On steerable AI

You notice it after a while. You push back on a model's answer — even mildly — and it folds. Apologizes, rewrites, agrees with the new framing. Tell it you wanted formal, then ask for casual; it shifts without protest. Tell it the previous answer was wrong, even when it wasn't, and it caves. No defense of prior work, no "are you sure though?", no sunk cost. The model has zero skin in its last turn.

That's steerability. Or at least the visible, slightly disorienting face of it.

We have sunk cost fallacy baked in — we'll defend a mediocre paragraph for an hour because we wrote it. The model is the opposite. Whatever you said most recently carries the most weight; whatever it produced is already in the past. This is a feature. It's also why interacting with these tools occasionally feels like talking to someone with no convictions — because in the relevant sense, it doesn't have any. It has yours.

Steerable, more formally: the AI reliably adjusts its tone, style, and focus based on what you tell it. You say less, you get more of what you wanted, and you don't have to keep saying it.

The opposite end of the spectrum — a model with a fixed default self that just ignores you — is mostly theoretical at this point. But the contrast still tells you what steerability is doing:

Situation	Steerable	Not steerable
You say "be concise"	Stays concise	Drifts back to verbose
You assign a role	Holds the role	Slips back to default voice
You say "focus on X"	Stays focused, pushes back on detours	Answers whatever you put in front of it
System prompt set	Honors it throughout	Wanders off after a few turns

The right column is what a model with a default self looks like — one that keeps reverting to its own preferences. The left is what it looks like when the model treats your input as the steering input, not seasoning on top of its own.

So what makes a model steerable? A few things, stacked.

Trained on instructions. Enough (instruction → correct behavior) pairs that "summarize in three bullets" actually means three bullets, instead of being read as a general vibe about brevity.

RLHF. Reinforcement learning from human feedback. Humans rate responses. Following the instruction gets the thumbs up. Showing off, ignoring the spec, drifting — thumbs down. Repeat a million times. The model learns that "do what they said" is the move.

Hierarchical prompt design. Not every instruction has the same weight. There's an order:

Safety guardrails > System prompt (set by the developer) > User > Default

When two things conflict, the higher one wins. Safety overrides everything. The developer's system prompt overrides you. You override the model's defaults. Defaults are the floor — what shows up when nobody bothered to specify anything.

Persona and tone training. The model is trained to hold a persona once you assign it. Without this, "talk like a grumpy senior engineer" lasts about two turns before "I hope this helps!" creeps back in.

Those four, layered, plus the side effect from the top: the model doesn't push back the way a colleague would. The cost of that is real — it'll throw out a perfectly good answer because you frowned at it. The benefit is that when you know what you want, you can just say it, and the thing actually listens.

On steerable AI

Comments

More from this blog

React Native developer; or, the Modern Procustes

Validating a hypothesis - Pt-3

Validating a hypothesis - Pt-2

Validating a hypothesis - Pt-1

Command Palette

Comments

More from this blog