On steerable AI

You notice it after a while. You push back on a model's answer — even mildly — and it folds. Apologizes, rewrites, agrees with the new framing. Tell it you wanted formal, then ask for casual; it shifts without protest. Tell it the previous answer was wrong, even when it wasn't, and it caves. No defense of prior work, no "are you sure though?", no sunk cost. The model has zero skin in its last turn.
That's steerability. Or at least the visible, slightly disorienting face of it.
We have sunk cost fallacy baked in — we'll defend a mediocre paragraph for an hour because we wrote it. The model is the opposite. Whatever you said most recently carries the most weight; whatever it produced is already in the past. This is a feature. It's also why interacting with these tools occasionally feels like talking to someone with no convictions — because in the relevant sense, it doesn't have any. It has yours.
Steerable, more formally: the AI reliably adjusts its tone, style, and focus based on what you tell it. You say less, you get more of what you wanted, and you don't have to keep saying it.
The opposite end of the spectrum — a model with a fixed default self that just ignores you — is mostly theoretical at this point. But the contrast still tells you what steerability is doing:
| Situation | Steerable | Not steerable |
|---|---|---|
| You say "be concise" | Stays concise | Drifts back to verbose |
| You assign a role | Holds the role | Slips back to default voice |
| You say "focus on X" | Stays focused, pushes back on detours | Answers whatever you put in front of it |
| System prompt set | Honors it throughout | Wanders off after a few turns |
The right column is what a model with a default self looks like — one that keeps reverting to its own preferences. The left is what it looks like when the model treats your input as the steering input, not seasoning on top of its own.
So what makes a model steerable? A few things, stacked.
Trained on instructions. Enough (instruction → correct behavior) pairs that "summarize in three bullets" actually means three bullets, instead of being read as a general vibe about brevity.
RLHF. Reinforcement learning from human feedback. Humans rate responses. Following the instruction gets the thumbs up. Showing off, ignoring the spec, drifting — thumbs down. Repeat a million times. The model learns that "do what they said" is the move.
Hierarchical prompt design. Not every instruction has the same weight. There's an order:
Safety guardrails > System prompt (set by the developer) > User > Default
When two things conflict, the higher one wins. Safety overrides everything. The developer's system prompt overrides you. You override the model's defaults. Defaults are the floor — what shows up when nobody bothered to specify anything.
Persona and tone training. The model is trained to hold a persona once you assign it. Without this, "talk like a grumpy senior engineer" lasts about two turns before "I hope this helps!" creeps back in.
Those four, layered, plus the side effect from the top: the model doesn't push back the way a colleague would. The cost of that is real — it'll throw out a perfectly good answer because you frowned at it. The benefit is that when you know what you want, you can just say it, and the thing actually listens.



