Becoming an assistant

From autocomplete to assistant.

After pre-training you have a "base model" — a brilliant autocomplete that has read the internet. But it only continues text. It doesn't answer questions, follow instructions, or know when to stop. Two more, much smaller, training stages fix that.

See the difference

Same prompt, before and after tuning

Base model (raw)
After fine-tuning + RLHF
You
List three tips for better sleep.
Base model
List three tips for losing weight. List three tips for saving money. List three tips for a better morning routine… — it just continues the pattern, as if completing a web page of headings. It never actually answers you.
You
List three tips for better sleep.
Assistant
1. Keep a consistent sleep schedule, even on weekends. 2. Avoid screens and caffeine in the hour before bed. 3. Keep your room cool, dark, and quiet. — answers directly, in the format you asked for, then stops.

Stage 2 — show it what "helpful" looks like

This is instruction fine-tuning (often called SFT). The model is trained on tens of thousands of example pairs — an instruction and a good response, written and curated by people. It's the same guess-the-next-word training, but now on conversations. The model quickly learns the shape of being an assistant: read the request, answer it, use the format asked for, and stop when done.

Stage 3 — learn from human preferences

Fine-tuning makes it helpful; this stage makes it good. It's called RLHF (reinforcement learning from human feedback). People are shown two answers to the same prompt and pick the better one, thousands of times. Take a turn yourself:

You be the human rater

Pick the better answer — watch a preference take shape

Same question, two answers. Choose the one you'd rather receive. A few rounds in, look at what your choices are quietly teaching the model to value.

Why a whole extra model? Because there aren't enough people to rate every answer the assistant will ever give. So those human choices train a stand-in judge — the reward model — that's available 24/7. The assistant then practises endlessly against it, learning to produce answers the judge scores highly: clearer, kinder, more honest, safer. This is the polish you actually talk to.

Guardrails come from here too. The same preference training is where an assistant learns to refuse genuinely harmful requests and to admit uncertainty. And a hidden system prompt — instructions the app puts in front of your messages — sets its persona and rules for each product. (One side effect to watch: tuning it to please people can make it a bit too agreeable — more on that soon.)

Now it's a helpful assistant — and completely frozen: training is over, the dials are fixed. So how does it use the things you tell it, remember a conversation, or look up today's news? That's what happens at chat time.