From autocomplete to assistant.
After pre-training you have a "base model" — a brilliant autocomplete that has read the internet. But it only continues text. It doesn't answer questions, follow instructions, or know when to stop. Two more, much smaller, training stages fix that.
Same prompt, before and after tuning
Stage 2 — show it what "helpful" looks like
This is instruction fine-tuning (often called SFT). The model is trained on tens of thousands of example pairs — an instruction and a good response, written and curated by people. It's the same guess-the-next-word training, but now on conversations. The model quickly learns the shape of being an assistant: read the request, answer it, use the format asked for, and stop when done.
Stage 3 — learn from human preferences
Fine-tuning makes it helpful; this stage makes it good. It's called RLHF (reinforcement learning from human feedback). People are shown two answers to the same prompt and pick the better one, thousands of times. Take a turn yourself:
Pick the better answer — watch a preference take shape
Same question, two answers. Choose the one you'd rather receive. A few rounds in, look at what your choices are quietly teaching the model to value.
Why a whole extra model? Because there aren't enough people to rate every answer the assistant will ever give. So those human choices train a stand-in judge — the reward model — that's available 24/7. The assistant then practises endlessly against it, learning to produce answers the judge scores highly: clearer, kinder, more honest, safer. This is the polish you actually talk to.
Now it's a helpful assistant — and completely frozen: training is over, the dials are fixed. So how does it use the things you tell it, remember a conversation, or look up today's news? That's what happens at chat time.