Why it takes thousands of computers and months.
Each nudge is microscopic, and there are trillions of them, over trillions of tokens. Doing that in any reasonable time takes staggering computing power — and one discovery shaped the whole field: to get a better model, mostly just make everything bigger.
Thousands of GPUs
Training runs on specialised chips (GPUs) built for massive parallel math — frontier models use tens of thousands at once.
Weeks to months
A single training run for a large model takes many weeks of those chips running non-stop, day and night.
Tens of millions+
Between the hardware, electricity, and data work, the biggest runs cost tens to hundreds of millions of dollars — why only a few labs build them.
The scaling law
Around 2020, researchers noticed something almost suspiciously reliable: add more parameters, more data, and more compute (computing power), and the model gets predictably better — following a smooth curve. Progress stopped being only about clever new ideas and became, in large part, about scaling up. Drag the dial and watch what grows:
From a toy model to a frontier model
Bigger isn't the whole story anymore
Pure size has limits — there's only so much high-quality text, and cost and energy are real. So recent progress also comes from cleaner data, more efficient designs, and a newer trick: letting a model think for longer at answer time (the "reasoning" models). But the scaling era is why today's models are so much more capable than those of just a few years ago.
After all this, you have a giant, powerful model. But it has one big problem: it's still just an autocomplete. Ask it a question and it might continue with more questions. Next, how it's turned into a helpful assistant.