Learning is just guess, check, nudge — a trillion times.
We have billions of random dials (the network) and endless fill-in-the-blank questions (the data). "Learning" is what connects them: one simple loop, repeated at a scale that's hard to imagine.
Predict the next word
Show the model a snippet of real text with the next word hidden. It outputs a probability for every possible next token — its guess.
Measure how wrong (the loss)
Compare the guess to the real next word. A single number, the loss, captures how wrong it was. Confident and right = low loss; confident and wrong = high loss.
Adjust every dial a little
Work out, for each of the billions of dials, which direction would have made this guess slightly better — then nudge every one a tiny step that way.
Do it again. And again.
Move to the next snippet and repeat — across trillions of tokens. Each nudge is tiny; the sheer number of them is what does the work.
Watch one example being learned
The model sees "The sky is ___" and must predict the next word. The right answer is "blue". Press Train step to nudge its dials, and watch the guess improve while the loss falls.
Each nudge is microscopic, and there are trillions of them, over trillions of words. The real surprise isn't the effort — it's what all that blind guessing quietly builds. The model is never told a single rule of grammar, and never handed a list of facts, yet it ends up with both — plus a knack for simple reasoning. That payoff is the next lesson.