Nobody gave it grammar, a dictionary, or logic. It worked them out anyway.
The only thing training ever rewards is guessing the next word. Yet to do that well across everything people have written, the model is forced to pick up the machinery underneath language — grammar, meaning, facts, even a little reasoning. Not because anyone put them there, but because you simply can't guess well without them.
Never handed the rules — has them anyway
Yet it conjugates
It keeps verbs agreeing with their subjects and words in a sensible order — having never been shown a single rule of grammar.
Yet it knows meanings
It uses words by what they actually mean and recalls facts like capitals and dates — none of it typed in as a list to memorise.
Yet it can reason a little
It follows simple chains of "if this, then that" — picked up purely from how often that shape of sentence appears in text.
How does that happen? The same way a child who has never had a grammar lesson still speaks grammatically — by hearing enough of it. The model "heard" a huge slice of everything ever written. To guess the next word across all of it, the cheapest thing its billions of dials can do is store the patterns that keep repeating. The rules are never spelled out anywhere; they get squeezed out of the data.
Guess the next word — and catch what you lean on
This is the exact game the model was trained on. Read each line, pick the word that should come next, then see which kind of knowledge you just used without noticing. Three quick rounds.
You didn't open a rulebook once. You leaned on patterns soaked up from a lifetime of language. The model has no rulebook either — only patterns, stored as numbers, pressed in by the single thing training ever asked of it. Grammar, meaning, and reasoning all arrived as side effects of getting good at the guess.
How a rule actually gets pressed in
"Arrived as side effects" can still sound like hand-waving, so here is the whole mechanism with nothing hidden. Take the agreement rule you just used — "the children were", never "was" — and watch it get learned the only way the model ever learns anything: read a document, nudge the dials a hair, read the next, nudge again. One document barely matters. The sheer number of them is the entire trick.
Pick something to learn, then feed it documents
Grammar, a fact, a chain of reasoning — each is learned the same way: one document, one tiny nudge, repeated. Switch the example and watch the same thing happen.
No single page held a grammar lesson. The rule is just what's left after the same near-miss — "were" yes, "was" almost never — nudges the dials a few billion times. And every other rule of grammar, every fact, every pattern of reasoning is being pressed in the exact same way, all at once, from the same stream of text. That is how "guess the next word" quietly becomes grammar, knowledge, and a little reasoning.
So is it actually thinking?
When it fixes your grammar and cracks a little logic puzzle, it's tempting to say it "thinks" or "understands" the way a person does. Worth being precise here, because it's the thing most people get wrong. It isn't looking facts up in a memory, and it isn't reasoning a decision through the way you'd weigh one. It's producing the most likely next words, steered by everything it absorbed. That one idea explains both why it's astonishing and why it slips up.
Fluent, fast, sure of itself
It answers almost anything instantly in clean prose, sounds confident, and is right often enough that it really seems to "know" things.
Extremely good pattern-matching
No beliefs, no check on whether something is true, no sense of being right or wrong — just the most likely continuation, drawn from what it read.
This is also why its famous weaknesses aren't random gremlins. Made-up facts, shaky arithmetic, going out of date — each one follows straight from "it predicts likely text; it doesn't know things." We'll line them all up, and how to steer around them, a few lessons from here.
And every pattern you just watched it absorb — grammar, facts, reasoning — gets sharper, with brand-new ones switching on, as the model grows bigger. Which raises the obvious questions: how big, and at what cost?