The training data — How LLMs Work

Where the knowledge comes from

It learned by reading a huge slice of the internet.

Training happens in stages. By far the biggest is the first one, pre-training: play "guess the next word" across an enormous pile of text until the model has soaked up how language and the world work. (Two shorter stages later turn it into a helpful assistant.) Everything starts with the fuel — the data. Let's see what it actually is.

Walk the pipeline

From web pages to a learned word

Click through the steps. A few real-style web pages get cleaned, turned into fill-in-the-blank questions, and — across all of them — teach the model a single word.

Where it all comes from — and how much

Models train on enormous piles of exactly this: public web crawls (Common Crawl), Wikipedia, books, and large amounts of code, plus cleaned-up collections like C4, The Pile, and FineWeb. Altogether it adds up to trillions of tokens — very roughly the text of tens of millions of books. No human could read even a sliver of it in a lifetime.

One thing to be clear about: the model doesn't keep all this text. It reads each passage to adjust its dials, then moves on — the original pages aren't filed away inside it. What's left at the end is the patterns, not the pages.

Garbage in, garbage out

As step 1 showed, raw web data is mostly junk. What the model becomes depends heavily on what survives the cleaning, so a lot of careful work goes in before training:

De-duplicate

Remove repeats

The same text appears thousands of times across the web. Duplicates are stripped out so the model learns broadly instead of memorising the same passages.

Quality filter

Keep the good stuff

Spam, gibberish, and auto-generated pages are dropped; readable, informative text is kept — often using another model to score quality.

Safety & privacy

Remove harmful data

Illegal and abusive content and personal data (names, addresses, phone numbers) are filtered out before training.

Decontaminate

Don't leak the test

Known test and benchmark questions are removed, so later scores measure real ability, not memorised answers.

A big lesson of recent years: data quality matters as much as quantity. A smaller, cleaner, well-balanced dataset often beats a bigger, dirtier one.

So now we have the fuel — trillions of clean tokens, turned into endless fill-in-the-blank questions — and in that last step you nudged the dials toward the right word, one training step at a time. Next comes the part everyone hand-waves: how the model works out which way to nudge each of its billions of dials, all at once.