Season 4 · Ch. 2

1.92B Parameters, 38.4B Tokens, Zero Garbage

The 1B taught me that pretraining data quality matters more than anything that comes after it. Clean before you tokenize, or spend weeks trying to undo what you can’t undo. The 2B is the experiment that tests whether I actually learned that lesson, or whether I just said I learned it. Same architecture family, same training code, same evaluation suite. Different data - 600,000 documents scanned, 660 contaminated ones deleted, everything re-tokenized from scratch. If the hypothesis is right, the 2B should produce zero garbage tokens even before SFT runs. If it’s wrong, congratulations, I just bought a $183 souvenir. Either way, somebody learns something. ...

March 28, 2026 · 4 min · Jun Park
GPUburnout
GPUburnout
Will Code for Tokens
S1 GPT-2 134M
S2 Llama 1B
S3 1B SFT
S4 Llama 2B
S5 Llama 3B