Pretraining

Season 4 · Ch. 2

1.92B Parameters, 38.4B Tokens, Zero Garbage

The 1B taught me that pretraining data quality matters more than anything that comes after it. Clean before you tokenize, or spend weeks trying to undo what you can’t undo. The 2B is the experiment that tests whether I actually learned that lesson, or whether I just said I learned it. Same architecture family, same training code, same evaluation suite. Different data - 600,000 documents scanned, 660 contaminated ones deleted, everything re-tokenized from scratch. If the hypothesis is right, the 2B should produce zero garbage tokens even before SFT runs. If it’s wrong, congratulations, I just bought a $183 souvenir. Either way, somebody learns something. ...

Season 4 · Ch. 1

RIP GPUburnout-1B. Cause of Death: Its Own Training Data.

Nine experiments. Zero fixes. Five SFT runs, four DPO runs, three different datasets - including one written entirely by humans. All failed. The most aggressive DPO config actually made things worse: 7 out of 8 prompts producing garbage. I tried to teach the model manners. It responded by getting louder. We’ve all been there. Diagnosis confirmed. The garbage tokens are pretraining attractors from contaminated source data. No amount of post-training alignment can reach them. The bones were laid wrong. There is no fixing the bones. ...

Season 3 · Ch. 2

My Model's Vocabulary Came from Stack Overflow at 3am

My chat model had a haunted vocabulary. PersonX. AndroidRuntime. fefefe. oardvark. Paasilinna. The same seven nonsense tokens, in different prompts, at different temperatures, across totally separate runs. Not random. Specific. Reproducible. A slot machine that only ever pays out in cursed symbols. I needed to find where they came from. Standard CSI episode: dust the model for prints, follow the trail back, identify the perpetrator. My first suspect: the fine-tuning data. SlimOrca is GPT-4 generated, and machine text sometimes carries annotation crud from academic NLP datasets. Plausible. Easy to test. Confidently wrong. ...