Season-4

Season 4 · Ch. 4

Verbatim: The Proof Is in the Output

Benchmarks say the 1B and 2B are basically the same model. The outputs say otherwise. Here are the receipts - same 8 prompts, same temperature (0.7), same top-p (0.9), same max tokens (200). 1B-160K-Chat vs 2B-75K-Chat-DPO, head to head. Why the 1B’s Chat model and not its DPO version? Because DPO made the 1B worse - the best DPO run scored 4/8 garbage, worse than the Chat baseline. The Chat model is the 1B at its best. This is as fair as it gets. ...

Season 4 · Ch. 3

7 Out of 8 - How DPO Finally Worked

Season 3: four DPO configurations on the 1B. Best: 4/8 clean. Worst: 7/8 garbage. More training literally made the model dumber. I had receipts. Season 4: same technique, similar hyperparameters, on the 2B. Result: 7/8 clean. First try. No suffering necessary. Same method. Different foundation. Completely different outcome. That’s the entire moral of Season 4 in one A/B test. I could end the post here. I won’t, because the details are too good to skip. ...

Season 4 · Ch. 2

1.92B Parameters, 38.4B Tokens, Zero Garbage

The 1B taught me that pretraining data quality matters more than anything that comes after it. Clean before you tokenize, or spend weeks trying to undo what you can’t undo. The 2B is the experiment that tests whether I actually learned that lesson, or whether I just said I learned it. Same architecture family, same training code, same evaluation suite. Different data - 600,000 documents scanned, 660 contaminated ones deleted, everything re-tokenized from scratch. If the hypothesis is right, the 2B should produce zero garbage tokens even before SFT runs. If it’s wrong, congratulations, I just bought a $183 souvenir. Either way, somebody learns something. ...

Season 4 · Ch. 1

RIP GPUburnout-1B. Cause of Death: Its Own Training Data.

Nine experiments. Zero fixes. Five SFT runs, four DPO runs, three different datasets - including one written entirely by humans. All failed. The most aggressive DPO config actually made things worse: 7 out of 8 prompts producing garbage. I tried to teach the model manners. It responded by getting louder. We’ve all been there. Diagnosis confirmed. The garbage tokens are pretraining attractors from contaminated source data. No amount of post-training alignment can reach them. The bones were laid wrong. There is no fixing the bones. ...