Posts

Season 5 · Ch. 3

Nothing Happened for 75,000 Steps and It Was Glorious

After Chapter 1 (three days of cloud chaos) and Chapter 2 (twelve hours of blaming the wrong thing), you have earned the right to expect another disaster chapter. I am sorry. There is no disaster here. The training worked. Here is the diary. Day What happened 1 Loss went down 2 Loss went down 3 Loss went down 4 Loss went down 5 Loss reached 2.2475. Run complete. That is the whole season, basically. We can stop now if you want. ...

Season 5 · Ch. 2

My Code Agent Said It Was a Moose. I Said No. It Was a Moose.

The H200 was working. The 3B was training. After three days of fighting the cloud, the model was finally putting tokens through the GPU at 23,200 per second. I had a checkpoint at step 1,000. I had a checkpoint at step 1,200. I went to bed feeling, briefly, like a person. Six hours later the run was dead. The checkpoint at step 1,200 was corrupted. The next run got to step 25 and froze. The one after that got to step 17 and silently disappeared. ...

Season 5 · Ch. 1

I Have an A100. I Have 528 Shards of Data. I Cannot Combine Them.

I had a 3B model expanded from the 2B-75K base. Code tested. Smoke test passed. 528 shards on my NAS, ~70 GB, ~38 billion tokens of FineWeb, FineMath, PubMed, and cleaned Python. Three days later I had spent zero training tokens and was 1,200 words deep into a Notion page about VRAM accounting. This is that story. Why a 3B Two reasons. One: I wanted the next model to know what a kinase is. The 2B was clean, polite, and had read a lot of FineWeb. It had also never seen a single PubMed abstract. I have plans for this model that involve answering biomedical questions, and you cannot retrieve your way out of a model that does not know what “phosphorylation” means. The 3B’s data plan added 256 shards of PubMed, ~5.5B tokens, all fresh. The 2B is a polite generalist. The 3B is a polite generalist who also took two semesters of biochemistry. ...

Season 4 · Ch. 4

Verbatim: The Proof Is in the Output

Benchmarks say the 1B and 2B are basically the same model. The outputs say otherwise. Here are the receipts - same 8 prompts, same temperature (0.7), same top-p (0.9), same max tokens (200). 1B-160K-Chat vs 2B-75K-Chat-DPO, head to head. Why the 1B’s Chat model and not its DPO version? Because DPO made the 1B worse - the best DPO run scored 4/8 garbage, worse than the Chat baseline. The Chat model is the 1B at its best. This is as fair as it gets. ...

Season 4 · Ch. 3

7 Out of 8 - How DPO Finally Worked

Season 3: four DPO configurations on the 1B. Best: 4/8 clean. Worst: 7/8 garbage. More training literally made the model dumber. I had receipts. Season 4: same technique, similar hyperparameters, on the 2B. Result: 7/8 clean. First try. No suffering necessary. Same method. Different foundation. Completely different outcome. That’s the entire moral of Season 4 in one A/B test. I could end the post here. I won’t, because the details are too good to skip. ...

Season 4 · Ch. 2

1.92B Parameters, 38.4B Tokens, Zero Garbage

The 1B taught me that pretraining data quality matters more than anything that comes after it. Clean before you tokenize, or spend weeks trying to undo what you can’t undo. The 2B is the experiment that tests whether I actually learned that lesson, or whether I just said I learned it. Same architecture family, same training code, same evaluation suite. Different data - 600,000 documents scanned, 660 contaminated ones deleted, everything re-tokenized from scratch. If the hypothesis is right, the 2B should produce zero garbage tokens even before SFT runs. If it’s wrong, congratulations, I just bought a $183 souvenir. Either way, somebody learns something. ...

Season 4 · Ch. 1

RIP GPUburnout-1B. Cause of Death: Its Own Training Data.

Nine experiments. Zero fixes. Five SFT runs, four DPO runs, three different datasets - including one written entirely by humans. All failed. The most aggressive DPO config actually made things worse: 7 out of 8 prompts producing garbage. I tried to teach the model manners. It responded by getting louder. We’ve all been there. Diagnosis confirmed. The garbage tokens are pretraining attractors from contaminated source data. No amount of post-training alignment can reach them. The bones were laid wrong. There is no fixing the bones. ...

Season 3 · Ch. 3

Nine Experiments, Nine Funerals

I had a diagnosis. Garbage tokens, pretraining contamination, baked into the base weights, unreachable by fine-tuning. Open and shut. Case closed. Except science doesn’t accept “trust me bro” as evidence. The only way to prove the diagnosis was to try fixing it the wrong way and watch it not work. Repeatedly. With increasing desperation. Nine experiments. Zero fixes. One scoreboard. Here we go. SFT: Five Attempts, Five Failures I built a cleaning pipeline, removed 27% of SlimOrca (139K examples), verified zero garbage tokens in the cleaned set, and ran five experiments: ...

Season 3 · Ch. 2

My Model's Vocabulary Came from Stack Overflow at 3am

My chat model had a haunted vocabulary. PersonX. AndroidRuntime. fefefe. oardvark. Paasilinna. The same seven nonsense tokens, in different prompts, at different temperatures, across totally separate runs. Not random. Specific. Reproducible. A slot machine that only ever pays out in cursed symbols. I needed to find where they came from. Standard CSI episode: dust the model for prints, follow the trail back, identify the perpetrator. My first suspect: the fine-tuning data. SlimOrca is GPT-4 generated, and machine text sometimes carries annotation crud from academic NLP datasets. Plausible. Easy to test. Confidently wrong. ...

Season 3 · Ch. 1

Teaching the 1B to Talk

At the end of Season 2, I had a “working” 1B parameter language model. The scare quotes are doing some heavy lifting. Yes, it could complete sentences. Yes, it knew Paris was a city. Yes, it could write paragraphs about single-cell RNA sequencing with journal citations that looked real and were absolutely not. Ask it the capital of France and it would confidently answer “the currency in the money is dollar and the currency is dollar and the currency is the euro and euro.” Technically not wrong about the euro. Wildly wrong about everything else. As base models go, it was functional. As useful tools go, it was a paperweight that costs electricity. ...