Training

Season 5 · Ch. 3

Nothing Happened for 75,000 Steps and It Was Glorious

After Chapter 1 (three days of cloud chaos) and Chapter 2 (twelve hours of blaming the wrong thing), you have earned the right to expect another disaster chapter. I am sorry. There is no disaster here. The training worked. Here is the diary. Day What happened 1 Loss went down 2 Loss went down 3 Loss went down 4 Loss went down 5 Loss reached 2.2475. Run complete. That is the whole season, basically. We can stop now if you want. ...

Season 5 · Ch. 2

My Code Agent Said It Was a Moose. I Said No. It Was a Moose.

The H200 was working. The 3B was training. After three days of fighting the cloud, the model was finally putting tokens through the GPU at 23,200 per second. I had a checkpoint at step 1,000. I had a checkpoint at step 1,200. I went to bed feeling, briefly, like a person. Six hours later the run was dead. The checkpoint at step 1,200 was corrupted. The next run got to step 25 and froze. The one after that got to step 17 and silently disappeared. ...

Season 5 · Ch. 1

I Have an A100. I Have 528 Shards of Data. I Cannot Combine Them.

I had a 3B model expanded from the 2B-75K base. Code tested. Smoke test passed. 528 shards on my NAS, ~70 GB, ~38 billion tokens of FineWeb, FineMath, PubMed, and cleaned Python. Three days later I had spent zero training tokens and was 1,200 words deep into a Notion page about VRAM accounting. This is that story. Why a 3B Two reasons. One: I wanted the next model to know what a kinase is. The 2B was clean, polite, and had read a lot of FineWeb. It had also never seen a single PubMed abstract. I have plans for this model that involve answering biomedical questions, and you cannot retrieve your way out of a model that does not know what “phosphorylation” means. The 3B’s data plan added 256 shards of PubMed, ~5.5B tokens, all fresh. The 2B is a polite generalist. The 3B is a polite generalist who also took two semesters of biochemistry. ...

Season 2 · Ch. 5

I Spent Another $68 Because a Spreadsheet Wouldn't Stop Staring at Me

The question that wouldn’t leave me alone S2-03 ended with a question I couldn’t stop thinking about: What would more training buy? GPUburnout-1B had trained on 11.8 billion tokens - 59% of Chinchilla-optimal for a 1B model. The data was sitting there. Twenty billion tokens is the theoretically ideal ratio for a billion parameters: twenty tokens per parameter, the point where your compute budget is perfectly balanced between model size and training data. I was 41% short of that line. ...

Season 2 · Ch. 2

The $175 Experiment: Training GPUburnout-1B on a Single GPU

The short version I trained a 1 billion parameter model from scratch. It took 90,000 steps, 11.8 billion tokens, one A100 GPU, and $175. The model went from generating random unicode soup to writing paragraphs about single-cell RNA sequencing with confidently hallucinated journal citations. (They look real. They are not.) This is the full story - every phase, every dollar, and every moment I stared at a loss curve instead of sleeping like a normal person. ...

Season 1 · Ch. 4

11 Training Challenges and How I Solved Them

A comprehensive guide to every way I shot myself in the foot training GPT-2 Small. Learn from my pain.