Season 5 · Ch. 1

I Have an A100. I Have 528 Shards of Data. I Cannot Combine Them.

I had a 3B model expanded from the 2B-75K base. Code tested. Smoke test passed. 528 shards on my NAS, ~70 GB, ~38 billion tokens of FineWeb, FineMath, PubMed, and cleaned Python. Three days later I had spent zero training tokens and was 1,200 words deep into a Notion page about VRAM accounting. This is that story. Why a 3B Two reasons. One: I wanted the next model to know what a kinase is. The 2B was clean, polite, and had read a lot of FineWeb. It had also never seen a single PubMed abstract. I have plans for this model that involve answering biomedical questions, and you cannot retrieve your way out of a model that does not know what “phosphorylation” means. The 3B’s data plan added 256 shards of PubMed, ~5.5B tokens, all fresh. The 2B is a polite generalist. The 3B is a polite generalist who also took two semesters of biochemistry. ...

April 7, 2026 · 8 min · Jun Park
Season 2 · Ch. 5

I Spent Another $68 Because a Spreadsheet Wouldn't Stop Staring at Me

The question that wouldn’t leave me alone S2-03 ended with a question I couldn’t stop thinking about: What would more training buy? GPUburnout-1B had trained on 11.8 billion tokens - 59% of Chinchilla-optimal for a 1B model. The data was sitting there. Twenty billion tokens is the theoretically ideal ratio for a billion parameters: twenty tokens per parameter, the point where your compute budget is perfectly balanced between model size and training data. I was 41% short of that line. ...

March 15, 2026 · 9 min · Jun Park
Season 2 · Ch. 2

The $175 Experiment: Training GPUburnout-1B on a Single GPU

The short version I trained a 1 billion parameter model from scratch. It took 90,000 steps, 11.8 billion tokens, one A100 GPU, and $175. The model went from generating random unicode soup to writing paragraphs about single-cell RNA sequencing with confidently hallucinated journal citations. (They look real. They are not.) This is the full story - every phase, every dollar, and every moment I stared at a loss curve instead of sleeping like a normal person. ...

March 4, 2026 · 10 min · Jun Park
GPUburnout
GPUburnout
Will Code for Tokens
S1 GPT-2 134M
S2 Llama 1B
S3 1B SFT
S4 Llama 2B
S5 Llama 3B