Training

Season 2 · Ch. 5

I Spent Another $68 Because a Spreadsheet Wouldn't Stop Staring at Me

The question that wouldn’t leave me alone S2-03 ended with a question I couldn’t stop thinking about: What would more training buy? GPUburnout-1B had trained on 11.8 billion tokens — 59% of Chinchilla-optimal for a 1B model. The data was sitting there. Twenty billion tokens is the theoretically ideal ratio for a billion parameters: twenty tokens per parameter, the point where your compute budget is perfectly balanced between model size and training data. I was 41% short of that line. ...

Season 2 · Ch. 2

The $175 Experiment: Training GPUburnout-1B on a Single GPU

The short version I trained a 1 billion parameter model from scratch. It took 90,000 steps, 11.8 billion tokens, one A100 GPU, and $175. The model went from generating random unicode soup to writing paragraphs about single-cell RNA sequencing with confidently hallucinated journal citations. (They look real. They are not.) This is the full story — every phase, every dollar, and every moment I stared at a loss curve instead of sleeping like a normal person. ...

Season 1 · Ch. 4

11 Training Challenges and How I Solved Them

A comprehensive guide to every way I shot myself in the foot training GPT-2 Small. Learn from my pain.

GPUburnout

Will Code for Tokens

S1 GPT-2 134M

S2 Llama 1B