Posts

Season 2 · Ch. 5

I Spent Another $68 Because a Spreadsheet Wouldn't Stop Staring at Me

The question that wouldn’t leave me alone S2-03 ended with a question I couldn’t stop thinking about: What would more training buy? GPUburnout-1B had trained on 11.8 billion tokens — 59% of Chinchilla-optimal for a 1B model. The data was sitting there. Twenty billion tokens is the theoretically ideal ratio for a billion parameters: twenty tokens per parameter, the point where your compute budget is perfectly balanced between model size and training data. I was 41% short of that line. ...

Season 2 · Ch. 4

10 Things I Learned Training a 1B Parameter Model That Nobody Talks About

The stuff that doesn’t make it into papers Research papers tell you about architectures, loss functions, and scaling laws. They do not tell you that the cheapest GPU per hour is almost never the cheapest GPU per token, that your biggest optimization is probably a boolean you forgot to flip, or that every single crash you’ll experience will be infrastructure — never training code. They especially don’t tell you that the five-second decision you make on day one about which datacenter region to pick will haunt you for the entire project. ...

Season 2 · Ch. 3

What GPUburnout-1B Actually Learned

Time to face the music Training a language model is the fun part. You watch the loss drop, you generate text samples that are slightly less incoherent than yesterday’s, you tell yourself “look, it almost knows what France is.” It’s addictive. It’s rewarding. It also tells you absolutely nothing about how good your model actually is. Benchmarking is where the universe hands you a report card you didn’t ask for. ...

Season 2 · Ch. 2

The $175 Experiment: Training GPUburnout-1B on a Single GPU

The short version I trained a 1 billion parameter model from scratch. It took 90,000 steps, 11.8 billion tokens, one A100 GPU, and $175. The model went from generating random unicode soup to writing paragraphs about single-cell RNA sequencing with confidently hallucinated journal citations. (They look real. They are not.) This is the full story — every phase, every dollar, and every moment I stared at a loss curve instead of sleeping like a normal person. ...

Season 2 · Ch. 1

From 134M to 1B: Building GPUburnout-1B From Scratch

Season 1 is over. Time to scale up. Six weeks ago, I started this blog with a simple question: what actually happens inside a language model? The answer turned into a six-post series where I built GPT-2 from scratch — 134 million parameters, 2.8 billion tokens, and a Colab session that crashed more often than it didn’t. I learned a lot. Not just about transformers and tokenizers, but about the thousand small decisions that determine whether your training run produces coherent English or expensive gibberish. I took training time from 90 minutes down to 21 minutes. I watched a random pile of floating-point numbers slowly learn that Paris is a city and “the” comes before nouns. ...

Season 1 · Ch. 6

Training Optimizations Deep Dive: How I Made the A100 Actually Work

The complete technical reference for achieving 16x speedup. Every optimization explained with code and diagrams.

Season 1 · Ch. 5

The Results Are In (And My Wallet Is Empty)

Final loss curves, the damage to my compute budget, and 22 lessons I paid dearly to learn.

Season 1 · Ch. 4

11 Training Challenges and How I Solved Them

A comprehensive guide to every way I shot myself in the foot training GPT-2 Small. Learn from my pain.

Season 1 · Ch. 3

Scaling Up: From Tiny Model to GPT-2 Small

How I went from ‘cute toy model’ to ‘134 million parameters that need an A100 to breathe.’

Season 1 · Ch. 2

Data Preparation: Building a 12GB Training Corpus

How I built a 12GB ChatGPT-style conversational dataset and implemented BPE tokenization for efficient training.