Season 2 · Ch. 1

From 134M to 1B: Building GPUburnout-1B From Scratch

Season 1 is over. Time to scale up. Six weeks ago, I started this blog with a simple question: what actually happens inside a language model? The answer turned into a six-post series where I built GPT-2 from scratch — 134 million parameters, 2.8 billion tokens, and a Colab session that crashed more often than it didn’t. I learned a lot. Not just about transformers and tokenizers, but about the thousand small decisions that determine whether your training run produces coherent English or expensive gibberish. I took training time from 90 minutes down to 21 minutes. I watched a random pile of floating-point numbers slowly learn that Paris is a city and “the” comes before nouns. ...

February 27, 2026 · 7 min · Jun Park
Season 1 · Ch. 3

Scaling Up: From Tiny Model to GPT-2 Small

How I went from ‘cute toy model’ to ‘134 million parameters that need an A100 to breathe.’

January 27, 2026 · 4 min · Jun Park
GPUburnout
GPUburnout
Will Code for Tokens
S1 GPT-2 134M
S2 Llama 1B