Architecture

Season 1 is over. Time to scale up. Six weeks ago, I started this blog with a simple question: what actually happens inside a language model? The answer turned into a six-post series where I built GPT-2 from scratch — 134 million parameters, 2.8 billion tokens, and a Colab session that crashed more often than it didn’t. I learned a lot. Not just about transformers and tokenizers, but about the thousand small decisions that determine whether your training run produces coherent English or expensive gibberish. I took training time from 90 minutes down to 21 minutes. I watched a random pile of floating-point numbers slowly learn that Paris is a city and “the” comes before nouns. ...

Architecture

From 134M to 1B: Building GPUburnout-1B From Scratch

Scaling Up: From Tiny Model to GPT-2 Small