GPUburnout-1B

Season 4 · Ch. 4

Verbatim: The Proof Is in the Output

Benchmarks say the 1B and 2B are basically the same model. The outputs say otherwise. Here are the receipts - same 8 prompts, same temperature (0.7), same top-p (0.9), same max tokens (200). 1B-160K-Chat vs 2B-75K-Chat-DPO, head to head. Why the 1B’s Chat model and not its DPO version? Because DPO made the 1B worse - the best DPO run scored 4/8 garbage, worse than the Chat baseline. The Chat model is the 1B at its best. This is as fair as it gets. ...

Season 4 · Ch. 1

RIP GPUburnout-1B. Cause of Death: Its Own Training Data.

Nine experiments. Zero fixes. Five SFT runs, four DPO runs, three different datasets - including one written entirely by humans. All failed. The most aggressive DPO config actually made things worse: 7 out of 8 prompts producing garbage. I tried to teach the model manners. It responded by getting louder. We’ve all been there. Diagnosis confirmed. The garbage tokens are pretraining attractors from contaminated source data. No amount of post-training alignment can reach them. The bones were laid wrong. There is no fixing the bones. ...

Season 3 · Ch. 3

Nine Experiments, Nine Funerals

I had a diagnosis. Garbage tokens, pretraining contamination, baked into the base weights, unreachable by fine-tuning. Open and shut. Case closed. Except science doesn’t accept “trust me bro” as evidence. The only way to prove the diagnosis was to try fixing it the wrong way and watch it not work. Repeatedly. With increasing desperation. Nine experiments. Zero fixes. One scoreboard. Here we go. SFT: Five Attempts, Five Failures I built a cleaning pipeline, removed 27% of SlimOrca (139K examples), verified zero garbage tokens in the cleaned set, and ran five experiments: ...

Season 3 · Ch. 2

My Model's Vocabulary Came from Stack Overflow at 3am

My chat model had a haunted vocabulary. PersonX. AndroidRuntime. fefefe. oardvark. Paasilinna. The same seven nonsense tokens, in different prompts, at different temperatures, across totally separate runs. Not random. Specific. Reproducible. A slot machine that only ever pays out in cursed symbols. I needed to find where they came from. Standard CSI episode: dust the model for prints, follow the trail back, identify the perpetrator. My first suspect: the fine-tuning data. SlimOrca is GPT-4 generated, and machine text sometimes carries annotation crud from academic NLP datasets. Plausible. Easy to test. Confidently wrong. ...

Season 3 · Ch. 1

Teaching the 1B to Talk

At the end of Season 2, I had a “working” 1B parameter language model. The scare quotes are doing some heavy lifting. Yes, it could complete sentences. Yes, it knew Paris was a city. Yes, it could write paragraphs about single-cell RNA sequencing with journal citations that looked real and were absolutely not. Ask it the capital of France and it would confidently answer “the currency in the money is dollar and the currency is dollar and the currency is the euro and euro.” Technically not wrong about the euro. Wildly wrong about everything else. As base models go, it was functional. As useful tools go, it was a paperweight that costs electricity. ...

Season 2 · Ch. 5

I Spent Another $68 Because a Spreadsheet Wouldn't Stop Staring at Me

The question that wouldn’t leave me alone S2-03 ended with a question I couldn’t stop thinking about: What would more training buy? GPUburnout-1B had trained on 11.8 billion tokens - 59% of Chinchilla-optimal for a 1B model. The data was sitting there. Twenty billion tokens is the theoretically ideal ratio for a billion parameters: twenty tokens per parameter, the point where your compute budget is perfectly balanced between model size and training data. I was 41% short of that line. ...

Season 2 · Ch. 4

10 Things I Learned Training a 1B Parameter Model That Nobody Talks About

The stuff that doesn’t make it into papers Research papers tell you about architectures, loss functions, and scaling laws. They do not tell you that the cheapest GPU per hour is almost never the cheapest GPU per token, that your biggest optimization is probably a boolean you forgot to flip, or that every single crash you’ll experience will be infrastructure - never training code. They especially don’t tell you that the five-second decision you make on day one about which datacenter region to pick will haunt you for the entire project. ...

Season 2 · Ch. 3

What GPUburnout-1B Actually Learned

Time to face the music Training a language model is the fun part. You watch the loss drop, you generate text samples that are slightly less incoherent than yesterday’s, you tell yourself “look, it almost knows what France is.” It’s addictive. It’s rewarding. It also tells you absolutely nothing about how good your model actually is. Benchmarking is where the universe hands you a report card you didn’t ask for. ...

Season 2 · Ch. 2

The $175 Experiment: Training GPUburnout-1B on a Single GPU

The short version I trained a 1 billion parameter model from scratch. It took 90,000 steps, 11.8 billion tokens, one A100 GPU, and $175. The model went from generating random unicode soup to writing paragraphs about single-cell RNA sequencing with confidently hallucinated journal citations. (They look real. They are not.) This is the full story - every phase, every dollar, and every moment I stared at a loss curve instead of sleeping like a normal person. ...

Season 2 · Ch. 1

From 134M to 1B: Building GPUburnout-1B From Scratch

Season 1 is over. Time to scale up. Six weeks ago, I started this blog with a simple question: what actually happens inside a language model? The answer turned into a six-post series where I built GPT-2 from scratch - 134 million parameters, 2.8 billion tokens, and a Colab session that crashed more often than it didn’t. I learned a lot. Not just about transformers and tokenizers, but about the thousand small decisions that determine whether your training run produces coherent English or expensive gibberish. I took training time from 90 minutes down to 21 minutes. I watched a random pile of floating-point numbers slowly learn that Paris is a city and “the” comes before nouns. ...