The Results Are In (And My Wallet Is Empty)

So. After 140 hours of GPU time, countless Colab disconnects, and one existential crisis about whether I should’ve just used a pretrained model… here’s what I got.

Training Progress (The Numbers Don’t Lie)

The Loss Curve

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
Loss
4.5 |*
    |
4.0 |    *
    |
3.5 |        *
    |            *
3.0 |                *  *  *  *  ?  ?     <- Epoch 8: still going
    |
2.5 |                                      <- Would be nice
    |
2.0 |                                      <- Maybe in my dreams
    +--+--+--+--+--+--+--+--+--+--+--+
    1  2  3  4  5  6  7  8  9  10
                  Epoch

Legend: * = Actually happened, ? = Hopeful projection

It’s going down. That’s… that’s good, right?

The Epoch-by-Epoch Damage Report

Epoch	Train Loss	Speed	Runtime	What Was Happening
1	~4.5	1.6s/step	~35 hrs	Model: “What are words?”
2	~3.8	1.6s/step	~35 hrs	Model: “Oh, patterns exist!”
3	~3.4	1.6s/step	~35 hrs	Me: “Why is this so slow?”
4	~3.24	0.225s/step	~5 hrs	Me: “OH. That’s why.”
5	3.093	0.225s/step	~5 hrs	Finally making progress
6	~3.05	0.225s/step	~5 hrs	Loss go down
7	~3.03	0.225s/step	~5 hrs	Loss go down more
8	3.014	0.225s/step	~5 hrs	Currently cooking
9-10	TBD	0.225s/step	~5 hrs each	The home stretch

Notice the jump from epoch 3 to 4. That’s the 7x optimization kicking in. I literally wasted 105 hours being dumb.

What The Model Actually Says Now

Here’s the real test. Same prompts, different epochs. Watch it get less stupid:

Prompt: “What is the capital of France?”

Epoch	Output	Verdict
1	`What is the capital of France? France France France France`	Brain damage
4	`What is the capital of France? The capital of France is a city.`	Technically true?
8	`What is the capital of France? The capital of France is Paris.`	We did it 🎉

Prompt: “Explain machine learning in simple terms.”

Epoch	Output
1	`Explain machine learning in simple terms. terms terms terms`
4	`Explain machine learning in simple terms. Machine learning is a type of learning.`
8	`Explain machine learning in simple terms. Machine learning is a way for computers to learn from data without being explicitly programmed.`

Prompt: “Write a poem about coffee.”

Epoch	Output
1	`Write a poem about coffee. coffee coffee coffee coffee`
4	`Write a poem about coffee. Coffee is a drink. The drink is good.`
8	`Write a poem about coffee. In the morning light, I wake / With a cup of coffee to take / It warms my soul and clears my mind`

Not Shakespeare, but not keyboard spam either. Progress.

Try It Yourself

Don’t take my word for it. Here’s the actual model — pick a checkpoint and see how it generates:

Three models to compare: Tiny Shakespeare (3.2M), Medium (3.3M), and GPT-2 Small (134M). The difference is… noticeable.

The Two Sessions: A Tale of Suffering and Enlightenment

Session 1 (Epochs 1-3): The Dark Ages

~105 hours of my life I’m not getting back
Memory-mapped like an amateur
1.6s per step on an A100. AN A100.
The GPU was probably playing solitaire

Session 2 (Epochs 4-10): I Learned Things

RAM preload + vectorization + torch.compile + AMP
0.225s/step like a civilized person
~35 hours for 7 epochs
GPU finally working for its electricity

The Financial Damage 💸

Item	Compute Units
A100 burn rate	~8 units/hour
Epochs 1-3 (suffering)	~840 units
Epochs 4-10 (enlightened)	~280 units
Total	~1,120 units

Let me do some fun math:

If I had optimized from the start: ~400 units
What I actually used: ~1,120 units
Units wasted on stupidity: ~720 units (64% of total)

Cool. Cool cool cool.

22 Lessons I Paid To Learn

Architecture & Design (Don’t Be Clever)

Start embarrassingly small. 400K params first. Your ego can wait.
Parameterize everything. Magic numbers are technical debt with compound interest.
Use .get() with defaults. Config files change. Your code shouldn’t explode.

Data Pipeline (Prepare for the Worst)

Pre-tokenize to binary. Process text once. Not every epoch. Once.
BPE > Character tokenization. This isn’t even debatable.
Store metadata with data. Token counts, vocab size, dtype. Trust nothing.

Dataset Creation (The Boring Stuff)

Quality > Quantity. 10GB clean beats 20GB garbage.
Stream large files. Loading 12GB into RAM is for people who hate their computers.
Compress before uploading. 4GB uploads faster than 11GB. Revolutionary.
Use Rust-based tokenizers. Python tokenizers are cute. Rust tokenizers are fast.
Save intermediate files. Re-running 2-hour jobs is a special kind of pain.

Training Optimization (Read This First)

Profile before optimizing. Find the actual bottleneck before changing random things.
RAM beats mmap for random access. If it fits, preload it.
Vectorize or suffer. Python loops in hot paths are crimes against GPUs.
torch.compile + AMP = mandatory. It’s 2026. There’s no excuse.

Colab Survival Guide

Checkpoints go to Drive. Colab WILL disconnect. Plan accordingly.
OOM = restart runtime. empty_cache() is a polite suggestion.
Budget compute units. A100 eats 8 units/hour. Track it.

Debugging (Check These First)

Check dtypes. Then check again. Then .long() anyway.
Configs must match exactly. SEQ_LEN, vocab_size, embed_dim — one mismatch = cryptic error.
Verify files exist. Before writing code that loads them. Novel concept.

Documentation (Future You Is Dumb)

Document as you go. That “obvious” fix you just did? You’ll forget it by next week. Write it down.

What’s Next (If I Haven’t Given Up)

Finish epoch 10 — Almost there. Maybe.
Test generation quality — Does it produce coherent text? We’ll see.
Consider GPT-2 Medium — 355M params. Because I’m a glutton for punishment.
Fine-tuning experiments — Make it useful for something specific.
~~Release on GitHub/HuggingFace~~ — Done! github.com/GPUburnout/gpt2-from-scratch

Final Thoughts

Was this worth it? Honestly? Yes.

Not because I built something amazing — GPT-2 Small is table stakes in 2026. But because I finally understand the stack. The papers make sense now. When something breaks, I know where to look. When someone says “just use torch.compile,” I know what it’s doing under the hood.

The gap between “I’ve read the attention paper” and “I’ve debugged tensor shapes at 2 AM” is massive. This project closed that gap.

My advice: Start smaller than you think you need to. 400K params catches bugs before they’re expensive. Scale up only when the small model works perfectly.

And for the love of GPUs, optimize before you start training. Not on epoch 4. I learned that one the expensive way.

Questions? Check out the code at github.com/GPUburnout/gpt2-from-scratch. I’ve made every mistake so you don’t have to.

Training Progress (The Numbers Don’t Lie)#

The Loss Curve#

The Epoch-by-Epoch Damage Report#

What The Model Actually Says Now#

Prompt: “What is the capital of France?”#

Prompt: “Explain machine learning in simple terms.”#

Prompt: “Write a poem about coffee.”#

Try It Yourself#

The Two Sessions: A Tale of Suffering and Enlightenment#

The Financial Damage 💸#

22 Lessons I Paid To Learn#

Architecture & Design (Don’t Be Clever)#

Data Pipeline (Prepare for the Worst)#

Dataset Creation (The Boring Stuff)#

Training Optimization (Read This First)#

Colab Survival Guide#

Debugging (Check These First)#

Documentation (Future You Is Dumb)#

What’s Next (If I Haven’t Given Up)#

Final Thoughts#