Phase 2: Building Infrastructure I’ll Actually Need Later

Here’s the thing about Phase 1: it was too easy. Everything fit in RAM. Training took 15 minutes. The model was small enough to run on a microwave.

Phase 2 was about building infrastructure for the 12GB monster coming in Phase 3. Because debugging memory issues is way more fun with 250MB than with 12GB. (It’s not fun either way, but at least it’s faster.)

The Setup

AttributeValue
Dataset~250MB text — still baby size
TokenizationCharacter-level (int16)
Model4-6 layers, 256-384 dim — teenager transformer
Parameters~10-50M — finally respectable
TrainingGoogle Colab T4/V100 — the free tier hustle
Sequence Length512 — because why not

The Numbers

MetricValue
Training Time~2-4 hours (coffee break compatible)
Epochs10-15
Speed~0.3-0.5s/step

What The Phase 2 Model Thought “Intelligence” Was

I asked it some questions. It tried its best.

PromptModel OutputMy Reaction
Where is the capital of USA?The capital of USA is USA.Technically… no.
What is 2 + 2?2 + 2 is a number.I mean, you’re not wrong.
Who is the president?The president is the president of the United States.Circular logic achievement unlocked.
Tell me a joke.A man walks into a bar and the bar and the bar andIt’s avant-garde comedy.

The model had mastered the art of saying nothing with great confidence.

Why Memory-Mapped Files?

“But the 250MB dataset fits in RAM!” you say.

Yes. But 12GB doesn’t. And I’d rather debug memory-mapping code with a small dataset than stare at OOM errors for hours with a big one.

1
2
250MB text → ~500MB as int16 → fits in Colab's 12GB RAM ✓
12GB text → ~11GB as int32 → LOL no ✗

Future me thanked past me for this one.

What I Actually Built

1. Pre-tokenization Pipeline

Because processing text every epoch is for people who enjoy watching progress bars.

2. Memory-Mapped Loading

numpy.memmap is basically magic. Your 11GB file pretends to be a numpy array, but it’s actually reading from disk. The OS handles caching. It’s like having infinite RAM with extra steps.

3. Colab Survival Kit

  • Checkpoints to Google Drive (because Colab WILL disconnect)
  • Session timeout handling (because Colab WILL disconnect)
  • Did I mention Colab disconnects?

Phase 3: The Real Deal 💪

Time to stop playing around. GPT-2 Small. 134 million parameters. On a dataset that makes my laptop cry.

The Configuration

AttributeValue
Dataset12GB of ChatGPT-style conversations
TokenizationBPE 32K vocab — like a grown-up
Total Tokens2.8 billion — with a B
ModelGPT-2 Small (12 layers, 768 dim, 12 heads)
Parameters134 million
TrainingColab A100 (40GB) — the big guns
Batch Size64
Learning Rate3e-4

The Glow Up

ParameterPhase 1 (Toy)Phase 2 (Meh)Phase 3 (Finally)
Layers24-612
Heads44-812
Embed Dim128256-384768
Context256512512
Parameters~400K~10-50M134M

Look at that growth. They grow up so fast.

Runtime Reality Check

MetricValue
Steps/Epoch~79,507 (yes, really)
Tokens/Step32,768 (64 batches × 512 tokens)
Speed (after optimization)~0.225s/step
Time/Epoch~5 hours
Total Epochs10
Total Training Time~50 hours (optimized)
Colab Compute Units~400 (at 8 units/hr for A100)

50 hours of GPU time. On the “cheap” tier. ML is an expensive hobby.


The Files That Made This Possible

All code available at github.com/GPUburnout/gpt2-from-scratch

FileWhat It Does
model.pyThe transformer itself (fully parameterized, because hardcoding is for cowards)
tokenizer.pyCharacter-level tokenizer (Phase 1-2)
tokenizer_bpe.pyBPE tokenizer (Phase 3, the good stuff)
tokenize_local_bpe.pyConverts text to binary with BPE
train_colab_mmap.pyThe training script that keeps me up at night
generate.pyMakes the model spit out text

Lessons From Scaling 335x

(From 400K params to 134M params. I did the math so you don’t have to.)

  1. Build for scale before you need it. The infrastructure I built for 250MB worked perfectly for 12GB. Zero changes. That’s the dream.

  2. Pre-tokenize everything. Processing text during training is like doing your taxes during a job interview. Just… don’t.

  3. Parameterize obsessively. Every magic number should be a config value. Future you will forget why you used 768 instead of 512.

  4. Use .get() with defaults. Config files evolve. Your code shouldn’t crash because you added a new field.

  5. Colab will disconnect. Save checkpoints like your career depends on it. Because your training run does.