Training Optimizations Deep Dive: How I Made the A100 Actually Work
The complete technical reference for achieving 16x speedup. Every optimization explained with code and diagrams.
The complete technical reference for achieving 16x speedup. Every optimization explained with code and diagrams.
Final loss curves, the damage to my compute budget, and 22 lessons I paid dearly to learn.
A comprehensive guide to every way I shot myself in the foot training GPT-2 Small. Learn from my pain.
How I went from ‘cute toy model’ to ‘134 million parameters that need an A100 to breathe.’
How I built a 12GB ChatGPT-style conversational dataset and implemented BPE tokenization for efficient training.
Because apparently using someone else’s model was too easy. Here’s how I tortured myself by training GPT from scratch.