Scaling Up: From Tiny Model to GPT-2 Small
How I went from ‘cute toy model’ to ‘134 million parameters that need an A100 to breathe.’
How I went from ‘cute toy model’ to ‘134 million parameters that need an A100 to breathe.’
Because apparently using someone else’s model was too easy. Here’s how I tortured myself by training GPT from scratch.