10 Things I Learned Training a 1B Parameter Model That Nobody Talks About
The stuff that doesn’t make it into papers Research papers tell you about architectures, loss functions, and scaling laws. They do not tell you that the cheapest GPU per hour is almost never the cheapest GPU per token, that your biggest optimization is probably a boolean you forgot to flip, or that every single crash you’ll experience will be infrastructure — never training code. They especially don’t tell you that the five-second decision you make on day one about which datacenter region to pick will haunt you for the entire project. ...