Chinchilla

The question that wouldn’t leave me alone S2-03 ended with a question I couldn’t stop thinking about: What would more training buy? GPUburnout-1B had trained on 11.8 billion tokens — 59% of Chinchilla-optimal for a 1B model. The data was sitting there. Twenty billion tokens is the theoretically ideal ratio for a billion parameters: twenty tokens per parameter, the point where your compute budget is perfectly balanced between model size and training data. I was 41% short of that line. ...