HellaSwag

Time to face the music Training a language model is the fun part. You watch the loss drop, you generate text samples that are slightly less incoherent than yesterday’s, you tell yourself “look, it almost knows what France is.” It’s addictive. It’s rewarding. It also tells you absolutely nothing about how good your model actually is. Benchmarking is where the universe hands you a report card you didn’t ask for. ...