Season 4 · Ch. 4

Verbatim: The Proof Is in the Output

Benchmarks say the 1B and 2B are basically the same model. The outputs say otherwise. Here are the receipts - same 8 prompts, same temperature (0.7), same top-p (0.9), same max tokens (200). 1B-160K-Chat vs 2B-75K-Chat-DPO, head to head. Why the 1B’s Chat model and not its DPO version? Because DPO made the 1B worse - the best DPO run scored 4/8 garbage, worse than the Chat baseline. The Chat model is the 1B at its best. This is as fair as it gets. ...

March 30, 2026 · 11 min · Jun Park
GPUburnout
GPUburnout
Will Code for Tokens
S1 GPT-2 134M
S2 Llama 1B
S3 1B SFT
S4 Llama 2B
S5 Llama 3B