Verbatim: The Proof Is in the Output
Benchmarks say the 1B and 2B are basically the same model. The outputs say otherwise. Here are the receipts - same 8 prompts, same temperature (0.7), same top-p (0.9), same max tokens (200). 1B-160K-Chat vs 2B-75K-Chat-DPO, head to head. Why the 1B’s Chat model and not its DPO version? Because DPO made the 1B worse - the best DPO run scored 4/8 garbage, worse than the Chat baseline. The Chat model is the 1B at its best. This is as fair as it gets. ...