7 Out of 8 - How DPO Finally Worked
Season 3: four DPO configurations on the 1B. Best: 4/8 clean. Worst: 7/8 garbage. More training literally made the model dumber. I had receipts. Season 4: same technique, similar hyperparameters, on the 2B. Result: 7/8 clean. First try. No suffering necessary. Same method. Different foundation. Completely different outcome. That’s the entire moral of Season 4 in one A/B test. I could end the post here. I won’t, because the details are too good to skip. ...