Dpo

I had a diagnosis. Garbage tokens, pretraining contamination, baked into the base weights, unreachable by fine-tuning. Open and shut. Case closed. Except science doesn’t accept “trust me bro” as evidence. The only way to prove the diagnosis was to try fixing it the wrong way and watch it not work. Repeatedly. With increasing desperation. Nine experiments. Zero fixes. One scoreboard. Here we go. SFT: Five Attempts, Five Failures I built a cleaning pipeline, removed 27% of SlimOrca (139K examples), verified zero garbage tokens in the cleaned set, and ran five experiments: ...

7 Out of 8 - How DPO Finally Worked

Nine Experiments, Nine Funerals