I drafted this chapter on May 14, the morning after the 3B benchmarks landed. The opening was a banger about “the inflection point at 3B parameters.” The conclusion explained how alignment tax shrinks with scale and flips positive between 2B and 3B. There was a chart. The chart had a smooth curve. I was very pleased with the chart.
For two weeks I was going to publish that chapter. I told three people. I rehearsed a tweet thread. I picked which checkpoint to put in the OG image.
Then I read twelve lines of Python in my own notebook.
The lines were a data formatting function I wrote in April. They used plain string concatenation. The evaluation code used chat templates. My 3B was trained in one format and tested in another. The +2.04 IFEval gain was not a scaling breakthrough. It was train/test contamination’s weird cousin. The bug had been sitting in my notebook for two months. In three different notebooks actually. All containing the same broken function, copy-pasted between them. Read by exactly one human ever. Me.
The first draft of this chapter is in a folder called do_not_open.
I am the only person who could have found this bug. It took me two weeks of staring at the broken benchmark to think “maybe I should look at the code.”
The thing that bothered me
The thing that bothered me was extra-curricular math.
I had the result. I had the chart. I had the chapter. What I also had was the previous three data points sitting next to the new one. They went like this:
| |
Quick translation. IFEval is a benchmark that tests strict instruction following. “Respond in exactly three sentences.” “Include the word banana.” “Use only lowercase.” That kind of thing. When you SFT a base model, IFEval usually goes down because the chatty assistant style you just installed overrides literal compliance. The drop is called the alignment tax. The four numbers above are my four IFEval deltas. Three negatives (tax paid) and one positive (??).
If you stare at the first three numbers long enough you can draw a line through them. The line goes from -6.35 to -4.91 over roughly two scale doublings. Slope of about 0.7 per scale step. Extrapolate to 3B and you get something like -4.
The actual number was +2.04. Six points off the line. In the wrong direction.
I noticed this on May 15. I noticed it again on May 16. I noticed it twice on May 17. Each time I told myself it was a phase transition. Maybe 3B is where the alignment tax stops being a thing. Maybe my model figured something out at scale that the smaller ones couldn’t. Maybe I had discovered an emergent property.
I had a working title. “The 3B Inflection Point: Emergent Alignment Robustness at Scale.” I tried “Beyond the Alignment Tax: A Phase Transition in SFT Dynamics.” I tried “Free Lunch at 3B.” That last one had pizzazz.
I started thinking about which workshop. Maybe one of the NeurIPS satellites. The poster session would have my chart on it. People would walk by and stop. I would say something modest like “yeah, we were surprised too.” I rehearsed the part where someone asks “but did you control for prompt format” and I say “great question” and explain why it doesn’t matter.
I dreamed about this. Twice. I picked a bibtex key. It was park2026emergent.
This is what scientists call “motivated reasoning.” Regular people call it “hoping.”
A linear trend through three data points is a vibe, not a finding. A single data point that misses the vibe by six points in the wrong direction is its own kind of vibe. It is the vibe of “your experiment has a problem and you do not want to look at it.”
So I decided to do a controlled ablation before publishing. Just to rule out the boring explanation. Just to be sure. Just because the boring explanation is almost always right and I had seen this movie before.
I did not do the controlled ablation for another twelve days.
Twelve days later
I opened the notebook.
It was called GPUburnout_3B_SFT_Experiment.ipynb. I had created it in April. I had run it on Colab. I had merged the LoRA adapter into the base. I had benchmarked the result. I had drafted a chapter about the result. I had not looked at the notebook since the day I last hit “Run All.”
I scrolled to the data preparation cell. It was twelve lines. Here they are:
| |
That last line joined the system and user messages with newlines. Just plain text. No chat template, no special tokens, no <|im_start|>, no <|im_end|>. The model was being trained on raw concatenated strings.
Then I opened my 2B notebook. Same scaffold, same role mapping, same dataset. But the formatting line said this:
| |
The 2B was trained with chat templates wrapping every turn. The 3B was not. The two notebooks had been copy-pasted from each other, then drift had happened, and nobody had noticed because nobody else was reading either of them.
I scrolled down to the inference cell of the 3B notebook. The cell that runs the model after training and shows you sample outputs. The cell I had stared at for weeks confirming the model could speak.
| |
The inference cell used the chat template. The training cell did not. The model was being fine-tuned on one format and tested on a completely different format. Every benchmark I had run, every IFEval delta I had computed, every paper title I had rehearsed, was the result of a system trained to predict text in Format A while being asked questions in Format B.
I sat with this for a moment.
I had built three different 3B SFT notebooks. All three had the same broken formatter. I had run each of them. I had merged each adapter. I had benchmarked at least one of them in full. I had drafted a chapter. I had told three friends. I had picked an OG image.
The bug was twelve lines.
What I should have done in April
The fix is a recipe parity ablation. Two SFT runs from the same 3B base. Same dataset, same LoRA shape, same scheduler. Only the learning rate and the prompt format vary.
Run A uses the lr=2e-4 from my 1B and 2B recipes. Run B uses the lr=5e-5 from the original buggy 3B. Both use apply_chat_template.
Before the SFT loop I added an assert, to ensure the formatter works.
| |
This is what they call defensive programming. I call it: I am never trusting Past Me again.
The whole experiment ran in five hours on a Colab A100. Total cost was about two dollars and fifty cents. The bug I had been ignoring for two weeks was solved for less than a sandwich.
Then I waited five hours.
The boring answer
Run A came back at -3.36. Run B came back at -3.24.
Both within 0.7 of the eyeball prediction. The line I had drawn through 1B-90K, 1B-160K, and 2B-75K extended cleanly into 3B. The alignment tax did not flip. It just got smaller, the way it had been doing the whole time.
Updated curve:
| |
A nice straight line. Slope still about 0.7 per scale step. Extend it and the tax hits zero somewhere around 5B or 6B. That is actually a more useful finding than “phase transition at 3B.” It tells you how big you need to go before you stop paying. The boring version of the story has a number in it.
It is also the version where I do not get invited to give a talk.
There was no emergent property, which meant… no NeurIPS workshop. The chart with the smooth curve was right.
I deleted park2026emergent from my bibtex file…
Eight points behind
While I was looking at the corrected IFEval numbers, I scrolled across the row.
The shipped 3B-Chat scored 49.83 on ARC-Easy. The shipped 2B-Chat scored 58.12.
The 2B was beating the 3B by 8.29 points on one of the most important reasoning benchmarks. I had paid $425 to pretrain a 3B model and then quietly shipped a version of it that scored 8 points worse than my smaller, cheaper model.
This explained something I had been ignoring since April. Every time I qualitatively tested the 3B-Chat, it felt about the same as the 2B-Chat. Sometimes worse. I had told myself this was a vibes problem, the kind of thing where your subjective impression doesn’t match the benchmark. The benchmark, it turned out, was telling me the same thing. I just had not bothered to look at it next to the 2B’s row.
When I ran the same ARC-Easy on the recipe-parity 3B-Chat-CT-lr2e4, it scored 59.89.
That is a 10-point jump from the shipped version. It is also, finally, better than the 2B. The full table:
| |
The proper-recipe 3B wins on five of six. The shipped 3B won on one and lost on two.
My 3B is making sense now. Turns out the 3B I built was real. I just had not met it yet.
The fix took five hours and cost less than a sandwich.
The bigger fish
There was another thing I noticed when I ran the inference comparison.
For months I had been chasing weird tokens. The 1B-Chat liked to end its responses with PersonX or fefefe. The 2B-Chat preferred asilinna or Medalists. I tried cleaner SFT data, then DPO. The 2B got cleaner. The 1B did not. I shipped what I had and moved on.
By the time I started the 3B, I had also cleaned my pretraining data and re-tokenized everything. The 3B trained on the cleaner version. When it finished, the garbage was gone. What it had instead was loops.
Same prompt, two models. “Which of two trains traveled farther by noon?”
Shipped 3B (the buggy one):
| |
3B-Chat-CT:
| |
Same setup. “Explain relativity in three sentences a 10-year-old could understand.”
Shipped 3B:
| |
3B-Chat-CT:
| |
I had filed this in my head as “different problem.” It wasn’t.
I had been hunting two different bugs the whole time. Garbage tokens live in contaminated pretraining weights. Loops come from the model not knowing when to stop, because the plain-concat SFT never showed it an <|im_end|> token during training. Both look like a broken chat. Neither fix solves the other.
I had spent three months in the wrong half of the diagnosis. The other half was sitting in a 12-line function I had never read.
Cleaning my pretraining data didn’t fix the 3B because the 3B also needed the format fix. Adding the format fix to the 1B wouldn’t have fixed it because the 1B’s garbage is in its weights. The 3B-Chat-CT is the first model in my family with both halves applied. It produces no garbage. Just clean answers that end when they should.
I had been chasing this since February.
The honest reckoning
The 3B-Chat-CT is the best chat model I have ever shipped. However, it is also, by industry standards, undertrained by a wide margin.
| |
If Chinchilla 20:1 is the floor, I am in the basement, holding a flashlight.
The industry quietly moved a hundred to a thousand times past Chinchilla. I am still climbing toward 1x.
The reason for the move is inference economics. You pretrain a model once. You serve it billions of times. A 3B that matches a 10B in quality is three times cheaper to run for the rest of its operating life, so the extra pretraining compute pays for itself almost immediately. Commercial labs figured this out and started shoveling tokens into small models at industrial scale.
To get my 3B to Chinchilla 20:1 would cost about another $1,000. To match Qwen 2.5 3B’s training budget would cost around $770,000.
I am not going to do either of those things.
$770,000 is more than my house.
What I am going to do is publish the methodology. Everything in this chapter is documented and reproducible from a benchmark table and a Colab notebook. None of it is a capabilities claim.
The 3B is not a competitive 3B. It is the best 3B I could build for $425 in compute, plus one format bug that ate two extra weeks.
Closing Season 5
Season 5 was supposed to be about scaling. I built a 2B and shipped it. I built a 3B and shipped it. Both had chat models with problems I could not quite name. But the main thing was, my 3B for some reason did not feel any better than the 2B, as if I had just wasted ~$450 on no improvement.
Turns out, the actual story of Season 5 is that I had a 12-line Python function I never read.
Twelve lines. About the length of a haiku. I read more on a cereal box.
For Season 6, I am going to read my code first.
