I Have an A100. I Have 528 Shards of Data. I Cannot Combine Them.
I had a 3B model expanded from the 2B-75K base. Code tested. Smoke test passed. 528 shards on my NAS, ~70 GB, ~38 billion tokens of FineWeb, FineMath, PubMed, and cleaned Python. Three days later I had spent zero training tokens and was 1,200 words deep into a Notion page about VRAM accounting. This is that story. Why a 3B Two reasons. One: I wanted the next model to know what a kinase is. The 2B was clean, polite, and had read a lot of FineWeb. It had also never seen a single PubMed abstract. I have plans for this model that involve answering biomedical questions, and you cannot retrieve your way out of a model that does not know what “phosphorylation” means. The 3B’s data plan added 256 shards of PubMed, ~5.5B tokens, all fresh. The 2B is a polite generalist. The 3B is a polite generalist who also took two semesters of biochemistry. ...