My Code Agent Said It Was a Moose. I Said No. It Was a Moose.
The H200 was working. The 3B was training. After three days of fighting the cloud, the model was finally putting tokens through the GPU at 23,200 per second. I had a checkpoint at step 1,000. I had a checkpoint at step 1,200. I went to bed feeling, briefly, like a person. Six hours later the run was dead. The checkpoint at step 1,200 was corrupted. The next run got to step 25 and froze. The one after that got to step 17 and silently disappeared. ...