Season 1 · Ch. 2

Data Preparation: Building a 12GB Training Corpus

How I built a 12GB ChatGPT-style conversational dataset and implemented BPE tokenization for efficient training.

January 22, 2026 · 4 min · Jun Park
GPUburnout
GPUburnout
Will Code for Tokens
S1 GPT-2 134M
S2 Llama 1B