Data Preparation: Building a 12GB Training CorpusWhere I learned that 90% of ML is just cleaning data and crying about file sizes.