FineWeb-Edu: How to Make a Very High-Quality Dataset to Pre-train LLMs
A deep dive into the making of Hugging Face's FineWeb-Edu
A few months ago, Hugging Face introduced FineWeb, a high-quality dataset consisting of 15 trillion tokens that surpasses other datasets of similar size in pre-training large language models (LLMs).
Building on this, Hugging Face meticulously refined and filtered the dataset to retain only the highest-quality data. The outcome is FineWeb-Edu, a specialized subset containing 1.3 trillion tokens of exceptionally high-quality educational content. Experiments show that pre-training LLMs on FineWeb-Edu outperforms pre-training on other datasets that are 10 times larger.
Hugging Face published a very detailed technical report on how they have created FineWeb-Edu. Since most LLM makers don’t reveal how they build their pre-training data, this report offers valuable insights.
In this article, I review the technical report detailing the creation of FineWeb and FineWeb-Edu. We will explore the process of developing a high-quality dataset for training LLMs.