FineWeb-Edu: How to Make a Very High-Quality Dataset to Pre-train LLMs

A deep dive into the making of Hugging Face's FineWeb-Edu

Jun 12, 2024

∙ Paid

Create a cartoon-style image of table salt being filtered. The scene should show salt grains being poured through a filter or sieve, with some grains falling through and others being held back. Use bright and cheerful colors, and give the salt grains a bit of character, maybe with small faces or expressions to add a fun element. The filter should be clearly visible, and the overall style should be playful and engaging. — Generated with DALL-E

A few months ago, Hugging Face introduced FineWeb, a high-quality dataset consisting of 15 trillion tokens that surpasses other datasets of similar size in pre-training large language models (LLMs).

Building on this, Hugging Face meticulously refined and filtered the dataset to retain only the highest-quality data. The outcome is FineWeb-Edu, a specialized subset containing 1.3 trillion tokens of exceptionally high-quality educational content. Experiments show that pre-training LLMs on FineWeb-Edu outperforms pre-training on other datasets that are 10 times larger.

Hugging Face published a very detailed technical report on how they have created FineWeb-Edu. Since most LLM makers don’t reveal how they build their pre-training data, this report offers valuable insights.

In this article, I review the technical report detailing the creation of FineWeb and FineWeb-Edu. We will explore the process of developing a high-quality dataset for training LLMs.

The Salt - Curated AI

FineWeb-Edu: How to Make a Very High-Quality Dataset to Pre-train LLMs

A deep dive into the making of Hugging Face's FineWeb-Edu

This post is for paid subscribers