The Salt - Curated AI

The Salt - Curated AI

Share this post

The Salt - Curated AI
The Salt - Curated AI
FineWeb-Edu: How to Make a Very High-Quality Dataset to Pre-train LLMs
Copy link
Facebook
Email
Notes
More

FineWeb-Edu: How to Make a Very High-Quality Dataset to Pre-train LLMs

A deep dive into the making of Hugging Face's FineWeb-Edu

Benjamin Marie's avatar
Benjamin Marie
Jun 12, 2024
∙ Paid
6

Share this post

The Salt - Curated AI
The Salt - Curated AI
FineWeb-Edu: How to Make a Very High-Quality Dataset to Pre-train LLMs
Copy link
Facebook
Email
Notes
More
3
Share
Create a cartoon-style image of table salt being filtered. The scene should show salt grains being poured through a filter or sieve, with some grains falling through and others being held back. Use bright and cheerful colors, and give the salt grains a bit of character, maybe with small faces or expressions to add a fun element. The filter should be clearly visible, and the overall style should be playful and engaging.
Generated with DALL-E

A few months ago, Hugging Face introduced FineWeb, a high-quality dataset consisting of 15 trillion tokens that surpasses other datasets of similar size in pre-training large language models (LLMs).

Building on this, Hugging Face meticulously refined and filtered the dataset to retain only the highest-quality data. The outcome is FineWeb-Edu, a specialized subset containing 1.3 trillion tokens of exceptionally high-quality educational content. Experiments show that pre-training LLMs on FineWeb-Edu outperforms pre-training on other datasets that are 10 times larger.

Hugging Face published a very detailed technical report on how they have created FineWeb-Edu. Since most LLM makers don’t reveal how they build their pre-training data, this report offers valuable insights.

The Salt - Curated AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

In this article, I review the technical report detailing the creation of FineWeb and FineWeb-Edu. We will explore the process of developing a high-quality dataset for training LLMs.

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Benjamin Marie
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Copy link
Facebook
Email
Notes
More