Better Data Recipes for Pre-training LLMs and Training Reasoning Models

The Weekly Salt #72

Benjamin Marie

Jun 11, 2025

This week, we review:

⭐OpenThoughts: Data Recipes for Reasoning Models
The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text
Inference-Time Hyper-Scaling with KV Cache Compression

⭐: Papers that I particularly recommend reading.

New code repositories (list of all repositories):

Common Pile: Collection and processing pipelines to make open training datasets for LLMs

⭐OpenThoughts: Data Recipes for Reasoning Models

Recent reasoning models like DeepSeek-R1 and OpenAI's o3 achieve strong performance in tasks like math, coding, and science by extending capable base models with post-training methods such as supervised fine-tuning (SFT) and reinforcement learning (RL). These methods enable models to generate explicit chains of reasoning during inference, but the full methodologies behind frontier systems remain closed, limiting reproducibility.

A key insight from recent work is that the quality and structure of SFT data are major drivers of reasoning ability, often more so than architectural changes or RL. Projects like R1-Distill demonstrate that small to mid-scale models can reach competitive performance using only SFT on curated triplets of questions, reasoning steps (“thinking tokens”), and answers. These tokens are often generated by stronger teacher models, suggesting that reasoning capability can be distilled with sufficient data quality alone.

"Thinking" LLMs with Simple Fine-tuning and Budget Forcing

Benjamin Marie

Feb 13

Read full story

OpenThoughts extends this paradigm by scaling and refining the SFT data pipeline. Early iterations like OpenThoughts-114K and OpenThoughts2-1M explore automated verification and increased question diversity. The latest version, OpenThoughts3-1.2M, builds on extensive empirical analysis, over 1,000 ablation studies, to identify scalable data construction strategies. Fine-tuning Qwen2.5-7B-Instruct on this dataset yields OpenThinker3-7B, which outperforms both R1-Distill-7B and Nemotron-Nano-8B on a suite of reasoning benchmarks, setting a new standard for 7B-scale open models.

This work shows that sampling multiple answers per question from a teacher model dramatically increases data scale and effectiveness, and that better teacher models aren’t necessarily those with the best benchmark scores. Moreover, simplistic verification or filtering strategies contribute little to performance, while more effective curation involves careful selection of high-quality data sources and filtering based on LLM-labeled difficulty.

If you are interested in learning about reasoning models, I really recommend reading this long paper!

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Pretraining is a key phase in LLMs, where models learn to predict the next token in sequences of unstructured text. This step helps LLMs capture language patterns and build broad world knowledge, and is essential for strong performance on downstream tasks. To improve model capabilities, pretraining datasets have grown to trillions of tokens, mostly sourced from publicly available web text.

However, much of this web content is copyrighted. Although it's commonly used for training, often without permission or compensation, this practice raises serious legal and ethical issues. It has led to lawsuits, DMCA takedowns, and increased restrictions like websites blocking AI crawlers.

To address these challenges, the authors ask whether it's possible to train competitive LLMs using only public domain and openly licensed text, defined as content with explicit permission for unrestricted use. They introduce Common Pile v0.1, an 8TB dataset built from 30 openly licensed sources spanning academic papers, code, government documents, historical books, forums, and educational resources.

After cleaning and balancing the data, they train two 7B-parameter models, Comma v0.1-1T and Comma v0.1-2T, and show that these models match the performance of similarly sized models trained on unlicensed data like Llama 2. All models, data, and preprocessing code are released publicly:

GitHub: r-three/common-pile

Inference-Time Hyper-Scaling with KV Cache Compression

Inference-time scaling boosts LLM reasoning by allocating more compute during generation, allowing models to explore multiple or longer reasoning paths. However, this strategy is constrained by the memory and runtime costs of the key–value (KV) cache, which grows with the number and length of reasoning chains. Attention becomes memory-bound as more cache tokens are added, slowing down generation.

Recent methods attempt to mitigate this by reducing cache size through training-free heuristics or retrofitting models to manage the KV cache more efficiently. However, naive sparsification often harms model quality, and retrofitting approaches like token merging (e.g., DMC) are expensive to train.

This work introduces Dynamic Memory Sparsification (DMS). DMS is a lightweight retrofitting method that sparsifies the KV cache while preserving model quality. It works by making the model aware of future cache eviction decisions, allowing high compression (up to 8x) with minimal training (just 1K steps). Unlike prior approaches, DMS maintains or even improves model performance on reasoning benchmarks like MATH-500, AIME 2024, GPQA, and LiveCodeBench, all within the same memory or runtime budget.

DMS consistently outperforms both vanilla models and existing efficient attention baselines, with particularly strong gains on long-context tasks. These results demonstrate that efficient attention, when implemented through methods like DMS, enables LLMs to scale their reasoning capabilities at inference time without degrading quality. This advances the Pareto frontier of performance versus efficiency.

The Salt - Curated AI

"Thinking" LLMs with Simple Fine-tuning and Budget Forcing

Discussion about this post