This week, “thinking” and “reasoning” with LLMs are again the main topics:
Evolving Deeper LLM Thinking
O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning
⭐DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
⭐: Papers that I particularly recommend reading.
Next Deep Dive
The next deep dive I’ll publish in The Salt will focus on online Direct Preference Optimization (DPO), a highly effective technique for optimizing language model preferences. I had originally planned to publish the article this week; however, I encountered several bugs in the implementation provided by Hugging Face TRL, which delayed progress on the practical section. Hopefully, I’ll be able to publish it early next week.
The challenge of guiding LLMs to think more deeply about complex problems often revolves around making better use of compute resources during inference.
Previous strategies to alleviate include techniques like chain-of-thought reasoning, self-consistency, and step-by-step revisions based on feedback. These approaches often depend on generating multiple solutions or using evaluators to refine outputs. While methods like Best-of-N or tree search explore a wide range of solutions, they tend to focus either on breadth or incremental improvement, which can limit their efficiency.
This paper proposes the Mind Evolution strategy which introduces a genetic search approach for LLMs, combining broad exploration with iterative refinement. This method generates a diverse population of candidate solutions, uses feedback from an evaluator to identify promising options, and refines them through selection, crossover, and mutation. Unlike traditional approaches, Mind Evolution refines complete solutions rather than individual steps, requiring only a global evaluator instead of stepwise feedback. This process can be easily parallelized and works effectively even in natural language tasks that lack formalized structures, such as travel planning or meeting scheduling.
Experiments show that Mind Evolution significantly outperforms existing methods on benchmarks like TravelPlanner and Natural Plan. For example, it allows Gemini 1.5 Flash to achieve a 95.6% success rate on TravelPlanner, far surpassing the 55.6% success rate of Best-of-N with the same model. A two-stage approach using a more advanced LLM further pushes success rates to nearly 100%.
O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning
Reasoning is a key skill for LLMs. Recently, models like OpenAI’s O1 and others like DeepSeek-R1, and QwQ have adopted a "long-thought reasoning" approach. This method “mimics” human problem-solving by breaking down complex problems, identifying mistakes, and trying alternative strategies when needed. It’s not just for text-based reasoning either, research shows it can work well in multimodal tasks, like combining text and images.
While these models are powerful, their detailed reasoning processes often result in longer outputs and higher computational costs. NVIDIA has been communicating a lot on this recently to plant the idea that we will need more GPUs than ever in the near future.
This paper introduced O1-Pruner, a Length-Harmonizing Fine-Tuning approach to tackle these challenges. This method aims to reduce the computational overhead of long-thought reasoning by optimizing the reasoning length dynamically, adapting it to the problem's complexity. The optimization objective minimizes reasoning redundancies while preserving accuracy as a constraint. O1-Pruner uses a reinforcement learning (RL)-based loss function to guide the model toward more efficient inference. Specifically, it penalizes unnecessary steps in the reasoning chain and encourages streamlined, accurate problem-solving paths.
They will release their code here:
GitHub: StarDewXXX/O1-Pruner
⭐DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
This is probably the most significant paper of 2025 so far. It describes how DeepSeek-AI made DeepSeek-R1. Some details are missing but the simplicity of the approach makes it somewhat easy to reproduce it. Hugging Face is working on it.
I reviewed the paper in this article for The Kaitchup:
Here is a summary of the training steps:
It starts with DeepSeek-R1-Zero, which utilizes reinforcement learning (RL) without prior supervised fine-tuning, relying on Group Relative Policy Optimization (GRPO) to streamline training. Unlike traditional RL setups that require a separate critic model, GRPO calculates baseline rewards using a group of outputs from an earlier policy version, simplifying the process.
The RL training flow includes sampling candidate responses, assigning rewards based on correctness or adherence to formats, calculating advantage scores within a group, and updating the model to favor higher-reward responses. This approach showcases the model's ability to self-improve through interaction, resulting in advanced reasoning capabilities. However, the lack of initial fine-tuning introduces trade-offs, such as less polished and occasionally cluttered answers, highlighting the limitations of a purely RL-driven system.
Reward modeling plays a central role in DeepSeek-R1-Zero’s progression, using rule-based methods to ensure correctness and format adherence without relying on complex neural reward models. Over time, the model naturally generates detailed reasoning steps, leading to improved accuracy and structured outputs. The emergence of “aha moments,” where the model reevaluates and corrects its reasoning, underscores its evolving meta-reasoning capabilities. Despite these advancements, the RL-only pipeline struggles with user-friendly presentation, often producing tangled or unreadable chains of thought. This limitation inspired the creation of DeepSeek-R1, which introduces a supervised fine-tuning phase with curated chain-of-thought examples to stabilize and improve initial performance, achieving a balance between reasoning ability and clarity.
DeepSeek-R1 employs multi-stage training to refine its outputs further, including language consistency rewards and rejection sampling.