Supervised Fine-Tuning Improves LLM Reasoning at the Cost of Other Skills

The Weekly Salt #76

Benjamin Marie

Jul 09, 2025

This week, we review:

⭐ Should We Still Pretrain Encoders with Masked Language Modeling?
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

⭐: Papers that I particularly recommend reading.

⭐ Should We Still Pretrain Encoders with Masked Language Modeling?

The dominant approach for pre-training encoders with masked language modeling (MLM) is being challenged by decoder models that first learn via causal language modeling (CLM) before adapting with MLM. For instance, LLM2Vec is one of these successful approaches.

Llama 3.2 Embeddings: Training and Evaluation with LLM2Vec

Benjamin Marie

November 4, 2024

Read full story

State-of-the-art results on benchmarks like MTEB have emerged from these hybrid schemes, but they rely on much larger architectures and more data, leaving it unclear whether gains derive from the causal objective itself or simply from scale.

To disentangle these factors, the authors conduct a controlled comparison of MLM and CLM objectives using models of identical size (210M–1B parameters) trained on the same corpus. They explore three regimes: training from scratch on MLM versus CLM, a two-stage scratch protocol (CLM then MLM), and continued pretraining (CPT) where existing MLM- or CLM-only checkpoints receive additional MLM steps.

Results show that while pure CLM pre-training yields competitive performance and data-efficient convergence on certain tasks, bidirectional MLM remains indispensable for consistently strong results across the board. Moreover, the two-stage protocol starting with CLM and ending with MLM delivers the best trade-off between data efficiency and downstream accuracy when training from scratch, combining stability and bidirectional context without requiring massive compute.

All model weights, training scripts, and evaluation code are released:

GitHub: Nicolas-BZRD/EuroBERT/tree/MLM_vs_CLM

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

Math problems from benchmarks like AIME contain clear questions, unambiguous solutions, and easy numeric evaluation. That makes math reasoning a popular proxy for testing LLMs, but everyday applications, like dialogue, instruction following, or coding, rely on broader language and commonsense skills.

Can gains in math reasoning actually carry over to other tasks? To find out, the authors evaluated over twenty open-weight reasoning models on a mix of reasoning challenges (scientific QA, coding, planning) and non-reasoning tasks (conversational QA, instruction following). They introduced a simple Transferability Index to measure how much math improvements boosted performance elsewhere.

What stood out was the fine-tuning method. Models tuned with reinforcement learning (RL) reliably transferred their math skills to new domains, while those tuned with supervised learning (SFT) often lost ground outside of math. To confirm this, they fine-tuned a Qwen3-14B model both ways: SFT by filtering correct answers, and RL by rewarding accuracy. The RL-tuned model handled non-math tasks much better than its SFT counterpart.

Their code will be released here:

GitHub: ReasoningTransfer/Transferability-of-LLM-Reasoning

The Salt - Curated AI

Llama 3.2 Embeddings: Training and Evaluation with LLM2Vec

Discussion about this post