Reviewed this week:
If You Can't Use Them, Recycle Them: Optimizing Merging at Scale Mitigates Performance Tradeoffs
⭐Training Large Language Models to Reason in a Continuous Latent Space
I Don't Know: Explicit Modeling of Uncertainty with an [IDK] Token
⭐: Papers that I particularly recommend reading.
New code repositories (list of all repositories):
No new code repository this week. The code for “I Don't Know: Explicit Modeling of Uncertainty with an [IDK] Token” will be released and seems interesting but I won’t add it to the list until I can check it.
If You Can't Use Them, Recycle Them: Optimizing Merging at Scale Mitigates Performance Tradeoffs
The paper highlights model merging as a cost-effective alternative to multi-task learning or model ensembling, focusing on its potential to address performance tradeoffs in LLMs. While prior research has mostly involved small models (e.g., 7B parameters) and specialized expert models, this study explores merging generalist models at a much larger scale (100B+). Key challenges include tradeoffs in multi-task training, where optimizing for one task often harms performance on others, requiring costly tuning.
The main contributions:
Realistic Setup: Investigates merging intermediate checkpoints from diverse training runs (e.g., with different data mixes, objectives, and post-training stages) to optimize task tradeoffs. This reflects practical scenarios in LLM development.
Evolutionary Optimization: Proposes an automated approach to find optimal merging weights, outperforming uniform or greedy strategies while minimizing task conflicts.
Insights into Merging: Reveals that seemingly suboptimal checkpoints can contribute to optimal merges, supporting the concept of recycling intermediate models. Performance depends on synergy among models rather than isolated quality.
The findings demonstrate that model merging can enable training-free optimization of task tradeoffs, providing a scalable method to improve LLM performance across multiple tasks without discarding intermediate models.
If you are curious about model merging, I did some for The Kaitchup:
Merging models as a mixture-of-expert:
Merging models with an ensembling method (the resulting model, The Mayonnaise, ranked first on the Open LLM leaderboard for some time among the 7B LLMs):
⭐Training Large Language Models to Reason in a Continuous Latent Space
This paper introduces Coconut (Chain of Continuous Thought), a paradigm for reasoning in LLMs that operates in a latent, continuous space rather than relying on language tokens. Reasoning methods like Chain-of-Thought (CoT) generate step-by-step solutions in natural language, which is inefficient.
Coconut modifies the reasoning process by using the model's hidden states as continuous inputs for subsequent steps, bypassing the need for explicit language generation. This approach allows for efficient reasoning, where multiple potential solutions are encoded simultaneously, enabling the model to explore and refine reasoning paths in a manner similar to breadth-first search. Continuous thoughts improve decision-making by progressively narrowing down options, even without explicit training for this capability.
Experimental results show that Coconut enhances reasoning across tasks, including math (GSM8k) and logic (ProntoQA and ProsQA), outperforming traditional CoT in accuracy and efficiency while requiring fewer tokens. The findings demonstrate that latent reasoning can scale effectively to handle complex problems, offering a more efficient and powerful alternative to language-based reasoning in LLMs.
I Don't Know: Explicit Modeling of Uncertainty with an [IDK] Token
This paper proposes a new approach to address hallucinations in LLMs by enabling them to explicitly express uncertainty through a special "[IDK]" ("I Don’t Know") token. During a continued pretraining phase, the authors modify the conventional cross-entropy objective to allocate probability mass to the [IDK] token when the model is uncertain, based on an Uncertainty Factor derived from the model's logits. This process, termed IDK-tuning, does not rely on labeled data and allows models to express uncertainty during pretraining, preserving the ability to fine-tune on specific tasks later.
The results show that IDK-tuning significantly improves factual precision while causing only a minor reduction in recall. It is effective across various model architectures and sizes and does not compromise general language modeling abilities, such as long-text generation. The method offers a robust solution to improving model reliability and demonstrates the potential of incorporating uncertainty modeling directly into the pretraining process.
The code will be released here:
GitHub: roi-hpi/IDK-token-tuning