This week, we review:
⭐ Tower+: Bridging Generality and Translation Specialization in Multilingual LLMs
SparseLoRA: Accelerating LLM Fine-Tuning with Contextual Sparsity
FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language
⭐: Papers that I particularly recommend reading.
New code repositories (list of all repositories):
⭐ Tower+: Bridging Generality and Translation Specialization in Multilingual LLMs

Recent studies show that proprietary LLMs like GPT-4o and Claude 3.7 are top-tier in translation, but open-weight models are catching up fast. With the right fine-tuning, some of these open models now rival or outperform closed models in translation quality. However, pushing for high translation performance often comes at the expense of general-purpose abilities. A clear example is TOWER V2: it's excellent at translation, winning most language pairs in WMT24, but falls behind in conversational benchmarks like GPT-4o or Claude 3.
This tradeoff, translation accuracy versus general utility, is increasingly problematic. Specialized models may falter when asked to follow formatting guidelines or apply specific terminology, especially in more complex translation tasks. This limitation is made visible in a new benchmark called IF-MT, which tests how well a model can translate and follow instructions. On IF-MT, translation-focused models struggle, showing that narrowing a model's focus can hinder its broader capabilities.
To tackle this, the paper introduces a refined training pipeline aimed at achieving strong translation quality without sacrificing general-purpose performance. The updated approach builds on earlier TOWER models, adding key adjustments. One is modifying the continued pretraining stage to include a small amount of instruction-style data early on. This helps the model retain general capabilities as it specializes. The supervised fine-tuning phase also shifts: translation now makes up just 22% of the data mix, with the rest covering diverse tasks like coding and Q&A, all to preserve balance.
They also add a sophisticated reward tuning phase using techniques like Weighted Preference Optimization (WPO) and Group Relative Policy Optimization (GRPO), supported by reward models designed to verify outputs. This stage further refines the model’s responses across tasks.
The models are here:
Hugging Face: Tower+ (CC-BY-NC-SA)
SparseLoRA: Accelerating LLM Fine-Tuning with Contextual Sparsity
LLMs are usually trained on massive general-purpose data and then fine-tuned for specific tasks or user preferences. But that fine-tuning gets pricey fast, especially for large models. Techniques like LoRA help reduce memory costs by updating weights in a low-rank way, and follow-ups like QLoRA and DoRA refine this further. But here’s the catch: while these methods save memory, they don’t necessarily speed up training.
This paper introduces SparseLoRA, which aims to fix that by tackling both memory and compute. The key idea is to use contextual sparsity, only compute gradients for a small, input-dependent subset of weights, instead of everything. To do this efficiently, they use an SVD-based estimator that selects which channels to update, adding almost no runtime overhead. It doesn’t need pretraining and adapts based on where you are in training, what part of the model you're updating, and what kind of tokens you're dealing with.
Results show it cuts compute costs by up to 2.2x and gives around a 1.6x speedup, all while holding accuracy steady on benchmarks like math reasoning, code gen, and instruction following. It’s the first approach to bring contextual sparsity into fine-tuning, not just inference, and it works surprisingly well.
They released their code here:
GitHub: z-lab/sparselora
FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language
This paper proposes a new, scalable pipeline inspired by the methodology behind FineWeb (a state-of-the-art English dataset).
This version is language-aware. It adjusts its preprocessing automatically based on language-specific features and statistics, instead of applying the same filtering rules across all languages. The design was validated through controlled experiments across nine diverse languages, ensuring that results are based on meaningful evaluation tasks rather than superficial benchmarks.
Another contribution is a principled approach to rebalancing datasets by accounting for both duplication and quality scores, which improves model performance on globally deduplicated corpora. They show that models trained on language-specific corpora from this pipeline outperform those trained on existing multilingual datasets, including when tested on languages that weren't used to tune the pipeline.
Finally, they use this method to create FineWeb2, a massive multilingual dataset with 20 TB of data across 1,000+ languages, collected from Common Crawl snapshots spanning 2013–2024.
This is an excellent dataset that I use very often for various tasks.