What Breaks When You Quantize for Translation? A Deep Dive Across 55 Languages

Evaluating LLM translation under quantization with COMET, BLEU, GGUF models, and more

Sep 08, 2025

Model quantization comes with a comforting promise: lower precision means smaller models, faster inference, and minimal performance loss, if done right. And in many cases, especially for high-resource languages like English, Japanese, or French, that promise holds.

But things look different on the long tail, languages like Bengali, Malayalam, or Zulu, which are spoken by millions but often underserved in NLP systems.

To explore the impact of quantization for these languages, I collaborated with NICT, a top machine translation lab. We tested post-training quantization on models ranging from 1.7B to 70B parameters, focusing on translation tasks between English and 55 languages. The setup was deliberately clean: straightforward prompts, COMET for evaluation, and weight-only quantization to keep activations and KV cache intact.

When you quantize for translation, what actually breaks, and what still works?

We published our findings in this report:

The Uneven Impact of Post-Training Quantization in Machine Translation

In the rest of this article, I’ll highlight the key takeaways:

4-bit quantization generally performs well in translation tasks.
GGUF models hold up surprisingly well, even under low-bit quantization. Our tests used k-quantization with an importance matrix. Simpler methods (e.g., GGUF type 0 and 1) without importance matrices may produce significantly worse results.
Bitsandbytes quantization struggles with large models when applied to translation tasks.
Even for languages that are officially supported by the models we tested, the translation quality can be extremely low and worsens once the model is quantized to 2-bit.
Automatic evaluation of translation quality is harder than ever.

Experimental Settings

The task is machine translation on WMT24++, which has 55 languages and four domains (literary, news, social, speech). All models translate both directions with English (En→X and X→En; X being one of the languages in WMT24++). Scores come from COMET (wmt22-comet-da). COMET is good for ranking systems, but its absolute values can mislead, so treat levels with care (more on this in the last section of this article).

Because scores are not comparable across languages, results are reported per language, not averaged. The paper shows detailed results for six representative languages that span scripts and resource levels: Japanese (ja_JP), French (fr_FR), Polish (pl_PL), Bengali (bn_IN), Malayalam (ml_IN), and Zulu (zu_ZA). Note that WMT24++ was built for En→X; the reverse X→En side may contain “translationese” and can be easier.

Five models are tested to cover scale: Qwen3-1.7B, Qwen3-8B, Llama-3.1-8B-Instruct, Qwen3-32B, and Llama-3.3-70B. Qwen3 “reasoning mode” is disabled. Qwen3 claims official coverage for almost all WMT24++ languages except Canadian French and Zulu. Llama 3.1/3.3 list only eight supported languages (English, French, German, Hindi, Italian, Portuguese, Spanish, Thai).

Quantization is post-training and weight-only. Activations and the KV cache stay in higher precision. Four PTQ methods are used, chosen for accuracy and ecosystem support, with fixed hyperparameters:

AWQ (4-bit): AutoAWQ implementation; per-channel with zero points; group size 128.
bitsandbytes NF4 (4-bit): Non-uniform 4-bit with nested quantization via Transformers integration; no 2-bit support.
GGUF (4-bit Q4_K_M, 2-bit Q2_K): llama.cpp K-quantization guided by an importance matrix; imatrix fit on 20k WikiText samples (context 512, batch 512).
AutoRound (2-/4-bit): Differentiable rounding; 512 calibration samples; max sequence length 4096; 512 optimization steps; group size 128 at 4-bit and 32 at 2-bit.

Main Results

Translation quality follows two simple rules in these tables:

Bigger models score higher.
Any quantization lowers scores.

The 1.7B model can lose up to ~5 COMET points after 4-bit PTQ, while 32B and 70B models usually lose ≤1 point. Language matters a lot. High-resource pairs like Japanese and French stay close to full precision. Low-resource Indic languages and Zulu start lower and drop the most.

Method choice matters. GGUF has the smallest losses overall. AWQ and especially bitsandbytes (NF4) degrade more. NF4 is okay around 8B but becomes the worst choice at 70B.

Going to 2-bit hurts across the board. The hit is uneven by language. Example: Qwen3-8B at 2-bit (GGUF) for En→X drops ~2 points on Japanese and French but ~17 points on Bengali and Malayalam. NF4 shows the sharpest split on Indic sources: Llama-3.1-8B 4-bit loses only −1.5 (ja→en) and −0.3 (fr→en), but −7.7 (bn→en) and −9.7 (ml→en). The 70B model shows the same pattern, just milder (e.g., pl→en −0.9 vs zu→en −6.2).

In short: the worse a language’s baseline, the more quantization hurts it. The impact is language-dependent, strongest for low-resource and script-diverse pairs.

If your model can’t translate well or shows limitations for your target language, don’t quantize it!

Should You Calibrate Quantization for Your Target Language?

Most quantization methods can leverage a calibration step to minimize the quantization error. However, the calibration dataset is often an English dataset.

The purpose of this experiment is to test whether the language used for quantization calibration affects translation quality after compression. The model used is Llama-3.1-8B-Instruct, and the quantization method is GGUF, evaluated at two bit-widths: 4-bit (Q4_K_M) and 2-bit (Q2_K). Calibration for GGUF is achieved through “importance matrices” (imatrix). These are computed from calibration data; in this experiment, two datasets of 10,000 tokens each, one in English and one in Bengali, sampled from the FineWeb-2 corpus.

The models are tested on four translation directions:

English → French
English → Bengali
French → English
Bengali → English

Results at 4-bit

At 4-bit, the choice of calibration language has no meaningful impact. When using Q4_K_M, COMET scores are nearly identical whether the imatrix was built on English or Bengali. Across all four translation directions, the difference is ≤ 0.2 COMET points. This suggests that at 4-bit, the model retains enough representational capacity to handle both Latin and Indic scripts without needing language-specific calibration.

Conclusion: For 4-bit quantization, you can safely calibrate on a general-purpose dataset (e.g., in English) and expect it to perform equally well across these languages and directions.

Results at 2-bit

At 2-bit, the situation changes. The representational budget is much smaller, and the model becomes sensitive to the language used for calibration. When the imatrix is built on Bengali instead of English:

English → Bengali improves by +3.1 COMET (48.0 vs. 44.9). Note that even after this improvement, the translation quality remains extremely low.
Bengali → English shows a smaller, inconsistent change. The paper text claims a +0.8 gain, but the table actually shows a drop (72.3 with English vs. 71.5 with Bengali calibration), likely a typo or noise.

The benefit is clear only for Bengali as the target language, especially in the more difficult direction (English→Bengali), where the model must generate complex non-Latin script under tight precision constraints.

There is no benefit for unrelated languages. For French, COMET scores remain flat regardless of calibration language. In some cases, they even slightly improve with the “wrong” calibration language, which supports the idea that the effect is specific, not generalizable.

Why this happens

At 4-bit, group-wise quantization allows a relatively fine-grained allocation of representational space. The imatrix sets per-group scales, but the 4-bit width is still wide enough to represent a broad range of weight values, covering both Latin and Indic features. As a result, the specific calibration text has little influence.

At 2-bit, however, each weight group can only represent four values. The imatrix determines which parts of the weight distribution are preserved. If the calibration data comes from a language that shares properties with the translation target (e.g., similar script, morphology), the quantizer can better preserve the weights that matter most. If the calibration data is mismatched, important information may be lost.

This explains the performance gain when using Bengali calibration for English→Bengali. That direction involves generating long, morphologically rich, and script-diverse outputs. In contrast, translating into English (Bengali→English) is less sensitive, since English is simpler morphologically and script-wise.

Other factors

The baseline quality of the unquantized model also affects how much calibration helps. If the model already struggles on a given language pair, quantizing to 2-bit worsens performance more. Calibration can recover some quality, but not all of it. Conversely, for strong baseline pairs like English↔French, 2-bit quantization has less impact, and calibration adds little.

What to do in practice

Use a generic, language-agnostic calibration set for 4-bit. Do not overthink language choice at this precision.
If you must use 2-bit for a given target language, build the imatrix on that target language. Do this per target. Store separate imatrices per language if you can.
Expect the largest gains when the target uses a non-Latin script or has complex morphology. Expect smaller gains when the target is English.
Validate with more than one metric when you can. COMET is good for ranking, but small changes may sit within noise and are very difficult to appreciate. chrF or human spot checks help confirm real gains.

Something Not in the Paper: Machine Translation Evaluation is Harder than Ever

I’ve long been part of the BLEU/chrF skepticism camp. My work examining a decade of evaluation practices may have even contributed to the shift from traditional n-gram-based metrics to neural ones like COMET and MetricX, which correlate far better with human judgment.

But with LLMs, we’re seeing translation behaviors that older sequence-to-sequence models simply didn’t produce, and neural metrics often fail to penalize them appropriately. In the LLM era, they still do better correlate with human judgments for ranking translations than BLEU/chrF, but deltas between scores and absolute scores are often very misleading. While BLEU and chrF can clearly show when a system fails to translate, e.g., with a score below 5 BLEU, neural metrics always give points to a model output.

When you evaluate a system trained specifically for machine translation, you expect translation errors. Machine translation evaluation metrics are trained to spot and judge these errors. When you translate with LLMs, the output can be anything. Things that neural metrics are not trained to judge or that are impossible to rank: Is an empty output better or worse than answering a question, instead of producing a translation?

For example, COMET and MetricX may yield “good” scores even when LLMs:

Answer a question in the input text to translate instead of translating it
Copy the source text verbatim
Translate and then continue generating irrelevant content until reaching the maximum sequence length
Annotate their translations (e.g., Qwen models sometimes add translator notes; Llama 3.3 often includes transliterations for Chinese)
Refuse to translate (e.g., due to unsafe content in the source text) or only partially complete the translation

These behaviors would be heavily penalized by BLEU and chrF, which reward exact n-gram overlap and apply length penalties for outputs that are too short or too long. But neural metrics often give them a pass.

In practice, a model showing all of these issues can still score over 60 COMET points (on a 100-point scale), yet score below 1 BLEU point (also on a 100-point scale). That’s a massive disconnect. Even in the machine translation industry, the companies and their customers always tend to interpret these scores (BLEU, COMET, chrF, …) and use them to define goals, while they are absolutely meaningless. They were made to rank systems, not to produce a meaningful score. I particularly like MetricX for this reason: It produces a score between 0 and 25, with scores lower being better, i.e., naturally not interpretable (if you don’t know how MetricX was trained).

You can find BLEU and COMET scores for all the quantized models we have evaluated in this dataset:

bnjmnmarie/wmt24pp-qtranslated-scores

So here’s my caution:

Don’t trust COMET or other neural metrics unless they’re accompanied by concrete translation examples or traditional metrics like BLEU or chrF.
Only use BLEU or chrF for diagnostic purposes.
There are no “good” COMET, BLEU, and chrF scores.

Tomaž Savodnik

Sep 8

Wondering if LLM as a judge could be finetuned to better score LLM translation errors that Comet and MetricX (even hybrid) miss... Even in 0-shot w/o fine tuning some prompts show promissing results. Will see, maybe even Gemma 270M could be FT for this task with well curated DS.

Expand full comment

1 reply by Benjamin Marie

1 more comment...

The Salt - Curated AI

Discussion about this post