Skip to main content

AI Translation Quality

How good is AI translation in 2025–2026? Here's what the benchmarks show — from competitive evaluations judged by humans, to production deployment metrics.


WMT competition results

The Workshop on Machine Translation (WMT) is the primary competitive evaluation for translation systems, with human judges evaluating output across dozens of language pairs.

WMT 2024

FindingDetail
WinnerClaude 3.5 Sonnet — 9 of 11 language pairs
"Good" translation rate78% across German, Polish, Russian
SignificanceA general-purpose LLM beat dedicated translation systems

Source: WMT 2024 General Translation Task results.

WMT 2025

FindingDetail
Scale36 teams, 32 language pairs
DesignationOfficially the "LLM Era" of machine translation
Frontier LLMsGemini 2.5 Pro, GPT-4.1, Claude 4 — consistent excellence without translation-specific training
Specialized winnerTencent Hunyuan-MT (7B params) — rank 1.0 in 30/31 categories (with metric-gaming caveats)
Open-weightTower-7B surpassed GPT-4o despite being 100x smaller

Source: WMT 2025 proceedings.


Model-specific performance

Translation quality by model

ModelKey strengthNotable benchmark
Claude 3.5 SonnetEuropean languages, formal docs92.8% English performance; WMT 2024 winner
GPT-4Conversational content0.88 BLEU vs traditional MT's 0.82
Gemini 2.5 ProLong-context comprehension94.5% accuracy on 128K reading tasks
DeepSeek R1Chinese language, cost efficiency90.9% CMMLU; 10x lower cost than competitors

AI suggestion acceptance rates

MetricNumberSource
AI suggestion acceptance rate (Lokalise)82.6%Lokalise production data
Content requiring no post-editing (technical docs)80%Industry benchmark
GPT-4 vs human translators (MQM errors)Junior-to-medium level matchAcademic study (1,600 technical sentences)

Language-tier performance

TierLanguagesHuman parityExamples
Tier 1English80–93%Baseline
Tier 2Major global70–80%French, German, Japanese, Chinese, Spanish
Tier 3Low-resource55–70%Many African and Southeast Asian languages

Production metrics: MTPE workflow

Machine Translation Post-Editing (MTPE) — where a human reviews and corrects AI output — is the dominant production workflow.

MetricMTPEHuman from scratchImprovement
Words per hour700–1,500250–3003–5x faster
Cost per word$0.045–0.10$0.09–0.3550–70% cheaper
Quality (MQM)95–99+95–98Comparable

Source: Industry benchmarks, Nimdzi, GALA.

Enterprise deployments

CompanyScaleResult
Lionbridge Aurora AI180,000 employees, 9 languages30% reduction in turnaround times
Holiday ExtrasCompany-wide500+ hours saved weekly, $500K annual savings, 95% adoption
Smartling / Fortune 50050M+ words annually$3.4M first-year cost savings, 99+ MQM quality scores

Technical advances (2024–2025)

Key innovations that drove the quality improvement:

InnovationImpactSource
Chain-of-Dictionary prompting13x improvement in chrF++ scoresAcademic research
TEaR framework+2.48 BLEU average, up to +6.88 for EN→DEAcademic research
Document-level translationBLEU from 6.78 to 34.06 (5x improvement)Academic research
Retrieval-augmented translation129% improvement over NLLB baselineAcademic research
ALMA training paradigm>12 BLEU/COMET improvements with 7–13B paramsAcademic research
Context window expansion100K–1M tokens enables full-document translationModel releases

Industry adoption

MetricNumberPeriodSource
LSPs using LLMs for MT29% (up from 11%)2023 → 2024Slator
European language professionals using MT70%2024ELIS survey
MTPE adoption rate46% (up from 26%)2022 → 2024Industry survey
AI translation market size$1.5B → projected $4.1B2022 → 2030Market research
Global language services market$71.7B2024Nimdzi Insights

Remaining limitations

For completeness — where AI translation still falls short:

LimitationDetail
Low-resource languages60–75% human parity vs 85–90% for high-resource pairs
Literary/creative contentTechnically correct but culturally flat; human involvement essential
SpeedLLM translation runs 25–100x slower than traditional NMT (irrelevant for docs, which translate once and serve statically)
Post-editing rate30–70% of output still needs some human review for publication quality in critical domains; ~20% for technical documentation
Cultural nuanceAI produces accurate translations that may miss cultural connotations; most relevant for marketing, less so for technical docs