How good is AI translation in 2025–2026? Here's what the benchmarks show — from competitive evaluations judged by humans, to production deployment metrics.
WMT competition results
The Workshop on Machine Translation (WMT) is the primary competitive evaluation for translation systems, with human judges evaluating output across dozens of language pairs.
WMT 2024
| Finding | Detail |
|---|
| Winner | Claude 3.5 Sonnet — 9 of 11 language pairs |
| "Good" translation rate | 78% across German, Polish, Russian |
| Significance | A general-purpose LLM beat dedicated translation systems |
Source: WMT 2024 General Translation Task results.
WMT 2025
| Finding | Detail |
|---|
| Scale | 36 teams, 32 language pairs |
| Designation | Officially the "LLM Era" of machine translation |
| Frontier LLMs | Gemini 2.5 Pro, GPT-4.1, Claude 4 — consistent excellence without translation-specific training |
| Specialized winner | Tencent Hunyuan-MT (7B params) — rank 1.0 in 30/31 categories (with metric-gaming caveats) |
| Open-weight | Tower-7B surpassed GPT-4o despite being 100x smaller |
Source: WMT 2025 proceedings.
Translation quality by model
| Model | Key strength | Notable benchmark |
|---|
| Claude 3.5 Sonnet | European languages, formal docs | 92.8% English performance; WMT 2024 winner |
| GPT-4 | Conversational content | 0.88 BLEU vs traditional MT's 0.82 |
| Gemini 2.5 Pro | Long-context comprehension | 94.5% accuracy on 128K reading tasks |
| DeepSeek R1 | Chinese language, cost efficiency | 90.9% CMMLU; 10x lower cost than competitors |
AI suggestion acceptance rates
| Metric | Number | Source |
|---|
| AI suggestion acceptance rate (Lokalise) | 82.6% | Lokalise production data |
| Content requiring no post-editing (technical docs) | 80% | Industry benchmark |
| GPT-4 vs human translators (MQM errors) | Junior-to-medium level match | Academic study (1,600 technical sentences) |
| Tier | Languages | Human parity | Examples |
|---|
| Tier 1 | English | 80–93% | Baseline |
| Tier 2 | Major global | 70–80% | French, German, Japanese, Chinese, Spanish |
| Tier 3 | Low-resource | 55–70% | Many African and Southeast Asian languages |
Production metrics: MTPE workflow
Machine Translation Post-Editing (MTPE) — where a human reviews and corrects AI output — is the dominant production workflow.
| Metric | MTPE | Human from scratch | Improvement |
|---|
| Words per hour | 700–1,500 | 250–300 | 3–5x faster |
| Cost per word | $0.045–0.10 | $0.09–0.35 | 50–70% cheaper |
| Quality (MQM) | 95–99+ | 95–98 | Comparable |
Source: Industry benchmarks, Nimdzi, GALA.
Enterprise deployments
| Company | Scale | Result |
|---|
| Lionbridge Aurora AI | 180,000 employees, 9 languages | 30% reduction in turnaround times |
| Holiday Extras | Company-wide | 500+ hours saved weekly, $500K annual savings, 95% adoption |
| Smartling / Fortune 500 | 50M+ words annually | $3.4M first-year cost savings, 99+ MQM quality scores |
Technical advances (2024–2025)
Key innovations that drove the quality improvement:
| Innovation | Impact | Source |
|---|
| Chain-of-Dictionary prompting | 13x improvement in chrF++ scores | Academic research |
| TEaR framework | +2.48 BLEU average, up to +6.88 for EN→DE | Academic research |
| Document-level translation | BLEU from 6.78 to 34.06 (5x improvement) | Academic research |
| Retrieval-augmented translation | 129% improvement over NLLB baseline | Academic research |
| ALMA training paradigm | >12 BLEU/COMET improvements with 7–13B params | Academic research |
| Context window expansion | 100K–1M tokens enables full-document translation | Model releases |
Industry adoption
| Metric | Number | Period | Source |
|---|
| LSPs using LLMs for MT | 29% (up from 11%) | 2023 → 2024 | Slator |
| European language professionals using MT | 70% | 2024 | ELIS survey |
| MTPE adoption rate | 46% (up from 26%) | 2022 → 2024 | Industry survey |
| AI translation market size | $1.5B → projected $4.1B | 2022 → 2030 | Market research |
| Global language services market | $71.7B | 2024 | Nimdzi Insights |
Remaining limitations
For completeness — where AI translation still falls short:
| Limitation | Detail |
|---|
| Low-resource languages | 60–75% human parity vs 85–90% for high-resource pairs |
| Literary/creative content | Technically correct but culturally flat; human involvement essential |
| Speed | LLM translation runs 25–100x slower than traditional NMT (irrelevant for docs, which translate once and serve statically) |
| Post-editing rate | 30–70% of output still needs some human review for publication quality in critical domains; ~20% for technical documentation |
| Cultural nuance | AI produces accurate translations that may miss cultural connotations; most relevant for marketing, less so for technical docs |