Fed the same 12-page contract into ChatGPT and Gemini - got two completely different translations. One preserved the legal style and terminology, the other turned “Vertragspartner” into “agreement parties” instead of “contracting parties.” If you’re a translator looking for the best AI assistant, or a client trying to figure out whether you can trust machine translation - let’s look at the numbers and real examples to see who actually translates better.
Gemini and GPT-4o in 2026: the quick overview¶
Both models have evolved dramatically over the past year, and comparing them based on last year’s benchmarks is pointless.
GPT-4o by OpenAI is a multimodal model that works with text, images, and audio. Context window is 128,000 tokens (roughly 200 pages of text). Available through ChatGPT Plus for $20/month or via API. In translation, GPT-4o excels at recognizing idioms, preserving tone, and working with context - in one benchmark, the model showed just a 3% error rate when translating idiomatic expressions.
Gemini 2.5 Pro by Google is a thinking model with a context window of 1 million tokens (roughly 1,500 pages). Available through Google AI Pro for $19.99/month or via API. The main advantage is its ability to “think” before responding and process massive documents in one go, without chunking. At WMT25 (the world’s largest machine translation competition), Gemini 2.5 Pro landed in the top cluster for quality across 14 of 16 language pairs.
Both tools already translate pretty well. The question is in the details - and those details can cost you money or reputation if you pick the wrong tool for a specific task.
Benchmarks: who translates better by the numbers¶
Let’s look at actual results, not marketing promises.
WMT25 - the world championship of translation¶
At WMT25 (Workshop on Machine Translation, August-November 2025), evaluators tested 60 translation systems, including LLM models and traditional tools like DeepL and Google Translate. Human evaluators used ESA methodology (Error Span Annotation - marking specific errors in the text rather than just rating “like/dislike”).
Result: Gemini 2.5 Pro took first place - landing in the top cluster for 14 of 16 language pairs. GPT-4o performed well too, but didn’t come close to that level of dominance.
Lokalise blind test: what professional translators think¶
In 2025, the Lokalise platform ran a blind test - professional translators evaluated translations from different models without knowing which AI produced them. The surprising result: Claude 3.5 got the highest rating - 78% of its translations were rated “good.” GPT-4o and DeepL came in second. Claude also ranked first in 9 of 11 language pairs at WMT24.
BLEU scores for European languages¶
BLEU (Bilingual Evaluation Understudy) is an automated metric that compares machine translation against a reference. For the English→German pair:
| Model | BLEU score (EN→DE) |
|---|---|
| DeepL | 64.5 |
| Gemini 2.5 Pro | ~63-64 |
| GPT-4o | 62.1 |
| Google Translate | ~60 |
DeepL still leads for European language pairs on automated metrics, but the gap with LLM models shrinks with every update.
What this means in practice¶
Benchmarks are useful, but they measure averages. In practice, quality depends on the specific language pair, text type, and even the prompt you use. The same GPT-4o can brilliantly translate marketing copy - and completely butcher a legal contract if you give it the wrong instructions.
Ukrainian, German, Russian: how both models perform¶
Most benchmarks focus on English, Spanish, Chinese - but what about the language pairs that matter most if you’re working between Ukraine and Germany? This is where it gets interesting.
Ukrainian¶
Here’s the honest truth: Ukrainian is a mid-resource language for AI. Neither GPT-4o nor Gemini was trained on nearly as much Ukrainian text as English or German.
GPT-4o shows noticeable improvement over previous versions, but still occasionally produces calques and inaccuracies in terminology. Legal and medical texts are where it struggles most. It handles conversational language and everyday texts well.
Gemini 2.5 Pro benefits from Google’s integration of Gemini technology into Google Translate, which improved quality for 100+ languages including Ukrainian. Gemini works better with context thanks to its large window - if you load a glossary or translation examples, the output improves significantly.
One translator shared on a forum: “For everyday texts, both AI models give acceptable Ukrainian translations. But as soon as you get into legal documents or medical reports - you can’t skip manual review. ChatGPT more often preserves sentence structure, while Gemini sometimes paraphrases too freely.”
German¶
The situation is better for both models here - German is in the top 10 best-supported languages:
GPT-4o is strong at translating complex compound nouns (Zusammengesetzte Substantive - “Aufenthaltserlaubnis”, “Krankenversicherungsbeitrag”), which is critical for legal and technical texts. It correctly distinguishes formal and informal address (Sie/du).
Gemini 2.5 Pro handles long German sentences well - with a 1M token context, it doesn’t “forget” the beginning of a paragraph when translating the end. For German legal prose, where half-page sentences are the norm, the difference is noticeable.
Russian¶
An interesting finding: a study of medical text translation through GPT-4o showed factual accuracy of 84% in English but only 69% in Russian. That’s a significant gap. Gemini shows a similar trend - quality drops for languages with less training data.
For translating documents from Ukrainian to German or from Russian to German - no AI can replace a sworn translation for official documents yet. But as a tool for creating a draft before post-editing, both models are already very useful.
The price question: how much does AI translation cost¶
| ChatGPT (GPT-4o) | Gemini 2.5 Pro | |
|---|---|---|
| Subscription | $20/mo (Plus) | $19.99/mo (AI Pro) |
| Free access | Yes, with limits | Yes, with limits |
| API: input tokens | $2.50 per 1M | $1.25 per 1M |
| API: output tokens | $10.00 per 1M | $10.00 per 1M |
| Context window | 128K tokens | 1M tokens |
| PRO tier | $200/mo (Pro) | ~$42/mo (Ultra) |
On the API, Gemini 2.5 Pro is half the price on input tokens - that adds up fast if you’re translating large volumes. For a 100-page document via API:
- GPT-4o: roughly $0.50-1.00
- Gemini 2.5 Pro: roughly $0.25-0.75
If you’re using subscriptions (ChatGPT Plus or Google AI Pro) - the difference is minimal, both cost ~$20/month and provide enough for a translator’s daily work. $40/month for both subscriptions combined is less than the cost of one manual translation of a medium-sized document.
Translation prompts: how to get the best results¶
An AI translator isn’t Google Translate where you paste text and hit a button. Translation quality through LLMs depends directly on your prompt. Here are proven approaches that work for both GPT-4o and Gemini:
Assign a role¶
Instead of “translate this text” - tell the AI who it is:
“You are a certified legal translator with 15 years of experience translating between Ukrainian and German. Translate the following contract, preserving legal terminology and formal style.”
Two-step translation¶
Ask the AI to first do an accurate translation, then adapt:
“Step 1: Provide an accurate translation of this text from Ukrainian to German. Step 2: Review the translation for naturalness and compliance with German legal style. Fix anything that sounds like a calque from Ukrainian.”
Add a glossary¶
This works especially well with Gemini thanks to its large context:
“Use this glossary when translating: Birth certificate = Geburtsurkunde, Criminal record clearance = Führungszeugnis, Diploma = Hochschulabschluss…”
Back-translation for verification¶
Ask the AI to translate the result back into the source language - this helps spot inaccuracies and “hallucinations” (when the model invents things that weren’t in the original).
We’ve put together more prompts and examples in our guide to using ChatGPT and Claude for translation.
When to choose Gemini, when to choose GPT-4o¶
There’s no universal “X is better than Y” answer - it all depends on the task:
| Task | Better fit | Why |
|---|---|---|
| Long documents (50+ pages) | Gemini 2.5 Pro | 1M token context - doesn’t “forget” earlier content |
| Legal texts | GPT-4o | Better at preserving formal style and terminology |
| High volume via API | Gemini 2.5 Pro | Half the price on input tokens |
| Creative texts, marketing | GPT-4o | Better at adapting tone and cultural nuances |
| Translation from scans or photos | Both | Multimodal, both handle images |
| Preserving formatting | GPT-4o | More stable at maintaining document structure |
If you’re a freelance translator, the optimal strategy is to have access to both and use each for its strengths. And if you want to understand how to integrate AI into your workflow - check out our MTPE guide.
What about Claude and DeepL?¶
It wouldn’t be fair to skip two other serious players.
Claude (Anthropic) - in Lokalise’s blind test, professional translators rated Claude 3.5’s translations the highest. It’s especially strong at preserving tone, humor, and literary style. If you’re translating fiction or anything where keeping the author’s “voice” matters - Claude is worth trying.
DeepL - still leads for European language pairs on automated metrics and requires the least post-editing. In blind tests, DeepL translations required 2-3x fewer edits than GPT-4o or Google Translate output. But for Ukrainian and Russian, DeepL quality is noticeably lower than for German or French - both users and benchmarks confirm this.
We did a detailed comparison of DeepL and Google Translate for Ukrainian earlier - with specific examples.
FAQ¶
Which AI translates best in 2026?¶
Based on WMT25 results, Gemini 2.5 Pro showed the best results overall - top cluster in 14 of 16 language pairs. But for specific tasks, results differ: DeepL is better for European languages by BLEU scores, Claude is best for preserving style and tone, GPT-4o handles legal texts with precise terminology well. The best approach is to test on your own text types.
Can I use Gemini or GPT-4o for official document translation?¶
No, AI translation has no legal standing. For submitting documents to German authorities, you need a sworn translation from a sworn translator (beeidigter Übersetzer - a translator who took an oath in a German court). But you can use AI to create a draft that a professional translator then reviews and certifies - this significantly speeds up the process.
How much does translation through ChatGPT or Gemini cost?¶
ChatGPT Plus costs $20/month, Google AI Pro (Gemini) is $19.99/month. Both subscriptions provide enough for everyday translation work. Via API, translating a 100-page document costs $0.25-1.00 depending on the model - orders of magnitude cheaper than manual translation.
Do Gemini and GPT-4o translate Ukrainian well?¶
Better than a year ago, but still not perfect. For everyday texts and general correspondence, the quality is acceptable. For legal, medical, and technical documents, you definitely need a human translator to review. Pay close attention to terminology - both models sometimes substitute Russian-influenced forms or English calques when translating Ukrainian.
Which is better for MTPE - Gemini or GPT-4o?¶
For machine translation post-editing (MTPE), both models work well but with different strengths. GPT-4o generates a “cleaner” initial translation that needs fewer edits. Gemini handles large context better - you can load an entire document with a glossary and Translation Memory, and the model will account for this when translating each segment.