Pasted a paragraph of legal text into DeepL - got a translation in 200 milliseconds. Pasted the same paragraph into ChatGPT with the prompt “translate as a legal translator with 10 years of experience” - waited 8 seconds, but got a translation that accounts for context from previous pages and maintains consistent terminology. Two tools, two approaches, and both are called “machine translation.” But under the hood, they work completely differently. If you’re a translator trying to figure out what you’re actually using and when each tool gives better results - let’s break it all down with real examples.
NMT and LLM: what they are and how they differ¶
Let’s start with the basics, because people mix these up all the time.
NMT (Neural Machine Translation) is a specialized model trained exclusively for translation. DeepL, Google Translate, Microsoft Translator, Amazon Translate - these are all NMT systems. They’re trained on millions of parallel text pairs (source + translation) and do exactly one thing: convert text from language A to language B.
LLM (Large Language Model) is a general-purpose model that can do everything: write code, answer questions, generate text, and translate as well. GPT-4o, Gemini 2.5 Pro, Claude - these are LLMs. Translation is one of thousands of tasks for them, not their sole specialization.
Simple analogy: NMT is a cardiac surgeon who’s been doing heart operations for 20 years. LLM is a general practitioner with encyclopedic knowledge who can diagnose, prescribe, and answer complex questions. The cardiac surgeon will do heart surgery better. But the GP will see the big picture that the specialist might miss.
Architecture: why they work differently¶
This isn’t just theory - architecture directly affects the quality, speed, and limitations of each approach.
NMT: encoder-decoder¶
Classic NMT uses an encoder-decoder architecture. The encoder reads the input sentence in its entirety, creates a numerical representation (embedding), and the decoder generates the translation in the target language word by word. The key point: the encoder “sees” the input sentence fully, left to right and right to left. This gives deep understanding of sentence structure.
But there’s an important limitation: NMT works sentence by sentence. It translates one sentence at a time, without context from previous or next sentences. If paragraph one mentions “the company” and paragraph ten says “it,” NMT won’t connect those two words.
LLM: decoder-only¶
LLMs work differently - they only have a decoder. The input text (your prompt + text to translate) and the output translation are processed as a single token sequence. The model generates each next word based on everything that came before it - the prompt, the original, and the already-generated part of the translation.
The main advantage: LLMs work with the context of the entire document (within their context window - 128K tokens for GPT-4o, 1M for Gemini 2.5 Pro). If you load a glossary plus 50 pages of text, the model will consider all of it when translating each sentence.
The main drawback: LLMs weren’t trained specifically for translation. They “know how to translate” because among the trillions of texts they trained on, there were parallel translations too. But translation is a side skill for them, not their core specialization.
Speed: a 10-100x difference¶
This is where the gap is obvious even without benchmarks - just open both tools and compare.
NMT systems like DeepL or Google Translate process translations in milliseconds. Paste a paragraph - the result appears instantly. Google Translate runs up to 20x faster than LLMs, and that’s not an exaggeration.
LLM models take seconds per request. GPT-4o on a typical paragraph - 3-8 seconds. Gemini - similar. For a single document, that’s manageable. But if you’ve got 10,000 segments in a CAT tool and each one needs to go through the API - the difference becomes critical.
| Parameter | NMT (DeepL, Google) | LLM (GPT-4o, Gemini) |
|---|---|---|
| Latency per segment | 50-200 ms | 3-10 sec |
| 1,000 segments | ~2-3 minutes | 1-3 hours |
| 10,000 segments | ~20-30 minutes | 10-30 hours |
| Suitable for real-time | Yes | No |
For a freelance translator working in Trados, memoQ, or Smartcat - the speed difference between an NMT plugin and an LLM plugin is night and day. NMT gives you a suggestion instantly, LLM makes you wait several seconds for each segment.
Cost: who’s cheaper at scale¶
For an individual translator, the difference might seem negligible. But at scale, it becomes a deciding factor.
| Parameter | NMT API | LLM API |
|---|---|---|
| Google Translate API | $20 per 1M characters | - |
| DeepL API Pro | €5.49/mo + $25 per 1M characters | - |
| GPT-4o API | - | $2.50 input + $10 output per 1M tokens |
| Gemini 2.5 Pro API | - | $1.25 input + $10 output per 1M tokens |
| Claude API | - | $3 input + $15 output per 1M tokens |
At first glance, LLMs might seem more expensive. But if you calculate the cost of translating a 100-page document via API:
- Google Translate: ~$5-8
- DeepL: ~$6-10
- GPT-4o: ~$0.50-1.50
- Gemini 2.5 Pro: ~$0.25-0.75
Here’s the paradox: LLMs via API are often cheaper than NMT for one-off large documents, because billing is in tokens rather than characters, and the ratio is different. But for streaming work with thousands of short segments, NMT wins on speed and integration simplicity.
If you’re using subscriptions (ChatGPT Plus $20/mo, Google AI Pro $19.99/mo) - that’s a fixed price for unlimited translations through the interface. DeepL Pro for translators starts at €8.74/mo. We covered the real cost of document translation in detail in a separate article.
Translation quality: who wins¶
Short answer: depends on text type and language pair. Longer answer below.
Where NMT is stronger¶
NMT is built for translation and is more consistent in its results:
- Technical documentation: manuals, specifications, instructions - NMT delivers consistently good results because it’s trained on millions of similar texts
- Short fragments: individual sentences, UI elements, menu items - NMT has no competition here in speed and accuracy
- European language pairs: for DE↔EN, FR↔EN, ES↔EN, DeepL still shows the highest BLEU scores among all systems
- Repetitive text: if you’re translating 1,000 similar product descriptions, NMT will deliver uniform quality with no surprises
Where LLM is stronger¶
LLMs win where context understanding matters:
- Long documents: thanks to a context window of hundreds of thousands of tokens, LLMs maintain consistent terminology throughout the entire document, instead of translating sentences in isolation
- Creative text: marketing, ad slogans, literary translations - LLMs are better at adapting tone, humor, cultural nuances
- Translation with instructions: you can tell an LLM “translate formally, use Sie, here’s a glossary” - and the model will follow. That doesn’t work with NMT
- Non-standard pairs: for languages with less training data (like UK↔DE), LLMs sometimes produce better results because they “understand” the language more deeply than NMT with a limited parallel corpus
At WMT25 (the biggest machine translation competition), LLM models dominated at the system level - Gemini 2.5 Pro landed in the top cluster in 14 of 16 language pairs. But at the individual segment level, NMT systems are still competitive, especially for well-represented language pairs.
Hallucinations: the biggest LLM risk¶
Here’s something not enough people talk about. LLMs hallucinate - they make up things that weren’t in the original. And this is a qualitatively different problem from NMT errors.
When NMT makes a mistake, it’s usually predictable: wrong case, wrong term, literal calque. You see the error and immediately know where it is.
When an LLM hallucinates, it’s sneakier. The model might add information that wasn’t in the original. Or “beautifully” rephrase a sentence in a way that changes the meaning. And it looks completely smooth - you might not catch the error unless you compare with the original word by word.
A real example: one translator shared on a forum that ChatGPT, when translating a medical report from Ukrainian to German, “added” a detail that wasn’t in the original - it specified a medication dosage, even though the original only mentioned the drug name. For a medical document, that’s a potentially dangerous error.
Research shows that LLM translation accuracy for medical texts in English is ~84%, but drops to ~69% for Russian. For legal documents, hallucination risk is a serious argument for either NMT + human review or fully human translation.
If you’re interested in the reliability of machine translation for legal texts, we covered this in detail in our article about why machine translation doesn’t work for legal documents.
When to choose NMT, when to choose LLM¶
Here’s a practical decision table:
| Situation | Choose | Why |
|---|---|---|
| UI/interface translation in a CAT tool | NMT | Speed + stability on short segments |
| Large legal contract | LLM + human review | Context + consistent terminology across document |
| Stream of 50,000 strings per day | NMT | Speed and cost at scale |
| Marketing copy, ad campaigns | LLM | Tone adaptation, creativity, transcreation |
| Live chat, customer support | NMT | Millisecond latency |
| Translation from scans/photos | LLM | Multimodal - reads images |
| Technical docs with repetitions | NMT | Consistency + speed |
| Long document with glossary | LLM | Can “hold” glossary + context |
If you’re a freelancer working with different text types, the best move is to have both. DeepL Pro for fast translations in your CAT tool, and ChatGPT Plus or Google AI Pro for complex documents where context matters.
The hybrid approach: why “either-or” is the wrong question¶
The industry has stopped asking “NMT or LLM?” - it’s asking “how to combine both?”
The hybrid approach works like this: NMT generates a quick draft, and then an LLM reviews and improves the result with full context. Or the reverse: LLM translates complex creative text, while NMT handles “routine” segments (dates, addresses, standard phrases).
Platforms like Smartcat and Phrase already integrate both approaches - translators can choose the “AI engine” for each project depending on content type.
For MTPE workflows (machine translation post-editing), this is especially relevant. If you’re offering MTPE as a service, it’s worth testing both technologies on the client’s specific text type and picking whichever needs fewer edits.
DeepL recently released its own LLM model for translation, which in their tests outperforms GPT-4o. Google integrated Gemini technology into Google Translate. The line between NMT and LLM is blurring - and this is a trend that’s shaping the translation industry in 2026.
FAQ¶
What’s the main difference between LLM and NMT for translation?¶
NMT (DeepL, Google Translate) are specialized models trained exclusively on translating parallel texts. They work sentence by sentence, very fast (milliseconds), and consistently. LLMs (GPT-4o, Gemini, Claude) are general-purpose models that translate as one of thousands of tasks, but can account for the context of an entire document, follow instructions, and work with glossaries. NMT is a cardiac surgeon, LLM is a GP with encyclopedic knowledge.
Can LLMs fully replace NMT?¶
No, and that’s unlikely to happen anytime soon. NMT is 10-100x faster and still wins for streaming tasks: UI translation, live chat, processing thousands of short segments. For those scenarios, LLMs are too slow and expensive. But for complex documents, creative text, and context-aware translation, LLMs already deliver better results today. The optimal approach is using both technologies for different tasks.
Which approach is safer for legal documents?¶
Both approaches need human review for legal texts. But NMT makes more predictable mistakes - wrong case, calque, wrong term. LLMs can “hallucinate” - add information that wasn’t in the original, and do it so smoothly you won’t notice without careful checking. For official documents, no AI replaces a sworn translator - but both work as drafts to speed up the process, as long as you review every sentence.
Why does an LLM sometimes give a better translation than DeepL?¶
Because the LLM sees context. DeepL translates each sentence separately, with no connection to previous ones. An LLM can hold the entire document, glossary, and style instructions in context - and generate a more coherent translation. For short individual sentences, DeepL often wins. But for long texts with consistent terminology, LLMs show better results.
What should a beginner translator choose?¶
Start with free tools: DeepL (free tier) for quick translations and ChatGPT (free version) for complex texts where context matters. Once you understand the difference in practice, you’ll know whether you need a paid subscription. For most translators, DeepL Pro (~€9/mo) + ChatGPT Plus ($20/mo) covers 95% of tasks. And if you want to go deeper into AI tools, check out our guide to using ChatGPT and Claude for translation.