Translation Quality Metrics: MQM, DQF, Error Typology - How to Measure Objectively¶
One reviewer checks 50 translation segments and says: “quality looks fine.” Another reviewer checks the same 50 segments and says: “this is full of errors.” Who’s right? Without a shared evaluation system - nobody, because “fine quality” means something different to each person.
That’s exactly why formalized translation quality metrics exist - MQM, DQF, Error Typology and their derivatives. These aren’t academic toys: major clients (Google, Microsoft, the European Commission) have been using these frameworks for years to evaluate thousands of translators and millions of segments. And in 2024, MQM celebrated its 10th anniversary with an updated scoring model that finally became readable for humans, not just researchers.
Let’s break down how these metrics work, what makes them different, and how to implement them in a real workflow - even if you’re a small agency or a freelancer.
What Are Translation Quality Metrics and Why Do They Matter¶
A Translation Quality Metric is a formalized system for detecting, classifying, and counting errors in translation. Instead of subjective “I like it / I don’t,” you get a concrete numerical score you can compare across translators, projects, and time periods.
Why this matters:
- Objectivity: two reviewers working under the same framework will produce comparable results
- Comparability: you can compare translator A and translator B on the same scale
- Process improvement: if 40% of errors are terminological, you know to update the glossary - not “rewrite everything”
- SLA and contracts: a client says “quality must be at least 98 on MQM” - that’s a concrete, measurable requirement
- Client-facing argument: “your translation scored 94.2 on MQM against a threshold of 90 - here’s the report” sounds a lot more convincing than “we checked it, it’s fine”
As the MQM Council states:
The central component of MQM is a hierarchical listing of issue types derived in a careful examination of existing quality evaluation metrics.
MQM didn’t invent categories out of thin air - it systematized what the industry had already been using for decades.
MQM (Multidimensional Quality Metrics): The Industry’s Main Framework¶
MQM is a framework for analytic Translation Quality Evaluation (TQE). Created in 2014 under the EU-funded QTLaunchpad project, it’s now maintained by the MQM Council.
How MQM Works¶
The core idea is simple: an expert (reviewer) reads the translation, spots errors, and for each error determines:
- Error type - from a hierarchical taxonomy
- Severity level - neutral, minor, major, critical
Then a scoring model converts these annotations into a numerical score.
MQM Error Typology¶
MQM has a tree-like structure of error types. At the top level - 7 main categories:
| Category | What It Covers | Example |
|---|---|---|
| Accuracy | Correspondence to source | Omitted sentence, wrong number translation, added information not in source |
| Fluency | Linguistic correctness of target text | Grammar errors, unnatural phrasing, typos |
| Terminology | Adherence to term bases | Using “agreement” instead of “contract” when there’s an approved glossary |
| Style | Adherence to style requirements | Overly formal tone in marketing copy |
| Design | Formatting, markup | Broken tags, wrong font, RTL issues |
| Locale Convention | Regional conventions | Date format DD/MM vs MM/DD, wrong thousands separator |
| Verity | Factual correctness | Wrong URL, outdated information |
Each category has subcategories. For example, Accuracy breaks down into Addition, Omission, Mistranslation, and Untranslated. Mistranslation further splits into False Friend, Technical Relationship Error, and even MT Hallucination - a subtype added in the MQM-Chat variant for evaluating AI translation.
Two Variants: MQM Core and MQM Full¶
For practical use, there are two levels:
- MQM Core - a streamlined set of ~20 error types. Covers 95% of commercial translation needs. This is what Phrase TMS, Smartcat, and other platforms use
- MQM Full - an expanded set with 100+ types. For research projects requiring detailed diagnostics
For most agencies and freelancers, MQM Core is what you need. The full typology is used by large LSPs and research labs.
DQF (Dynamic Quality Framework): The TAUS Approach¶
DQF (Dynamic Quality Framework) is a framework from TAUS (Translation Automation User Society), launched in 2011. DQF’s core idea: quality is dynamic, and requirements depend on content type, audience, and translation purpose.
What DQF Offers¶
DQF consists of several components:
- Content Profiling - determining content type (marketing, technical, legal) and quality requirements
- Error Typology - error classification (accuracy, fluency, terminology, style, locale)
- Productivity Tracking - monitoring translator productivity (words per hour, post-editing time)
- Adequacy/Fluency Rating - scoring translation adequacy and fluency
As TAUS notes:
Quality in DQF is considered dynamic since today’s translation quality requirements change depending on content type, purpose and audience.
DQF tracks both productivity and quality simultaneously - that’s its strength for managers who need the full picture.
DQF vs MQM: Competitors or Partners?¶
Initially, DQF and MQM developed separately. But in 2014-2015, TAUS and DFKI (German Research Center for AI) harmonized both frameworks into a unified DQF-MQM typology. The result:
- DQF’s 6 top-level error categories became a subset of MQM
- Anyone using DQF error typology is automatically using MQM 2.0
- Since 2018, the DQF subset of MQM has been updated and renamed MQM Core
So it’s no longer “DQF vs MQM” - it’s “MQM as the unified standard, DQF as the practical implementation for tracking productivity + quality.”
Other Models: LISA QA, SAE J2450, and Industry-Specific Frameworks¶
MQM/DQF aren’t the only metrics out there. Here are a few more you might encounter:
LISA QA Model¶
LISA QA (from the Localization Industry Standards Association) dates back to the 1990s, last updated in 2006. It classifies errors into 7 categories including DTP and UI-specific issues. LISA as an organization ceased to exist in 2011, but the model still appears in legacy SLA contracts.
SAE J2450¶
SAE J2450 is a standard from the Society of Automotive Engineers. Built specifically for technical manuals in the automotive industry. 7 error types (wrong term, syntactic error, omission, word structure, misspelling, punctuation, miscellaneous) x 2 severity levels (serious, minor). Simple and strict - perfect for a production line, but too limited for general translation.
DGT Error Typology¶
The European Commission’s Directorate-General for Translation (DGT) has its own typology with 5 error dimensions and 6 error codes. Used for evaluating external EU contractors.
Framework Comparison¶
| Framework | Year | Error Types | Severity Levels | Application |
|---|---|---|---|---|
| MQM Full | 2014 | 100+ | 4 (neutral/minor/major/critical) | Universal |
| MQM Core | 2018 | ~20 | 4 | Commercial translation |
| DQF | 2011 | 6 top-level | 3-4 | Corporate + productivity |
| LISA QA | 1990s | 7 | 3 | Software localization (legacy) |
| SAE J2450 | 2001 | 7 | 2 | Automotive documentation |
| DGT | 2024 | 6 | 3 | EU institutional translation |
Bottom line: MQM/MQM Core is the de facto standard right now. If you’re just starting out - go with MQM Core and don’t overcomplicate things.
The Scoring Model: How Errors Become a Number¶
Finding and classifying errors is half the job. The other half is turning annotations into an understandable score. MQM offers several models for this.
Severity Multipliers¶
By default, MQM uses 4 severity levels with these multipliers:
| Severity | Multiplier (penalty points) | When to Apply |
|---|---|---|
| Neutral | 0 | Not ideal, but acceptable in context |
| Minor | 1 | Error that doesn’t hinder understanding but is noticeable |
| Major | 5 | Error that distorts meaning or looks unprofessional |
| Critical | 25 | Error with legal, financial, or safety consequences |
For example: “Nehmen Sie 5 mg ein” translated as “Take 50 mg” - that’s Critical (25 points), because wrong medication dosage can be dangerous. An extra comma in marketing copy - that’s Minor (1 point).
Raw Score¶
The simplest formula:
Raw Score = 100 - (Total Penalty Points / Word Count x 1000)
Example: 1,000 words, found 2 minor (2 x 1 = 2) and 1 major (1 x 5 = 5). Total = 7. Score = 100 - (7 / 1000 x 1000) = 100 - 7 = 93.0
Linear Calibrated Scoring Model (2024)¶
Since 2024, the MQM Council recommends the calibrated model. It allows you to:
- Compare scores across different content types
- Set passing thresholds flexibly
- Adapt to different service levels (gist translation vs. certified translation)
For instance, the threshold for legal translation might be 99.5 (max 5 penalty points per 1,000 words), while for user-generated content it could be 97.2 (up to 28 penalty points). Flexible and pragmatic.
Practical Thresholds: What Counts as “Good” Translation¶
Typical industry thresholds:
| Content Type | Threshold (MQM Score) | Acceptable Penalty Points / 1,000 Words |
|---|---|---|
| Legal, medical, financial | 98-99.5 | 0.5-2 |
| Marketing | 95-98 | 2-5 |
| Technical documentation | 93-97 | 3-7 |
| Gist / internal use | 85-93 | 7-15 |
| Raw MT (no post-editing) | 70-85 | 15-30 |
These are guidelines - every organization calibrates to their own needs.
How to Implement Metrics in Your Workflow¶
Theory is great, but how does this actually work? Here’s a step-by-step plan for agencies and freelancers.
Step 1: Pick a Framework¶
For 90% of cases - MQM Core. It’s supported by most CAT tools and is detailed enough for commercial translation. If you’re in automotive - SAE J2450. If you work with the European Commission - DGT.
Step 2: Define Severity Guidelines¶
Don’t leave it up to reviewer discretion. Describe specifically what counts as minor, major, and critical FOR YOUR content type. Example:
- Critical: errors in drug names, legal terms, numbers in financial documents, errors that reverse meaning
- Major: omitted sentence, wrong terminology, errors that change meaning
- Minor: stylistic inaccuracies, typos, style guide violations
Step 3: Determine Sample Size¶
Checking 100% of text is expensive. Standard practice:
- Large projects (10,000+ words): sample 5-10% (500-1,000 words), minimum from each section
- Medium projects (1,000-10,000): 10-20%
- Small projects (<1,000): 100% (it’s a small volume anyway)
As researchers note in their Multi-Range Theory paper, for very small samples (even a single sentence), MQM recommends using Statistical Quality Control instead of simple counting.
Step 4: Pick a Tool¶
Modern CAT systems have built-in LQA support:
| Tool | MQM Support | Customization | Reporting |
|---|---|---|---|
| Phrase TMS | MQM Core template | Full (categories, weights, thresholds) | Dashboard + export |
| memoQ | LQA module | Full | Built-in reports |
| Smartcat | MQM framework | Basic | Auto-checks + manual LQA |
| Lokalise | MQM-based scoring | Full | Scoring 0-100 |
If you’re a freelancer without a TMS - start with Google Sheets: columns for segment, error type, severity, and penalty points. The scoring formula is a single Excel function.
Step 5: Train Your Reviewers¶
The weakest link in any metric is the person applying it. Inter-annotator agreement (consistency between reviewers) is the key challenge. Two reviewers look at the same sentence: one sees minor, the other sees major.
Solutions: - Run a calibration session: 50 segments with errors, reviewers score independently, then compare and discuss discrepancies - Create severity guidelines with concrete examples (not abstract rules, but “THIS sentence is major because…”) - Check agreement quarterly
MQM for Machine Translation and MTPE Evaluation¶
A separate and important use case: MQM for evaluating MT quality and post-editing.
Traditional automatic metrics (BLEU, COMET, METEOR) compare MT output against a reference translation and produce a single number. The problem: they don’t distinguish error types. BLEU might give a high score to text where 95% of words are correctly translated, but one wrong number in a medication dosage is a critical error.
MQM solves this by evaluating specific errors with their severity. That’s why major MT conferences (WMT) switched to MQM-based evaluation for human assessment starting in 2020.
For the MTPE process, this means you can measure not just “how long post-editing took” but “what specific errors MT makes most often” - and based on that, decide whether to switch engines, refine the prompt, or update the glossary.
Common Mistakes When Implementing Metrics¶
Mistake 1: Using Metrics as a Punitive Tool¶
“Your score is 91.3 instead of 95 - penalty.” This kills motivation and translator relationships. A metric is a diagnostic tool, not a stick. If a translator consistently scores 91-92 against a threshold of 95, that’s a signal for training or glossary updates, not termination.
Mistake 2: Evaluating Without Calibration¶
If reviewers haven’t gone through calibration - the results are worthless. One reviewer marks everything as minor, the other as major. A score of “92” from the first and “78” from the second for the same text isn’t a metric - it’s chaos.
Mistake 3: Over-Detailed Typology¶
MQM Full has 100+ error types. If you start with the full typology, reviewers will get confused and spend more time classifying an error than finding it. Start with MQM Core (~20 types) and expand only when you see the need for more granularity.
Mistake 4: Ignoring Context¶
“Certified translation” and “gist for internal use” are different quality levels. If you apply legal thresholds to an internal FAQ translation, you’re wasting time and money. Calibrate thresholds to content type.
How It Connects to ISO 17100 and the TEP Model¶
MQM/DQF are metrics for MEASURING quality. ISO 17100 and the TEP model are processes for ENSURING quality. They don’t compete - they complement each other:
- ISO 17100 says: “translation must be reviewed by another person” (process)
- TEP says: “Translation -> Editing -> Proofreading - here are the stages” (workflow)
- MQM says: “here’s how to measure the result of each stage” (metric)
The ideal setup: a TEP process per ISO 17100 + MQM evaluation at the output of each stage. That way you see not only the final quality but exactly where in the pipeline issues arise.
For automated QA checks (tags, numbers, consistency), MQM provides the framework to classify automated findings on the same scale as manual ones.
FAQ¶
What Is MQM in Simple Terms?¶
MQM (Multidimensional Quality Metrics) is a standardized system for evaluating translation quality. A reviewer reads the translation, finds errors, classifies each by type (accuracy, terminology, style) and severity (minor, major, critical). Then a formula converts this into a numerical score from 0 to 100. Think of it as grading an exam - but with clear rules on what counts as an error and how many points to deduct.
Do Freelancers Need Quality Metrics, or Are They Just for Agencies?¶
Metrics are useful for freelancers too. First, major clients increasingly require MQM-based QA in contracts. Second, if you evaluate your own work against a formalized system, you spot your weaknesses and improve them deliberately. Third, “average MQM score of 97.8 across my last 50 projects” in a portfolio looks far more convincing than “I produce quality translations.”
What’s the Difference Between MQM and DQF?¶
Today - practically none. They started as separate frameworks: MQM from DFKI, DQF from TAUS. In 2014-2015 they were harmonized into a unified DQF-MQM typology. Since 2018, the DQF subset of MQM became MQM Core. If someone says “we use DQF,” they’re essentially using MQM Core with additional productivity tracking.
How Much Does It Cost to Implement MQM at an Agency?¶
The framework itself is free - the typology and documentation are open. Your costs will be: (1) training reviewers (1-2 days of calibration sessions), (2) possibly upgrading your TMS if it doesn’t support LQA (Phrase TMS, memoQ already have built-in support), (3) time to create severity guidelines for your content types. For a small agency, a realistic implementation budget is $0 and 2-3 days of setup work.
Will Automatic Metrics (BLEU, COMET) Replace Manual MQM Evaluation?¶
No. Automatic metrics are great for rapid screening of large MT volumes, but they can’t tell the difference between a critical medication dosage error and a minor stylistic inaccuracy - both are just “doesn’t match reference” to BLEU. MQM with manual annotation remains the gold standard for quality evaluation where errors carry different weights.
Which Tools Support MQM-Based Evaluation?¶
Phrase TMS, memoQ, Smartcat, and Lokalise all have built-in support. SDL Trados also supports LQA with custom profiles. For freelancers without a TMS, Google Sheets or Excel with a scoring formula works perfectly as a starting point.