Translation Quality Metrics: MQM, DQF, Error Typology - How to Measure Objectively

MQM, DQF, Error Typology - how to measure translation quality objectively. Breakdown of frameworks, scoring models, severity levels and practical implementation for agencies and freelancers.

Also in: RU EN UK
Translation Quality Metrics: MQM, DQF, Error Typology - How to Measure Objectively

Translation Quality Metrics: MQM, DQF, Error Typology - How to Measure Objectively

One reviewer checks 50 translation segments and says: “quality looks fine.” Another reviewer checks the same 50 segments and says: “this is full of errors.” Who’s right? Without a shared evaluation system - nobody, because “fine quality” means something different to each person.

That’s exactly why formalized translation quality metrics exist - MQM, DQF, Error Typology and their derivatives. These aren’t academic toys: major clients (Google, Microsoft, the European Commission) have been using these frameworks for years to evaluate thousands of translators and millions of segments. And in 2024, MQM celebrated its 10th anniversary with an updated scoring model that finally became readable for humans, not just researchers.

Let’s break down how these metrics work, what makes them different, and how to implement them in a real workflow - even if you’re a small agency or a freelancer.

What Are Translation Quality Metrics and Why Do They Matter

A Translation Quality Metric is a formalized system for detecting, classifying, and counting errors in translation. Instead of subjective “I like it / I don’t,” you get a concrete numerical score you can compare across translators, projects, and time periods.

Why this matters:

  • Objectivity: two reviewers working under the same framework will produce comparable results
  • Comparability: you can compare translator A and translator B on the same scale
  • Process improvement: if 40% of errors are terminological, you know to update the glossary - not “rewrite everything”
  • SLA and contracts: a client says “quality must be at least 98 on MQM” - that’s a concrete, measurable requirement
  • Client-facing argument: “your translation scored 94.2 on MQM against a threshold of 90 - here’s the report” sounds a lot more convincing than “we checked it, it’s fine”

As the MQM Council states:

The central component of MQM is a hierarchical listing of issue types derived in a careful examination of existing quality evaluation metrics.

MQM didn’t invent categories out of thin air - it systematized what the industry had already been using for decades.

MQM (Multidimensional Quality Metrics): The Industry’s Main Framework

MQM is a framework for analytic Translation Quality Evaluation (TQE). Created in 2014 under the EU-funded QTLaunchpad project, it’s now maintained by the MQM Council.

How MQM Works

The core idea is simple: an expert (reviewer) reads the translation, spots errors, and for each error determines:

  1. Error type - from a hierarchical taxonomy
  2. Severity level - neutral, minor, major, critical

Then a scoring model converts these annotations into a numerical score.

MQM Error Typology

MQM has a tree-like structure of error types. At the top level - 7 main categories:

Category What It Covers Example
Accuracy Correspondence to source Omitted sentence, wrong number translation, added information not in source
Fluency Linguistic correctness of target text Grammar errors, unnatural phrasing, typos
Terminology Adherence to term bases Using “agreement” instead of “contract” when there’s an approved glossary
Style Adherence to style requirements Overly formal tone in marketing copy
Design Formatting, markup Broken tags, wrong font, RTL issues
Locale Convention Regional conventions Date format DD/MM vs MM/DD, wrong thousands separator
Verity Factual correctness Wrong URL, outdated information

Each category has subcategories. For example, Accuracy breaks down into Addition, Omission, Mistranslation, and Untranslated. Mistranslation further splits into False Friend, Technical Relationship Error, and even MT Hallucination - a subtype added in the MQM-Chat variant for evaluating AI translation.

Two Variants: MQM Core and MQM Full

For practical use, there are two levels:

  • MQM Core - a streamlined set of ~20 error types. Covers 95% of commercial translation needs. This is what Phrase TMS, Smartcat, and other platforms use
  • MQM Full - an expanded set with 100+ types. For research projects requiring detailed diagnostics

For most agencies and freelancers, MQM Core is what you need. The full typology is used by large LSPs and research labs.

DQF (Dynamic Quality Framework): The TAUS Approach

DQF (Dynamic Quality Framework) is a framework from TAUS (Translation Automation User Society), launched in 2011. DQF’s core idea: quality is dynamic, and requirements depend on content type, audience, and translation purpose.

What DQF Offers

DQF consists of several components:

  • Content Profiling - determining content type (marketing, technical, legal) and quality requirements
  • Error Typology - error classification (accuracy, fluency, terminology, style, locale)
  • Productivity Tracking - monitoring translator productivity (words per hour, post-editing time)
  • Adequacy/Fluency Rating - scoring translation adequacy and fluency

As TAUS notes:

Quality in DQF is considered dynamic since today’s translation quality requirements change depending on content type, purpose and audience.

DQF tracks both productivity and quality simultaneously - that’s its strength for managers who need the full picture.

DQF vs MQM: Competitors or Partners?

Initially, DQF and MQM developed separately. But in 2014-2015, TAUS and DFKI (German Research Center for AI) harmonized both frameworks into a unified DQF-MQM typology. The result:

  • DQF’s 6 top-level error categories became a subset of MQM
  • Anyone using DQF error typology is automatically using MQM 2.0
  • Since 2018, the DQF subset of MQM has been updated and renamed MQM Core

So it’s no longer “DQF vs MQM” - it’s “MQM as the unified standard, DQF as the practical implementation for tracking productivity + quality.”

Other Models: LISA QA, SAE J2450, and Industry-Specific Frameworks

MQM/DQF aren’t the only metrics out there. Here are a few more you might encounter:

LISA QA Model

LISA QA (from the Localization Industry Standards Association) dates back to the 1990s, last updated in 2006. It classifies errors into 7 categories including DTP and UI-specific issues. LISA as an organization ceased to exist in 2011, but the model still appears in legacy SLA contracts.

SAE J2450

SAE J2450 is a standard from the Society of Automotive Engineers. Built specifically for technical manuals in the automotive industry. 7 error types (wrong term, syntactic error, omission, word structure, misspelling, punctuation, miscellaneous) x 2 severity levels (serious, minor). Simple and strict - perfect for a production line, but too limited for general translation.

DGT Error Typology

The European Commission’s Directorate-General for Translation (DGT) has its own typology with 5 error dimensions and 6 error codes. Used for evaluating external EU contractors.

Framework Comparison

Framework Year Error Types Severity Levels Application
MQM Full 2014 100+ 4 (neutral/minor/major/critical) Universal
MQM Core 2018 ~20 4 Commercial translation
DQF 2011 6 top-level 3-4 Corporate + productivity
LISA QA 1990s 7 3 Software localization (legacy)
SAE J2450 2001 7 2 Automotive documentation
DGT 2024 6 3 EU institutional translation

Bottom line: MQM/MQM Core is the de facto standard right now. If you’re just starting out - go with MQM Core and don’t overcomplicate things.

The Scoring Model: How Errors Become a Number

Finding and classifying errors is half the job. The other half is turning annotations into an understandable score. MQM offers several models for this.

Severity Multipliers

By default, MQM uses 4 severity levels with these multipliers:

Severity Multiplier (penalty points) When to Apply
Neutral 0 Not ideal, but acceptable in context
Minor 1 Error that doesn’t hinder understanding but is noticeable
Major 5 Error that distorts meaning or looks unprofessional
Critical 25 Error with legal, financial, or safety consequences

For example: “Nehmen Sie 5 mg ein” translated as “Take 50 mg” - that’s Critical (25 points), because wrong medication dosage can be dangerous. An extra comma in marketing copy - that’s Minor (1 point).

Raw Score

The simplest formula:

Raw Score = 100 - (Total Penalty Points / Word Count x 1000)

Example: 1,000 words, found 2 minor (2 x 1 = 2) and 1 major (1 x 5 = 5). Total = 7. Score = 100 - (7 / 1000 x 1000) = 100 - 7 = 93.0

Linear Calibrated Scoring Model (2024)

Since 2024, the MQM Council recommends the calibrated model. It allows you to:

  • Compare scores across different content types
  • Set passing thresholds flexibly
  • Adapt to different service levels (gist translation vs. certified translation)

For instance, the threshold for legal translation might be 99.5 (max 5 penalty points per 1,000 words), while for user-generated content it could be 97.2 (up to 28 penalty points). Flexible and pragmatic.

Practical Thresholds: What Counts as “Good” Translation

Typical industry thresholds:

Content Type Threshold (MQM Score) Acceptable Penalty Points / 1,000 Words
Legal, medical, financial 98-99.5 0.5-2
Marketing 95-98 2-5
Technical documentation 93-97 3-7
Gist / internal use 85-93 7-15
Raw MT (no post-editing) 70-85 15-30

These are guidelines - every organization calibrates to their own needs.

How to Implement Metrics in Your Workflow

Theory is great, but how does this actually work? Here’s a step-by-step plan for agencies and freelancers.

Step 1: Pick a Framework

For 90% of cases - MQM Core. It’s supported by most CAT tools and is detailed enough for commercial translation. If you’re in automotive - SAE J2450. If you work with the European Commission - DGT.

Step 2: Define Severity Guidelines

Don’t leave it up to reviewer discretion. Describe specifically what counts as minor, major, and critical FOR YOUR content type. Example:

  • Critical: errors in drug names, legal terms, numbers in financial documents, errors that reverse meaning
  • Major: omitted sentence, wrong terminology, errors that change meaning
  • Minor: stylistic inaccuracies, typos, style guide violations

Step 3: Determine Sample Size

Checking 100% of text is expensive. Standard practice:

  • Large projects (10,000+ words): sample 5-10% (500-1,000 words), minimum from each section
  • Medium projects (1,000-10,000): 10-20%
  • Small projects (<1,000): 100% (it’s a small volume anyway)

As researchers note in their Multi-Range Theory paper, for very small samples (even a single sentence), MQM recommends using Statistical Quality Control instead of simple counting.

Step 4: Pick a Tool

Modern CAT systems have built-in LQA support:

Tool MQM Support Customization Reporting
Phrase TMS MQM Core template Full (categories, weights, thresholds) Dashboard + export
memoQ LQA module Full Built-in reports
Smartcat MQM framework Basic Auto-checks + manual LQA
Lokalise MQM-based scoring Full Scoring 0-100

If you’re a freelancer without a TMS - start with Google Sheets: columns for segment, error type, severity, and penalty points. The scoring formula is a single Excel function.

Step 5: Train Your Reviewers

The weakest link in any metric is the person applying it. Inter-annotator agreement (consistency between reviewers) is the key challenge. Two reviewers look at the same sentence: one sees minor, the other sees major.

Solutions: - Run a calibration session: 50 segments with errors, reviewers score independently, then compare and discuss discrepancies - Create severity guidelines with concrete examples (not abstract rules, but “THIS sentence is major because…”) - Check agreement quarterly

MQM for Machine Translation and MTPE Evaluation

A separate and important use case: MQM for evaluating MT quality and post-editing.

Traditional automatic metrics (BLEU, COMET, METEOR) compare MT output against a reference translation and produce a single number. The problem: they don’t distinguish error types. BLEU might give a high score to text where 95% of words are correctly translated, but one wrong number in a medication dosage is a critical error.

MQM solves this by evaluating specific errors with their severity. That’s why major MT conferences (WMT) switched to MQM-based evaluation for human assessment starting in 2020.

For the MTPE process, this means you can measure not just “how long post-editing took” but “what specific errors MT makes most often” - and based on that, decide whether to switch engines, refine the prompt, or update the glossary.

Common Mistakes When Implementing Metrics

Mistake 1: Using Metrics as a Punitive Tool

“Your score is 91.3 instead of 95 - penalty.” This kills motivation and translator relationships. A metric is a diagnostic tool, not a stick. If a translator consistently scores 91-92 against a threshold of 95, that’s a signal for training or glossary updates, not termination.

Mistake 2: Evaluating Without Calibration

If reviewers haven’t gone through calibration - the results are worthless. One reviewer marks everything as minor, the other as major. A score of “92” from the first and “78” from the second for the same text isn’t a metric - it’s chaos.

Mistake 3: Over-Detailed Typology

MQM Full has 100+ error types. If you start with the full typology, reviewers will get confused and spend more time classifying an error than finding it. Start with MQM Core (~20 types) and expand only when you see the need for more granularity.

Mistake 4: Ignoring Context

“Certified translation” and “gist for internal use” are different quality levels. If you apply legal thresholds to an internal FAQ translation, you’re wasting time and money. Calibrate thresholds to content type.

How It Connects to ISO 17100 and the TEP Model

MQM/DQF are metrics for MEASURING quality. ISO 17100 and the TEP model are processes for ENSURING quality. They don’t compete - they complement each other:

  • ISO 17100 says: “translation must be reviewed by another person” (process)
  • TEP says: “Translation -> Editing -> Proofreading - here are the stages” (workflow)
  • MQM says: “here’s how to measure the result of each stage” (metric)

The ideal setup: a TEP process per ISO 17100 + MQM evaluation at the output of each stage. That way you see not only the final quality but exactly where in the pipeline issues arise.

For automated QA checks (tags, numbers, consistency), MQM provides the framework to classify automated findings on the same scale as manual ones.

FAQ

What Is MQM in Simple Terms?

MQM (Multidimensional Quality Metrics) is a standardized system for evaluating translation quality. A reviewer reads the translation, finds errors, classifies each by type (accuracy, terminology, style) and severity (minor, major, critical). Then a formula converts this into a numerical score from 0 to 100. Think of it as grading an exam - but with clear rules on what counts as an error and how many points to deduct.

Do Freelancers Need Quality Metrics, or Are They Just for Agencies?

Metrics are useful for freelancers too. First, major clients increasingly require MQM-based QA in contracts. Second, if you evaluate your own work against a formalized system, you spot your weaknesses and improve them deliberately. Third, “average MQM score of 97.8 across my last 50 projects” in a portfolio looks far more convincing than “I produce quality translations.”

What’s the Difference Between MQM and DQF?

Today - practically none. They started as separate frameworks: MQM from DFKI, DQF from TAUS. In 2014-2015 they were harmonized into a unified DQF-MQM typology. Since 2018, the DQF subset of MQM became MQM Core. If someone says “we use DQF,” they’re essentially using MQM Core with additional productivity tracking.

How Much Does It Cost to Implement MQM at an Agency?

The framework itself is free - the typology and documentation are open. Your costs will be: (1) training reviewers (1-2 days of calibration sessions), (2) possibly upgrading your TMS if it doesn’t support LQA (Phrase TMS, memoQ already have built-in support), (3) time to create severity guidelines for your content types. For a small agency, a realistic implementation budget is $0 and 2-3 days of setup work.

Will Automatic Metrics (BLEU, COMET) Replace Manual MQM Evaluation?

No. Automatic metrics are great for rapid screening of large MT volumes, but they can’t tell the difference between a critical medication dosage error and a minor stylistic inaccuracy - both are just “doesn’t match reference” to BLEU. MQM with manual annotation remains the gold standard for quality evaluation where errors carry different weights.

Which Tools Support MQM-Based Evaluation?

Phrase TMS, memoQ, Smartcat, and Lokalise all have built-in support. SDL Trados also supports LQA with custom profiles. For freelancers without a TMS, Google Sheets or Excel with a scoring formula works perfectly as a starting point.

Try ChatsControl

AI platform for professional translators

Try for free →