How do we evaluate the machines? Quality estimation for machine translation

By Jose de la Torre Vilariño
More info
How do we evaluate the machines? Quality estimation for machine translation

Among the various applications that generate natural language, without a doubt one of the most prominent in terms of number of users is machine translation (MT). There are several online systems and hundreds of different companies with free or paid applications, with a translation volume of billions of words per day. It is estimated that Google Translate, to cite the most prolific example, is used by more than 550 million people, translating an estimate of 120 billion words per day. Because MT systems are widely used, quality estimation (QE) is an important tool to help users gauge the reliability of a translation. QE can also be used to evaluate the performance of automatic translators, especially when the user does not understand the source language.

What is quality estimation?

QE, our main topic for today, is the prediction of the quality based on certain features. These characteristics can be any part of speech, including nouns, adjectives, verbs, adverbs and so on, in the source translation and target translation. They can also be named entities, like places, people, companies and more. These characteristics or features can be used with techniques, such as deep learning, to create a QE model. That model is then used to obtain a score that represents the estimation of the quality of any translation.

Traditionally, the task of evaluating the quality of an MT engine required a reference translation, which was usually created by a human translator. Then, the differences and similarities between the reference translation and the output translation were evaluated for quality by converting them into a metric. The quality of the automatic translator was assessed using many of these metrics, which included BLEU, LEPOR, METEOR and NIST, to name four of the most used methods. If you don’t know about those specific methods, you can learn more from a previous article about the top five evaluation metrics in MT.

In a nutshell, QE for MT aims to predict the quality of a machine-translated text without using reference translations or any human assistance. The predictions usually result in several categories, which include Good, Regular and Bad. While the number of categories depends on the granularity of the machine learning model’s definition, three or four categories are generally enough for our purposes. Some QE models have numeric scores, but that depends on the client’s expectations or how exact we want the estimation to be. Let’s take a look at how it is actually done by examining the main QE tools and frameworks.

QE frameworks and models


This open source software focuses on QE for machine translation. Professor Lucia Specia’s team at the University of Sheffield developed QuEst++ with assistance from other researchers. It has two main modules, the feature extraction module and the machine learning module.

The framework allows the extraction of several quality indicators from source segments and their translations. It also uses external sources, such as corpora, language models or topic models, as well as language tools like parsers or part-of-speech tags. It also offers machine learning algorithms that can be trained to build QE models.


DeepQuest is a framework for neural-based QE, which was also developed at the University of Sheffield. It provides models for multilevel QE with two different architectures, both based on a RNN encoder-decoder. This architecture is a two-stage, end-to-end stacked neural QE model that combines a Predictor and an Estimator.

An encoder-decoder RNN model, the Predictor anticipates words based on representations in the context. Using a bidirectional RNN model, The Estimator uses the representations from the Predictor known as QE feature vectors (QEFVs) to produce quality estimates for words, phrases and sentences.


Qualitative has already been in the QE field for several years, but is still a useful python tool kit that uses QE to rank and produce sentence-level output for different MT systems. It contains the implementation of a basic pipeline for scoring given phrases with black-box models and applies a machine learning mechanism for classifying data based on preformed models of human preferences. The preprocessing pipeline includes support for language models, parsing, language checking and various other preprocessors and feature generators. The code follows the principles of object-oriented programming to enable modularity and extensibility.


Built from the joint effort of the Dublin City University and the University of Sheffield, Marmot presents itself as an easy to learn and use tool kit developed in Python. Marmot contains utilities targeted for QE mostly at the word level, although due to its flexibility and modularity, it can be extended to work at the phrase and sentence level. Experimental pipelines are easy to deploy with this framework via configuration files, speeding up the development process and reducing the associated. Finally, but very importantly, any scikit-learn algorithm can be used within the training pipeline and any feature can be used in external tools, like Weka and CRF++ and others.

Practical applications for QE 

While there is no doubt machine translation has done wonders for enhancing the speed of translation, we must still take into account several factors when advising our clients on the best translation solution. We start by keeping an eye on the quality of a translation. We can estimate the quality of translations at the segment and document level using the QE tools we discussed.

Score translation quality

Our segment-level scores are focused on categories to predict either light post-editing (for Adequate Quality), full post-editing (for Bad Quality) or no post-editing at all (for Good Quality). Our ability to determine the quality of translations helps us focus post-editing only on content that needs it. This cuts down on time, which saves you money.

Estimate time for post-editing

In addition to estimating translation quality, we can also gauge how much time and effort we will spend post-editing content. Based on past experience, for example, we can assume that segments with a low quality score will take longer to post-edit, which means both time and cost increase.

Compare MT systems

Finally, it is also possible for us to compare MT systems based on QE scores to see which performs best for certain types of content. This is especially useful if you are trying to decide which engine, or which version of an engine, to use.

Ready to leverage QE to reach your business goals?

Acclaro can help determine which MT system will best fit your needs. Starting with the MT system that is best suited for your particular localization program can help ensure timely translations delivered within your budget. Contact us today to learn how you can harness the power of MT to achieve your business goals.

Power your strategic growth

Go beyond tactical localization with tailored, strategic solutions that resonate locally and drive growth globally.

Get started