Retrospective: Machine Translation

Category: Technology, Translation Services

Acclaro’s 10th anniversary is fast approaching, and today we want to take a look at one of the more striking changes in the translation services industry over the past decade: machine translation, or MT. There are two basic types of machine translation technology, rule-based and stats-based systems:

Rule-Based Machine Translation Technology

Rule-based machine translation relies on built-in linguistic rules and millions of bilingual dictionaries for each language pair. This process requires extensive lexicons with morphological, syntactic, and semantic information, outlining how the information is parsed and displayed. The software then uses these complex rule sets and then transfers the grammatical structure of the source language into the target language.

Users can improve the out-of-the-box translation quality by adding their terminology into the translation process, on an as-needed basis. They create user-defined dictionaries which override the system’s default settings.

Ten years ago, rule-based MT systems (like Systran) were pretty much the only game in town. While rule-based MT worked well for large companies with straightforward terminology, like the automotive or manufacturing industries, it wasn’t the best option for many others, like marketing translations. All that changed with Google Translate, which works on a very different model.

Statistical Machine Translation Technology

Statistical machine translation analyzes and indexes of huge amounts of monolingual and bilingual corpora (a.k.a. sets of existing translations). We’re not kidding: a minimum of 2 million words for a specific domain and even more for general language are required before stats-based MT can really function effectively. As a result, statistical machine translation is CPU intensive and requires an extensive hardware configuration to run translation models for average performance levels. However, you gain access to a much greater and more diverse pool of possible translations, thereby increasing the odds of getting good quality from the start.

Google Translate is probably the largest and best-known stats-based system, and for good reason: Google has been indexing translations for years, so they have a large and varied resource base.

What’s the difference?

Rule-based MT provides good out-of-domain quality and is by nature predictable. Dictionary-based customization guarantees improved quality and compliance with corporate terminology, but translation results may lack the fluency readers expect. In terms of investment, the customization cycle needed to reach the quality threshold can be long and costly. The performance is high even on standard hardware.

Statistical MT provides good quality when large and qualified corpora are available. The translation is fluent, meaning it reads well and therefore meets user expectations. However, the translation is neither predictable nor consistent. Training from good corpora is automated and cheaper. But training on general language corpora, meaning text other than the specified domain, is poor. Furthermore, statistical MT requires significant hardware to build and manage large translation models.

What’s next?

Good question — technology changes make it an exciting time to work in the translation industry, and we’re optimistic about what the next ten years may bring. From our standpoint today, MT and human translation each have their place. While MT may be cheap and fast, it doesn’t always produce the best quality or work well with the illogical and nuanced characteristics inherent in most every language. If you do use machine translation, this can mean your global users might get the wrong message without a human-driven quality check. Human translation is flexible, accurate, and conveys the right idea, but can be a slower and more expensive process. While the ideal solution would be to have a superhuman translator who works 24/7 and outputs a million words a month with 100% accuracy, that’s just not the reality of where we are…or at least, not yet. In some instances, a hybrid solution, combining an initial machine translation followed by a human post-edit, may work if you need to get a very large amount of translation work done quickly and if your translation will work well with current MT technology.

If you are curious about whether or not machine translation might be a good fit, we wrote up a great newsletter article about just that very topic, or you can contact us too for more information.

Photo attribution: jcorrius