The road to quality: the evolution of neural machine translation

By Jose de la Torre Vilariño

“Machine intelligence is the last invention that humanity will ever need to make.” Nick Bostrom

These days, AI is everywhere and terms like “neural networks” that may sound like science fiction are even sitting in our pockets right now. For the professionals of the language industry, AI is a huge reality stepping into their day to day, but how can we use it to our advantage? And how can we adapt to the great future we have ahead?

Any professional in our field has heard about neural machine translation (NMT), however, only a few have enough information to actually know facts, advantages or how it will change lives for professionals in this industry. So, let’s get started!

A little background first

Machine translation (MT) came about in the early ‘70s, with rule-based machine translation—defining a set of grammatical rules to go between languages. But it had huge disadvantages. Apart from already previously reviewed content, any new content became chaos, and the maintenance of such systems was very time consuming and expensive.

These issues were solved with the evolution of statistical machine translation (SMT) in the ‘80s. SMT used the idea of frequency between phrases, transforming the training corpus into a table. To get a little technical for a moment, these automatic translation systems are no more than just a few components: the language model, the translation model, our table and the decoder.

The language model is responsible for calculating the probability that a sentence in the target language is correct. It’s in charge of the fluency of the translation, and a monolingual corpus of the target language is used to train it as much as possible. Fluency ensures that literal translations (i.e. the words are all there, but the sense of the sentence is not) are replaced by a more natural-sounding translation.

On the other hand, the translation model is in charge of establishing the correspondence between the source and target languages, and is trained using an aligned corpus. During this training phase, the system estimates the probability of a translation from the translations that appear in the training corpus. Finally, the decoder is responsible for searching within all possible translations for the most probable one in each case.

At Acclaro, our MT team does corpus tokenization, language and translation model training, and tuning and testing on a disjoint set from training, using the tools provided by the statistical system Moses.

Let’s dive into neural machine translation (NMT)

In the world of technology, 2013 was ages ago. It was then that Nal Kalchbrenner and Phil Blunsom proposed the idea of encoding a given source text into a vector using Convolutional Neural Networks (that’s a topic for another day), which eventually turned into a Recurrent Neural Network (RNN). This end-to-end encoder-decoder structure for MT was treated as the birth of neural machine translation (NMT).

One year later, in 2014, Sutskever and Chod developed a method called sequence to sequence (seq2seq) learning, using RNN for both encoder and decoder. This long short-term memory (LSTM, a variety of RNN) for NMT made huge improvements in the field. Thanks to the gate mechanism introduced by LSTMs, the problem of “exploding/vanishing gradients” was controlled so the model could capture “long-distance dependencies” in a sentence much better.

But then Google happened and… “attention is all you need”

A Google paper from the end of 2017, “Attention is All You Need,” presented the architecture of the transformer, a model that innovated the substitution of the recurrent layers, such as the LSTMs or RNNs, for the so-called attention layers. These layers of attention encode each word in a sentence as a function of the rest of the sequence, thus allowing the introduction of the context in the mathematical representation of the text.

At Acclaro, our usual work story with transformers starts after all the data transformations, with segmenting the infrequent words into their corresponding sub-word units by applying the byte pair encoding (BPE) approach to an encoder-decoder NMT model. Then, we train using transformers in open tools as OpenNMT with heavy GPUs.

These days, NMT has grown to outperform statistical systems to such an extent that the developers of Moses, the most widely used statistical system to date, announced in October 2017 that its fourth version would be the last fully platform-tested incarnation.

This announcement indicated that researchers were convinced that automatic translation would disappear to make way for more computationally complex systems. Despite the great advances that NMT has provided, it still has weaknesses. Creating a NMT engine still involves very high costs. Also, it’s slow to train, not fully effective in translating terminology, and often leaves part of a segment empty, depending on the complexity of the word. Clearly, it’s still a work in progress in the scientific community.

Acclaro recognizes the importance and value of expert MT customized to our clients’ needs. It opens up a new world of opportunities and challenges to which the industry is adapting, and like every task, we approach it with great respect for knowledge and science. We want to make sure we’re using the best technology and processes when it comes to selecting or creating engines, evaluating their performance and applying the right type of post-editing.

We believe that every language approach has to be considered with the utmost respect for quality. In this case, that means working within a well-defined process that uses the most advanced techniques in AI, while offering clients the cutting-edge technology and quality they’ve come to expect.