In the field of language technology, engineers and researchers commonly are the ones who create and define terms. To the public, the distinctions between these terms are not always clear. Often, they’re downright confusing. Here at Acclaro, our mission is to simplify complicated concepts. With this terminology guide, we hope to help you understand key terminology surrounding machine translation technology.
Basic terms & tools
Let’s start by clarifying a few basic machine translation (MT) terms and tools.
Statistical machine translation (SMT)
Statistical machine translation is the use of statistical models that learn to translate text from a source language to a target language by providing a large corpus of examples. A 1990 article published in Computational Linguistics, A Statistical Approach to Machine Translation, explains this more fully: “Given a sentence T in the target language, we seek the sentence S from which the translator produced T. We know that our chance of error is minimized by choosing that sentence S that is most probable given T. Thus, we wish to choose S so as to maximize Pr(S|T).”
In short, it’s a paradigm where translations are generated based on statistical models using parameters learned from training with bilingual text corpora.
Neural machine translation (NMT)
With neural machine translation, large neural network models are used to predict the likelihood of a sequence of words. According to more than 30 Google researchers, “The strength of NMT lies in its ability to learn directly, in an end-to-end fashion, the mapping from input text to association output text.”
Unlike SMT, which consumes more memory and time, NMT trains on an end-to-end approach to maximize performance. These systems are at the forefront of machine translation, often outcompeting traditional forms of translation systems.
Hybrid machine translation
Hybrid systems use multiple machine translation approaches within one machine translation system. Usually, hybrid systems offer the best of both SMT and NMT. They populate dictionaries using corpora much smaller than those required by SMT systems, and leverage the great results from NMT. Some hybrid systems use SMT engines to identify errors, which are then corrected by language professionals. This is called statistical post-editing. Other hybrid solutions include combinations of network model architectures, like convolutional neural networks (CNNs), recurrent neural networks (RNNs) or other typical architectures, for sequence-to-sequence problems.
The term corpora refers to a data set or collection of texts.
Bilingual corpora are collections of texts that are in both the source language and target language and are perfectly aligned with each other.
Post-editing is the process in which a linguist reviews the machine output to assess its quality and how well it fits the purpose of the translation.
In back translation, text that has been translated into a different language is returned to its source language. This is usually done by a new translator or by an MT engine to check conceptual or cultural accuracy and to compare the overall quality of the translated text.
A segment is the unit of text being translated or stored in corpus. Generally, it’s a sentence or sequence of words set off by an end-of-file (EOF), return or punctuation mark.
Training is the process by which a system, SMT or NMT is created. Inside the process, a set of parameters and hyper-parameters are adjusted by comparing the current outputs against reviewed translations.
Translation memory is a collection of segments (sentences) and their translations. In translation memory, the segments are not necessarily in order.
Translation memory eXchange (TMX)
Translation memory eXchange is nothing more than the exchange of translated memory between tools. It’s used to exchange the work between translators or to combine translation memories, and it can be file based or server based, depending on where it is stored.
Localization involves taking a product and making it linguistically and culturally appropriate to the target locale (country/region and language) where it will be used and sold. (Esselink, 2000, pg. 3). It’s more than a simple translation because it includes the adaptation to the local cultural context and local market.
Computer-assisted translation (CAT)
Computer-aided or computer-assisted machine translation is the use of software to help a translator during the translation process.
Since the very beginning, the main goal of machine translation has been to build a fully automatic, high-quality translation machine that does not require any human intervention. At a 1952 conference, Yehoshua Bar-Hillel, a pioneer in the field of machine translation, stated that fully automatic machine translation was unrealistic and years ahead. He coined the term, “fully automatic high-quality machine translation” (FAHQMT) and said that it was essentially unattainable.
Fast forward to today, machine translation now lands between fully automatic high-quality translation (FAHQT) and FAHQMT, with a focus on the best fit for purpose rather than on quality.
More complex terms & tools
Now that we’ve covered some basic terms, let’s dive deeper to explain some additional terms and tools you might hear us mention or see in scientific or industry publications.
The encoder is the first piece in an encoder-decoder model that takes an input, like source text, and translates it into a fixed-length representation called a context vector. The encoder-decoder model is sometimes known as sequence to sequence (Seq2Seq). The model is built on a stack of several recurrent units, such as long short-term memory (LSTM) or gated recurrent units (GRU). Each of these accepts a single element of the input sequence, collects information about it and propagates it forward.
The decoder is the second piece in the encoder-decoder architecture. It’s called the decoder because in machine translation, texts are viewed as a series of codes, more specifically fixed-length vectors, that must be decoded, depending on the desired target language. Interestingly, once encoded, different decoding systems could be used to translate the context into different languages.
As Ian Goodfellow, Yoshua Bengio and Aaron Courville specify in their 2016 book, Deep Learning, “ … one model first reads the input sequence and emits a data structure that summarizes the input sequence. We call this summary the ‘context’ C … A second model, usually an RNN, then reads the context C and generates a sentence in the target language.”
Sequence to sequence (seq2seq)
A general sequence-to-sequence model converts sequences from one domain or language to sequences in another domain, an encoder and a decoder. Both parts are neural network models combined into one huge network, becoming the overall Seq2Seq model.
Recurrent neural network (RNN)
A Recurrent Neural Network is a type of artificial neural network that uses sequential data like natural language processing (NLP) or time series data. RNNs are known by their “memory” as they take information from prior inputs to influence the current input and subsequent output.
The output of a RNN depends on the prior elements within the sequence. RNNs deal with sequence prediction problems, which are best described by the types of inputs and outputs:
One to many: One input mapped to a sequence as an output
Many to one: A sequence as input mapped to a class or quantity (usually for time series data)
Many to many: A Seq2Seq problem, which is the one in which we describe in this article
Long short-term memory (LSTM)
Long short-term memory is a special RNN with a new feature: a LSTM cell. At each recurrence step, this cell is able to understand which part of the previous hidden state should be used (or ignored) to compute the new hidden state. It can also detect which part of the previous hidden state must be updated (or left untouched) before passing it to the following step.
This allows an LSTM to maintain the necessary information contained in the first elements of the sequence as it moves through to the final step. This final step inside a sequence-to-sequence architecture, computes the encoded array and passes it to the decoder. The beauty of LSTM is it allows the decoder to know what the input was at the beginning of the original sequence.
Gated recurrent units (GRU)
Gated recurrent units aim to solve the vanishing gradient problem that comes with every standard RNN. It’s a common problem that prevents learning from long data sequences.
The gradients hold information in the RNN parameter. The gradients, or magnitude, are calculated during training of a Neural Network that is used to update networks weights. They carry information used in the RNN parameter update, and when this gradient becomes really small, the updated parameter becomes very insignificant — which means no real learning during training.
To solve the problem, GRU uses an update gate and reset gate to decide what information is passed to the output. GRU may be considered a variation of the LSTM because they’re similarly designed.
The Transformer is a deep learning model introduced in 2017 and used mostly in natural language processing tasks. The neural network architecture is based on a self-attention mechanism used for language understanding and machine translation.
BLEU Scores for the Standard WMT newstest2014 EN-DE benchmark
The BLEU scores reflect how the Transformer surpasses every benchmark. The Transformer applies the self-attention mechanism, which directly models relationships between all the words in a sentence, regardless their position. It uses those relationships to create in-context output on every case no matter the position of the words inside the sentence.
Bidirectional encoder representations from transformers (BERT)
BERT stands for bidirectional encoder representations from transformers. In the pre-BERT world, the text sequences during training were learned from left to right, either individually or combined. This one directional approach works decently well for generating sentences by predicting the next word. But, when BERT entered the picture with its bidirectional approach, it was able to more deeply understand the context while learning. Instead of simply predicting the next word in a sequence, BERT uses a technique called Masked LM or MLM.
MLM randomly masks words in the sentence, and then tries to predict them using a three-layer architecture extracted from the Transformer:
Token Embedding: Denotes and separates the tokens in a sentence(s)
Segment Embeddings: Marks the sentence correspondence for each token
Positional embeddings — Indicates the position of each token in the sentence
The input representation for BERT
Google’s 2018 release of BERT sparked a creative revolution in the industry In 2018 Google released BERT (just described before), which became the start of a greater revolution in the industry with Baidu’s ERNIE, GPT-3 and BART from Facebook AI. BART is a pre-trained model for both text generation and comprehension that uses the bidirectionality from BERT with autoregressive methods. BART is trained by corrupting text with an arbitrary noising function and learning a model to reconstruct original text.
It uses a standard Transformer-based neural machine translation architecture, a standard seq2seq/NMT architecture with a bidirectional encoder and a left-to-right decoder. The encoder’s attention mask is fully visible, like BERT, and the decoder’s attention mask is causal, like GPT (we will explain this one in a second).
But in 2020, Facebook AI continued developing the BART model. They introduced mBART, which is the first method for pre-training a complete sequence-to-sequence model by denoising full texts in multiple languages for machine translation purposes. You should believe us when we say it’s very good.
Generative pre-training (GPT)
GPT stands for generative pre-training. This Transformer-based architecture features training that follows a two-stage procedure. First, a language model is created using unlabeled data to learn the initial parameters of the neural network. Next, these parameters are adapted to a target task using corresponding supervised objectives.
RoBERTa is an extension of BERT with the following modifications included:
More extensive training, with bigger batches and more data
Removing the next sentence prediction objective (Used in BERT training process)
Training with longer sentences
Dynamically changing the masking pattern applied to the training data
In machine translation, a beam search strategy translates a sentence word by word from left to right while keeping a fixed number of active words to use in the sentence. The fixed number is called the beam, and the larger the beam, the more words are considered for use as the next best word. The more words we use to search for the next best word, the more we improve translation performance. However, decoder speed is reduced significantly.
Bilingual synthetic sentences are those where machine translation generates the target language. The MT engines producing this content are high-scores engines, and the source content, or monolingual content, usually comes from low resource languages or clients with very specific domain content.
Byte pair encoding (BPE)
Byte pair encoding, or diagram encoding, is one of the simplest methods for data compression. It functions by collecting the most common pair of consecutive bytes of data (in NLP these are characters) and replaces them with a byte that does not occur within that data. Let’s look at an example taken from Wikipedia: Suppose the data to be encoded is
The byte pair “aa” occurs most often, so it will be replaced by a byte that is not used in the data, “Z.” Now there is the following data and replacement table:
Then the process is repeated with byte pair “ab,” replacing it with “Y”:
The only literal byte pair left occurs only once, and the encoding might stop here. Or the process could continue with recursive byte pair encoding, replacing “ZY” with “X”:
This data cannot be compressed further with byte pair encoding, because there are no pairs of bytes that occur more than once.
Ensembles, or ensemble methods, is a machine learning technique that combines several base models in order to produce one optimal predictive model. At a minimum, it combines various models, averaging the output and producing a more general model.
With fine-tuning, a model that has already been trained for a specific task is tuned to perform a second similar task.
In machine translation, the term distillation refers to the process of knowledge distillation where we transfer knowledge from a large model to a smaller one.
Harness the power of cutting-edge translation technology
Machine translation is a rapidly evolving field — one that often seems to change by the minute. And, as new breakthroughs continue to transform our understanding of what’s possible, the language around machine translation will also grow to describe it. We hope we can help demystify the science behind translation terminology by explaining it in simple terms. If you’d like to know how our team can make cutting-edge language technology work for you, contact us today.