Let’s start simple. What is a language model?
A Language Model, or LM, is nothing more than several statistical and probabilistic techniques used to determine the probability of words occurring in a sentence, or the sentence itself. These models interpret the language by feeding it through algorithms. Such algorithms are responsible for creating rules for the context in a natural language scene and they use big corporas of text to provide a basis for their word predictions. Currently, LMs are the backbone of almost every application inside natural language processing (NLP).
Let’s take a look at some of the most important applications. Maybe you’re already using an LM without even knowing it!
Optical character recognition (OCR)
OCR is used all over the world to recognize text inside images, going from scanned documents to photos. This technology can convert virtually any kind of image containing written text to computer-readable text.
On a very basic level, Machine Translation (MT) is the substitution of words in one language for words in another. In a globalized world, it’s a major skill needed across text and speech applications in fields as varied as government, military, finance, health care and e-commerce.
Voice assistants such as Siri or Alexa are examples of language models in our everyday lives. They’re big exponents of how LMs help machines in processing speech audio.
There’s a good chance that every opinion or comment you’ve ever posted in social media has been used somehow in a sentiment analysis process. Businesses use sentiment analysis to understand social sentiment about their brands. In fact, it’s one of the major ways of monetizing social networks. No wonder why they’re billion-dollar businesses!
But getting back to our topic, Language Models are crucial in modern NLP applications. They’re the main reason machines are able to understand language, transforming qualitative information into quantitative information, therefore allowing machines to understand people.
The roots come from the 1948 paper, “A Mathematical Theory of Communication,” where Claude Shannon introduced the use of a stochastic model called the Markov chain to create a statistical model for the sequences of an English text — a shocking discovery for even making references to N-Grams. But it wasn’t until the 1980s and the rise of computers that more complex systems made statistical models the norm. It was a big decade for NLP. John Hopfield introduced Recurrent Neural Networks, and Geoffrey Hinton, one of the Fathers of modern AI, introduced the idea of representing words as vectors. We had to wait until 2003 for the first Neural Language Model, with the very first feed-forward neural network language model, but from then on, we haven’t looked back.
Let’s take a look at some of the most important language models based on today’s neural networks.
The top 5 language models that accelerated natural language processing
BERT, or Bidirectional Encoder Representations from Transformers, is a pre-trained NLP language model developed by Google in 2018. Unlike previous models, BERT was the first truly bidirectional or nondirectional unsupervised language representation. Previous models, such as Word2vec or GloVe, generated a single-word embedding representation for each word in the vocabulary, where BERT takes into account the context and position in a sentence for each occurrence of a given word.
For example, models as Word2vec will have the same representation for the word “right” in the three following sentences:
- You have the right to defend yourself.
- She just gave the right answer.
- We should make a right in the next corner.
BERT, on the other hand, will provide a contextualized embedding that’s different according to the sentence in each case therefore being a completely different word with different meaning so humans understand it with no effort.
In 2019, T-NLG, or Turing Natural Language Generation, became the largest model ever published, with 17 billion parameters, outperforming state-of-the-art language models bench marks in very practical lists of tasks, such as summarization or question answering. T-NLG is based in a Transformer-based generative LM, which means it can generate words to complete open-ended textual tasks.
“Beyond saving our users time by summarizing documents and emails, T-NLG can enhance experiences with the Microsoft Office suite by offering writing assistance to authors and answering questions that readers may ask about a document,” noted Microsoft AI Research applied scientist Corby Rosset.
From OpenAI, we encounter GPT-3, the successor of GPT and GPT-2, in that order. For comparison, the previous version, GPT-2, was trained with around 1.5 billion parameters — very far from the largest Transformer-based language model by Microsoft. But erase all that from your memory because OpenAI went to 175 billion parameters with GPT-3, which was 10 times larger to the next closest thing.
“GPT-3 achieves strong performance on many NLP data sets, including translation, question-answering, and close tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing three-digit arithmetic,” according to researchers.
The huge size of the model goes beyond bounds for almost everyone except a few select companies and research labs. But as a main contribution, it makes cutting-edge NLP more accessible without requiring large and specific data sets, and without requiring task specific model architectures.
Following the latest advancements in NLP, BERT, GPT-2 and GPT-3, it was a matter of time before a big competitor in GPU manufacturing pushed the limits in the technology. Enter Nvidia’s Megatron-LM, an 8.3 billion parameter transformer language model trained on 512 GPUs.
Nvidia demystified the question, “Is having better NLP models as easy as having larger models?” They proved that increasing the size of the BERT model from 336 million to 1.3 billion decreased accuracy and compounded the larger models’ issues with memory. To see how careful you must be with layering normalization when increasing model size, Language modeling using megatron A100 GPU is a must-read.
ELMo (Author’s attachments to this model should be considered.)
In 2018, the paper “Deep Contextualized Word Representations” introduced ELMo as a new technique for embedding words into a vector space using bidirectional LSTMs trained on a language modeling objective. In addition to beating several NLP bench marks, ELMo proved to be the best technology to reduce training data by a potential 10 times, while achieving the same results.
The model developed by AllenNLP and based on a deep bidirectional model (biLM) on top of biLSTM was pre-trained on a huge text corpus. The main differentiator is how easily it can be added to existing models, drastically improving functions such as Q&A, sentiment analysis or summarization.
The future of language models
Although bigger is not always better, when working with language models, the amount of data is critical. The bigger the model, and the more diverse and comprehensive the pre-training data, the better the results.
As Microsoft scientist Corby Rosset put it, “We believe it is more efficient to train a large centralized multitask model and share its capabilities across numerous tasks.”
Like GPT-3 or BERT, language models may be able to complete open-ended textual tasks generating words, and build summaries or answer direct questions, but the costs are high, with expensive data sets and millions in resources. So, although we’re not in the race yet, we’ll definitely stay tuned and take the benefits that come our way. Let’s just hope they’re for everyone.