As the internet expands, our personal data becomes more and more vulnerable. Since we each have a digital self to protect, internet privacy is more important than ever. And in a world where privacy violations are far too common, we need to stop and reboot if we are going to adapt.
With this in mind, the EU enacted the General Data Protection Regulation (GDPR) in 2018. Garnering worldwide attention, the regulation standardized the way personal data can be collected, processed and used. It also created one of the main fields of natural language processing (NLP): data anonymization.
What is data anonymization?
Anonymization is a data processing technique that removes or modifies personally identifiable information, so that it can’t be associated with any one individual. The anonymization of any data format, and specifically unstructured data such as text documents and social media posts, has been widely studied in recent years. But because it remains a highly manual task, more attention from researchers is necessary.
As a language company, we’re already getting hundreds of petitions from one field in particular: the law. Lawyers and legal professionals find it extremely valuable to protect references to legal entities and confidential files. Fortunately, machine learning specialists have made breakthroughs in the development of hybrid techniques known as named entity recognition (NER).
Understanding named entity recognition (NER)
NER is an NLP task aimed to identify named entities in text, such as the names of persons, organizations, locations, phone numbers, dates or emails. In short, NER seeks to find any term that might be used to identify an individual or contains sensitive information.
Now, we’re going to get technical. Once we have all the terms identified, we anonymize (or transform) the data so that it can be published or released without revealing the confidential information it contains. This is guided by statistical disclosure control or SDC methods. There are many tagging schemas to use based on client preference, but Acclaro typically uses an Inside-Outside-Beginning (IOB) format called IOB2. Here, each word in the text gets labeled with three possible tags: I for inside, O for outside or B for the beginning of a named entity.
Machine learning enters the picture
To get to these schemas, we need to use a series of machine learning techniques, including:
Rule-based approaches: Designed by the client or manually developed as needed, this technique relies on handcrafted rules and therefore doesn’t require annotated data.
Unsupervised learning: In this case, NER relies on unsupervised algorithms without hand-labeled training examples. Sometimes public databases like WordNet are used as well.
Feature-based supervised learning: This approachrelies on supervised learning algorithms with careful feature engineering.
Deep learning: This technique automatically discovers representations needed for the classification and/or detection from raw input in an end-to-end manner.
Several machine learning algorithms have been proposed using the techniques above, although current research and our own work focuses mostly on deep learning with word embeddings. Some of the most widely used context-encoder architectures involve convolutional neural networks, recurrent neural networks or deep transformers, all topics for another article. (If you’re eager to learn more about that, I strongly recommend reading A Survey on Deep Learning for Named Entity Recognition, which describes the major techniques and common algorithms used in open source code, as well as almost every commercial application).
How it’s done: examples of open source NER for developers
Open source application programming interfaces (APIs) are completely free, flexible and easy to integrate with other tools. However, they do require a learning curve, which is why they’re mostly used by developers. Here are some of our favorite options if you want to jump into NER:
Stanford named entity recognizer (SNER): A JAVA tool developed by Stanford University, this is considered one of the standard libraries for entity extraction. It’s based on Conditional Random Fields (CRF) and offers prepared models for the extraction of persons, organizations, locations and other entities.
SpaCy: This Python framework is known for being fast and very easy to use. It has an excellent statistical set of systems but you can also create custom NER extractors.
Natural language toolkit (NLTK): This library for Python is often used for NLP tasks. It has its own classifier for recognizing named entities called ne_chunk, but also provides a wrapper for using the Stanford NER tagger in Python.
NER use cases
NER is suitable for more than just anonymization. It’s great for any situation where a general overview of a large amount of text is helpful. Let’s look at a few of the notable use cases for NER besides anonymization.
Accelerate the hiring process by summarizing the resumes of candidates, making internal filtering pipelines or categorizing employee complaints or questions.
Improve response times by categorizing user requests, filtering by priority or even solving the easy ones with conversational AI or chatbots.
Enable students and researchers to find relevant material faster by summarizing articles and highlighting key terms, topics and issues. Google Scholar is an excellent example of using NER to improve the speed and relevance of search results and recommendations by summarizing descriptive text, reviews and discussions.
Simplify interface content and gain insight into trends by identifying the topics and issues of blog posts and news articles.
Can anonymization and NER help you achieve your goals?
Many companies and organizations have relied on Acclaro to successfully deploy anonymization techniques to help them achieve their growth goals. If you’d like to hear more about how we can do the same for you, contact us today.