Hi! First let me introduce myself. I’m Jose de la Torre Vilariño, the machine translation head at Acclaro, and I direct and develop natural language processing (NLP) projects. A computer scientist with a passion for science, artificial intelligence (AI) and fintech, I hold one master’s degree in intelligent systems from Jaume I University (UJI) and another in data science from the University of Valencia (UV).
I’ve worked for several years in AI, machine learning (ML) and NLP. Before Acclaro, I served as a data engineer consultant to ViacomCBS, where I performed data analysis and maintained the life cycle of data projects and production-ready data in Amazon Web Services (AWS).
So, now that you know a little about me, let’s get into my typical business day.
Reading, Research and Data
Even if I start with a plan, my day usually has its fill of interruptions — extremely good interruptions, like emails about new projects and ideas that can positively impact Acclaro’s growth. Those emails sometimes lead to meetings. While not my core strength, meetings and email help a lot in building strong relationships and teamwork across the company.
My work consists of reading research papers to implement and level up our translation quality. This plays a big role in the development of new state-of-the-art NLP models, which include everything from data preprocessing to modeling. Usually, the data is stored in software repositories like memoQ or Smartling, but it’s usually filled with noise, so data preprocessing consumes a lot of my time. Needless to say, if you want to step into ML, get ready to handle a lot of messy data. Let’s take a look at a recent Machine Translation Thursday.
A Typical Machine Translation Thursday
I decided to push a new Bert experiment, the RoBERTta model architecture, using already available code from an old Transformer version of an engine and data from Acclaro and my personal database.
Transformer models are a type of Attention Model that are very well suited for chatbots and MT tasks. RoBERTa tries to iterate deeper than the usual Transformer would, training the model longer with bigger batches, while removing the next-sentence-prediction objective. So, in short, it was a good experiment to level up our translations results.
I have the dataset already downloaded, normalized, cleaned up and converted in just the right format needed by the model for training. That saved me days of work — literally. Being a native Spanish speaker and a pretty good English speaker, the bilingual dataset and all the cleaning rules for both languages makes the data look amazing. No glitches so far, code prepared and AWS instance up and running. Time spent = 1 hour.
Time to start with RoBERTa. My toolkit for the task is fairseq, which is built on top of Pytorch. I upload the code to the AWS instance and after some installation issues, it’s time to get our hands dirty and start adapting the code for both languages. One hour later, the preprocessing waiting time finally starts, and our data slowly begins to look like a decent ML corpora. There’s no better feeling than the unused corpus — I can only compare it to fresh coffee in the morning. Exceedingly happy, I can begin the training. With hyperparameters and some stop parameters checked, as well as some needed warnings for costs and associated metrics to the instance, our long wait starts.
I had created a model that would train ideally for 200-250 epochs on a batch size in memory of 4000 sentences, with an initial learning rate of 0.1. My estimation for that training is that it’ll take five to seven days before we start seeing a proper model ready to produce tests and reach the first conclusions. With this completed, a break is definitely needed.
2:00 p.m. Lunch Break
My lunch breaks are usually coffee breaks — an opportunity to make the best cappuccino ever. But I’ll need to reschedule that delicious treat for later, because today I’m having an hourlong lunch break with my girlfriend. I love to spend my lunchtime with my girlfriend, and it’s actually a key component to my work. I deeply enjoy that hour of sharing a meal and clearing my head. It helps me gain a fresh perspective and discover new questions or solutions to old data or code bugs.
3:30 p.m. Back to Work
Time to prepare for the weekly standup where we discuss progress and setbacks, the models and engines results, and some of the new ideas and solutions already in our road map. (And sometimes even chat about that new blog article I wrote!) This week, the new initiative the team’s working on is Domain Classification and Topic Modeling. The latest experiments with ELMo and BERT have shown amazing results for English and Spanish, and all the tests conducted for one of our clients show promising results. The next stage will be to redirect the Domain Classifier to build the new dataset for the client. This will set us up to build the custom-specific engine desired, which we expect to outperform any translator from any other company when dealing with our client data.
The meeting turns out to be very productive, and we agree to start the custom engine next week, applying all our previous knowledge for dealing with the languages related to this client. The roadmap is now clearer and some engines are already starting to pop up in my mind with architecture prototypes according to languages.
At this stage, nearly three hours of training later, I can start testing the scripts controlling the training process, as well as following the loss function and how the model is starting to fit to the training data. But that will have to wait because a new task lands in my lap — preparing the pipeline of the Domain Classification project. This will involve inputting data in various formats from the client and creating the new neural machine translation (NMT) engine with cleaned and domain-adapted content ready to go to a new AWS instance.
As the evening lights turn on, I decide to finally catch up with that postponed cappuccino and call it a day. My big take-aways for the day were how happy I was after testing a new architecture and seeing the model progressing nicely into the seventh epoch while the loss function improved steadily. And I was thrilled to see that after two hours of hard work, the preprocessing pipeline for Domain Classification was already forming itself for this new and very exciting project.
Because my work is flexible and I take courses on NLP and data science, I have a tendency to work a few more hours late at night and start a bit later the next day. This time, I’m enjoying a lot of the new NLP specialization on Coursera from deeplearning.ai. I’ve found that two late-night hours are the equivalent of four or more daytime hours with the occasional email check. My nights are often my chance to get caught up or even ahead of where I want to be.
The last thing I do is check my to-do list for the next day, ordering by priority and looking for things missed during the day. That way, the next morning I’m ready to start fresh with my plan at hand (even if an unexpected email changes that plan!).