“Under the Hood” with Bill Lafferty: Translation Memory Migration

“Under the Hood” with Bill Lafferty: Translation Memory Migration

Category: Technology, Translation Services

Customers want to bring their digital assets with them wherever they go these days. As many new translation tools have exploded onto the marketplace over the past decade, customers have greater ostensible freedom to do just that. A case in point is translation memory (TM) data.

Have you ever wondered why you’d have to move your TM, or if you even can? Here are just a few scenarios in which you might want or need to:

  • Your company is growing and needs another language service provider (LSP) to scale up.
  • You want to change to a new computer-assisted translation (CAT) tool or translation management system (TMS) to take advantage of a new feature.
  • You want to clean your TM data.
  • You want to consolidate TMs managed by multiple LSPs. 

Whatever the reason, it’s a common situation for new and mature localization buyers alike. And if you’re moving to a new LSP, you’ll certainly want them to use your TMs to save money and to keep the translated output consistent.

Translation Memory Export Limitations

TMs should be easy for a TMS or LSP to share, barring a few considerations, such as the logistics of sending very large ones. But unfortunately, we work with customers every year who struggle to get pristine data. 

Here at Acclaro, we view this as a problem for all of us in the translation industry. LSPs build value for customers by producing quality translations, by being a trusted adviser across the many tasks involved in the localization supply chain and especially by being reliable stewards of customers’ data. 

When TMs are not easily accessible, or when exporting them results in a loss of leverage and increased cost, it erodes trust.

The Challenge of Translation Memory Portability

So, why would customers find this process a challenge? Let’s discuss the anatomy of a TM. There are a few steps that need to be followed to create a TM. Generally, it can be created by a CAT tool or TMS when it converts an XLIFF file into a translation memory exchange (TMX) file. TMX is an open XML standard for the exchange of TM data that’s been around since 1997.

Using XLIFF to Exchange Localization Data

As Bryan Schnabel explains in his book, “A Practical Guide to XLIFF 2.0,” published in 2015, “XLIFF was developed to facilitate the exchange of content during localization and reduce the number of document formats that localization companies receive from information developers. XLIFF enables information developers — whether they create product documentation, training materials or entire websites — to reduce their translation and localization costs.”

Here’s a snippet of an XLIFF file featured in Schnabel’s book:

<unit id=”title-2″>
  <segment>
    <source>Birds in Oregon</source>
    <target>Pájaros en Oregon</target>
  </segment>
</unit>

Sentences are broken down into segments in the XML schema. Within each segment, there’s a <source> and a <target> element.

As you can imagine, if the phrase “Birds in Oregon” is translated into a given target language (Spanish, in this case), the next time the CAT tool detects the same <source> segment, it will produce a 100% <target> match. 

In-Context Exact Matches Are Key

This is great, but there’s more. TMs also support what are called in-context exact (ICE) matches, also known as guaranteed matches. This is when a given segment has been translated word for word before, and additionally, the segment above and below the given one are also 100% matches. 

As Wikipedia puts it, “An ICE match is an exact match that occurs in exactly the same context, that is, the same location in a paragraph. Context is often defined by the surrounding sentences and attributes such as document file name, date and permissions.”

Borrowing again from Schnabel, here’s a longer passage. 

<unit id=”title-2″>
      <segment>
        <source>Birds in Oregon</source>
        <target>Pájaros en Oregon</target>
      </segment>
    </unit>
    <unit id=”paragraph-3″>
      <segment>
        <source>Oregon is a mostly temperate state. There are many
        different kinds of birds that thrive there.</source>
        <target>Oregon es un estado generalmente templado. Muchos
        tipos diferentes de pájaros prosperan allí.</target>
      </segment>
    </unit>
    <unit id=”title-5″>
      <segment>
        <source>High Altitude Birds</source>
        <target>Pájaros de gran altura</target>
      </segment>
    </unit>

CAT tools add further properties to the XLIFF, which in turn will be saved to the TMX to achieve ICE matches and enable more targeted concordance searches. Here’s an example from Memsource (Memsource calls it metadata; memoQ calls it properties. But more on our industry’s penchant for using different names for the same thing in another article!):

  • ID – Memsource internal ID
  • {source language code} – for example ‘en’ or ‘en_us’
  • prev – text of the previous segment
  • next – text of the following segment
  • seg_key – text of the context key
  • mdata – metadata of Memsource tags
  • {target language code} – en or en_us
  • created_by – Memsource Username
  • created_at – in format 2017.07.07 14:39:52,000
  • modified_by – Memsource Username
  • modified_at – in format 2017.07.07 14:39:52,000
  • client – Memsource ID (number)
  • project – Memsource ID (number)
  • domain – Memsource ID (number)
  • subdomain – Memsource ID (number)
  • note – text (external use only, not visible in Memsource)
  • reviewed – true/false (external use only, not visible in Memsource)
  • aligned – true/false (external use only, not visible in Memsource)
  • filename – the name of the original file (test.docx)
  • mdata – metadata of Memsource tags

To come back to the point, the most important properties after the <source> and <target> elements are prev and next. They are what determine the context of a <source> and <target> segment.

Here’s how memoQ uses the properties x-context-pre and x-content-post:

  • <prop type=”x-context-pre”><seg>previous sentence</seg> 
  • <prop type=”x-context-post”><seg>following sentence</seg>

It’s the additional metadata/properties layered in the XLIFF with the <source> and <target> elements that helps customers get the best leverage from their TMs, including ICE matches.

Walk Before You Run

While much of the new technology is great at saving time and costs when it comes to migrating your TMs, there are some surprising pitfalls — like finding out all the metadata required to get ICE matches has been stripped. 

Our advice? Walk before you run. Inform your LSP or TMS that you’d like your TMs. Have their engineers catalog them. Send over a few to test against translated documents and identify problems early. If your TMs are ultimately incomplete, find reliable translations to align and start over. And make sure to ask your new CAT tool provider or TMS exactly how TMs can be migrated in the future.