Machine Translation Master Course Newsletter – Issue 2
As the Instructor for the Machine Translation Master Class from The Localization Institute, I am very happy to continue to share with you my ideas about the relationship between humans and machines. In this issue, we will discuss it from a data management perspective.
Why is data management relevant to me?
In 2016, Google said in a statement when it corrected a bug translating “Russia” as “Mordor”, “Google Translate is an automatic translator — it works without the intervention of human translators, using technology instead.” (See here)
Indeed Google Translate did not have human translators involved in its translation process. But remember, all natural language data comes from humans in their daily life. You might have heard things like “if you hear it enough, you’ll start to believe it”. The “illusion of truth” effect also applies to machines. A machine will believe what it has seen after looking for patterns in hundreds of millions of documents. You can, of course, try to fix some problems by manually correcting them. Yet, in many cases, in particular in a neural MT system, it is very difficult for humans to manually hit the quantity and complexity that the hidden layers present and thus some features have to be deleted in order to avoid potentially catastrophic mistakes. Therefore, it makes a lot of sense if we can control the quality and quantity of data before feeding it to a machine. Data management is one of the most effective ways to control MT related risks.
Why is some data more relevant than others?
Relevance, first and foremost, is based on the comparison. In a localization process, this comparison often happens between your data and the source text. You can compare them from different perspectives. A translator, for example, usually judges the relevance of their reference materials by searching for concepts, words or knowledge about these words that are similar to those appearing in the source text. If a 100-page document did not include any of these, most probably this translator will give up reading. Humans can make such decisions in a split second. Yet it is a daunting task for machines to simulate this process. So typically an MT engine would diligently scan the whole database and analyze the pattern. If a big percentage of data is irrelevant, it is a waste of computing power and you could not achieve your goal. Of course, in the machine world, language data is processed in a different way. For example, neural MT uses embeddings to capture word meaning whereas statistical MT uses n-gram to process corpora. So we cannot judge data relevance only from a human’s perspective. Yet this analogy helps you get a rough picture based on your intuition.
Who is involved in the process of managing MT-driven data?
While IT professionals can communicate your ideas to machines, it is translators, linguists, project managers, and content managers, who can really make sense out of the data from a human perspective. With effective communication that is based on relevant technological knowledge, you will be able to generate a “collective” insight from your team, other teams outside your department, clients, end-users, and last but not least, your machine. This insight will navigate your attention to meet your needs.
Finally, it is important to point out that there are many more aspects regarding data management in an MT deployment process. For example, data quantity, data generated in an interactive MT or an MTPE (Machine Translation Post Editing) process, and data format. It is definitely an intriguing topic we can further explore.
Takeaways:
- Data management is one of the most effective ways to control MT related risks
- Data relevance is key to train an MT engine
- Communication helps the team make sense out of the data
If you want to know more about machine translation, sign up for our next Machine Translation Master Class.
Disclaimer: Copyright © 2021 The Localization Institute. All rights reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published, and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this section are included on all such copies and derivative works. However, this document itself may not be modified in any way, including by removing the copyright notice or references to The Localization Institute, without the permission of the copyright owners. This document and the information contained herein is provided on an “AS IS” basis and THE LOCALIZATION INSTITUTE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY OWNERSHIP RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.