Statistical machine translation (SMT) treats the translation of human language as a machine learning problem. It learns probabilistic mappings by iterating through billions of lines of bilingual text produced by humans which are then used to construct the translation of an unseen pieces of text. The Microsoft translator team builds SMT systems for 100’s of language pairs consisting of 1000s of models. We run a large number of experiments by utilizing publically available training data from the internet. Based upon successful results, these ML models get refreshed multiple times a year.
Let’s say you are planning to tackle a ML problem at global scale. How would you approach your problem? What kind of engineering principles, solutions do you keep in mind while coming up with an architecture you can iteratively build and improve along with the success of your product?
In this talk, we break down the end-to-end process of training ML models into four broad components – data acquisition, training, evaluation and release. This represents a typical workflow for any ML system design. For each of these areas, we describe the kinds of problems one is most likely to encounter. Through our specific solutions, we derive generic recommendations that you will find helpful for your ML problem. The talk highlights real world learnings that has proven to be helpful in building automated, reliable ML infrastructures.
Vishal Chowdhary is a Principal Development lead with the MSR – Microsoft Translator (MT) team for the past 4 years. His team is primarily responsible for the data acquisition and training infrastructure for building translation models. Earlier, he has worked in the .NET Framework, Azure ServiceBus and BizTalk teams. He loves to learn and to fly, both figuratively and literally.
Comments on this page are now closed.