Published in

Analytics Vidhya

7 min readMay 28, 2021

This posts aims to summarize the findings presented in the paper entitled, “Beyond English-Centric Multilingual Machine Translation”. The headline advancement from this paper is the M2M_100 pre-trained multilingual translation model. In addition to the headline achievement, two additional advancements were made through the research presented in the paper — methods for multilingual data mining culminating in expansion of two large multilingual translation corpora, and implementation of methods for improving the training of very large deep learning models to the context of translation models.

The following write-up will start with a synopsis to present the overall summary of the paper. Then, a description of the datasets produced in order to train the model, some strategies employed to train such a large model, and finally a breakdown of the M2M_100 model design.

Synopsis

This paper introduces multilingual translation model, M2M_100, that performs direct translation between languages. Able to directly translate between any pair of 100 languages. Doing away with the need to use English as an intermediary language as is common in English-Centric translation models. M2M_100 outperforms English-Centric multilingual models, published bilingual models, and other published direct translation multilingual models. Additionally human translators rated M2M higher than English-Centric models in a blind test.

The training of such a model was possible due to improvements in the mining of multilingual translation data. The authors leverage and extend the multilingual translation corpi CCMatrix [1] and CCAligned [2] for training and test data. They define a mining strategy based on groupings of language families and languages chosen to span across the groupings dubbed bridge languages, and demonstrate how this strategy can be used to improve sparse mining over the language pair matrix vs random sampling. Lastly, they show how utilizing backtranslated synthetic bitexts can improve translations in otherwise low resource languages.

In order to train such a large model the authors implemented recent dense scaling advancements such as optimizer state sharding and gradient checkpointing for reducing the memory required to process states, and model parallelism to split training across multiple devices.

Data Mining

Language coverage was chosen to include widely spoken languages from geographically diverse language families and a diversity of scripts, with the objective of high coverage of worldwide languages. Additionally, languages were restricted to those for which public evaluation data and monolingual data existed.

A definition : pairs of sentences that are translations of one another are called bitext data. This type of data makes up the training and test sets for a Transformer based translation model.

Extending Corpi

The experimenters leverage and extend two multilingual bitext corpi, CCMatrix and CCAligned. CCMatrix uses a global approach in mining for bitexts — it compares each unique sentence in one language to all unique sentences in another language to find bitext pairs. CCAligned first pre-selects documents that are likely to contain mutual translations, then mines for bitexts within the paired documents.

Both approaches perform sentence comparisons using language agnostic semantic embeddings generated by the LASER [3] encoder.

Mining Strategy

To address the prohibitiveness of mining data for each and every pair of languages, the authors demonstrate a method of sparse mining by defining language family groupings and bridge languages. Language families group languages that are similar and all languages within a group are mined against all other languages in that group. Bridge languages are languages for which data is mined across groups. These languages are typically the languages within each group that have the most resources.

This is an efficient sparse mining strategy, which outperforms other sparse mining strategies, while enabling the model to learn how to translate languages across groups whether the particular language pair is a bridge language pair or not.

Backtranslation

Backtranslation is the process of creating sythetic bitexts by taking a sentence in one language and translating it to another language to make up a bitext pair. The authors show how this technique can be of worth when applied to low resource languages to increase the training data for those languages and thus the performance of translations.

The authors first train a 1.2B parameter M2M model and use it to evaluate which language pairs where the worst performers. The same model is then used to generate synthetic translations that are added to the training set. These bitext pairs are tagged with a special encoder-side token to indicate to the model that the pair is synthetic. The monolingual sentences are from the cleaned CommonCrawl [4] corpus. The same corpus from which CCAligned and CCMatrix are mined.

Balancing Languages

The authors describe Sinkhorn Temperature Sampling to extend the temperature sampling strategy to the Many-to-Many setting in order to balance sampling across under- and over-represented languages.

Computation Improvement Techniques

The authors mainly use two techniques to improve training performance: (1) reducing memory consumption on a GPU through optimizer state sharding [5] and gradient checkpointing [6]. and (2) model sharding through tensor parallelism [7, 8, 9, 10].

Experimental analysis was performed to determine that wider models scaled better and performed better than deeper models.

Model Design

Tokenization

Almost all natural language processing works on some version of tokens. The authors accomplish this using SentencePiece [11] to produce subword units learned from the training dataset. SentencePiece is suited since it was designed to be compatible with languages that have no segmentation.

Model Structure

Image direct from the “Beyond English…” paper.

The M2M_100 is an extension of a sequence-to-sequence model based on the encoder-decoder Transformer architecture. Explaining this in non-nerd:

Sequence-to-sequence largely refers to models designed with the purpose of optimizing the quality of an output sequence generated by transforming an input sequence to the given target output. For example, improving the quality of an English sentence generated by translating from a French sentence.
Encoder-decorder Transformer architecture is the structure of the neural network model to have an “encoder” transformer and a “decoder” transformer. The encoder takes a sequence of tokens and generates a standard length embedding — a mathematical representation of the sequence. The embedding is fed into the decoder. This decoder is tasked with converting the embeddings to the sequence of tokens as they should be in the target space. For example, first converting the French sentence to an embedding representing the meaning of that sentence. Then, converting that embedding to the English sentence that carries the same meaning.

Thus at it’s core M2M_100 is a fairly standard sequence-to-sequence model built on fairly standard Transformer architecture, with a modeling size challenge centered on how to effectively train on the mountain of mined bitext data. From here the modelers were able to get creative and take advantage of the natural groupings of language families to extend the architecture in a way that makes the above discussed neural network performance optimization techniques applicable.

This is accomplished by replacing some Transformer sublayers with a set of parallel sublayers, one for each pre-defined group of languages. Each sublayer can then be deployed on different GPUs to increase training and inference speed, as well as, increase memory use efficiency because not all languages are interested in all parameters.

These language specific sublayers can be applied to either the encoder or decoder side. However, in practice are only applied to the decoder side of the model.

A potential drawback of this architecture is that the information learned by the model for language groups becomes siloed to that group. To combat this, the authors implement random re-routing and perform experiments to optimize the amount. Random re-routing is the process of feeding the sentence to a randomly selected language group sublayer instead of the deterministically selected one.

Another advantage of utilizing language group sublayers is that new language sublayers can be added to an already pre-trained Transformer to learn the language-specific components. Thereby easily extending the translation pair coverage to new languages.

Overall Results

The authors used the BLEU [12] metric for translation quality for all result studies. They compared the output of M2M_100 to models submitted against the WMT benchmark dataset as well as to human translators.

Overall, the M2M_100 model moderately improves on translations to and from the English direction as compared to other models, and drastically improves on translations for low resource languages and non-English directions. This holds up when examples of translations from M2M_100 were scrutinized against English-Centric model translations as evaluated by human translators.

References

[1] Holger Schwenk, Guillaume Wenzek, Sergey Edunov, Edouard Grave, Armand Joulin, and Angela Fan. CCMatrix: Mining billions of high-quality parallel sentences on the web. arXiv preprint arXiv:1911.04944, 2019.

[2] Ahmed El-Kishky, Vishrav Chaudhary, Francisco Guzman, and Philipp Koehn. CCAligned: A massive collection of cross-lingual web-document pairs. In Proc. of EMNLP, 2020.

[3] Mikel Artetxe and Holger Schwenk. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. In https://arxiv.org/abs/1812.10464, 2018a.

[4] Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzmán, Armand Joulin, and Edouard Grave. Ccnet: Extracting high quality monolingual datasets from web crawl data. arXiv preprint arXiv:1911.00359, 2019.

[5] Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. ArXiv, 2019.

[6] Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. arXiv, abs/1604.06174, 2016.

[7] Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism. In Advances in neural information processing systems, pages 103–112, 2019.

[8] Chiheon Kim, Heungsub Lee, Myungryong Jeong, Woonhyuk Baek, Boogeon Yoon, Ildoo Kim, Sungbin Lim, and Sungwoong Kim. torchgpipe: On-the-fly pipeline parallelism for training giant models. arXiv preprint arXiv:2004.09910, 2020.

[9] Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using gpu model parallelism. arXiv preprint arXiv:1909.08053, 2019.

[10] Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanantakool, Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff Young, et al. Mesh-tensorflow: Deep learning for supercomputers. In Advances in Neural Information Processing Systems, pages 10414–10423, 2018.

[11] Taku Kudo and John Richardson. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226, 2018.

[12] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.