Morphologically-Guided Segmentation For Translation of Agglutinative Low-Resource Languages

Published in The 4th Workshop on Technologies for MT of Low Resource Languages, 2021

William Chen and Brett Fazio. (2021). "Morphologically-Guided Segmentation For Translation of Agglutinative Low-Resource Languages." The 4th Workshop on Technologies for MT of Low Resource Languages.

Abstract: Neural Machine Translation (NMT) for Low Resource Languages (LRL) is often limited by the lack of available training data, making it necessary to explore additional techniques to improve translation quality. We propose the use of the Prefix-Root-Postfix-Encoding (PRPE) subword segmentation algorithm to improve translation quality for LRLs, using two agglutinative languages as case studies: Quechua and Indonesian. During the course of our experiments, we reintroduce a parallel corpus for Quechua-Spanish translation that was previously unavailable for NMT. Our experiments show the importance of appropriate subword segmentation, which can go as far as improving translation quality over systems trained on much larger quantities of data. We show this by achieving state-of-the-art results for both languages, obtaining higher BLEU scores than large pre-trained models with much smaller amounts of data.

Download paper here

Code available here