Empirical Models for an Indic Dialect Continuum

Niyati Bafna (INRIA, Paris)
Many Indic languages and dialects of the so-called "Hindi Belt" and surrounding regions in the Indian subcontinent, spoken by more than 100 million people, are severely under-resourced and under-researched in NLP, individually and as a dialect continuum. We describe our efforts to build basic NLP resources for some of these languages: data collection, cognate induction, and investigating effective methods of embeddings transfer. Specifically, we collect monolingual data for 26 Indic languages and dialects, 16 of which were previously zero-resource, and perform exploratory character, lexical and subword cross-lingual alignment experiments. We present a novel method for unsupervised cognate/borrowing identification from monolingual corpora designed for low and extremely low resource scenarios, based on combining noisy semantic signals from joint bilingual spaces with orthographic cues modelling sound change. We create bilingual evaluation lexicons against Hindi for 20 of the languages, and show that our method outperforms both traditional orthography baselines as well as iteratively learnt edit distance matrices, showing that even noisy bilingual embeddings can act as good guides for this task. Next, we investigate static subword embedding transfer for Indic languages from a relatively higher resource language to a genealogically related low resource language. We primarily work with Hindi-Marathi, simulating a low-resource scenario for Marathi, and confirm observed trends on Nepali. We demonstrate the consistent benefits of unsupervised morphemic segmentation on both source and target sides over the treatment performed by fastText. Our best-performing approach uses an EM-style approach to learning bilingual subword embeddings; we also show, for the first time, that a trivial "copy-and-paste" embeddings transfer based on even perfect bilingual lexicons is inadequate in capturing language-specific relationships.
views: 136

Attachments: (video, slides, etc.)
62.0 MB