Pursuit of Fair and Effective Text Segmentation in Multilingual Language Models
Speaker:
Tomasz Limisiewicz (ÚFAL MFF UK)
Abstract:
Subword tokenization has become dominant as the method of segmenting textual input of language models. It offers a compromise between coverage of rare words and preventing excessive text segmentation. Nevertheless, the popular subwording algorithms rely on word frequency, limiting their effectiveness for low-resource languages and domains.
This presentation will delve into the aspects of subword tokenization that influence language model performance and costs: the allocation and overlap of vocabulary units across languages. Additionally, I will talk about potential improvements and alternatives aimed at producing better and fairer textual representations for NLP models.