Pursuit of Fair and Effective Text Segmentation in Multilingual Language Models

Speaker:
Tomasz Limisiewicz (ÚFAL MFF UK)
Abstract:
Subword tokenization has become dominant as the method of segmenting textual input of language models. It offers a compromise between coverage of rare words and preventing excessive text segmentation. Nevertheless, the popular subwording algorithms rely on word frequency, limiting their effectiveness for low-resource languages and domains. This presentation will delve into the aspects of subword tokenization that influence language model performance and costs: the allocation and overlap of vocabulary units across languages. Additionally, I will talk about potential improvements and alternatives aimed at producing better and fairer textual representations for NLP models.
Length:
01:04:50
Date:
29/04/2024
views: 156

Images:
Attachments: (video, slides, etc.)
87.0 MB
157 downloads