Video Recordings

Pursuit of Fair and Effective Text Segmentation in Multilingual Language Models

Speaker:

Tomasz Limisiewicz (ÚFAL MFF UK)

Abstract:

Subword tokenization has become dominant as the method of segmenting textual input of language models. It offers a compromise between coverage of rare words and preventing excessive text segmentation. Nevertheless, the popular subwording algorithms rely on word frequency, limiting their effectiveness for low-resource languages and domains. This presentation will delve into the aspects of subword tokenization that influence language model performance and costs: the allocation and overlap of vocabulary units across languages. Additionally, I will talk about potential improvements and alternatives aimed at producing better and fairer textual representations for NLP models.

Length:

01:04:50

Date:

29/04/2024

Video Recordings

Institute of Formal and Applied Linguistics

Pursuit of Fair and Effective Text Segmentation in Multilingual Language Models