Multilingual pre-trained encoders: How far can we get with multilingual data?
Speaker:
Jindřich Libovický (ÚFAL MFF UK)
Abstract:
Pre-trained multilingual encoders trained with monolingual data only show surprising cross-lingual abilities. One thing that makes monolingually trained multilingual encoders attractive is that they do not require an explicit cross-lingual alignment using parallel data. Avoiding parallel data might have the advantage of not enforcing the culture of the highest-resourced language in the model. But is that really so? In the talk, we will discuss several ways of improving cross-lingual alignment with monolingual data only. Further, we will show two case studies on how the decision to use or not use parallel data affects how the models capture culture-related meaning aspects using an (almost) unsupervised interpretability method.