The ParlaMint project: developing comparable corpora of parliamentary debates in Europe

Speaker:
Tomaž Erjavec (Jožef Stefan Institute, Ljubljana, Slovenia)
Abstract:
The talk presents the results of the ParlaMint project, which comprise comparable corpora of parliamentary debates of 29 European countries and autonomous regions, covering at least the period from 2015 to 2022, and containing over 1 billion words. The corpora are uniformly encoded, contain rich metadata about their 24 thousand speakers, and are linguistically annotated up to the level of Universal Dependencies syntax and named entities. We present the compilation of the corpora, including the encoding infrastructure, use of GitHub, the production of individual corpora, the common pipeline for producing their distribution, and use of CLARIN services for dissemination. We then introduce the latest additions to the corpora, namely metadata localisation, adding new metadata, such as the political orientation of political parties, the machine translation of the corpora to English and its tagging with semantic classes, and the production of pilot speech corpora. Finally, outreach activities and further work are discussed.
Length:
00:56:49
Date:
25/03/2024
views: 33

Images:
Attachments: (video, slides, etc.)
124.0 MB
34 downloads