Prague Dependency Treebank - Consolidated 1.0

Marie Mikulová, Jaroslava Hlaváčová, Barbora Štěpánková, Jan Hajič (ÚFAL MFF UK)
A richly annotated and genre-diversified language resource, The Prague Dependency Treebank – Consolidated 1.0 (PDT-C 1.0, or PDT-C in short in the sequel) is a consolidated release of the existing PDT-corpora of Czech data, uniformly annotated using the standard PDT scheme. PDT-corpora included in PDT-C are the Prague Dependency Treebank (the original PDT contents, written newspaper and journal texts from three genres), the Czech part of Prague Czech-English Dependency Treebank (financial texts, translated from English), Prague Dependency Treebank of Spoken Czech (spoken data, including audio and transcripts and multiple speech reconstruction annotation) and PDT-Faust (user-generated texts). The difference from the separately published original treebanks can be briefly described as follows: it is published in one package, to allow easier data handling for all the datasets; the data is enhanced with a manual linguistic annotation at the morphological layer and new version of morphological dictionary is enclosed; a common valency lexicon for all four original parts is enclosed. Comprehensive documentation is enclosed, including new morphological guidelines and lexicon description. There are two browsing and editing desktop tools (TrEd and MEd) and the corpus is also available online for searching using PML-TQ and through Kontext and TEITOK. In the talk, we will concentrate on the compositon of PDT-C and the new features and contents of the Czech morphological dictionary as such, as well as at the impact and changes when used for the morphological (re)annotation of the corpus.

