Towards A Multilayer And Multidimensional Corpus Annotation: Following The Footprints Of The Meaning-Text Theory
Speaker:
Leo Wanner
Abstract:
An increasing number of treebanks is available for training statistical Natural Language Processing applications. Nearly all of them capture linguistic phenomena of different nature (at least word order, morphological features and syntactic dependencies), but only a few (among them, the Prague Dependency Treebank, PDT) actually separate these phenomena in terms of different levels of annotation; the majority uses one single agglomerated annotation structure. Such a structure can be considered deficient from the theoretical (linguistic) point of view. It also reduces the quality of the annotated resources, which in turn hampers the quality of the applications trained on them. As already pointed out by numerous scholars, the annotation of corpora is of higher quality when a well-defined linguistic model which supports multi-level annotation is followed. In my talk, I will present the annotation of Spanish and English corpora rooted in the linguistic model of the Meaning-Text Theory. I will introduce the annotation schema we have developed for the surface-syntactic layer of Spanish and discuss how we (semi-)automatically derive from the surface-syntactic annotation the more abstract deep-syntactic and semantic annotations. In the second half of my talk, I will report on our work in progress on the annotation of the Penn Treebank with the Theme/Rheme structure. To conclude, I will draw some parallels between the annotation philosophy underlying PDT 2.0 and ours.