From The Jungle to a Park: Harmonizing Dependency Treebanks of 30 Languages

Speaker:
Daniel Zeman, Martin Popel
Abstract:
We will present our recent parsing experiments with dependency treebanks in multiple languages. We identified more than 30 languages for which treebanks (mostly dependency-based) are available under acceptable licensing terms. However, the treebanks adhere to many different annotation styles. To make our results comparable, we need to make the annotation styles as similar as possible. An interesting question is, how should the common annotation style look like, and what criteria should we use to evaluate suitability of the various approaches. In the first part of the talk we will present the data we have. We will demonstrate the diversity of annotation styles by giving an overview of various syntactic phenomena, their representation in treebanks and our effort to transform the representation to one common scheme. In the second part we will focus specifically on coordinating structures – one of the most difficult phenomena both for treebank designers and parsers. We will classify the possible annotation styles along several dimensions and we will evaluate both their theoretical expressive power and practical impact.
Length:
01:32:15
Date:
31/10/2011
views: 1440

Images:
Preview of img-005.jpg
Image img-005.jpg
Preview of img-038.jpg
Image img-038.jpg
Preview of img-051.jpg
Image img-051.jpg
Attachments: (video, slides, etc.)
42,2M
1064 downloads
122M
1441 downloads
634M
1055 downloads