Combining Symbolic and Statistical Methods in Corpus-Based NLP
Speaker:
Dan Flickinger
Abstract:
Linguists developing formal models of language seek to provide detailed accounts of linguistic phenomena, making predictions that can be tested systematically. Computational linguists building broad-coverage grammar implementations must balance several competing demands if the resulting systems are to be both effective and linguistically satisfying. There is an emerging consensus within computational linguistics that hybrid approaches combining rich symbolic resources and powerful statistical techniques will be necessary to produce NLP applications with a satisfactory balance of robustness and precision. In this talk I will present one approach to this division of labor which we have been exploring at CSLI as part of an international consortium of researchers working on deep linguistic
processing (www.delph-in.net). I will argue for the respective roles of a large-scale effort at manual construction of a grammar of English, and the systematic construction of statistical models building on annotated corpora parsed with such a grammar, and then manually disambiguated. Illustrations of this approach will come from three applications of NLP: machine translation, information extraction from scientific texts, and grammar checking in online elementary school writing courses.