CCGbank: A Corpus of CCG Derivations and Dependency Structures Extracted from the Penn Treebank
Open Access
- 1 September 2007
- journal article
- Published by MIT Press in Computational Linguistics
- Vol. 33 (3), 355-396
- https://doi.org/10.1162/coli.2007.33.3.355
Abstract
This article presents an algorithm for translating the Penn Treebank into a corpus of Combinatory Categorial Grammar (CCG) derivations augmented with local and long-range word-word dependencies. The resulting corpus, CCGbank, includes 99.4% of the sentences in the Penn Treebank. It is available from the Linguistic Data Consortium, and has been used to train wide-coverage statistical parsers that obtain state-of-the-art rates of dependency recovery. In order to obtain linguistically adequate CCG analyses, and to eliminate noise and inconsistencies in the original annotation, an extensive analysis of the constructions and annotations in the Penn Treebank was called for, and a substantial number of changes to the Treebank were necessary. We discuss the implications of our findings for the extraction of other linguistically expressive grammars from the Treebank, and for the design of future treebanks.Keywords
This publication has 14 references indexed in Scilit:
- Automated extraction of Tree-Adjoining Grammars from treebanksNatural Language Engineering, 2005
- Large-Scale Induction and Evaluation of Lexical Resources from the Penn-II and Penn-III TreebanksComputational Linguistics, 2005
- The Proposition Bank: An Annotated Corpus of Semantic RolesComputational Linguistics, 2005
- Extending the Coverage of a CCG SystemResearch on Language and Computation, 2004
- Partial Proof Trees as Building Blocks for a Categorial GrammarLinguistics and Philosophy, 1997
- On the treatment of complex predicates in categorial grammarLinguistics and Philosophy, 1995
- A Hypothetical Reasoning Algorithm for Linguistic AnalysisJournal of Logic and Computation, 1994
- Categorial grammars determined from linguistic data by unificationStudia Logica, 1990
- Adverbs and Logical Form: A Linguistically Realistic TheoryLanguage, 1982
- A Quasi-Arithmetical Notation for Syntactic DescriptionLanguage, 1953