Decomposing and Regenerating Syntactic Trees

Federico Sangati

January 2012

Abstract

The thesis focuses on learning syntactic tree structures by generalizing over annotated treebanks. It investigates several probabilistic models for three different syntactic representations. Standard phrase-structure and dependency-structure treebanks are used to train and test the models. A third representation is proposed, based on a systematic yet concise formulation of the original dependency theory proposed by Lucien Tesni`ere (1959). This new representation incorporates all main advantages of phrase-structure and dependency-structure, and represents a valid compromise between adequacy and simplicity in syntactic description. One of the main contributions of the thesis is to formulate a general framework for defining probabilistic generative models of syntax. In every model syntactic trees are decomposed in elementary constructs which can be recomposed to generate novel syntactic structures by means of specific combinatory operations. For learning phrase-structures, a novel Data-Oriented Parsing approach (Bod et al., 2003) is proposed. Following the original DOP framework, constructs of variable size are utilized as building blocks of the model. In order to restrict the grammar to a small yet representative set of constructions, only those recurring multiple times in the training treebank are utilized. For finding recurring fragments a novel efficient tree-kernel algorithm is utilized. Regarding the other two representations, several generative models are formulated and evaluated by means of a re-ranking framework. This represents an effective methodology, which can function as a parser-simulator, and can guide the process of (re)defining probabilistic generative models for learning syntactic structures.

Type

Thesis