Towards simpler Tree Substitution Grammars

Abstract

In this thesis we will investigate several supervised methods of learning the syntactic structure of natural languages. In line with the mainstream computational linguistics tradition, we focus on the syntactic analysis of language. In particular we will focus on different ways of defining the basic units of linguistic structures. Each grammar instance that we will consider is therefore related to a way of extracting elementary fragments from input annotated natural language sentences. In general, we will refer to this set of elementary units as the grammar which has been learned. The possibility of combining the fragments of our grammar into full syntactic structures, could allow in principle to generate an infinite number of grammatical sentences. The disambiguation problem can be solved if we supplement our formal grammar with a probabilistic model. In Chapter 2, we start with defining the general class of Tree Substitution Grammars (TSG), and how it is supplemented with a probabilistic model. In Chapter 3, we focus on a sub-class of LTSGs, viz. one-anchored lexicalized tree substitution grammars, characterized by elementary trees that have exactly one lexical item. The goal of Chapter 4 is to further analyze the grammars considered in Chapter 3.

Type