Linear Average Time Extraction of Phrase-structure Fragments


We present a method and implementation for extracting tree fragments from treebanks. Using a tree kernel method the largest overlapping fragments are extracted from each pair of trees (cf. figure 1 & 2), resulting in a set of fragments, each of which occurs at least twice in the treebank. The algorithm presented is able to find these fragments 70 times faster (cf. table 1) than the previously available implementation (Sangati et al., 2010). This substantial speedup is due to the incorporation of the Fast Tree Kernel (Moschitti, 2006), and opens up the possibility of handling much larger corpora. Furthermore, the tool supports trees with discontinuous constituents, used in inter alia the Ger- man and Dutch treebanks. The resulting fragments can be used for statistical parsing in a tree-substitution grammar, as in Data-Oriented Parsing (Sangati and Zuidema, 2011; van Cranenburgh and Bod, 2013). Another application is in classification problems such as authorship attribution (van Cranenburgh, 2012) and native language detection (Swanson and Charniak, 2012).

The 24th Meeting of Computational Linguistics in The Netherlands (CLIN 2014)