Skip to main content

Discontinuous Data-Oriented Parsing

Project description

The aim of this project is to parse discontinuous constituents with Data-Oriented Parsing (DOP), with a focus on global world domination. Concretely, we build a DOP model with a Linear Context-Free Rewriting System (LCFRS) as the symbolic backbone.

contrived discontinuous constituent for expository purposes.

Background

This work is partly described in the following publications:

Some references to implemented algorithms:

  • parser, estimates: Maier & Kallmeyer (2010), Data-driven parsing with probabilistic linear context-free rewriting systems.

  • data-oriented parsing (DOP):

    • Goodman (2002), Efficient parsing of DOP with PCFG-reductions

    • Sangati & Zuidema (2011), Accurate parsing with compact tree-substitution grammars: Double-DOP

  • k-best list: Huang & Chiang (2005), Better k-best parsing

  • optimal binarization: Gildea (2010), Optimal parsing strategies for linear context-free rewriting systems

Requirements

For example, to install these dependencies and compile the code on Ubuntu (tested on 12.04), run the following sequence of commands:

sudo apt-get install python-dev python-numpy build-essential
sudo pip install cython
git clone --depth 1 git://github.com/andreasvc/disco-dop.git
cd disco-dop
python setup.py install

(add --user to the pip command and the python setup.py command to install to your home directory which does not require root privileges). To port the code to another compiler such as Visual C, replace the compiler intrinsics in macros.h, bit.pyx, and bit.pxd to their equivalents in the compiler in question. This mainly concerns operations to scan for bits in integers, for which these compiler intrinsics provide the most efficient implementation on a given processor.

Usage: parser

To run a full experiment from treebank to evaluation on a test set, copy the file sample.prm and edit its parameters. These parameters can then be invoked by executing:

discodop runexp.py filename.prm

This will create a new directory with the basename of the parameter file, i.e., filename/ in this case. This directory must not exist yet, to avoid accidentally overwriting previous results. The directory will contain the grammar rules and lexicon in a text format, as well as the parsing results and the gold standard file in Negra’s export format.

Corpora are expected to be in Negra’s export format. Access to the Negra corpus itself can be requested for non-commercial purposes, while the Tiger corpus is freely available for download for research purposes.

Alternatively, there is a simpler parser in the shedskin/ directory. This LCFRS parser only produces the Viterbi parse. The grammar is supplied in a file following a simple text format. The plcfrs.py script can be translated to C++ by the Shed Skin compiler, after which the resulting code can be compiled with make:

sudo apt-get install shedskin
cd disco-dop/shedskin
shedskin -b -l -w plcfrs.py
make

Usage: tools

Aside from the parser there are some standalone tools:

fragments:

Finds recurring or common fragments in one or more treebanks. It can be used with discontinuous as well as Penn-style bracketed treebanks.

treetransforms:

A command line interface to perform transformations on treebanks such as binarization.

grammar:

A command line interface to read off grammars from (binarized) treebanks.

parser:

A basic command line interface to the parser comparable to bitpar. Reads grammars from text files.

eval:

Discontinuous evaluation. Reports F-scores and other metrics. Accepts EVALB parameter files:

python evalnegra-corpus.export sample/plcfrs.export proper.prm
demos:

Contains examples of various formalisms encoded in LCFRS grammars.

gen:

An experiment in generation with LCFRS.

All of these can be started with the discodop command. For example:

discodop fragments --help

… prints instructions for the fragment extractor.

Usage: tools

There are two web based tools, which require Flask to be installed:

web/parse.py:

A web interface to the parser. Expects a series of grammars in subdirectories of web/grammars/.

web/treesearch.py:

A web interface for searching trough treebanks. Expects one or more treebanks with the .mrg extension in the directory web/corpus/.

Acknowledgments

The Tree data structures in tree.py and the simple binarization algorithm in treetransforms.py was taken from NLTK. The Zhang-Shasha tree-edit distance algorithm in treedist.py was taken from https://github.com/timtadh/zhang-shasha Elements of the PLCFRS parser and punctuation re-attachment are based on code from rparse. Various other bits from the Stanford parser, Berkeley parser, Bubs parser, &c.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

disco-dop-0.2.tar.gz (253.1 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page