graph-based processing of multi-level annotated corpora
This library enables you to process linguistic corpora with multiple levels of annotations by:
converting the different annotation formats into separate graphs and
merging these graphs into a single multidigraph (based on the common tokenization of the annotation layers)
So far, the following formats can be imported and merged:
TigerXML (a format for representing tree-like syntax graphs with secondary edges)
RS3 (a format used by RSTTool to annotate documents with Rhetorical Structure Theory)
an ad-hoc plain text format for annotating expletives (you’re probably not interested in)
Install from PyPI
pip install discoursegraphs # prepend 'sudo' if needed
or, if you’re oldschool:
easy_install discoursegraphs # prepend 'sudo' if needed
Install from source
git clone https://github.com/arne-cl/discoursegraphs.git cd discoursegraphs python setup.py install # prepend 'sudo' if needed
Right now, there’s only a primitive command line interface that will merge the syntax, RST and expletive annotation layers into one graph and generates a dot file from it.
discoursegraphs syntax/doc.xml rst/doc.rs3 expletives/doc.txt doc.dot dot -Tpdf doc.dot > discoursegraph.pdf # generates a PDF from the dot file
If you’re interested in working with just one of those layers, you’ll have to call the code directly:
from discoursegraphs import readwrite tiger_docgraph = readwrite.TigerDocumentGraph('syntax/doc.xml') rst_docgraph = readwrite.RSTGraph('rst/doc.rs3') expletives_docgraph = readwrite.AnaphoraDocumentGraph('expletives/doc.txt')
All the document graphs generated in this example are derived from the networkx.MultiDiGraph class, so you should be able to use all of its methods.
Source code documentation is available here, but you can always get an up-to-date local copy using Sphinx.
You can generate an HTML or PDF version by running these commands in the docs directory:
to produce a PDF (docs/_build/latex/discoursegraphs.pdf) and
to produce a set of HTML files (docs/_build/html/index.html).
If you’d like to visualize your graphs, you will also need:
People who downloaded this also like
SaltNPepper (a converter framework for various linguistic data formats)
Release data: 13-May-2014
Release date: 25-Apr-2014
added usage examples to readme
discoursegraphs script now uses the commandline interface of the merging module
Release date: 24-Apr-2014
first public release
imports: RS3, TigerXML and an ad-hoc format for expletive annotation
merge these formats/files into a single multidigraph
generates simple dot/graphviz-based visualization
Release history Release notifications | RSS feed
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.