graph-based processing of multi-level annotated corpora
Project description
DiscourseGraphs
This library enables you to process linguistic corpora with multiple levels of annotations by:
converting the different annotation formats into separate graphs and
merging these graphs into a single multidigraph (based on the common tokenization of the annotation layers)
exporting your (merged) graphs into several output formats
visualizing linguistic graphs directly in an IPython notebook
Import formats
So far, the following formats can be imported and merged:
TigerXML (a format for representing tree-like syntax graphs with secondary edges)
NeGra Export Format (a format used i.a. for the TüBa-D/Z Treebank)
Penn Treebank format (an s-expressions/lisp/brackets format for representing syntax trees)
a number of formats for Rhetorical Structure Theory:
MMAX2 (a format / GUI tool for annotating spans and connections between them (e.g. coreferences)
CoNLL 2009 and CoNLL 2010 formats (used for annotating i.a. dependency parses and coreference links)
ConanoXML (a format for annotating connectives, used by Conano)
Decour (an XML format used by a corpus of DEceptive statements in Italian COURts)
EXMARaLDA, a format for annotating spans in spoken or written language
an ad-hoc plain text format for annotating expletives (you’re probably not interested in)
Export formats
discoursegraphs can export graphs into the following formats / for the following tools:
dot format, which is used by the open source graph visualization software graphviz
geoff format, used by the neo4j graph database
GEXF and GraphML (common interchange formats for graphs used by various tools such as Gephi and Cytoscape)
PAULA XML 1.1, an exchange format for linguistic data (exporter is still buggy)
EXMARaLDA, a tool for annotating spans in spoken or written language
CoNLL 2009 (so far, only tokens, sentence boundaries and coreferences are exported)
Installation
This should work on both Linux and Mac OSX using Python 2.7 and either pip or easy_install.
Install from PyPI
pip install discoursegraphs # prepend 'sudo' if needed
or, if you’re oldschool:
easy_install discoursegraphs # prepend 'sudo' if needed
Install from source
sudo apt-get install python-dev libxml2-dev libxslt-dev pkg-config graphviz-dev libgraphviz-dev -y sudo easy_install -U setuptools git clone https://github.com/arne-cl/discoursegraphs.git cd discoursegraphs sudo python setup.py install
Usage
The command line interface of DiscourseGraphs allows you to merge syntax, rhetorical structure, connectives and expletives annotation files into one graph and to store this graph in one of several output formats (e.g. the geoff format used by the neo4j graph database or the dot format used by the graphviz plotting tool).
discoursegraphs -t syntax/maz-13915.xml -r rst/maz-13915.rs3 -c connectors/maz-13915.xml -a anaphora/tosik/das/maz-13915.txt -o dot dot -Tpdf doc.dot > discoursegraph.pdf # generates a PDF from the dot file
If you’re interested in working with just one of those layers, you’ll have to call the code directly:
import discoursegraphs as dg tiger_docgraph = dg.read_tiger('syntax/doc.xml') rst_docgraph = dg.read_rs3('rst/doc.rs3') expletives_docgraph = dg.read_anaphoricity('expletives/doc.txt')
All the document graphs generated in this example are derived from the networkx.MultiDiGraph class, so you should be able to use all of its methods.
Documentation
Source code documentation is available here, but you can always get an up-to-date local copy using Sphinx.
You can generate an HTML or PDF version by running these commands in the docs directory:
make latexpdf
to produce a PDF (docs/_build/latex/discoursegraphs.pdf) and
make html
to produce a set of HTML files (docs/_build/html/index.html).
Requirements
If you’d like to visualize your graphs, you will also need:
License and Citation
This software is released under a 3-Clause BSD license. If you use discoursegraphs in your academic work, please cite the following paper:
Neumann, A. 2015. discoursegraphs: A graph-based merging tool and converter for multilayer annotated corpora. In Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015), pp. 309-312.
@inproceedings{neumann2015discoursegraphs, title={discoursegraphs: A graph-based merging tool and converter for multilayer annotated corpora}, author={Neumann, Arne}, booktitle={Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015)}, pages={309-312}, year={2015} }
People who downloaded this also like
SaltNPepper: a converter framework for various linguistic data formats
educe: a library for handling discourse-annotated corpora (SDRT, RST and PDTB)
treetools: a library for converting treebanks and grammar extraction (supports i.a. TigerXML and Negra/Tüba-Export formats)
TCFnetworks: library for creating graphs from annotated text corpora (based on TCF).
News
0.4.14 (2021-03-14)
Release date: 14-March-2021
minor: renamed fixture for clarity
0.4.13 (2021-03-12)
Release date: 12-March-2021
fix for parsing DPLP trees with only one EDU
0.4.12 (2021-02-22)
Release date: 22-February-2021
added support for StageDP RST parser
0.4.11 (2020-12-10)
Release date: 10-December-2020
fix rs3 parser for files produced by isanlp_rst
0.4.10 (2020-06-03)
Release date: 06-June-2020
added standard RST relations to every rs3 file
0.4.9 (2020-05-10)
Release date: 10-May-2020
added option to write_svgtree() to return SVG image as a string
0.4.8 (2020-04-25)
Release date: 25-April-2020
fixed dependencies in 0.4.7
0.4.7 (2020-04-23)
Release date: 23-April-2020
fixed dependencies in 0.4.6
0.4.6 (2020-04-21)
Release date: 21-April-2020
added write_svgtree (create SVG files from nltk trees)
0.4.5 (2019-05-16)
Release date: 12-May-2019
fixed rstlatex nested tree generation
0.4.4 (2019-05-11)
Release date: 11-May-2019
fixed rstlatex formatting / inheritance bug
0.4.3 (2019-05-10)
Release date: 10-May-2019
fixed rstlatex file export
0.4.2 (2019-05-10)
Release date: 10-May-2019
fixed dependency in setup.py
0.4.1 (2019-04-27)
Release date: 27-April-2019
added exporter for RST trees in Latex
0.4.0 (2019-04-25)
Release date: 25-April-2019
almost three years of additions/fixes (mostly RST-related importers/exporters, e.g. URML, dis, rs3, HILDA, DPLP, Heilman and Sagae (2015))
0.3.2 (2016-05-30)
Release date: 30-May-2016
second attempt to fix the distribution of the data directory with the package
added exporter for FREQT, which extracts frequent embedded subtrees
0.3.1 (2016-05-07)
Release date: 7-May-2016
attempt to fix the distribution of the data directory with the package
document graphs can be converted into PTB-style strings (readwrite/tree.py)
node/edge collections are now ordered (OrderedDict)
0.3.0 (2016-04-30)
Release date: 30-April-2016
almost two years and countless commits later, finally a new official release
added lots of importers and exporters and simplified the API
added 80+ tests (py.test), continuous integration (Travis) and docker support
0.1.2 (2014-05-13)
Release date: 13-May-2014
0.1.1 (2014-04-25)
Release date: 25-Apr-2014
small improvements
added usage examples to readme
discoursegraphs script now uses the commandline interface of the merging module
0.1.0 (2014-04-24)
Release date: 24-Apr-2014
first public release
imports: RS3, TigerXML and an ad-hoc format for expletive annotation
merge these formats/files into a single multidigraph
generates simple dot/graphviz-based visualization
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file discoursegraphs-0.4.14.tar.gz
.
File metadata
- Download URL: discoursegraphs-0.4.14.tar.gz
- Upload date:
- Size: 261.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.15.0 pkginfo/1.7.0 requests/2.22.0 setuptools/44.0.0 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/2.7.17
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 80ff67b5099231c315b626ebe21763c98606942ff4a71fec78d5a8ff9f24a865 |
|
MD5 | 62fffc271eac55d6e773f3c9df6dc6a7 |
|
BLAKE2b-256 | 6fc6d8e754f80cc5ee526d2d3478047bd11c0af514172a297c8e1dab9f3a4022 |
File details
Details for the file discoursegraphs-0.4.14-py2-none-any.whl
.
File metadata
- Download URL: discoursegraphs-0.4.14-py2-none-any.whl
- Upload date:
- Size: 2.8 MB
- Tags: Python 2
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.15.0 pkginfo/1.7.0 requests/2.22.0 setuptools/44.0.0 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/2.7.17
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2c01e963bc76e5e7966d089a8dbc0ecb7a035d88b83f54f0b4b6456d0e475199 |
|
MD5 | 1a64c9d0a57bd545d88bbf715ea6f755 |
|
BLAKE2b-256 | cd71f42d48656babd6d3d0a84c7d3a76fbea395338568c379ad1249c157aa7c5 |