Skip to main content

graph-based processing of multi-level annotated corpora

Project description

DiscourseGraphs

Latest version BSD License Build status Test coverage Code Issues Docker build status

This library enables you to process linguistic corpora with multiple levels of annotations by:

  1. converting the different annotation formats into separate graphs and

  2. merging these graphs into a single multidigraph (based on the common tokenization of the annotation layers)

  3. exporting your (merged) graphs into several output formats

  4. visualizing linguistic graphs directly in an IPython notebook

Import formats

So far, the following formats can be imported and merged:

  • TigerXML (a format for representing tree-like syntax graphs with secondary edges)

  • NeGra Export Format (a format used i.a. for the TüBa-D/Z Treebank)

  • Penn Treebank format (an s-expressions/lisp/brackets format for representing syntax trees)

  • a number of formats for Rhetorical Structure Theory:

    • RS3 (a format used by RSTTool to annotate documents with Rhetorical Structure Theory)

    • the .dis “LISP” format used by the RST-DT corpus

    • URML (a format for underspecified rhetorical structure trees)

  • MMAX2 (a format / GUI tool for annotating spans and connections between them (e.g. coreferences)

  • CoNLL 2009 and CoNLL 2010 formats (used for annotating i.a. dependency parses and coreference links)

  • ConanoXML (a format for annotating connectives, used by Conano)

  • Decour (an XML format used by a corpus of DEceptive statements in Italian COURts)

  • EXMARaLDA, a format for annotating spans in spoken or written language

  • an ad-hoc plain text format for annotating expletives (you’re probably not interested in)

Export formats

discoursegraphs can export graphs into the following formats / for the following tools:

  • dot format, which is used by the open source graph visualization software graphviz

  • geoff format, used by the neo4j graph database

  • GEXF and GraphML (common interchange formats for graphs used by various tools such as Gephi and Cytoscape)

  • PAULA XML 1.1, an exchange format for linguistic data (exporter is still buggy)

  • EXMARaLDA, a tool for annotating spans in spoken or written language

  • CoNLL 2009 (so far, only tokens, sentence boundaries and coreferences are exported)

Installation

This should work on both Linux and Mac OSX using Python 2.7 and either pip or easy_install.

Install from PyPI

pip install discoursegraphs # prepend 'sudo' if needed

or, if you’re oldschool:

easy_install discoursegraphs # prepend 'sudo' if needed

Install from source

sudo apt-get install python-dev libxml2-dev libxslt-dev pkg-config graphviz-dev libgraphviz-dev -y
sudo easy_install -U setuptools
git clone https://github.com/arne-cl/discoursegraphs.git
cd discoursegraphs
sudo python setup.py install

Usage

The command line interface of DiscourseGraphs allows you to merge syntax, rhetorical structure, connectives and expletives annotation files into one graph and to store this graph in one of several output formats (e.g. the geoff format used by the neo4j graph database or the dot format used by the graphviz plotting tool).

discoursegraphs -t syntax/maz-13915.xml -r rst/maz-13915.rs3 -c connectors/maz-13915.xml -a anaphora/tosik/das/maz-13915.txt -o dot
dot -Tpdf doc.dot > discoursegraph.pdf # generates a PDF from the dot file

If you’re interested in working with just one of those layers, you’ll have to call the code directly:

import discoursegraphs as dg
tiger_docgraph = dg.read_tiger('syntax/doc.xml')
rst_docgraph = dg.read_rs3('rst/doc.rs3')
expletives_docgraph = dg.read_anaphoricity('expletives/doc.txt')

All the document graphs generated in this example are derived from the networkx.MultiDiGraph class, so you should be able to use all of its methods.

Documentation

Source code documentation is available here, but you can always get an up-to-date local copy using Sphinx.

You can generate an HTML or PDF version by running these commands in the docs directory:

make latexpdf

to produce a PDF (docs/_build/latex/discoursegraphs.pdf) and

make html

to produce a set of HTML files (docs/_build/html/index.html).

Requirements

If you’d like to visualize your graphs, you will also need:

License and Citation

This software is released under a 3-Clause BSD license. If you use discoursegraphs in your academic work, please cite the following paper:

Neumann, A. 2015. discoursegraphs: A graph-based merging tool and converter for multilayer annotated corpora. In Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015), pp. 309-312.

@inproceedings{neumann2015discoursegraphs,
  title={discoursegraphs: A graph-based merging tool and converter for multilayer annotated corpora},
  author={Neumann, Arne},
  booktitle={Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015)},
  pages={309-312},
  year={2015}
}

Author

Arne Neumann

People who downloaded this also like

  • SaltNPepper: a converter framework for various linguistic data formats

  • educe: a library for handling discourse-annotated corpora (SDRT, RST and PDTB)

  • treetools: a library for converting treebanks and grammar extraction (supports i.a. TigerXML and Negra/Tüba-Export formats)

  • TCFnetworks: library for creating graphs from annotated text corpora (based on TCF).

News

0.4.14 (2021-03-14)

Release date: 14-March-2021

  • minor: renamed fixture for clarity

0.4.13 (2021-03-12)

Release date: 12-March-2021

  • fix for parsing DPLP trees with only one EDU

0.4.12 (2021-02-22)

Release date: 22-February-2021

  • added support for StageDP RST parser

0.4.11 (2020-12-10)

Release date: 10-December-2020

  • fix rs3 parser for files produced by isanlp_rst

0.4.10 (2020-06-03)

Release date: 06-June-2020

  • added standard RST relations to every rs3 file

0.4.9 (2020-05-10)

Release date: 10-May-2020

  • added option to write_svgtree() to return SVG image as a string

0.4.8 (2020-04-25)

Release date: 25-April-2020

  • fixed dependencies in 0.4.7

0.4.7 (2020-04-23)

Release date: 23-April-2020

  • fixed dependencies in 0.4.6

0.4.6 (2020-04-21)

Release date: 21-April-2020

  • added write_svgtree (create SVG files from nltk trees)

0.4.5 (2019-05-16)

Release date: 12-May-2019

  • fixed rstlatex nested tree generation

0.4.4 (2019-05-11)

Release date: 11-May-2019

  • fixed rstlatex formatting / inheritance bug

0.4.3 (2019-05-10)

Release date: 10-May-2019

  • fixed rstlatex file export

0.4.2 (2019-05-10)

Release date: 10-May-2019

  • fixed dependency in setup.py

0.4.1 (2019-04-27)

Release date: 27-April-2019

  • added exporter for RST trees in Latex

0.4.0 (2019-04-25)

Release date: 25-April-2019

  • almost three years of additions/fixes (mostly RST-related importers/exporters, e.g. URML, dis, rs3, HILDA, DPLP, Heilman and Sagae (2015))

0.3.2 (2016-05-30)

Release date: 30-May-2016

  • second attempt to fix the distribution of the data directory with the package

  • added exporter for FREQT, which extracts frequent embedded subtrees

0.3.1 (2016-05-07)

Release date: 7-May-2016

  • attempt to fix the distribution of the data directory with the package

  • document graphs can be converted into PTB-style strings (readwrite/tree.py)

  • node/edge collections are now ordered (OrderedDict)

0.3.0 (2016-04-30)

Release date: 30-April-2016

  • almost two years and countless commits later, finally a new official release

  • added lots of importers and exporters and simplified the API

  • added 80+ tests (py.test), continuous integration (Travis) and docker support

0.1.2 (2014-05-13)

Release date: 13-May-2014

  • added basic Geoff and Neo4j exporter (not yet available via the command line)

  • added sphinx-based documentation

0.1.1 (2014-04-25)

Release date: 25-Apr-2014

  • small improvements

  • added usage examples to readme

  • discoursegraphs script now uses the commandline interface of the merging module

0.1.0 (2014-04-24)

Release date: 24-Apr-2014

  • first public release

  • imports: RS3, TigerXML and an ad-hoc format for expletive annotation

  • merge these formats/files into a single multidigraph

  • generates simple dot/graphviz-based visualization

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

discoursegraphs-0.4.14.tar.gz (261.3 kB view details)

Uploaded Source

Built Distribution

discoursegraphs-0.4.14-py2-none-any.whl (2.8 MB view details)

Uploaded Python 2

File details

Details for the file discoursegraphs-0.4.14.tar.gz.

File metadata

  • Download URL: discoursegraphs-0.4.14.tar.gz
  • Upload date:
  • Size: 261.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.7.0 requests/2.22.0 setuptools/44.0.0 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/2.7.17

File hashes

Hashes for discoursegraphs-0.4.14.tar.gz
Algorithm Hash digest
SHA256 80ff67b5099231c315b626ebe21763c98606942ff4a71fec78d5a8ff9f24a865
MD5 62fffc271eac55d6e773f3c9df6dc6a7
BLAKE2b-256 6fc6d8e754f80cc5ee526d2d3478047bd11c0af514172a297c8e1dab9f3a4022

See more details on using hashes here.

File details

Details for the file discoursegraphs-0.4.14-py2-none-any.whl.

File metadata

  • Download URL: discoursegraphs-0.4.14-py2-none-any.whl
  • Upload date:
  • Size: 2.8 MB
  • Tags: Python 2
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.7.0 requests/2.22.0 setuptools/44.0.0 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/2.7.17

File hashes

Hashes for discoursegraphs-0.4.14-py2-none-any.whl
Algorithm Hash digest
SHA256 2c01e963bc76e5e7966d089a8dbc0ecb7a035d88b83f54f0b4b6456d0e475199
MD5 1a64c9d0a57bd545d88bbf715ea6f755
BLAKE2b-256 cd71f42d48656babd6d3d0a84c7d3a76fbea395338568c379ad1249c157aa7c5

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page