Skip to main content

NLP tools at TUW Informatics

Project description

TUW-NLP

NLP utilities developed at TUW informatics.

The main goal of the library is to provide a unified interface for working with different semantic graph representations. To represent graphs we use the networkx library. Currently you can use the following semantic graph representations integrated in the library:

  • 4lang
  • UD (Universal Dependencies)
  • AMR (Abstract Meaning Representation)
  • SDP (Semantic Dependency Parsing)
  • UCCA (Universal Conceptual Cognitive Annotation)
  • DRS (Discourse Representation Structure)

Setup and Usage

Install the tuw-nlp repository from pip:

pip install tuw-nlp

Or install from source:

pip install -e .

On Windows and Mac, you might also need to install Graphviz manually.

You will also need some additional steps to use the library:

Download nltk resources:

import nltk
nltk.download('stopwords')
nltk.download('propbank')

Download stanza models for UD parsing:

import stanza

stanza.download("en")
stanza.download("de")

4lang

The 4lang semantic graph representation is implemented in the repository. We use Interpreted Regular Tree Grammars (IRTGs) to build the graphs from UD trees. The grammar can be found in the lexicon. It supports English and German.

To use the parser download the alto parser and tuw_nlp dictionaries:

import tuw_nlp

tuw_nlp.download_alto()
tuw_nlp.download_definitions()

Also please make sure to have JAVA on your system to be able to use the parser!

Then you can parse a sentence as simple as:

from tuw_nlp.grammar.text_to_4lang import TextTo4lang

tfl = TextTo4lang("en", "en_nlp_cache")

fl_graphs = list(tfl("brown dog", depth=1, substitute=False))

# Then the fl_graphs are the classes that contains the networkx graph object
fl_graphs[0].G.nodes(data=True)

# Visualize the graph
fl_graphs[0].to_dot()

UD

To parse Universal Dependencies into networkx format, we use the stanza library. You can use all the languages supported by stanza: https://stanfordnlp.github.io/stanza/models.html

For parsing you can use the snippet above, just use the TextToUD class.

AMR

For parsing Abstract Meaning Representation graphs we use the amrlib library. Models are only available for English.

If you want to use AMR parsing, install the amrlib package (this is also included in the setup file) and download the models:

pip install amrlib

Go to the amrlib repository and follow the instructions to download the models.

Then also download the spacy model for AMR parsing:

python -m spacy download en_core_web_sm

To parse AMR, see the TextToAMR class.

SDP

For parsing Semantic Dependency Parsing we integrated the Semantic Dependency Parser from the SuPar library. Models are only available for English.

See the TextToSDP class for more information.

UCCA

For parsing UCCA graphs we integrated the tupa parser. See our fork of the parser here. Because of the complexity of the parser, we included a docker image that contains the parser and all the necessary dependencies. You can use the docker image to parse UCCA graphs. To see more go to the services folder and follow the instructions there.

UCCA parsing is currently supporting English, French, German and Hebrew. The docker service is a REST API that you can use to parse UCCA graphs. To convert the output to networkx graphs, see the TextToUCCA class.

DRS

The task of Discourse Representation Structure (DRS) parsing is to convert text into formal meaning representations in the style of Discourse Representation Theory (DRT; Kamp and Reyle 1993).

To make it compatible with our library, we make use of the paper titled Transparent Semantic Parsing with Universal Dependencies Using Graph Transformations. This work first transformes DRS structures into graphs (DRG). They make use of a rule-based method developed using the GREW library.

Because of the complexity of the parser, we included a docker image that contains the parser and all the necessary dependencies. You can use the docker image to parse DRS graphs. To see more go to the services folder and follow the instructions there. For parsing we our own fork of the ud-boxer repository. Currently it supports English, Italian, German and Dutch.

To convert the output of the REST API (from the docker service) to networkx graphs, see the TextToDRS class.

For more examples you can check our experiments jupyter notebook.

Command line interface

We provide a simple script to parse into any of the supported formats. For this you can use the scripts/semparse.py script. For usage:

usage: semparse.py [-h] [-f FORMAT] [-cd CACHE_DIR] [-cn NLP_CACHE] -l LANG [-d DEPTH] [-s SUBSTITUTE] [-p PREPROCESSOR] [-o OUT_DIR]

optional arguments:
  -h, --help            show this help message and exit
  -f FORMAT, --format FORMAT
  -cd CACHE_DIR, --cache-dir CACHE_DIR
  -cn NLP_CACHE, --nlp-cache NLP_CACHE
  -l LANG, --lang LANG
  -d DEPTH, --depth DEPTH
  -s SUBSTITUTE, --substitute SUBSTITUTE
  -p PREPROCESSOR, --preprocessor PREPROCESSOR
  -o OUT_DIR, --out-dir OUT_DIR

For example to parse a sentence into UCCA graph run:

echo "A police statement did not name the man in the boot, but in effect indicated the traveler was State Secretary Samuli Virtanen, who is also the deputy to Foreign Minister Timo Soini." | python scripts/semparse.py -f ucca -l en -cn cache/nlp_cache_en.json

Services

We also provide services built on our package. To get to know more visit services.

Text_to_4lang service

To run a browser-based demo (also available online) for building graphs from raw texts, first start the graph building service:

python services/text_to_4lang/backend/service.py

Then run the frontend with this command:

streamlit run services/text_to_4lang/frontend/demo.py

In the demo you can parse english and german sentences and you can also try out multiple algorithms our graphs implement, such as expand, substitute and append_zero_paths.

Modules

text

General text processing utilities, contains:

  • segmentation: stanza-based processors for word and sentence level segmentation
  • patterns: various patterns for text processing tasks

graph

Tools for working with graphs, contains:

  • utils: misc utilities for working with graphs

grammar

Tools for generating and using grammars, contains:

  • alto: tools for interfacing with the alto tool
  • irtg: class for representing Interpreted Regular Tree Grammars
  • lexicon: Rule lexica for building lexicalized grammars
  • ud_fl: grammar-based mapping of Universal Dependencies to 4lang semantic graphs.
  • utils: misc utilities for working with grammars

Contributing

We welcome all contributions! Please fork this repository and create a branch for your modifications. We suggest getting in touch with us first, by opening an issue or by writing an email to Gabor Recski or Adam Kovacs at firstname.lastname@tuwien.ac.at

Citing

If you use the library, please cite our paper

@inproceedings{Recski:2021,
  title={Explainable Rule Extraction via Semantic Graphs},
  author={Recski, Gabor and Lellmann, Bj{\"o}rn and Kovacs, Adam and Hanbury, Allan},
  booktitle = {{Proceedings of the Fifth Workshop on Automated Semantic Analysis
of Information in Legal Text (ASAIL 2021)}},
  publisher = {{CEUR Workshop Proceedings}},
  address = {São Paulo, Brazil},
  pages="24--35",
  url= "http://ceur-ws.org/Vol-2888/paper3.pdf",
  year={2021}
}

License

MIT license

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tuw-nlp-0.1.0.tar.gz (32.1 MB view details)

Uploaded Source

Built Distribution

tuw_nlp-0.1.0-py3-none-any.whl (32.5 MB view details)

Uploaded Python 3

File details

Details for the file tuw-nlp-0.1.0.tar.gz.

File metadata

  • Download URL: tuw-nlp-0.1.0.tar.gz
  • Upload date:
  • Size: 32.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.64.0 CPython/3.9.12

File hashes

Hashes for tuw-nlp-0.1.0.tar.gz
Algorithm Hash digest
SHA256 9a4ad70754f8f00a0a91f38d02a3da964ce71170aa4fa813eae2c283eb5c4df4
MD5 398af1e7fd453289e7743f35edab4a6e
BLAKE2b-256 db52892c6493ccdd28babffe2cf2faf25b4b92558fc0abbe0c179d7f52c6e1ed

See more details on using hashes here.

File details

Details for the file tuw_nlp-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: tuw_nlp-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 32.5 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.64.0 CPython/3.9.12

File hashes

Hashes for tuw_nlp-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2650abbeb3a981191d456d609d2689439c068c81a8e1ed40c255536d0bcf2ab8
MD5 a1aea2e89f1298bc2b246453d6434c17
BLAKE2b-256 99bd98379d705f1647bd17a1afcb72970584e091d5eb732efbcfcc0fe5fb0622

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page