NLP tools at TUW Informatics
Project description
TUW-NLP
NLP utilities developed at TUW informatics.
The main goal of the library is to provide a unified interface for working with different semantic graph representations. To represent graphs we use the networkx library. Currently you can use the following semantic graph representations integrated in the library:
- 4lang
- UD (Universal Dependencies)
- AMR (Abstract Meaning Representation)
- SDP (Semantic Dependency Parsing)
- UCCA (Universal Conceptual Cognitive Annotation)
- DRS (Discourse Representation Structure)
Setup and Usage
Install the tuw-nlp repository from pip:
pip install tuw-nlp
Or install from source:
pip install -e .
On Windows and Mac, you might also need to install Graphviz manually.
You will also need some additional steps to use the library:
Download nltk resources:
import nltk
nltk.download('stopwords')
nltk.download('propbank')
Download stanza models for UD parsing:
import stanza
stanza.download("en")
stanza.download("de")
4lang
The 4lang semantic graph representation is implemented in the repository. We use Interpreted Regular Tree Grammars (IRTGs) to build the graphs from UD trees. The grammar can be found in the lexicon. It supports English and German.
To use the parser download the alto parser and tuw_nlp dictionaries:
import tuw_nlp
tuw_nlp.download_alto()
tuw_nlp.download_definitions()
Also please make sure to have JAVA on your system to be able to use the parser!
Then you can parse a sentence as simple as:
from tuw_nlp.grammar.text_to_4lang import TextTo4lang
tfl = TextTo4lang("en", "en_nlp_cache")
fl_graphs = list(tfl("brown dog", depth=1, substitute=False))
# Then the fl_graphs are the classes that contains the networkx graph object
fl_graphs[0].G.nodes(data=True)
# Visualize the graph
fl_graphs[0].to_dot()
UD
To parse Universal Dependencies into networkx format, we use the stanza library. You can use all the languages supported by stanza: https://stanfordnlp.github.io/stanza/models.html
For parsing you can use the snippet above, just use the TextToUD
class.
AMR
For parsing Abstract Meaning Representation graphs we use the amrlib library. Models are only available for English.
If you want to use AMR parsing, install the amrlib package (this is also included in the setup file) and download the models:
pip install amrlib
Go to the amrlib repository and follow the instructions to download the models.
Then also download the spacy model for AMR parsing:
python -m spacy download en_core_web_sm
To parse AMR, see the TextToAMR
class.
SDP
For parsing Semantic Dependency Parsing we integrated the Semantic Dependency Parser from the SuPar library. Models are only available for English.
See the TextToSDP
class for more information.
UCCA
For parsing UCCA graphs we integrated the tupa parser. See our fork of the parser here. Because of the complexity of the parser, we included a docker image that contains the parser and all the necessary dependencies. You can use the docker image to parse UCCA graphs. To see more go to the services folder and follow the instructions there.
UCCA parsing is currently supporting English, French, German and Hebrew. The docker service is a REST API that you can use to parse UCCA graphs. To convert the output to networkx graphs, see the TextToUCCA
class.
DRS
The task of Discourse Representation Structure (DRS) parsing is to convert text into formal meaning representations in the style of Discourse Representation Theory (DRT; Kamp and Reyle 1993).
To make it compatible with our library, we make use of the paper titled Transparent Semantic Parsing with Universal Dependencies Using Graph Transformations. This work first transformes DRS structures into graphs (DRG). They make use of a rule-based method developed using the GREW library.
Because of the complexity of the parser, we included a docker image that contains the parser and all the necessary dependencies. You can use the docker image to parse DRS graphs. To see more go to the services folder and follow the instructions there. For parsing we our own fork of the ud-boxer repository. Currently it supports English, Italian, German and Dutch.
To convert the output of the REST API (from the docker service) to networkx graphs, see the TextToDRS
class.
For more examples you can check our experiments jupyter notebook.
Command line interface
We provide a simple script to parse into any of the supported formats.
For this you can use the scripts/semparse.py
script. For usage:
usage: semparse.py [-h] [-f FORMAT] [-cd CACHE_DIR] [-cn NLP_CACHE] -l LANG [-d DEPTH] [-s SUBSTITUTE] [-p PREPROCESSOR] [-o OUT_DIR]
optional arguments:
-h, --help show this help message and exit
-f FORMAT, --format FORMAT
-cd CACHE_DIR, --cache-dir CACHE_DIR
-cn NLP_CACHE, --nlp-cache NLP_CACHE
-l LANG, --lang LANG
-d DEPTH, --depth DEPTH
-s SUBSTITUTE, --substitute SUBSTITUTE
-p PREPROCESSOR, --preprocessor PREPROCESSOR
-o OUT_DIR, --out-dir OUT_DIR
For example to parse a sentence into UCCA graph run:
echo "A police statement did not name the man in the boot, but in effect indicated the traveler was State Secretary Samuli Virtanen, who is also the deputy to Foreign Minister Timo Soini." | python scripts/semparse.py -f ucca -l en -cn cache/nlp_cache_en.json
Services
We also provide services built on our package. To get to know more visit services.
Text_to_4lang service
To run a browser-based demo (also available online) for building graphs from raw texts, first start the graph building service:
python services/text_to_4lang/backend/service.py
Then run the frontend with this command:
streamlit run services/text_to_4lang/frontend/demo.py
In the demo you can parse english and german sentences and you can also try out multiple algorithms our graphs implement, such as expand
, substitute
and append_zero_paths
.
Modules
text
General text processing utilities, contains:
- segmentation: stanza-based processors for word and sentence level segmentation
- patterns: various patterns for text processing tasks
graph
Tools for working with graphs, contains:
- utils: misc utilities for working with graphs
grammar
Tools for generating and using grammars, contains:
- alto: tools for interfacing with the alto tool
- irtg: class for representing Interpreted Regular Tree Grammars
- lexicon: Rule lexica for building lexicalized grammars
- ud_fl: grammar-based mapping of Universal Dependencies to 4lang semantic graphs.
- utils: misc utilities for working with grammars
Contributing
We welcome all contributions! Please fork this repository and create a branch for your modifications. We suggest getting in touch with us first, by opening an issue or by writing an email to Gabor Recski or Adam Kovacs at firstname.lastname@tuwien.ac.at
Citing
If you use the library, please cite our paper
@inproceedings{Recski:2021,
title={Explainable Rule Extraction via Semantic Graphs},
author={Recski, Gabor and Lellmann, Bj{\"o}rn and Kovacs, Adam and Hanbury, Allan},
booktitle = {{Proceedings of the Fifth Workshop on Automated Semantic Analysis
of Information in Legal Text (ASAIL 2021)}},
publisher = {{CEUR Workshop Proceedings}},
address = {São Paulo, Brazil},
pages="24--35",
url= "http://ceur-ws.org/Vol-2888/paper3.pdf",
year={2021}
}
License
MIT license
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file tuw-nlp-0.1.0.tar.gz
.
File metadata
- Download URL: tuw-nlp-0.1.0.tar.gz
- Upload date:
- Size: 32.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.64.0 CPython/3.9.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9a4ad70754f8f00a0a91f38d02a3da964ce71170aa4fa813eae2c283eb5c4df4 |
|
MD5 | 398af1e7fd453289e7743f35edab4a6e |
|
BLAKE2b-256 | db52892c6493ccdd28babffe2cf2faf25b4b92558fc0abbe0c179d7f52c6e1ed |
File details
Details for the file tuw_nlp-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: tuw_nlp-0.1.0-py3-none-any.whl
- Upload date:
- Size: 32.5 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.64.0 CPython/3.9.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2650abbeb3a981191d456d609d2689439c068c81a8e1ed40c255536d0bcf2ab8 |
|
MD5 | a1aea2e89f1298bc2b246453d6434c17 |
|
BLAKE2b-256 | 99bd98379d705f1647bd17a1afcb72970584e091d5eb732efbcfcc0fe5fb0622 |