Adapts amrlib in the Zensols framework.
Project description
AMR annotation and feature generation
Provides support for AMR graph manipulation, annotations and feature generation.
Features:
- Annotation in AMR metadata. For example, sentence types found in the Proxy report AMR corpus.
- AMR token alignment as spaCy components.
- Integrates natural language parsing and features with Zensols zensols.nlparse library.
- A scoring API that includes Smatch and WLK, which extends a more general NLP scoring module.
- AMR parsing (amrlib) and AMR co-reference (amr_coref).
- Command line and API utilities for AMR graph Penman graphs, debugging and files.
- Tools for training and evaluating new AMR parse (text to graph) and generation (graph to text) models.
- A method for re-indexing and updating AMR graph variables so that all in a document collection are unique.
Documentation
Installing
The library can be installed with pip from the pypi repository:
pip3 install zensols.amr
Installing the Gsii Model
The Gsii model link expires and requires a manual download of the model. To install it, do the following:
- Download the Gsii model (click "direct download").
- Move the file to the local directory.
- Install the file by forcing a test parse:
amr parse 'Test sentence.' --override \ amr_parse_gsii_resource.url=file:model_parse_gsii-v0_1_0.tar.gz
Usage
from penman.graph import Graph
from zensols.nlp import FeatureDocument, FeatureDocumentParser
from zensols.amr import AmrDocument, AmrSentence, Dumper, ApplicationFactory
sent: str = """
He was George Washington and first president of the United States.
He was born On February 22, 1732.
""".replace('\n', ' ').strip()
# get the AMR document parser
doc_parser: FeatureDocumentParser = ApplicationFactory.get_doc_parser()
# the parser creates a NLP centric feature document as provided in the
# zensols.nlp package
doc: FeatureDocument = doc_parser(sent)
# the AMR object graph data structure is provided in the feature document
amr_doc: AmrDocument = doc.amr
# dump a human readable output of the AMR document
amr_doc.write()
# get the first AMR sentence instance
amr_sent: AmrSentence = amr_doc.sents[0]
print('sentence:')
print(' ', amr_sent.text)
print('tuples:')
# show the Penman graph representation
pgraph: Graph = amr_sent.graph
print(f'variables: {", ".join(pgraph.variables())}')
for t in pgraph.triples:
print(' ', t)
print('edges:')
for e in pgraph.edges():
print(' ', e)
# visualize the graph as a PDF
dumper: Dumper = ApplicationFactory.get_dumper()
dumper(amr_doc)
Per the example, the t5.conf and
gsii.conf configuration show how to include
configuration needed per AMR model. These files can also be used directly with
the amr
command using the --config
option.
However, the other resources in the example must be imported unless you redefine them yourself.
Library
When adding the amr
spaCy pipeline component, the doc._.amr
attribute is
set on the Doc
instance. You can either configure spaCy yourself, or you can
use the configuration files in test-resources as an example
using the zensols.util configuration framework. The command line application
provides an example how to do this, along with the test
case.
Command Line
This library is written mostly to be used by other program, but the command
line utility amr
is also available to demonstrate its usage and to generate
ARM graphs on the command line.
To parse:
$ amr parse -c test-resources/t5.conf 'This is a test of the AMR command line utility.'
# ::snt This is a test of the AMR command line utility.
(t / test-01
:ARG1 (u / utility
:mod (c / command-line)
:name (n / name
:op1 "AMR"
:toki1 "6")
:toki1 "9")
:domain (t2 / this
:toki1 "0")
:toki1 "3")
To generate graphs in PDF format:
$ amr plot -c test-resources/t5.conf 'This is a test of the AMR command line utility.'
wrote: amr-graph/this-is-a-test-of-the-amr-comm.pdf
Training
This package uses the amrlib training, but adds a command line and downloadable corpus aggregation / API. To train:
- Choose a model (i.e. SPRING, T5).
- Optionally edit the train configuration directory of the model you choose.
- Optionally edit the
resources/train.yml
to select/add more corpora (see Adding Corpora). - Train the model:
./amr --config train-config/<model>.conf
Pretrained Models
This library was used to train all of the amrlib models (using the same checkpoints as amrlib), except the T5 Base v1 model, with additional examples from publicly available human annotated corpora. The differences of these trained models include:
- None of the models were tested against a training set, only the development SMATCH scores are available. This was intentional to provide more training examples.
- The AMR Release 3.0 (LDC2020T02) test set was added to the training set.
- The Little Prince and Bio AMR corpora where used to train the models. The first 85% of the AMR sentences were added to training set and the remaining 15% were added to the development set.
- The mini-batch size changed for
generate-t5wtense-base
due to memory constraints. - The number of training epochs were increased to account for the additional number of training examples.
- Models have the same naming conventions but are prefixed with
zsl
. - Generative models were trained on graphs metadata annotated by the Sci spaCy
en_core_sci_md
model.
The performance of these models:
Model Name | Model Type | Checkpoint | Performance |
---|---|---|---|
zsl_spring |
parse | facebook/bart-large | SMATCH: 81.26 |
zsl_xfm_bart_base |
parse | facebook/bart-base | SMATCH: 80.5 |
zsl_xfm_bart_large |
parse | facebook/bart-large | SMATCH: 82.7 |
zsl_t5wtense_base |
generative | t5-base | BLEU: 42.20 |
zsl_t5wtense_large |
generative | google/flan-t5-large | BLEU: 44.01 |
These models are available upon request.
Adding Corpora
You can retrain your own model and add additional training corpora by modifying
the list of ${amr_prep_manager:preppers}
in resources/train.yml
. This file
defines downloaded corpora for the Little Prince and Bio AMR corpora. To use
the AMR 3.0 release, add the LDC downloaded file to (a new) download
directory.
Attribution
This project, or reference model code, uses:
- Python 3.11
- amrlib for AMR parsing.
- amr_coref for AMR co-reference
- spaCy for natural language parsing.
- zensols.nlparse for natural language features.
- Smatch (Cai and Knight. 2013) and WLK (Opitz et. al. 2021) for scoring.
Citation
If you use this project in your research please use the following BibTeX entry:
@inproceedings{landes-etal-2023-deepzensols,
title = "{D}eep{Z}ensols: A Deep Learning Natural Language Processing Framework for Experimentation and Reproducibility",
author = "Landes, Paul and
Di Eugenio, Barbara and
Caragea, Cornelia",
editor = "Tan, Liling and
Milajevs, Dmitrijs and
Chauhan, Geeticka and
Gwinnup, Jeremy and
Rippeth, Elijah",
booktitle = "Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)",
month = dec,
year = "2023",
address = "Singapore, Singapore",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.nlposs-1.16",
pages = "141--146"
}
Changelog
An extensive changelog is available here.
Community
Please star this repository and let me know how and where you use this API. Contributions as pull requests, feedback and any input is welcome.
License
Copyright (c) 2021 - 2024 Paul Landes
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
File details
Details for the file zensols.amr-0.1.5-py3-none-any.whl
.
File metadata
- Download URL: zensols.amr-0.1.5-py3-none-any.whl
- Upload date:
- Size: 103.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0336aa341686484402fb01b0a75295032fdce1d6c67aeeb9bb87da932ab405c7 |
|
MD5 | 0f5d2a4d3a10990c2ac8634d5d97cc8d |
|
BLAKE2b-256 | 878613ca360ae7b653edfa2fd3b57e7b6a854306a9b68bbf262b358b9ace5e40 |