The ERRor ANnotation Toolkit (ERRANT). Automatically extract and classify edits in parallel sentences.

These details have not been verified by PyPI

Project description

VNERRANT v1.0.0

Overview

The main aim of VNERRANT is to automatically annotate parallel English sentences with error type information. Specifically, given an original and corrected sentence pair, VNERRANT will extract the edits that transform the former to the latter and classify them according to a rule-based error type framework. This can be used to standardise parallel datasets or facilitate detailed error type evaluation. Annotated output files are in M2 format and an evaluation script is provided.

Example

Original: This are gramamtical sentence . Corrected: This is a grammatical sentence . Output M2:

S This are gramamtical sentence .
A 1 2|||R:VERB:SVA|||is|||REQUIRED|||-NONE-|||0
A 2 2|||M:DET|||a|||REQUIRED|||-NONE-|||0
A 2 3|||R:SPELL|||grammatical|||REQUIRED|||-NONE-|||0
A -1 -1|||noop|||-NONE-|||REQUIRED|||-NONE-|||1

Installation

Pip Install

conda create -n vnerrant python=3.9
conda activate vnerrant

You have two options for installing VNERRANT:

Option 1: Install VNERRANT using pip with the following commands:

pip install -U pip setuptools wheel
pip install vnerrant

Option 2: Alternatively, if you want to install VNERRANT from the source, you can follow these steps:

git clone https://gitlab.testsprep.online/nlp/research/vnerrant
cd vnerrant
pip install -U pip setuptools wheel
pip install -e .

Please obtain a Spacy model by using the following command:

python -m spacy download en_core_web_sm

You can verify the available models at this location.

Usage

CLI

Two main commands are provided with VNERRANT: convert and evaluate. You can run them from anywhere on the command line without having to invoke a specific python script.

1.vnerrant convert parallel-to-m2

This is the main annotation command that takes an original text file and at least one parallel corrected text file as input, and outputs an annotated M2 file. By default, it is assumed that the original and corrected text files are word tokenised with one sentence per line. Example:

vnerrant convert parallel-to-m2 -o <orig_file> -c <cor_file1> [<cor_file2> ...] -out <out_m2>

2.vnerrant convert m2-to-m2

This is a variant of parallel-to-m2 that operates on an M2 file instead of parallel text files. This makes it easier to reprocess existing M2 files. You must also specify whether you want to use gold or auto edits; i.e. -gold will only classify the existing edits, while -auto will extract and classify automatic edits. In both settings, uncorrected edits and noops are preserved. Example:

vnerrant convert m2-to-m2  -i <in_m2> -o <out_m2> {-auto|-gold}

3.vnerrant evaluate m2

This is the evaluation command that compares a hypothesis M2 file against a reference M2 file. The default behaviour evaluates the hypothesis overall in terms of span-based correction. The -cat {1,2,3} flag can be used to evaluate error types at increasing levels of granularity, while the -ds or -dt flag can be used to evaluate in terms of span-based or token-based detection (i.e. ignoring the correction). All scores are presented in terms of Precision, Recall and F-score (default: F0.5), and counts for True Positives (TP), False Positives (FP) and False Negatives (FN) are also shown. Examples:

vnerrant evaluate m2 -hyp <hyp_m2> -ref <ref_m2>
vnerrant evaluate m2 -hyp <hyp_m2> -ref <ref_m2> -cat {1,2,3}
vnerrant evaluate m2 -hyp <hyp_m2> -ref <ref_m2> -ds
vnerrant evaluate m2 -hyp <hyp_m2> -ref <ref_m2> -ds -cat {1,2,3}

All these scripts also have additional advanced command line options which can be displayed using the -h flag.

API

As of v3.0.0, VNERRANT now also comes with an API.

Quick Start

import vnerrant

annotator = vnerrant.load('en')

orig = 'My    name    is   the     John'
cor = 'My name is John'
edits = annotator.annotate_raw(orig, cor)

for e in edits:
    print(e.original.start_token, e.original.end_token, e.original.text)
    print(e.corrected.start_token, e.corrected.end_token, e.corrected.text)
    print(e.original.start_char, e.original.end_char, e.edit_type)

Loading

vnerrant.load(lang, model_name)

Instantiate an VNERRANT Annotator object. Presently, the lang parameter exclusively accepts 'en' for English, though we aspire to broaden its language support in future iterations. The model_name corresponds to the name of the SpaCy model being utilized. Optionally, you can provide the nlp parameter if you've previously loaded SpaCy and wish to prevent VNERRANT from loading it redundantly.

Annotator Objects

An Annotator object is the main interface for VNERRANT.

Methods

annotator.parse

annotator.parse(string, tokenize_type='string')

Lemmatise, POS tag, and parse a text string with spacy. Returns a spacy Doc object.

tokenize_type must be in ["spacy", "split", "string"]

spacy: tokenizing by default spacy tokenizer.
split: tokenizing by split function.
string: tokenizing by spacy and string tokenizer.

annotator.align

annotator.align(orig, cor, lev=False)

Align spacy-parsed original and corrected text. The default uses a linguistically-enhanced Damerau-Levenshtein alignment, but the lev flag can be used for a standard Levenshtein alignment. Returns an Alignment object.

annotator.merge

annotator.merge(alignment, merging='rules')

Extract edits from the optimum alignment in an Alignment object. Four different merging strategies are available:

rules: Use a rule-based merging strategy (default)
all-split: Merge nothing: MSSDI -> M, S, S, D, I
all-merge: Merge adjacent non-matches: MSSDI -> M, SSDI
all-equal: Merge adjacent same-type non-matches: MSSDI -> M, SS, D, I

Returns a list of Edit objects.

annotator.classify

annotator.classify(edit)

Classify an edit. Sets the edit.type attribute in an Edit object and returns the same Edit object.

annotator.annotate

annotator.annotate(orig, cor, lev=False, merging='rules')

Run the full annotation pipeline to align two sequences and extract and classify the edits. Equivalent to running annotator.align, annotator.merge and annotator.classify in sequence. Returns a list of Edit objects.

import vnerrant

annotator = vnerrant.load(lang="en", model_name="en_core_web_sm")
orig = annotator.parse("My   name   is    the    John")
cor = annotator.parse("My name is John")
edits = annotator.annotate(orig, cor)
for e in edits:
    print(e)

annotator.annotate_raw

annotator.annotate_raw(orig: str, cor: str, lev=False, merging='rules', tokenize_type='string')

Run the full annotation pipeline to align two strings, extract and classify the edits. Equivalent to running annotator.parse, annotator.align, annotator.merge and annotator.classify in sequence. Returns a list of Edit objects.

import vnerrant

annotator = vnerrant.load(lang="en", model_name="en_core_web_sm")
orig = "My   name   is    the    John"
cor = "My name is John"
edits = annotator.annotate_raw(orig, cor)
for e in edits:
    print(e)

annotator.import_edit

annotator.import_edit(orig, cor, edit, min=True, old_cat=False)

Load an Edit object from a list. orig and cor must be spacy-parsed Doc objects and the edit must be of the form: [o_start, o_end, c_start, c_end(, type)]. The values must be integers that correspond to the token start and end offsets in the original and corrected Doc objects. The type value is an optional string that denotes the error type of the edit (if known). Set min to True to minimise the edit (e.g. [a b -> a c] = [b -> c]) and old_cat to True to preserve the old error type category (i.e. turn off the classifier).

import vnerrant

annotator = vnerrant.load('en')
orig = annotator.parse('This are gramamtical sentence .')
cor = annotator.parse('This is a grammatical sentence .')
edit = [1, 2, 1, 2, 'SVA'] # are -> is
edit = annotator.import_edit(orig, cor, edit)
print(edit.to_m2())

Alignment Objects

An Alignment object is created from two spacy-parsed text sequences.

Attributes

alignment.orig alignment.cor The spacy-parsed original and corrected text sequences.

alignment.cost_matrix alignment.op_matrix The cost matrix and operation matrix produced by the alignment.

alignment.align_seq The first cheapest alignment between the two sequences.

Edit Objects

An Edit object represents a transformation between two text sequences.

Attributes

edit.o_start edit.o_end edit.o_toks edit.o_str The start and end offsets, the spacy tokens, and the string for the edit in the original text.

edit.c_start edit.c_end edit.c_toks edit.c_str The start and end offsets, the spacy tokens, and the string for the edit in the corrected text.

edit.type The error type string.

Method

edit.to_m2(id=0) Format the edit for an output M2 file. id is the annotator id.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

2.0.0

Oct 23, 2024

1.0.1rc13 pre-release

Sep 5, 2024

1.0.1rc12 pre-release

Sep 5, 2024

1.0.1rc11 pre-release

Jul 19, 2024

1.0.1rc10 pre-release

Jul 1, 2024

1.0.1rc9 pre-release

Jun 30, 2024

1.0.1rc8 pre-release

Jun 17, 2024

1.0.1rc7 pre-release

May 22, 2024

1.0.1rc6 pre-release

May 22, 2024

1.0.1rc5 pre-release

May 22, 2024

This version

1.0.1rc4 pre-release

May 16, 2024

1.0.1rc3 pre-release

May 12, 2024

1.0.1rc2 pre-release

May 12, 2024

1.0.1rc1 pre-release

Apr 28, 2024

1.0.0

Feb 13, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vnerrant-1.0.1rc4.tar.gz (516.7 kB view details)

Uploaded May 16, 2024 Source

Built Distribution

vnerrant-1.0.1rc4-py3-none-any.whl (519.5 kB view details)

Uploaded May 16, 2024 Python 3

File details

Details for the file vnerrant-1.0.1rc4.tar.gz.

File metadata

Download URL: vnerrant-1.0.1rc4.tar.gz
Upload date: May 16, 2024
Size: 516.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.9.13

File hashes

Hashes for vnerrant-1.0.1rc4.tar.gz
Algorithm	Hash digest
SHA256	`65e4f9ef868edbf426c88a2356ee042729fe13902fce004363b3df798c37217d`
MD5	`c498e49590e131ffb8615d6a5bbaebdb`
BLAKE2b-256	`41ce499c9162aa1297588004bae01a0c9c9d92175353a7bcf2bd20ebfa227cd3`

See more details on using hashes here.

File details

Details for the file vnerrant-1.0.1rc4-py3-none-any.whl.

File metadata

Download URL: vnerrant-1.0.1rc4-py3-none-any.whl
Upload date: May 16, 2024
Size: 519.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.9.13

File hashes

Hashes for vnerrant-1.0.1rc4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a8eba07b247c79df4c4df2a817fe978cf826c2381af2d2f20b6217b8c3f2e188`
MD5	`ceb8fb6accedd3cd8bc9dc04fc5e0327`
BLAKE2b-256	`5450f2a2af0559a8a3b222d8f8c7651400138c3ea7ecc7178db4ded71aa05415`