Extended CoNLL Utilities for Shallow Parsing
Project description
eCoNLL: Extended CoNLL Utilities for Shallow Parsing
Shallow Parsing
Sequence Labeling and Classification
Classification is the problem of identifying to which of a set of categories (subpopulations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known.
Sequence Labeling is a type of pattern recognition task that involves the algorithmic assignment of a categorical label to each member of a sequence of observed values. It is a subclass of structured (output) learning, since we are predicting a sequence object rather than a discrete or real value predicted in classification problems.
Sequence Labeling and Shallow Parsing
Shallow Parsing is a kind of Sequence Labeling. The main difference from Sequence Labeling task, such as Part-of-Speech Tagging, where there is an output label (tag) per token; Shallow Parsing additionally performs chunking -- segmentation of input sequence into constituents. Chunking is required to identify categories (or types) of multi-word expressions.
In other words, we want to be able to capture information that expressions like "New York" that consist of 2 tokens,
constitute a single unit.
What this means in practice is that Shallow Parsing performs jointly (or not) 2 tasks:
- Segmentation of input into constituents (spans)
- Classification (Categorization, Labeling) of these constituents into predefined set of labels (types)
CoNLL Corpus Format
Corpus in CoNLL format consists of series of sentences, separated by blank lines. Each sentence is encoded using a table (or "grid") of values, where each line corresponds to a single word, and each column corresponds to an annotation type (such as various token-level features & labels).
The set of columns used by CoNLL-style files can vary from corpus to corpus.
Since a line in a data can correspond to any token (word or not), it is referred to by a more general term token.
Similarly, since a data can be composed of units more or less than a sentence,
a new line separated unit is referred to as block.
Encoding Segmentation Information
IOB Scheme
The notation scheme is used to label multi-word spans in token-per-line format, e.g. that New York is a LOCATION concept that has 2 tokens.
As such a token-level tag consists of an affix that encodes segmentation information
and label that encodes type information.
Consequently, the corpus tagset consists of all possible affix and label combinations.
A segment encoded with affixes and assigned a label is referred to as chunk.
-
Both, prefix and suffix notations are commons:
- prefix: B-LOC
- suffix: LOC-B
-
Meaning of Affixes
- I for Inside of span
- O for Outside of span (no prefix or suffix, just
O) - B for Beginning of span
Alternative Schemes
- No affix (useful when there are no multi-word concepts)
IO: deficient withoutBIOB: see aboveIOBE:Efor End of span (LinBILOUfor Last)IOBES:Sfor Singleton (UinBILOUfor Unit)
Evaluation
There are several methods to evaluate performance of shallow parsing models.
They can be evaluated at token-level and at chunk-level.
Token-Level Evaluation
The unit of evaluation in this case is a tag of a token,
and what is evaluated is how accurately a model assigns tags to tokens.
Consequently, the token (or tag) accuracy measures the amount of correctly predicted tags.
Since a tag consists of an affix-label pair,
it is additionally possible to separately compute affix and label performances.
Chunk-Level Evaluation
The unit of evaluation in this case is a chunk, and the evaluation is "joint";
in the sense that it jointly evaluates segmentation and labeling.
That is, a true unit is the one that has correct label and span.
Similar to token-level evaluation, it is possible to evaluate segmentation independently of labeling.
This is achieved ignoring the chunk label, e.g. by converting all of them to a single label.
Why eCoNLL?
Token-level evaluation is readily available from a number of packages,
and can be easily computed using scikit-learn's classification_report, for instance.
Chunk-level evaluation was originally provided by
conlleval perl script within CoNLL Shared Tasks.
However, the one limitation of conlleval is that it does not support IOBES or BILOU schemes.
The conlleval script was ported to python numerous times, and these ports have various functionalities.
One notable port is seqeval,
which is also included in Hugging Face's evaluate package.
Installation
To install econll run:
pip install econll
Usage
It is possible to run econll from command-line, as well as to import the methods.
Command-Line Usage
usage: PROG [-h] -d DATA [-r REFS]
[--separator SEPARATOR] [--boundary BOUNDARY] [--docstart DOCSTART]
[--kind {prefix,suffix}] [--glue GLUE] [--otag OTAG]
[-f {conll,parse,mdown}] [-o OUTS]
[{eval,conv}]
eCoNLL: Extended CoNLL Utilities
positional arguments:
{eval,conv} task to perform
options:
-h, --help show this help message and exit
I/O Arguments:
-d DATA, --data DATA path to data/hypothesis file
-r REFS, --refs REFS path to references file
Data Format Arguments:
--separator SEPARATOR
field separator string
--boundary BOUNDARY block separator string
--docstart DOCSTART doc start string
Tag Format Arguments:
--kind {prefix,suffix}
tag order
--glue GLUE tag separator
--otag OTAG outside tag
Data Conversion Arguments:
-f {conll,parse,mdown}, --form {conll,parse,mdown}
output format (kind)
-o OUTS, --outs OUTS path to output file
Evaluation
python -m econll -d DATA
python -m econll eval -d DATA
python -m econll eval -d DATA -r REFS
Conversion
python -m econll conv -d DATA -f FORMAT -o PATH
Versioning
This project adheres to Semantic Versioning.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file econll-0.3.0.tar.gz.
File metadata
- Download URL: econll-0.3.0.tar.gz
- Upload date:
- Size: 44.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
eb8d53afa81ad9858c7a8c734cc5d4e9b8a92c4875b616680bcec1f69fce4037
|
|
| MD5 |
1498ba6b108aa01b49ed6e70a7929732
|
|
| BLAKE2b-256 |
bba30b8b36b8108e14ad6d3769e22350cd2331201fe5d4707929669e588e7b5f
|
File details
Details for the file econll-0.3.0-py3-none-any.whl.
File metadata
- Download URL: econll-0.3.0-py3-none-any.whl
- Upload date:
- Size: 38.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
012f7127e2caa462da99631d9f2c374da1ff6a4ea5481648995a5ca2552ddf9d
|
|
| MD5 |
0b35e236d1a46e0d2f9b337f87691afa
|
|
| BLAKE2b-256 |
25ea3e207dd7e6cc20581347df11a6928ab80fc1ab1cfb003a1c343db3f2d8bf
|