Extended CoNLL Utilities for Shallow Parsing
Project description
eCoNLL: Extended CoNLL Utilities for Shallow Parsing
Shallow Parsing
Sequence Labeling and Classification
Classification is the problem of identifying to which of a set of categories (subpopulations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known.
Sequence Labeling is a type of pattern recognition task that involves the algorithmic assignment of a categorical label to each member of a sequence of observed values. It is a subclass of structured (output) learning, since we are predicting a sequence object rather than a discrete or real value predicted in classification problems.
Sequence Labeling and Shallow Parsing
Shallow Parsing is a kind of Sequence Labeling. The main difference from Sequence Labeling task, such as Part-of-Speech Tagging, where there is an output label (tag) per token; Shallow Parsing additionally performs chunking -- segmentation of input sequence into constituents. Chunking is required to identify categories (or types) of multi-word expressions.
In other words, we want to be able to capture information that expressions like "New York"
that consist of 2 tokens,
constitute a single unit.
What this means in practice is that Shallow Parsing performs jointly (or not) 2 tasks:
- Segmentation of input into constituents (spans)
- Classification (Categorization, Labeling) of these constituents into predefined set of labels (types)
CoNLL Corpus Format
Corpus in CoNLL format consists of series of sentences, separated by blank lines. Each sentence is encoded using a table (or "grid") of values, where each line corresponds to a single word, and each column corresponds to an annotation type (such as various token-level features & labels).
The set of columns used by CoNLL-style files can vary from corpus to corpus.
Since a line in a data can correspond to any token (word or not), it is referred to by a more general term token
.
Similarly, since a data can be composed of units more or less than a sentence,
a new line separated unit is referred to as block
.
Encoding Segmentation Information
IOB Scheme
The notation scheme is used to label multi-word spans in token-per-line format, e.g. that New York is a LOCATION concept that has 2 tokens.
As such a token-level tag
consists of an affix
that encodes segmentation information
and label
that encodes type information.
Consequently, the corpus tagset
consists of all possible affix
and label
combinations.
A segment encoded with affix
es and assigned a label
is referred to as chunk
.
-
Both, prefix and suffix notations are commons:
- prefix: B-LOC
- suffix: LOC-B
-
Meaning of Affixes
- I for Inside of span
- O for Outside of span (no prefix or suffix, just
O
) - B for Beginning of span
Alternative Schemes
- No affix (useful when there are no multi-word concepts)
IO
: deficient withoutB
IOB
: see aboveIOBE
:E
for End of span (L
inBILOU
for Last)IOBES
:S
for Singleton (U
inBILOU
for Unit)
Evaluation
There are several methods to evaluate performance of shallow parsing models.
They can be evaluated at token
-level and at chunk
-level.
Token-Level Evaluation
The unit of evaluation in this case is a tag
of a token
,
and what is evaluated is how accurately a model assigns tags to tokens.
Consequently, the token
(or tag
) accuracy measures the amount of correctly predicted tag
s.
Since a tag
consists of an affix
-label
pair,
it is additionally possible to separately compute affix
and label
performances.
Chunk-Level Evaluation
The unit of evaluation in this case is a chunk
, and the evaluation is "joint";
in the sense that it jointly evaluates segmentation and labeling.
That is, a true
unit is the one that has correct label
and span
.
Similar to token-level evaluation, it is possible to evaluate segmentation independently of labeling.
This is achieved ignoring the chunk
label, e.g. by converting all of them to a single label.
Why eCoNLL?
Token-level evaluation is readily available from a number of packages,
and can be easily computed using scikit-learn
's classification_report
, for instance.
Chunk-level evaluation was originally provided by
conlleval
perl script within CoNLL Shared Tasks.
However, the one limitation of conlleval
is that it does not support IOBES
or BILOU
schemes.
The conlleval
script was ported to python numerous times, and these ports have various functionalities.
One notable port is seqeval
,
which is also included in Hugging Face's evaluate
package.
Installation
To install econll
run:
pip install econll
Usage
It is possible to run econll
from command-line, as well as to import the methods.
Command-Line Usage
usage: PROG [-h] -d DATA [-r REFS]
[--separator SEPARATOR] [--boundary BOUNDARY] [--docstart DOCSTART]
[--kind {prefix,suffix}] [--glue GLUE] [--otag OTAG]
[-f {conll,parse,mdown}] [-o OUTS]
[{eval,conv}]
eCoNLL: Extended CoNLL Utilities
positional arguments:
{eval,conv} task to perform
options:
-h, --help show this help message and exit
I/O Arguments:
-d DATA, --data DATA path to data/hypothesis file
-r REFS, --refs REFS path to references file
Data Format Arguments:
--separator SEPARATOR
field separator string
--boundary BOUNDARY block separator string
--docstart DOCSTART doc start string
Tag Format Arguments:
--kind {prefix,suffix}
tag order
--glue GLUE tag separator
--otag OTAG outside tag
Data Conversion Arguments:
-f {conll,parse,mdown}, --form {conll,parse,mdown}
output format (kind)
-o OUTS, --outs OUTS path to output file
Evaluation
python -m econll -d DATA
python -m econll eval -d DATA
python -m econll eval -d DATA -r REFS
Conversion
python -m econll conv -d DATA -f FORMAT -o PATH
Versioning
This project adheres to Semantic Versioning.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file econll-0.3.0.tar.gz
.
File metadata
- Download URL: econll-0.3.0.tar.gz
- Upload date:
- Size: 44.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.11
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | eb8d53afa81ad9858c7a8c734cc5d4e9b8a92c4875b616680bcec1f69fce4037 |
|
MD5 | 1498ba6b108aa01b49ed6e70a7929732 |
|
BLAKE2b-256 | bba30b8b36b8108e14ad6d3769e22350cd2331201fe5d4707929669e588e7b5f |
File details
Details for the file econll-0.3.0-py3-none-any.whl
.
File metadata
- Download URL: econll-0.3.0-py3-none-any.whl
- Upload date:
- Size: 38.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.11
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 012f7127e2caa462da99631d9f2c374da1ff6a4ea5481648995a5ca2552ddf9d |
|
MD5 | 0b35e236d1a46e0d2f9b337f87691afa |
|
BLAKE2b-256 | 25ea3e207dd7e6cc20581347df11a6928ab80fc1ab1cfb003a1c343db3f2d8bf |