NER Error Analysis for column (conll format) dataset including CoNLL-2003, WNUT-2017, ...
Project description
NER Error Analyzer
Quick Start
from nlu.error import *
from nlu.parser import *
cols_format = [{'type': 'predict', 'col_num': 1, 'tagger': 'ner'},
{'type': 'gold', 'col_num': 2, 'tagger': 'ner'}]
parser = ConllParser('testb.pred.gold', cols_format)
parser.obtain_statistics(entity_stat=True, source='predict')
parser.obtain_statistics(entity_stat=True, source='gold')
parser.set_entity_mentions()
NERErrorAnnotator.annotate(parser)
parser.print_corrects()
parser.print_all_errors()
parser.error_overall_stats()
see the section Input Format below to know what the input format is
Usage
import
from nlu.error import *
from nlu.parser import *
Create a ConllParser instance first with the input of the file path with specifying the column number in cols_format field
ConllParser(filepath)
cols_format = [{'type': 'predict', 'col_num': 1, 'tagger': 'ner'},
{'type': 'gold', 'col_num': 2, 'tagger': 'ner'}]
parser = ConllParser('testb.pred.gold', cols_format)
obtain the basic statistics by obtain_statistics() method
parser.obtain_statistics(entity_stat=True, source='predict')
parser.obtain_statistics(entity_stat=True, source='gold')
To "Annotate" NER Errors in the documents inside ConllParser
NERErrorAnnotator.annotate(parser)
To print out all corrects/errors, use
parser.print_corrects() or
parser.print_all_errors()
or use the function error_overall_stats() method to get the stats
Input File Format
The input file format of ConllParser is following the column format used by Conll03.
For example,
Natural I-ORG O
Language I-ORG O
Laboratory I-ORG I-ORG
...
where the first column is the text, the second and the third are the predicted and the ground truth tag respectively, where the order can be specified in the keyword cols_format in ConllParser in instantialization:
cols_format = [{'type': 'predict', 'col_num': 1, 'tagger': 'ner'},
{'type': 'gold', 'col_num': 2, 'tagger': 'ner'}] # col_num starts from 0
I recommend to use shell command awk '{print $x}' filepath to obtain the x-th column, like awk '{print $4} filepath' to obtain the 4-th column.
And use paste file1.txt file2.txt to concatenate two files.
For example,
awk '{print $4}' eng.train > ner_tags_file # $num starts from 1
paste ner_pred_tags_file ner_tags_file
Types of Span Errors
| Types | Number of Mentions (Predicted and Gold) | Subtypes | Examples | Notes |
|---|---|---|---|---|
| Missing Mention (False Negative) |
1 | TYPES→O | [] → None # todo | |
| Extra Mention (False Positive) |
1 | O→TYPES | None → [...] # todo | |
| Mention with Wrong Type (Type Errors) |
≥ 2 | TYPES-> TYPES - self ( {(p, g) | p ∈ T, g ∈ T - p } ) |
[PER...] → [ORG...] # todo | But the spans are the same |
| Missing Tokens | 2 | L/ R/ LR Diminished | [MISC1991 World Cup] → [MISC1991] [MISC World Cup] | also possible with type errors |
| Extra Tokens | 2 | L/R/LR Expanded | [...] → [......] # todo | also possible with type errors |
| Missing + Extra Tokens | 2 | L/R Crossed | ..[...].. → .[..]... | also possible with type errors |
| Conflated Mention | ≥ 3 | [][][] → [] # todo | also possible with type errors | |
| Divided Mention | ≥ 3 | [MISC1991 World Cup] → [MISC1991] [MISC World Cup] [PERBarack Hussein Obama] → [PERBarack][PERHussein][PERObama] |
also possible with type errors | |
| Complicated Case | ≥ 3 | [][][] → [][] # todo | also possible with type errors | |
| Ex - Mention with Wrong Segmentation (Same overall range but wrong segmentation) |
≥ 4 | [...][......][.] → [......][.....] | also possible with type errors |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ner_error_analysis-0.2.tar.gz.
File metadata
- Download URL: ner_error_analysis-0.2.tar.gz
- Upload date:
- Size: 35.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
30719bd97ece8e33d45312da0650c9624ff5e92d629cae564860b458732ac6b3
|
|
| MD5 |
9f5ade2229307959b5bc4961bbfbdf5c
|
|
| BLAKE2b-256 |
e0e6106dbac58c7928185c953cd3e31837eaae4e954e892b08967393ca3fcd1c
|
File details
Details for the file ner_error_analysis-0.2-py3-none-any.whl.
File metadata
- Download URL: ner_error_analysis-0.2-py3-none-any.whl
- Upload date:
- Size: 36.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
513b733e000a49c89cf5494136ca6f90acb7ffe78c8a84c23af24d1e66b3e2f5
|
|
| MD5 |
4b7f693e3fa439e54fe214832e76a979
|
|
| BLAKE2b-256 |
c1fcff7a1a2c7d20e92e7b3edc1589bce18c87da060edc3e9acb140278ae2d93
|