Skip to main content

NER Error Analysis for column (conll format) dataset including CoNLL-2003, WNUT-2017, ...

Project description

NER Error Analyzer

Quick Start

from nlu.error import *
from nlu.parser import *


cols_format = [{'type': 'predict', 'col_num': 1, 'tagger': 'ner'},
                {'type': 'gold', 'col_num': 2, 'tagger': 'ner'}]

parser = ConllParser('testb.pred.gold', cols_format)

parser.obtain_statistics(entity_stat=True, source='predict')

parser.obtain_statistics(entity_stat=True, source='gold')

parser.set_entity_mentions()

NERErrorAnnotator.annotate(parser)

parser.print_corrects()

parser.print_all_errors()

parser.error_overall_stats()

see the section Input Format below to know what the input format is

Usage

import

from nlu.error import *
from nlu.parser import *

Create a ConllParser instance first with the input of the file path with specifying the column number in cols_format field

ConllParser(filepath)

cols_format = [{'type': 'predict', 'col_num': 1, 'tagger': 'ner'},
                {'type': 'gold', 'col_num': 2, 'tagger': 'ner'}]

parser = ConllParser('testb.pred.gold', cols_format)

obtain the basic statistics by obtain_statistics() method

parser.obtain_statistics(entity_stat=True, source='predict')

parser.obtain_statistics(entity_stat=True, source='gold')

To "Annotate" NER Errors in the documents inside ConllParser

NERErrorAnnotator.annotate(parser)

To print out all corrects/errors, use

parser.print_corrects() or parser.print_all_errors()

or use the function error_overall_stats() method to get the stats

Input File Format

The input file format of ConllParser is following the column format used by Conll03.

For example,

Natural I-ORG O
Language I-ORG O
Laboratory I-ORG I-ORG
...

where the first column is the text, the second and the third are the predicted and the ground truth tag respectively, where the order can be specified in the keyword cols_format in ConllParser in instantialization:

cols_format = [{'type': 'predict', 'col_num': 1, 'tagger': 'ner'},
               {'type': 'gold', 'col_num': 2, 'tagger': 'ner'}]  # col_num starts from 0

I recommend to use shell command awk '{print $x}' filepath to obtain the x-th column, like awk '{print $4} filepath' to obtain the 4-th column.

And use paste file1.txt file2.txt to concatenate two files.

For example,

awk '{print $4}' eng.train > ner_tags_file  # $num starts from 1
paste ner_pred_tags_file ner_tags_file

Types of Span Errors

Types Number of Mentions (Predicted and Gold) Subtypes Examples Notes
Missing Mention
(False Negative)
1 TYPES→O [] → None # todo
Extra Mention
(False Positive)
1 O→TYPES None → [...] # todo
Mention with Wrong Type
(Type Errors)
≥ 2 TYPES-> TYPES - self
( {(p, g) | p ∈ T, g ∈ T - p } )
[PER...] → [ORG...] # todo But the spans are the same
Missing Tokens 2 L/ R/ LR Diminished [MISC1991 World Cup] → [MISC1991] [MISC World Cup] also possible with type errors
Extra Tokens 2 L/R/LR Expanded [...] → [......] # todo also possible with type errors
Missing + Extra Tokens 2 L/R Crossed ..[...].. → .[..]... also possible with type errors
Conflated Mention ≥ 3 [][][] → [] # todo also possible with type errors
Divided Mention ≥ 3 [MISC1991 World Cup] → [MISC1991] [MISC World Cup]
[PERBarack Hussein Obama] → [PERBarack][PERHussein][PERObama]
also possible with type errors
Complicated Case ≥ 3 [][][] → [][] # todo also possible with type errors
Ex -
Mention with Wrong Segmentation
(Same overall range but wrong segmentation)
≥ 4 [...][......][.] → [......][.....] also possible with type errors

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ner_error_analysis-0.2.tar.gz (35.0 kB view details)

Uploaded Source

Built Distribution

ner_error_analysis-0.2-py3-none-any.whl (36.0 kB view details)

Uploaded Python 3

File details

Details for the file ner_error_analysis-0.2.tar.gz.

File metadata

  • Download URL: ner_error_analysis-0.2.tar.gz
  • Upload date:
  • Size: 35.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.4

File hashes

Hashes for ner_error_analysis-0.2.tar.gz
Algorithm Hash digest
SHA256 30719bd97ece8e33d45312da0650c9624ff5e92d629cae564860b458732ac6b3
MD5 9f5ade2229307959b5bc4961bbfbdf5c
BLAKE2b-256 e0e6106dbac58c7928185c953cd3e31837eaae4e954e892b08967393ca3fcd1c

See more details on using hashes here.

File details

Details for the file ner_error_analysis-0.2-py3-none-any.whl.

File metadata

  • Download URL: ner_error_analysis-0.2-py3-none-any.whl
  • Upload date:
  • Size: 36.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.4

File hashes

Hashes for ner_error_analysis-0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 513b733e000a49c89cf5494136ca6f90acb7ffe78c8a84c23af24d1e66b3e2f5
MD5 4b7f693e3fa439e54fe214832e76a979
BLAKE2b-256 c1fcff7a1a2c7d20e92e7b3edc1589bce18c87da060edc3e9acb140278ae2d93

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page