Library to quickly build basic datasets for Named Entity Recognition (NER) and Relation Extraction (RE) Machine Learning tasks.
Project description
extr-ds
Library to programmatically build labeled datasets for Named-Entity Recognition (NER) and Relation Extraction (RE) Machine Learning tasks.
Install
pip install extr-ds
Command Line
see Instructions on how to use the command line utility to manage your project.
1. Init Project
extr-ds --init
2. Split and Annotate
extr-ds --split
3.a Annotate Entities or Relations Again?
extr-ds --annotate -ents
extr-ds --annotate -rels
3.b Change Relation Extraction Label
extr-ds --relate -label NO_RELATION=5,7,9
3.b Remove Relation Extraction Instance
extr-ds --relate -delete 5,6,7
3.c Recover removed Relation Extraction Instances
extr-ds --relate -recover 5,6,7
4. Save
extr-ds --save -ents
extr-ds --save -rels
5. Reset "Gold Standard" datasets
extr-ds --reset
6. Help!?
extr-ds --help
API
Example
text = 'Ted Johnson is a pitcher.'
1. Label Entities for Named-Entity Recognition Task (NER)
from extr import RegEx, RegExLabel
from extr.entities import EntityExtactor
from extr_ds.labelers import IOB
entity_extractor = EntityExtactor([
RegExLabel('PERSON', [
RegEx([r'(ted\s+johnson|ted)'], re.IGNORECASE)
]),
RegExLabel('POSITION', [
RegEx([r'pitcher'], re.IGNORECASE)
]),
])
sentence_tokenizer = ## 3rd party tokenizer ##
label = IOB(sentence_tokenizer, entity_extractor).label(text)
## label == <Label tokens=..., labels=['B-PERSON', 'I-PERSON', 'O', 'O', 'B-POSITION', 'O']>
2. Annotate for Relation Extraction Task (RE)
from extr.entities import EntityExtractor
from extr.relations import RegExRelationLabelBuilder, \
RelationExtractor
from extr_ds.labelers import RelationClassification
from extr_ds.labelers.relation import RelationBuilder, BaseRelationLabeler, RuleBasedRelationLabeler
person_to_position_relationship = RegExRelationLabelBuilder('is_a') \
.add_e1_to_e2(
'PERSON',
[
r'\s+is\s+a\s+',
],
'POSITION'
) \
.build()
base_relation_labeler = BaseRelationLabeler(
RelationBuilder(relation_formats=[
('PERSON', 'POSITION', 'NO_RELATION')
])
)
rule_based_relation_labeler = RuleBasedRelationLabeler(
RelationExtractor([person_to_position_relationship])
)
labeler = RelationClassification(
EntityExtractor([
RegExLabel('PERSON', [
RegEx([r'(ted johnson|bob)'], re.IGNORECASE)
]),
RegExLabel('POSITION', [
RegEx([r'pitcher'], re.IGNORECASE)
]),
]),
base_relation_labeler,
relation_labelers=[
rule_based_relation_labeler
]
)
results = labeler.label(text)
## results.relation_labels == [
## <RelationLabel sentence="<e1>Ted Johnson</e1> is a <e2>pitcher</e2>." label="is_a">
## ]
3. Find and define the type of difference between labels
from extr_ds.validators import check_for_differences
differences_in_labels = check_for_differences(
['B-PERSON', 'I-PERSON', 'O', 'O', 'B-POSITION', 'O'],
['B-PERSON', 'O', 'O', 'O', 'B-POSITION', 'O']
)
## differences_in_labels.has_diffs == True
## differences_in_labels.diffs_between_labels == [
## <Difference index=1, diff_type=DifferenceTypes.S2_MISSING>
## ]
differences_in_labels = check_for_differences(
['B-PERSON', 'I-PERSON', 'O', 'O', 'B-POSITION', 'O'],
['B-PERSON', 'B-PERSON', 'O', 'O', 'B-POSITION', 'O']
)
## differences_in_labels.has_diffs == True
## differences_in_labels.diffs_between_labels == [
## <Difference index=1, diff_type=DifferenceTypes.MISMATCH>
## ]
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
extr-ds-0.0.86.tar.gz
(15.5 kB
view details)
Built Distribution
extr_ds-0.0.86-py3-none-any.whl
(22.1 kB
view details)
File details
Details for the file extr-ds-0.0.86.tar.gz
.
File metadata
- Download URL: extr-ds-0.0.86.tar.gz
- Upload date:
- Size: 15.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4fc0b0dafccc1d34792f53ff0fdd651285e3d3817979217b0247722e9313bf37 |
|
MD5 | 3c4e7c4afd3bf2640813aa66598f5477 |
|
BLAKE2b-256 | 3667c2b7862b04645cd562b5e36cc1c659742e32f68ca48d6f08ac22dd1e6961 |
File details
Details for the file extr_ds-0.0.86-py3-none-any.whl
.
File metadata
- Download URL: extr_ds-0.0.86-py3-none-any.whl
- Upload date:
- Size: 22.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7d6ffdb11bc8e019a382ecb5bbf18e7b557e186ca443c313ca59c5befb214221 |
|
MD5 | 27daf69bbcc9a7056ca489e9fbe866fc |
|
BLAKE2b-256 | 6687cc19d9abfffc3f85da513a68ddcf875dbd3250dcc23278e8f6a57a43b7c8 |