spaCy pipeline component for CRF entity extraction
Project description
spacy_crfsuite: CRF entity tagger for spaCy.
✨ Features
- Simple but tough to beat CRF entity tagger (via sklearn-crfsuite)
- spaCy NER component
- Command line interface for training & evaluation and example notebook
- CoNLL, JSON and Markdown annotations
- Pre-trained NER component
⏳ Installation
pip install spacy_crfsuite
🚀 Quickstart
Standalone usage
from spacy_crfsuite import CRFExtractor, prepare_example
crf_extractor = CRFExtractor().from_disk("model.pkl")
raw_example = {"text": "show mexican restaurents up north"}
example = prepare_example(raw_example, crf_extractor=crf_extractor)
crf_extractor.process(example)
# Output:
# [{'start': 5,
# 'end': 12,
# 'value': 'mexican',
# 'entity': 'cuisine',
# 'confidence': 0.5823148506311286},
# {'start': 28,
# 'end': 33,
# 'value': 'north',
# 'entity': 'location',
# 'confidence': 0.8863076478494413}]
Usage as a spaCy pipeline component
import spacy
from spacy_crfsuite import CRFEntityExtractor
nlp = spacy.blank('en')
pipe = CRFEntityExtractor(nlp).from_disk("model.pkl")
nlp.add_pipe(pipe)
doc = nlp("show mexican restaurents up north")
for ent in doc.ents:
print(ent.text, "--", ent.label_)
# Output:
# mexican -- cuisine
# north -- location
Follow this notebook to learn how to train a entity tagger from few restaurant search examples.
Pre-trained model
You can download a pre-trained model.
Dataset | Size | 📥 Download (zipped) |
---|---|---|
CoNLL03 | 1.2 MB | part 1 |
Below is another usage example.
import spacy
from spacy_crfsuite import CRFEntityExtractor, CRFExtractor
crf_extractor = CRFExtractor().from_disk("spacy_crfsuite_conll03.bz2")
nlp = spacy.blank("en")
pipe = CRFEntityExtractor(nlp, crf_extractor=crf_extractor)
nlp.add_pipe(pipe)
doc = nlp(
"George Walker Bush (born July 6, 1946) is an American politician and businessman "
"who served as the 43rd president of the United States from 2001 to 2009.")
for ent in doc.ents:
print(ent, "-", ent.label_)
# Output:
Command Line Interface
Model training
$ python -m spacy_crfsuite.train examples/restaurent_search.md -c examples/default-config.json -o model/
ℹ Loading config from disk
✔ Successfully loaded config from file.
examples/default-config.json
ℹ Loading training examples.
✔ Successfully loaded 15 training examples from file.
examples/restaurent_search.md
ℹ Using spaCy blank: 'en'
ℹ Training entity tagger with CRF.
ℹ Saving model to disk
✔ Successfully saved model to file.
model/model.pkl
Evaluation (F1 & Classification report)
$ python -m spacy_crfsuite.eval examples/restaurent_search.md -m model/model.pkl
ℹ Loading model from file
model/model.pkl
✔ Successfully loaded CRF tagger
<spacy_crfsuite.crf_extractor.CRFExtractor object at 0x126e5f438>
ℹ Loading dev dataset from file
examples/example.md
✔ Successfully loaded 15 dev examples.
⚠ f1 score: 1.0
precision recall f1-score support
B-cuisine 1.000 1.000 1.000 2
I-cuisine 1.000 1.000 1.000 1
L-cuisine 1.000 1.000 1.000 2
U-cuisine 1.000 1.000 1.000 5
U-location 1.000 1.000 1.000 7
micro avg 1.000 1.000 1.000 17
macro avg 1.000 1.000 1.000 17
weighted avg 1.000 1.000 1.000 17
Tips & tricks
Use the .explain()
method to understand model decision.
print(crf_extractor.explain())
# Output:
#
# Most likely transitions:
# O -> O 1.617362
# U-cuisine -> O 1.277659
# B-cuisine -> I-cuisine 1.206597
# I-cuisine -> L-cuisine 0.800963
# O -> U-location 0.719703
# B-cuisine -> L-cuisine 0.589600
# O -> U-cuisine 0.402591
# U-location -> U-cuisine 0.325804
# O -> B-cuisine 0.150878
# L-cuisine -> O 0.087336
#
# Positive features:
# 2.186071 O 0:bias:bias
# 1.973212 U-location -1:low:the
# 1.135395 B-cuisine -1:low:for
# 1.121395 U-location 0:prefix5:centr
# 1.121395 U-location 0:prefix2:ce
# 1.106081 U-location 0:digit
# 1.019241 U-cuisine 0:prefix5:chine
# 1.019241 U-cuisine 0:prefix2:ch
# 1.011240 U-cuisine 0:suffix2:an
# 0.945071 U-cuisine -1:low:me
Development
Set up pip & virtualenv
$ pipenv sync -d
Run unit test
$ pipenv run pytest
Run black (code formatter)
$ pipenv run black spacy_crfsuite/ --config=pyproject.toml
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
spacy_crfsuite-1.1.0.tar.gz
(19.7 kB
view hashes)
Built Distribution
Close
Hashes for spacy_crfsuite-1.1.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2dc5fbf37cbeba6652b5a04408ff8e0aef7a591a7966fd29ac795b0047fa61c9 |
|
MD5 | 57870ff127722adbcdd63db7b51ac935 |
|
BLAKE2b-256 | f759a70494e8ae05a0a4e377c26bd39a84b1e0ccb25d8916bbecc33f28b6ef3f |