spaCy pipeline component for CRF entity extraction
Project description
spacy_crfsuite: CRF entity tagger for spaCy.
✨ Features
- Simple but tough to beat CRF entity tagger (via sklearn-crfsuite)
- spaCy NER component
- Command line interface for training & evaluation and example notebook
- CoNLL, JSON and Markdown annotations
- Pre-trained NER component
⏳ Installation
pip install spacy_crfsuite
🚀 Quickstart
Standalone usage
from spacy_crfsuite import CRFExtractor, prepare_example
crf_extractor = CRFExtractor().from_disk("model.pkl")
raw_example = {"text": "show mexican restaurents up north"}
example = prepare_example(raw_example, crf_extractor=crf_extractor)
crf_extractor.process(example)
# Output:
# [{'start': 5,
# 'end': 12,
# 'value': 'mexican',
# 'entity': 'cuisine',
# 'confidence': 0.5823148506311286},
# {'start': 28,
# 'end': 33,
# 'value': 'north',
# 'entity': 'location',
# 'confidence': 0.8863076478494413}]
Usage as a spaCy pipeline component
import spacy
from spacy_crfsuite import CRFEntityExtractor
nlp = spacy.blank('en')
pipe = CRFEntityExtractor(nlp).from_disk("model.pkl")
nlp.add_pipe(pipe)
doc = nlp("show mexican restaurents up north")
for ent in doc.ents:
print(ent.text, "--", ent.label_)
# Output:
# mexican -- cuisine
# north -- location
Follow this notebook to learn how to train a entity tagger from few restaurant search examples.
Pre-trained model
You can download a pre-trained model.
| Dataset | Size | 📥 Download (zipped) |
|---|---|---|
| CoNLL03 | 1.2 MB | part 1 |
Below is another usage example.
import spacy
from spacy_crfsuite import CRFEntityExtractor, CRFExtractor
crf_extractor = CRFExtractor().from_disk("spacy_crfsuite_conll03.bz2")
nlp = spacy.blank("en")
pipe = CRFEntityExtractor(nlp, crf_extractor=crf_extractor)
nlp.add_pipe(pipe)
doc = nlp(
"George Walker Bush (born July 6, 1946) is an American politician and businessman "
"who served as the 43rd president of the United States from 2001 to 2009.")
for ent in doc.ents:
print(ent, "-", ent.label_)
# Output:
Command Line Interface
Model training
$ python -m spacy_crfsuite.train examples/restaurent_search.md -c examples/default-config.json -o model/
ℹ Loading config from disk
✔ Successfully loaded config from file.
examples/default-config.json
ℹ Loading training examples.
✔ Successfully loaded 15 training examples from file.
examples/restaurent_search.md
ℹ Using spaCy blank: 'en'
ℹ Training entity tagger with CRF.
ℹ Saving model to disk
✔ Successfully saved model to file.
model/model.pkl
Evaluation (F1 & Classification report)
$ python -m spacy_crfsuite.eval examples/restaurent_search.md -m model/model.pkl
ℹ Loading model from file
model/model.pkl
✔ Successfully loaded CRF tagger
<spacy_crfsuite.crf_extractor.CRFExtractor object at 0x126e5f438>
ℹ Loading dev dataset from file
examples/example.md
✔ Successfully loaded 15 dev examples.
⚠ f1 score: 1.0
precision recall f1-score support
B-cuisine 1.000 1.000 1.000 2
I-cuisine 1.000 1.000 1.000 1
L-cuisine 1.000 1.000 1.000 2
U-cuisine 1.000 1.000 1.000 5
U-location 1.000 1.000 1.000 7
micro avg 1.000 1.000 1.000 17
macro avg 1.000 1.000 1.000 17
weighted avg 1.000 1.000 1.000 17
Tips & tricks
Use the .explain() method to understand model decision.
print(crf_extractor.explain())
# Output:
#
# Most likely transitions:
# O -> O 1.617362
# U-cuisine -> O 1.277659
# B-cuisine -> I-cuisine 1.206597
# I-cuisine -> L-cuisine 0.800963
# O -> U-location 0.719703
# B-cuisine -> L-cuisine 0.589600
# O -> U-cuisine 0.402591
# U-location -> U-cuisine 0.325804
# O -> B-cuisine 0.150878
# L-cuisine -> O 0.087336
#
# Positive features:
# 2.186071 O 0:bias:bias
# 1.973212 U-location -1:low:the
# 1.135395 B-cuisine -1:low:for
# 1.121395 U-location 0:prefix5:centr
# 1.121395 U-location 0:prefix2:ce
# 1.106081 U-location 0:digit
# 1.019241 U-cuisine 0:prefix5:chine
# 1.019241 U-cuisine 0:prefix2:ch
# 1.011240 U-cuisine 0:suffix2:an
# 0.945071 U-cuisine -1:low:me
Development
Set up pip & virtualenv
$ pipenv sync -d
Run unit test
$ pipenv run pytest
Run black (code formatter)
$ pipenv run black spacy_crfsuite/ --config=pyproject.toml
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file spacy_crfsuite-1.1.0.tar.gz.
File metadata
- Download URL: spacy_crfsuite-1.1.0.tar.gz
- Upload date:
- Size: 19.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.6.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cdf3f1bfc77b8c887f7f74d6bbf14d16a03190235de5abe4ed8cd1eebf0607a5
|
|
| MD5 |
59e9d21d4222d20a5c4d8a6d318051e1
|
|
| BLAKE2b-256 |
aecdd43a8866d9fed4bd9688d35faa309fc0f8a7252370df340e08001c340461
|
File details
Details for the file spacy_crfsuite-1.1.0-py3-none-any.whl.
File metadata
- Download URL: spacy_crfsuite-1.1.0-py3-none-any.whl
- Upload date:
- Size: 20.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.6.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2dc5fbf37cbeba6652b5a04408ff8e0aef7a591a7966fd29ac795b0047fa61c9
|
|
| MD5 |
57870ff127722adbcdd63db7b51ac935
|
|
| BLAKE2b-256 |
f759a70494e8ae05a0a4e377c26bd39a84b1e0ccb25d8916bbecc33f28b6ef3f
|