spaCy pipeline component for CRF entity extraction
Project description
spacy_crfsuite: CRF tagger for spaCy.
Sequence tagging with spaCy and crfsuite.
A port of Rasa NLU.
✨ Features
- Simple but tough to beat CRF entity tagger ( via sklearn-crfsuite)
- spaCy NER component
- Command line interface for training & evaluation and example notebook
- CoNLL, JSON and Markdown annotations
- Pre-trained NER component
⏳ Installation
pip install spacy_crfsuite
🚀 Quickstart
Usage as a spaCy pipeline component
import spacy
from spacy.language import Language
from spacy_crfsuite import CRFEntityExtractor, CRFExtractor
@Language.factory("ner_crf")
def create_component(nlp, name):
crf_extractor = CRFExtractor().from_disk("spacy_crfsuite_conll03_sm.bz2")
return CRFEntityExtractor(nlp, crf_extractor=crf_extractor)
nlp = spacy.load("en_core_web_sm", disable=["ner"])
nlp.add_pipe("ner_crf")
doc = nlp(
"George Walker Bush (born July 6, 1946) is an American politician and businessman "
"who served as the 43rd president of the United States from 2001 to 2009.")
for ent in doc.ents:
print(ent, "-", ent.label_)
# Output:
# George Walker Bush - PER
# American - MISC
# United States - LOC
Visualization (via Gradio)
Run the command below to launch a Gradio playground
$ pip install gradio
$ python spacy_crfsuite/visualize.py
Pre-trained models
You can download a pre-trained model.
Dataset | F1 | 📥 Download |
---|---|---|
CoNLL03 | 82% | spacy_crfsuite_conll03_sm.bz2 |
Train your own model
Below is a command line to train a simple model for restaurants search bot with markdown annotations and save it to disk. If you prefer working on jupyter, follow this notebook.
$ python -m spacy_crfsuite.train examples/restaurent_search.md -c examples/default-config.json -o model/ -lm en_core_web_sm
ℹ Loading config from disk
✔ Successfully loaded config from file.
examples/default-config.json
ℹ Loading training examples.
✔ Successfully loaded 15 training examples from file.
examples/restaurent_search.md
ℹ Using spaCy model: en_core_web_sm
ℹ Training entity tagger with CRF.
ℹ Saving model to disk
✔ Successfully saved model to file.
model/model.pkl
Below is a command line to test the CRF model and print the classification report (In the example we use the training set, however normally we would use a held out set).
$ python -m spacy_crfsuite.eval examples/restaurent_search.md -m model/model.pkl -lm en_core_web_sm
ℹ Loading model from file
model/model.pkl
✔ Successfully loaded CRF tagger
<spacy_crfsuite.crf_extractor.CRFExtractor object at 0x126e5f438>
ℹ Loading dev dataset from file
examples/example.md
✔ Successfully loaded 15 dev examples.
ℹ Using spaCy model: en_core_web_sm
ℹ Classification Report:
precision recall f1-score support
B-cuisine 1.000 1.000 1.000 2
I-cuisine 1.000 1.000 1.000 1
L-cuisine 1.000 1.000 1.000 2
U-cuisine 1.000 1.000 1.000 5
U-location 1.000 1.000 1.000 7
micro avg 1.000 1.000 1.000 17
macro avg 1.000 1.000 1.000 17
weighted avg 1.000 1.000 1.000 17
Now we can use the tagger for named entity recognition in a spaCy pipeline!
import spacy
from spacy.language import Language
from spacy_crfsuite import CRFEntityExtractor, CRFExtractor
@Language.factory("ner_crf")
def create_component(nlp, name):
crf_extractor = CRFExtractor().from_disk("model/model.pkl")
return CRFEntityExtractor(nlp, crf_extractor=crf_extractor)
nlp = spacy.load("en_core_web_sm", disable=["ner"])
nlp.add_pipe("ner_crf")
doc = nlp("show mexican restaurents up north")
for ent in doc.ents:
print(ent.text, "--", ent.label_)
# Output:
# mexican -- cuisine
# north -- location
Or alternatively as a standalone component
from spacy_crfsuite import CRFExtractor
from spacy_crfsuite.tokenizer import SpacyTokenizer
crf_extractor = CRFExtractor().from_disk("model/model.pkl")
tokenizer = SpacyTokenizer()
example = {"text": "show mexican restaurents up north"}
tokenizer.tokenize(example, attribute="text")
crf_extractor.process(example)
# Output:
# [{'start': 5,
# 'end': 12,
# 'value': 'mexican',
# 'entity': 'cuisine',
# 'confidence': 0.5823148506311286},
# {'start': 28,
# 'end': 33,
# 'value': 'north',
# 'entity': 'location',
# 'confidence': 0.8863076478494413}]
We can also take a look at what model learned.
Use the .explain()
method to understand model decision.
print(crf_extractor.explain())
# Output:
#
# Most likely transitions:
# O -> O 1.637338
# B-cuisine -> I-cuisine 1.373766
# U-cuisine -> O 1.306077
# I-cuisine -> L-cuisine 0.915989
# O -> U-location 0.751463
# B-cuisine -> L-cuisine 0.698893
# O -> U-cuisine 0.480360
# U-location -> U-cuisine 0.403487
# O -> B-cuisine 0.261450
# L-cuisine -> O 0.182695
#
# Positive features:
# 1.976502 O 0:bias:bias
# 1.957180 U-location -1:low:the
# 1.216547 B-cuisine -1:low:for
# 1.153924 U-location 0:prefix5:centr
# 1.153924 U-location 0:prefix2:ce
# 1.110536 U-location 0:digit
# 1.058294 U-cuisine 0:prefix5:chine
# 1.058294 U-cuisine 0:prefix2:ch
# 1.051457 U-cuisine 0:suffix2:an
# 0.999976 U-cuisine -1:low:me
Notice: You can also access the
crf_extractor
directly withnlp.get_pipe("crf_ner").crf_extractor
.
Deploy to a web server
Start a web service
$ pip install uvicorn
$ uvicorn spacy_crfsuite.serve:app --host 127.0.0.1 --port 5000
Notice: Set
$SPACY_MODEL
and$CRF_MODEL
in your environment to control the server configurations
cURL example
$ curl -X POST http://127.0.0.1:5000/parse -H 'Content-Type: application/json' -d '{"text": "George Walker Bush (born July 6, 1946) is an American politician and businessman who served as the 43rd president of the United States from 2001 to 2009."}'
{
"data": [
{
"text": "George Walker Bush (born July 6, 1946) is an American politician and businessman who served as the 43rd president of the United States from 2001 to 2009.",
"entities": [
{
"start": 0,
"end": 18,
"value": "George Walker Bush",
"entity": "PER"
},
{
"start": 45,
"end": 53,
"value": "American",
"entity": "MISC"
},
{
"start": 121,
"end": 134,
"value": "United States",
"entity": "LOC"
}
]
}
]
}
Development
Set up env
$ poetry install
$ poetry run spacy download en_core_web_sm
Run unit test
$ poetry run pytest
Run black (code formatting)
$ poetry run black spacy_crfsuite/ --config=pyproject.toml
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file spacy_crfsuite-1.7.0.tar.gz
.
File metadata
- Download URL: spacy_crfsuite-1.7.0.tar.gz
- Upload date:
- Size: 310.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.16
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1110945ffa5a1fb30a7a663e1b7bca1859c726728929ae00de5cffff51ca0b43 |
|
MD5 | 335ada73fb15f9ec3d02dc29e9497a96 |
|
BLAKE2b-256 | 349d6c2ac7f1f91e3441750df0f31d90685152cf11566ab98bd50b198c376f9d |
File details
Details for the file spacy_crfsuite-1.7.0-py3-none-any.whl
.
File metadata
- Download URL: spacy_crfsuite-1.7.0-py3-none-any.whl
- Upload date:
- Size: 26.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.16
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2a2154800294b2fb2e576bbf9e45d64a68d5ce75e222eb174fbe1d0c66290a6c |
|
MD5 | 6ebaac129716b52b80ad726204dc092c |
|
BLAKE2b-256 | 638aa8756706aa915c4758aa4f5675ba54bd3f98ec68f21546d5ba838cf1dde6 |