Skip to main content

spaCy pipeline component for CRF entity extraction

Project description

spacy_crfsuite: CRF tagger for spaCy.

Sequence tagging with spaCy and crfsuite.

Copied from Rasa NLU.

✨ Features

  • Simple but tough to beat CRF entity tagger (via sklearn-crfsuite)
  • spaCy NER component
  • Command line interface for training & evaluation and example notebook
  • CoNLL, JSON and Markdown annotations
  • Pre-trained NER component

⏳ Installation

pip install spacy_crfsuite

🚀 Quickstart

Usage as a spaCy pipeline component

import spacy

from spacy_crfsuite import CRFEntityExtractor, CRFExtractor


nlp = spacy.load("en_core_web_sm", disable=["ner"])
crf_extractor = CRFExtractor().from_disk("spacy_crfsuite_conll03_sm.bz2")
pipe = CRFEntityExtractor(nlp, crf_extractor=crf_extractor)
nlp.add_pipe(pipe)

doc = nlp(
    "George Walker Bush (born July 6, 1946) is an American politician and businessman "
    "who served as the 43rd president of the United States from 2001 to 2009.")

for ent in doc.ents:
    print(ent, "-", ent.label_)

# Output:
# George Walker Bush - PER
# American - MISC
# United States - LOC

Pre-trained models

You can download a pre-trained model.

Dataset F1 📥 Download
CoNLL03 82% spacy_crfsuite_conll03_sm.bz2

Train your own model

Let's train a simple model for restaurent search bot with markdown annotations and the command line. You can also try this notebook.

So we start by training a model and saving it to disk.

$ python -m spacy_crfsuite.train examples/restaurent_search.md -c examples/default-config.json -o model/ -lm en_core_web_sm
ℹ Loading config from disk
✔ Successfully loaded config from file.
examples/default-config.json
ℹ Loading training examples.
✔ Successfully loaded 15 training examples from file.
examples/restaurent_search.md
ℹ Using spaCy model: en_core_web_sm
ℹ Training entity tagger with CRF.
ℹ Saving model to disk
✔ Successfully saved model to file.
model/model.pkl

We can also evaluate on a dev set to get f1 & classification report. Below we use the training examples.

$ python -m spacy_crfsuite.eval examples/restaurent_search.md -m model/model.pkl -lm en_core_web_sm
ℹ Loading model from file
model/model.pkl
✔ Successfully loaded CRF tagger
<spacy_crfsuite.crf_extractor.CRFExtractor object at 0x126e5f438>
ℹ Loading dev dataset from file
examples/example.md
✔ Successfully loaded 15 dev examples.
ℹ Using spaCy model: en_core_web_sm
⚠ f1 score: 1.0
              precision    recall  f1-score   support

   B-cuisine      1.000     1.000     1.000         2
   I-cuisine      1.000     1.000     1.000         1
   L-cuisine      1.000     1.000     1.000         2
   U-cuisine      1.000     1.000     1.000         5
  U-location      1.000     1.000     1.000         7

   micro avg      1.000     1.000     1.000        17
   macro avg      1.000     1.000     1.000        17
weighted avg      1.000     1.000     1.000        17

Now we can use the tagger in a spaCy pipeline!

import spacy

from spacy_crfsuite import CRFEntityExtractor

nlp = spacy.load('en_core_web_sm')
pipe = CRFEntityExtractor(nlp).from_disk("model/model.pkl")
nlp.add_pipe(pipe)

doc = nlp("show mexican restaurents up north")
for ent in doc.ents:
    print(ent.text, "--", ent.label_)

# Output:
# mexican -- cuisine
# north -- location

Or alternatively as a standalone component.

from spacy_crfsuite import CRFExtractor
from spacy_crfsuite.tokenizer import SpacyTokenizer

crf_extractor = CRFExtractor().from_disk("model/model.pkl")
tokenizer = SpacyTokenizer()

example = {"text": "show mexican restaurents up north"}
tokenizer.tokenize(example, attribute="text")
crf_extractor.process(example)

# Output:
# [{'start': 5,
#   'end': 12,
#   'value': 'mexican',
#   'entity': 'cuisine',
#   'confidence': 0.5823148506311286},
#  {'start': 28,
#   'end': 33,
#   'value': 'north',
#   'entity': 'location',
#   'confidence': 0.8863076478494413}]

We can also take a look at what model learned.

Use the .explain() method to understand model decision.

print(crf_extractor.explain())

# Output:
#
# Most likely transitions:
# O          -> O          1.637338
# B-cuisine  -> I-cuisine  1.373766
# U-cuisine  -> O          1.306077
# I-cuisine  -> L-cuisine  0.915989
# O          -> U-location 0.751463
# B-cuisine  -> L-cuisine  0.698893
# O          -> U-cuisine  0.480360
# U-location -> U-cuisine  0.403487
# O          -> B-cuisine  0.261450
# L-cuisine  -> O          0.182695
# 
# Positive features:
# 1.976502 O          0:bias:bias
# 1.957180 U-location -1:low:the
# 1.216547 B-cuisine  -1:low:for
# 1.153924 U-location 0:prefix5:centr
# 1.153924 U-location 0:prefix2:ce
# 1.110536 U-location 0:digit
# 1.058294 U-cuisine  0:prefix5:chine
# 1.058294 U-cuisine  0:prefix2:ch
# 1.051457 U-cuisine  0:suffix2:an
# 0.999976 U-cuisine  -1:low:me

Notice: You can also access the crf_extractor directly with nlp.get_pipe("crf_ner").crf_extractor.

Development

Set up virtualenv

$ pipenv sync -d

Run unit test

$ pipenv run pytest

Run black (code formatting)

$ pipenv run black spacy_crfsuite/ --config=pyproject.toml

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spacy_crfsuite-1.2.0.tar.gz (20.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

spacy_crfsuite-1.2.0-py3-none-any.whl (21.3 kB view details)

Uploaded Python 3

File details

Details for the file spacy_crfsuite-1.2.0.tar.gz.

File metadata

  • Download URL: spacy_crfsuite-1.2.0.tar.gz
  • Upload date:
  • Size: 20.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.6.9

File hashes

Hashes for spacy_crfsuite-1.2.0.tar.gz
Algorithm Hash digest
SHA256 886add94ed41105c837092ae84f33ea93db8fbaa2773de81a3c6de930427f2e5
MD5 38ddcff5825c05eb7ef9b1b7fe113ee5
BLAKE2b-256 9d3efd2ee12a268090ab66570740ee38041877b3a69660eae64803c9915866f3

See more details on using hashes here.

File details

Details for the file spacy_crfsuite-1.2.0-py3-none-any.whl.

File metadata

  • Download URL: spacy_crfsuite-1.2.0-py3-none-any.whl
  • Upload date:
  • Size: 21.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.6.9

File hashes

Hashes for spacy_crfsuite-1.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6e067da98449e703b146f0fa902f2adb30de425cd01c81c33ef5a9e189f255b2
MD5 0030dda0a391af9d8cc3bfdb58d34c59
BLAKE2b-256 40219c47c59fa80e1f65f3877ffd8c1cb2e715c7a32091b0e79bde36178a8578

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page