Skip to main content

spaCy pipeline component for CRF entity extraction

Project description

spacy_crfsuite: CRF entity tagger for spaCy.

✨ Features

  • Simple but tough to beat CRF entity tagger (via sklearn-crfsuite)
  • spaCy NER component
  • Command line interface for training & evaluation and example notebook
  • CoNLL, JSON and Markdown annotations
  • Pre-trained NER component

⏳ Installation

pip install spacy_crfsuite

🚀 Quickstart

Standalone usage

from spacy_crfsuite import CRFExtractor, prepare_example

crf_extractor = CRFExtractor().from_disk("model.pkl")
raw_example = {"text": "show mexican restaurents up north"}
example = prepare_example(raw_example, crf_extractor=crf_extractor)
crf_extractor.process(example)

# Output:
# [{'start': 5,
#   'end': 12,
#   'value': 'mexican',
#   'entity': 'cuisine',
#   'confidence': 0.5823148506311286},
#  {'start': 28,
#   'end': 33,
#   'value': 'north',
#   'entity': 'location',
#   'confidence': 0.8863076478494413}]

Usage as a spaCy pipeline component

import spacy

from spacy_crfsuite import CRFEntityExtractor

nlp = spacy.blank('en')
pipe = CRFEntityExtractor(nlp).from_disk("model.pkl")
nlp.add_pipe(pipe)

doc = nlp("show mexican restaurents up north")
for ent in doc.ents:
    print(ent.text, "--", ent.label_)

# Output:
# mexican -- cuisine
# north -- location

Follow this notebook to learn how to train a entity tagger from few restaurant search examples.

Pre-trained model

You can download a pre-trained model.

Dataset Size 📥 Download (zipped)
CoNLL03 1.2 MB part 1

Below is another usage example.

import spacy

from spacy_crfsuite import CRFEntityExtractor, CRFExtractor

crf_extractor = CRFExtractor().from_disk("spacy_crfsuite_conll03.bz2")

nlp = spacy.blank("en")

pipe = CRFEntityExtractor(nlp, crf_extractor=crf_extractor)
nlp.add_pipe(pipe)

doc = nlp(
    "George Walker Bush (born July 6, 1946) is an American politician and businessman "
    "who served as the 43rd president of the United States from 2001 to 2009.")

for ent in doc.ents:
    print(ent, "-", ent.label_)

# Output:

Command Line Interface

Model training

$ python -m spacy_crfsuite.train examples/restaurent_search.md -c examples/default-config.json -o model/
ℹ Loading config from disk
✔ Successfully loaded config from file.
examples/default-config.json
ℹ Loading training examples.
✔ Successfully loaded 15 training examples from file.
examples/restaurent_search.md
ℹ Using spaCy blank: 'en' Training entity tagger with CRF.
ℹ Saving model to disk
✔ Successfully saved model to file.
model/model.pkl

Evaluation (F1 & Classification report)

$ python -m spacy_crfsuite.eval examples/restaurent_search.md -m model/model.pkl
ℹ Loading model from file
model/model.pkl
✔ Successfully loaded CRF tagger
<spacy_crfsuite.crf_extractor.CRFExtractor object at 0x126e5f438>
ℹ Loading dev dataset from file
examples/example.md
✔ Successfully loaded 15 dev examples.
⚠ f1 score: 1.0
              precision    recall  f1-score   support

   B-cuisine      1.000     1.000     1.000         2
   I-cuisine      1.000     1.000     1.000         1
   L-cuisine      1.000     1.000     1.000         2
   U-cuisine      1.000     1.000     1.000         5
  U-location      1.000     1.000     1.000         7

   micro avg      1.000     1.000     1.000        17
   macro avg      1.000     1.000     1.000        17
weighted avg      1.000     1.000     1.000        17

Tips & tricks

Use the .explain() method to understand model decision.

print(crf_extractor.explain())

# Output:
#
# Most likely transitions:
# O          -> O          1.617362
# U-cuisine  -> O          1.277659
# B-cuisine  -> I-cuisine  1.206597
# I-cuisine  -> L-cuisine  0.800963
# O          -> U-location 0.719703
# B-cuisine  -> L-cuisine  0.589600
# O          -> U-cuisine  0.402591
# U-location -> U-cuisine  0.325804
# O          -> B-cuisine  0.150878
# L-cuisine  -> O          0.087336
# 
# Positive features:
# 2.186071 O          0:bias:bias
# 1.973212 U-location -1:low:the
# 1.135395 B-cuisine  -1:low:for
# 1.121395 U-location 0:prefix5:centr
# 1.121395 U-location 0:prefix2:ce
# 1.106081 U-location 0:digit
# 1.019241 U-cuisine  0:prefix5:chine
# 1.019241 U-cuisine  0:prefix2:ch
# 1.011240 U-cuisine  0:suffix2:an
# 0.945071 U-cuisine  -1:low:me

Development

Set up pip & virtualenv

$ pipenv sync -d

Run unit test

$ pipenv run pytest

Run black (code formatter)

$ pipenv run black spacy_crfsuite/ --config=pyproject.toml

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spacy_crfsuite-1.1.0.tar.gz (19.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

spacy_crfsuite-1.1.0-py3-none-any.whl (20.8 kB view details)

Uploaded Python 3

File details

Details for the file spacy_crfsuite-1.1.0.tar.gz.

File metadata

  • Download URL: spacy_crfsuite-1.1.0.tar.gz
  • Upload date:
  • Size: 19.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.6.9

File hashes

Hashes for spacy_crfsuite-1.1.0.tar.gz
Algorithm Hash digest
SHA256 cdf3f1bfc77b8c887f7f74d6bbf14d16a03190235de5abe4ed8cd1eebf0607a5
MD5 59e9d21d4222d20a5c4d8a6d318051e1
BLAKE2b-256 aecdd43a8866d9fed4bd9688d35faa309fc0f8a7252370df340e08001c340461

See more details on using hashes here.

File details

Details for the file spacy_crfsuite-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: spacy_crfsuite-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 20.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.6.9

File hashes

Hashes for spacy_crfsuite-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2dc5fbf37cbeba6652b5a04408ff8e0aef7a591a7966fd29ac795b0047fa61c9
MD5 57870ff127722adbcdd63db7b51ac935
BLAKE2b-256 f759a70494e8ae05a0a4e377c26bd39a84b1e0ccb25d8916bbecc33f28b6ef3f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page