Skip to main content

spaCy pipeline component for CRF entity extraction

Project description

spacy_crfsuite: CRF tagger for spaCy.

Sequence tagging with spaCy and crfsuite.

A port of Rasa NLU.

✨ Features

  • Simple but tough to beat CRF entity tagger ( via sklearn-crfsuite)
  • spaCy NER component
  • Command line interface for training & evaluation and example notebook
  • CoNLL, JSON and Markdown annotations
  • Pre-trained NER component

⏳ Installation

pip install spacy_crfsuite

🚀 Quickstart

Usage as a spaCy pipeline component

import spacy

from spacy.language import Language
from spacy_crfsuite import CRFEntityExtractor, CRFExtractor


@Language.factory("ner_crf")
def create_component(nlp, name):
    crf_extractor = CRFExtractor().from_disk("spacy_crfsuite_conll03_sm.bz2")
    return CRFEntityExtractor(nlp, crf_extractor=crf_extractor)


nlp = spacy.load("en_core_web_sm", disable=["ner"])
nlp.add_pipe("ner_crf")

doc = nlp(
    "George Walker Bush (born July 6, 1946) is an American politician and businessman "
    "who served as the 43rd president of the United States from 2001 to 2009.")

for ent in doc.ents:
    print(ent, "-", ent.label_)

# Output:
# George Walker Bush - PER
# American - MISC
# United States - LOC

Visualization (via Gradio)

Run the command below to launch a Gradio playground

$ pip install gradio
$ python spacy_crfsuite/visualize.py

Pre-trained models

You can download a pre-trained model.

Dataset F1 📥 Download
CoNLL03 82% spacy_crfsuite_conll03_sm.bz2

Train your own model

Below is a command line to train a simple model for restaurants search bot with markdown annotations and save it to disk. If you prefer working on jupyter, follow this notebook.

$ python -m spacy_crfsuite.train examples/restaurent_search.md -c examples/default-config.json -o model/ -lm en_core_web_sm
ℹ Loading config from disk
✔ Successfully loaded config from file.
examples/default-config.json
ℹ Loading training examples.
✔ Successfully loaded 15 training examples from file.
examples/restaurent_search.md
ℹ Using spaCy model: en_core_web_sm
ℹ Training entity tagger with CRF.
ℹ Saving model to disk
✔ Successfully saved model to file.
model/model.pkl

Below is a command line to test the CRF model and print the classification report (In the example we use the training set, however normally we would use a held out set).

$ python -m spacy_crfsuite.eval examples/restaurent_search.md -m model/model.pkl -lm en_core_web_sm
ℹ Loading model from file
model/model.pkl
✔ Successfully loaded CRF tagger
<spacy_crfsuite.crf_extractor.CRFExtractor object at 0x126e5f438>
ℹ Loading dev dataset from file
examples/example.md
✔ Successfully loaded 15 dev examples.
ℹ Using spaCy model: en_core_web_sm
ℹ Classification Report:
              precision    recall  f1-score   support

   B-cuisine      1.000     1.000     1.000         2
   I-cuisine      1.000     1.000     1.000         1
   L-cuisine      1.000     1.000     1.000         2
   U-cuisine      1.000     1.000     1.000         5
  U-location      1.000     1.000     1.000         7

   micro avg      1.000     1.000     1.000        17
   macro avg      1.000     1.000     1.000        17
weighted avg      1.000     1.000     1.000        17

Now we can use the tagger for named entity recognition in a spaCy pipeline!

import spacy

from spacy.language import Language
from spacy_crfsuite import CRFEntityExtractor, CRFExtractor


@Language.factory("ner_crf")
def create_component(nlp, name):
    crf_extractor = CRFExtractor().from_disk("model/model.pkl")
    return CRFEntityExtractor(nlp, crf_extractor=crf_extractor)


nlp = spacy.load("en_core_web_sm", disable=["ner"])
nlp.add_pipe("ner_crf")

doc = nlp("show mexican restaurents up north")
for ent in doc.ents:
    print(ent.text, "--", ent.label_)

# Output:
# mexican -- cuisine
# north -- location

Or alternatively as a standalone component

from spacy_crfsuite import CRFExtractor
from spacy_crfsuite.tokenizer import SpacyTokenizer

crf_extractor = CRFExtractor().from_disk("model/model.pkl")
tokenizer = SpacyTokenizer()

example = {"text": "show mexican restaurents up north"}
tokenizer.tokenize(example, attribute="text")
crf_extractor.process(example)

# Output:
# [{'start': 5,
#   'end': 12,
#   'value': 'mexican',
#   'entity': 'cuisine',
#   'confidence': 0.5823148506311286},
#  {'start': 28,
#   'end': 33,
#   'value': 'north',
#   'entity': 'location',
#   'confidence': 0.8863076478494413}]

We can also take a look at what model learned.

Use the .explain() method to understand model decision.

print(crf_extractor.explain())

# Output:
#
# Most likely transitions:
# O          -> O          1.637338
# B-cuisine  -> I-cuisine  1.373766
# U-cuisine  -> O          1.306077
# I-cuisine  -> L-cuisine  0.915989
# O          -> U-location 0.751463
# B-cuisine  -> L-cuisine  0.698893
# O          -> U-cuisine  0.480360
# U-location -> U-cuisine  0.403487
# O          -> B-cuisine  0.261450
# L-cuisine  -> O          0.182695
# 
# Positive features:
# 1.976502 O          0:bias:bias
# 1.957180 U-location -1:low:the
# 1.216547 B-cuisine  -1:low:for
# 1.153924 U-location 0:prefix5:centr
# 1.153924 U-location 0:prefix2:ce
# 1.110536 U-location 0:digit
# 1.058294 U-cuisine  0:prefix5:chine
# 1.058294 U-cuisine  0:prefix2:ch
# 1.051457 U-cuisine  0:suffix2:an
# 0.999976 U-cuisine  -1:low:me

Notice: You can also access the crf_extractor directly with nlp.get_pipe("crf_ner").crf_extractor.

Deploy to a web server

Start a web service

$ pip install uvicorn
$ uvicorn spacy_crfsuite.serve:app --host 127.0.0.1 --port 5000

Notice: Set $SPACY_MODEL and $CRF_MODEL in your environment to control the server configurations

cURL example

$ curl -X POST http://127.0.0.1:5000/parse -H 'Content-Type: application/json' -d '{"text": "George Walker Bush (born July 6, 1946) is an American politician and businessman who served as the 43rd president of the United States from 2001 to 2009."}'
{
  "data": [
    {
      "text": "George Walker Bush (born July 6, 1946) is an American politician and businessman who served as the 43rd president of the United States from 2001 to 2009.",
      "entities": [
        {
          "start": 0,
          "end": 18,
          "value": "George Walker Bush",
          "entity": "PER"
        },
        {
          "start": 45,
          "end": 53,
          "value": "American",
          "entity": "MISC"
        },
        {
          "start": 121,
          "end": 134,
          "value": "United States",
          "entity": "LOC"
        }
      ]
    }
  ]
}

Development

Set up env

$ poetry install
$ poetry run spacy download en_core_web_sm

Run unit test

$ poetry run pytest

Run black (code formatting)

$ poetry run black spacy_crfsuite/ --config=pyproject.toml

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spacy_crfsuite-1.4.0.tar.gz (25.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

spacy_crfsuite-1.4.0-py3-none-any.whl (26.0 kB view details)

Uploaded Python 3

File details

Details for the file spacy_crfsuite-1.4.0.tar.gz.

File metadata

  • Download URL: spacy_crfsuite-1.4.0.tar.gz
  • Upload date:
  • Size: 25.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.8.14

File hashes

Hashes for spacy_crfsuite-1.4.0.tar.gz
Algorithm Hash digest
SHA256 9ab981f37422c9ee5999fbbd1bdabd4996782fae7c046c13956cafebbd783110
MD5 5287f52593482a2f9c02c079c2b23ac9
BLAKE2b-256 a6d9c3c59711bf150f800de358942331965a5ffafb86bb6458ccdc637b168d0b

See more details on using hashes here.

File details

Details for the file spacy_crfsuite-1.4.0-py3-none-any.whl.

File metadata

  • Download URL: spacy_crfsuite-1.4.0-py3-none-any.whl
  • Upload date:
  • Size: 26.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.8.14

File hashes

Hashes for spacy_crfsuite-1.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2a3e039b5385bff6f4d9b8f7695c9c43ea161d186e6e68893dc76ab25ee5e627
MD5 3e6ad4ca21efaf979518f66ad9f9d039
BLAKE2b-256 0e8adb957d1637f761bef1a61e6a859eff7bfaf5aa1cfc91ad6e92175baba9d8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page