Skip to main content

spaCy wrapper for Hugging Face Transformers pipelines

Project description

spacy-huggingface-pipelines: Use pretrained transformer models for text and token classification

This package provides spaCy components to use pretrained Hugging Face Transformers pipelines for inference only.

PyPi GitHub

Features

🚀 Installation

Installing the package from pip will automatically install all dependencies, including PyTorch and spaCy.

pip install -U pip setuptools wheel
pip install spacy-huggingface-pipelines

For GPU installation, follow the spaCy installation quickstart with GPU, e.g.

pip install -U spacy[cuda-autodetect]

If you are having trouble installing PyTorch, follow the instructions on the official website for your specific operating system and requirements.

📖 Documentation

This module provides spaCy wrappers for the inference-only transformers TokenClassificationPipeline and TextClassificationPipeline pipelines.

The models are downloaded on initialization from the Hugging Face Hub if they're not already in your local cache, or alternatively they can be loaded from a local path.

Note that the transformer model data is not saved with the pipeline when you call nlp.to_disk, so if you are loading pipelines in an environment with limited internet access, make sure the model is available in your transformers cache directory and enable offline mode if needed.

Token classification

Config settings for hf_token_pipe:

[components.hf_token_pipe]
factory = "hf_token_pipe"
model = "dslim/bert-base-NER"     # Model name or path
revision = "main"                 # Model revision
aggregation_strategy = "average"  # "simple", "first", "average", "max"
stride = 16                       # If stride >= 0, process long texts in
                                  # overlapping windows of the model max
                                  # length. The value is the length of the
                                  # window overlap in transformer tokenizer
                                  # tokens, NOT the length of the stride.
kwargs = {}                       # Any additional arguments for
                                  # TokenClassificationPipeline
alignment_mode = "strict"         # "strict", "contract", "expand"
annotate = "ents"                 # "ents", "pos", "spans", "tag"
annotate_spans_key = null         # Doc.spans key for annotate = "spans"
scorer = null                     # Optional scorer

TokenClassificationPipeline settings

  • model: The model name or path.
  • revision: The model revision. For production use, a specific git commit is recommended instead of the default main.
  • stride: For stride >= 0, the text is processed in overlapping windows where the stride setting specifies the number of overlapping tokens between windows (NOT the stride length). If stride is None, then the text may be truncated. stride is only supported for fast tokenizers.
  • aggregation_strategy: The aggregation strategy determines the word-level tags for cases where subwords within one word do not receive the same predicted tag. See: https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.TokenClassificationPipeline.aggregation_strategy
  • kwargs: Any additional arguments to TokenClassificationPipeline.

spaCy settings

  • alignment_mode determines how transformer predictions are aligned to spaCy token boundaries as described for Doc.char_span.
  • annotate and annotate_spans_key configure how the annotation is saved to the spaCy doc. You can save the output as token.tag_, token.pos_ (only for UPOS tags), doc.ents or doc.spans.

Examples

  1. Save named entity annotation as Doc.ents:
import spacy
nlp = spacy.blank("en")
nlp.add_pipe("hf_token_pipe", config={"model": "dslim/bert-base-NER"})
doc = nlp("My name is Sarah and I live in London")
print(doc.ents)
# (Sarah, London)
  1. Save named entity annotation as Doc.spans[spans_key] and scores as Doc.spans[spans_key].attrs["scores"]:
import spacy
nlp = spacy.blank("en")
nlp.add_pipe(
    "hf_token_pipe",
    config={
        "model": "dslim/bert-base-NER",
        "annotate": "spans",
        "annotate_spans_key": "bert-base-ner",
    },
)
doc = nlp("My name is Sarah and I live in London")
print(doc.spans["bert-base-ner"])
# [Sarah, London]
print(doc.spans["bert-base-ner"].attrs["scores"])
# [0.99854773, 0.9996215]
  1. Save fine-grained tags as Token.tag:
import spacy
nlp = spacy.blank("en")
nlp.add_pipe(
    "hf_token_pipe",
    config={
        "model": "QCRI/bert-base-multilingual-cased-pos-english",
        "annotate": "tag",
    },
)
doc = nlp("My name is Sarah and I live in London")
print([t.tag_ for t in doc])
# ['PRP$', 'NN', 'VBZ', 'NNP', 'CC', 'PRP', 'VBP', 'IN', 'NNP']
  1. Save coarse-grained tags as Token.pos:
import spacy
nlp = spacy.blank("en")
nlp.add_pipe(
    "hf_token_pipe",
    config={"model": "vblagoje/bert-english-uncased-finetuned-pos", "annotate": "pos"},
)
doc = nlp("My name is Sarah and I live in London")
print([t.pos_ for t in doc])
# ['PRON', 'NOUN', 'AUX', 'PROPN', 'CCONJ', 'PRON', 'VERB', 'ADP', 'PROPN']

Text classification

Config settings for hf_text_pipe:

[components.hf_text_pipe]
factory = "hf_text_pipe"
model = "distilbert-base-uncased-finetuned-sst-2-english"  # Model name or path
revision = "main"                 # Model revision
kwargs = {}                       # Any additional arguments for
                                  # TextClassificationPipeline
scorer = null                     # Optional scorer

The input texts are truncated according to the transformers model max length.

TextClassificationPipeline settings

  • model: The model name or path.
  • revision: The model revision. For production use, a specific git commit is recommended instead of the default main.
  • kwargs: Any additional arguments to TextClassificationPipeline.

Example

import spacy

nlp = spacy.blank("en")
nlp.add_pipe(
    "hf_text_pipe",
    config={"model": "distilbert-base-uncased-finetuned-sst-2-english"},
)
doc = nlp("This is great!")
print(doc.cats)
# {'POSITIVE': 0.9998694658279419, 'NEGATIVE': 0.00013048505934420973}

Batching and GPU

Both token and text classification support batching with nlp.pipe:

for doc in nlp.pipe(texts, batch_size=256):
    do_something(doc)

If the component runs into an error processing a batch (e.g. on an empty text), nlp.pipe will back off to processing each text individually. If it runs into an error on an individual text, a warning is shown and the doc is returned without additional annotation.

Switch to GPU:

import spacy
spacy.require_gpu()

for doc in nlp.pipe(texts):
    do_something(doc)

Bug reports and issues

Please report bugs in the spaCy issue tracker or open a new thread on the discussion board for other issues.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spacy_huggingface_pipelines-0.0.4.tar.gz (11.7 kB view details)

Uploaded Source

Built Distribution

spacy_huggingface_pipelines-0.0.4-py2.py3-none-any.whl (11.2 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file spacy_huggingface_pipelines-0.0.4.tar.gz.

File metadata

File hashes

Hashes for spacy_huggingface_pipelines-0.0.4.tar.gz
Algorithm Hash digest
SHA256 35b409ed7d20c5b36d788912570e3444ec1b0c344255e847bf722b3286279e95
MD5 a967cf1c4dab40128fe57b518177cee3
BLAKE2b-256 38ca07667af54b4efb3ee204db6db6ba9a3e7d7baf59219e5c86f7888121be06

See more details on using hashes here.

File details

Details for the file spacy_huggingface_pipelines-0.0.4-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for spacy_huggingface_pipelines-0.0.4-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 9e38ee6eba7a11fca32b7d14f38259f7805eec211e8959105a90c95915168b00
MD5 b52ebc695fda9cffba6ccb53b0758b5c
BLAKE2b-256 ba691cf6333eebaadf8517f59b9dec676f42f5fef8b13a29eaf2cd2922470868

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page