Skip to main content

Spacy to HF converter

Project description

spacy-to-hf

A simple converter from SpaCy Entities (Spans) to Huggingface BILOU formatted data (tokens and ner_tags)

I've always struggled to convert my spacy formatted spans into data that can be trained on using huggingface transformers. But Spacy's Entity format is the most intuitive format for tagging entities for NER.

This repo is a simple converter that leverages spacy.gold.biluo_tags_from_offsets and the SpaCy tokenizations repo that creates a 1-line function to convert spacy formatted spans to tokens and ner_tags that can be fed into any Token Classification Transformer

Try before you buy

You can demo the functionality on streamlit or spaces

Try the app

What is "Spacy" or "HuggingFace" format?

Spacy format simply means having a text input and character level span assignments.
For example:

text = "Hello, my name is Ben"
spans = [{"start": 18, "end": 21, "label": "person"}, ...]

This is the common structure of output data from labeling tools like LabelStudio or LabelBox, because it's easy and human interpretable.

Huggingface format refers to the BIO/BILOU/BIOES tagging format commonly used for fine-tuning transformers. The input text is tokenized, and each token is given a tag to denote whether or not it's a label (and it's location, Beginning, Inside etc). Here's an example: https://huggingface.co/datasets/wikiann image

For more information about this tagging system, see wikipedia

This format is tricky, though, because it is entirely dependant on the tokenizer used. Tokens are not simply space separated words. Each tokenizer has a specific vocabulary of tokens that break down works into unique sub-words. So moving from character level spans to token level tags is a very manual process. That's a core reason I built this tool.

Installation

pip install spacy-to-hf
python -m spacy download en_core_web_sm

Usage

from spacy_to_hf import spacy_to_hf

span_data = [
    {
        "text": "I have a BSc (Bachelors of Computer Sciences) from NYU",
        "spans": [
            {"start": 9, "end": 12, "label": "degree"},
            {"start": 14, "end": 44, "label": "degree"},
            {"start": 51, "end": 54, "label": "university"}
        ]
    }
]
hf_data = spacy_to_hf(span_data, "bert-base-cased")
print(list(zip(hf_data["tokens"][0], hf_data["ner_tags"][0])))

Or, if you want to immediately start fine-tuning or upload this to huggingface, you can run

ds = spacy_to_hf(span_data, "bert-base-cased", as_hf_dataset=True)

print(ds.features["ner_tags"].feature.names)

This will return your data as a HuggingFace Dataset and will automatically string-index your ner_tags into a ClassLabel object

Project Setup

Project setup is credited to @anthonycorletti and his awesome project template repo

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spacy_to_hf-0.0.4.tar.gz (11.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

spacy_to_hf-0.0.4-py3-none-any.whl (11.0 kB view details)

Uploaded Python 3

File details

Details for the file spacy_to_hf-0.0.4.tar.gz.

File metadata

  • Download URL: spacy_to_hf-0.0.4.tar.gz
  • Upload date:
  • Size: 11.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.15

File hashes

Hashes for spacy_to_hf-0.0.4.tar.gz
Algorithm Hash digest
SHA256 b5bea5817dc28858bae7b0fe2817fde50347f0d7b70a24b3eb86022bb5eff40f
MD5 53f07157355b02df97e9f4b305606037
BLAKE2b-256 5eaa94f9ec4f44926bcdcb85612601ce6415b3c41629cec25601697a52afabca

See more details on using hashes here.

File details

Details for the file spacy_to_hf-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: spacy_to_hf-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 11.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.15

File hashes

Hashes for spacy_to_hf-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 69d89c0b87770b3a0565bd245a7285364c20cb4bf10cbf3a249607226bf63005
MD5 8ea4669751da1702e42194f89cb045a2
BLAKE2b-256 a1e8b417a6d87292d284996d4bc334d4cdff7f7c1be169546a8d95e75a59c9df

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page