Convert IOB2 NER span annotations into integer label sequences aligned to any HuggingFace-compatible tokenizer.
Project description
iob2labels
Convert IOB2-format NER span annotations into integer label sequences for Transformer-based token classification tasks.
If you use annotation tools like Prodigy, Label Studio, or Doccano to annotate NER data, this library converts those character-offset span annotations into the label arrays you need for training.
Installation
uv add iob2labels
Dependencies: tokenizers (HuggingFace Rust-backed tokenizer) and pydantic. No torch or transformers required.
Quick Start
from iob2labels import IOB2Encoder
encoder = IOB2Encoder(
labels=["actor", "character", "plot"],
tokenizer="bert-base-uncased",
)
labels = encoder(
text="Did Dame Judy Dench star in a British film about Queen Elizabeth?",
spans=[
{"label": "actor", "start": 4, "end": 19},
{"label": "plot", "start": 30, "end": 37},
{"label": "character", "start": 49, "end": 64},
]
)
# >>> [-100, 0, 1, 2, 2, 2, 0, 0, 0, 5, 0, 0, 3, 4, 0, -100]
Example pulled from the MITMovie dataset.
The output is a list[int] aligned to the tokenizer's output. Convert to a tensor or array as needed:
import torch
x = torch.tensor(labels)
# or with numpy
import numpy as np
x = np.array(labels)
How It Works
The IOB2 format assigns each token one of three tag types:
- O (Outside) - not part of any entity
- B-LABEL (Beginning) - first token of an entity
- I-LABEL (Inside) - continuation of an entity
Each entity class generates 2 labels (B + I), plus the O class, so the total label count is always (n * 2) + 1:
encoder.label_map
# >>> {'O': 0, 'B-ACTOR': 1, 'I-ACTOR': 2, 'B-CHARACTER': 3, 'I-CHARACTER': 4, 'B-PLOT': 5, 'I-PLOT': 6}
Special tokens (e.g., [CLS], [SEP]) receive the ignore value -100, which PyTorch's CrossEntropyLoss skips by default.
Tokenizer Input
The tokenizer argument accepts three forms:
# 1. checkpoint name (downloads from HuggingFace Hub)
encoder = IOB2Encoder(labels=labels, tokenizer="bert-base-uncased")
# 2. standalone tokenizers.Tokenizer instance
from tokenizers import Tokenizer
tok = Tokenizer.from_pretrained("bert-base-uncased")
encoder = IOB2Encoder(labels=labels, tokenizer=tok)
# 3. transformers PreTrainedTokenizerFast (unwrapped automatically)
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("bert-base-uncased")
encoder = IOB2Encoder(labels=labels, tokenizer=tok)
Batch Encoding
annotations = [
{"text": "Did Dame Judy Dench star?", "spans": [{"label": "actor", "start": 4, "end": 19}]},
{"text": "Matt Damon was Jason Bourne.", "spans": [{"label": "actor", "start": 0, "end": 10}]},
]
results = encoder.batch(annotations)
# >>> [[-100, 0, 1, 2, 2, 2, 0, -100], [-100, 1, 2, 0, 0, 0, 0, -100]]
The batch path uses the Rust-backed encode_batch() for parallelized tokenization. Returns list[list[int]] with no padding; use HuggingFace's DataCollatorForTokenClassification or your own padding for training.
Custom Field Names
If your annotation data uses non-standard field names, configure them at construction:
# BioMed-NER dataset uses "entities" and "class" instead of "spans" and "label"
encoder = IOB2Encoder(
labels=["organism", "chemicals"],
tokenizer="bert-base-uncased",
spans_field="entities",
label_field="class",
)
Built-in Conversion Check
By default, every encoding is verified by recovering the entity text from the produced labels and comparing it to the original annotation. This catches misalignment bugs early. Disable it for performance in production:
encoder = IOB2Encoder(labels=labels, tokenizer=tok, conversion_check=False)
Supported Tokenizers
Tested across three tokenizer families:
| Family | Checkpoints |
|---|---|
| WordPiece | bert-base-cased, bert-base-uncased, bert-large-cased, bert-large-uncased, distilbert-base-cased, distilbert-base-uncased, google/electra-base-discriminator |
| BPE | roberta-base, roberta-large |
| SentencePiece | albert-base-v2, xlnet-base-cased, t5-small |
Other HuggingFace-compatible tokenizers should work as well. The built-in conversion check will flag any issues.
Tests
uv run pytest tests/ -v
The test suite includes unit tests for label map construction, entity range detection, and the conversion checker, plus a parametrized matrix of 12 tokenizer checkpoints across multiple annotation edge cases (entities at text boundaries, adjacent entities, punctuation, etc.).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file iob2labels-0.2.0.tar.gz.
File metadata
- Download URL: iob2labels-0.2.0.tar.gz
- Upload date:
- Size: 77.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e134971980847155a1d30831f157a1dd9e468dd2a3a6208815e0fda02692f03b
|
|
| MD5 |
8192edefb492e619dea23fb91f6ecaec
|
|
| BLAKE2b-256 |
76d4fd6f110b69832d0f70f486938959bbbef28be7737e2cee701cdbebafc677
|
Provenance
The following attestation bundles were made for iob2labels-0.2.0.tar.gz:
Publisher:
publish.yml on cldixon/iob2labels
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
iob2labels-0.2.0.tar.gz -
Subject digest:
e134971980847155a1d30831f157a1dd9e468dd2a3a6208815e0fda02692f03b - Sigstore transparency entry: 990512778
- Sigstore integration time:
-
Permalink:
cldixon/iob2labels@06097f83c6e1dfc2471aa2505936a89ee4f87a2d -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/cldixon
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@06097f83c6e1dfc2471aa2505936a89ee4f87a2d -
Trigger Event:
release
-
Statement type:
File details
Details for the file iob2labels-0.2.0-py3-none-any.whl.
File metadata
- Download URL: iob2labels-0.2.0-py3-none-any.whl
- Upload date:
- Size: 12.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7c57ae3956891a3decf0680e0df49eaf151bf361eda5abcc09b2c0dc381d1061
|
|
| MD5 |
a2f37ff3579368dacd09f94194713dea
|
|
| BLAKE2b-256 |
7a0d4ef9a7e4847bb3eb9ed103830516414f1b80bbb30bd5b1f08e3a57a0cfe7
|
Provenance
The following attestation bundles were made for iob2labels-0.2.0-py3-none-any.whl:
Publisher:
publish.yml on cldixon/iob2labels
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
iob2labels-0.2.0-py3-none-any.whl -
Subject digest:
7c57ae3956891a3decf0680e0df49eaf151bf361eda5abcc09b2c0dc381d1061 - Sigstore transparency entry: 990512780
- Sigstore integration time:
-
Permalink:
cldixon/iob2labels@06097f83c6e1dfc2471aa2505936a89ee4f87a2d -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/cldixon
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@06097f83c6e1dfc2471aa2505936a89ee4f87a2d -
Trigger Event:
release
-
Statement type: