Convert IOB2 NER span annotations into integer label sequences aligned to any HuggingFace-compatible tokenizer.

These details have not been verified by PyPI

Project description

iob2labels

Convert IOB2-format NER span annotations into integer label sequences for Transformer-based token classification tasks.

If you use annotation tools like Prodigy, Label Studio, or Doccano to annotate NER data, this library converts those character-offset span annotations into the label arrays you need for training.

Installation

uv add iob2labels

Dependencies: tokenizers (HuggingFace Rust-backed tokenizer) and pydantic. No torch or transformers required.

Quick Start

from iob2labels import IOB2Encoder

encoder = IOB2Encoder(
    labels=["actor", "character", "plot"],
    tokenizer="bert-base-uncased",
)

labels = encoder(
    text="Did Dame Judy Dench star in a British film about Queen Elizabeth?",
    spans=[
        {"label": "actor", "start": 4, "end": 19},
        {"label": "plot", "start": 30, "end": 37},
        {"label": "character", "start": 49, "end": 64},
    ]
)
# >>> [-100, 0, 1, 2, 2, 2, 0, 0, 0, 5, 0, 0, 3, 4, 0, -100]

Example pulled from the MITMovie dataset.

The output is a list[int] aligned to the tokenizer's output. Convert to a tensor or array as needed:

import torch
x = torch.tensor(labels)

# or with numpy
import numpy as np
x = np.array(labels)

How It Works

The IOB2 format assigns each token one of three tag types:

O (Outside) - not part of any entity
B-LABEL (Beginning) - first token of an entity
I-LABEL (Inside) - continuation of an entity

Each entity class generates 2 labels (B + I), plus the O class, so the total label count is always (n * 2) + 1:

encoder.label_map
# >>> {'O': 0, 'B-ACTOR': 1, 'I-ACTOR': 2, 'B-CHARACTER': 3, 'I-CHARACTER': 4, 'B-PLOT': 5, 'I-PLOT': 6}

Special tokens (e.g., [CLS], [SEP]) receive the ignore value -100, which PyTorch's CrossEntropyLoss skips by default.

Tokenizer Input

The tokenizer argument accepts three forms:

# 1. checkpoint name (downloads from HuggingFace Hub)
encoder = IOB2Encoder(labels=labels, tokenizer="bert-base-uncased")

# 2. standalone tokenizers.Tokenizer instance
from tokenizers import Tokenizer
tok = Tokenizer.from_pretrained("bert-base-uncased")
encoder = IOB2Encoder(labels=labels, tokenizer=tok)

# 3. transformers PreTrainedTokenizerFast (unwrapped automatically)
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("bert-base-uncased")
encoder = IOB2Encoder(labels=labels, tokenizer=tok)

Batch Encoding

annotations = [
    {"text": "Did Dame Judy Dench star?", "spans": [{"label": "actor", "start": 4, "end": 19}]},
    {"text": "Matt Damon was Jason Bourne.", "spans": [{"label": "actor", "start": 0, "end": 10}]},
]

results = encoder.batch(annotations)
# >>> [[-100, 0, 1, 2, 2, 2, 0, -100], [-100, 1, 2, 0, 0, 0, 0, -100]]

The batch path uses the Rust-backed encode_batch() for parallelized tokenization. Returns list[list[int]] with no padding; use HuggingFace's DataCollatorForTokenClassification or your own padding for training.

Custom Field Names

If your annotation data uses non-standard field names, configure them at construction:

# BioMed-NER dataset uses "entities" and "class" instead of "spans" and "label"
encoder = IOB2Encoder(
    labels=["organism", "chemicals"],
    tokenizer="bert-base-uncased",
    spans_field="entities",
    label_field="class",
)

Built-in Conversion Check

By default, every encoding is verified by recovering the entity text from the produced labels and comparing it to the original annotation. This catches misalignment bugs early. Disable it for performance in production:

encoder = IOB2Encoder(labels=labels, tokenizer=tok, conversion_check=False)

Supported Tokenizers

Tested across three tokenizer families:

Family	Checkpoints
WordPiece	`bert-base-cased`, `bert-base-uncased`, `bert-large-cased`, `bert-large-uncased`, `distilbert-base-cased`, `distilbert-base-uncased`, `google/electra-base-discriminator`
BPE	`roberta-base`, `roberta-large`
SentencePiece	`albert-base-v2`, `xlnet-base-cased`, `t5-small`

Other HuggingFace-compatible tokenizers should work as well. The built-in conversion check will flag any issues.

Tests

uv run pytest tests/ -v

The test suite includes unit tests for label map construction, entity range detection, and the conversion checker, plus a parametrized matrix of 12 tokenizer checkpoints across multiple annotation edge cases (entities at text boundaries, adjacent entities, punctuation, etc.).

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.3.0

Feb 25, 2026

0.2.0

Feb 25, 2026

This version

0.1.0

Feb 24, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

iob2labels-0.1.0.tar.gz (75.4 kB view details)

Uploaded Feb 24, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

iob2labels-0.1.0-py3-none-any.whl (11.4 kB view details)

Uploaded Feb 24, 2026 Python 3

File details

Details for the file iob2labels-0.1.0.tar.gz.

File metadata

Download URL: iob2labels-0.1.0.tar.gz
Upload date: Feb 24, 2026
Size: 75.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for iob2labels-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`0f130086b141622ef48322a0245feeaab56ee79d1b2fe6f4de90e77777c6d262`
MD5	`9cbea73f994ddbbe3eef75a3b9effd31`
BLAKE2b-256	`a21bbeb9b6952efbff0ca076185c3a0dbae7b2439f3c0c509be2882bc1617de3`

See more details on using hashes here.

Provenance

The following attestation bundles were made for iob2labels-0.1.0.tar.gz:

Publisher: publish.yml on cldixon/iob2labels

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: iob2labels-0.1.0.tar.gz
- Subject digest: 0f130086b141622ef48322a0245feeaab56ee79d1b2fe6f4de90e77777c6d262
- Sigstore transparency entry: 984577484
- Sigstore integration time: Feb 24, 2026
Source repository:
- Permalink: cldixon/iob2labels@3439ccd6f6e179bcf89a53acfc80266f7f989be5
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/cldixon
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@3439ccd6f6e179bcf89a53acfc80266f7f989be5
- Trigger Event: release

File details

Details for the file iob2labels-0.1.0-py3-none-any.whl.

File metadata

Download URL: iob2labels-0.1.0-py3-none-any.whl
Upload date: Feb 24, 2026
Size: 11.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for iob2labels-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7389bb8659f340869fc55f602c8f29e9d5a6249af44d8ece95836f0403b04c20`
MD5	`295be87e91a92e4e10153bea621bd61d`
BLAKE2b-256	`87743dc3375e43374ad4c6df29b5ad15d65ebe993c1f602a4c2bb1fc08bb1bcc`

See more details on using hashes here.

Provenance

The following attestation bundles were made for iob2labels-0.1.0-py3-none-any.whl:

Publisher: publish.yml on cldixon/iob2labels

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: iob2labels-0.1.0-py3-none-any.whl
- Subject digest: 7389bb8659f340869fc55f602c8f29e9d5a6249af44d8ece95836f0403b04c20
- Sigstore transparency entry: 984577487
- Sigstore integration time: Feb 24, 2026
Source repository:
- Permalink: cldixon/iob2labels@3439ccd6f6e179bcf89a53acfc80266f7f989be5
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/cldixon
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@3439ccd6f6e179bcf89a53acfc80266f7f989be5
- Trigger Event: release

iob2labels 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

iob2labels

Installation

Quick Start

How It Works

Tokenizer Input

Batch Encoding

Custom Field Names

Built-in Conversion Check

Supported Tokenizers

Tests

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance