NER training-data generation built on py-agent-lib. Bring your own labels, LLM, and downstream trainer.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

halfwitgaslit

These details have not been verified by PyPI

Project description

py-agent-ner

NER training-data generation built on py-agent-lib.

Use an LLM to label text → write a JSONL with character-indexed spans → convert to BIO TSV → fine-tune a small NER model.

Install

pip install py-agent-ner
# or, for local dev:
pip install -e .

py-agent-ner pulls in py-agent-lib[adapters] (Pydantic + Jinja2 + instructor

httpx + rapidfuzz), so everything you need to run is one install away.

Quickstart (no setup)

import asyncio
from py_agent_lib.adapters.llm import from_ollama
from py_agent_ner import extract_training_data, jsonl_to_bio

LABELS = {
    "person_name":        "Full names. Ignore organizations and locations.",
    "location_reference": "Places, venues, addresses. Include 'at'/'in' if it's part of the phrase.",
    "time_reference":     "Times like 'tomorrow', '4:30 PM', 'next Monday'.",
}
TEXTS = ["Jordan Lee at City Library tomorrow 4:30 PM."]

async def main():
    llm = from_ollama("gemma4", think=False)
    records = await extract_training_data(
        texts=TEXTS,
        labels=LABELS,
        llm=llm,
        output_path="training_data.jsonl",
        fuzzy=True,
    )
    jsonl_to_bio("training_data.jsonl", "training_data.bio.tsv")
    return records

asyncio.run(main())

That's it. You get:

training_data.jsonl — one Pydantic TrainingRecord per line, with character-indexed spans
training_data.bio.tsv — CoNLL BIO format, ready to feed into HuggingFace token-classification or spacy convert

A more complete end-to-end demo lives at examples/quickstart/.

Public API

from py_agent_ner import (
    # Canonical Pydantic shapes
    LabelExtraction,    # the LLM's per-label response
    EntitySpan,         # one character-indexed span
    TaggedEntity,       # one label with confidence + spans
    TrainingRecord,     # one input text + all entities (JSONL row shape)

    # The fan-out-merge pipeline
    build_plan,           # build a Plan from a list of LabelPrompt
    make_handlers,        # build extract+merge handlers for one text
    extract_training_data,  # one-shot: run it, get records, write JSONL
    MERGE_STEP_ID,

    # BIO conversion for downstream training
    simple_tokenize,      # word + punctuation tokenizer with offsets
    record_to_bio,        # one TrainingRecord -> [(token, tag), ...]
    jsonl_to_bio,         # JSONL file -> CoNLL TSV file
    records_to_bio,       # in-memory records -> CoNLL TSV file

    # Prompts (re-exported from py-agent-lib for convenience)
    LabelPrompt,
    label_prompts_from_dict,
    load_label_prompts,
    labels_of,
    DEFAULT_BASE_TEMPLATE,
)

Architecture

py-agent-ner is a thin opinionated layer over py-agent-lib:

your labeler
     │
     ▼
py-agent-ner ─── pipeline.py     (build_plan + make_handlers + extract_training_data)
     │           models.py       (LabelExtraction, EntitySpan, TaggedEntity, TrainingRecord)
     │           bio.py          (simple_tokenize, record_to_bio, jsonl_to_bio)
     │
     ▼
py-agent-lib ── adapters.llm     (Structured LLM clients: Ollama native, OpenAI, Anthropic, ...)
                adapters.prompts  (Jinja2 prompt loading + DEFAULT_BASE_TEMPLATE)
                adapters.spans    (strict + fuzzy span finding)
                adapters.training (JSONL read/write)
                DagExecutor       (fan-out scheduling, retries, observers, snapshots)

Everything in py-agent-ner is replaceable — if you want a different prompt scaffold, write your own LabelPrompt builder. If you want a different span strategy, use find_spans directly with your own make_handlers. If you want a different output format, take the list[TrainingRecord] from extract_training_data and write whatever you want.

Training a model from the output

The JSONL output is the canonical intermediate format. To fine-tune a model:

HuggingFace Transformers

from py_agent_ner import jsonl_to_bio

# 1) Convert JSONL → BIO TSV
jsonl_to_bio("training_data.jsonl", "training_data.bio.tsv")

# 2) Load BIO TSV, tokenize with your target model's tokenizer, align labels
#    using `word_ids()`, then fine-tune with `Trainer` +
#    `DataCollatorForTokenClassification`. This is the official HuggingFace
#    token-classification tutorial verbatim:
#    https://huggingface.co/docs/transformers/tasks/token_classification

spaCy

python -m spacy convert training_data.bio.tsv ./spacy-data/ --converter ner
python -m spacy init config config.cfg --lang en --pipeline ner
python -m spacy train config.cfg --paths.train ./spacy-data/training_data.bio.spacy

GLiNER / Flair / custom

All consume BIO. Once training_data.bio.tsv exists, it's a solved problem.

Tests

pip install -e ".[dev]"
pytest

FakeLlm in tests/test_pipeline.py returns canned LabelExtraction responses, so the suite runs in milliseconds without a live model.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

halfwitgaslit

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.1

May 26, 2026

This version

0.1.0

May 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

py_agent_ner-0.1.0.tar.gz (11.0 kB view details)

Uploaded May 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

py_agent_ner-0.1.0-py3-none-any.whl (13.4 kB view details)

Uploaded May 26, 2026 Python 3

File details

Details for the file py_agent_ner-0.1.0.tar.gz.

File metadata

Download URL: py_agent_ner-0.1.0.tar.gz
Upload date: May 26, 2026
Size: 11.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for py_agent_ner-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`eb9b1acbb62a4b41848f40e0e563cd396afff27f47e02e711c5d7913a7dbcdbc`
MD5	`36a59ff0a72cc0f14a213c0ee7ac3888`
BLAKE2b-256	`542db0ad08118646ef71134446d0594fdccf4c00956f7bb8af2e0d38b4a47b7f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for py_agent_ner-0.1.0.tar.gz:

Publisher: publish-ner.yml on gaslit-ai/py-agent

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: py_agent_ner-0.1.0.tar.gz
- Subject digest: eb9b1acbb62a4b41848f40e0e563cd396afff27f47e02e711c5d7913a7dbcdbc
- Sigstore transparency entry: 1635430621
- Sigstore integration time: May 26, 2026
Source repository:
- Permalink: gaslit-ai/py-agent@74aef7aa4916654ae7345a3a27e3fd6754f2c966
- Branch / Tag: refs/tags/ner-v0.1.0
- Owner: https://github.com/gaslit-ai
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-ner.yml@74aef7aa4916654ae7345a3a27e3fd6754f2c966
- Trigger Event: push

File details

Details for the file py_agent_ner-0.1.0-py3-none-any.whl.

File metadata

Download URL: py_agent_ner-0.1.0-py3-none-any.whl
Upload date: May 26, 2026
Size: 13.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for py_agent_ner-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0cafb1f449fc7775f8eb7e64b07c0dd0f5dfc00a1e48ed4de73cc998ed1f4fe8`
MD5	`8426d3cac878e3567a518bd9233acb26`
BLAKE2b-256	`a76fc85f62eb559433fb8e82b6df0b26f2795798d81087c2e2fcefcc513cf85a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for py_agent_ner-0.1.0-py3-none-any.whl:

Publisher: publish-ner.yml on gaslit-ai/py-agent

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: py_agent_ner-0.1.0-py3-none-any.whl
- Subject digest: 0cafb1f449fc7775f8eb7e64b07c0dd0f5dfc00a1e48ed4de73cc998ed1f4fe8
- Sigstore transparency entry: 1635430628
- Sigstore integration time: May 26, 2026
Source repository:
- Permalink: gaslit-ai/py-agent@74aef7aa4916654ae7345a3a27e3fd6754f2c966
- Branch / Tag: refs/tags/ner-v0.1.0
- Owner: https://github.com/gaslit-ai
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-ner.yml@74aef7aa4916654ae7345a3a27e3fd6754f2c966
- Trigger Event: push

py-agent-ner 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

py-agent-ner

Install

Quickstart (no setup)

Public API

Architecture

Training a model from the output

HuggingFace Transformers

spaCy

GLiNER / Flair / custom

Tests

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance