NER training-data generation built on py-agent-lib. Bring your own labels, LLM, and downstream trainer.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

halfwitgaslit

These details have not been verified by PyPI

Project description

py-agent-ner

NER training-data generation built on py-agent-lib.

Architecture: one LLM call per (text, label). Each call uses a single-value Literal["<that_label>"] enum on its response schema; the model returns the verbatim substrings it found for that label. You fan out N calls per text (asyncio or DagExecutor) and assemble a TrainingRecord from the N responses. The package gives you the shapes, the prompt template, the schema builder, and the BIO converter — nothing else. The loop, the fan-out, and the merge are in your code where you can see them.

Install

pip install py-agent-ner
# or for local dev within this monorepo:
uv sync

py-agent-ner pulls in py-agent-lib[llm,fuzzy] (Pydantic + httpx + instructor + rapidfuzz) plus Jinja2 — everything needed to run.

What you write yourself

The loop, the fan-out, the assembly. The package never hides any of these. The canonical pattern (from examples/quickstart/run.py):

import asyncio
from py_agent_lib.adapters.llm import from_ollama
from py_agent_lib.adapters.spans import find_spans
from py_agent_ner import (
    EntitySpan, LabelExtraction, TaggedEntity, TrainingRecord,
    build_system_prompt, load_label_prompts, labels_of, single_label_schema,
)

label_prompts  = load_label_prompts("./labels")
all_labels     = labels_of(label_prompts)
base_template  = open("./prompts/base_system.jinja").read()
llm            = from_ollama("gemma4", think=False)

records = []
for text in your_texts:
    # Fan out: N LLM calls, one per label, in parallel
    extractions = await asyncio.gather(*[
        llm.extract(
            system=build_system_prompt(lp, all_labels=all_labels, base_template=base_template),
            user=text,
            response_model=single_label_schema(lp.label),
        )
        for lp in label_prompts
    ])
    # Merge: assemble the TrainingRecord
    entities = {
        lp.label: TaggedEntity(
            label=lp.label,
            confidence=ext.confidence,
            spans=[EntitySpan(text=s.text, start=s.start, end=s.end)
                   for s in find_spans(ext.matches, text, fuzzy=True)],
        )
        for lp, ext in zip(label_prompts, extractions, strict=True)
    }
    records.append(TrainingRecord(input_text=text, entities=entities))

That's the whole story. Add JSONL output and BIO conversion:

from py_agent_ner import write_records_jsonl, jsonl_to_bio
write_records_jsonl(records, "training_data.jsonl")
jsonl_to_bio("training_data.jsonl", "training_data.bio.tsv")

Public API

from py_agent_ner import (
    # Pydantic v2 shapes (every field carries a description)
    LabelPrompt,          # input: label + instructions
    LabelExtraction,      # output of ONE call: label + matches + confidence
    EntitySpan,           # one located occurrence with character offsets
    TaggedEntity,         # one label's spans + confidence
    TrainingRecord,       # input text + per-label TaggedEntities (JSONL row)

    # Prompts — single-label rendering, matching the per-label fan-out
    DEFAULT_BASE_TEMPLATE,
    build_system_prompt,       # render ONE label's system prompt
    label_prompts_from_dict,   # {label: instructions} → [LabelPrompt]
    load_label_prompts,        # dir of .jinja → [LabelPrompt]
    labels_of,                 # convenience

    # Schema — 5-line helper around create_model + Literal[label]
    single_label_schema,       # name → type[LabelExtraction] with Literal[name]

    # BIO conversion (JSONL → CoNLL TSV for HuggingFace / spaCy / Flair)
    simple_tokenize,
    record_to_bio,
    jsonl_to_bio,
    records_to_bio,
    write_records_jsonl,
)

What py-agent-ner intentionally does NOT provide

No extract_training_data-style runner. You write the per-text loop.
No "merge" helper. Building the dict[label, TaggedEntity] from N LabelExtractions is five lines you write in your code.
No multi-label wrapper class. The LLM returns LabelExtraction per call. There is no "Extraction containing entities for all labels" type.
No DAG plumbing inside the package. Use asyncio.gather (cheapest), or wire your fan-out as steps on py-agent-lib's DagExecutor if you want its observers / retries / cancellation / snapshots.

Architecture

your run.py
    │   loops over texts
    │   fans out N LLM calls per text via asyncio.gather
    │   assembles TrainingRecord from N LabelExtractions
    │
    ▼
py-agent-ner
    models.py    LabelPrompt, LabelExtraction, EntitySpan, TaggedEntity, TrainingRecord
    prompts.py   build_system_prompt + DEFAULT_BASE_TEMPLATE + loaders
    schema.py    single_label_schema (Literal[label_name] subclass of LabelExtraction)
    bio.py       JSONL → CoNLL BIO TSV
    │
    ▼
py-agent-lib    (domain-agnostic — knows nothing about NER)
    adapters/llm        from_ollama, from_openai, from_anthropic, from_litellm
    adapters/spans      find_spans (strict + opt-in rapidfuzz)
    adapters/training   write_records_jsonl, read_records_jsonl (generic Pydantic JSONL)
    (also: DagExecutor, observers, snapshots, etc. — for users who want them)

Training a model from the output

jsonl_to_bio converts the JSONL to BIO TSV — the canonical input for:

HuggingFace — follow the official token-classification tutorial verbatim.
spaCy — python -m spacy convert training_data.bio.tsv ./spacy-data/ --converter ner
GLiNER, Flair, custom — all consume BIO.

Tests

uv run pytest packages/py-agent-ner/tests/

Tests don't need a live model — they use a FakeLlm that returns canned LabelExtraction responses for the per-label calls.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

halfwitgaslit

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.1

May 26, 2026

0.1.0

May 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

py_agent_ner-0.1.1.tar.gz (10.5 kB view details)

Uploaded May 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

py_agent_ner-0.1.1-py3-none-any.whl (12.8 kB view details)

Uploaded May 26, 2026 Python 3

File details

Details for the file py_agent_ner-0.1.1.tar.gz.

File metadata

Download URL: py_agent_ner-0.1.1.tar.gz
Upload date: May 26, 2026
Size: 10.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for py_agent_ner-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`70ec5df489088f170e3b482d2e03bf74e6493178978ca953e7e9b06acc7550bb`
MD5	`5df1fbfd7b28a1e5b17df278c38d5ec3`
BLAKE2b-256	`4f4fb4a9a25c8d8ea987e18a2160a64af28fea88b51518980443a9b6850c57c3`

See more details on using hashes here.

Provenance

The following attestation bundles were made for py_agent_ner-0.1.1.tar.gz:

Publisher: publish-ner.yml on gaslit-ai/py-agent

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: py_agent_ner-0.1.1.tar.gz
- Subject digest: 70ec5df489088f170e3b482d2e03bf74e6493178978ca953e7e9b06acc7550bb
- Sigstore transparency entry: 1635433627
- Sigstore integration time: May 26, 2026
Source repository:
- Permalink: gaslit-ai/py-agent@4bb4779ca51d0c0bec43b36ec9d8ec7588e077b1
- Branch / Tag: refs/tags/ner-v0.1.1
- Owner: https://github.com/gaslit-ai
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-ner.yml@4bb4779ca51d0c0bec43b36ec9d8ec7588e077b1
- Trigger Event: push

File details

Details for the file py_agent_ner-0.1.1-py3-none-any.whl.

File metadata

Download URL: py_agent_ner-0.1.1-py3-none-any.whl
Upload date: May 26, 2026
Size: 12.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for py_agent_ner-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`124d1108d44fdf50a0c591d8b9bc3b95e4a5420fa0d7e1a5eeaccb2f1aa392a2`
MD5	`f0baffe59cef7efa2470f461e5527d7d`
BLAKE2b-256	`c617a5ffb249939156f3eb5e34e2d766f97e34dd9635abcca5022292d23f660d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for py_agent_ner-0.1.1-py3-none-any.whl:

Publisher: publish-ner.yml on gaslit-ai/py-agent

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: py_agent_ner-0.1.1-py3-none-any.whl
- Subject digest: 124d1108d44fdf50a0c591d8b9bc3b95e4a5420fa0d7e1a5eeaccb2f1aa392a2
- Sigstore transparency entry: 1635433633
- Sigstore integration time: May 26, 2026
Source repository:
- Permalink: gaslit-ai/py-agent@4bb4779ca51d0c0bec43b36ec9d8ec7588e077b1
- Branch / Tag: refs/tags/ner-v0.1.1
- Owner: https://github.com/gaslit-ai
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-ner.yml@4bb4779ca51d0c0bec43b36ec9d8ec7588e077b1
- Trigger Event: push

py-agent-ner 0.1.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

py-agent-ner

Install

What you write yourself

Public API

What py-agent-ner intentionally does NOT provide

Architecture

Training a model from the output

Tests

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance