Skip to main content

NER training-data generation built on py-agent-lib. Bring your own labels, LLM, and downstream trainer.

Project description

py-agent-ner

NER training-data generation built on py-agent-lib.

Architecture: one LLM call per (text, label). Each call uses a single-value Literal["<that_label>"] enum on its response schema; the model returns the verbatim substrings it found for that label. You fan out N calls per text (asyncio or DagExecutor) and assemble a TrainingRecord from the N responses. The package gives you the shapes, the prompt template, the schema builder, and the BIO converter — nothing else. The loop, the fan-out, and the merge are in your code where you can see them.

Install

pip install py-agent-ner
# or for local dev within this monorepo:
uv sync

py-agent-ner pulls in py-agent-lib[llm,fuzzy] (Pydantic + httpx + instructor + rapidfuzz) plus Jinja2 — everything needed to run.

What you write yourself

The loop, the fan-out, the assembly. The package never hides any of these. The canonical pattern (from examples/quickstart/run.py):

import asyncio
from py_agent_lib.adapters.llm import from_ollama
from py_agent_lib.adapters.spans import find_spans
from py_agent_ner import (
    EntitySpan, LabelExtraction, TaggedEntity, TrainingRecord,
    build_system_prompt, load_label_prompts, labels_of, single_label_schema,
)

label_prompts  = load_label_prompts("./labels")
all_labels     = labels_of(label_prompts)
base_template  = open("./prompts/base_system.jinja").read()
llm            = from_ollama("gemma4", think=False)

records = []
for text in your_texts:
    # Fan out: N LLM calls, one per label, in parallel
    extractions = await asyncio.gather(*[
        llm.extract(
            system=build_system_prompt(lp, all_labels=all_labels, base_template=base_template),
            user=text,
            response_model=single_label_schema(lp.label),
        )
        for lp in label_prompts
    ])
    # Merge: assemble the TrainingRecord
    entities = {
        lp.label: TaggedEntity(
            label=lp.label,
            confidence=ext.confidence,
            spans=[EntitySpan(text=s.text, start=s.start, end=s.end)
                   for s in find_spans(ext.matches, text, fuzzy=True)],
        )
        for lp, ext in zip(label_prompts, extractions, strict=True)
    }
    records.append(TrainingRecord(input_text=text, entities=entities))

That's the whole story. Add JSONL output and BIO conversion:

from py_agent_ner import write_records_jsonl, jsonl_to_bio
write_records_jsonl(records, "training_data.jsonl")
jsonl_to_bio("training_data.jsonl", "training_data.bio.tsv")

Public API

from py_agent_ner import (
    # Pydantic v2 shapes (every field carries a description)
    LabelPrompt,          # input: label + instructions
    LabelExtraction,      # output of ONE call: label + matches + confidence
    EntitySpan,           # one located occurrence with character offsets
    TaggedEntity,         # one label's spans + confidence
    TrainingRecord,       # input text + per-label TaggedEntities (JSONL row)

    # Prompts — single-label rendering, matching the per-label fan-out
    DEFAULT_BASE_TEMPLATE,
    build_system_prompt,       # render ONE label's system prompt
    label_prompts_from_dict,   # {label: instructions} → [LabelPrompt]
    load_label_prompts,        # dir of .jinja → [LabelPrompt]
    labels_of,                 # convenience

    # Schema — 5-line helper around create_model + Literal[label]
    single_label_schema,       # name → type[LabelExtraction] with Literal[name]

    # BIO conversion (JSONL → CoNLL TSV for HuggingFace / spaCy / Flair)
    simple_tokenize,
    record_to_bio,
    jsonl_to_bio,
    records_to_bio,
    write_records_jsonl,
)

What py-agent-ner intentionally does NOT provide

  • No extract_training_data-style runner. You write the per-text loop.
  • No "merge" helper. Building the dict[label, TaggedEntity] from N LabelExtractions is five lines you write in your code.
  • No multi-label wrapper class. The LLM returns LabelExtraction per call. There is no "Extraction containing entities for all labels" type.
  • No DAG plumbing inside the package. Use asyncio.gather (cheapest), or wire your fan-out as steps on py-agent-lib's DagExecutor if you want its observers / retries / cancellation / snapshots.

Architecture

your run.py
    │   loops over texts
    │   fans out N LLM calls per text via asyncio.gather
    │   assembles TrainingRecord from N LabelExtractions
    │
    ▼
py-agent-ner
    models.py    LabelPrompt, LabelExtraction, EntitySpan, TaggedEntity, TrainingRecord
    prompts.py   build_system_prompt + DEFAULT_BASE_TEMPLATE + loaders
    schema.py    single_label_schema (Literal[label_name] subclass of LabelExtraction)
    bio.py       JSONL → CoNLL BIO TSV
    │
    ▼
py-agent-lib    (domain-agnostic — knows nothing about NER)
    adapters/llm        from_ollama, from_openai, from_anthropic, from_litellm
    adapters/spans      find_spans (strict + opt-in rapidfuzz)
    adapters/training   write_records_jsonl, read_records_jsonl (generic Pydantic JSONL)
    (also: DagExecutor, observers, snapshots, etc. — for users who want them)

Training a model from the output

jsonl_to_bio converts the JSONL to BIO TSV — the canonical input for:

  • HuggingFace — follow the official token-classification tutorial verbatim.
  • spaCypython -m spacy convert training_data.bio.tsv ./spacy-data/ --converter ner
  • GLiNER, Flair, custom — all consume BIO.

Tests

uv run pytest packages/py-agent-ner/tests/

Tests don't need a live model — they use a FakeLlm that returns canned LabelExtraction responses for the per-label calls.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

py_agent_ner-0.1.1.tar.gz (10.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

py_agent_ner-0.1.1-py3-none-any.whl (12.8 kB view details)

Uploaded Python 3

File details

Details for the file py_agent_ner-0.1.1.tar.gz.

File metadata

  • Download URL: py_agent_ner-0.1.1.tar.gz
  • Upload date:
  • Size: 10.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for py_agent_ner-0.1.1.tar.gz
Algorithm Hash digest
SHA256 70ec5df489088f170e3b482d2e03bf74e6493178978ca953e7e9b06acc7550bb
MD5 5df1fbfd7b28a1e5b17df278c38d5ec3
BLAKE2b-256 4f4fb4a9a25c8d8ea987e18a2160a64af28fea88b51518980443a9b6850c57c3

See more details on using hashes here.

Provenance

The following attestation bundles were made for py_agent_ner-0.1.1.tar.gz:

Publisher: publish-ner.yml on gaslit-ai/py-agent

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file py_agent_ner-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: py_agent_ner-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 12.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for py_agent_ner-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 124d1108d44fdf50a0c591d8b9bc3b95e4a5420fa0d7e1a5eeaccb2f1aa392a2
MD5 f0baffe59cef7efa2470f461e5527d7d
BLAKE2b-256 c617a5ffb249939156f3eb5e34e2d766f97e34dd9635abcca5022292d23f660d

See more details on using hashes here.

Provenance

The following attestation bundles were made for py_agent_ner-0.1.1-py3-none-any.whl:

Publisher: publish-ner.yml on gaslit-ai/py-agent

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page