NER training-data generation built on py-agent-lib. Bring your own labels, LLM, and downstream trainer.
Project description
py-agent-ner
NER training-data generation built on py-agent-lib.
Architecture: one LLM call per (text, label). Each call uses a single-value
Literal["<that_label>"] enum on its response schema; the model returns the
verbatim substrings it found for that label. You fan out N calls per text
(asyncio or DagExecutor) and assemble a TrainingRecord from the N
responses. The package gives you the shapes, the prompt template, the schema
builder, and the BIO converter — nothing else. The loop, the fan-out, and
the merge are in your code where you can see them.
Install
pip install py-agent-ner
# or for local dev within this monorepo:
uv sync
py-agent-ner pulls in py-agent-lib[llm,fuzzy] (Pydantic + httpx +
instructor + rapidfuzz) plus Jinja2 — everything needed to run.
What you write yourself
The loop, the fan-out, the assembly. The package never hides any of these.
The canonical pattern (from examples/quickstart/run.py):
import asyncio
from py_agent_lib.adapters.llm import from_ollama
from py_agent_lib.adapters.spans import find_spans
from py_agent_ner import (
EntitySpan, LabelExtraction, TaggedEntity, TrainingRecord,
build_system_prompt, load_label_prompts, labels_of, single_label_schema,
)
label_prompts = load_label_prompts("./labels")
all_labels = labels_of(label_prompts)
base_template = open("./prompts/base_system.jinja").read()
llm = from_ollama("gemma4", think=False)
records = []
for text in your_texts:
# Fan out: N LLM calls, one per label, in parallel
extractions = await asyncio.gather(*[
llm.extract(
system=build_system_prompt(lp, all_labels=all_labels, base_template=base_template),
user=text,
response_model=single_label_schema(lp.label),
)
for lp in label_prompts
])
# Merge: assemble the TrainingRecord
entities = {
lp.label: TaggedEntity(
label=lp.label,
confidence=ext.confidence,
spans=[EntitySpan(text=s.text, start=s.start, end=s.end)
for s in find_spans(ext.matches, text, fuzzy=True)],
)
for lp, ext in zip(label_prompts, extractions, strict=True)
}
records.append(TrainingRecord(input_text=text, entities=entities))
That's the whole story. Add JSONL output and BIO conversion:
from py_agent_ner import write_records_jsonl, jsonl_to_bio
write_records_jsonl(records, "training_data.jsonl")
jsonl_to_bio("training_data.jsonl", "training_data.bio.tsv")
Public API
from py_agent_ner import (
# Pydantic v2 shapes (every field carries a description)
LabelPrompt, # input: label + instructions
LabelExtraction, # output of ONE call: label + matches + confidence
EntitySpan, # one located occurrence with character offsets
TaggedEntity, # one label's spans + confidence
TrainingRecord, # input text + per-label TaggedEntities (JSONL row)
# Prompts — single-label rendering, matching the per-label fan-out
DEFAULT_BASE_TEMPLATE,
build_system_prompt, # render ONE label's system prompt
label_prompts_from_dict, # {label: instructions} → [LabelPrompt]
load_label_prompts, # dir of .jinja → [LabelPrompt]
labels_of, # convenience
# Schema — 5-line helper around create_model + Literal[label]
single_label_schema, # name → type[LabelExtraction] with Literal[name]
# BIO conversion (JSONL → CoNLL TSV for HuggingFace / spaCy / Flair)
simple_tokenize,
record_to_bio,
jsonl_to_bio,
records_to_bio,
write_records_jsonl,
)
What py-agent-ner intentionally does NOT provide
- No
extract_training_data-style runner. You write the per-text loop. - No "merge" helper. Building the
dict[label, TaggedEntity]from NLabelExtractions is five lines you write in your code. - No multi-label wrapper class. The LLM returns
LabelExtractionper call. There is no "Extraction containing entities for all labels" type. - No DAG plumbing inside the package. Use
asyncio.gather(cheapest), or wire your fan-out as steps onpy-agent-lib'sDagExecutorif you want its observers / retries / cancellation / snapshots.
Architecture
your run.py
│ loops over texts
│ fans out N LLM calls per text via asyncio.gather
│ assembles TrainingRecord from N LabelExtractions
│
▼
py-agent-ner
models.py LabelPrompt, LabelExtraction, EntitySpan, TaggedEntity, TrainingRecord
prompts.py build_system_prompt + DEFAULT_BASE_TEMPLATE + loaders
schema.py single_label_schema (Literal[label_name] subclass of LabelExtraction)
bio.py JSONL → CoNLL BIO TSV
│
▼
py-agent-lib (domain-agnostic — knows nothing about NER)
adapters/llm from_ollama, from_openai, from_anthropic, from_litellm
adapters/spans find_spans (strict + opt-in rapidfuzz)
adapters/training write_records_jsonl, read_records_jsonl (generic Pydantic JSONL)
(also: DagExecutor, observers, snapshots, etc. — for users who want them)
Training a model from the output
jsonl_to_bio converts the JSONL to BIO TSV — the canonical input for:
- HuggingFace — follow the official token-classification tutorial verbatim.
- spaCy —
python -m spacy convert training_data.bio.tsv ./spacy-data/ --converter ner - GLiNER, Flair, custom — all consume BIO.
Tests
uv run pytest packages/py-agent-ner/tests/
Tests don't need a live model — they use a FakeLlm that returns canned
LabelExtraction responses for the per-label calls.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file py_agent_ner-0.1.1.tar.gz.
File metadata
- Download URL: py_agent_ner-0.1.1.tar.gz
- Upload date:
- Size: 10.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
70ec5df489088f170e3b482d2e03bf74e6493178978ca953e7e9b06acc7550bb
|
|
| MD5 |
5df1fbfd7b28a1e5b17df278c38d5ec3
|
|
| BLAKE2b-256 |
4f4fb4a9a25c8d8ea987e18a2160a64af28fea88b51518980443a9b6850c57c3
|
Provenance
The following attestation bundles were made for py_agent_ner-0.1.1.tar.gz:
Publisher:
publish-ner.yml on gaslit-ai/py-agent
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
py_agent_ner-0.1.1.tar.gz -
Subject digest:
70ec5df489088f170e3b482d2e03bf74e6493178978ca953e7e9b06acc7550bb - Sigstore transparency entry: 1635433627
- Sigstore integration time:
-
Permalink:
gaslit-ai/py-agent@4bb4779ca51d0c0bec43b36ec9d8ec7588e077b1 -
Branch / Tag:
refs/tags/ner-v0.1.1 - Owner: https://github.com/gaslit-ai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-ner.yml@4bb4779ca51d0c0bec43b36ec9d8ec7588e077b1 -
Trigger Event:
push
-
Statement type:
File details
Details for the file py_agent_ner-0.1.1-py3-none-any.whl.
File metadata
- Download URL: py_agent_ner-0.1.1-py3-none-any.whl
- Upload date:
- Size: 12.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
124d1108d44fdf50a0c591d8b9bc3b95e4a5420fa0d7e1a5eeaccb2f1aa392a2
|
|
| MD5 |
f0baffe59cef7efa2470f461e5527d7d
|
|
| BLAKE2b-256 |
c617a5ffb249939156f3eb5e34e2d766f97e34dd9635abcca5022292d23f660d
|
Provenance
The following attestation bundles were made for py_agent_ner-0.1.1-py3-none-any.whl:
Publisher:
publish-ner.yml on gaslit-ai/py-agent
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
py_agent_ner-0.1.1-py3-none-any.whl -
Subject digest:
124d1108d44fdf50a0c591d8b9bc3b95e4a5420fa0d7e1a5eeaccb2f1aa392a2 - Sigstore transparency entry: 1635433633
- Sigstore integration time:
-
Permalink:
gaslit-ai/py-agent@4bb4779ca51d0c0bec43b36ec9d8ec7588e077b1 -
Branch / Tag:
refs/tags/ner-v0.1.1 - Owner: https://github.com/gaslit-ai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-ner.yml@4bb4779ca51d0c0bec43b36ec9d8ec7588e077b1 -
Trigger Event:
push
-
Statement type: