NER training-data generation built on py-agent-lib. Bring your own labels, LLM, and downstream trainer.
Project description
py-agent-ner
NER training-data generation built on py-agent-lib.
Use an LLM to label text → write a JSONL with character-indexed spans → convert to BIO TSV → fine-tune a small NER model.
Install
pip install py-agent-ner
# or, for local dev:
pip install -e .
py-agent-ner pulls in py-agent-lib[adapters] (Pydantic + Jinja2 + instructor
- httpx + rapidfuzz), so everything you need to run is one install away.
Quickstart (no setup)
import asyncio
from py_agent_lib.adapters.llm import from_ollama
from py_agent_ner import extract_training_data, jsonl_to_bio
LABELS = {
"person_name": "Full names. Ignore organizations and locations.",
"location_reference": "Places, venues, addresses. Include 'at'/'in' if it's part of the phrase.",
"time_reference": "Times like 'tomorrow', '4:30 PM', 'next Monday'.",
}
TEXTS = ["Jordan Lee at City Library tomorrow 4:30 PM."]
async def main():
llm = from_ollama("gemma4", think=False)
records = await extract_training_data(
texts=TEXTS,
labels=LABELS,
llm=llm,
output_path="training_data.jsonl",
fuzzy=True,
)
jsonl_to_bio("training_data.jsonl", "training_data.bio.tsv")
return records
asyncio.run(main())
That's it. You get:
training_data.jsonl— one PydanticTrainingRecordper line, with character-indexed spanstraining_data.bio.tsv— CoNLL BIO format, ready to feed into HuggingFace token-classification orspacy convert
A more complete end-to-end demo lives at examples/quickstart/.
Public API
from py_agent_ner import (
# Canonical Pydantic shapes
LabelExtraction, # the LLM's per-label response
EntitySpan, # one character-indexed span
TaggedEntity, # one label with confidence + spans
TrainingRecord, # one input text + all entities (JSONL row shape)
# The fan-out-merge pipeline
build_plan, # build a Plan from a list of LabelPrompt
make_handlers, # build extract+merge handlers for one text
extract_training_data, # one-shot: run it, get records, write JSONL
MERGE_STEP_ID,
# BIO conversion for downstream training
simple_tokenize, # word + punctuation tokenizer with offsets
record_to_bio, # one TrainingRecord -> [(token, tag), ...]
jsonl_to_bio, # JSONL file -> CoNLL TSV file
records_to_bio, # in-memory records -> CoNLL TSV file
# Prompts (re-exported from py-agent-lib for convenience)
LabelPrompt,
label_prompts_from_dict,
load_label_prompts,
labels_of,
DEFAULT_BASE_TEMPLATE,
)
Architecture
py-agent-ner is a thin opinionated layer over py-agent-lib:
your labeler
│
▼
py-agent-ner ─── pipeline.py (build_plan + make_handlers + extract_training_data)
│ models.py (LabelExtraction, EntitySpan, TaggedEntity, TrainingRecord)
│ bio.py (simple_tokenize, record_to_bio, jsonl_to_bio)
│
▼
py-agent-lib ── adapters.llm (Structured LLM clients: Ollama native, OpenAI, Anthropic, ...)
adapters.prompts (Jinja2 prompt loading + DEFAULT_BASE_TEMPLATE)
adapters.spans (strict + fuzzy span finding)
adapters.training (JSONL read/write)
DagExecutor (fan-out scheduling, retries, observers, snapshots)
Everything in py-agent-ner is replaceable — if you want a different prompt
scaffold, write your own LabelPrompt builder. If you want a different span
strategy, use find_spans directly with your own make_handlers. If you want
a different output format, take the list[TrainingRecord] from
extract_training_data and write whatever you want.
Training a model from the output
The JSONL output is the canonical intermediate format. To fine-tune a model:
HuggingFace Transformers
from py_agent_ner import jsonl_to_bio
# 1) Convert JSONL → BIO TSV
jsonl_to_bio("training_data.jsonl", "training_data.bio.tsv")
# 2) Load BIO TSV, tokenize with your target model's tokenizer, align labels
# using `word_ids()`, then fine-tune with `Trainer` +
# `DataCollatorForTokenClassification`. This is the official HuggingFace
# token-classification tutorial verbatim:
# https://huggingface.co/docs/transformers/tasks/token_classification
spaCy
python -m spacy convert training_data.bio.tsv ./spacy-data/ --converter ner
python -m spacy init config config.cfg --lang en --pipeline ner
python -m spacy train config.cfg --paths.train ./spacy-data/training_data.bio.spacy
GLiNER / Flair / custom
All consume BIO. Once training_data.bio.tsv exists, it's a solved problem.
Tests
pip install -e ".[dev]"
pytest
FakeLlm in tests/test_pipeline.py returns canned LabelExtraction responses,
so the suite runs in milliseconds without a live model.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file py_agent_ner-0.1.0.tar.gz.
File metadata
- Download URL: py_agent_ner-0.1.0.tar.gz
- Upload date:
- Size: 11.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
eb9b1acbb62a4b41848f40e0e563cd396afff27f47e02e711c5d7913a7dbcdbc
|
|
| MD5 |
36a59ff0a72cc0f14a213c0ee7ac3888
|
|
| BLAKE2b-256 |
542db0ad08118646ef71134446d0594fdccf4c00956f7bb8af2e0d38b4a47b7f
|
Provenance
The following attestation bundles were made for py_agent_ner-0.1.0.tar.gz:
Publisher:
publish-ner.yml on gaslit-ai/py-agent
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
py_agent_ner-0.1.0.tar.gz -
Subject digest:
eb9b1acbb62a4b41848f40e0e563cd396afff27f47e02e711c5d7913a7dbcdbc - Sigstore transparency entry: 1635430621
- Sigstore integration time:
-
Permalink:
gaslit-ai/py-agent@74aef7aa4916654ae7345a3a27e3fd6754f2c966 -
Branch / Tag:
refs/tags/ner-v0.1.0 - Owner: https://github.com/gaslit-ai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-ner.yml@74aef7aa4916654ae7345a3a27e3fd6754f2c966 -
Trigger Event:
push
-
Statement type:
File details
Details for the file py_agent_ner-0.1.0-py3-none-any.whl.
File metadata
- Download URL: py_agent_ner-0.1.0-py3-none-any.whl
- Upload date:
- Size: 13.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0cafb1f449fc7775f8eb7e64b07c0dd0f5dfc00a1e48ed4de73cc998ed1f4fe8
|
|
| MD5 |
8426d3cac878e3567a518bd9233acb26
|
|
| BLAKE2b-256 |
a76fc85f62eb559433fb8e82b6df0b26f2795798d81087c2e2fcefcc513cf85a
|
Provenance
The following attestation bundles were made for py_agent_ner-0.1.0-py3-none-any.whl:
Publisher:
publish-ner.yml on gaslit-ai/py-agent
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
py_agent_ner-0.1.0-py3-none-any.whl -
Subject digest:
0cafb1f449fc7775f8eb7e64b07c0dd0f5dfc00a1e48ed4de73cc998ed1f4fe8 - Sigstore transparency entry: 1635430628
- Sigstore integration time:
-
Permalink:
gaslit-ai/py-agent@74aef7aa4916654ae7345a3a27e3fd6754f2c966 -
Branch / Tag:
refs/tags/ner-v0.1.0 - Owner: https://github.com/gaslit-ai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-ner.yml@74aef7aa4916654ae7345a3a27e3fd6754f2c966 -
Trigger Event:
push
-
Statement type: