An easy-to-extend LLM annotator for robust, resumable data annotation.
Project description
Robust, resumable LLM dataset annotation
llm-annotator is a Python 3.12+ library for robust, resumable
LLM-driven dataset annotation and generation.
It supports multiple providers through pluggable clients:
- vLLM offline inference:
VLLMOfflineClient - vLLM server API:
VLLMClient - OpenAI API:
OpenAIClient - Anthropic API:
ClaudeClient
Key capabilities:
- Staged pipeline:
prepare_data+run_annotationseparates expensive template application and sorting from model inference, enabling SLURM and cluster restart workflows. - Resumable processing with JSONL checkpoints.
- Annotation of existing datasets and generation from scratch.
- Structured outputs via JSON schema.
- Retry and validation hooks for robust pipelines.
- Optional Hugging Face Hub upload cadence for both prepared data and outputs.
- Context-manager cleanup of client resources.
It is not intended for parallel, multi-node, multi-instance generation.
If that is what you are after, maybe datatrove
is something for you.
Documentation
Read the full documentation at bramvanroy.github.io/llm-annotator.
Provider setup reference: docs/provider-info.md
Installation
Recommended:
uv add llm-annotator
or
pip install llm-annotator
Install provider extras as needed:
uv add "llm-annotator[vllm]"
uv add "llm-annotator[openai]"
uv add "llm-annotator[anthropic]"
See docs/provider-info.md for auth environment variables and provider-specific setup notes.
Usage
One-step convenience
Annotate an existing dataset:
from llm_annotator import Annotator, VLLMOfflineClient
client = VLLMOfflineClient(
model="meta-llama/Llama-3.2-3B-Instruct",
max_model_len=4096,
)
with Annotator(client=client, verbose=True) as anno:
ds = anno.annotate_dataset(
output_dir="outputs/sentiment",
prompt_template="Classify the sentiment of this text: {text}",
dataset_name="stanfordnlp/imdb",
dataset_split="test",
max_num_samples=100,
)
Generate a dataset from scratch:
from llm_annotator import Annotator, OpenAIClient
client = OpenAIClient(model="gpt-4o-mini")
with Annotator(client=client) as anno:
ds = anno.generate_dataset(
output_dir="outputs/generated-qa",
prompts="Write a short geography quiz question with answer.",
max_num_samples=200,
)
Two-step staged workflow
For large datasets or cluster (SLURM) environments, split the pipeline
explicitly into a preparation step and a generation step. prepare_data
applies prompt templates, optional sorting, and saves the prepared
artifacts locally and to Hugging Face Hub. run_annotation then handles
only model inference. If generation fails, re-run run_annotation with
prepared_hub_id pointing to the Hub backup: preparation is skipped.
from llm_annotator import Annotator, VLLMOfflineClient
client = VLLMOfflineClient(
model="meta-llama/Llama-3.2-3B-Instruct",
max_model_len=4096,
)
HUB_ID = "my-org/imdb-prepared" # Hub repo for prepared data backup
with Annotator(client=client, verbose=True) as anno:
# Step 1: prepare data (reuses local cache or Hub backup if available)
prepared_dataset, local_path, hub_id = anno.prepare_data(
output_dir="outputs/imdb-sentiment",
prompt_template="Classify the sentiment of this text: {text}",
dataset_name="stanfordnlp/imdb",
dataset_split="test",
max_num_samples=100,
sort_by_length=True,
prepared_hub_id=HUB_ID,
)
# Step 2: run generation against the prepared data
ds = anno.run_annotation(
output_dir="outputs/imdb-sentiment",
prompt_template="Classify the sentiment of this text: {text}",
prepared_dataset=prepared_dataset,
new_hub_id="my-org/imdb-annotated",
upload_every_n_samples=500,
)
To force a fresh preparation (ignoring any cached or Hub-stored artifacts),
pass force_data_preparation=True to prepare_data or to annotate_dataset.
See the documentation for more examples, including:
- Structured output with JSON schemas
- Custom validation and post-processing
- Generating datasets from scratch
Or check out the examples/ directory for complete working examples.
Testing
Install development dependencies first:
uv sync --dev
Run the default checks:
make style
make quality
make test
make typecheck
Pytest marker targets:
# Fast tests (same as `make test`)
make test-fast
# Slow tests only
make test-slow
# Integration tests only
make test-integration
# Entire suite (fast + slow)
make test-all
You can also run markers directly with pytest:
uv run pytest -m "not slow"
uv run pytest -m "slow"
uv run pytest -m "integration"
Slow and integration tests may load local models, require more runtime, or depend on optional components.
Building documentation
Local versioned docs preview (uses mike on a temporary local branch):
make serve-docs
Override version metadata when needed:
make serve-docs DOCS_VERSION=0.4.0 DOCS_ALIAS=latest DOCS_SOURCE_REF=v0.4.0
Docs are published with mike on release tags through
.github/workflows/docs.yml.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llm_annotator-0.10.0.tar.gz.
File metadata
- Download URL: llm_annotator-0.10.0.tar.gz
- Upload date:
- Size: 331.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.17 {"installer":{"name":"uv","version":"0.11.17","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e9141da6b33343714a5516372009ebc20d85ebd861417b14a7e5b9077a155964
|
|
| MD5 |
8a17de9cdcf0bb08ab83409e6366de23
|
|
| BLAKE2b-256 |
ae169ba3a45fc23eea6130dde81c7a6a70dac004ec75a3a426557401776aa632
|
File details
Details for the file llm_annotator-0.10.0-py3-none-any.whl.
File metadata
- Download URL: llm_annotator-0.10.0-py3-none-any.whl
- Upload date:
- Size: 81.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.17 {"installer":{"name":"uv","version":"0.11.17","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bd43febbc440e4cd7ed8f13123a83e2acba82a6a347d76ab94d5932e2829e15c
|
|
| MD5 |
b0239074954307d70bbf398a40d4fb23
|
|
| BLAKE2b-256 |
9f4af6f712be1d4d4ed86232193d9c305ec7ef1833c8fa20f4590d29de0edafd
|