Skip to main content

An easy-to-extend LLM annotator for robust, resumable data annotation.

Project description

Robust, resumable LLM dataset annotation

CI codecov PyPI version Python versions License

llm-annotator is a Python 3.12+ library for robust, resumable LLM-driven dataset annotation and generation.

It supports multiple providers through pluggable clients:

  • vLLM offline inference: VLLMOfflineClient
  • vLLM server API: VLLMClient
  • OpenAI API: OpenAIClient
  • Anthropic API: ClaudeClient

Key capabilities:

  • Staged pipeline: prepare_data + run_annotation separates expensive template application and sorting from model inference, enabling SLURM and cluster restart workflows.
  • Resumable processing with JSONL checkpoints.
  • Annotation of existing datasets and generation from scratch.
  • Structured outputs via JSON schema.
  • Retry and validation hooks for robust pipelines.
  • Optional Hugging Face Hub upload cadence for both prepared data and outputs.
  • Context-manager cleanup of client resources.

It is not intended for parallel, multi-node, multi-instance generation. If that is what you are after, maybe datatrove is something for you.

Documentation

Read the full documentation at bramvanroy.github.io/llm-annotator.

Provider setup reference: docs/provider-info.md

Installation

Recommended:

uv add llm-annotator

or

pip install llm-annotator

Install provider extras as needed:

uv add "llm-annotator[vllm]"
uv add "llm-annotator[vllm-flashinfer]"  # Faster if your hardware supports it
uv add "llm-annotator[openai]"
uv add "llm-annotator[anthropic]"

See docs/provider-info.md for auth environment variables and provider-specific setup notes.

Usage

One-step convenience

Annotate an existing dataset:

from llm_annotator import Annotator, VLLMOfflineClient

client = VLLMOfflineClient(
    model="meta-llama/Llama-3.2-3B-Instruct",
    max_model_len=4096,
)

with Annotator(client=client, verbose=True) as anno:
    ds = anno.annotate_dataset(
        output_dir="outputs/sentiment",
        prompt_template="Classify the sentiment of this text: {text}",
        dataset_name="stanfordnlp/imdb",
        dataset_split="test",
        max_num_samples=100,
    )

Generate a dataset from scratch:

from llm_annotator import Annotator, OpenAIClient

client = OpenAIClient(model="gpt-4o-mini")

with Annotator(client=client) as anno:
    ds = anno.generate_dataset(
        output_dir="outputs/generated-qa",
        prompts="Write a short geography quiz question with answer.",
        max_num_samples=200,
    )

Two-step staged workflow

For large datasets or cluster (SLURM) environments, split the pipeline explicitly into a preparation step and a generation step. prepare_data applies prompt templates, optional sorting, and saves the prepared artifacts locally and to Hugging Face Hub. run_annotation then handles only model inference. If generation fails, re-run run_annotation with prepared_hub_id pointing to the Hub backup: preparation is skipped.

from llm_annotator import Annotator, VLLMOfflineClient

client = VLLMOfflineClient(
    model="meta-llama/Llama-3.2-3B-Instruct",
    max_model_len=4096,
)

HUB_ID = "my-org/imdb-prepared"  # Hub repo for prepared data backup

with Annotator(client=client, verbose=True) as anno:
    # Step 1: prepare data (reuses local cache or Hub backup if available)
    prepared_dataset, local_path, hub_id = anno.prepare_data(
        output_dir="outputs/imdb-sentiment",
        prompt_template="Classify the sentiment of this text: {text}",
        dataset_name="stanfordnlp/imdb",
        dataset_split="test",
        max_num_samples=100,
        sort_by_length=True,
        prepared_hub_id=HUB_ID,
    )

    # Step 2: run generation against the prepared data
    ds = anno.run_annotation(
        output_dir="outputs/imdb-sentiment",
        prompt_template="Classify the sentiment of this text: {text}",
        prepared_dataset=prepared_dataset,
        new_hub_id="my-org/imdb-annotated",
        upload_every_n_samples=500,
    )

To force a fresh preparation (ignoring any cached or Hub-stored artifacts), pass force_data_preparation=True to prepare_data or to annotate_dataset.

See the documentation for more examples, including:

  • Structured output with JSON schemas
  • Custom validation and post-processing
  • Generating datasets from scratch

Or check out the examples/ directory for complete working examples.

Testing

Install development dependencies first:

uv sync --dev

Run the default checks:

make style
make quality
make test
make typecheck

Pytest marker targets:

# Fast tests (same as `make test`)
make test-fast

# Slow tests only
make test-slow

# Integration tests only
make test-integration

# Entire suite (fast + slow)
make test-all

You can also run markers directly with pytest:

uv run pytest -m "not slow"
uv run pytest -m "slow"
uv run pytest -m "integration"

Slow and integration tests may load local models, require more runtime, or depend on optional components.

Building documentation

Local versioned docs preview (uses mike on a temporary local branch):

make serve-docs

Override version metadata when needed:

make serve-docs DOCS_VERSION=0.4.0 DOCS_ALIAS=latest DOCS_SOURCE_REF=v0.4.0

Docs are published with mike on release tags through .github/workflows/docs.yml.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_annotator-0.10.1.tar.gz (331.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_annotator-0.10.1-py3-none-any.whl (81.1 kB view details)

Uploaded Python 3

File details

Details for the file llm_annotator-0.10.1.tar.gz.

File metadata

  • Download URL: llm_annotator-0.10.1.tar.gz
  • Upload date:
  • Size: 331.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.18 {"installer":{"name":"uv","version":"0.11.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for llm_annotator-0.10.1.tar.gz
Algorithm Hash digest
SHA256 e3be733169f5024903050e94fde336e05ac0bbf7db44de4e0cd695f24a76be3d
MD5 e42b59b8680aeccbf64d573288baff90
BLAKE2b-256 64463607cba5808b23f27eaabc12cc9fecee8dc092075e18df5fdf4e922ccee0

See more details on using hashes here.

File details

Details for the file llm_annotator-0.10.1-py3-none-any.whl.

File metadata

  • Download URL: llm_annotator-0.10.1-py3-none-any.whl
  • Upload date:
  • Size: 81.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.18 {"installer":{"name":"uv","version":"0.11.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for llm_annotator-0.10.1-py3-none-any.whl
Algorithm Hash digest
SHA256 6639345cb0802de3d6eef5d9ae398e2c5af7b348e3b7e2e02f68892563808603
MD5 9fac7f2599a290fdcd9eb2a7aae336d3
BLAKE2b-256 d07bd729dc2e4a40b818fe1d4c5561fcc33b6dde89033ea4399ad31a2ea1de25

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page