Skip to main content

An easy-to-extend LLM annotator for robust, resumable data annotation.

Project description

Robust, resumable LLM dataset annotation

CI codecov PyPI version Python versions License GitHub tag

llm-annotator is a Python 3.12+ library for robust, resumable LLM-driven dataset annotation and generation.

It supports multiple providers through pluggable clients:

  • vLLM offline inference: VLLMOfflineClient
  • vLLM server API: VLLMClient
  • OpenAI API: OpenAIClient
  • Anthropic API: ClaudeClient
  • Gemini API: GeminiClient

Key capabilities:

  • Resumable processing with JSONL checkpoints.
  • Annotation of existing datasets and generation from scratch.
  • Structured outputs via JSON schema.
  • Retry and validation hooks for robust pipelines.
  • Optional Hugging Face Hub upload cadence.
  • Context-manager cleanup of client resources.

Documentation

Read the full documentation at bramvanroy.github.io/llm-annotator.

Provider setup reference: docs/provider-info.md

Installation

Recommended:

uv add llm-annotator

or

pip install llm-annotator

Install provider extras as needed:

uv add "llm-annotator[vllm]"
uv add "llm-annotator[openai]"
uv add "llm-annotator[anthropic]"
uv add "llm-annotator[gemini]"

See docs/provider-info.md for auth environment variables and provider-specific setup notes.

For local vLLM runs, install flashinfer for your CUDA version.

uv pip install flashinfer-python flashinfer-cubin
# JIT cache package (replace cu128 with your CUDA variant)
uv pip install flashinfer-jit-cache --index-url https://flashinfer.ai/whl/cu128

Usage

Annotate an existing dataset:

from llm_annotator import Annotator, VLLMOfflineClient

# Use a local vLLM model
client = VLLMOfflineClient(
    model="meta-llama/Llama-3.2-3B-Instruct",
    max_model_len=4096,
)

with Annotator(client=client, verbose=True) as anno:
    ds = anno.annotate_dataset(
        output_dir="outputs/sentiment",
        prompt_template="Classify the sentiment of this text: {text}",
        dataset_name="stanfordnlp/imdb",
        dataset_split="test",
        max_num_samples=100,
    )

Generate a dataset from scratch:

from llm_annotator import Annotator, OpenAIClient

client = OpenAIClient(model="gpt-4o-mini")

with Annotator(client=client) as anno:
    ds = anno.generate_dataset(
        output_dir="outputs/generated-qa",
        prompts="Write a short geography quiz question with answer.",
        max_num_samples=200,
    )

See the documentation for more examples, including:

  • Structured output with JSON schemas
  • Custom validation and post-processing
  • Large-scale streaming annotation
  • Generating datasets from scratch
  • Multi-GPU support

Or check out the examples/ directory for complete working examples.

Testing

Install development dependencies first:

uv sync --dev

Run the default checks:

make style
make quality
make test
make typecheck

Pytest marker targets:

# Fast tests (same as `make test`)
make test-fast

# Slow tests only
make test-slow

# Integration tests only
make test-integration

# Entire suite (fast + slow)
make test-all

You can also run markers directly with pytest:

uv run pytest -m "not slow"
uv run pytest -m "slow"
uv run pytest -m "integration"

Slow and integration tests may load local models, require more runtime, or depend on optional components.

Building documentation

Local versioned docs preview (uses mike on a temporary local branch):

make serve-docs

Override version metadata when needed:

make serve-docs DOCS_VERSION=0.4.0 DOCS_ALIAS=latest DOCS_SOURCE_REF=v0.4.0

Docs are published with mike on release tags through .github/workflows/docs.yml.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_annotator-0.8.1.tar.gz (289.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_annotator-0.8.1-py3-none-any.whl (47.9 kB view details)

Uploaded Python 3

File details

Details for the file llm_annotator-0.8.1.tar.gz.

File metadata

  • Download URL: llm_annotator-0.8.1.tar.gz
  • Upload date:
  • Size: 289.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.16 {"installer":{"name":"uv","version":"0.11.16","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for llm_annotator-0.8.1.tar.gz
Algorithm Hash digest
SHA256 59c9d5ae202ec48d9696374f6654a1b5c3a55571bfd0fa92db6787e3996d64f2
MD5 72f0d74a97d49eddc6fb930d76523c67
BLAKE2b-256 4e3cd401c9debf4f79e1e2fe3d8ecce15ead00182f515077a3877e14daa55daf

See more details on using hashes here.

File details

Details for the file llm_annotator-0.8.1-py3-none-any.whl.

File metadata

  • Download URL: llm_annotator-0.8.1-py3-none-any.whl
  • Upload date:
  • Size: 47.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.16 {"installer":{"name":"uv","version":"0.11.16","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for llm_annotator-0.8.1-py3-none-any.whl
Algorithm Hash digest
SHA256 1209687aeeb0c5d40fd53a33f070e3127dfb93c6201ce729a7c2e7a213f24ecf
MD5 697590d8871f1b6c5fd85bd02a7e2b4f
BLAKE2b-256 d8058178677053bf239df59b79b68e0c6094f56f631a33640835ebb81e758a18

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page