Skip to main content

Python port of data-tamer using LiteLLM for structured outputs and batching

Project description

data-tamer

Lightweight Python wrappers (built on LiteLLM) for transforming data with structured outputs, compact prompts for lower token usage, and batching utilities. Strict structured outputs are supported via Pydantic models or JSON Schema.

Install

Install from PyPI via pip or UV:

pip install data-tamer
# or with UV
uv add data-tamer

Basic usage in Python mirrors the TS API and prompt-compaction behavior:

from pydantic import BaseModel
import os
from data_tamer import transform_object, transform_batch


class Person(BaseModel):
    name: str
    age: int | None

# Choose a LiteLLM model id; set provider API keys via env (e.g., OPENAI_API_KEY, OPENROUTER_API_KEY)
model = os.environ.get("LITELLM_MODEL", "gpt-4o-mini")

# Single transform from guidance only
single = transform_object(
    model=model,
    schema=Person,
    prompt_context={
        "instructions": "Extract name and age. Use null when unknown.",
    },
)
print(single["data"])  # -> Person(name=..., age=...)

# Batch transform from compact prompt
inputs = [
    "Jane Doe, 29",
    "Mr. Smith, unknown age",
    {"text": "Alice, 41"},
]

results = transform_batch(
    model=model,
    schema=Person,
    items=inputs,
    batch_size=2,
    prompt_context={
        "instructions": "Extract name and age. Use null when unknown.",
    },
)
print(results)  # list of Person-like dicts

Streaming structured output is supported via data_tamer.stream_transform_object (LiteLLM streaming under the hood).

Async batching

For higher throughput, use the async variant with concurrency:

import asyncio
from pydantic import BaseModel
import os
from data_tamer import async_transform_batch


class Person(BaseModel):
    name: str
    age: int | None


async def main():
    model = os.environ.get("LITELLM_MODEL", "gpt-4o-mini")
    inputs = [f"User {i}, {20 + (i % 40)}" for i in range(100)]
    results = await async_transform_batch(
        model=model,
        schema=Person,
        items=inputs,
        batch_size=10,
        concurrency=5,
        prompt_context={"instructions": "Extract name and age"},
    )
    print(len(results))


asyncio.run(main())

Prompt Compaction

The prompt builder:

  • De-duplicates schema guidance and uses short, strict JSON directions.
  • Truncates per-item input via char_limit_per_item.
  • Supports optional system, instructions, and few-shot examples.
  • Items are raw inputs (strings or objects). Place guidance/instructions in prompt_context.system/prompt_context.instructions.

API

  • transform_object(model, schema, items|prompt_context, ...)

    • Generates a single structured object. If items are provided, a compact prompt is built; otherwise use prompt_context with instructions.
    • schema can be a Pydantic model class or a JSON Schema dict. When supported by the provider, LiteLLM enforces structured output. We also parse JSON and, for dict schemas, validate locally via jsonschema as a fallback.
  • stream_transform_object(...)

    • Streams text chunks and allows awaiting the final parsed object.
  • transform_batch(model, schema, items, batch_size=..., concurrency=...)

    • Splits inputs into batches, builds compact prompts, and parses array outputs. Uses threads when concurrency > 1.
  • async_transform_batch(...)

    • Async variant with concurrency control via asyncio.

Notes

  • Providers (LiteLLM): pass a model id string (e.g., gpt-4o-mini, openrouter/google/gemini-2.5-flash-lite) and set the corresponding API key in env (OPENAI_API_KEY, OPENROUTER_API_KEY, etc.). Alternatively, pass credentials directly via provider_options, e.g. provider_options={"api_key": "sk-...", "api_base": "https://..."}.
  • Structured outputs:
    • Pydantic: pass a BaseModel subclass as schema. LiteLLM will request structured responses when supported; we parse JSON regardless.
    • JSON Schema: pass a dict; we set LiteLLM response_format={"type":"json_schema",...} and also validate locally with jsonschema.
    • Helpers: pydantic_json_schema, pydantic_array_json_schema generate dict schemas from Pydantic models.
  • OpenRouter: set OPENROUTER_API_KEY and pick an OpenRouter model id via LITELLM_MODEL, e.g., openrouter/google/gemini-2.5-flash-lite. Or pass provider_options={"api_key": "..."} with an OpenRouter model id.

Examples

  • examples/generate_object_example.py — basic structured generation
  • examples/transform_batch_example.py — batching with compact prompts
  • examples/jsonschema_example.py — JSON Schema with validation
  • examples/legacy_contacts.py — real-world cleanup with OpenRouter (default Gemini model)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data_tamer-0.1.4.tar.gz (11.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

data_tamer-0.1.4-py3-none-any.whl (12.0 kB view details)

Uploaded Python 3

File details

Details for the file data_tamer-0.1.4.tar.gz.

File metadata

  • Download URL: data_tamer-0.1.4.tar.gz
  • Upload date:
  • Size: 11.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for data_tamer-0.1.4.tar.gz
Algorithm Hash digest
SHA256 d0f988a14c2b5aaa073a42d10df79c18810203b354d2d7de949eaeef1e7c1984
MD5 ec63b4a0eb103dd88a2c1c1f96996135
BLAKE2b-256 101b7c3cf2b018d2cdcb9404c3469333ea535fa060fff94810d8b4ad77f64eea

See more details on using hashes here.

Provenance

The following attestation bundles were made for data_tamer-0.1.4.tar.gz:

Publisher: pypi-publish.yml on seb-lewis/data-tamer-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file data_tamer-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: data_tamer-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 12.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for data_tamer-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 64036a176b4bf34a3e714cab067a49f0560d57d3091c4758b7a84c32702ff3bd
MD5 5ed75793def4d930efb30cd06bb5f609
BLAKE2b-256 06c2d6d4f49790d4a7fe9a85ad9fd59881814498f8144898dd2dd4af8c8be7ee

See more details on using hashes here.

Provenance

The following attestation bundles were made for data_tamer-0.1.4-py3-none-any.whl:

Publisher: pypi-publish.yml on seb-lewis/data-tamer-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page