Skip to main content

Python port of data-tamer using LiteLLM for structured outputs and batching

Project description

data-tamer

Lightweight Python wrappers (built on LiteLLM) for transforming data with structured outputs, compact prompts for lower token usage, and batching utilities. Strict structured outputs are supported via Pydantic models or JSON Schema.

Install

Install from PyPI via pip or UV:

pip install data-tamer
# or with UV
uv add data-tamer

Basic usage in Python mirrors the TS API and prompt-compaction behavior:

from pydantic import BaseModel
import os
from data_tamer import transform_object, transform_batch


class Person(BaseModel):
    name: str
    age: int | None

# Choose a LiteLLM model id; set provider API keys via env (e.g., OPENAI_API_KEY, OPENROUTER_API_KEY)
model = os.environ.get("LITELLM_MODEL", "gpt-4o-mini")

# Single transform from guidance only
single = transform_object(
    model=model,
    schema=Person,
    prompt_context={
        "instructions": "Extract name and age. Use null when unknown.",
    },
)
print(single["data"])  # -> Person(name=..., age=...)

# Batch transform from compact prompt
inputs = [
    "Jane Doe, 29",
    "Mr. Smith, unknown age",
    {"text": "Alice, 41"},
]

results = transform_batch(
    model=model,
    schema=Person,
    items=inputs,
    batch_size=2,
    prompt_context={
        "instructions": "Extract name and age. Use null when unknown.",
    },
)
print(results)  # list of Person-like dicts

Streaming structured output is supported via data_tamer.stream_transform_object (LiteLLM streaming under the hood).

Async batching

For higher throughput, use the async variant with concurrency:

import asyncio
from pydantic import BaseModel
import os
from data_tamer import async_transform_batch


class Person(BaseModel):
    name: str
    age: int | None


async def main():
    model = os.environ.get("LITELLM_MODEL", "gpt-4o-mini")
    inputs = [f"User {i}, {20 + (i % 40)}" for i in range(100)]
    results = await async_transform_batch(
        model=model,
        schema=Person,
        items=inputs,
        batch_size=10,
        concurrency=5,
        prompt_context={"instructions": "Extract name and age"},
    )
    print(len(results))


asyncio.run(main())

Prompt Compaction

The prompt builder:

  • De-duplicates schema guidance and uses short, strict JSON directions.
  • Truncates per-item input via char_limit_per_item.
  • Supports optional system, instructions, and few-shot examples.
  • Items are raw inputs (strings or objects). Place guidance/instructions in prompt_context.system/prompt_context.instructions.

API

  • transform_object(model, schema, items|prompt_context, ...)

    • Generates a single structured object. If items are provided, a compact prompt is built; otherwise use prompt_context with instructions.
    • schema can be a Pydantic model class or a JSON Schema dict. When supported by the provider, LiteLLM enforces structured output. We also parse JSON and, for dict schemas, validate locally via jsonschema as a fallback.
  • stream_transform_object(...)

    • Streams text chunks and allows awaiting the final parsed object.
  • transform_batch(model, schema, items, batch_size=..., concurrency=...)

    • Splits inputs into batches, builds compact prompts, and parses array outputs. Uses threads when concurrency > 1.
  • async_transform_batch(...)

    • Async variant with concurrency control via asyncio.

Notes

  • Providers (LiteLLM): pass a model id string (e.g., gpt-4o-mini, openrouter/google/gemini-2.5-flash-lite) and set the corresponding API key in env (OPENAI_API_KEY, OPENROUTER_API_KEY, etc.).
  • Structured outputs:
    • Pydantic: pass a BaseModel subclass as schema. LiteLLM will request structured responses when supported; we parse JSON regardless.
    • JSON Schema: pass a dict; we set LiteLLM response_format={"type":"json_schema",...} and also validate locally with jsonschema.
    • Helpers: pydantic_json_schema, pydantic_array_json_schema generate dict schemas from Pydantic models.
  • OpenRouter: set OPENROUTER_API_KEY and pick an OpenRouter model id via LITELLM_MODEL, e.g., openrouter/google/gemini-2.5-flash-lite.

Examples

  • examples/generate_object_example.py — basic structured generation
  • examples/transform_batch_example.py — batching with compact prompts
  • examples/jsonschema_example.py — JSON Schema with validation
  • examples/legacy_contacts.py — real-world cleanup with OpenRouter (default Gemini model)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data_tamer-0.1.3.tar.gz (11.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

data_tamer-0.1.3-py3-none-any.whl (11.6 kB view details)

Uploaded Python 3

File details

Details for the file data_tamer-0.1.3.tar.gz.

File metadata

  • Download URL: data_tamer-0.1.3.tar.gz
  • Upload date:
  • Size: 11.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for data_tamer-0.1.3.tar.gz
Algorithm Hash digest
SHA256 c2f91c6a38a1cbdb1222b0983905d51f908d4de9625ec2f9cde06ad85c1b3607
MD5 27eec20652bff866c32f1d4fc20dd6b4
BLAKE2b-256 70a303222ec65be323268ff7a94696c9bd27eaf9b9158edf2a98e8faca3fdb1d

See more details on using hashes here.

Provenance

The following attestation bundles were made for data_tamer-0.1.3.tar.gz:

Publisher: pypi-publish.yml on seb-lewis/data-tamer-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file data_tamer-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: data_tamer-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 11.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for data_tamer-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 77b753023122ed4228775413bd7f0bb2e020dd553476a6903e5ed13e4906c3ad
MD5 f4a02058328b39d0ab7e9a593ef2e637
BLAKE2b-256 7919a502018b14b2d5366a7477407b373eab9d298a3b31cb02a5de0c19be525e

See more details on using hashes here.

Provenance

The following attestation bundles were made for data_tamer-0.1.3-py3-none-any.whl:

Publisher: pypi-publish.yml on seb-lewis/data-tamer-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page