Skip to main content

Planet-aware observability for LLM inference

Project description

Vetch SDK

PyPI version Python versions License CI Open In Colab

Planet-aware observability for LLM inference.

Vetch is a Python SDK that wraps LLM API calls to log energy consumption, cost, and carbon per inference using live grid data. It never reads prompt or completion content—only metadata from the response usage.

Why Vetch?

Attributed Spend, Not Just Total Spend

Provider dashboards (OpenAI Usage, Anthropic Console, Google Cloud Billing) show you total spend. Vetch shows you attributed spend. Using tags, you can track cost-per-feature, cost-per-user, or cost-per-environment in real-time—without building custom infrastructure.

Sustainability Instrumentation

Begin tracking AI inference emissions for future CSRD (EU) and SEC (US) Scope 3 reporting. Note: Current estimates are Tier 3 (order-of-magnitude). Vetch provides the instrumentation infrastructure—audit-grade accuracy requires Tier 1/2 energy data from providers or calibrated measurements.

Design Guarantees

Fail-Open Architecture

Vetch is architected with a non-blocking, fail-open boundary. Every Vetch operation (patching, calculation, emission) is wrapped in isolated error handlers. If Vetch fails, your LLM call proceeds normally, and a tracking_disabled: true event is logged. Vetch will never cause an inference outage.

Privacy & Data Perimeter

Vetch never reads or stores prompt/completion content. It only extracts metadata (token counts, model names, timing) directly from SDK response objects. No PII or proprietary prompt data ever leaves your execution environment.

Thread Safety (v0.1.4+)

Vetch is fully thread-safe and supports multi-client isolation. It uses contextvars for async safety and WeakKeyDictionary for client patching, ensuring that unpatching one client doesn't affect another in the same process.

Features

  • Fail-Open: LLM calls always proceed even if Vetch fails
  • Privacy-First: No prompt or completion data is ever read or buffered
  • Multi-tier Caching: Memory -> File -> API -> Regional averages for grid data
  • Observability-Transparent: Works seamlessly with Datadog, OpenTelemetry, and Sentry
  • Low Overhead: Under 5ms overhead for sync calls; zero TTFT latency for streaming
  • MoE-Aware: Energy estimates account for active parameters in Mixture-of-Experts models
  • Session Aggregation: Group multiple LLM calls into sessions for agentic AI tracking
  • Cache-Aware Pricing: Accurate cost calculation with prompt cache discounts

Supported Providers

Provider Status Instrumentation
OpenAI Supported vetch.instrument() or vetch.wrap()
Azure OpenAI Supported vetch.instrument() (auto-detects AzureOpenAI)
Anthropic Supported vetch.instrument() or vetch.wrap()
Vertex AI (Gemini) Supported vetch.instrument() or vetch.wrap()
OpenRouter Compatible Uses OpenAI instrumentation (OpenAI-compatible API)
Together.ai Compatible Uses OpenAI instrumentation (OpenAI-compatible API)
Anyscale Compatible Uses OpenAI instrumentation (OpenAI-compatible API)
Ollama Compatible Uses OpenAI instrumentation (OpenAI-compatible API)
vLLM / TGI Compatible Uses OpenAI instrumentation (OpenAI-compatible API)

OpenAI-compatible endpoints (OpenRouter, Together.ai, Ollama, vLLM, TGI) work automatically with vetch.instrument() since they use the openai Python SDK under the hood.

Installation

pip install vetch

Quick Start

The simplest way to use Vetch is with instrument() — one line at startup, and all LLM calls are tracked automatically:

import vetch
import openai

# One line to instrument all providers
vetch.instrument(region="us-east-1", tags={"service": "chat-api"})

# All LLM calls are now automatically tracked
client = openai.OpenAI()
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello world"}]
)
# Energy, cost, and carbon events emitted automatically!

Advanced: Context Manager

For per-call control, use the wrap() context manager:

from vetch import wrap

with wrap(region="us-east-1", tags={"team": "ml", "env": "prod"}) as ctx:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": "Hello world"}]
    )

# Access inference metadata
print(f"Energy: {ctx.event['estimated_energy_wh']} Wh")
print(f"Carbon: {ctx.event['estimated_carbon_g']} gCO2e")
print(f"Cost:   ${ctx.event['estimated_cost_usd']}")

Async Support

import vetch

async with vetch.awrap(region="us-east-1") as ctx:
    response = await client.chat.completions.create(...)
print(ctx.event["estimated_energy_wh"])

Session Aggregation (Agentic AI)

Group multiple LLM calls into sessions for agentic frameworks like CrewAI, AutoGPT, or LangGraph:

import vetch

with vetch.Session(tags={"agent": "researcher", "task": "summarize"}) as session:
    with vetch.wrap() as ctx1:
        response1 = client.chat.completions.create(...)

    # Nested sessions for sub-agents
    with vetch.Session(tags={"agent": "summarizer"}) as sub_session:
        with vetch.wrap() as ctx2:
            response2 = client.chat.completions.create(...)

# Aggregate metrics across all calls
print(f"Total energy: {session.total_energy_wh} Wh")
print(f"Total cost: ${session.total_cost_usd}")
print(f"Call count: {session.call_count}")

Sessions support distributed propagation across microservices:

# In FastAPI service:
headers = session.inject_headers({})
celery_task.delay(task_id, headers=headers)

# In Celery worker:
with vetch.Session.from_headers(task_headers) as worker_session:
    with vetch.wrap() as ctx:
        response = client.chat.completions.create(...)

Budget Alerts

Set spending thresholds with automatic alerting:

import vetch

vetch.set_budget("hourly", cost_usd=10.0, energy_wh=50.0)

@vetch.on_budget_alert
def handle_alert(alert):
    print(f"Budget alert: {alert}")

# Check budget status
status = vetch.get_budget_status()

OTLP Export (Grafana, Datadog)

Export metrics to any OpenTelemetry-compatible backend:

import vetch

vetch.configure_otlp_export(
    endpoint="http://localhost:4317",
    service_name="my-llm-service"
)

# Export a pre-built Grafana dashboard
# vetch dashboard --export grafana --output grafana_vetch.json

CLI Usage

# Check Vetch status and configuration
vetch status

# Estimate energy/carbon for a model without running code
vetch estimate --model gpt-4o --input-tokens 1000 --output-tokens 500

# Compare multiple models
vetch compare --models gpt-4o,claude-3-opus,gemini-1.5-pro --tokens 1000

# Analyze token usage patterns
vetch audit

# Export Grafana dashboard
vetch dashboard --export grafana --output dashboard.json

# Freeze registry for CI/CD (eliminates cold-start latency)
vetch registry freeze --output vetch_registry.json

# Generate usage reports
vetch report --days 7 --tags team=ml

Token Waste Audit

Vetch tracks token usage patterns across your session and provides actionable recommendations:

from vetch import wrap, get_session_stats, generate_advisories

# Make multiple LLM calls
for _ in range(10):
    with wrap() as ctx:
        response = client.chat.completions.create(...)

# Analyze patterns
stats = get_session_stats()
advisories = generate_advisories(stats)

for a in advisories:
    print(f"[{a.level.value}] {a.title}")
    print(f"  {a.description}")

What it detects:

  • Static system prompts: Repeated input token counts suggest cacheable prompts
  • High input:output ratios: Large inputs producing small outputs
  • Expensive model usage: Opportunities to use smaller, cheaper models

GPU Calibration (Local Inference)

For local inference (Ollama, vLLM, llama.cpp), calibrate energy measurements using actual GPU power draw:

from vetch.calibrate import calibrate_model, format_calibration_result

def my_inference():
    response = ollama.generate(model="llama3.1:8b", prompt="Hello world")
    return 100, 50  # (input_tokens, output_tokens)

result = calibrate_model("ollama", "llama3.1:8b", workload=my_inference)
print(format_calibration_result(result))

Requirements: NVIDIA GPU with pynvml (pip install nvidia-ml-py3)

Clean Test Isolation

Remove instrumentation for clean test environments:

import vetch

vetch.instrument()
# ... run your code ...
vetch.uninstrument()  # Restore original SDK methods

Energy Tiers

Vetch uses a tiered system for energy estimate confidence:

Tier Name Uncertainty Source
0 Measured +-10-20% Direct GPU measurement (pynvml)
1 Vendor-Published +-20-50% Official provider data
2 Validated +-50-100% Crowdsourced aggregates
3 Estimated order of magnitude Parameter-based calculation

Run vetch methodology to see full methodology documentation.

Environment Variables

Variable Description
VETCH_DISABLED Set to true to completely disable Vetch (emergency kill switch)
VETCH_REGION Default grid region (e.g., us-east-1, eu-west-1)
VETCH_OUTPUT Output target: none (default), stderr, or file path
VETCH_HOME Vetch home directory (default: ~/.vetch/)
VETCH_REGISTRY_REMOTE Set to false to disable remote registry updates
VETCH_REGISTRY_PATH Path to offline registry directory (air-gapped environments)
VETCH_REGISTRY_URL Custom remote registry URL
ELECTRICITY_MAPS_API_KEY API key for live grid carbon intensity data
VETCH_CACHE_MODE Set to memory-only for serverless/Lambda environments

Alpha Limitations

This is an alpha release. Please be aware of:

  1. Energy estimates are uncertain: Most models use Tier 3 estimates (+-10x uncertainty). See vetch methodology for details.

  2. Region inference is approximate: Without explicit VETCH_REGION, timezone-based inference is ~30% accurate. Set the region explicitly for accurate carbon calculations.

  3. Experimental modules: vetch.calibrate, vetch.storage, and vetch.ci emit FutureWarning and may change in future versions.

Troubleshooting

Vetch is blocking my LLM calls:

export VETCH_DISABLED=true  # Emergency kill switch

Too much output:

export VETCH_OUTPUT=none  # Silence all output

Need to debug:

import logging
logging.getLogger("vetch").setLevel(logging.DEBUG)

Contributing

See CONTRIBUTING.md for development setup, testing guidelines, and how to contribute energy data.

License

Apache License 2.0. See LICENSE and NOTICE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vetch-0.1.6.tar.gz (147.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vetch-0.1.6-py3-none-any.whl (108.4 kB view details)

Uploaded Python 3

File details

Details for the file vetch-0.1.6.tar.gz.

File metadata

  • Download URL: vetch-0.1.6.tar.gz
  • Upload date:
  • Size: 147.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vetch-0.1.6.tar.gz
Algorithm Hash digest
SHA256 d61fa35ac6beee9052559de4c171a0bcf4602829934b22680de06f11b28396d4
MD5 1dc4c7940ec1f9fee5252c7fef2637ce
BLAKE2b-256 6fe427dffe91eab010d733f9d298d53a9b16add40c035c9814db40ad1e309d78

See more details on using hashes here.

File details

Details for the file vetch-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: vetch-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 108.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vetch-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 9697d3b85ad03ba8948ef45fed5421a5578f9f714fc47679a8a463c883d75803
MD5 c24acd850ade8f782291a25b1f0d71a6
BLAKE2b-256 0eab2cce205dfad0dd2fd104ae0835635835640248033c2d0b954c7b4ab3f9da

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page