Skip to main content

Planet-aware observability for LLM inference

Project description

Vetch SDK

PyPI version Python versions License CI Open In Colab

Planet-aware observability for LLM inference.

Vetch is a Python SDK that wraps LLM API calls to log energy consumption, cost, and carbon per inference using live grid data. It never reads prompt or completion content—only metadata from the response usage.

Features

  • Fail-Open: LLM calls always proceed even if Vetch fails.
  • Privacy-First: No prompt or completion data is ever read or buffered.
  • Multi-tier Caching: Memory and file-based caching for grid intensity data.
  • Observability-Transparent: Works seamlessly with Datadog, OpenTelemetry, and Sentry.
  • Low Overhead: Under 5ms overhead for sync calls; zero TTFT latency for streaming.

Installation

pip install vetch

Quick Start

from vetch import wrap
from openai import OpenAI

client = OpenAI()

with wrap(region="us-east-1", tags={"team": "ml", "env": "prod"}) as ctx:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": "Hello world"}]
    )

# Access inference metadata
print(f"Energy: {ctx.event['estimated_energy_wh']} Wh")
print(f"Carbon: {ctx.event['estimated_carbon_g']} gCO2e")

CLI Usage

Estimate energy/carbon for a model without running code:

vetch estimate --model gpt-4o --input-tokens 1000 --output-tokens 500 --region us-east-1

Compare multiple models:

vetch compare --models gpt-4o,claude-3-opus,gemini-1.5-pro --tokens 1000

Analyze your token usage patterns:

vetch audit

Check your environment:

vetch check

Token Waste Audit

Vetch tracks token usage patterns across your session and provides actionable recommendations:

from vetch import wrap, get_session_stats, generate_advisories

# Make multiple LLM calls
for _ in range(10):
    with wrap() as ctx:
        response = client.chat.completions.create(...)

# Analyze patterns
stats = get_session_stats()
advisories = generate_advisories(stats)

for a in advisories:
    print(f"[{a.level.value}] {a.title}")
    print(f"  {a.description}")

What it detects:

  • Static system prompts: Repeated input token counts suggest cacheable prompts
  • High input:output ratios: Large inputs producing small outputs
  • Expensive model usage: Opportunities to use smaller, cheaper models

GPU Calibration (Local Inference)

For local inference (Ollama, vLLM, llama.cpp), calibrate energy measurements using actual GPU power draw:

from vetch.calibrate import calibrate_model, format_calibration_result

def my_inference():
    # Run your inference workload
    # Return (input_tokens, output_tokens)
    response = ollama.generate(model="llama3.1:8b", prompt="Hello world")
    return 100, 50  # Your actual token counts

result = calibrate_model("ollama", "llama3.1:8b", workload=my_inference)
print(format_calibration_result(result))

# Use calibrated values for accurate tracking
with wrap(energy_override=result.to_override()) as ctx:
    response = ollama.generate(...)

Check calibration status:

vetch calibrate --status

Requirements: NVIDIA GPU with pynvml (pip install nvidia-ml-py3)

Historical Analysis & Reporting

Vetch can persist events to SQLite for historical FinOps analysis:

from vetch import configure_storage, query_usage, wrap
from datetime import datetime, timedelta

# Enable persistent storage
configure_storage()  # Uses ~/.vetch/usage.db

# Your LLM calls are now tracked
with wrap(tags={"team": "ml", "feature": "chat"}) as ctx:
    response = client.chat.completions.create(...)

# Query historical usage
summary = query_usage(
    start=datetime.now() - timedelta(days=7),
    tags={"team": "ml"}
)

print(f"Total cost: ${summary.total_cost_usd:.2f}")
print(f"Total energy: {summary.total_energy_wh:.2f} Wh")
print(f"Requests: {summary.total_requests}")

Generate reports from CLI:

# Weekly report
vetch report --days 7

# Filter by team
vetch report --tags team=ml

# Show top consumers
vetch report --top --top-by team --days 30

# JSON output for dashboards
vetch report --format json

Energy Tiers

Vetch uses a tiered system for energy estimate confidence:

Tier Name Uncertainty Source
0 Measured ±10-20% Direct GPU measurement (pynvml)
1 Vendor-Published ±20-50% Official provider data
2 Validated ±50-100% Crowdsourced aggregates
3 Estimated order of magnitude Parameter-based calculation

Run vetch methodology to see full methodology documentation.

Environment Variables

Variable Description
VETCH_DISABLED Set to true to completely disable Vetch (emergency kill switch)
VETCH_REGION Default grid region (e.g., us-east-1, eu-west-1)
VETCH_OUTPUT Output target: stderr (default), none, or file path
ELECTRICITY_MAPS_API_KEY API key for live grid carbon intensity data
VETCH_CACHE_MODE Set to memory-only for serverless/Lambda environments

Alpha Limitations

This is an alpha release. Please be aware of:

  1. Energy estimates are uncertain: Most models use Tier 3 estimates (±10x uncertainty). See vetch methodology for details.

  2. Region inference is approximate: Without explicit VETCH_REGION, timezone-based inference is ~30% accurate. Set the region explicitly for accurate carbon calculations.

  3. Experimental modules: vetch.calibrate, vetch.storage, and vetch.ci emit FutureWarning and may change in future versions.

  4. Provider support: Currently supports OpenAI, Anthropic, and Vertex AI. Other providers coming soon.

Troubleshooting

Vetch is blocking my LLM calls:

export VETCH_DISABLED=true  # Emergency kill switch

Too much output:

export VETCH_OUTPUT=none  # Silence all output

Need to debug:

import logging
logging.getLogger("vetch").setLevel(logging.DEBUG)

License

Apache License 2.0. See LICENSE and NOTICE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vetch-0.1.2.tar.gz (102.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vetch-0.1.2-py3-none-any.whl (73.4 kB view details)

Uploaded Python 3

File details

Details for the file vetch-0.1.2.tar.gz.

File metadata

  • Download URL: vetch-0.1.2.tar.gz
  • Upload date:
  • Size: 102.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vetch-0.1.2.tar.gz
Algorithm Hash digest
SHA256 895e3f6a6d9d6107a6dd0d6703db4673ea3dd9437434ba60c35c5bbe4bd988c2
MD5 5fcbcea0da0aeda7f797dca8643d2b7a
BLAKE2b-256 e897d4f2b8a9a24c73075a2da01cca8c0a46a55c85428c30573914b3a1433a19

See more details on using hashes here.

File details

Details for the file vetch-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: vetch-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 73.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vetch-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 6f6a193d6034b6793249580f037eeff7119e65a0e7a96f817713702a481afd4a
MD5 39ebb09cf4de1e484f56200122753e57
BLAKE2b-256 1dee5f0637940e364746299bc8836323fb75a6b3a3bf82dccc6415a238d3f584

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page