Energy-aware observability for LLM inference
Project description
Vetch SDK
Energy-aware observability for LLM inference.
Vetch is a Python SDK that wraps LLM API calls to log energy consumption, cost, and carbon per inference using live grid data. It never reads prompt or completion content—only metadata from the response usage.
Features
- Fail-Open: LLM calls always proceed even if Vetch fails.
- Privacy-First: No prompt or completion data is ever read or buffered.
- Multi-tier Caching: Memory and file-based caching for grid intensity data.
- Observability-Transparent: Works seamlessly with Datadog, OpenTelemetry, and Sentry.
- Low Overhead: Under 5ms overhead for sync calls; zero TTFT latency for streaming.
Installation
pip install vetch
Quick Start
from vetch import wrap
from openai import OpenAI
client = OpenAI()
with wrap(region="us-east-1", tags={"team": "ml", "env": "prod"}) as ctx:
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello world"}]
)
# Access inference metadata
print(f"Energy: {ctx.event['estimated_energy_wh']} Wh")
print(f"Carbon: {ctx.event['estimated_carbon_g']} gCO2e")
CLI Usage
Estimate energy/carbon for a model without running code:
vetch estimate --model gpt-4o --input-tokens 1000 --output-tokens 500 --region us-east-1
Compare multiple models:
vetch compare --models gpt-4o,claude-3-opus,gemini-1.5-pro --tokens 1000
Analyze your token usage patterns:
vetch audit
Check your environment:
vetch check
Token Waste Audit
Vetch tracks token usage patterns across your session and provides actionable recommendations:
from vetch import wrap, get_session_stats, generate_advisories
# Make multiple LLM calls
for _ in range(10):
with wrap() as ctx:
response = client.chat.completions.create(...)
# Analyze patterns
stats = get_session_stats()
advisories = generate_advisories(stats)
for a in advisories:
print(f"[{a.level.value}] {a.title}")
print(f" {a.description}")
What it detects:
- Static system prompts: Repeated input token counts suggest cacheable prompts
- High input:output ratios: Large inputs producing small outputs
- Expensive model usage: Opportunities to use smaller, cheaper models
GPU Calibration (Local Inference)
For local inference (Ollama, vLLM, llama.cpp), calibrate energy measurements using actual GPU power draw:
from vetch.calibrate import calibrate_model, format_calibration_result
def my_inference():
# Run your inference workload
# Return (input_tokens, output_tokens)
response = ollama.generate(model="llama3.1:8b", prompt="Hello world")
return 100, 50 # Your actual token counts
result = calibrate_model("ollama", "llama3.1:8b", workload=my_inference)
print(format_calibration_result(result))
# Use calibrated values for accurate tracking
with wrap(energy_override=result.to_override()) as ctx:
response = ollama.generate(...)
Check calibration status:
vetch calibrate --status
Requirements: NVIDIA GPU with pynvml (pip install nvidia-ml-py3)
Historical Analysis & Reporting
Vetch can persist events to SQLite for historical FinOps analysis:
from vetch import configure_storage, query_usage, wrap
from datetime import datetime, timedelta
# Enable persistent storage
configure_storage() # Uses ~/.vetch/usage.db
# Your LLM calls are now tracked
with wrap(tags={"team": "ml", "feature": "chat"}) as ctx:
response = client.chat.completions.create(...)
# Query historical usage
summary = query_usage(
start=datetime.now() - timedelta(days=7),
tags={"team": "ml"}
)
print(f"Total cost: ${summary.total_cost_usd:.2f}")
print(f"Total energy: {summary.total_energy_wh:.2f} Wh")
print(f"Requests: {summary.total_requests}")
Generate reports from CLI:
# Weekly report
vetch report --days 7
# Filter by team
vetch report --tags team=ml
# Show top consumers
vetch report --top --top-by team --days 30
# JSON output for dashboards
vetch report --format json
Energy Tiers
Vetch uses a tiered system for energy estimate confidence:
| Tier | Name | Uncertainty | Source |
|---|---|---|---|
| 0 | Measured | ±10-20% | Direct GPU measurement (pynvml) |
| 1 | Vendor-Published | ±20-50% | Official provider data |
| 2 | Validated | ±50-100% | Crowdsourced aggregates |
| 3 | Estimated | ±10x | Parameter-based calculation |
Run vetch methodology to see full methodology documentation.
Environment Variables
| Variable | Description |
|---|---|
VETCH_DISABLED |
Set to true to completely disable Vetch (emergency kill switch) |
VETCH_REGION |
Default grid region (e.g., us-east-1, eu-west-1) |
VETCH_OUTPUT |
Output target: stderr (default), none, or file path |
ELECTRICITY_MAPS_API_KEY |
API key for live grid carbon intensity data |
VETCH_CACHE_MODE |
Set to memory-only for serverless/Lambda environments |
Alpha Limitations
This is an alpha release. Please be aware of:
-
Energy estimates are uncertain: Most models use Tier 3 estimates (±10x uncertainty). See
vetch methodologyfor details. -
Region inference is approximate: Without explicit
VETCH_REGION, timezone-based inference is ~30% accurate. Set the region explicitly for accurate carbon calculations. -
Experimental modules:
vetch.calibrate,vetch.storage, andvetch.ciemitFutureWarningand may change in future versions. -
Provider support: Currently supports OpenAI, Anthropic, and Vertex AI. Other providers coming soon.
Troubleshooting
Vetch is blocking my LLM calls:
export VETCH_DISABLED=true # Emergency kill switch
Too much output:
export VETCH_OUTPUT=none # Silence all output
Need to debug:
import logging
logging.getLogger("vetch").setLevel(logging.DEBUG)
License
Apache License 2.0. See LICENSE and NOTICE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vetch-0.1.0.tar.gz.
File metadata
- Download URL: vetch-0.1.0.tar.gz
- Upload date:
- Size: 101.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4132e0b9ec6edea61c9701dba7e6def2a7930d32a7ecc396dd4dea961034e054
|
|
| MD5 |
3137029c98a7638aba34c1972367d75b
|
|
| BLAKE2b-256 |
cc70f39d2d64dd919f6d0404ea92010cd56f9592aa503c93e5cdafd61da3f377
|
File details
Details for the file vetch-0.1.0-py3-none-any.whl.
File metadata
- Download URL: vetch-0.1.0-py3-none-any.whl
- Upload date:
- Size: 73.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0552b4820cf335780bfa51be26eb4722190b85fd924627824a8f650fdce6a8ff
|
|
| MD5 |
c14f0afbd88f943c785293380c28ef8c
|
|
| BLAKE2b-256 |
76299c2e3c863024e2aa7cf07840bf90746e59c3c70177776bb5a5a2f9e11c7c
|