Skip to main content

Zero-config observability for AI agents

Project description

peekr

Observability and evaluation for AI agents.

PyPI CI License: MIT Python 3.9+

Website · Docs · PyPI · TypeScript SDK


Peekr captures every LLM call, tool call, and framework step in your agent — what was sent, what came back, how long it took, and what it cost. Two lines of code, no backend, no account.

import peekr
peekr.instrument()

That's it. Spans stream to traces.jsonl (or SQLite) and to your console. Inspect them with peekr view, find expensive calls with peekr cost, generate a self-contained dashboard with peekr dashboard, and score every output with built-in LLM-as-judge evaluators including RAGAS-style claim decomposition.


Contents


Install

pip install peekr                   # base
pip install "peekr[openai]"         # with OpenAI
pip install "peekr[anthropic]"      # with Anthropic
pip install "peekr[bedrock]"        # with AWS Bedrock
pip install "peekr[gemini]"         # with Google Gemini
pip install "peekr[langchain]"      # with LangChain / LangGraph
pip install "peekr[llamaindex]"     # with LlamaIndex
pip install "peekr[crewai]"         # with CrewAI
pip install "peekr[otel]"           # with OpenTelemetry / OpenInference export
pip install "peekr[all]"            # everything

Quick start

1. Instrument once at startup. Patches OpenAI, Anthropic, Bedrock, and any installed agent framework.

import peekr
peekr.instrument()

2. Trace your tools so they appear in the same tree as LLM calls.

from peekr import trace

@trace
def search_web(query: str) -> list[str]:
    return fetch_results(query)

@trace                       # async works
async def fetch_user(user_id: int) -> dict:
    return await db.get(user_id)

3. View the trace.

peekr view traces.jsonl          # tree view
peekr view --io traces.jsonl     # include inputs and outputs
peekr cost traces.jsonl          # cost breakdown + top hotspots
Trace a3f2b1c0  1243ms  891tok
────────────────────────────────────────────────
agent.run  1243ms
   └─ tool.search_web  210ms
         in:  {"query": "climate policy"}
         out: ["result1", "result2", ...]
   └─ openai.chat.completions [gpt-4o]  1033ms  891tok
         in:  [{"role": "user", "content": "..."}]
         out: "Based on recent research..."

What you get

Capability API
Auto-instrumentation peekr.instrument() — patches OpenAI, Anthropic, Bedrock, LangChain, LlamaIndex, CrewAI
Tool tracing @peekr.trace on any sync or async function
Sessions with peekr.session(user_id="alice", tenant_id="acme"): ...
Multi-tenant schema tenant_id and retention_class first-class on every span
Alerts + Slack/webhook sinks ErrorRate(0.05).with_sinks(SlackSink(url), WebhookSink(url))
LLM-as-judge eval instrument(evaluators=[peekr.eval.Rubric("Be concise")])
Hallucination detection instrument(evaluators=[peekr.eval.Hallucination()])
Claim-level (RAGAS) hallucination Hallucination(detailed=True) — per-claim verdicts
Drift dashboard peekr dashboard traces.db -o report.html
Feedback + fine-tuning export peekr.feedback(trace_id, rating="good")
A/B experiments @peekr.experiment(variants=["control", "test"])
Trace replay peekr replay <trace_id>
TypeScript SDK npm install @peekr/sdk — same wire format
OpenTelemetry export add_exporter(peekr.OTelExporter()) — OpenInference-shaped spans into any OTel pipeline
Sampling instrument(sample_rate=0.1) — whole-trace decision; errored spans always kept

Failure modes peekr catches that timing alone won't

A profiler tells you a function was slow. Peekr also tells you it returned the wrong shape and the LLM had no idea.

agent.run  2100ms
   └─ tool.fetch_user  12ms     out: null         ← tool returned null
   └─ openai.chat       2088ms  in: "User profile: null..."   ← LLM got garbage

Slow steps are obvious in the tree, with the cost broken out:

agent.run  4300ms
   └─ tool.search_web   3800ms  ← 88% of latency. Cache, don't swap models.
   └─ openai.chat        490ms

Token growth across runs surfaces unbounded conversation history:

Trace 1:  18,432 tokens
Trace 2:  21,104 tokens
Trace 3:  24,891 tokens   ← summarise after N turns

And prod-vs-local divergence is a tool I/O diff, not guesswork:

local:  out: [{"id": 1, "qty": 42}]
prod:   out: []   ← upstream pipeline bug, not agent logic

CLI

peekr view

Tree view of every trace, optionally with inputs and outputs.

peekr view traces.jsonl
peekr view --io traces.jsonl
peekr view traces.db          # SQLite works the same way

peekr cost

Where money and time went, with a top-10 hotspots list ranked by composite cost-and-latency score.

peekr cost traces.jsonl
────────────────────────────────────────────────────────────
  peekr cost  ·  traces.jsonl
────────────────────────────────────────────────────────────
  Total spans        : 8,022
  LLM calls          : 85
  Errors             : 0
  Total input tokens : 130,807
  Total output tokens: 10,274
  Total LLM time     : 161.9s
  Total cost (est.)  : $0.14574
────────────────────────────────────────────────────────────

  Top 10 hottest calls  (60% cost · 40% latency):
  #   Operation                In      Out      Cost      ms  Model
  1   anthropic.messages    5,066     264 $ 0.00511   2965ms  claude-haiku-4-5
  2   anthropic.messages    3,924     376 $ 0.00464   3458ms  claude-haiku-4-5
  ...

peekr dashboard

Self-contained HTML report — see Dashboard.

peekr replay

Re-run a stored trace through the live SDK, with the same inputs.

peekr replay a3f2b1c0

Evaluators

Score every LLM output for groundedness, conciseness, or any custom rubric. Scores land on the span as attributes.eval_scores.

import peekr

peekr.instrument(evaluators=[
    peekr.eval.Hallucination(),                  # 0.0 = hallucinated, 1.0 = grounded
    peekr.eval.Rubric("Answer is concise and direct"),
    peekr.eval.NotEmpty(),
    peekr.eval.NoError(),
])
openai.chat [gpt-4o]  843ms  312tok
   in:  "When was the Eiffel Tower built?"
   out: "The Eiffel Tower was built in 1923 by Frank Lloyd Wright."
   eval_scores: {Hallucination: 0.0, Rubric: 0.9, NotEmpty: 1.0}

For RAG flows, point Hallucination at the retrieved document instead of the prompt:

peekr.eval.Hallucination(
    context_extractor=lambda span: span.attributes.get("retrieved_docs", "")
)

Claim-level (RAGAS-style) detection

For why a response was scored low — not just what the score was — set detailed=True. The judge decomposes the output into atomic claims and assigns each one a verdict (supported / contradicted / unsupported), the same pipeline RAGAS Faithfulness uses.

peekr.instrument(evaluators=[peekr.eval.Hallucination(detailed=True)])
// span.attributes.hallucination_details
{
  "total": 3, "supported": 1, "contradicted": 2, "unsupported": 0, "score": 0.33,
  "claims": [
    {"text": "The Eiffel Tower is in Paris",         "verdict": "supported"},
    {"text": "It was built in 1923",                 "verdict": "contradicted"},
    {"text": "It was designed by Frank Lloyd Wright", "verdict": "contradicted"}
  ]
}

Use simple mode for cheap monitoring across many traces; detailed mode for the cases worth investigating. Cost is roughly one judge call per scored span.

Query the lowest-scoring traces from SQLite to find regressions:

SELECT trace_id,
       json_extract(attributes, '$.eval_scores.Hallucination') AS score,
       json_extract(attributes, '$.output')                    AS output
FROM spans
WHERE score IS NOT NULL AND score < 0.5
ORDER BY start_time DESC;

Dashboard

Generate a self-contained HTML observability report. No server, no build step — open the file in a browser, or attach it to a Slack message.

peekr dashboard traces.db -o report.html   # SQLite
peekr dashboard traces.jsonl               # writes ./dashboard.html

Five tabs (15 to switch, / to search, R to clear filters, Esc to close panels):

Tab Purpose
Overview Health hero (0–100), narrative summary of what's happening, top 3 action items
Traces Search and filter every trace; click any row for full I/O, claim verdicts, citations
Quality Rolling chart with thresholds, score distribution, channel × time heatmap
Diagnose AI-generated likely causes, severity-tagged action lists, worst-offender cards with side-by-side context vs answer
Help Setup checklist, glossary, evaluator snippets, troubleshooting

A persistent filter bar (tenant · model · endpoint · time range) refilters every panel across every tab in one click. Tab and filter state live in the URL hash so links are shareable.

To populate the channel breakdown, peekr reads attributes.model automatically and tenant_id from the span schema. Attach an endpoint yourself in your request handler:

from peekr import trace, get_current_span

@trace
def handle_request(req):
    get_current_span().attributes["endpoint"] = req.path
    return call_llm(...)

Full screenshots and tab-by-tab walkthrough → docs.


Multi-tenant traces

Every span carries two first-class fields — tenant_id (the customer org) and retention_class (a storage-tier hint). They're separate from user_id (the end-user) so a B2B agent can tag both without conflict.

import peekr
peekr.instrument(tenant_id="acme", retention_class="default")

with peekr.session(user_id="alice", tenant_id="acme",
                   retention_class="long"):
    run_agent()

Resolution order, highest priority first:

  1. peekr.session(tenant_id=..., retention_class=...)
  2. peekr.instrument(tenant_id=..., retention_class=...)
  3. Env vars PEEKR_TENANT_ID / PEEKR_RETENTION_CLASS

Both fields are top-level columns in SQLite (indexed) and top-level keys in JSONL — query without json_extract:

SELECT tenant_id, COUNT(*) FROM spans GROUP BY tenant_id;
SELECT * FROM spans WHERE retention_class = 'long' AND start_time > ?;

retention_class is a free-form string in the OSS SDK. Recommended values are default, short, long, and pii; the meaning of each is enforced by your storage tier (or by Peekr Cloud when you're ready).


Storage

peekr.instrument()                    # JSONL — default, grep-able
peekr.instrument(storage="sqlite")    # SQLite — queryable, multi-process safe
peekr.instrument(storage="both")      # both

SQLite uses WAL mode so multiple processes (Docker, CI, parallel agents) can write at the same time. Query across runs:

# slowest tool calls
sqlite3 traces.db "
  SELECT name, ROUND(AVG(duration_ms)) avg_ms
  FROM spans GROUP BY name ORDER BY avg_ms DESC;"

# token spend by model
sqlite3 traces.db "
  SELECT json_extract(attributes,'\$.model')        AS model,
         SUM(json_extract(attributes,'\$.tokens_total')) AS tokens
  FROM spans GROUP BY model;"

# all errors
sqlite3 traces.db "
  SELECT name, trace_id, json_extract(attributes,'\$.error') AS msg
  FROM spans WHERE status = 'error';"

Alert routing — Slack, webhooks, PagerDuty

By default, alert messages go to stderr. Attach one or more sinks to route them anywhere:

import peekr
from peekr.alert import ErrorRate, CostSpike, LatencyP95, SlackSink, WebhookSink

peekr.instrument(alerts=[
    ErrorRate(threshold=0.05).with_sinks(
        SlackSink("https://hooks.slack.com/services/T0/B0/abc"),
    ),
    CostSpike(multiplier=3.0).with_sinks(
        WebhookSink(
            "https://events.pagerduty.com/v2/enqueue",
            payload_builder=lambda name, msg: {
                "routing_key": "your-key",
                "event_action": "trigger",
                "payload": {"summary": msg, "source": "peekr", "severity": "warning"},
            },
        ),
    ),
])

Sinks are best-effort — network failures, timeouts, and exceptions inside notify() are swallowed silently so a flaky webhook never breaks the application's tracing path. Use WebhookSink(payload_builder=...) to fit any incident system (PagerDuty Events v2, Opsgenie, OpsLevel, custom routers).

Sampling

High-traffic agents produce a lot of spans. sample_rate drops a fraction of traces from storage while keeping evaluators and alerts running on the full stream — so your error rate, hallucination score, and cost figures stay accurate.

peekr.instrument(
    sample_rate=0.1,        # keep 10% of traces; default 1.0
    keep_errors=True,       # errored spans always persisted (default)
)

The decision is made once per trace at root-span creation and inherited by every child, so a trace is never partially captured — you don't get orphan openai.chat.completions spans without their parent.

OpenTelemetry export

Ship peekr spans into any OTel-compatible backend (Datadog, Honeycomb, Grafana Tempo, Arize Phoenix, Langfuse-OTel, etc.) by translating attributes into the OpenInference semantic conventions the LLM observability ecosystem uses.

pip install "peekr[otel]"
import peekr
from peekr.exporters import add_exporter

peekr.instrument()
add_exporter(peekr.OTelExporter())                    # uses your app's existing OTel setup
add_exporter(peekr.OTelExporter(endpoint="https://api.honeycomb.io",
                                headers={"x-honeycomb-team": "..."}))   # or configure inline

No agent, no collector, no separate process. Peekr writes OpenInference-shaped spans in-process, and any OTel pipeline you already operate consumes them.

Custom exporters

Ship spans to any backend by implementing one method:

from peekr.exporters import add_exporter

class MyExporter:
    def export(self, span):
        requests.post("https://my-backend.com/spans", json=span.to_dict())

peekr.instrument()
add_exporter(MyExporter())

@trace options

@trace                        # auto-names from module.function, captures I/O
@trace(name="tool.search")    # custom span name
@trace(capture_io=False)      # skip args/output (e.g. secrets)

Supported clients

LLM SDKs

Provider SDK Install
OpenAI openai pip install "peekr[openai]"
Anthropic anthropic pip install "peekr[anthropic]"
AWS Bedrock boto3 pip install "peekr[bedrock]"
Google Gemini google-genai (or legacy google-generativeai) pip install "peekr[gemini]"

Agent frameworks

Framework Package Install
LangChain / LangGraph langchain-core pip install "peekr[langchain]"
LlamaIndex llama-index-core pip install "peekr[llamaindex]"
CrewAI crewai pip install "peekr[crewai]"

peekr.instrument() detects whichever SDKs and frameworks are installed and patches them. Streaming is supported across all LLM SDKs. Frameworks emit chain / tool / retriever / agent / LLM spans nested in the order they actually executed:

crewai.crew.kickoff                       3.4s
  └─ crewai.task.execute                  3.4s   task=plan_trip
       └─ crewai.agent.execute_task       3.4s   agent=planner
            └─ openai.chat.completions    1.2s   gpt-4o  · 891tok
            └─ langchain.tool.search_web  2.1s

TypeScript SDK

npm install @peekr/sdk
import { instrument, wrap, trace, withSession } from "@peekr/sdk";
import OpenAI from "openai";

instrument({ jsonlPath: "./traces.jsonl" });
const openai = wrap(new OpenAI());

await withSession(
  { user_id: "alice", tenant_id: "acme" },
  async () => {
    await openai.chat.completions.create({
      model: "gpt-4o-mini",
      messages: [{ role: "user", content: "Summarise the docs above" }],
    });
  },
);

The TypeScript SDK writes the same JSONL schema as Python, so a Node app's traces work with peekr view, peekr cost, and peekr dashboard unchanged. Full reference → peekr-ts/README.md.


Peekr Cloud

The OSS SDK runs in your process, writes to local files, and is MIT licensed forever — that's not changing. When a single-process file isn't the right fit any more (multiple services, a team that needs shared dashboards, longer retention, audit-grade trace storage), Peekr Cloud is the managed backend.

Sign up at peekr.cloud.ashwanijha.dev — free up to 10k spans/month, no card required.

Once you have a pk_live_ key from the project settings page:

import peekr

peekr.instrument(
    tenant_id="acme",
    exporter=peekr.HTTPExporter(
        endpoint="https://peekr.cloud.ashwanijha.dev",
        api_key="pk_live_…",
    ),
)

HTTPExporter is fully implemented as of v0.5 — batched, retried, flushed at interpreter exit. The spans you already instrument locally ship to the Cloud dashboard unchanged; tenant_id and retention_class are first-class columns.

Tier Spans / month Price
Free 10k $0
Starter 500k $29/mo
Pro 5M $99/mo
Scale 50M $399/mo

How it works

instrument() monkey-patches the OpenAI, Anthropic, and Bedrock SDK methods before your code runs. Python resolves function references at call time, so every subsequent call hits the wrapper without any change to your code.

Parent / child span relationships are tracked through contextvars.ContextVar, which propagates correctly across async / await without manual threading. The TypeScript SDK uses Node's AsyncLocalStorage for the same reason.


Contributing

git clone https://github.com/ashwanijha04/peekr
cd peekr
pip install -e ".[dev]"
pytest

Open an issue before large changes. PRs welcome.


Website · Docs · PyPI · TypeScript SDK · MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

peekr-0.5.1.tar.gz (132.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

peekr-0.5.1-py3-none-any.whl (93.5 kB view details)

Uploaded Python 3

File details

Details for the file peekr-0.5.1.tar.gz.

File metadata

  • Download URL: peekr-0.5.1.tar.gz
  • Upload date:
  • Size: 132.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for peekr-0.5.1.tar.gz
Algorithm Hash digest
SHA256 3c5ba6ef2df916a555b3ade6ad66b89a4de7bd0bd93db2e0eafb964671408ac3
MD5 b1d6b4a8ee10d0d3779bb5c1f7306c03
BLAKE2b-256 6eb41e39fc9297a8ac662af414704d30d407b1778e24512fb9f465fc380f29dc

See more details on using hashes here.

Provenance

The following attestation bundles were made for peekr-0.5.1.tar.gz:

Publisher: publish.yml on ashwanijha04/peekr

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file peekr-0.5.1-py3-none-any.whl.

File metadata

  • Download URL: peekr-0.5.1-py3-none-any.whl
  • Upload date:
  • Size: 93.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for peekr-0.5.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d342a85e0e26ffb1341254ef399be2355ec43041b036e0e04f69f73a88a1a09b
MD5 aeec3f3a17e7ff3c203ae51cd5b71e3b
BLAKE2b-256 49927191c4a8cab2380aa1ce920fe6083f3d326f38e18261b17b26828e26335c

See more details on using hashes here.

Provenance

The following attestation bundles were made for peekr-0.5.1-py3-none-any.whl:

Publisher: publish.yml on ashwanijha04/peekr

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page