Skip to main content

Drop-in observability for LLM applications — quality scoring, hallucination detection, cost tracking, and agent run debugging.

Project description

llm-evaltrack

Drop-in observability for LLM applications. Automatic quality scoring, hallucination detection, cost tracking, and agent run debugging — with 2 lines of code.

Install

pip install llm-evaltrack

Quick Start

import llm_observe

llm_observe.init(api_url="https://your-server.com/ingest")
llm_observe.patch_openai()     # auto-track all OpenAI calls
llm_observe.patch_anthropic()  # auto-track all Anthropic calls

That's it. Your existing code is unchanged. Every chat.completions.create() and messages.create() is now automatically tracked.

Agent Debugging (v0.2.0)

Trace multi-step agent runs to see every step, find where things break, and measure cost per span:

from llm_observe import trace_agent

with trace_agent("research_agent", input="Research renewable energy trends") as trace:

    with trace.span("web_search", span_type="retrieval") as s:
        results = search("renewable energy 2024")
        s.set_output(results)

    with trace.span("llm_summarize", span_type="llm", model="gpt-4o") as s:
        summary = llm.summarize(results)
        s.set_output(summary)
        s.set_tokens(1200)
        s.set_cost(0.009)

    with trace.span("fact_check", span_type="tool") as s:
        verified = fact_check(summary)
        s.set_output(verified)

    trace.set_output("Report complete")

Span types: llm · tool · retrieval · decision · custom

Each trace captures: total duration, tokens, cost, per-step timing, inputs/outputs, and errors. Errors inside a with block are automatically caught and marked as failed.

What Gets Tracked (auto)

Field Source
Input / Output Message content
Model response.model
Tokens response.usage
Cost (USD) Calculated from token counts
Quality Score Heuristic evaluation or LLM judge
Hallucination flags Automatic detection

Manual Tracking

llm_observe.track_llm_call(
    input="What is the capital of France?",
    output="Paris.",
    prompt="You are a helpful assistant.",
    model="gpt-4o",
    metadata={"feature": "qa", "user_id": "u_123", "cost_usd": 0.0003},
)

Configuration

llm_observe.init(
    api_url="https://your-server.com/ingest",
    api_key="your-secret",   # optional bearer token
    max_retries=3,
    timeout=5.0,
    enabled=True,            # set False in tests
)

Dashboard

Pair with the self-hosted server for:

  • Real-time quality trend charts
  • Bad response categorization
  • Root-cause analysis by prompt
  • Cost vs quality per model
  • Regression alerts
  • Agent Debugger — waterfall timeline of every span in a trace

Live demo: llm-evaltrack-production.up.railway.app

Self-host: github.com/Soufianeazz/llm-evaltrack

Changelog

v0.2.0

  • Added trace_agent() context manager for agent run tracing
  • Added span() and trace.span() for individual steps
  • Spans support: set_output(), set_tokens(), set_cost(), set_error()
  • Automatic error capture on exceptions inside with blocks
  • Nested span support via parent_span_id

v0.1.0

  • Initial release: patch_openai(), patch_anthropic(), track_llm_call()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_evaltrack-0.2.0.tar.gz (44.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_evaltrack-0.2.0-py3-none-any.whl (10.0 kB view details)

Uploaded Python 3

File details

Details for the file llm_evaltrack-0.2.0.tar.gz.

File metadata

  • Download URL: llm_evaltrack-0.2.0.tar.gz
  • Upload date:
  • Size: 44.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for llm_evaltrack-0.2.0.tar.gz
Algorithm Hash digest
SHA256 bb82b362476da790d995c59f2826476fa347a5e66d6886b9543cac4f7405fe56
MD5 e6c54f7e2b1677519275528e3c65dd22
BLAKE2b-256 2085665ed2b77cf7a325d732106fe5f860e5b1c71287e13e565b02059ac246a9

See more details on using hashes here.

File details

Details for the file llm_evaltrack-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: llm_evaltrack-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 10.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for llm_evaltrack-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a4cf04ddd4545a34d21a9a6172163878c8c7492dcfb259e0e35aa2bec7452adf
MD5 e0a7a3fc348e941dd2442968c1732058
BLAKE2b-256 9c788a8b337eb452f08c07afb3d732b357e3c0ebe921567a433c7c7faae601fe

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page