LLMAuditor — Execution-Based GenAI Application Evaluation & Certification Framework. Audit, track, and certify any LLM/GenAI/Agentic AI application with cost tracking, hallucination detection, and governance enforcement.

These details have not been verified by PyPI

Project links

Project description

LLMAuditor

Release v1.0.0 — LLMAuditor evaluation & certification framework

Execution-Based GenAI Application Evaluation & Certification Framework.

LLMAuditor wraps around any LLM integration to provide per-execution auditing, hallucination detection, cost governance, certification scoring, and enterprise-grade report export — without locking you into any specific AI provider.

Features

Category	Capabilities
Execution Auditing	Per-call tracking of tokens, cost, latency, confidence, and risk level
Hallucination Detection	Hybrid engine — rule-based heuristics + optional AI judge
Certification Scoring	5 subscores → weighted overall score → Platinum / Gold / Silver / Conditional / Fail
Governance	Budget enforcement, guard mode, alert mode, role-based access
Report Export	Markdown, HTML, and PDF with circular certification stamp, digital signature, and certificate number
Evaluation Sessions	Batch multiple executions into a scored, exportable certification report
Rich CLI Output	Color-coded panels with live token, cost, and risk display
Model-Agnostic	Works with OpenAI, Anthropic, Google, AWS Bedrock, or any custom wrapper

Installation

pip install llmauditor

Dependencies: rich>=13.0.0 (CLI display), reportlab>=4.0.0 (PDF export)

Quick Start

1. Single Execution Audit

from llmauditor import auditor

report = auditor.execute(
    model="gpt-4o",
    input_tokens=520,
    output_tokens=290,
    raw_response="Merchant risk is assessed as LOW.",
    input_text="Analyze merchant risk for account #4417",
)
report.display()

This produces a rich CLI panel showing execution ID, model, latency, token counts, estimated cost, confidence score, risk level, hallucination analysis, and improvement suggestions.

2. Monitor Decorator

from llmauditor import auditor

@auditor.monitor(model="gpt-4o")
def call_openai(prompt: str) -> dict:
    # Your actual API call here
    return {
        "response": "Market outlook: bullish on tech sector.",
        "input_tokens": 400,
        "output_tokens": 150,
    }

result = call_openai("Summarize today's market")
# Automatically tracked, displayed, and recorded in history

3. Passive Observation

report = auditor.observe(
    model="claude-3.5-sonnet",
    input_tokens=300,
    output_tokens=120,
    raw_response="The quarterly results show 12% growth.",
)
# Records without governance enforcement — useful for logging

4. Evaluation Session with Certification

from llmauditor import auditor

auditor.start_evaluation("My GenAI App", version="2.1.0")

# Run multiple executions...
auditor.execute(model="gpt-4o", input_tokens=500, output_tokens=200,
                raw_response="Response 1...", input_text="Prompt 1")
auditor.execute(model="gpt-4o", input_tokens=600, output_tokens=250,
                raw_response="Response 2...", input_text="Prompt 2")

auditor.end_evaluation()

# Generate certification report
eval_report = auditor.generate_evaluation_report()
eval_report.display()
eval_report.export("pdf", output_dir="./reports")

The exported PDF includes 11 sections with a circular certification stamp at the top and a digital signature with unique certificate number (LMA-YYYYMMDD-XXXXXX) at the bottom.

Core API Reference

`auditor.execute()`

Record and audit a single LLM execution with full governance enforcement.

report = auditor.execute(
    model="gpt-4o",             # Model name (used for cost lookup)
    input_tokens=500,           # Tokens consumed by the prompt
    output_tokens=200,          # Tokens in the completion
    raw_response="...",         # The model's output text
    input_text="...",           # Original prompt (optional, for hallucination analysis)
)

Returns an ExecutionReport with: report.display(), report.export("md"|"html"|"pdf"), report.to_dict()

`auditor.observe()`

Passive recording — no governance enforcement, no blocking.

report = auditor.observe(
    model="gpt-4o",
    input_tokens=500,
    output_tokens=200,
    raw_response="...",
)

`@auditor.monitor(model=)`

Decorator that automatically audits any function returning {"response": str, "input_tokens": int, "output_tokens": int}.

@auditor.monitor(model="gpt-4o")
def my_function(prompt):
    return {"response": "...", "input_tokens": 100, "output_tokens": 50}

Governance

Budget Enforcement

auditor.set_budget(max_cost_usd=5.00)

# Check remaining budget
status = auditor.get_budget_status()
# → {"budget_limit": 5.0, "spent": 1.23, "remaining": 3.77, "utilization_pct": 24.6}

# Exceeding budget raises BudgetExceededError

Guard Mode

Blocks executions with confidence below the threshold.

auditor.guard_mode(confidence_threshold=70)
# Executions scoring below 70 confidence raise LowConfidenceError

Alert Mode

Prints real-time warnings for high-risk executions without blocking.

auditor.set_alert_mode(enabled=True)

Exception Handling

from llmauditor import auditor, BudgetExceededError, LowConfidenceError

try:
    report = auditor.execute(model="gpt-4o", input_tokens=500,
                              output_tokens=200, raw_response="...")
except BudgetExceededError as e:
    print(f"Budget exceeded: {e}")
except LowConfidenceError as e:
    print(f"Blocked by guard mode: {e}")

Hallucination Detection

LLMAuditor includes a hybrid hallucination detection engine:

Rule-based heuristics — detects hedging language, self-contradiction, unsupported assertions, specificity without grounding, and temporal/numerical inconsistencies
AI judge (optional) — configurable external LLM call for deeper analysis

Each execution report includes a HallucinationAnalysis with:

risk_level — None / Low / Medium / High / Critical
risk_score — 0.0 to 1.0
flags — list of detected patterns
explanation — human-readable summary

Certification & Scoring

Evaluation sessions produce a CertificationScore with 5 subscores:

Subscore	Measures
Stability	Latency/token variance, failure rate
Factual Reliability	Hallucination risk, confidence levels
Governance Compliance	Guard/budget/role violation rates
Cost Predictability	Cost variance, budget adherence
Risk Profile	Distribution of execution risk levels

Certification Levels:

Level	Score Range	Emoji
Platinum	≥ 90	🏆
Gold	≥ 80	🥇
Silver	≥ 70	🥈
Conditional Pass	≥ 60	⚠️
Fail	< 60	❌

Customizing Weights

auditor.set_certification_thresholds(weights={
    "stability": 0.25,
    "factual_reliability": 0.25,
    "governance_compliance": 0.20,
    "cost_predictability": 0.15,
    "risk_profile": 0.15,
})

Report Export

Per-Execution Reports

report = auditor.execute(model="gpt-4o", input_tokens=500,
                          output_tokens=200, raw_response="...")
report.export("md", output_dir="./reports")   # Markdown
report.export("html", output_dir="./reports") # HTML
report.export("pdf", output_dir="./reports")  # PDF

Evaluation Certification Reports

eval_report = auditor.generate_evaluation_report()
eval_report.export("pdf", output_dir="./reports")

Certification reports contain 11 sections:

Certification Stamp — circular stamp with certification level
Executive Summary — app name, version, execution count, overall score
Certification Score — overall score with level and emoji
Score Breakdown — 5 subscores with progress bars
Execution Log — per-execution details table
Token & Cost Analysis — aggregated token/cost statistics
Hallucination Analysis — risk distribution and flagged patterns
Governance Summary — budget utilization, guard/alert mode status
Improvement Suggestions — actionable recommendations
Methodology — scoring methodology and weight explanation
Plain-Language Summary — business-readable certification narrative

Each report includes a digital signature with a unique certificate number (format: LMA-YYYYMMDD-XXXXXX).

Supported Models & Pricing

Provider	Model	Input (per 1K)	Output (per 1K)
OpenAI	gpt-4	$0.0300	$0.0600
OpenAI	gpt-4-turbo	$0.0100	$0.0300
OpenAI	gpt-4o	$0.0050	$0.0150
OpenAI	gpt-4o-mini	$0.00015	$0.0006
OpenAI	gpt-3.5-turbo	$0.0005	$0.0015
Anthropic	claude-3-opus	$0.0150	$0.0750
Anthropic	claude-3-sonnet	$0.0030	$0.0150
Anthropic	claude-3-haiku	$0.00025	$0.00125
Anthropic	claude-3.5-sonnet	$0.0030	$0.0150
Anthropic	claude-3.5-haiku	$0.0008	$0.0040
Google	gemini-pro	$0.00025	$0.0005
Google	gemini-1.5-flash	$0.000075	$0.0003
Google	gemini-1.5-pro	$0.00125	$0.0050
Google	gemini-2.0-flash	$0.00015	$0.0006
Google	gemini-2.0-flash-lite	$0.000075	$0.0003
Google	gemini-2.5-flash	$0.00015	$0.0006
Google	gemini-2.5-pro	$0.00125	$0.0100
AWS	amazon.titan-text-express	$0.0002	$0.0006
AWS	amazon.titan-text-lite	$0.00015	$0.0002

Custom Pricing

auditor.set_pricing_table({
    "my-custom-model": {"input": 0.002, "output": 0.008},
})

Unlisted models default to $0.00 (never crash on unknown models).

Project Structure

llm-control-engine/
├── llmauditor/
│   ├── __init__.py          # Public API — singleton, exports, version
│   ├── auditor.py           # LLMAuditor class — orchestration
│   ├── tracker.py           # ExecutionTracker — time & token aggregation
│   ├── cost.py              # Pricing registry + calculate_cost()
│   ├── report.py            # ExecutionReport + rich CLI display
│   ├── exporter.py          # Audit & certification export (MD/HTML/PDF)
│   ├── hallucination.py     # Hybrid hallucination detection engine
│   ├── scoring.py           # Certification scoring (5 subscores)
│   ├── evaluation.py        # Evaluation sessions & metrics
│   └── suggestions.py       # Improvement recommendations
├── validate_llmauditor.py   # Validation script (49 checks)
├── pyproject.toml
├── LICENSE
└── README.md

Design Principles

Model-agnostic — works with any LLM provider or custom wrapper
Metric-driven — all scoring and certification derived from actual execution data
Separation of concerns — 10 focused modules, each with a single responsibility
Non-invasive — integrates via execute(), observe(), or @monitor decorator
Enterprise-ready — budget enforcement, guard mode, role control, audit export
Never crash — unknown models, missing data, and edge cases handled gracefully

Contributing

Contributions are welcome. See the Issues tab for open tasks.

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.1.9

Mar 5, 2026

1.1.8

Mar 5, 2026

1.1.7

Mar 5, 2026

1.1.6

Mar 5, 2026

1.1.5

Mar 4, 2026

1.1.4

Mar 4, 2026

1.1.3

Mar 4, 2026

1.1.2

Mar 4, 2026

1.1.1

Mar 4, 2026

This version

1.1.0

Mar 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmauditor-1.1.0.tar.gz (52.5 kB view details)

Uploaded Mar 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llmauditor-1.1.0-py3-none-any.whl (53.4 kB view details)

Uploaded Mar 4, 2026 Python 3

File details

Details for the file llmauditor-1.1.0.tar.gz.

File metadata

Download URL: llmauditor-1.1.0.tar.gz
Upload date: Mar 4, 2026
Size: 52.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for llmauditor-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`a4f7cc7df188b5166606bf0142e4f2fb1c3cfd21be5107b585b14fe950c34ba1`
MD5	`97401df4445062b9d2fb2a0126f38803`
BLAKE2b-256	`e5b9f194fd404859abbd8cc6c5f0279f445c9f1be43a43efb3c648b5b8a19fc0`

See more details on using hashes here.

File details

Details for the file llmauditor-1.1.0-py3-none-any.whl.

File metadata

Download URL: llmauditor-1.1.0-py3-none-any.whl
Upload date: Mar 4, 2026
Size: 53.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for llmauditor-1.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2601bd60550721d82b84fbe58132c78c8b9ee6e0f4aeed5a7a4a17d61238ae98`
MD5	`92cc4eabc12c2bf23dcf32f202687f05`
BLAKE2b-256	`f5664a620f5af20dc7657148abbbf4443a38d608736cd485f7999b0e10723348`

See more details on using hashes here.

llmauditor 1.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

LLMAuditor

Features

Installation

Quick Start

1. Single Execution Audit

2. Monitor Decorator

3. Passive Observation

4. Evaluation Session with Certification

Core API Reference

auditor.execute()

auditor.observe()

@auditor.monitor(model=)

Governance

Budget Enforcement

Guard Mode

Alert Mode

Exception Handling

Hallucination Detection

Certification & Scoring

Customizing Weights

Report Export

Per-Execution Reports

Evaluation Certification Reports

Supported Models & Pricing

Custom Pricing

Project Structure

Design Principles

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`auditor.execute()`

`auditor.observe()`

`@auditor.monitor(model=)`