Skip to main content

Inference cost/quality tradeoff auditor for AI systems. Finds the Pareto frontier, flags dominated routing decisions, and outputs IF/THEN routing rules with PASS/WARN/FAIL verdicts.

Project description

InferenceLens

Inference Cost/Quality Tradeoff Auditor

An auditor for AI inference routing decisions — makes the cost/quality tradeoff explicit, measurable, and defensible.

Python FastAPI Pydantic Tests Pareto License


Architecture

InferenceLens Pipeline Architecture

Sample Report Output

Sample Inference Report


The Problem

AI systems make implicit cost/quality tradeoffs on every inference call. Most teams don't know which model configuration is on their Pareto frontier, which is dominated, or whether their routing heuristics are costing more than they save.

The failure mode: "AI systems that make implicit cost/quality routing decisions with no auditability."

When gpt-4o handles a sentiment classification that gpt-3.5-turbo would solve at 94% quality for 97% less cost — that's a routing failure with no paper trail. InferenceLens finds it, proves it, and generates defensible routing rules backed by empirical evidence.


The Thesis Question

How does an AI system know it's working correctly when cost and quality are competing constraints?

InferenceLens answers this by profiling inference calls across model configurations, measuring quality degradation as cost decreases, finding the Pareto frontier, and generating routing recommendations with evidence.


Component Architecture

Component File Role
InferenceProfiler pipeline/profiler.py Runs tasks across model configs; logs tokens, latency, cost per call
QualityEvaluator pipeline/evaluator.py ROUGE-1 (summarization), macro F1 (classification), exact match (extraction); graceful fallback to semantic similarity
CostTracker pipeline/cost_tracker.py Logs token usage, USD cost, latency per call; aggregates by tier and task type
ParetoAnalyzer pipeline/pareto.py Identifies Pareto frontier; flags dominated configs (higher cost, lower quality than an alternative)
RoutingRecommender pipeline/router.py Generates IF/THEN routing rules per complexity band; outputs human-readable + structured rules
AuditLogger pipeline/audit.py Append-only JSONL audit trail of all inference decisions and routing choices
ReportGenerator pipeline/report.py Assembles InferenceLensReport: Pareto curve data, routing rules, PASS/WARN/FAIL verdict

Composite Scoring

routing_efficiency  =  quality_score  /  (normalized_cost × 1000  +  0.001)
pareto_rank         =  position on cost-quality frontier (1 = best)
audit_verdict       =  PASS   (optimal routing — frontier config)
                       WARN   (suboptimal — quality drop > 10% or savings < 20%)
                       FAIL   (dominated config in use)

Pareto dominance rule: Config A dominates Config B if:

A.avg_cost < B.avg_cost  AND  A.avg_quality >= B.avg_quality

Sample Output

InferenceLens Report — Summarization Task
═══════════════════════════════════════════════════════════════
Config           Cost/call   Latency    Quality    Pareto
───────────────────────────────────────────────────────────────
large            $0.0142     1,240ms     0.91      DOMINATED ✗
medium           $0.0021       380ms     0.88      FRONTIER  ✓
small            $0.0008       210ms     0.79      FRONTIER  ✓
local            $0.0000       890ms     0.74      FRONTIER  ✓
───────────────────────────────────────────────────────────────
[PASS] IF task=summarization AND complexity<0.35
       THEN use=small [gpt-3.5-turbo] (saves 94% cost, quality_delta=-0.12)

[PASS] IF task=summarization AND complexity<0.70
       THEN use=medium [gpt-4o-mini] (saves 85% cost, quality_delta=-0.03)

[PASS] IF task=summarization AND complexity>=0.70
       THEN use=medium [gpt-4o-mini] (saves 85% cost, quality_delta=-0.03)
═══════════════════════════════════════════════════════════════
Global verdict: PASS ✓
═══════════════════════════════════════════════════════════════

Complexity Score Definition

The complexity score (0–1) determines which routing band a prompt falls into. It is computed before any model call — making it a zero-cost, deterministic signal that cannot depend on LLM output.

complexity_score = (
    0.50 * token_count_score      +  # normalised prompt token count
    0.30 * vocabulary_entropy     +  # type-token ratio (lexical diversity)
    0.20 * task_type_baseline        # fixed baseline by task type
)

Token count score: min(prompt_tokens / 500, 1.0) — a 500-token prompt scores 1.0; a 50-token prompt scores 0.1. Longer prompts require more context retention and typically benefit from larger models.

Vocabulary entropy: Type-token ratio = unique tokens / total tokens. High TTR (0.8+) signals domain-specific or technical language where smaller models lose precision. Low TTR (0.2–0.4) signals repetitive, formulaic text a smaller model handles well.

Task type baseline: Extraction tasks get a 0.15 baseline (structured field correctness is sensitive to model quality), classification gets 0.10, summarization gets 0.05 (most tolerant of quality degradation).

Routing bands:

Band Score Range Default Route Rationale
Simple < 0.35 local / small Short, low-entropy — local model sufficient
Standard 0.35–0.70 medium Moderate complexity — gpt-4o-mini hits quality floor
Complex > 0.70 large Long, technical, or extraction — quality floor requires large

The score is logged in the audit record per call, enabling post-hoc validation that routing decisions were based on reproducible inputs.


Synthetic Dataset

data/generator.py generates a deterministic (seeded) dataset of 300 prompts across 3 task types:

Task Type Count Ground Truth Quality Metric
Summarization 100 Article summaries (5 domains) ROUGE-1 F1
Classification 100 Sentiment labels (positive/negative/neutral) Macro F1
Extraction 100 Structured JSON from documents Field-level exact match

4 model configurations are simulated (deterministic, seed=42):

Tier Model Input cost/1K Output cost/1K Avg latency
large gpt-4o $0.005 $0.015 1,240ms
medium gpt-4o-mini $0.00015 $0.0006 380ms
small gpt-3.5-turbo $0.0005 $0.0015 210ms
local ollama/mistral $0.000 $0.000 890ms

Quick Start

1. Install

git clone https://github.com/SidharthKriplani/inferencelens
cd inferencelens
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env

2. Run the demo

python demo.py

This runs the full pipeline: generates 300 synthetic prompts, profiles all 4 configs across all 3 task types (1,200 simulated calls), runs Pareto analysis, generates routing rules, writes the audit log, and prints the terminal report.

3. Run tests

pytest tests/ -v

4. Start the API

uvicorn api.main:app --reload
# → http://localhost:8000/docs

5. Docker

docker build -t inferencelens .
docker run -p 8000:8000 inferencelens

API Reference

Method Endpoint Description
GET /health Service health check
GET /configs Model configs and cost table
POST /profile Run profiling across model configs
GET /report/{task_type} Full report for summarization, classification, or extraction
GET /audit-log Query JSONL audit log with filters

POST /profile — request body

{
  "task_types": ["summarization", "classification"],
  "n_samples": 20,
  "tiers": ["large", "medium", "small"]
}

GET /audit-log — query params

?limit=50&event_type=routing_decision&task_type=classification&verdict=PASS

Project Structure

inferencelens/
├── config.py                 # Model configs, cost table, routing thresholds
├── pipeline/
│   ├── profiler.py           # InferenceProfiler
│   ├── evaluator.py          # QualityEvaluator (ROUGE, F1, ExactMatch)
│   ├── cost_tracker.py       # CostTracker
│   ├── pareto.py             # ParetoAnalyzer
│   ├── router.py             # RoutingRecommender
│   ├── audit.py              # AuditLogger (JSONL)
│   └── report.py             # ReportGenerator
├── data/
│   └── generator.py          # Deterministic synthetic dataset
├── models/                   # Empty — no training required
├── api/
│   └── main.py               # FastAPI app
├── tests/
│   ├── test_generator.py
│   ├── test_evaluator.py
│   ├── test_pareto.py
│   ├── test_router.py
│   └── test_pipeline_integration.py
├── docs/assets/
│   ├── pipeline_architecture.svg
│   └── inference_report_sample.svg
├── demo.py
├── requirements.txt
├── .env.example
├── Dockerfile
└── README.md

Design Decisions

Why no ML model training? The core problem is measurement and auditing, not prediction. InferenceLens is a metrological tool: it instruments inference calls, applies well-defined metrics (ROUGE, F1, exact match), and applies geometric analysis (Pareto dominance) to surface routing failures. No gradient descent required.

Why deterministic synthetic data? Real inference costs money and requires API keys. Seeded synthetic data (seed=42) allows the full pipeline to demonstrate meaningful cost/quality gradients without dependencies, making the project fully runnable in any environment.

Why JSONL for the audit trail? Append-only JSONL is the industry standard for audit logs: human-readable, streamable, queryable with jq, and trivially ingestible by any data platform.

Why Pydantic v2 throughout? All data structures are Pydantic models: type-safe, auto-validated, JSON-serializable. The FastAPI integration is zero-ceremony because request/response models are already Pydantic.


Interview Defense

📄 InferenceLens_Interview_Defense.pdf — covers:

  • Pareto dominance definition and edge cases (non-monotone quality curves)
  • Why ROUGE-1 for summarization and not BERTScore — tradeoffs and known limitations
  • How routing rules translate to a production API gateway (beyond IF/THEN logic)
  • Complexity signal definition — what features determine prompt complexity
  • Scaling to thousands of task types and distribution shift in production

Part of Applied LLM Systems Portfolio

This library is part of a 13-repo Applied LLM Systems portfolio targeting Applied LLM Systems Engineer, MLOps, and Technical AI PM roles.

Applied Systems (LangGraph pipelines — same failure modes under domain pressure):

Project Domain Primary Failure Mode
LendFlow Financial underwriting When to stop or escalate
AgentReliabilityLab Cyber threat triage When to stop or escalate
NexusSupply Supplier risk intelligence Conflicting signal fusion

Libraries & Auditors (domain-agnostic tooling):

Project What It Audits
InferenceLens Inference cost/quality tradeoffs — Pareto frontier, routing rules
RiskFrame ML model lifecycle — champion/challenger, drift, fairness
MetaSignal A/B experiment validity — CUPED, guardrail-first, SRM
DevPulse Version-safe RAG — conflict detection, LLM-Last architecture
PulseRank Marketplace ranking — IPS debiasing, MMR diversity
TrialCheck A/B readout audit — SRM, peeking, underpowered tests
FeatureLeakageLens Pre-training leakage — target, temporal, overlap
GoldenSetAuditor LLM/RAG eval dataset quality
DocIngestQA RAG document ingestion quality — 11 deterministic checks
MetricLens Metric movement decomposition — mix shift vs rate shift

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

inferencelens-1.0.0.tar.gz (32.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

inferencelens-1.0.0-py3-none-any.whl (32.0 kB view details)

Uploaded Python 3

File details

Details for the file inferencelens-1.0.0.tar.gz.

File metadata

  • Download URL: inferencelens-1.0.0.tar.gz
  • Upload date:
  • Size: 32.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for inferencelens-1.0.0.tar.gz
Algorithm Hash digest
SHA256 9fc96c1297335f67e3fff6de29cecbec9d12709db740eb7b8479f4481c5e1bee
MD5 1ccb5379db98041c1e02c7e118af65cf
BLAKE2b-256 340f82099ee877564a4fc6188e169fa46c0867395233fc3970d47b2ead6f4b96

See more details on using hashes here.

File details

Details for the file inferencelens-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: inferencelens-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 32.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for inferencelens-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 235b16dcbacdd6b178876266923b7458f0bd9ec83747c9f1e1e07fbec54a25d4
MD5 c27b187b1088ab7330a2d54acfddcead
BLAKE2b-256 b1e8e052b7fc811d6ed88892f3650c3e62ccce10187e3e996dea08cd947278d3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page