Inference cost/quality tradeoff auditor for AI systems. Finds the Pareto frontier, flags dominated routing decisions, and outputs IF/THEN routing rules with PASS/WARN/FAIL verdicts.

These details have not been verified by PyPI

Project links

Project description

InferenceLens

Inference Cost/Quality Tradeoff Auditor

An auditor for AI inference routing decisions — makes the cost/quality tradeoff explicit, measurable, and defensible.

Architecture

InferenceLens Pipeline Architecture

Sample Report Output

Sample Inference Report

The Problem

AI systems make implicit cost/quality tradeoffs on every inference call. Most teams don't know which model configuration is on their Pareto frontier, which is dominated, or whether their routing heuristics are costing more than they save.

The failure mode: "AI systems that make implicit cost/quality routing decisions with no auditability."

When gpt-4o handles a sentiment classification that gpt-3.5-turbo would solve at 94% quality for 97% less cost — that's a routing failure with no paper trail. InferenceLens finds it, proves it, and generates defensible routing rules backed by empirical evidence.

The Thesis Question

How does an AI system know it's working correctly when cost and quality are competing constraints?

InferenceLens answers this by profiling inference calls across model configurations, measuring quality degradation as cost decreases, finding the Pareto frontier, and generating routing recommendations with evidence.

Component Architecture

Component	File	Role
InferenceProfiler	`pipeline/profiler.py`	Runs tasks across model configs; logs tokens, latency, cost per call
QualityEvaluator	`pipeline/evaluator.py`	ROUGE-1 (summarization), macro F1 (classification), exact match (extraction); graceful fallback to semantic similarity
CostTracker	`pipeline/cost_tracker.py`	Logs token usage, USD cost, latency per call; aggregates by tier and task type
ParetoAnalyzer	`pipeline/pareto.py`	Identifies Pareto frontier; flags dominated configs (higher cost, lower quality than an alternative)
RoutingRecommender	`pipeline/router.py`	Generates IF/THEN routing rules per complexity band; outputs human-readable + structured rules
AuditLogger	`pipeline/audit.py`	Append-only JSONL audit trail of all inference decisions and routing choices
ReportGenerator	`pipeline/report.py`	Assembles InferenceLensReport: Pareto curve data, routing rules, PASS/WARN/FAIL verdict

Composite Scoring

routing_efficiency  =  quality_score  /  (normalized_cost × 1000  +  0.001)
pareto_rank         =  position on cost-quality frontier (1 = best)
audit_verdict       =  PASS   (optimal routing — frontier config)
                       WARN   (suboptimal — quality drop > 10% or savings < 20%)
                       FAIL   (dominated config in use)

Pareto dominance rule: Config A dominates Config B if:

A.avg_cost < B.avg_cost  AND  A.avg_quality >= B.avg_quality

Sample Output

InferenceLens Report — Summarization Task
═══════════════════════════════════════════════════════════════
Config           Cost/call   Latency    Quality    Pareto
───────────────────────────────────────────────────────────────
large            $0.0142     1,240ms     0.91      DOMINATED ✗
medium           $0.0021       380ms     0.88      FRONTIER  ✓
small            $0.0008       210ms     0.79      FRONTIER  ✓
local            $0.0000       890ms     0.74      FRONTIER  ✓
───────────────────────────────────────────────────────────────
[PASS] IF task=summarization AND complexity<0.35
       THEN use=small [gpt-3.5-turbo] (saves 94% cost, quality_delta=-0.12)

[PASS] IF task=summarization AND complexity<0.70
       THEN use=medium [gpt-4o-mini] (saves 85% cost, quality_delta=-0.03)

[PASS] IF task=summarization AND complexity>=0.70
       THEN use=medium [gpt-4o-mini] (saves 85% cost, quality_delta=-0.03)
═══════════════════════════════════════════════════════════════
Global verdict: PASS ✓
═══════════════════════════════════════════════════════════════

Complexity Score Definition

The complexity score (0–1) determines which routing band a prompt falls into. It is computed before any model call — making it a zero-cost, deterministic signal that cannot depend on LLM output.

complexity_score = (
    0.50 * token_count_score      +  # normalised prompt token count
    0.30 * vocabulary_entropy     +  # type-token ratio (lexical diversity)
    0.20 * task_type_baseline        # fixed baseline by task type
)

Token count score: min(prompt_tokens / 500, 1.0) — a 500-token prompt scores 1.0; a 50-token prompt scores 0.1. Longer prompts require more context retention and typically benefit from larger models.

Vocabulary entropy: Type-token ratio = unique tokens / total tokens. High TTR (0.8+) signals domain-specific or technical language where smaller models lose precision. Low TTR (0.2–0.4) signals repetitive, formulaic text a smaller model handles well.

Task type baseline: Extraction tasks get a 0.15 baseline (structured field correctness is sensitive to model quality), classification gets 0.10, summarization gets 0.05 (most tolerant of quality degradation).

Routing bands:

Band	Score Range	Default Route	Rationale
Simple	< 0.35	local / small	Short, low-entropy — local model sufficient
Standard	0.35–0.70	medium	Moderate complexity — gpt-4o-mini hits quality floor
Complex	> 0.70	large	Long, technical, or extraction — quality floor requires large

The score is logged in the audit record per call, enabling post-hoc validation that routing decisions were based on reproducible inputs.

Synthetic Dataset

data/generator.py generates a deterministic (seeded) dataset of 300 prompts across 3 task types:

Task Type	Count	Ground Truth	Quality Metric
Summarization	100	Article summaries (5 domains)	ROUGE-1 F1
Classification	100	Sentiment labels (positive/negative/neutral)	Macro F1
Extraction	100	Structured JSON from documents	Field-level exact match

4 model configurations are simulated (deterministic, seed=42):

Tier	Model	Input cost/1K	Output cost/1K	Avg latency
`large`	gpt-4o	$0.005	$0.015	1,240ms
`medium`	gpt-4o-mini	$0.00015	$0.0006	380ms
`small`	gpt-3.5-turbo	$0.0005	$0.0015	210ms
`local`	ollama/mistral	$0.000	$0.000	890ms

Quick Start

1. Install

git clone https://github.com/SidharthKriplani/inferencelens
cd inferencelens
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env

2. Run the demo

python demo.py

This runs the full pipeline: generates 300 synthetic prompts, profiles all 4 configs across all 3 task types (1,200 simulated calls), runs Pareto analysis, generates routing rules, writes the audit log, and prints the terminal report.

3. Run tests

pytest tests/ -v

4. Start the API

uvicorn api.main:app --reload
# → http://localhost:8000/docs

5. Docker

docker build -t inferencelens .
docker run -p 8000:8000 inferencelens

API Reference

Method	Endpoint	Description
`GET`	`/health`	Service health check
`GET`	`/configs`	Model configs and cost table
`POST`	`/profile`	Run profiling across model configs
`GET`	`/report/{task_type}`	Full report for summarization, classification, or extraction
`GET`	`/audit-log`	Query JSONL audit log with filters

POST /profile — request body

{
  "task_types": ["summarization", "classification"],
  "n_samples": 20,
  "tiers": ["large", "medium", "small"]
}

GET /audit-log — query params

?limit=50&event_type=routing_decision&task_type=classification&verdict=PASS

Project Structure

inferencelens/
├── config.py                 # Model configs, cost table, routing thresholds
├── pipeline/
│   ├── profiler.py           # InferenceProfiler
│   ├── evaluator.py          # QualityEvaluator (ROUGE, F1, ExactMatch)
│   ├── cost_tracker.py       # CostTracker
│   ├── pareto.py             # ParetoAnalyzer
│   ├── router.py             # RoutingRecommender
│   ├── audit.py              # AuditLogger (JSONL)
│   └── report.py             # ReportGenerator
├── data/
│   └── generator.py          # Deterministic synthetic dataset
├── models/                   # Empty — no training required
├── api/
│   └── main.py               # FastAPI app
├── tests/
│   ├── test_generator.py
│   ├── test_evaluator.py
│   ├── test_pareto.py
│   ├── test_router.py
│   └── test_pipeline_integration.py
├── docs/assets/
│   ├── pipeline_architecture.svg
│   └── inference_report_sample.svg
├── demo.py
├── requirements.txt
├── .env.example
├── Dockerfile
└── README.md

Design Decisions

Why no ML model training? The core problem is measurement and auditing, not prediction. InferenceLens is a metrological tool: it instruments inference calls, applies well-defined metrics (ROUGE, F1, exact match), and applies geometric analysis (Pareto dominance) to surface routing failures. No gradient descent required.

Why deterministic synthetic data? Real inference costs money and requires API keys. Seeded synthetic data (seed=42) allows the full pipeline to demonstrate meaningful cost/quality gradients without dependencies, making the project fully runnable in any environment.

Why JSONL for the audit trail? Append-only JSONL is the industry standard for audit logs: human-readable, streamable, queryable with jq, and trivially ingestible by any data platform.

Why Pydantic v2 throughout? All data structures are Pydantic models: type-safe, auto-validated, JSON-serializable. The FastAPI integration is zero-ceremony because request/response models are already Pydantic.

Interview Defense

📄 InferenceLens_Interview_Defense.pdf — covers:

Pareto dominance definition and edge cases (non-monotone quality curves)
Why ROUGE-1 for summarization and not BERTScore — tradeoffs and known limitations
How routing rules translate to a production API gateway (beyond IF/THEN logic)
Complexity signal definition — what features determine prompt complexity
Scaling to thousands of task types and distribution shift in production

Part of Applied LLM Systems Portfolio

This library is part of a 13-repo Applied LLM Systems portfolio targeting Applied LLM Systems Engineer, MLOps, and Technical AI PM roles.

Applied Systems (LangGraph pipelines — same failure modes under domain pressure):

Project	Domain	Primary Failure Mode
LendFlow	Financial underwriting	When to stop or escalate
AgentReliabilityLab	Cyber threat triage	When to stop or escalate
NexusSupply	Supplier risk intelligence	Conflicting signal fusion

Libraries & Auditors (domain-agnostic tooling):

Project	What It Audits
InferenceLens	Inference cost/quality tradeoffs — Pareto frontier, routing rules
RiskFrame	ML model lifecycle — champion/challenger, drift, fairness
MetaSignal	A/B experiment validity — CUPED, guardrail-first, SRM
DevPulse	Version-safe RAG — conflict detection, LLM-Last architecture
PulseRank	Marketplace ranking — IPS debiasing, MMR diversity
TrialCheck	A/B readout audit — SRM, peeking, underpowered tests
FeatureLeakageLens	Pre-training leakage — target, temporal, overlap
GoldenSetAuditor	LLM/RAG eval dataset quality
DocIngestQA	RAG document ingestion quality — 11 deterministic checks
MetricLens	Metric movement decomposition — mix shift vs rate shift

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.0

Jun 20, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

inferencelens-1.0.0.tar.gz (32.1 kB view details)

Uploaded Jun 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

inferencelens-1.0.0-py3-none-any.whl (32.0 kB view details)

Uploaded Jun 20, 2026 Python 3

File details

Details for the file inferencelens-1.0.0.tar.gz.

File metadata

Download URL: inferencelens-1.0.0.tar.gz
Upload date: Jun 20, 2026
Size: 32.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for inferencelens-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`9fc96c1297335f67e3fff6de29cecbec9d12709db740eb7b8479f4481c5e1bee`
MD5	`1ccb5379db98041c1e02c7e118af65cf`
BLAKE2b-256	`340f82099ee877564a4fc6188e169fa46c0867395233fc3970d47b2ead6f4b96`

See more details on using hashes here.

File details

Details for the file inferencelens-1.0.0-py3-none-any.whl.

File metadata

Download URL: inferencelens-1.0.0-py3-none-any.whl
Upload date: Jun 20, 2026
Size: 32.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for inferencelens-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`235b16dcbacdd6b178876266923b7458f0bd9ec83747c9f1e1e07fbec54a25d4`
MD5	`c27b187b1088ab7330a2d54acfddcead`
BLAKE2b-256	`b1e8e052b7fc811d6ed88892f3650c3e62ccce10187e3e996dea08cd947278d3`

See more details on using hashes here.

inferencelens 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

InferenceLens

Architecture

Sample Report Output

The Problem

The Thesis Question

Component Architecture

Composite Scoring

Sample Output

Complexity Score Definition

Synthetic Dataset

Quick Start

1. Install

2. Run the demo

3. Run tests

4. Start the API

5. Docker

API Reference

POST /profile — request body

GET /audit-log — query params

Project Structure

Design Decisions

Interview Defense

Part of Applied LLM Systems Portfolio

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes