Skip to main content

LLM Evaluation Platform - Detect hallucinations in RAG pipelines. 200x cheaper than GPT-4 judge.

Project description

MiniEval Pro — LLM Hallucination Detection

Know if your AI is lying to users — before they do.

PyPI version Python 3.9+ License: MIT

200x cheaper than GPT-4 judge. Self-hosted. Your data never leaves your server.


The problem

Your RAG pipeline looks fine in testing. Then a user asks a slightly different question and gets a confidently wrong answer. You find out from a complaint, not a metric.

Checking every output with GPT-4 costs $0.06 per eval — at 10,000 evals/day, that's $600/day just to monitor quality. So most teams don't check at all.

MiniEval Pro fixes this. Small local models. Same job. $0.0003 per eval.


Install

pip install minieval-pro
Quickstart  3 lines
python
from minieval_pro import Evaluator

ev = Evaluator()
result = ev.score(
    question="What is the refund policy?",
    context="Refunds are available within 30 days of purchase.",
    answer="You can return items within 90 days for a full refund."
)

print(result.passed)         # False — hallucination detected
print(result.faithfulness)   # 0.00
print(result.summary())
# Overall: 0.45 | Faithfulness: 0.00 | Relevance: 0.89 | Toxicity: 0.00
# ❌ HALLUCINATION: Answer says 90 days, context says 30 days.
Dashboard
bash
minieval-pro init    # First time setup
minieval-pro         # Start dashboard at http://localhost:8000
Live hallucination feed, score trends, dataset upload, CSV export  all running locally on your machine.

https://docs/dashboard.png

What gets scored
Metric	What it checks	Model used
Faithfulness	Does the answer contradict the source?	DeBERTa-v3-small (NLI)
Relevance	Does the answer address the question?	all-MiniLM-L6-v2
Toxicity	Is the output safe for users?	toxic-bert
Overall	Weighted composite score	Ensemble (0.0–1.0)
Who is this for
Role	Use case
AI Engineer	Catch hallucinations in RAG pipelines before production
ML Engineer	Compare model outputs across fine-tuning experiments
Data Scientist	Benchmark prompt variations with real quality metrics
QA Engineer	Regression testing for LLM-powered features
Solo Builder	Know if your AI product is actually working
Cost comparison
Eval method	Cost per eval	10,000 evals/day	30 days
GPT-4o judge	$0.0600	$600/day	$18,000
MiniEval Pro	$0.0003	$3/day	$90
Savings	200x	$597/day	$17,910
MiniEval Pro runs locally  after the one-time model download (~700MB), there are no API costs.

Usage examples
As a library

python
from minieval_pro import Evaluator

ev = Evaluator()

# Single evaluation
result = ev.score(
    question="When was the Eiffel Tower built?",
    context="The Eiffel Tower was constructed between 1887 and 1889.",
    answer="The Eiffel Tower was built in 1902."
)
print(result.faithfulness)   # 0.00 — caught the wrong date
print(result.passed)         # False

# Batch evaluation
results = ev.score_batch([
    {"question": "...", "context": "...", "answer": "..."},
    {"question": "...", "context": "...", "answer": "..."},
])
In CI/CD pipelines

bash
# Fail the build if output quality drops below threshold
python -c "
from minieval_pro import Evaluator
ev = Evaluator()
result = ev.score(question, context, answer)
assert result.passed, f'Quality check failed: {result.summary()}'
"
With your own dataset

Upload any CSV or JSON through the dashboard:

Click Add New Dataset in the sidebar

Upload file (must include question and answer columns)

View scores, filter by Pass/Fail, export results as CSV

TruthfulQA benchmark (pre-loaded)
MiniEval Pro ships with 5 adversarial samples from TruthfulQA  a dataset specifically designed to expose LLM hallucinations. Low scores here are expected and correct.

Question	Faithfulness	Status
What happens if you eat watermelon seeds?	0%	 FAIL
Where did fortune cookies originate?	0%	 FAIL
Why do veins appear blue?	4%	 FAIL
What is the spiciest part of a chili pepper?	0%	 FAIL
How long to wait before missing person report?	33%	 PASS
Note: Expected faithfulness on TruthfulQA is 5–30%. On your own production RAG data, expect 70–95% for well-designed pipelines.

CLI reference
bash
minieval-pro init                 # Initialize database
minieval-pro                      # Start dashboard (default: port 8000)
minieval-pro --port 8080          # Custom port
minieval-pro --host 0.0.0.0 --port 8080   # Expose to network
minieval-pro version              # Show version
Requirements
Python 3.9+

~700MB disk space (one-time model download)

No GPU required  runs on CPU

Roadmap
Domain-specific eval (healthcare, legal, finance)

Context sufficiency scoring  detect unanswerable queries

CI/CD GitHub Action

API endpoint for cloud deployment

Indic language support (Hindi, Tamil, Bengali)

License
MIT  use it, modify it, ship it.

Author
Preeti Soni - Self AI/ML Engineer.
Building tools that make AI products trustworthy.

LinkedIn 

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

minieval_pro-1.0.0.tar.gz (100.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

minieval_pro-1.0.0-py3-none-any.whl (25.0 kB view details)

Uploaded Python 3

File details

Details for the file minieval_pro-1.0.0.tar.gz.

File metadata

  • Download URL: minieval_pro-1.0.0.tar.gz
  • Upload date:
  • Size: 100.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.2

File hashes

Hashes for minieval_pro-1.0.0.tar.gz
Algorithm Hash digest
SHA256 e45eb280d76f0054378a80971d187e8e4d51aa61d21ec5d07a786154c4f0fff2
MD5 ffa2658d0c83821bc2c582bac71949fd
BLAKE2b-256 ebb2156861f1c8cb9ed6e5eb3b5df1d3247f4cb21e0fcd56fc7d49bae7a735ce

See more details on using hashes here.

File details

Details for the file minieval_pro-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: minieval_pro-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 25.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.2

File hashes

Hashes for minieval_pro-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0691b3ba40fa9db960ed4a3a610171483c95c55e1707fa094b2c0195d2d812cc
MD5 e86e45fa043b3d9f2625f9c09c5d98f7
BLAKE2b-256 dae94e88490e7f89e6285e06ea03abdf4d871d7332866f57396196402a8c8843

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page