Fast AI evaluator for scikit-learn models

Project description

ai-critic 🧠

The Quality Gate for Machine Learning Models

ai-critic is a specialized decision-making system designed to evaluate whether a machine learning model is safe, reliable, and trustworthy enough to be deployed in real-world environments.

Unlike traditional ML evaluation tools that focus almost exclusively on performance metrics, ai-critic operates as a Quality Gate — a final checkpoint that actively probes models to uncover hidden risks that frequently cause silent failures in production.

ai-critic does not ask “How accurate is this model?” It asks “Can this model be trusted in the real world?”

🎯 What Problem Does ai-critic Solve?

In production, most ML failures are not accuracy problems.

They are caused by:

Data leakage hidden inside features
Overfitting disguised as strong validation scores
Models that collapse under small noise
Models that rely on a single fragile signal
Configuration choices that look fine — but are structurally unsafe

These failures usually appear after deployment, when it is already expensive or dangerous to fix them.

ai-critic exists to catch these failures before deployment.

🚀 Getting Started (The Basics)

This section is intentionally designed for beginners, students, and engineers under time pressure.

If you only want a fast, conservative verdict, this is all you need.

Installation

Install directly from PyPI:

pip install ai-critic

Python ≥ 3.8 is recommended.

The Quick Verdict

With just a few lines of code, you can obtain:

An executive-level verdict
A risk classification
A deployment recommendation

from ai_critic import AICritic
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

# 1. Prepare data and model
X, y = make_classification(
    n_samples=1000,
    n_features=20,
    random_state=42
)

model = RandomForestClassifier(
    max_depth=5,
    random_state=42
)

# 2. Initialize the Critic
critic = AICritic(model, X, y)

# 3. Run the audit
report = critic.evaluate(view="executive")

print(f"Verdict: {report['verdict']}")
print(f"Risk Level: {report['risk_level']}")
print(f"Deploy Recommended: {report['deploy_recommended']}")
print(f"Main Reason: {report['main_reason']}")

Example Output:

Verdict: ⚠️ Risky
Risk Level: medium
Deploy Recommended: False
Main Reason: Structural, robustness, or dependency-related risks detected.

This verdict is intentionally conservative by design.

If ai-critic approves deployment, it means no meaningful risks were detected by multiple independent heuristics.

🧭 How to Read the Verdict

Field	Meaning
`verdict`	Human-readable summary
`risk_level`	low / medium / high
`deploy_recommended`	Final gate decision
`main_reason`	Primary blocking factor

The goal is clarity, not ambiguity.

💡 Understanding the Critique (Intermediate Level)

This section is for data scientists, ML engineers, and students who want to understand why the model was flagged — and how to improve it.

The Four Pillars of the Audit

ai-critic evaluates models across four independent risk dimensions.

Pillar	What It Detects	Why It Matters
📊 Data Integrity	Leakage, correlations, shortcuts	Inflated performance
🧠 Model Structure	Over-complexity, unsafe configs	Poor generalization
📈 Performance	Suspicious CV behavior	False confidence
🧪 Robustness	Noise sensitivity	Production collapse

Each pillar produces signals, not binary judgments.

Those signals are later aggregated by the deployment gate.

📊 Data Integrity Analysis

This pillar focuses on the relationship between features and the target.

It answers questions like:

Are some features too predictive?
Are there suspicious correlations?
Does performance collapse when a single feature is disturbed?

These are classic symptoms of data leakage and shortcut learning.

🧠 Model Structure Analysis

A model can be accurate and still be unsafe.

Structural analysis looks for:

Excessive depth
Over-parameterization
Configuration choices that amplify variance
Inconsistent bias–variance tradeoffs

This is especially important for:

Decision trees
Boosting models
Neural networks with limited data

📈 Performance Sanity Checks

Rather than optimizing metrics, ai-critic questions them.

It checks:

Cross-validation stability
Variance across folds
Learning curve consistency
Performance under perturbations

A strong score that behaves strangely is treated as a warning, not a success.

🧪 Robustness Testing (Noise Injection)

Production data is never clean.

This test injects controlled noise into inputs and measures degradation.

robustness = report["details"]["robustness"]

print(f"Original CV Score: {robustness['cv_score_original']}")
print(f"Noisy CV Score: {robustness['cv_score_noisy']}")
print(f"Performance Drop: {robustness['performance_drop']}")
print(f"Verdict: {robustness['verdict']}")

Possible outcomes:

stable → acceptable degradation
fragile → high sensitivity
misleading → performance likely inflated

🔍 Explainability & Feature Sensitivity

Accuracy alone hides why a model works.

The explainability module performs feature sensitivity analysis to detect:

Feature-level leakage
Over-reliance on a single signal
Structural shortcuts

How Explainability Works

For each feature:

The feature is randomly permuted.
The model is re-evaluated.
Performance drop is measured.

Large drops indicate critical dependency.

This approach is:

Model-agnostic
Lightweight
Framework-independent
Interpretable by humans

Explainability Verdicts

Verdict	Meaning
`stable`	Balanced feature usage
`feature_dependency`	Few features dominate
`feature_leakage_risk`	Single feature dominates

These verdicts directly affect:

Deployment decision
Confidence score
Recommendations

🧠 Recommendations Engine (New)

ai-critic does not stop at “deploy or not”.

It generates actionable recommendations, such as:

“Reduce max_depth”
“Increase regularization”
“Likely feature leakage detected”
“Model shows structural overfitting”
“High noise sensitivity — retrain with augmentation”

These recommendations are rule-based + data-driven, not LLM hallucinations.

⚙️ Deployment Gate

The final decision is produced by deploy_decision().

decision = critic.deploy_decision()

print(decision["deploy"])
print(decision["risk_level"])
print(decision["confidence"])
print(decision["blocking_issues"])

Conceptually:

Hard blockers → deployment denied
Soft blockers → deployment discouraged
Confidence score (0–1) → heuristic trust

🔄 Feedback Loop & Learning Critic

ai-critic improves over time.

Each evaluation can be stored as feedback:

Model config
Signals
Final outcome
Human override (optional)

This enables:

Meta-learning
Better future recommendations
Context-aware criticism

🧪 Session Tracking & Comparison

You can compare models over time:

critic_v1 = AICritic(model, X, y, session="v1")
critic_v1.evaluate()

critic_v2 = AICritic(model, X, y, session="v2")
critic_v2.evaluate()

critic_v2.compare_with("v1")

Use cases:

Regression detection
Risk drift
Governance audits

⚙️ Multi-Framework Support

The same API works for:

scikit-learn
PyTorch
TensorFlow

Adapters handle training, evaluation, and probing internally.

🧩 Design Philosophy

ai-critic is intentionally skeptical.

It assumes:

Metrics can lie
Data is imperfect
Models fail silently
Confidence must be earned

This makes it ideal as a final gate, not a tuning toy.

🛡️ What ai-critic Is NOT

❌ A hyperparameter optimizer
❌ A leaderboard benchmark tool
❌ A replacement for domain expertise
❌ A magic “approve all” system

🧠 Final Note

ai-critic is not here to make models look good. It exists to prevent bad models from looking good enough to deploy.

A failed audit does not mean your model is bad. It means your model is not yet safe to trust.

That distinction is everything.

Project details

Release history Release notifications | RSS feed

3.5.1

May 6, 2026

3.5.0

Apr 18, 2026

3.4.6

Apr 14, 2026

3.4.5

Apr 5, 2026

3.4.1

Apr 5, 2026

3.3.0

Mar 22, 2026

3.2.0

Mar 16, 2026

3.0.0

Feb 15, 2026

2.1.0

Feb 9, 2026

This version

2.0.0

Feb 4, 2026

1.2.0

Jan 29, 2026

1.1.0

Jan 27, 2026

1.0.0

Jan 25, 2026

0.2.5

Jan 25, 2026

0.2.4

Jan 23, 2026

0.2.3

Jan 23, 2026

0.2.2

Jan 22, 2026

0.2.1

Jan 19, 2026

0.2.0

Jan 18, 2026

0.1.0

Jan 18, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ai_critic-2.0.0.tar.gz (20.5 kB view details)

Uploaded Feb 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ai_critic-2.0.0-py3-none-any.whl (24.6 kB view details)

Uploaded Feb 4, 2026 Python 3

File details

Details for the file ai_critic-2.0.0.tar.gz.

File metadata

Download URL: ai_critic-2.0.0.tar.gz
Upload date: Feb 4, 2026
Size: 20.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.7

File hashes

Hashes for ai_critic-2.0.0.tar.gz
Algorithm	Hash digest
SHA256	`a09f9af0d8f99ef878bd24112ee277ff608fddfebfd4d50f4468d6a01e5fe8d6`
MD5	`2a2b448ef48a0a0f24cf68833e14e6d9`
BLAKE2b-256	`8be7e388bf0ec01f8772a5451113f35948c8a4dda5581a9aa522992d8cdd891a`

See more details on using hashes here.

File details

Details for the file ai_critic-2.0.0-py3-none-any.whl.

File metadata

Download URL: ai_critic-2.0.0-py3-none-any.whl
Upload date: Feb 4, 2026
Size: 24.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.7

File hashes

Hashes for ai_critic-2.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`82b7148e7c39ce7dcbbe06b929d38b513c9d06e166f49f06cccd8ed80a3a3592`
MD5	`e10d2e0dceb5567f28301fb617d42f0c`
BLAKE2b-256	`1f1cf35d279ff8f436f9f169e78ca187f42c83e0546918ceaf8214115a26c999`

See more details on using hashes here.

ai-critic 2.0.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

ai-critic 🧠

The Quality Gate for Machine Learning Models

🎯 What Problem Does ai-critic Solve?

🚀 Getting Started (The Basics)

Installation

The Quick Verdict

🧭 How to Read the Verdict

💡 Understanding the Critique (Intermediate Level)

The Four Pillars of the Audit

📊 Data Integrity Analysis

🧠 Model Structure Analysis

📈 Performance Sanity Checks

🧪 Robustness Testing (Noise Injection)

🔍 Explainability & Feature Sensitivity

How Explainability Works

Explainability Verdicts

🧠 Recommendations Engine (New)

⚙️ Deployment Gate

🔄 Feedback Loop & Learning Critic

🧪 Session Tracking & Comparison

⚙️ Multi-Framework Support

🧩 Design Philosophy

🛡️ What ai-critic Is NOT

🧠 Final Note

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes