Fast AI evaluator for scikit-learn models
Project description
ai-critic 🧠
The Quality Gate for Machine Learning Models
ai-critic is a specialized decision-making system designed to evaluate whether a machine learning model is safe, reliable, and trustworthy enough to be deployed in real-world environments.
Unlike traditional ML evaluation tools that focus almost exclusively on performance metrics, ai-critic operates as a Quality Gate — a final checkpoint that actively probes models to uncover hidden risks that frequently cause silent failures in production.
ai-critic does not ask “How accurate is this model?” It asks “Can this model be trusted in the real world?”
🎯 What Problem Does ai-critic Solve?
In production, most ML failures are not accuracy problems.
They are caused by:
- Data leakage hidden inside features
- Overfitting disguised as strong validation scores
- Models that collapse under small noise
- Models that rely on a single fragile signal
- Configuration choices that look fine — but are structurally unsafe
These failures usually appear after deployment, when it is already expensive or dangerous to fix them.
ai-critic exists to catch these failures before deployment.
🚀 Getting Started (The Basics)
This section is intentionally designed for beginners, students, and engineers under time pressure.
If you only want a fast, conservative verdict, this is all you need.
Installation
Install directly from PyPI:
pip install ai-critic
Python ≥ 3.8 is recommended.
The Quick Verdict
With just a few lines of code, you can obtain:
- An executive-level verdict
- A risk classification
- A deployment recommendation
from ai_critic import AICritic
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
# 1. Prepare data and model
X, y = make_classification(
n_samples=1000,
n_features=20,
random_state=42
)
model = RandomForestClassifier(
max_depth=5,
random_state=42
)
# 2. Initialize the Critic
critic = AICritic(model, X, y)
# 3. Run the audit
report = critic.evaluate(view="executive")
print(f"Verdict: {report['verdict']}")
print(f"Risk Level: {report['risk_level']}")
print(f"Deploy Recommended: {report['deploy_recommended']}")
print(f"Main Reason: {report['main_reason']}")
Example Output:
Verdict: ⚠️ Risky
Risk Level: medium
Deploy Recommended: False
Main Reason: Structural, robustness, or dependency-related risks detected.
This verdict is intentionally conservative by design.
If ai-critic approves deployment, it means no meaningful risks were detected by multiple independent heuristics.
🧭 How to Read the Verdict
| Field | Meaning |
|---|---|
verdict |
Human-readable summary |
risk_level |
low / medium / high |
deploy_recommended |
Final gate decision |
main_reason |
Primary blocking factor |
The goal is clarity, not ambiguity.
💡 Understanding the Critique (Intermediate Level)
This section is for data scientists, ML engineers, and students who want to understand why the model was flagged — and how to improve it.
The Four Pillars of the Audit
ai-critic evaluates models across four independent risk dimensions.
| Pillar | What It Detects | Why It Matters |
|---|---|---|
| 📊 Data Integrity | Leakage, correlations, shortcuts | Inflated performance |
| 🧠 Model Structure | Over-complexity, unsafe configs | Poor generalization |
| 📈 Performance | Suspicious CV behavior | False confidence |
| 🧪 Robustness | Noise sensitivity | Production collapse |
Each pillar produces signals, not binary judgments.
Those signals are later aggregated by the deployment gate.
📊 Data Integrity Analysis
This pillar focuses on the relationship between features and the target.
It answers questions like:
- Are some features too predictive?
- Are there suspicious correlations?
- Does performance collapse when a single feature is disturbed?
These are classic symptoms of data leakage and shortcut learning.
🧠 Model Structure Analysis
A model can be accurate and still be unsafe.
Structural analysis looks for:
- Excessive depth
- Over-parameterization
- Configuration choices that amplify variance
- Inconsistent bias–variance tradeoffs
This is especially important for:
- Decision trees
- Boosting models
- Neural networks with limited data
📈 Performance Sanity Checks
Rather than optimizing metrics, ai-critic questions them.
It checks:
- Cross-validation stability
- Variance across folds
- Learning curve consistency
- Performance under perturbations
A strong score that behaves strangely is treated as a warning, not a success.
🧪 Robustness Testing (Noise Injection)
Production data is never clean.
This test injects controlled noise into inputs and measures degradation.
robustness = report["details"]["robustness"]
print(f"Original CV Score: {robustness['cv_score_original']}")
print(f"Noisy CV Score: {robustness['cv_score_noisy']}")
print(f"Performance Drop: {robustness['performance_drop']}")
print(f"Verdict: {robustness['verdict']}")
Possible outcomes:
stable→ acceptable degradationfragile→ high sensitivitymisleading→ performance likely inflated
🔍 Explainability & Feature Sensitivity
Accuracy alone hides why a model works.
The explainability module performs feature sensitivity analysis to detect:
- Feature-level leakage
- Over-reliance on a single signal
- Structural shortcuts
How Explainability Works
For each feature:
- The feature is randomly permuted.
- The model is re-evaluated.
- Performance drop is measured.
Large drops indicate critical dependency.
This approach is:
- Model-agnostic
- Lightweight
- Framework-independent
- Interpretable by humans
Explainability Verdicts
| Verdict | Meaning |
|---|---|
stable |
Balanced feature usage |
feature_dependency |
Few features dominate |
feature_leakage_risk |
Single feature dominates |
These verdicts directly affect:
- Deployment decision
- Confidence score
- Recommendations
🧠 Recommendations Engine (New)
ai-critic does not stop at “deploy or not”.
It generates actionable recommendations, such as:
- “Reduce
max_depth” - “Increase regularization”
- “Likely feature leakage detected”
- “Model shows structural overfitting”
- “High noise sensitivity — retrain with augmentation”
These recommendations are rule-based + data-driven, not LLM hallucinations.
⚙️ Deployment Gate
The final decision is produced by deploy_decision().
decision = critic.deploy_decision()
print(decision["deploy"])
print(decision["risk_level"])
print(decision["confidence"])
print(decision["blocking_issues"])
Conceptually:
- Hard blockers → deployment denied
- Soft blockers → deployment discouraged
- Confidence score (0–1) → heuristic trust
🔄 Feedback Loop & Learning Critic
ai-critic improves over time.
Each evaluation can be stored as feedback:
- Model config
- Signals
- Final outcome
- Human override (optional)
This enables:
- Meta-learning
- Better future recommendations
- Context-aware criticism
🧪 Session Tracking & Comparison
You can compare models over time:
critic_v1 = AICritic(model, X, y, session="v1")
critic_v1.evaluate()
critic_v2 = AICritic(model, X, y, session="v2")
critic_v2.evaluate()
critic_v2.compare_with("v1")
Use cases:
- Regression detection
- Risk drift
- Governance audits
⚙️ Multi-Framework Support
The same API works for:
- scikit-learn
- PyTorch
- TensorFlow
Adapters handle training, evaluation, and probing internally.
🧩 Design Philosophy
ai-critic is intentionally skeptical.
It assumes:
- Metrics can lie
- Data is imperfect
- Models fail silently
- Confidence must be earned
This makes it ideal as a final gate, not a tuning toy.
🛡️ What ai-critic Is NOT
- ❌ A hyperparameter optimizer
- ❌ A leaderboard benchmark tool
- ❌ A replacement for domain expertise
- ❌ A magic “approve all” system
🧠 Final Note
ai-critic is not here to make models look good. It exists to prevent bad models from looking good enough to deploy.
A failed audit does not mean your model is bad. It means your model is not yet safe to trust.
That distinction is everything.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ai_critic-2.0.0.tar.gz.
File metadata
- Download URL: ai_critic-2.0.0.tar.gz
- Upload date:
- Size: 20.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a09f9af0d8f99ef878bd24112ee277ff608fddfebfd4d50f4468d6a01e5fe8d6
|
|
| MD5 |
2a2b448ef48a0a0f24cf68833e14e6d9
|
|
| BLAKE2b-256 |
8be7e388bf0ec01f8772a5451113f35948c8a4dda5581a9aa522992d8cdd891a
|
File details
Details for the file ai_critic-2.0.0-py3-none-any.whl.
File metadata
- Download URL: ai_critic-2.0.0-py3-none-any.whl
- Upload date:
- Size: 24.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
82b7148e7c39ce7dcbbe06b929d38b513c9d06e166f49f06cccd8ed80a3a3592
|
|
| MD5 |
e10d2e0dceb5567f28301fb617d42f0c
|
|
| BLAKE2b-256 |
1f1cf35d279ff8f436f9f169e78ca187f42c83e0546918ceaf8214115a26c999
|