Skip to main content

Fast AI evaluator for scikit-learn models

Project description

ai-critic 🧠

The Quality Gate for Machine Learning Models

ai-critic is a specialized decision-making system designed to evaluate whether a machine learning model is safe, reliable, and trustworthy enough to be deployed in real-world environments.

Unlike traditional ML evaluation tools that focus almost exclusively on performance metrics, ai-critic operates as a Quality Gate — a final checkpoint that actively probes models to uncover hidden risks that frequently cause silent failures in production.

ai-critic does not ask “How accurate is this model?” It asks “Can this model be trusted in the real world?”


🎯 What Problem Does ai-critic Solve?

In production, most ML failures are not accuracy problems.

They are caused by:

  • Data leakage hidden inside features
  • Overfitting disguised as strong validation scores
  • Models that collapse under small noise
  • Models that rely on a single fragile signal
  • Configuration choices that look fine — but are structurally unsafe

These failures usually appear after deployment, when it is already expensive or dangerous to fix them.

ai-critic exists to catch these failures before deployment.


🚀 Getting Started (The Basics)

This section is intentionally designed for beginners, students, and engineers under time pressure.

If you only want a fast, conservative verdict, this is all you need.


Installation

Install directly from PyPI:

pip install ai-critic

Python ≥ 3.8 is recommended.


The Quick Verdict

With just a few lines of code, you can obtain:

  • An executive-level verdict
  • A risk classification
  • A deployment recommendation
from ai_critic import AICritic
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

# 1. Prepare data and model
X, y = make_classification(
    n_samples=1000,
    n_features=20,
    random_state=42
)

model = RandomForestClassifier(
    max_depth=5,
    random_state=42
)

# 2. Initialize the Critic
critic = AICritic(model, X, y)

# 3. Run the audit
report = critic.evaluate(view="executive")

print(f"Verdict: {report['verdict']}")
print(f"Risk Level: {report['risk_level']}")
print(f"Deploy Recommended: {report['deploy_recommended']}")
print(f"Main Reason: {report['main_reason']}")

Example Output:

Verdict: ⚠️ Risky
Risk Level: medium
Deploy Recommended: False
Main Reason: Structural, robustness, or dependency-related risks detected.

This verdict is intentionally conservative by design.

If ai-critic approves deployment, it means no meaningful risks were detected by multiple independent heuristics.


🧭 How to Read the Verdict

Field Meaning
verdict Human-readable summary
risk_level low / medium / high
deploy_recommended Final gate decision
main_reason Primary blocking factor

The goal is clarity, not ambiguity.


💡 Understanding the Critique (Intermediate Level)

This section is for data scientists, ML engineers, and students who want to understand why the model was flagged — and how to improve it.


The Four Pillars of the Audit

ai-critic evaluates models across four independent risk dimensions.

Pillar What It Detects Why It Matters
📊 Data Integrity Leakage, correlations, shortcuts Inflated performance
🧠 Model Structure Over-complexity, unsafe configs Poor generalization
📈 Performance Suspicious CV behavior False confidence
🧪 Robustness Noise sensitivity Production collapse

Each pillar produces signals, not binary judgments.

Those signals are later aggregated by the deployment gate.


📊 Data Integrity Analysis

This pillar focuses on the relationship between features and the target.

It answers questions like:

  • Are some features too predictive?
  • Are there suspicious correlations?
  • Does performance collapse when a single feature is disturbed?

These are classic symptoms of data leakage and shortcut learning.


🧠 Model Structure Analysis

A model can be accurate and still be unsafe.

Structural analysis looks for:

  • Excessive depth
  • Over-parameterization
  • Configuration choices that amplify variance
  • Inconsistent bias–variance tradeoffs

This is especially important for:

  • Decision trees
  • Boosting models
  • Neural networks with limited data

📈 Performance Sanity Checks

Rather than optimizing metrics, ai-critic questions them.

It checks:

  • Cross-validation stability
  • Variance across folds
  • Learning curve consistency
  • Performance under perturbations

A strong score that behaves strangely is treated as a warning, not a success.


🧪 Robustness Testing (Noise Injection)

Production data is never clean.

This test injects controlled noise into inputs and measures degradation.

robustness = report["details"]["robustness"]

print(f"Original CV Score: {robustness['cv_score_original']}")
print(f"Noisy CV Score: {robustness['cv_score_noisy']}")
print(f"Performance Drop: {robustness['performance_drop']}")
print(f"Verdict: {robustness['verdict']}")

Possible outcomes:

  • stable → acceptable degradation
  • fragile → high sensitivity
  • misleading → performance likely inflated

🔍 Explainability & Feature Sensitivity

Accuracy alone hides why a model works.

The explainability module performs feature sensitivity analysis to detect:

  • Feature-level leakage
  • Over-reliance on a single signal
  • Structural shortcuts

How Explainability Works

For each feature:

  1. The feature is randomly permuted.
  2. The model is re-evaluated.
  3. Performance drop is measured.

Large drops indicate critical dependency.

This approach is:

  • Model-agnostic
  • Lightweight
  • Framework-independent
  • Interpretable by humans

Explainability Verdicts

Verdict Meaning
stable Balanced feature usage
feature_dependency Few features dominate
feature_leakage_risk Single feature dominates

These verdicts directly affect:

  • Deployment decision
  • Confidence score
  • Recommendations

🧠 Recommendations Engine (New)

ai-critic does not stop at “deploy or not”.

It generates actionable recommendations, such as:

  • “Reduce max_depth
  • “Increase regularization”
  • “Likely feature leakage detected”
  • “Model shows structural overfitting”
  • “High noise sensitivity — retrain with augmentation”

These recommendations are rule-based + data-driven, not LLM hallucinations.


⚙️ Deployment Gate

The final decision is produced by deploy_decision().

decision = critic.deploy_decision()

print(decision["deploy"])
print(decision["risk_level"])
print(decision["confidence"])
print(decision["blocking_issues"])

Conceptually:

  • Hard blockers → deployment denied
  • Soft blockers → deployment discouraged
  • Confidence score (0–1) → heuristic trust

🔄 Feedback Loop & Learning Critic

ai-critic improves over time.

Each evaluation can be stored as feedback:

  • Model config
  • Signals
  • Final outcome
  • Human override (optional)

This enables:

  • Meta-learning
  • Better future recommendations
  • Context-aware criticism

🧪 Session Tracking & Comparison

You can compare models over time:

critic_v1 = AICritic(model, X, y, session="v1")
critic_v1.evaluate()

critic_v2 = AICritic(model, X, y, session="v2")
critic_v2.evaluate()

critic_v2.compare_with("v1")

Use cases:

  • Regression detection
  • Risk drift
  • Governance audits

⚙️ Multi-Framework Support

The same API works for:

  • scikit-learn
  • PyTorch
  • TensorFlow

Adapters handle training, evaluation, and probing internally.


🧩 Design Philosophy

ai-critic is intentionally skeptical.

It assumes:

  • Metrics can lie
  • Data is imperfect
  • Models fail silently
  • Confidence must be earned

This makes it ideal as a final gate, not a tuning toy.


🛡️ What ai-critic Is NOT

  • ❌ A hyperparameter optimizer
  • ❌ A leaderboard benchmark tool
  • ❌ A replacement for domain expertise
  • ❌ A magic “approve all” system

🧠 Final Note

ai-critic is not here to make models look good. It exists to prevent bad models from looking good enough to deploy.

A failed audit does not mean your model is bad. It means your model is not yet safe to trust.

That distinction is everything.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ai_critic-2.0.0.tar.gz (20.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ai_critic-2.0.0-py3-none-any.whl (24.6 kB view details)

Uploaded Python 3

File details

Details for the file ai_critic-2.0.0.tar.gz.

File metadata

  • Download URL: ai_critic-2.0.0.tar.gz
  • Upload date:
  • Size: 20.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.7

File hashes

Hashes for ai_critic-2.0.0.tar.gz
Algorithm Hash digest
SHA256 a09f9af0d8f99ef878bd24112ee277ff608fddfebfd4d50f4468d6a01e5fe8d6
MD5 2a2b448ef48a0a0f24cf68833e14e6d9
BLAKE2b-256 8be7e388bf0ec01f8772a5451113f35948c8a4dda5581a9aa522992d8cdd891a

See more details on using hashes here.

File details

Details for the file ai_critic-2.0.0-py3-none-any.whl.

File metadata

  • Download URL: ai_critic-2.0.0-py3-none-any.whl
  • Upload date:
  • Size: 24.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.7

File hashes

Hashes for ai_critic-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 82b7148e7c39ce7dcbbe06b929d38b513c9d06e166f49f06cccd8ed80a3a3592
MD5 e10d2e0dceb5567f28301fb617d42f0c
BLAKE2b-256 1f1cf35d279ff8f436f9f169e78ca187f42c83e0546918ceaf8214115a26c999

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page