Skip to main content

Conformal Prediction Moderation & Human-in-the-Loop Routing Python Library

Project description

commCP 🛡️

License: MIT

commCP (Conformal Prediction Moderation & Human-in-the-Loop Routing) is a post-training wrapper for binary classification estimators. It combines Conformal Prediction (to enforce statistical reliability guarantees) and LLM Refereeing to decide when a prediction can be auto-accepted vs. when it should be escalated for human review.

Inspired by stats-centric tools like MAPIE, commCP bridges the gap between statistical guarantees and LLM verification for the AI era.


Features

  • Statistical Coverage Guarantees: Enforces target error rates ($1 - \alpha$) via conformal calibration.
  • Selective Prediction / HITL: Automatically routes predictions into auto_decided or escalated queues.
  • LLM-as-a-Referee: Mediates ensemble disagreements and conformal "gray-zone" uncertainties dynamically.
  • Cost-Optimized: Bypasses the LLM completely for obvious acceptances or low-confidence/high-risk rejections, keeping API costs to a minimum.
  • Seamless sklearn Compatibility: Works with any estimator exposing a predict_proba method (e.g., LogisticRegression, RandomForest, XGBoost).

How commCP Works under the Hood

1. What is Conformal Prediction?

Conformal Prediction is a modern statistical framework that wraps around any standard machine learning model to provide rigorous, mathematical guarantees of accuracy.

Instead of trusting raw model probabilities blindly, you specify a tolerance level alpha (significance level). For example, if you set alpha = 0.05, commCP guarantees that the subset of predictions the system automatically accepts will achieve at least 95% accuracy (empirical coverage).

2. The Calibration Phase

Before you make predictions, you run the calibration phase using ccp.calibrate(X_calib, y_calib) on a small held-out dataset:

  1. The model predicts probabilities for each calibration sample.
  2. The library calculates the "nonconformity score" (how surprised the model was by the true label).
  3. We sort these scores and find the exact mathematical boundary where the model starts making mistakes.
  4. This boundary is saved as the conformal_cutoff_ (e.g., 84%). Any prediction with confidence equal to or greater than 84% is statistically guaranteed to meet your safety standards.

3. The Routing Decision Rules

For every new test case, the library routes the prediction using three possible paths:

  • Auto-Accepted ("ACCEPT"): If the model's confidence is above the safety cutoff (e.g., $\ge$ 84%), it is highly confident and statistically safe. The prediction is accepted automatically with zero human or LLM overhead.
  • LLM Refereeing ("LLM_VERIFY" / "LLM_VERIFIED"):
    • CRITICAL REQUIREMENT: The library requires the user to provide their own LLM API Key (e.g., Groq API Key or OpenAI API Key) when initializing the wrapper. If no API key is supplied, LLM-moderation calls will fail and default directly to human escalation.
    • The LLM Referee is called to review the raw feature data and make an independent judgment in two scenarios:
      1. Gray-Zone Gating: The model's confidence is close, but falls slightly below the safety cutoff (within the verify_margin).
      2. Ensemble Disagreement: You are using an ensemble of models (e.g. RF + SVM + LR) and the models disagree on the prediction.
    • If the LLM referee agrees with the model's choice, it is accepted as "LLM_VERIFIED".
  • Human Review ("HUMAN_REVIEW"): Escalated to a human queue if:
    • The model's confidence is extremely low (below the gray zone).
    • The LLM referee disagrees with the model's prediction (flagging a potential model error).
    • The LLM call fails due to connection issues or missing API credentials.

4. The Benefit of Human-in-the-Loop (HITL) Automation

Running business processes completely manually is slow and extremely expensive. Running them 100% automatically is risky because machine learning models occasionally make confidence errors.

commCP solves this by automating the vast majority of easy, confident cases (typically 70% to 95% of the workload) and leaving only the highly-uncertain, borderline cases (5% to 30%) for the human expert to verify.

By isolating only the complex cases for human review, the human's workload is dramatically reduced, saving massive amounts of time and operational costs while maintaining near-perfect system accuracy.


Installation

# Install from source (or PyPI once published)
pip install .

How to Prepare Your Data

Conformal prediction requires a held-out calibration set that the model was not exposed to during training. Before wrapping your model, split your dataset into three distinct partitions:

  1. Training Set (e.g., 60% of data): Used to train your base classifier (e.g. RandomForest).
  2. Calibration Set (e.g., 20% of data): Used by commCP to calculate safety thresholds. Crucial: Calibrating on the training set violates statistical guarantees.
  3. Test Set (e.g., 20% of data): Used for incoming predictions and routing.

Quick Start Guide

1. Train Your Classifier

from sklearn.ensemble import RandomForestClassifier
from commcp import CommCP

# Train a standard sklearn classifier
model = RandomForestClassifier()
model.fit(X_train, y_train)

2. Wrap and Calibrate commCP

Initialize CommCP with your trained estimator, along with a task description and class labels (which are required to build high-accuracy prompts for the LLM Referee). Pass a held-out calibration set to establish the conformal threshold.

# Initialize commcp wrapper (configured for significance level alpha=0.05 -> 95% coverage)
ccp = CommCP(
    estimator=model,
    task_description="Predict whether a patient has heart disease based on clinical features",
    class_labels={0: "Healthy", 1: "Heart Disease Present"},
    alpha=0.05,
    llm_provider="groq",          # Supports "groq" or "openai"
    llm_api_key=None,             # Optional: API key string (defaults to GROQ_API_KEY/OPENAI_API_KEY env vars)
    llm_model=None,               # Optional: override default model (e.g. "llama-3.3-70b-versatile")
    base_url=None,                # Optional: specify custom API endpoint URL
    verify_margin=0.15,           # Trigger LLM verification on predictions within 15% of the threshold
    consensus_threshold=1.0       # Threshold for ensemble consensus (1.0 = trigger LLM on any disagreement)
)

# Calibrate
ccp.calibrate(X_calib, y_calib)

Constructor Parameters

When initializing CommCP, you can customize the configuration using the following parameters:

Parameter Type Default Description
estimator BaseEstimator | list Required A pre-trained scikit-learn compatible binary classifier (exposing predict_proba) or a list of classifiers for ensemble gating.
task_description str Required A plain-text description of the prediction task (e.g. "Predict if credit application will default"). Used by the LLM Referee to understand context.
class_labels dict Required A dictionary mapping classes to labels, e.g., {0: "Repay", 1: "Default"}. Used by the LLM Referee for high-accuracy prompting.
alpha float 0.05 Conformal significance level (expected error rate). 1 - alpha is the target mathematical coverage guarantee (e.g., 0.05 guarantees >= 95% accuracy).
llm_provider str "groq" The LLM provider API client. Supports "groq" or "openai".
llm_api_key str | None None The API key for the chosen LLM provider. If None (default), the library looks for the GROQ_API_KEY or OPENAI_API_KEY environment variables. If no key is found, LLM calls fail gracefully and default to human escalation.
llm_model str | None None Name of the model to use. If None (default), uses "llama-3.3-70b-versatile" for Groq and "gpt-4o-mini" for OpenAI.
base_url str | None None Custom API endpoint URL for custom or self-hosted LLM endpoints.
verify_margin float 0.15 The threshold range below the conformal cutoff where the LLM referee is called (e.g., if cutoff is 90% and margin is 15%, predictions between 75% and 90% go to LLM verification).
consensus_threshold float 1.0 In ensemble gating, the fraction of models that must agree. A value of 1.0 triggers LLM refereeing on any disagreement (not 100% consensus).

3. Predict & Moderate

Predict outcomes for test data. CommCP will execute conformal gating, query the LLM referee on borderline cases, and partition predictions.

# Run predictions
results = ccp.predict(
    X_test, 
    text_dossiers=text_descriptions # Optional natural language dossiers for LLM inspection
)

# Get automation and routing results
print(f"Automation rate: {results.automation_rate:.2%}")

# Access lists of auto-decided and escalated records
auto_cases = results.auto_decided  # list of dicts
human_queue = results.escalated    # list of dicts

Understanding the Results Structure

Each record inside results.auto_decided and results.escalated is a dictionary with the following schema:

  • sample_index (int): The index of the sample in the test dataset.
  • model_prediction (int): The raw output prediction of your base classifier.
  • confidence (float): The probability score assigned to the predicted class by the model.
  • route (str): The final decision route. It will be:
    • "ACCEPT": Auto-accepted directly by Conformal Prediction (high confidence).
    • "LLM_VERIFIED": Evaluated by the LLM Referee (due to borderline confidence or ensemble disagreement) and the LLM agreed with the classifier.
    • "HUMAN_REVIEW": Escalated to a human reviewer (due to low confidence or LLM disagreement).
  • route_reason (str): A detailed description explaining why this route was selected.
  • llm_prediction (int or None): The decision made by the LLM Referee (0 or 1 if called, otherwise None).
  • llm_reasoning (str or None): The short reasoning sentence written by the LLM Referee (if called).
  • final_prediction (int or None): The final automated prediction value if automated. If escalated to a human, this is None (requiring human resolution).

How Statistics are Calculated

When you call results.stats(y_true=y_test), the library calculates the following metrics under the hood:

  1. Automation Rate:
    Automation Rate = (ACCEPT Cases + LLM_VERIFIED Cases) / Total Samples
    
  2. Empirical Conformal Coverage: The accuracy of the automated decisions against the true labels. Under conformal prediction, this is mathematically guaranteed to be >= 1 - alpha.
  3. CommCP Wrapped System Accuracy:
    CommCP Wrapped System Accuracy = (Correct Automated Predictions + Total Human Escalated Cases) / Total Samples
    
    (Note: This calculation assumes the human reviewer acts as a ground-truth oracle and corrects any escalated case to the right label).

4. Evaluate Guarantees

Verify if your target mathematical coverage guarantee was met:

empirical_coverage = results.coverage(y_test)
print(f"Empirical Coverage: {empirical_coverage:.2%}") # Should be >= 95%

Examine system performance details:

print(results.stats(y_test))

Customizing Gating Logic

CommCP dynamically adjusts its gating based on your model architecture:

  • Single Models: Uses Gray-Zone Gating. Calls the LLM referee only when confidence is close but below the conformal cutoff.
  • Ensembles: Uses Disagreement Gating. Automatically inspects ensemble consensus and calls the LLM to referee conflicting model predictions.

License

Licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

commcp-1.0.6.tar.gz (19.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

commcp-1.0.6-py3-none-any.whl (15.5 kB view details)

Uploaded Python 3

File details

Details for the file commcp-1.0.6.tar.gz.

File metadata

  • Download URL: commcp-1.0.6.tar.gz
  • Upload date:
  • Size: 19.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for commcp-1.0.6.tar.gz
Algorithm Hash digest
SHA256 af7c58fbc6362d4557c5875960702cbb3c7ed34fce1d2d7b64a6845ac7d9fa1b
MD5 cf45d2add195e50352a57c8dcfbdc8f3
BLAKE2b-256 5a272a1fcc63ae295c5ba7c0ee4b72ca871a4f8f52ca87ca111772829701b62c

See more details on using hashes here.

File details

Details for the file commcp-1.0.6-py3-none-any.whl.

File metadata

  • Download URL: commcp-1.0.6-py3-none-any.whl
  • Upload date:
  • Size: 15.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for commcp-1.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 f6d2d7527ae64952dc58826ab63b9aceb2650272e21bd34d26fa9aaf04ceaee9
MD5 ce2ae93fe3c6e4486d8580972f8b6486
BLAKE2b-256 27d06b3145910b84e2c8ebc1403efebff70c7a8ca78abd0866709bd2af94299d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page