Slice-based model evaluation for ML engineers. Find where your model fails before production does.

These details have not been verified by PyPI

Project links

Project description

sliceval

Your model's global metric is lying to you.

Slice-based evaluation for ML models. Find hidden failures. Ship with confidence.

Your dashboard says F1 = 0.91. You ship.
Three months later, a production line keeps missing failures.

The 200-sample subgroup that matters? It was at F1 = 0.41.
The global metric never moved.

Quick Start · The Problem · How It Works · API Reference · Use Cases

⚡ Quick Start

pip install sliceval

from sliceval import SliceEvaluator

# 1. Wrap your trained model + test data
ev = SliceEvaluator(model, X_test, y_test)

# 2. Define slices you care about
ev.add_slice('sensor_b', X_test['sensor_type'] == 'B')
ev.add_slice('night_shift', lambda X: X['hour'] < 6)

# 3. Auto-discover slices you didn't think of
ev.discover_slices()

# 4. Evaluate — model.predict() called once, all slices reuse cached predictions
report = ev.evaluate()

# 5. See the truth
print(report.worst_slices())

Every metric includes a confidence interval, sample count, delta from global, and a significance test. No ambiguity.

🔍 The Problem

Standard ML evaluation computes one number across the entire test set. That number is a weighted average where majority subgroups dominate and minority subgroups disappear.

Why not just use accuracy / F1 / confusion matrix?

What You're Using	What It Tells You	What It Hides
Global F1 / Accuracy	Average performance across all data	Which subgroups are failing
Confusion Matrix	TP/FP/TN/FN counts overall	Where those errors concentrate
Per-class Metrics	Performance per label	Feature-driven failure patterns
sliceval	Performance per data subgroup with CI + significance	Nothing. That's the point.

💡 The confusion matrix tells you what the model gets wrong. sliceval tells you where and why.

🔬 How It Works

Three ways to define slices

Tree Discovery — how it finds failures automatically

A shallow decision tree is fit on model errors. Each leaf represents a region of feature space where the model systematically fails. The leaves become candidate slices.

🚀 Tree discovery is the default. It's fast and finds axis-aligned failure regions. Use beam search when you need exhaustive coverage and can afford the compute.

📊 Visual Output

Bar Chart with Confidence Intervals

fig = report.plot(metric='f1', top_n=10)

Confidence Intervals on Every Slice

Red = significantly worse than global (delta < -0.1)
Amber = somewhat worse (-0.1 ≤ delta < 0)
Green = at or above global
Dashed line = global metric baseline

🏭 Real-World Use Cases

🔧 Predictive Maintenance

Your sensor model hits F1 = 0.91 globally. But Sensor Type B on night shifts? F1 = 0.41. That production line keeps missing failures. sliceval finds it before deployment.

🏥 Healthcare / Clinical AI

A diagnostic model performs well overall — but recall drops to 0.25 for patients with large tumor radius. In cancer diagnosis, a missed malignant case kills. sliceval surfaces exactly where recall collapses.

💳 Fraud Detection

Your fraud model catches 95% of fraud globally. But for transactions over $10K from new accounts? Precision drops to 0.30 — you're blocking legitimate high-value customers. sliceval shows you the segment.

🎯 Recommendation Systems

CTR model looks great in aggregate. But for users in the 18-24 cohort with < 5 interactions? The model is essentially random. sliceval quantifies the cold-start problem per segment.

📊 Supported Metrics

Classification

Metric	Key	Notes
F1 Score	`'f1'`	`average='binary'` or `'macro'` for multiclass
Precision	`'precision'`	Same averaging
Recall	`'recall'`	Same averaging
Accuracy	`'accuracy'`	—
ROC AUC	`'auc'`	Requires `predict_proba()`
Expected Calibration Error	`'ece'`	Requires `predict_proba()`

Regression

Metric	Key
Root Mean Squared Error	`'rmse'`
Mean Absolute Error	`'mae'`

🔒 Confidence Intervals & Significance

Every metric on every slice gets a confidence interval and a p-value. Two CI methods:

Method	How	When
`'bootstrap'` (default)	Resample N times, take percentiles	Always works, any metric
`'wilson'`	Wilson score interval	Binary classification only, faster

ev = SliceEvaluator(
    model, X_test, y_test,
    ci_method='bootstrap',
    ci_alpha=0.05,        # 95% CI
    n_bootstrap=1000,
)

⚠️ When ci_method='wilson' is set but a metric doesn't support it (e.g., F1), sliceval silently falls back to bootstrap. No warning, no error — Wilson is a preference, not a requirement.

📦 MLflow Integration

One call. Everything logged.

import mlflow

with mlflow.start_run():
    report = ev.evaluate()
    report.to_mlflow()  # uses active run

💡 Requires pip install sliceval[mlflow]. Raises ImportError with install instructions if missing.

🏗️ Performance & Design

Built for production ML pipelines, not notebooks-only.

Property	Detail
Single inference pass	`model.predict()` called once. 100 slices = same cost as 1.
Lazy evaluation	Callable masks evaluated at `.evaluate()` time, not at definition.
Non-invasive	Wraps any sklearn-compatible model. No training code changes.
Composable	Use slicing without discovery. Use discovery without MLflow. Each piece works alone.
Zero heavy deps	Core = numpy + pandas + sklearn. MLflow, matplotlib, scipy are optional.
Tested	162 tests including stress tests across 7 datasets, 13 model types, 3 task types.

🧰 Compared to Other Tools

⚠️ Common Mistakes

1. Using discovery without manual slices. Discovery is automated, not omniscient. Always add slices for known risk segments first. Discovery finds what you missed.

2. Ignoring p-values. A slice with delta = -0.30 and p = 0.45 is noise. A slice with delta = -0.10 and p = 0.002 is real. Filter by significance.

3. Setting min_support too low. A slice with 8 samples and F1 = 0.0 is not actionable. Keep min_support >= 0.05 unless you have a specific reason.

4. Evaluating on training data. sliceval is for test/validation sets. Slice metrics on training data tell you about memorization, not generalization.

🛡️ Error Handling

sliceval fails loudly with descriptive messages. No silent corruption.

Exceptions (click to expand)

Situation	Exception	Message
`X` is not a DataFrame	`TypeError`	`X must be a pd.DataFrame, got ndarray`
`len(X) != len(y)`	`ValueError`	`X and y must have the same length. Got X: 500, y: 400`
Invalid task string	`ValueError`	`task must be 'binary', 'multiclass', or 'regression'. Got: 'classify'`
Unknown metric	`ValueError`	`Unknown metric 'f2'. Valid metrics: [...]`
`auc`/`ece` without `predict_proba`	`ValueError`	`Metric 'auc' requires model.predict_proba()`
Slice mask wrong length	`ValueError`	`Slice 'x' mask has length 50, expected 1000`
Empty slice	`ValueError`	`Slice 'x' has 0 samples. Check your mask condition.`
Discovery metric not in list	`ValueError`	`Discovery metric 'auc' is not in the evaluator's metric list`
MLflow not installed	`ImportError`	`MLflow integration requires: pip install sliceval[mlflow]`
matplotlib not installed	`ImportError`	`Plotting requires: pip install sliceval[plot]`

Warnings (click to expand)

Situation	Warning
Slice with < 30 samples	`Slice 'x' has 15 samples. Metrics may be unreliable.`
Duplicate slice name	`Slice 'x' already exists and will be overwritten.`
No slices before evaluate	`No slices defined. Call add_slice() or discover_slices().`

📖 Full API Reference

SliceEvaluator (click to expand)

SliceEvaluator(
    model,                          # any object with .predict()
    X: pd.DataFrame,                # test features (must be DataFrame)
    y: pd.Series | np.ndarray,      # ground truth labels
    task: str = 'binary',           # 'binary' | 'multiclass' | 'regression'
    metrics: list = None,           # default depends on task
    ci_method: str = 'bootstrap',   # 'bootstrap' | 'wilson'
    ci_alpha: float = 0.05,         # confidence level = 1 - ci_alpha
    n_bootstrap: int = 1000,        # bootstrap iterations
    average: str = 'macro',         # multiclass averaging
    random_state: int = 42,
)

add_slice(name, mask)

ev.add_slice(
    name: str,                      # human-readable label
    mask,                           # pd.Series[bool] | np.ndarray[bool] | callable
)

If mask is callable, it receives X and must return a boolean array. Evaluated lazily at .evaluate() time.

discover_slices(method, **kwargs)

ev.discover_slices(
    method: str = 'tree',           # 'tree' | 'beam'
    max_depth: int = 3,             # max feature conjunctions
    min_support: float = 0.05,      # min fraction of test set
    metric: str = 'f1',             # metric to rank by
    n_slices: int = 10,             # max slices to return
    significance: float = 0.05,     # p-value threshold
)

evaluate() -> SliceReport

Runs inference once, computes all metrics on all slices, returns a SliceReport.

SliceReport (click to expand)

Attribute	Type	Description
`global_metrics`	`dict`	`{'f1': 0.91, ...}`
`slices`	`list[Slice]`	All evaluated slices
`metrics`	`list[SliceMetrics]`	Per-slice results
`task`	`str`	Task type
`evaluated_at`	`datetime`	UTC timestamp

worst_slices(n=5, metric=None, min_support=0.0) -> pd.DataFrame

Returns n worst slices sorted by delta ascending.

to_dataframe() -> pd.DataFrame

Full slice x metric matrix. First row is [global]. Columns per metric: {m}_value, {m}_ci_lower, {m}_ci_upper, {m}_delta, {m}_p_value.

to_mlflow(run_id=None, artifact_path='slice_eval')

Logs CSV and JSON artifacts to MLflow.

plot(metric=None, top_n=10, figsize=(10, 6)) -> Figure

Horizontal bar chart with CI error bars and global baseline.

Slice and SliceMetrics dataclasses (click to expand)

@dataclass
class Slice:
    name: str                       # human-readable label
    mask: np.ndarray                # boolean, shape (n_test_samples,)
    n_samples: int
    support: float                  # n_samples / len(X_test)
    source: str                     # 'manual' | 'tree' | 'beam'
    feature_conditions: list        # e.g. ['sensor_type == B', 'hour < 6']

@dataclass
class SliceMetrics:
    slice_name: str
    n_samples: int
    support: float
    metrics: dict                   # {'f1': 0.41, ...}
    ci_lower: dict                  # {'f1': 0.33, ...}
    ci_upper: dict                  # {'f1': 0.49, ...}
    delta: dict                     # {'f1': -0.50, ...}  (slice - global)
    p_value: dict                   # {'f1': 0.003, ...}

🗺️ Roadmap

Multi-model comparison (compare slices across model versions)
HTML report export (standalone, no MLflow needed)
Weights and Biases integration
Slice-aware cross-validation
Interactive slice explorer (panel/streamlit widget)

🧑‍💻 Development

git clone https://github.com/kartikeyamandhar/sliceval.git
cd sliceval
python -m venv .venv
source .venv/bin/activate
pip install -e .
pip install pytest
pytest tests/ -v

Project Structure

sliceval/
├── __init__.py                 # public API
├── evaluator.py                # SliceEvaluator
├── slice.py                    # Slice, SliceMetrics
├── metrics.py                  # metric computation + CI
├── report.py                   # SliceReport
├── discovery/
│   ├── tree.py                 # decision tree discovery
│   └── beam.py                 # beam search (SliceFinder)
├── integrations/
│   └── mlflow.py               # MLflow artifact export
└── utils/
    ├── stats.py                # bootstrap, Wilson, permutation tests
    └── validation.py           # input validation

Built because global metrics are dangerous defaults.

If this saves you from a production failure, consider starring the repo.

⭐ github.com/kartikeyamandhar/sliceval

MIT License · Made by Kartikeya Mandhar

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Apr 7, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sliceval-0.1.0.tar.gz (5.5 MB view details)

Uploaded Apr 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sliceval-0.1.0-py3-none-any.whl (23.4 kB view details)

Uploaded Apr 7, 2026 Python 3

File details

Details for the file sliceval-0.1.0.tar.gz.

File metadata

Download URL: sliceval-0.1.0.tar.gz
Upload date: Apr 7, 2026
Size: 5.5 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for sliceval-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`29cb7bbeb5902f40845f82607e2a3ebc541e0db8c5cf7a308c81ce3eae3e24ab`
MD5	`0a2c806c84d2034c2735588ed89115fe`
BLAKE2b-256	`c6030dcabebcc9bab40a231439eb1333d51f5a558944a98261d979bc2952e038`

See more details on using hashes here.

File details

Details for the file sliceval-0.1.0-py3-none-any.whl.

File metadata

Download URL: sliceval-0.1.0-py3-none-any.whl
Upload date: Apr 7, 2026
Size: 23.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for sliceval-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f18e0e43378bb8b0dbae80353e20f0b1ef507ef27505157d7868e5468b2b8a12`
MD5	`0f508f061774d00953acacdfeea5614c`
BLAKE2b-256	`74805d0481a88228513dbfccdb39b49d1992f64840ef388eb5c11423351fb7d6`

See more details on using hashes here.

sliceval 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

sliceval

Your model's global metric is lying to you.

⚡ Quick Start

🔍 The Problem

Why not just use accuracy / F1 / confusion matrix?

🔬 How It Works

Three ways to define slices

Tree Discovery — how it finds failures automatically

📊 Visual Output

Bar Chart with Confidence Intervals

Confidence Intervals on Every Slice

🏭 Real-World Use Cases

🔧 Predictive Maintenance

🏥 Healthcare / Clinical AI

💳 Fraud Detection

🎯 Recommendation Systems

📊 Supported Metrics

Classification

Regression

🔒 Confidence Intervals & Significance

📦 MLflow Integration

🏗️ Performance & Design

🧰 Compared to Other Tools

⚠️ Common Mistakes

🛡️ Error Handling

📖 Full API Reference

🗺️ Roadmap

🧑‍💻 Development

Project Structure

Built because global metrics are dangerous defaults.

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes