Three checks that catch the leakage + schema bugs that slip past peer review: target-correlated features, schema violations, and state-dependent transforms.
Project description
schema-firewall
Three checks that catch the leakage and schema bugs that slip past peer review.
pip install schema-firewall
The problem
In the last five years, published and competition-grade ML systems have repeatedly shipped with one of these three bugs:
| Bug | Real example | Impact |
|---|---|---|
| Feature statistically mirrors the target | COVID-19 chest X-ray classifiers learned hospital-ID confounders, not pulmonary features | Internal AUC 0.99, external-hospital AUC near-chance |
| Forbidden / post-outcome feature in the input | JAMA Network Open 2024: 40.2% of MIMIC same-admission prediction studies fed in ICD codes finalised at discharge | AUROC 0.97 from leaky codes alone |
| Transform that reads across the whole dataset | Kaggle Santander 2019 "magic" leak: frequency features computed on (train ∪ real-test) | Public AUC jumped 0.90 → 0.92 |
Each one escaped peer review, code review, or competition scrutiny — because the bug isn't a type error. It's a statistical / semantic contract violation.
schema-firewall provides three drop-in checks, one per bug class.
Usage
import pandas as pd
from schema_firewall import (
check_leakage,
check_schema,
check_stateless,
SchemaContract,
LeakageError,
)
X: pd.DataFrame # your feature frame
y: pd.Series # your target
# 1. Statistical leakage — Pearson + Spearman + normalised mutual info.
# Catches target-copies, monotonic transforms, sigmoid/rank re-encodings,
# and strong confounders. Raises LeakageError on fail.
check_leakage(X, y)
# 2. Schema contract — forbidden columns, required columns, dtypes.
# Catches ICD-code-style post-outcome features and schema drift.
contract = SchemaContract(
forbidden_columns=frozenset({"SALE PRICE", "PRICE_PER_SQFT"}),
required_columns=frozenset({"sqft", "year_built"}),
)
check_schema(X, contract)
# 3. Statelessness — runs your feature pipeline on the full frame vs a
# single-row subset. Flags any transform whose per-row output depends
# on other rows: mean encoders, frequency encoders, target encoders
# applied outside CV, ComBat/global normalisation, etc.
check_stateless(my_pipeline_fn, raw_frame)
Each function raises on failure and returns None on pass. No silent
degradation.
The demo notebook
examples/leakage_demo.ipynb— 60 seconds, California housing dataset, one deliberate leak, one library call.
Open it. It reproduces the target-encoding bug that sits in real production pipelines, shows an R² that looks impressive, then one call to check_stateless catches the leak before the model ships.
If you've ever applied .mean(), .value_counts(), TargetEncoder, or ComBat/fit_transform to your full dataset before cross-validation, the notebook is pointed at you.
What this is NOT
- Not a replacement for train/test splitting, cross-validation, or sklearn
Pipeline. - Not a feature-importance tool.
- Not a drift-monitoring service.
- Not a validation framework with its own DSL.
Three checks. One contract class. Four exceptions. That's the whole library.
Design constraints (locked)
- ≤ 500 LoC of core implementation. Actual: ~305.
- 3 public check functions —
check_leakage,check_schema,check_stateless. No more. - 27 adversarial tests covering every documented failure mode above.
- Three dependencies:
numpy,pandas,scikit-learn. Nothing else.
If schema-firewall v0.1 is missing a check you need, the library is wrong for your use case. Build the check in-line. v0.1 will not grow to absorb it.
When to use each check
| You did this | Run this |
|---|---|
| Built any feature-engineering function that reads the full frame | check_stateless(pipeline_fn, raw) |
| Joined multiple datasets with different origins / schemas / timestamps | check_schema(X, SchemaContract(forbidden_columns=…)) |
| Want a fast sanity gate before training | check_leakage(X, y) on the final feature frame |
What it caught in production (dogfood)
The schema-firewall checks are the same ones used by the NYC Real Estate Predictor external benchmark against NYC.gov 2024 Rolling Sales data. The flagship benchmark uses schema-firewall as a dependency, not a vendored copy. When the library breaks, the benchmark breaks. This is by design.
Attribution
Extracted from the firewall layer of the NYC Real Estate Predictor's external benchmark. The scoring-determinism pattern comes from the Protocol-based core of the Job Decision Engine project. Credit for the underlying problem classes goes to:
- DeGrave et al. (Nature Machine Intelligence, 2021) — COVID X-ray shortcut learning
- Rosenblatt et al. (Nature Communications, 2024) — connectome leakage
- Ramadan et al. (JAMIA, 2024) — clinical label-leakage framework
- YaG320 — Santander "magic" competition kernel
License
MIT. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file schema_firewall-0.1.0.tar.gz.
File metadata
- Download URL: schema_firewall-0.1.0.tar.gz
- Upload date:
- Size: 14.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8e39294455dea7acb5cdce54f54250e19acfeee9dffe19fbd663d0611e478c5e
|
|
| MD5 |
634da14023c05eb1db97f98c2d8f57e0
|
|
| BLAKE2b-256 |
d7e34b8552250eaa12f4e1e786b92197ca0f3cf1afcdc2f48ff63b5e914961e6
|
File details
Details for the file schema_firewall-0.1.0-py3-none-any.whl.
File metadata
- Download URL: schema_firewall-0.1.0-py3-none-any.whl
- Upload date:
- Size: 10.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c3995195bfc91d89c8ba868573702d0e7dff43e8f2f4478ddf9a5a406a42e9b0
|
|
| MD5 |
967729a640759009b03bdc9a776816d5
|
|
| BLAKE2b-256 |
b2fddab12a0d93cbf57efe849767c153c7cdef05254c30566d96c43ddb8544f8
|