Three checks that catch the leakage + schema bugs that slip past peer review: target-correlated features, schema violations, and state-dependent transforms.

These details have not been verified by PyPI

Project links

Project description

schema-firewall

Three checks that catch the leakage and schema bugs that slip past peer review.

pip install schema-firewall

Production usage. Extracted from the firewall layer of nyc-real-estate-predictor — the flagship pins schema-firewall==0.1.0 in requirements.txt and re-validates the firewall integration in its External Benchmark CI job on every push. Directional coupling signal (pinned dep + consuming CI), not a semantic contract invariant.

The problem

In the last five years, published and competition-grade ML systems have repeatedly shipped with one of these three bugs:

Bug	Real example	Impact
Feature statistically mirrors the target	COVID-19 chest X-ray classifiers learned hospital-ID confounders, not pulmonary features	Internal AUC 0.99, external-hospital AUC near-chance
Forbidden / post-outcome feature in the input	JAMA Network Open 2024: 40.2% of MIMIC same-admission prediction studies fed in ICD codes finalised at discharge	AUROC 0.97 from leaky codes alone
Transform that reads across the whole dataset	Kaggle Santander 2019 "magic" leak: frequency features computed on (train ∪ real-test)	Public AUC jumped 0.90 → 0.92

Each one escaped peer review, code review, or competition scrutiny — because the bug isn't a type error. It's a statistical / semantic contract violation.

schema-firewall provides three drop-in checks, one per bug class.

Usage

import pandas as pd
from schema_firewall import (
    check_leakage,
    check_schema,
    check_stateless,
    SchemaContract,
    LeakageError,
)

X: pd.DataFrame  # your feature frame
y: pd.Series     # your target

# 1. Statistical leakage — Pearson + Spearman + normalised mutual info.
#    Catches target-copies, monotonic transforms, sigmoid/rank re-encodings,
#    and strong confounders. Raises LeakageError on fail.
check_leakage(X, y)

# 2. Schema contract — forbidden columns, required columns, dtypes.
#    Catches ICD-code-style post-outcome features and schema drift.
contract = SchemaContract(
    forbidden_columns=frozenset({"SALE PRICE", "PRICE_PER_SQFT"}),
    required_columns=frozenset({"sqft", "year_built"}),
)
check_schema(X, contract)

# 3. Statelessness — runs your feature pipeline on the full frame vs a
#    single-row subset. Flags any transform whose per-row output depends
#    on other rows: mean encoders, frequency encoders, target encoders
#    applied outside CV, ComBat/global normalisation, etc.
check_stateless(my_pipeline_fn, raw_frame)

Each function raises on failure and returns None on pass. No silent degradation.

The demo notebook

examples/leakage_demo.ipynb — 60 seconds, California housing dataset, one deliberate leak, one library call.

Open it. It reproduces the target-encoding bug that sits in real production pipelines, shows an R² that looks impressive, then one call to check_stateless catches the leak before the model ships.

If you've ever applied .mean(), .value_counts(), TargetEncoder, or ComBat/fit_transform to your full dataset before cross-validation, the notebook is pointed at you.

Verified invariants under execution

The library is in production use today as a pinned dep of nyc-real-estate-predictor. The flagship's External Benchmark CI job re-checks these invariants against the published wheel on every push to main:

Statistical leakage detection triggers on the bundled California housing demo. Build a target-mean-encoded feature on rounded lat/lon buckets — Ridge regression returns R² = 0.9495 (leaky). Apply the same target encoding per train fold only — R² collapses to 0.4384 (honest). Both check_leakage and check_stateless raise on the leaky pipeline. Reproducible in 60 seconds via examples/leakage_demo.ipynb.
Statelessness holds under subset perturbation. check_stateless runs the user pipeline on the full frame, then on a one-row subset. Any transform whose per-row output depends on other rows (frequency encoders, target-mean encoders, ComBat-style global normalisation) fails this invariant by construction. Default samples five spread indices to avoid being fooled by a singleton-group row 0.
Forbidden-column gate raises on the documented set. nyc-real-estate-predictor configures SchemaContract(forbidden_columns=frozenset({"SALE PRICE", "SALE DATE", "PRICE_PER_SQFT", "TARGET", "log_price"})). The 18-test adversarial suite in the flagship asserts that check_schema raises on each of these columns presented under several disguises.
Determinism check catches non-deterministic transforms. Two consecutive pipeline_fn(raw) calls must produce identical frames. Unseeded random initialisation, dict-order dependency, and side-effecting transforms all fail. Internal pd.testing.assert_frame_equal.

These hold across the test matrix; numbers (test counts, coverage %) age — the invariants don't.

What this is NOT

Not a replacement for train/test splitting, cross-validation, or sklearn Pipeline.
Not a feature-importance tool.
Not a drift-monitoring service.
Not a validation framework with its own DSL.

Three checks. One contract class. Four exceptions. That's the whole library.

Design constraints (locked)

≤ 500 LoC of core implementation. Actual: 344 lines (raw) / 270 lines (excluding blanks + comments).
3 public check functions — check_leakage, check_schema, check_stateless. No more.
30 adversarial tests covering every documented failure mode above.
Three dependencies: numpy, pandas, scikit-learn. Nothing else.

If schema-firewall v0.1 is missing a check you need, the library is wrong for your use case. Build the check in-line. v0.1 will not grow to absorb it.

When to use each check

You did this	Run this
Built any feature-engineering function that reads the full frame	`check_stateless(pipeline_fn, raw)`
Joined multiple datasets with different origins / schemas / timestamps	`check_schema(X, SchemaContract(forbidden_columns=…))`
Want a fast sanity gate before training	`check_leakage(X, y)` on the final feature frame

What it caught in production (dogfood)

The schema-firewall checks are the same ones used by the NYC Real Estate Predictor external benchmark against NYC.gov 2024 Rolling Sales data. The flagship benchmark uses schema-firewall as a dependency, not a vendored copy. When the library breaks, the benchmark breaks. This is by design.

Attribution

Extracted from the firewall layer of the NYC Real Estate Predictor's external benchmark. The scoring-determinism pattern comes from the Protocol-based core of the Job Decision Engine project. Credit for the underlying problem classes goes to:

DeGrave et al. (Nature Machine Intelligence, 2021) — COVID X-ray shortcut learning
Rosenblatt et al. (Nature Communications, 2024) — connectome leakage
Ramadan et al. (JAMIA, 2024) — clinical label-leakage framework
YaG320 — Santander "magic" competition kernel

License

MIT. See LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.2

May 28, 2026

0.1.1

May 24, 2026

0.1.0

Apr 20, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

schema_firewall-0.1.2.tar.gz (18.3 kB view details)

Uploaded May 28, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

schema_firewall-0.1.2-py3-none-any.whl (11.8 kB view details)

Uploaded May 28, 2026 Python 3

File details

Details for the file schema_firewall-0.1.2.tar.gz.

File metadata

Download URL: schema_firewall-0.1.2.tar.gz
Upload date: May 28, 2026
Size: 18.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for schema_firewall-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`d704195ba554d96cd3dc319d451628c6fc3b83e5b50ae5d3ac05dde2d3f1a4a4`
MD5	`46924f521835e0990cf53d002446d137`
BLAKE2b-256	`28a92c3a991f5c4e8fae51b33c719bcd2313c15203348dcf95e5902d4a605dfd`

See more details on using hashes here.

File details

Details for the file schema_firewall-0.1.2-py3-none-any.whl.

File metadata

Download URL: schema_firewall-0.1.2-py3-none-any.whl
Upload date: May 28, 2026
Size: 11.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for schema_firewall-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9ff9db17ee019e1a9736fc8bd8f8f8d1b9f9611a0a7bbd29593d05e3bc7d128e`
MD5	`8c3483e654fb8fc69f243ac49836b4d9`
BLAKE2b-256	`c6955b8bdec1f4f568380f83e9130b61e59d3f956e1b89e95b61658a5833789b`

See more details on using hashes here.

schema-firewall 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

schema-firewall

The problem

Usage

The demo notebook

Verified invariants under execution

What this is NOT

Design constraints (locked)

When to use each check

What it caught in production (dogfood)

Attribution

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes