Dataset readiness checker for machine learning — run a pre-flight checklist on any DataFrame before training.
Project description
Dataset readiness checks for ML pipelines.
Use preflight to catch data blockers before training and deployment.
Why preflight
- Runs fast checks for data quality, target risk, schema/type issues, split integrity, and statistical anomalies.
- Produces machine-readable findings and a CI gate decision (
PASS/FAIL). - Keeps output stable with schema versioning for downstream tooling.
Quickstart
Install:
pip install preflight-data
Python API:
import pandas as pd
import preflight
df = pd.read_csv("data.csv")
report = preflight.run(df, target="churn", profile="ci-balanced")
print(report)
CLI:
preflight run data.csv --target churn --profile ci-balanced --format text
Example output:
Preflight Run Report
────────────────────────────────────────────────────────
Gate: PASS
Heuristic Score: 97.0/100
Profile: ci-balanced
Dataset: 1000/1000 rows analyzed across 6 columns
Target: churn
Summary: 8 info, 1 warn, 0 error, 0 critical
Gate reasons:
- No findings met fail conditions for this profile
Findings:
- [WARN] completeness.missingness: Overall missingness is 1.8% (108 cells missing across dataset). (confidence=0.95)
- [INFO] duplicates.exact: No exact duplicate rows detected. (confidence=0.95)
- [INFO] balance.class_imbalance: Class distribution is within configured tolerance. (confidence=0.90)
HTML report preview
If you want to see what the generated HTML report looks like, open:
The notebook contains rendered report.to_html() output cells.
Core model
finding: one detected issue or advisory signalseverity:info | warn | error | criticalgate: policy decision based on severities (PASS | FAIL)score: heuristic summary for trend/comparison, not statistical truth
Score guidance:
- Good for rough trend tracking across runs
- Not a probability of model success
Common workflows
Dataset readiness (single table)
report = preflight.run(df, target="churn", profile="ci-balanced")
Split integrity (train/validation or train/test)
split_report = preflight.run_split(train_df, valid_df, profile="ci-balanced")
Policy profiles
Built-in profiles:
exploratory: permissive, useful in notebooksci-balanced: practical CI defaultci-strict: highest sensitivity for blocking conditions
Example:
preflight run data.csv --target churn --profile ci-strict --format json
--fail-on override:
preflight run data.csv --target churn --profile ci-balanced --fail-on error,critical
Policy argument rules:
- Use either
--profileor--policy-file(mutually exclusive). --fail-onis only supported with--profile.- Invalid policy/config files fail fast at load time.
CLI reference
# Recommended policy-first commands
preflight run data.csv --target churn --profile ci-balanced --format json
preflight run-split train.csv test.csv --profile ci-balanced --format markdown
# Optional artifacts
preflight run data.csv --target churn --format text --output report.txt --output-html report.html
# Compare against baseline JSON report
preflight compare current.json baseline.json --max-score-drop 3 --fail-on-new-error
# Suppressions
preflight suppress add --file suppressions.json --check-id leakage.high_correlation --reason "known safe"
preflight suppress list --file suppressions.json
preflight suppress validate --file suppressions.json --fail-on-expired
HTML output
CLI:
preflight run data.csv --target churn --profile ci-balanced --format text --output-html report.html
Python:
report = preflight.run(df, target="churn", profile="ci-balanced")
html = report.to_html()
with open("report.html", "w", encoding="utf-8") as f:
f.write(html)
This creates a shareable HTML report you can attach to CI artifacts, docs, or review tickets.
Exit codes:
0: gate pass2: gate fail or explicit CLI validation failure
Output schema contract
RunReport.to_dict() includes stable contract keys:
schema_versionrundatasetgatescoresummaryfindings
Per-finding payload includes evidence and explainability fields:
check_id,title,domain,severity,suppressedsuggested_action,docs_urlevidence.metrics,evidence.threshold,evidence.samples
Examples
- Realistic workflow notebook:
- Public dataset demo notebook:
- Script demo:
Legacy compatibility
Legacy check(...) and check_split(...) APIs are still available for compatibility, but run(...) and run_split(...) are recommended for policy-first workflows.
Migration status: the policy-first runner now uses native checks for class balance, completeness, leakage, duplicates, distributional health, correlations, and types. Legacy APIs remain supported during migration.
Compatibility namespace:
preflight.legacy.check(...)preflight.legacy.check_split(...)preflight.legacy.Report
Development
make env
conda activate preflight
make install-dev
make test
make lint
make typecheck
make build
Supported versions
- Python: 3.9-3.13
- pandas: >=1.3
- numpy: >=1.21
License
Apache-2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file preflight_data-0.1.1.tar.gz.
File metadata
- Download URL: preflight_data-0.1.1.tar.gz
- Upload date:
- Size: 74.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b51183727dd377fa7dd4f11a8c7fef71ec769d7c23c9ea7ea2d2ac8edd082453
|
|
| MD5 |
f08e4e8a575d3a843fb009f7029955cc
|
|
| BLAKE2b-256 |
25a949915387cb0cadb55960de77c2f2e176cd9d11cd93bf1c58e807bdb02eb6
|
Provenance
The following attestation bundles were made for preflight_data-0.1.1.tar.gz:
Publisher:
ci.yml on ryan-wolbeck/preflight
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
preflight_data-0.1.1.tar.gz -
Subject digest:
b51183727dd377fa7dd4f11a8c7fef71ec769d7c23c9ea7ea2d2ac8edd082453 - Sigstore transparency entry: 1190505183
- Sigstore integration time:
-
Permalink:
ryan-wolbeck/preflight@d7dbf1dd625146222b4ba6a0eb06fefc7e11b960 -
Branch / Tag:
refs/tags/0.1.1 - Owner: https://github.com/ryan-wolbeck
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@d7dbf1dd625146222b4ba6a0eb06fefc7e11b960 -
Trigger Event:
release
-
Statement type:
File details
Details for the file preflight_data-0.1.1-py3-none-any.whl.
File metadata
- Download URL: preflight_data-0.1.1-py3-none-any.whl
- Upload date:
- Size: 79.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
57e24d22b4a7fc7aeb85b81417fb14667ab556214df7db4399ff3a3777d198b2
|
|
| MD5 |
cf873c6a601cb3592ec671c66f52f92d
|
|
| BLAKE2b-256 |
c6f851827e1d8d30c1ba07b187b08fcb1da2b16371dd10d401e41a736c9bd433
|
Provenance
The following attestation bundles were made for preflight_data-0.1.1-py3-none-any.whl:
Publisher:
ci.yml on ryan-wolbeck/preflight
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
preflight_data-0.1.1-py3-none-any.whl -
Subject digest:
57e24d22b4a7fc7aeb85b81417fb14667ab556214df7db4399ff3a3777d198b2 - Sigstore transparency entry: 1190505187
- Sigstore integration time:
-
Permalink:
ryan-wolbeck/preflight@d7dbf1dd625146222b4ba6a0eb06fefc7e11b960 -
Branch / Tag:
refs/tags/0.1.1 - Owner: https://github.com/ryan-wolbeck
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@d7dbf1dd625146222b4ba6a0eb06fefc7e11b960 -
Trigger Event:
release
-
Statement type: