Dataset readiness checker for machine learning — run a pre-flight checklist on any DataFrame before training.
Project description
preflight ✈️
Dataset readiness checker for machine learning. Run a pre-flight checklist on any pandas DataFrame before you train a model.
import preflight
report = preflight.check(df, target="churn")
print(report)
Preflight Report
────────────────────────────────────────────────
Readiness Score: 74/100 ⚠ CAUTION
Completeness
✓ No missing values detected across all columns.
Class Balance
⚠ Moderate class imbalance: 82%/18% split (`0` vs `1`).
Leakage Detection
✗ Potential target leakage — suspiciously high correlation: `signup_date` (r=0.97).
⚠ 1 ID-like column(s) detected: `user_id`.
Duplicates
✓ No exact duplicate rows.
✓ No near-duplicate rows.
Distributional Health
✓ No constant columns.
⚠ Feature value ranges span 4.2 orders of magnitude.
Feature Correlation
✓ No highly correlated feature pairs.
Data Types
✓ No object columns appear to be numeric.
✓ No mixed-type object columns.
────────────────────────────────────────────────
✓ 7 passed ⚠ 3 warnings ✗ 1 failed
Installation
# From PyPI (once published)
pip install preflight-data
# From source
git clone https://github.com/preflight-ml/preflight
cd preflight
make install-dev
Conda environment
make env # creates 'preflight' conda env
conda activate preflight
make test # run full test suite
API
preflight.check(df, target=None) → Report
Run all checks on a DataFrame.
| Parameter | Type | Description |
|---|---|---|
df |
pd.DataFrame |
Dataset to analyse |
target |
str | None |
Name of the label/target column |
report = preflight.check(df, target="price")
report.score # float 0–100
report.verdict # "READY" | "CAUTION" | "NOT READY"
str(report) # terminal-friendly summary
report.to_dict() # machine-readable dict
report.to_markdown() # markdown for model cards / READMEs
# configurable runtime (fast mode + sampling)
from preflight import PreflightConfig
cfg = PreflightConfig()
report = preflight.check(df, target="price", config=cfg)
preflight.run(df, target=None, profile="exploratory") → RunReport
Policy-first API with explicit gate semantics.
import preflight
run_report = preflight.run(df, target="price", profile="ci-strict")
run_report.gate.status # PASS | FAIL
run_report.to_dict() # schema v2 machine output
preflight.check_split(X_train, X_test) → Report
Detect distribution drift between train and test splits:
- numeric drift via Population Stability Index (PSI)
- categorical drift via total variation distance (TVD)
- missingness drift via absolute missing-rate delta
split_report = preflight.check_split(X_train, X_test)
print(split_report)
JSON schema stability
report.to_dict() includes a schema_version field so downstream CI/pipeline parsing can version-lock safely.
CLI
# policy-first commands (recommended)
preflight run data.csv --target churn --profile ci-strict --format json
preflight run data.csv --target churn --policy-file policy.json --format json
preflight run data.csv --target churn --config-file config.json --format json
preflight run data.csv --target churn --format markdown --output-html report.html
preflight run-split train.csv test.csv --profile exploratory --format markdown
preflight run data.csv --target churn --profile ci-strict --suppressions suppressions.json
preflight compare current.json baseline.json --max-score-drop 3 --fail-on-new-error --fail-on-domain-increase target_risk=1
preflight suppress add --file suppressions.json --check-id leakage.high_correlation --reason "known safe"
preflight suppress list --file suppressions.json
preflight suppress validate --file suppressions.json --fail-on-expired
preflight plugins doctor --format json
# full dataset readiness check
preflight check data.csv --target churn --format json --output preflight.json
# train/test drift check
preflight check-split train.csv test.csv --format markdown
# fast mode sampling
preflight check data.csv --target churn --mode fast --sample-rows 50000
Suppressions
Policy runs can load suppressions from JSON:
[
{
"check_id": "leakage.high_correlation",
"column": "signup_date",
"expires": "2026-12-31",
"reason": "feature excluded from model; tracked in migration plan"
}
]
Checks
| Category | Check | Penalty |
|---|---|---|
| Completeness | Overall missing rate | 5–15 pts |
| Per-column missing (>20% warn, >50% fail) | 5–30 pts | |
| Class Balance | Majority/minority ratio (>4:1 warn, >9:1 fail) | 7–15 pts |
| Leakage Detection | Correlation to target (>0.85 warn, >0.95 fail) | 8–20 pts |
| ID-like columns | 8 pts | |
| Datetime columns | 8 pts | |
| Temporal leakage signal from datetime columns | 6–12 pts | |
| Duplicates | Exact duplicate rows | 5–10 pts |
| Near-duplicate rows | 5–10 pts | |
| Distributional Health | Constant columns | 5 pts each |
| Near-zero variance | 5 pts | |
| High-cardinality categoricals (>95% unique) | 5 pts | |
| Scale disparity (>4 orders of magnitude) | 5 pts | |
| Feature Correlation | Feature pairs with r>0.90 | 3–20 pts |
| Data Types | Numeric stored as object | 5 pts |
| Mixed types in object column | 8 pts |
Scoring
Score = 100 − Σ(penalties) clamped to [0, 100]
≥ 85 → READY
60–84 → CAUTION
< 60 → NOT READY
Development
git clone https://github.com/preflight-ml/preflight
cd preflight
# Conda setup
make env
conda activate preflight
# Install in editable mode with dev deps
make install-dev
# Run tests
make test # pytest + coverage
make test-fast # pytest only
make test-stdlib # stdlib unittest runner (no pytest needed)
# Lint
make lint
# Build
make build
Project structure
preflight/
├── __init__.py # check() and check_split() entry points
├── _types.py # CheckResult dataclass, Severity enum
├── scorer.py # penalty → score → verdict
├── report.py # Report class (__str__, to_dict, to_markdown)
└── checks/
├── completeness.py
├── balance.py
├── leakage.py
├── duplicates.py
├── distributions.py
├── correlations.py
└── types.py
tests/
├── conftest.py # shared fixtures
├── test_checks.py # pytest-style tests
└── run_tests.py # stdlib unittest runner
Adding a check
- Create
preflight/checks/my_check.pywith arun(df, **kwargs) -> list[CheckResult]function. - Import and call it in
preflight/__init__.pyinsidecheck(). - Add tests to
tests/test_checks.pyandtests/run_tests.py.
Requirements
- Python ≥ 3.9
- pandas ≥ 1.3
- numpy ≥ 1.21
- scipy ≥ 1.7 (optional extra:
pip install preflight-data[stats]) - scikit-learn ≥ 1.0 (optional extra:
pip install preflight-data[ml])
License
Apache 2.0 — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file preflight_data-0.1.0.tar.gz.
File metadata
- Download URL: preflight_data-0.1.0.tar.gz
- Upload date:
- Size: 70.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2b6533452c7fb58df967de36e62715f660a82527f4d6cf7dbc7e149b426fa4f6
|
|
| MD5 |
da1fe660183625724943ebac2b4bb450
|
|
| BLAKE2b-256 |
029bdcad3988ac1cc0f4acbdd2f06c4012ce7f5937c12d627ba1c121c748dfac
|
Provenance
The following attestation bundles were made for preflight_data-0.1.0.tar.gz:
Publisher:
ci.yml on ryan-wolbeck/preflight
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
preflight_data-0.1.0.tar.gz -
Subject digest:
2b6533452c7fb58df967de36e62715f660a82527f4d6cf7dbc7e149b426fa4f6 - Sigstore transparency entry: 1189446530
- Sigstore integration time:
-
Permalink:
ryan-wolbeck/preflight@6c6e46e00f6275d54be6e6333dde76c968cb3e95 -
Branch / Tag:
refs/tags/0.1.0 - Owner: https://github.com/ryan-wolbeck
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@6c6e46e00f6275d54be6e6333dde76c968cb3e95 -
Trigger Event:
release
-
Statement type:
File details
Details for the file preflight_data-0.1.0-py3-none-any.whl.
File metadata
- Download URL: preflight_data-0.1.0-py3-none-any.whl
- Upload date:
- Size: 75.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c2ec6e4b192b59da5292c9b8e8a19ee4dbfddc3440411af2f754f2a824a7b14f
|
|
| MD5 |
0e5448e9317d157450b38bf9ec131081
|
|
| BLAKE2b-256 |
7660b6fb451ed5f029b65138ce7aca3213fd240afd3399f668eeda751cf09bc9
|
Provenance
The following attestation bundles were made for preflight_data-0.1.0-py3-none-any.whl:
Publisher:
ci.yml on ryan-wolbeck/preflight
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
preflight_data-0.1.0-py3-none-any.whl -
Subject digest:
c2ec6e4b192b59da5292c9b8e8a19ee4dbfddc3440411af2f754f2a824a7b14f - Sigstore transparency entry: 1189446532
- Sigstore integration time:
-
Permalink:
ryan-wolbeck/preflight@6c6e46e00f6275d54be6e6333dde76c968cb3e95 -
Branch / Tag:
refs/tags/0.1.0 - Owner: https://github.com/ryan-wolbeck
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@6c6e46e00f6275d54be6e6333dde76c968cb3e95 -
Trigger Event:
release
-
Statement type: