Skip to main content

Dataset readiness checker for machine learning — run a pre-flight checklist on any DataFrame before training.

Project description

preflight ✈️

CI Python License

Dataset readiness checker for machine learning. Run a pre-flight checklist on any pandas DataFrame before you train a model.

import preflight

report = preflight.check(df, target="churn")
print(report)
Preflight Report
────────────────────────────────────────────────
Readiness Score: 74/100  ⚠ CAUTION

Completeness
  ✓ No missing values detected across all columns.

Class Balance
  ⚠ Moderate class imbalance: 82%/18% split (`0` vs `1`).

Leakage Detection
  ✗ Potential target leakage — suspiciously high correlation: `signup_date` (r=0.97).
  ⚠ 1 ID-like column(s) detected: `user_id`.

Duplicates
  ✓ No exact duplicate rows.
  ✓ No near-duplicate rows.

Distributional Health
  ✓ No constant columns.
  ⚠ Feature value ranges span 4.2 orders of magnitude.

Feature Correlation
  ✓ No highly correlated feature pairs.

Data Types
  ✓ No object columns appear to be numeric.
  ✓ No mixed-type object columns.

────────────────────────────────────────────────
✓ 7 passed  ⚠ 3 warnings  ✗ 1 failed

Installation

# From PyPI (once published)
pip install preflight-data

# From source
git clone https://github.com/preflight-ml/preflight
cd preflight
make install-dev

Conda environment

make env                   # creates 'preflight' conda env
conda activate preflight
make test                  # run full test suite

API

preflight.check(df, target=None) → Report

Run all checks on a DataFrame.

Parameter Type Description
df pd.DataFrame Dataset to analyse
target str | None Name of the label/target column
report = preflight.check(df, target="price")

report.score          # float  0–100
report.verdict        # "READY" | "CAUTION" | "NOT READY"
str(report)           # terminal-friendly summary
report.to_dict()      # machine-readable dict
report.to_markdown()  # markdown for model cards / READMEs

# configurable runtime (fast mode + sampling)
from preflight import PreflightConfig
cfg = PreflightConfig()
report = preflight.check(df, target="price", config=cfg)

preflight.run(df, target=None, profile="exploratory") → RunReport

Policy-first API with explicit gate semantics.

import preflight

run_report = preflight.run(df, target="price", profile="ci-strict")
run_report.gate.status     # PASS | FAIL
run_report.to_dict()       # schema v2 machine output

preflight.check_split(X_train, X_test) → Report

Detect distribution drift between train and test splits:

  • numeric drift via Population Stability Index (PSI)
  • categorical drift via total variation distance (TVD)
  • missingness drift via absolute missing-rate delta
split_report = preflight.check_split(X_train, X_test)
print(split_report)

JSON schema stability

report.to_dict() includes a schema_version field so downstream CI/pipeline parsing can version-lock safely.


CLI

# policy-first commands (recommended)
preflight run data.csv --target churn --profile ci-strict --format json
preflight run data.csv --target churn --policy-file policy.json --format json
preflight run data.csv --target churn --config-file config.json --format json
preflight run data.csv --target churn --format markdown --output-html report.html
preflight run-split train.csv test.csv --profile exploratory --format markdown
preflight run data.csv --target churn --profile ci-strict --suppressions suppressions.json
preflight compare current.json baseline.json --max-score-drop 3 --fail-on-new-error --fail-on-domain-increase target_risk=1
preflight suppress add --file suppressions.json --check-id leakage.high_correlation --reason "known safe"
preflight suppress list --file suppressions.json
preflight suppress validate --file suppressions.json --fail-on-expired
preflight plugins doctor --format json

# full dataset readiness check
preflight check data.csv --target churn --format json --output preflight.json

# train/test drift check
preflight check-split train.csv test.csv --format markdown

# fast mode sampling
preflight check data.csv --target churn --mode fast --sample-rows 50000

Suppressions

Policy runs can load suppressions from JSON:

[
  {
    "check_id": "leakage.high_correlation",
    "column": "signup_date",
    "expires": "2026-12-31",
    "reason": "feature excluded from model; tracked in migration plan"
  }
]

Checks

Category Check Penalty
Completeness Overall missing rate 5–15 pts
Per-column missing (>20% warn, >50% fail) 5–30 pts
Class Balance Majority/minority ratio (>4:1 warn, >9:1 fail) 7–15 pts
Leakage Detection Correlation to target (>0.85 warn, >0.95 fail) 8–20 pts
ID-like columns 8 pts
Datetime columns 8 pts
Temporal leakage signal from datetime columns 6–12 pts
Duplicates Exact duplicate rows 5–10 pts
Near-duplicate rows 5–10 pts
Distributional Health Constant columns 5 pts each
Near-zero variance 5 pts
High-cardinality categoricals (>95% unique) 5 pts
Scale disparity (>4 orders of magnitude) 5 pts
Feature Correlation Feature pairs with r>0.90 3–20 pts
Data Types Numeric stored as object 5 pts
Mixed types in object column 8 pts

Scoring

Score = 100 − Σ(penalties)    clamped to [0, 100]

≥ 85  → READY
60–84 → CAUTION
< 60  → NOT READY

Development

git clone https://github.com/preflight-ml/preflight
cd preflight

# Conda setup
make env
conda activate preflight

# Install in editable mode with dev deps
make install-dev

# Run tests
make test          # pytest + coverage
make test-fast     # pytest only
make test-stdlib   # stdlib unittest runner (no pytest needed)

# Lint
make lint

# Build
make build

Project structure

preflight/
├── __init__.py          # check() and check_split() entry points
├── _types.py            # CheckResult dataclass, Severity enum
├── scorer.py            # penalty → score → verdict
├── report.py            # Report class (__str__, to_dict, to_markdown)
└── checks/
    ├── completeness.py
    ├── balance.py
    ├── leakage.py
    ├── duplicates.py
    ├── distributions.py
    ├── correlations.py
    └── types.py
tests/
├── conftest.py          # shared fixtures
├── test_checks.py       # pytest-style tests
└── run_tests.py         # stdlib unittest runner

Adding a check

  1. Create preflight/checks/my_check.py with a run(df, **kwargs) -> list[CheckResult] function.
  2. Import and call it in preflight/__init__.py inside check().
  3. Add tests to tests/test_checks.py and tests/run_tests.py.

Requirements

  • Python ≥ 3.9
  • pandas ≥ 1.3
  • numpy ≥ 1.21
  • scipy ≥ 1.7 (optional extra: pip install preflight-data[stats])
  • scikit-learn ≥ 1.0 (optional extra: pip install preflight-data[ml])

License

Apache 2.0 — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

preflight_data-0.1.0.tar.gz (70.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

preflight_data-0.1.0-py3-none-any.whl (75.3 kB view details)

Uploaded Python 3

File details

Details for the file preflight_data-0.1.0.tar.gz.

File metadata

  • Download URL: preflight_data-0.1.0.tar.gz
  • Upload date:
  • Size: 70.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for preflight_data-0.1.0.tar.gz
Algorithm Hash digest
SHA256 2b6533452c7fb58df967de36e62715f660a82527f4d6cf7dbc7e149b426fa4f6
MD5 da1fe660183625724943ebac2b4bb450
BLAKE2b-256 029bdcad3988ac1cc0f4acbdd2f06c4012ce7f5937c12d627ba1c121c748dfac

See more details on using hashes here.

Provenance

The following attestation bundles were made for preflight_data-0.1.0.tar.gz:

Publisher: ci.yml on ryan-wolbeck/preflight

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file preflight_data-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: preflight_data-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 75.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for preflight_data-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c2ec6e4b192b59da5292c9b8e8a19ee4dbfddc3440411af2f754f2a824a7b14f
MD5 0e5448e9317d157450b38bf9ec131081
BLAKE2b-256 7660b6fb451ed5f029b65138ce7aca3213fd240afd3399f668eeda751cf09bc9

See more details on using hashes here.

Provenance

The following attestation bundles were made for preflight_data-0.1.0-py3-none-any.whl:

Publisher: ci.yml on ryan-wolbeck/preflight

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page