Dataset readiness checker for machine learning — run a pre-flight checklist on any DataFrame before training.

These details have not been verified by PyPI

Project links

Project description

preflight ✈️

Dataset readiness checker for machine learning. Run a pre-flight checklist on any pandas DataFrame before you train a model.

import preflight

report = preflight.check(df, target="churn")
print(report)

Preflight Report
────────────────────────────────────────────────
Readiness Score: 74/100  ⚠ CAUTION

Completeness
  ✓ No missing values detected across all columns.

Class Balance
  ⚠ Moderate class imbalance: 82%/18% split (`0` vs `1`).

Leakage Detection
  ✗ Potential target leakage — suspiciously high correlation: `signup_date` (r=0.97).
  ⚠ 1 ID-like column(s) detected: `user_id`.

Duplicates
  ✓ No exact duplicate rows.
  ✓ No near-duplicate rows.

Distributional Health
  ✓ No constant columns.
  ⚠ Feature value ranges span 4.2 orders of magnitude.

Feature Correlation
  ✓ No highly correlated feature pairs.

Data Types
  ✓ No object columns appear to be numeric.
  ✓ No mixed-type object columns.

────────────────────────────────────────────────
✓ 7 passed  ⚠ 3 warnings  ✗ 1 failed

Installation

# From PyPI (once published)
pip install preflight-data

# From source
git clone https://github.com/preflight-ml/preflight
cd preflight
make install-dev

Conda environment

make env                   # creates 'preflight' conda env
conda activate preflight
make test                  # run full test suite

API

`preflight.check(df, target=None) → Report`

Run all checks on a DataFrame.

Parameter	Type	Description
`df`	`pd.DataFrame`	Dataset to analyse
`target`	`str \| None`	Name of the label/target column

report = preflight.check(df, target="price")

report.score          # float  0–100
report.verdict        # "READY" | "CAUTION" | "NOT READY"
str(report)           # terminal-friendly summary
report.to_dict()      # machine-readable dict
report.to_markdown()  # markdown for model cards / READMEs

# configurable runtime (fast mode + sampling)
from preflight import PreflightConfig
cfg = PreflightConfig()
report = preflight.check(df, target="price", config=cfg)

`preflight.run(df, target=None, profile="exploratory") → RunReport`

Policy-first API with explicit gate semantics.

import preflight

run_report = preflight.run(df, target="price", profile="ci-strict")
run_report.gate.status     # PASS | FAIL
run_report.to_dict()       # schema v2 machine output

`preflight.check_split(X_train, X_test) → Report`

Detect distribution drift between train and test splits:

numeric drift via Population Stability Index (PSI)
categorical drift via total variation distance (TVD)
missingness drift via absolute missing-rate delta

split_report = preflight.check_split(X_train, X_test)
print(split_report)

JSON schema stability

report.to_dict() includes a schema_version field so downstream CI/pipeline parsing can version-lock safely.

CLI

# policy-first commands (recommended)
preflight run data.csv --target churn --profile ci-strict --format json
preflight run data.csv --target churn --policy-file policy.json --format json
preflight run data.csv --target churn --config-file config.json --format json
preflight run data.csv --target churn --format markdown --output-html report.html
preflight run-split train.csv test.csv --profile exploratory --format markdown
preflight run data.csv --target churn --profile ci-strict --suppressions suppressions.json
preflight compare current.json baseline.json --max-score-drop 3 --fail-on-new-error --fail-on-domain-increase target_risk=1
preflight suppress add --file suppressions.json --check-id leakage.high_correlation --reason "known safe"
preflight suppress list --file suppressions.json
preflight suppress validate --file suppressions.json --fail-on-expired
preflight plugins doctor --format json

# full dataset readiness check
preflight check data.csv --target churn --format json --output preflight.json

# train/test drift check
preflight check-split train.csv test.csv --format markdown

# fast mode sampling
preflight check data.csv --target churn --mode fast --sample-rows 50000

Suppressions

Policy runs can load suppressions from JSON:

[
  {
    "check_id": "leakage.high_correlation",
    "column": "signup_date",
    "expires": "2026-12-31",
    "reason": "feature excluded from model; tracked in migration plan"
  }
]

Checks

Category	Check	Penalty
Completeness	Overall missing rate	5–15 pts
	Per-column missing (>20% warn, >50% fail)	5–30 pts
Class Balance	Majority/minority ratio (>4:1 warn, >9:1 fail)	7–15 pts
Leakage Detection	Correlation to target (>0.85 warn, >0.95 fail)	8–20 pts
	ID-like columns	8 pts
	Datetime columns	8 pts
	Temporal leakage signal from datetime columns	6–12 pts
Duplicates	Exact duplicate rows	5–10 pts
	Near-duplicate rows	5–10 pts
Distributional Health	Constant columns	5 pts each
	Near-zero variance	5 pts
	High-cardinality categoricals (>95% unique)	5 pts
	Scale disparity (>4 orders of magnitude)	5 pts
Feature Correlation	Feature pairs with r>0.90	3–20 pts
Data Types	Numeric stored as object	5 pts
	Mixed types in object column	8 pts

Scoring

Score = 100 − Σ(penalties)    clamped to [0, 100]

≥ 85  → READY
60–84 → CAUTION
< 60  → NOT READY

Development

git clone https://github.com/preflight-ml/preflight
cd preflight

# Conda setup
make env
conda activate preflight

# Install in editable mode with dev deps
make install-dev

# Run tests
make test          # pytest + coverage
make test-fast     # pytest only
make test-stdlib   # stdlib unittest runner (no pytest needed)

# Lint
make lint

# Build
make build

Project structure

preflight/
├── __init__.py          # check() and check_split() entry points
├── _types.py            # CheckResult dataclass, Severity enum
├── scorer.py            # penalty → score → verdict
├── report.py            # Report class (__str__, to_dict, to_markdown)
└── checks/
    ├── completeness.py
    ├── balance.py
    ├── leakage.py
    ├── duplicates.py
    ├── distributions.py
    ├── correlations.py
    └── types.py
tests/
├── conftest.py          # shared fixtures
├── test_checks.py       # pytest-style tests
└── run_tests.py         # stdlib unittest runner

Adding a check

Create preflight/checks/my_check.py with a run(df, **kwargs) -> list[CheckResult] function.
Import and call it in preflight/__init__.py inside check().
Add tests to tests/test_checks.py and tests/run_tests.py.

Requirements

Python ≥ 3.9
pandas ≥ 1.3
numpy ≥ 1.21
scipy ≥ 1.7 (optional extra: pip install preflight-data[stats])
scikit-learn ≥ 1.0 (optional extra: pip install preflight-data[ml])

License

Apache 2.0 — see LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.1

Mar 28, 2026

This version

0.1.0

Mar 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

preflight_data-0.1.0.tar.gz (70.4 kB view details)

Uploaded Mar 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

preflight_data-0.1.0-py3-none-any.whl (75.3 kB view details)

Uploaded Mar 27, 2026 Python 3

File details

Details for the file preflight_data-0.1.0.tar.gz.

File metadata

Download URL: preflight_data-0.1.0.tar.gz
Upload date: Mar 27, 2026
Size: 70.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for preflight_data-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`2b6533452c7fb58df967de36e62715f660a82527f4d6cf7dbc7e149b426fa4f6`
MD5	`da1fe660183625724943ebac2b4bb450`
BLAKE2b-256	`029bdcad3988ac1cc0f4acbdd2f06c4012ce7f5937c12d627ba1c121c748dfac`

See more details on using hashes here.

Provenance

The following attestation bundles were made for preflight_data-0.1.0.tar.gz:

Publisher: ci.yml on ryan-wolbeck/preflight

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: preflight_data-0.1.0.tar.gz
- Subject digest: 2b6533452c7fb58df967de36e62715f660a82527f4d6cf7dbc7e149b426fa4f6
- Sigstore transparency entry: 1189446530
- Sigstore integration time: Mar 27, 2026
Source repository:
- Permalink: ryan-wolbeck/preflight@6c6e46e00f6275d54be6e6333dde76c968cb3e95
- Branch / Tag: refs/tags/0.1.0
- Owner: https://github.com/ryan-wolbeck
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: ci.yml@6c6e46e00f6275d54be6e6333dde76c968cb3e95
- Trigger Event: release

File details

Details for the file preflight_data-0.1.0-py3-none-any.whl.

File metadata

Download URL: preflight_data-0.1.0-py3-none-any.whl
Upload date: Mar 27, 2026
Size: 75.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for preflight_data-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c2ec6e4b192b59da5292c9b8e8a19ee4dbfddc3440411af2f754f2a824a7b14f`
MD5	`0e5448e9317d157450b38bf9ec131081`
BLAKE2b-256	`7660b6fb451ed5f029b65138ce7aca3213fd240afd3399f668eeda751cf09bc9`

See more details on using hashes here.

Provenance

The following attestation bundles were made for preflight_data-0.1.0-py3-none-any.whl:

Publisher: ci.yml on ryan-wolbeck/preflight

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: preflight_data-0.1.0-py3-none-any.whl
- Subject digest: c2ec6e4b192b59da5292c9b8e8a19ee4dbfddc3440411af2f754f2a824a7b14f
- Sigstore transparency entry: 1189446532
- Sigstore integration time: Mar 27, 2026
Source repository:
- Permalink: ryan-wolbeck/preflight@6c6e46e00f6275d54be6e6333dde76c968cb3e95
- Branch / Tag: refs/tags/0.1.0
- Owner: https://github.com/ryan-wolbeck
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: ci.yml@6c6e46e00f6275d54be6e6333dde76c968cb3e95
- Trigger Event: release

preflight-data 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

preflight ✈️

Installation

Conda environment

API

preflight.check(df, target=None) → Report

preflight.run(df, target=None, profile="exploratory") → RunReport

preflight.check_split(X_train, X_test) → Report

JSON schema stability

CLI

Suppressions

Checks

Scoring

Development

Project structure

Adding a check

Requirements

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`preflight.check(df, target=None) → Report`

`preflight.run(df, target=None, profile="exploratory") → RunReport`

`preflight.check_split(X_train, X_test) → Report`