Skip to main content

Dataset readiness checker for machine learning — run a pre-flight checklist on any DataFrame before training.

Project description

Preflight banner

PyPI version PyPI downloads Code style: black


Dataset readiness checks for ML pipelines. Use preflight to catch data blockers before training and deployment.

Why preflight

  • Runs fast checks for data quality, target risk, schema/type issues, split integrity, and statistical anomalies.
  • Produces machine-readable findings and a CI gate decision (PASS/FAIL).
  • Keeps output stable with schema versioning for downstream tooling.

Quickstart

Install:

pip install preflight-data

Python API:

import pandas as pd
import preflight

df = pd.read_csv("data.csv")
report = preflight.run(df, target="churn", profile="ci-balanced")

print(report)

CLI:

preflight run data.csv --target churn --profile ci-balanced --format text

Example output:

Preflight Run Report
────────────────────────────────────────────────────────
Gate: PASS
Heuristic Score: 97.0/100
Profile: ci-balanced
Dataset: 1000/1000 rows analyzed across 6 columns
Target: churn
Summary: 8 info, 1 warn, 0 error, 0 critical

Gate reasons:
- No findings met fail conditions for this profile

Findings:
- [WARN] completeness.missingness: Overall missingness is 1.8% (108 cells missing across dataset). (confidence=0.95)
- [INFO] duplicates.exact: No exact duplicate rows detected. (confidence=0.95)
- [INFO] balance.class_imbalance: Class distribution is within configured tolerance. (confidence=0.90)

HTML report preview

If you want to see what the generated HTML report looks like, open:

The notebook contains rendered report.to_html() output cells.

Core model

  • finding: one detected issue or advisory signal
  • severity: info | warn | error | critical
  • gate: policy decision based on severities (PASS | FAIL)
  • score: heuristic summary for trend/comparison, not statistical truth

Score guidance:

  • Good for rough trend tracking across runs
  • Not a probability of model success

Common workflows

Dataset readiness (single table)

report = preflight.run(df, target="churn", profile="ci-balanced")

Split integrity (train/validation or train/test)

split_report = preflight.run_split(train_df, valid_df, profile="ci-balanced")

Policy profiles

Built-in profiles:

  • exploratory: permissive, useful in notebooks
  • ci-balanced: practical CI default
  • ci-strict: highest sensitivity for blocking conditions

Example:

preflight run data.csv --target churn --profile ci-strict --format json

--fail-on override:

preflight run data.csv --target churn --profile ci-balanced --fail-on error,critical

Policy argument rules:

  • Use either --profile or --policy-file (mutually exclusive).
  • --fail-on is only supported with --profile.
  • Invalid policy/config files fail fast at load time.

CLI reference

# Recommended policy-first commands
preflight run data.csv --target churn --profile ci-balanced --format json
preflight run-split train.csv test.csv --profile ci-balanced --format markdown

# Optional artifacts
preflight run data.csv --target churn --format text --output report.txt --output-html report.html

# Compare against baseline JSON report
preflight compare current.json baseline.json --max-score-drop 3 --fail-on-new-error

# Suppressions
preflight suppress add --file suppressions.json --check-id leakage.high_correlation --reason "known safe"
preflight suppress list --file suppressions.json
preflight suppress validate --file suppressions.json --fail-on-expired

HTML output

CLI:

preflight run data.csv --target churn --profile ci-balanced --format text --output-html report.html

Python:

report = preflight.run(df, target="churn", profile="ci-balanced")
html = report.to_html()
with open("report.html", "w", encoding="utf-8") as f:
    f.write(html)

This creates a shareable HTML report you can attach to CI artifacts, docs, or review tickets.

Exit codes:

  • 0: gate pass
  • 2: gate fail or explicit CLI validation failure

Output schema contract

RunReport.to_dict() includes stable contract keys:

  • schema_version
  • run
  • dataset
  • gate
  • score
  • summary
  • findings

Per-finding payload includes evidence and explainability fields:

  • check_id, title, domain, severity, suppressed
  • suggested_action, docs_url
  • evidence.metrics, evidence.threshold, evidence.samples

Examples

Legacy compatibility

Legacy check(...) and check_split(...) APIs are still available for compatibility, but run(...) and run_split(...) are recommended for policy-first workflows.

Migration status: the policy-first runner now uses native checks for class balance, completeness, leakage, duplicates, distributional health, correlations, and types. Legacy APIs remain supported during migration.

Compatibility namespace:

  • preflight.legacy.check(...)
  • preflight.legacy.check_split(...)
  • preflight.legacy.Report

Development

make env
conda activate preflight
make install-dev
make test
make lint
make typecheck
make build

Supported versions

  • Python: 3.9-3.13
  • pandas: >=1.3
  • numpy: >=1.21

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

preflight_data-0.1.1.tar.gz (74.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

preflight_data-0.1.1-py3-none-any.whl (79.9 kB view details)

Uploaded Python 3

File details

Details for the file preflight_data-0.1.1.tar.gz.

File metadata

  • Download URL: preflight_data-0.1.1.tar.gz
  • Upload date:
  • Size: 74.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for preflight_data-0.1.1.tar.gz
Algorithm Hash digest
SHA256 b51183727dd377fa7dd4f11a8c7fef71ec769d7c23c9ea7ea2d2ac8edd082453
MD5 f08e4e8a575d3a843fb009f7029955cc
BLAKE2b-256 25a949915387cb0cadb55960de77c2f2e176cd9d11cd93bf1c58e807bdb02eb6

See more details on using hashes here.

Provenance

The following attestation bundles were made for preflight_data-0.1.1.tar.gz:

Publisher: ci.yml on ryan-wolbeck/preflight

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file preflight_data-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: preflight_data-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 79.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for preflight_data-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 57e24d22b4a7fc7aeb85b81417fb14667ab556214df7db4399ff3a3777d198b2
MD5 cf873c6a601cb3592ec671c66f52f92d
BLAKE2b-256 c6f851827e1d8d30c1ba07b187b08fcb1da2b16371dd10d401e41a736c9bd433

See more details on using hashes here.

Provenance

The following attestation bundles were made for preflight_data-0.1.1-py3-none-any.whl:

Publisher: ci.yml on ryan-wolbeck/preflight

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page