Fast, safe, automatic data cleaning for real-world tabular data.

These details have not been verified by PyPI

Project links

Project description

freshdata

Fast, safe, automatic data cleaning for real-world tabular data.

freshdata cleans messy CSV / Excel / SQL-export data in one call — and tells you exactly what it did and why. It is not a fillna wrapper: a rule-based decision engine profiles every column (missing ratio, dtype, skewness, cardinality, inferred role) and chooses the right action per column, logging a rationale, a risk level, and a confidence score for each one.

import pandas as pd
import freshdata as fd

df = pd.read_csv("export.csv")

cleaned = fd.clean(df)                             # one line
cleaned, report = fd.clean(df, return_report=True) # ... with a full audit trail
print(report.summary())

freshdata clean report
  rows:    525 -> 500 (-25)
  columns: 7 -> 6 (-1)
  missing: 421 -> 0 cell(s)
  memory:  100.8 KB -> 89.2 KB
  time:    0.017s
  engine:  25 duplicate row(s) removed; 20 outlier(s) flagged; imputed: age, segment
  actions (7):
    - [fix_dtypes] 'mostly_gone': converted to Int64
    - [drop_duplicates] dropped 25 duplicate row(s) (4.8% of rows, keep='first')
    - [missing] 'age': filled 12 missing value(s) with median (39.6846)
    - [missing] 'segment': filled 90 missing value(s) with sentinel "Missing" ('Missing')
    - [missing] 'mostly_gone': preserved 300 missing value(s)
    - [outliers] 'amount': flagged 15 outlier(s), 3.0% of values (method=iqr, factor=1.5) in new column 'amount_outlier'
    - [outliers] 'age': flagged 5 outlier(s), 1.0% of values (method=iqr, factor=1.5) in new column 'age_outlier'
  review (1):
    ? column 'mostly_gone' preserved at 60.0% missing in balanced mode

Install

pip install freshdata-cleaner                 # pandas + numpy only
pip install "freshdata-cleaner[ml]"           # + scikit-learn (KNN imputation, IsolationForest)
pip install "freshdata-cleaner[enterprise]"   # + polars, pyarrow, requests, pyyaml (enterprise layer + CLI)

Requires Python ≥ 3.9 and pandas ≥ 1.5.

How cleaning works

Layer 1 — representation repair (always on):

order	step	what it does
1	`column_names`	snake_case names, deduplicate collisions (`"a", "a"` → `"a", "a_2"`)
2	`strip_whitespace`	trim surrounding whitespace in text cells (internal spacing kept)
3	`normalize_sentinels`	`"N/A"`, `"null"`, `"-"`, `""`, `"#REF!"`, … → missing
4	`drop_empty_columns` / `drop_empty_rows`	remove all-missing columns and rows
5	`fix_dtypes`	text → numeric (`"$1,234.56"` works) / datetime / boolean, validated
6	`drop_duplicates`	resolve duplicate rows (`duplicate_keep`: first/last/drop/aggregate)

Layer 2 — the decision engine (strategy="balanced", the default) infers each column's role — id, target/label, datetime, free text, categorical, numeric — and applies explicit threshold rules. Use strategy="aggressive" for v0.2-style scrubbing (KNN imputation, column drops, winsorization). strategy="auto" is deprecated (alias for "aggressive").

Missing values (balanced default)

missing ratio	numeric	categorical	datetime
≤ 5% (low)	mean if ~normal & no outliers, else median	mode if clear majority, else `"Unknown"`	ffill/bfill if time-ordered
> 5% and ≤ 30% (medium)	median (KNN only in aggressive mode)	mode if dominant, else `"Missing"`	ffill/bfill if time-ordered
> 30% (high/extreme)	preserved + warning (balanced); dropped in aggressive unless preserved/informative	same	same

Aggressive mode additionally: KNN imputation for correlated numerics, column drops for high/extreme missingness without informative signal.

Role gates run first: targets are never modified, IDs are never imputed, free text is never force-filled — those columns are preserved with the reason written into the report, so a remaining NaN is never silent. A <col>_was_missing indicator column is added when the missingness itself correlates with other features (configurable via missing_indicators). On frames under 30 rows the ratios are too noisy: the engine preserves and recommends manual review instead of guessing.

Outliers

Detection: IQR fences (default), z-score, outlier_method="auto" (z-score for ~normal columns, IQR for skewed), or "isolation_forest" (scikit-learn, ≥ 100 rows, falls back to IQR). The method, threshold, and action are always logged.

Action (outlier_action): in balanced mode the default "cap" is converted to "flag" (adds a boolean <col>_outlier column). Explicit "remove" still drops rows. In aggressive mode, "cap" winsorizes to the fences. None detects and reports only. Outliers in ID and target columns, preserve_columns, and domain-sensitive columns (AQI, pollutants, fraud/anomaly/risk-like names) are always preserved — there the extremes usually are the signal. Heavy-tailed columns (> 15% outside the fences) are flagged instead of capped.

Duplicates

Exact duplicates are removed by default (count and percentage reported). Time-indexed frames never lose rows unless allow_timeseries_duplicates=True. A duplicate ratio above duplicate_threshold (10%) raises a data-quality warning. With duplicate_subset, duplicate_keep="aggregate" collapses each group (numeric mean, first non-missing otherwise).

Tuning the engine

fd.clean(
    df,
    strategy="balanced",             # "aggressive" | "conservative"
    missing_threshold_low=0.05,      # band edges for the missing-value rules
    missing_threshold_medium=0.30,
    missing_threshold_high=0.60,
    duplicate_threshold=0.10,        # warn above this duplicate ratio
    outlier_method="iqr",            # "zscore" | "auto" | "isolation_forest"
    outlier_action="cap",            # balanced converts cap→flag; "remove" | None
    target_column="churn",           # never modified
    preserve_columns=("notes",),     # never dropped
    id_columns=("ref",),             # never imputed
    preserve_original=True,          # False allows in-place memory reuse
    verbose=True,                    # one-line summary per clean
    return_report=True,
)

# Preview engine choices before cleaning
plan = fd.suggest_plan(df)
print(plan.summary())
fd.clean(df, config=plan.config)

# Compare strategies side-by-side
print(fd.compare_plans(df))

Explicit choices always override the engine: impute="median" / outliers="clip" force simple uniform handling, and strategy="conservative" restores the old opt-in behavior. Every option lives on one frozen dataclass — fd.CleanConfig — and unknown names fail immediately with a "did you mean" suggestion:

config = fd.CleanConfig(duplicate_keep="aggregate", duplicate_subset=("order_id",))
fd.clean(df, config=config, outlier_action="flag")   # config + overrides

cleaner = fd.Cleaner(target_column="churn")          # reusable pipeline
for path in paths:
    out = cleaner.clean(pd.read_csv(path))
    log.info(cleaner.report_.summary())

The report

fd.clean(df, return_report=True) returns (cleaned_df, CleanReport):

dataset shape, memory, and missing-cell counts before/after;
one Action per decision — step, column, description, affected count, rationale, risk level (low/medium/high), confidence score;
columns dropped / imputed / preserved, duplicates removed, outliers handled;
report.warnings for risky decisions and report.recommendations for manual review;
report.summary() (text), report.to_frame() (DataFrame), report.to_dict() (JSON-friendly).

If any NaN survives cleaning, the report says exactly why it was preserved.

Profiling

fd.profile(df) inspects without changing anything — and because it runs the same inference code as clean, its suggestions are a faithful preview. With include_plan=True, attach a dry-run cleaning plan:

print(fd.profile(df))
profile = fd.profile(df, include_plan=True)
print(profile.plan.summary())   # primary model per column

freshdata profile — 5 rows x 6 columns, 1.5 KB
  missing cells: 6 (20.0%)   duplicate rows: 1
  column        dtype    missing  issues
   First Name   object       20%  20.0% missing; 1 value(s) with surrounding whitespace; 1 sentinel value(s) meaning missing
  AGE           object         -  1 sentinel value(s) meaning missing; would convert to Int64
  Joined Date   object         -  1 sentinel value(s) meaning missing; would convert to datetime64[ns]
  Active        object         -  would convert to bool
  Salary($)     object         -  1 sentinel value(s) meaning missing; would convert to float64
  empty         object      100%  100.0% missing; constant column

What freshdata will not do

Touch a target/label column, impute an identifier, or force-fill free text.
Remove outliers blindly — capping is the default, and fraud/anomaly-style columns keep their extremes.
Guess at fuzzy entity resolution in clean() — variant/typo merging is opt-in via the enterprise layer's clustering.
Parse ambiguous European decimal commas ("1.234,56") — too risky to guess.
Mutate your DataFrame (unless you pass preserve_original=False).

API

name	purpose
`fd.clean(df, , return_report=False, config=None, *options)`	clean, optionally returning a `CleanReport`
`fd.suggest_plan(df, , config=None, *options)`	dry-run: primary + alternative models per column
`fd.compare_clean(df, *, strategies=...)`	side-by-side actual clean outcomes per strategy
`fd.compare_plans(df, *, strategies=..., include_metrics=False)`	side-by-side models across strategies
`fd.profile(df, , include_plan=False, config=None, *options)`	read-only inspection with actionable issues
`fd.Cleaner(config=None, **options)`	reusable configured pipeline (`.clean()`, `.report_`)
`fd.CleanConfig`	frozen dataclass holding every option
`fd.CleanPlan` / `fd.ColumnPlan`	engine preview before cleaning
`fd.CleanReport` / `fd.Action`	audit trail with rationale/risk/confidence/model_id
`fd.Profile` / `fd.ColumnProfile`	profiling results

Enterprise layer

freshdata.enterprise adds opt-in governance and data-quality features on top of the core cleaner: fuzzy value clustering, PII masking, semantic validation, a 0–100 Data Trust Score, OpenLineage metadata, and a batch CLI. It accepts and returns either a pandas or a polars DataFrame — running Polars-native fast paths when polars is installed and falling back to vectorized pandas otherwise. Optional dependencies stay lazy, so a plain import freshdata is unaffected.

pip install "freshdata-cleaner[enterprise]"   # polars, pyarrow, requests, pyyaml
pip install "freshdata-cleaner[cleanlab]"     # + cleanlab (ML label-noise detection)

from freshdata.enterprise import (
    clean_enterprise, EnterpriseConfig, ClusterConfig, MaskingRule, SemanticValidatorConfig,
)

ec = EnterpriseConfig(
    enable_clustering=True,
    clustering=ClusterConfig(columns=("vendor",)),       # merge "Acme Inc" / "ACME  inc"
    masking=(MaskingRule(name="pii", columns=("email",), strategy="hash", salt="…"),),
    semantic=(SemanticValidatorConfig(name="iso", kind="reference",
              columns=("country",), reference=("US", "CA", "GB")),),
    fail_under_trust=80,                                  # quality gate
)
result = clean_enterprise(df, enterprise=ec)              # df may be pandas OR polars
print(result.summary())
print(result.quality.to_markdown())                       # before/after trust report
result.lineage.emit("lineage.json")                       # OpenLineage RunEvents
assert result.passed_gate

Run it as a batch job in Airflow / Prefect / cron — the CLI exits non-zero when the trust gate fails:

freshdata clean in.csv -o out.parquet --mask email:hash --cluster vendor \
    --report quality.json --lineage lineage.json --fail-under-trust 80
freshdata trust in.csv --fail-under 90
freshdata profile in.csv --json

name	purpose
`clean_enterprise(df, , enterprise=…, clean_config=…, *opts)`	full pipeline → `EnterpriseResult`
`compute_trust_score(df)` → `TrustScore`	0–100 completeness / validity / uniqueness / consistency
`merge_clusters(df, cols)` / `cluster_column(df, col)`	key-collision + n-gram value merging
`mask_dataframe(df, rules)` → `MaskReport`	hash / redact / partial / regex-scrub / drop PII
`run_semantic_validation(df, configs)` → `ValidationReport`	reference / regex / API checks
`LineageTracker` / `schema_of`	OpenLineage-compatible transformation lineage
`detect_label_issues` / `detect_outliers`	optional Cleanlab wrappers

Migrating from 0.2.x

Breaking: the default strategy changed from "auto" to "balanced".

If you want…	Do this
Same behavior as freshdata 0.2	`fd.clean(df, strategy="aggressive")`
Accuracy-first cleaning (recommended)	`fd.clean(df)` — new default
Representation repair only	`fd.clean(df, strategy="conservative")`

strategy="auto" still works but emits a DeprecationWarning (alias for "aggressive"). Other notable 0.3 changes:

High-missing columns are preserved in balanced mode (not dropped).
Outliers are flagged by default in balanced mode (not capped).
KNN imputation runs only in aggressive mode.
Target heuristics expanded (aqi, *_bucket, score, …).
Action.model_id records which imputation/outlier model was chosen.
fd.suggest_plan() / fd.compare_plans() / fd.compare_clean() preview and compare engine decisions.

Validated scenarios

Every fixture in tests/fixtures/ is run under conservative, balanced, and aggressive strategies in CI. Use fd.compare_clean(df) to reproduce the quality/efficiency matrix on your own data.

Fixture	Rows	What it stress-tests
`aqi_sample`	500	Real AQI panel slice — targets, pollutants, outliers
`large_panel`	3,000	AQI-shaped panel at scale — perf + preserve rules
`sales_export`	200	CRM export — currency strings, whitespace, dupes
`survey_responses`	150	High missing categoricals, free-text `notes`
`sensor_timeseries`	120	Datetime readings, time-ordered fills
`fraud_signals`	180	Domain-sensitive scores — outliers preserved
`tiny_cohort`	12	Small frame gate — preserve, don't drop
`wide_sparse`	200×20	Sparse columns — balanced never drops
`duplicate_heavy`	260	~30% duplicate rows — layer-1 dedup
`locale_numbers`	100	European decimals — must not auto-convert
`mixed_roles`	100	Misnamed target, free text, id-like columns

Online datasets (50 curated)

Fifty real public datasets are catalogued in tests/fixtures/online/registry.json. Pinned URLs and sha256 hashes live in manifest.json; cached CSV slices in tests/fixtures/online/cache/ power CI (no network). Formats include CSV, TSV, JSON, and ZIP.

Tier	Count	CI scope
Tier 1 (anchors)	10	Full expectations + golden snapshots + live URL checks
Tier 2	40	Smoke tests (all strategies run, basic invariants)

Tier 1 anchors: titanic, wine_quality, adult_income, air_quality_uci, iris, loan_approval, heart_cleveland, bank_marketing, mushroom, weather_json.

Domain coverage: UCI classics, GitHub mirrors, environmental panels (OWID), finance/census, JSON-native (Vega datasets), medical, and high-dimensional numeric sets.

Refresh cached slices:

python scripts/fetch_online_fixtures.py --discover --update-manifest
python scripts/fetch_online_fixtures.py --refresh --only titanic
python scripts/search_datasets.py --tag missing --domain finance
python scripts/search_datasets.py --format json

Debug, explain, and compare:

python scripts/debug_datasets.py --online --explain titanic
python scripts/debug_datasets.py --infer-roles --online adult_income
python scripts/debug_datasets.py --search missing --online
python benchmarks/bench.py --online-all --compare
python benchmarks/bench.py --online-all --tier 1

Reverse-engineering APIs:

import freshdata as fd

# Infer column roles before cleaning
print(fd.infer_roles(df))

# Explain what clean() did and why
explanation = fd.explain_clean(df, strategy="balanced")
print(explanation.summary())
print(explanation.roles)

Polars adapter (optional extra):

pip install freshdata-cleaner[polars]

import polars as pl
cleaned = fd.clean(pl_df)  # returns pl.DataFrame when input is Polars

Live URL validation (network required, not default CI):

pytest -m online tests/test_online_datasets.py
pytest -m tier1 tests/test_online_datasets.py

Compare cleaning across strategies

import freshdata as fd

# Actual outcomes: missing after, duration, models used
print(fd.compare_clean(df))

# Planned models + optional actual metrics
print(fd.compare_plans(df, include_metrics=True))

Performance expectations

Typical throughput on a modern laptop (see tests/fixtures/perf/baselines.json):

Dataset size	Balanced	Aggressive
500 rows	<0.5s	<1s
3,000 rows	<2.5s	<6s
29k rows (full AQI)	<5s	KNN gated

Run benchmarks:

python benchmarks/bench.py --fixtures --compare   # all fixtures, side-by-side
pytest -m large                                   # optional full AQI.csv (set FRESHDATA_AQI_PATH)

Performance is achieved via vectorized pandas/NumPy and one-pass engine caching (correlation matrix, column contexts). A C extension is not used — profiling showed the bottleneck was KNN on large frames (now gated to aggressive mode only).

Development

git clone https://github.com/FreshCode-Org/freshdata
cd freshdata
pip install -e ".[dev,ml,polars]"
pytest
ruff check src tests
mypy src/freshdata

Update golden report snapshots after intentional engine changes:

pytest tests/test_golden.py tests/test_online_datasets.py --update-golden

Benchmarks: python benchmarks/bench.py (synthetic), python benchmarks/bench.py --fixtures --compare (11 local scenario fixtures), or python benchmarks/bench.py --online --compare (6 online cached datasets).

Optional large-file benchmark (29k-row AQI.csv, not committed to repo):

export FRESHDATA_AQI_PATH=/path/to/AQI.csv
pytest -m large

License

MIT — see LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.1

Jun 15, 2026

1.0.0

Jun 14, 2026

0.5.0

Jun 14, 2026

This version

0.4.0

Jun 14, 2026

0.2.0

Jun 12, 2026

0.1.0

Jun 12, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

freshdata_cleaner-0.4.0.tar.gz (1.6 MB view details)

Uploaded Jun 14, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

freshdata_cleaner-0.4.0-py3-none-any.whl (92.6 kB view details)

Uploaded Jun 14, 2026 Python 3

File details

Details for the file freshdata_cleaner-0.4.0.tar.gz.

File metadata

Download URL: freshdata_cleaner-0.4.0.tar.gz
Upload date: Jun 14, 2026
Size: 1.6 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for freshdata_cleaner-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`b7a96534e356341eb3527efaddc9c42f7c0379d6144262618956bae38639d0c9`
MD5	`1a71550d9eb6d4b7000a31be84461912`
BLAKE2b-256	`b7f75db94314168e23bda28e21e7faf53c24fe3f3eb79bba64fc11858e3dff6d`

See more details on using hashes here.

File details

Details for the file freshdata_cleaner-0.4.0-py3-none-any.whl.

File metadata

Download URL: freshdata_cleaner-0.4.0-py3-none-any.whl
Upload date: Jun 14, 2026
Size: 92.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for freshdata_cleaner-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7a76d3ef2d43df20a41646a147e626247da3d7bfa8ace33706de621166c5bdbc`
MD5	`a8d28c47ff2906378cf83949aff83f17`
BLAKE2b-256	`b4bebc1c078f7f978038a522eddde5ce494d940c164bed5d4f3432de4beaf754`

See more details on using hashes here.

freshdata-cleaner 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

freshdata

Install

How cleaning works

Missing values (balanced default)

Outliers

Duplicates

Tuning the engine

The report

Profiling

What freshdata will not do

API

Enterprise layer

Migrating from 0.2.x

Validated scenarios

Online datasets (50 curated)

Compare cleaning across strategies

Performance expectations

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes