Fast, safe, automatic data cleaning for real-world tabular data.
Project description
freshdata
Fast, safe, automatic data cleaning for real-world tabular data.
freshdata cleans messy CSV / Excel / SQL-export data in one call — and tells
you exactly what it did and why. It is not a fillna wrapper: a rule-based
decision engine profiles every column (missing ratio, dtype, skewness,
cardinality, inferred role) and chooses the right action per column, logging a
rationale, a risk level, and a confidence score for each one.
import pandas as pd
import freshdata as fd
df = pd.read_csv("export.csv")
cleaned = fd.clean(df) # one line
cleaned, report = fd.clean(df, return_report=True) # ... with a full audit trail
print(report.summary())
freshdata clean report
rows: 525 -> 500 (-25)
columns: 7 -> 6 (-1)
missing: 421 -> 0 cell(s)
memory: 100.8 KB -> 89.2 KB
time: 0.017s
engine: 25 duplicate row(s) removed; 20 outlier(s) flagged; imputed: age, segment
actions (7):
- [fix_dtypes] 'mostly_gone': converted to Int64
- [drop_duplicates] dropped 25 duplicate row(s) (4.8% of rows, keep='first')
- [missing] 'age': filled 12 missing value(s) with median (39.6846)
- [missing] 'segment': filled 90 missing value(s) with sentinel "Missing" ('Missing')
- [missing] 'mostly_gone': preserved 300 missing value(s)
- [outliers] 'amount': flagged 15 outlier(s), 3.0% of values (method=iqr, factor=1.5) in new column 'amount_outlier'
- [outliers] 'age': flagged 5 outlier(s), 1.0% of values (method=iqr, factor=1.5) in new column 'age_outlier'
review (1):
? column 'mostly_gone' preserved at 60.0% missing in balanced mode
Install
pip install freshdata-cleaner # pandas + numpy only
pip install "freshdata-cleaner[ml]" # + scikit-learn (KNN imputation, IsolationForest)
pip install "freshdata-cleaner[enterprise]" # + polars, pyarrow, requests, pyyaml (enterprise layer + CLI)
Requires Python ≥ 3.9 and pandas ≥ 1.5.
How cleaning works
Layer 1 — representation repair (always on):
| order | step | what it does |
|---|---|---|
| 1 | column_names |
snake_case names, deduplicate collisions ("a", "a" → "a", "a_2") |
| 2 | strip_whitespace |
trim surrounding whitespace in text cells (internal spacing kept) |
| 3 | normalize_sentinels |
"N/A", "null", "-", "", "#REF!", … → missing |
| 4 | drop_empty_columns / drop_empty_rows |
remove all-missing columns and rows |
| 5 | fix_dtypes |
text → numeric ("$1,234.56" works) / datetime / boolean, validated |
| 6 | drop_duplicates |
resolve duplicate rows (duplicate_keep: first/last/drop/aggregate) |
Layer 2 — the decision engine (strategy="balanced", the default) infers
each column's role — id, target/label, datetime, free text,
categorical, numeric — and applies explicit threshold rules.
Use strategy="aggressive" for v0.2-style scrubbing (KNN imputation, column
drops, winsorization). strategy="auto" is deprecated (alias for
"aggressive").
Missing values (balanced default)
| missing ratio | numeric | categorical | datetime |
|---|---|---|---|
| ≤ 5% (low) | mean if ~normal & no outliers, else median | mode if clear majority, else "Unknown" |
ffill/bfill if time-ordered |
| > 5% and ≤ 30% (medium) | median (KNN only in aggressive mode) | mode if dominant, else "Missing" |
ffill/bfill if time-ordered |
| > 30% (high/extreme) | preserved + warning (balanced); dropped in aggressive unless preserved/informative | same | same |
Aggressive mode additionally: KNN imputation for correlated numerics, column drops for high/extreme missingness without informative signal.
Role gates run first: targets are never modified, IDs are never
imputed, free text is never force-filled — those columns are preserved
with the reason written into the report, so a remaining NaN is never silent.
A <col>_was_missing indicator column is added when the missingness itself
correlates with other features (configurable via missing_indicators).
On frames under 30 rows the ratios are too noisy: the engine preserves and
recommends manual review instead of guessing.
Outliers
Detection: IQR fences (default), z-score, outlier_method="auto" (z-score
for ~normal columns, IQR for skewed), or "isolation_forest" (scikit-learn,
≥ 100 rows, falls back to IQR). The method, threshold, and action are always
logged.
Action (outlier_action): in balanced mode the default "cap" is
converted to "flag" (adds a boolean <col>_outlier column). Explicit
"remove" still drops rows. In aggressive mode, "cap" winsorizes to
the fences. None detects and reports only. Outliers in ID and target
columns, preserve_columns, and domain-sensitive columns (AQI, pollutants,
fraud/anomaly/risk-like names) are always preserved — there the extremes
usually are the signal. Heavy-tailed columns (> 15% outside the fences) are
flagged instead of capped.
Duplicates
Exact duplicates are removed by default (count and percentage reported).
Time-indexed frames never lose rows unless allow_timeseries_duplicates=True.
A duplicate ratio above duplicate_threshold (10%) raises a data-quality
warning. With duplicate_subset, duplicate_keep="aggregate" collapses each
group (numeric mean, first non-missing otherwise).
Tuning the engine
fd.clean(
df,
strategy="balanced", # "aggressive" | "conservative"
missing_threshold_low=0.05, # band edges for the missing-value rules
missing_threshold_medium=0.30,
missing_threshold_high=0.60,
duplicate_threshold=0.10, # warn above this duplicate ratio
outlier_method="iqr", # "zscore" | "auto" | "isolation_forest"
outlier_action="cap", # balanced converts cap→flag; "remove" | None
target_column="churn", # never modified
preserve_columns=("notes",), # never dropped
id_columns=("ref",), # never imputed
preserve_original=True, # False allows in-place memory reuse
verbose=True, # one-line summary per clean
return_report=True,
)
# Preview engine choices before cleaning
plan = fd.suggest_plan(df)
print(plan.summary())
fd.clean(df, config=plan.config)
# Compare strategies side-by-side
print(fd.compare_plans(df))
Explicit choices always override the engine: impute="median" /
outliers="clip" force simple uniform handling, and
strategy="conservative" restores the old opt-in behavior. Every option
lives on one frozen dataclass — fd.CleanConfig — and unknown names fail
immediately with a "did you mean" suggestion:
config = fd.CleanConfig(duplicate_keep="aggregate", duplicate_subset=("order_id",))
fd.clean(df, config=config, outlier_action="flag") # config + overrides
cleaner = fd.Cleaner(target_column="churn") # reusable pipeline
for path in paths:
out = cleaner.clean(pd.read_csv(path))
log.info(cleaner.report_.summary())
The report
fd.clean(df, return_report=True) returns (cleaned_df, CleanReport):
- dataset shape, memory, and missing-cell counts before/after;
- one
Actionper decision — step, column, description, affected count, rationale, risk level (low/medium/high), confidence score; - columns dropped / imputed / preserved, duplicates removed, outliers handled;
report.warningsfor risky decisions andreport.recommendationsfor manual review;report.summary()(text),report.to_frame()(DataFrame),report.to_dict()(JSON-friendly).
If any NaN survives cleaning, the report says exactly why it was preserved.
Profiling
fd.profile(df) inspects without changing anything — and because it runs the
same inference code as clean, its suggestions are a faithful preview.
With include_plan=True, attach a dry-run cleaning plan:
print(fd.profile(df))
profile = fd.profile(df, include_plan=True)
print(profile.plan.summary()) # primary model per column
freshdata profile — 5 rows x 6 columns, 1.5 KB
missing cells: 6 (20.0%) duplicate rows: 1
column dtype missing issues
First Name object 20% 20.0% missing; 1 value(s) with surrounding whitespace; 1 sentinel value(s) meaning missing
AGE object - 1 sentinel value(s) meaning missing; would convert to Int64
Joined Date object - 1 sentinel value(s) meaning missing; would convert to datetime64[ns]
Active object - would convert to bool
Salary($) object - 1 sentinel value(s) meaning missing; would convert to float64
empty object 100% 100.0% missing; constant column
What freshdata will not do
- Touch a target/label column, impute an identifier, or force-fill free text.
- Remove outliers blindly — capping is the default, and fraud/anomaly-style columns keep their extremes.
- Guess at fuzzy entity resolution in
clean()— variant/typo merging is opt-in via the enterprise layer's clustering. - Parse ambiguous European decimal commas (
"1.234,56") — too risky to guess. - Mutate your DataFrame (unless you pass
preserve_original=False).
API
| name | purpose |
|---|---|
fd.clean(df, *, return_report=False, config=None, **options) |
clean, optionally returning a CleanReport |
fd.suggest_plan(df, *, config=None, **options) |
dry-run: primary + alternative models per column |
fd.compare_clean(df, *, strategies=...) |
side-by-side actual clean outcomes per strategy |
fd.compare_plans(df, *, strategies=..., include_metrics=False) |
side-by-side models across strategies |
fd.profile(df, *, include_plan=False, config=None, **options) |
read-only inspection with actionable issues |
fd.Cleaner(config=None, **options) |
reusable configured pipeline (.clean(), .report_) |
fd.CleanConfig |
frozen dataclass holding every option |
fd.CleanPlan / fd.ColumnPlan |
engine preview before cleaning |
fd.CleanReport / fd.Action |
audit trail with rationale/risk/confidence/model_id |
fd.Profile / fd.ColumnProfile |
profiling results |
Enterprise layer
freshdata.enterprise adds opt-in governance and data-quality features on top of the core
cleaner: fuzzy value clustering, PII masking, semantic validation, a 0–100 Data Trust
Score, OpenLineage metadata, and a batch CLI. It accepts and returns either a pandas
or a polars DataFrame — running Polars-native fast paths when polars is installed and
falling back to vectorized pandas otherwise. Optional dependencies stay lazy, so a plain
import freshdata is unaffected.
pip install "freshdata-cleaner[enterprise]" # polars, pyarrow, requests, pyyaml
pip install "freshdata-cleaner[cleanlab]" # + cleanlab (ML label-noise detection)
from freshdata.enterprise import (
clean_enterprise, EnterpriseConfig, ClusterConfig, MaskingRule, SemanticValidatorConfig,
)
ec = EnterpriseConfig(
enable_clustering=True,
clustering=ClusterConfig(columns=("vendor",)), # merge "Acme Inc" / "ACME inc"
masking=(MaskingRule(name="pii", columns=("email",), strategy="hash", salt="…"),),
semantic=(SemanticValidatorConfig(name="iso", kind="reference",
columns=("country",), reference=("US", "CA", "GB")),),
fail_under_trust=80, # quality gate
)
result = clean_enterprise(df, enterprise=ec) # df may be pandas OR polars
print(result.summary())
print(result.quality.to_markdown()) # before/after trust report
result.lineage.emit("lineage.json") # OpenLineage RunEvents
assert result.passed_gate
Run it as a batch job in Airflow / Prefect / cron — the CLI exits non-zero when the trust gate fails:
freshdata clean in.csv -o out.parquet --mask email:hash --cluster vendor \
--report quality.json --lineage lineage.json --fail-under-trust 80
freshdata trust in.csv --fail-under 90
freshdata profile in.csv --json
| name | purpose |
|---|---|
clean_enterprise(df, *, enterprise=…, clean_config=…, **opts) |
full pipeline → EnterpriseResult |
compute_trust_score(df) → TrustScore |
0–100 completeness / validity / uniqueness / consistency |
merge_clusters(df, cols) / cluster_column(df, col) |
key-collision + n-gram value merging |
mask_dataframe(df, rules) → MaskReport |
hash / redact / partial / regex-scrub / drop PII |
run_semantic_validation(df, configs) → ValidationReport |
reference / regex / API checks |
LineageTracker / schema_of |
OpenLineage-compatible transformation lineage |
detect_label_issues / detect_outliers |
optional Cleanlab wrappers |
Migrating from 0.2.x
Breaking: the default strategy changed from "auto" to "balanced".
| If you want… | Do this |
|---|---|
| Same behavior as freshdata 0.2 | fd.clean(df, strategy="aggressive") |
| Accuracy-first cleaning (recommended) | fd.clean(df) — new default |
| Representation repair only | fd.clean(df, strategy="conservative") |
strategy="auto" still works but emits a DeprecationWarning (alias for
"aggressive"). Other notable 0.3 changes:
- High-missing columns are preserved in balanced mode (not dropped).
- Outliers are flagged by default in balanced mode (not capped).
- KNN imputation runs only in aggressive mode.
- Target heuristics expanded (
aqi,*_bucket,score, …). Action.model_idrecords which imputation/outlier model was chosen.fd.suggest_plan()/fd.compare_plans()/fd.compare_clean()preview and compare engine decisions.
Validated scenarios
Every fixture in tests/fixtures/ is run under conservative, balanced, and
aggressive strategies in CI. Use fd.compare_clean(df) to reproduce the
quality/efficiency matrix on your own data.
| Fixture | Rows | What it stress-tests |
|---|---|---|
aqi_sample |
500 | Real AQI panel slice — targets, pollutants, outliers |
large_panel |
3,000 | AQI-shaped panel at scale — perf + preserve rules |
sales_export |
200 | CRM export — currency strings, whitespace, dupes |
survey_responses |
150 | High missing categoricals, free-text notes |
sensor_timeseries |
120 | Datetime readings, time-ordered fills |
fraud_signals |
180 | Domain-sensitive scores — outliers preserved |
tiny_cohort |
12 | Small frame gate — preserve, don't drop |
wide_sparse |
200×20 | Sparse columns — balanced never drops |
duplicate_heavy |
260 | ~30% duplicate rows — layer-1 dedup |
locale_numbers |
100 | European decimals — must not auto-convert |
mixed_roles |
100 | Misnamed target, free text, id-like columns |
Online datasets (50 curated)
Fifty real public datasets are catalogued in tests/fixtures/online/registry.json.
Pinned URLs and sha256 hashes live in manifest.json; cached CSV
slices in tests/fixtures/online/cache/ power CI (no network). Formats include CSV, TSV, JSON, and ZIP.
| Tier | Count | CI scope |
|---|---|---|
| Tier 1 (anchors) | 10 | Full expectations + golden snapshots + live URL checks |
| Tier 2 | 40 | Smoke tests (all strategies run, basic invariants) |
Tier 1 anchors: titanic, wine_quality, adult_income, air_quality_uci, iris,
loan_approval, heart_cleveland, bank_marketing, mushroom, weather_json.
Domain coverage: UCI classics, GitHub mirrors, environmental panels (OWID), finance/census, JSON-native (Vega datasets), medical, and high-dimensional numeric sets.
Refresh cached slices:
python scripts/fetch_online_fixtures.py --discover --update-manifest
python scripts/fetch_online_fixtures.py --refresh --only titanic
python scripts/search_datasets.py --tag missing --domain finance
python scripts/search_datasets.py --format json
Debug, explain, and compare:
python scripts/debug_datasets.py --online --explain titanic
python scripts/debug_datasets.py --infer-roles --online adult_income
python scripts/debug_datasets.py --search missing --online
python benchmarks/bench.py --online-all --compare
python benchmarks/bench.py --online-all --tier 1
Reverse-engineering APIs:
import freshdata as fd
# Infer column roles before cleaning
print(fd.infer_roles(df))
# Explain what clean() did and why
explanation = fd.explain_clean(df, strategy="balanced")
print(explanation.summary())
print(explanation.roles)
Polars adapter (optional extra):
pip install freshdata-cleaner[polars]
import polars as pl
cleaned = fd.clean(pl_df) # returns pl.DataFrame when input is Polars
Live URL validation (network required, not default CI):
pytest -m online tests/test_online_datasets.py
pytest -m tier1 tests/test_online_datasets.py
Compare cleaning across strategies
import freshdata as fd
# Actual outcomes: missing after, duration, models used
print(fd.compare_clean(df))
# Planned models + optional actual metrics
print(fd.compare_plans(df, include_metrics=True))
Performance expectations
Typical throughput on a modern laptop (see tests/fixtures/perf/baselines.json):
| Dataset size | Balanced | Aggressive |
|---|---|---|
| 500 rows | <0.5s | <1s |
| 3,000 rows | <2.5s | <6s |
| 29k rows (full AQI) | <5s | KNN gated |
Run benchmarks:
python benchmarks/bench.py --fixtures --compare # all fixtures, side-by-side
pytest -m large # optional full AQI.csv (set FRESHDATA_AQI_PATH)
Performance is achieved via vectorized pandas/NumPy and one-pass engine caching (correlation matrix, column contexts). A C extension is not used — profiling showed the bottleneck was KNN on large frames (now gated to aggressive mode only).
Development
git clone https://github.com/FreshCode-Org/freshdata
cd freshdata
pip install -e ".[dev,ml,polars]"
pytest
ruff check src tests
mypy src/freshdata
Update golden report snapshots after intentional engine changes:
pytest tests/test_golden.py tests/test_online_datasets.py --update-golden
Benchmarks: python benchmarks/bench.py (synthetic),
python benchmarks/bench.py --fixtures --compare (11 local scenario fixtures), or
python benchmarks/bench.py --online --compare (6 online cached datasets).
Optional large-file benchmark (29k-row AQI.csv, not committed to repo):
export FRESHDATA_AQI_PATH=/path/to/AQI.csv
pytest -m large
License
MIT — see LICENSE.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file freshdata_cleaner-0.4.0.tar.gz.
File metadata
- Download URL: freshdata_cleaner-0.4.0.tar.gz
- Upload date:
- Size: 1.6 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b7a96534e356341eb3527efaddc9c42f7c0379d6144262618956bae38639d0c9
|
|
| MD5 |
1a71550d9eb6d4b7000a31be84461912
|
|
| BLAKE2b-256 |
b7f75db94314168e23bda28e21e7faf53c24fe3f3eb79bba64fc11858e3dff6d
|
File details
Details for the file freshdata_cleaner-0.4.0-py3-none-any.whl.
File metadata
- Download URL: freshdata_cleaner-0.4.0-py3-none-any.whl
- Upload date:
- Size: 92.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7a76d3ef2d43df20a41646a147e626247da3d7bfa8ace33706de621166c5bdbc
|
|
| MD5 |
a8d28c47ff2906378cf83949aff83f17
|
|
| BLAKE2b-256 |
b4bebc1c078f7f978038a522eddde5ce494d940c164bed5d4f3432de4beaf754
|