Fast, safe, automatic data cleaning for real-world tabular data.
Project description
freshdata
Automated DataFrame cleaning for pandas — explainable, safe, and production-ready.
One call turns a messy CSV, Excel, or SQL export into analysis- and ML-ready data — and tells you exactly what it changed and why.
Documentation · Quickstart · API Reference · Examples · Changelog
freshdata is an automated data-cleaning library for Python that does real,
intelligent preprocessing of real-world tabular data. It is not a fillna
wrapper: a rule-based decision engine profiles every column (missing ratio,
dtype, skewness, cardinality, inferred role) and chooses the right action per
column — then logs a rationale, a risk level, and a confidence score for each
decision so nothing happens silently.
import pandas as pd
import freshdata as fd
df = pd.read_csv("export.csv")
cleaned = fd.clean(df) # one line
cleaned, report = fd.clean(df, return_report=True) # ... with a full audit trail
print(report.summary())
freshdata clean report
rows: 525 -> 500 (-25)
columns: 7 -> 6 (-1)
missing: 421 -> 0 cell(s)
memory: 100.8 KB -> 89.2 KB
time: 0.017s
engine: 25 duplicate row(s) removed; 20 outlier(s) flagged; imputed: age, segment
actions (7):
- [fix_dtypes] 'mostly_gone': converted to Int64
- [drop_duplicates] dropped 25 duplicate row(s) (4.8% of rows, keep='first')
- [missing] 'age': filled 12 missing value(s) with median (39.6846)
- [missing] 'segment': filled 90 missing value(s) with sentinel "Missing" ('Missing')
- [missing] 'mostly_gone': preserved 300 missing value(s)
- [outliers] 'amount': flagged 15 outlier(s), 3.0% of values (method=iqr, factor=1.5) in new column 'amount_outlier'
- [outliers] 'age': flagged 5 outlier(s), 1.0% of values (method=iqr, factor=1.5) in new column 'age_outlier'
review (1):
? column 'mostly_gone' preserved at 60.0% missing in balanced mode
✨ Key features
- Automated DataFrame cleaning in one call —
fd.clean(df)handles missing values, outliers, duplicates, dtype repair, and messy column names. - Per-column decision engine — infers each column's role (id, target, datetime, free text, categorical, numeric) and applies explicit, documented threshold rules instead of one blunt global strategy.
- Explainable by design — every decision carries a rationale, risk level,
and confidence score. If a
NaNsurvives, the report says exactly why. - Safe defaults — never imputes an identifier, never modifies a target/label column, never force-fills free text, never removes outliers blindly.
- AI-ready preprocessing — produces clean, typed, leakage-aware frames ready for scikit-learn, XGBoost, or any ML pipeline.
- Data profiling —
fd.profile(df)gives read-only data-quality insight using the same inference code asclean, so previews are faithful. - pandas-first, Polars-optional — pandas + NumPy core; pass a Polars frame and get a Polars frame back when the optional adapter is installed.
- Enterprise layer — opt-in fuzzy clustering, PII masking, semantic validation, a 0–100 Data Trust Score, OpenLineage metadata, and a batch CLI.
- Typed, tested, fast — fully type-hinted (
py.typed), 800+ tests, 95%+ coverage, vectorized pandas/NumPy throughout.
🤔 Why FreshData exists
Most data-cleaning code is hand-written, one-off, and silent. People reach for
df.dropna() or df.fillna(0) and quietly corrupt their analysis — imputing an
ID, leaking a target, or deleting the very outliers that were the signal.
General-purpose tools don't fix this:
- pandas gives you primitives, not decisions — you still write every rule.
- profiling tools (sweetviz, ydata-profiling) describe data but don't clean it.
- validation tools (Great Expectations) check data but don't repair it.
freshdata fills the gap: an opinionated engine that makes the right cleaning
decision per column and explains it, so you get reproducible, auditable,
ML-ready data without writing — or trusting — yet another bespoke script.
📦 Installation
pip install freshdata-cleaner # pandas + numpy only
pip install "freshdata-cleaner[ml]" # + scikit-learn (KNN imputation, IsolationForest)
pip install "freshdata-cleaner[enterprise]" # + polars, pyarrow, requests, pyyaml (enterprise layer + CLI)
pip install "freshdata-cleaner[all]" # everything, including cleanlab
Requires Python ≥ 3.9 and pandas ≥ 1.5. Verify the install:
python -c "import freshdata as fd; print(fd.__version__)"
🚀 Quickstart
import pandas as pd
import freshdata as fd
df = pd.read_csv("messy_export.csv")
# Clean with sensible, explainable defaults
cleaned, report = fd.clean(df, return_report=True)
print(report.summary()) # human-readable audit trail
report.to_frame() # decisions as a DataFrame
report.to_dict() # JSON-friendly for logging / dashboards
Preview the engine's choices before touching your data:
print(fd.profile(df)) # read-only data-quality report
print(fd.suggest_plan(df).summary()) # the exact plan clean() would run
print(fd.compare_plans(df)) # strategies side by side
🔁 Before vs after
| Before — raw export | After — fd.clean(df) | ||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
whitespace, |
snake_case names, real |
Every one of those changes appears in report.summary() with a rationale, risk
level, and confidence score — no silent mutations.
🧩 Core API
| name | purpose |
|---|---|
fd.clean(df, *, return_report=False, config=None, **options) |
clean, optionally returning a CleanReport |
fd.profile(df, *, include_plan=False, **options) |
read-only inspection with actionable issues |
fd.suggest_plan(df, **options) |
dry-run: primary + alternative models per column |
fd.compare_plans(df, *, strategies=...) |
side-by-side models across strategies |
fd.compare_clean(df, *, strategies=...) |
side-by-side actual clean outcomes |
fd.explain_clean(df, **options) |
what clean() did and why, plus inferred roles |
fd.Cleaner(config=None, **options) |
reusable configured pipeline (.clean(), .report_) |
fd.CleanConfig |
frozen dataclass holding every option |
fd.CleanReport / fd.Action |
audit trail with rationale / risk / confidence |
# Tune the engine — explicit choices always override the defaults
cleaned = fd.clean(
df,
strategy="balanced", # "aggressive" | "conservative"
target_column="churn", # never modified (no leakage)
id_columns=("customer_id",), # never imputed
preserve_columns=("notes",), # never dropped
outlier_method="iqr", # "zscore" | "auto" | "isolation_forest"
return_report=True,
)
# Reusable pipeline across many files
cleaner = fd.Cleaner(target_column="churn")
for path in paths:
out = cleaner.clean(pd.read_csv(path))
log.info(cleaner.report_.summary())
How the cleaning engine works (two layers)
Layer 1 — representation repair (always on):
| order | step | what it does |
|---|---|---|
| 1 | column_names |
snake_case names, deduplicate collisions ("a", "a" → "a", "a_2") |
| 2 | strip_whitespace |
trim surrounding whitespace in text cells |
| 3 | normalize_sentinels |
"N/A", "null", "-", "", "#REF!", … → missing |
| 4 | drop_empty_columns / drop_empty_rows |
remove all-missing columns and rows |
| 5 | fix_dtypes |
text → numeric ("$1,234.56" works) / datetime / boolean, validated |
| 6 | drop_duplicates |
resolve duplicate rows (first/last/drop/aggregate) |
Layer 2 — the decision engine (strategy="balanced", the default) infers
each column's role and applies explicit threshold rules:
| missing ratio | numeric | categorical | datetime |
|---|---|---|---|
| ≤ 5% | mean if ~normal & no outliers, else median | mode if clear majority, else "Unknown" |
ffill/bfill if time-ordered |
| 5–30% | median (KNN only in aggressive mode) | mode if dominant, else "Missing" |
ffill/bfill if time-ordered |
| > 30% | preserved + warning (balanced) | same | same |
Role gates run first: targets are never modified, IDs are never imputed,
free text is never force-filled. Outliers in ID/target columns,
preserve_columns, and domain-sensitive columns (AQI, pollutants, fraud/risk
names) are always preserved — there the extremes usually are the signal.
⚡ Performance highlights
Typical throughput on a modern laptop (vectorized pandas/NumPy, one-pass engine caching — no C extension required):
| Dataset size | Balanced | Aggressive |
|---|---|---|
| 500 rows | < 0.5 s | < 1 s |
| 3,000 rows | < 2.5 s | < 6 s |
| 29k rows (full AQI) | < 5 s | KNN gated |
python benchmarks/bench.py --fixtures --compare # all fixtures, side by side
📊 How FreshData compares
| Capability | freshdata | pandas | pyjanitor | Great Expectations | sweetviz | cleanlab |
|---|---|---|---|---|---|---|
| One-call automatic cleaning | ✅ | ❌ | ➖ | ❌ | ❌ | ❌ |
| Per-column decisions by inferred role | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ |
| Missing-value imputation (smart) | ✅ | ➖ | ➖ | ❌ | ❌ | ❌ |
| Outlier detection & handling | ✅ | ❌ | ❌ | ➖ | ➖ | ✅ |
| Duplicate resolution | ✅ | ➖ | ✅ | ❌ | ❌ | ❌ |
| Dtype / format repair | ✅ | ➖ | ✅ | ❌ | ❌ | ❌ |
| Explainable audit trail | ✅ | ❌ | ❌ | ➖ | ❌ | ➖ |
| Data profiling | ✅ | ➖ | ❌ | ➖ | ✅ | ❌ |
| Data validation / quality gates | ✅¹ | ❌ | ❌ | ✅ | ❌ | ❌ |
| PII masking | ✅¹ | ❌ | ❌ | ❌ | ❌ | ❌ |
| Label-noise (ML) detection | ✅¹ | ❌ | ❌ | ❌ | ❌ | ✅ |
| Polars support | ✅ | ❌ | ❌ | ➖ | ❌ | ❌ |
✅ built-in · ➖ partial / manual · ❌ not a goal · ¹ via the optional enterprise layer
🌍 Real-world use cases
- ML preprocessing — turn raw CSVs into leakage-aware, typed feature matrices before scikit-learn / XGBoost, without imputing IDs or touching the label.
- Analytics & BI ingestion — clean CRM, finance, and survey exports
(currency strings,
N/Asentinels, duplicate rows) on the way into a warehouse. - Data-quality gates in ETL — run the enterprise CLI in Airflow/Prefect/cron; fail the job when the Data Trust Score drops below a threshold.
- Exploratory data analysis (EDA) —
fd.profile(df)surfaces missingness, dtype issues, and duplicates before you commit to a modeling approach. - Notebook hygiene — replace ad-hoc
dropna/fillnacells with one auditable, reproducible call.
🛠️ Example pipeline
import pandas as pd
import freshdata as fd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
raw = pd.read_csv("customers.csv")
# 1. Clean with the target protected from leakage
clean_df, report = fd.clean(raw, target_column="churn", return_report=True)
assert not report.warnings, report.warnings # gate on data quality
# 2. Split & model on AI-ready data
X = pd.get_dummies(clean_df.drop(columns="churn"))
y = clean_df["churn"]
X_tr, X_te, y_tr, y_te = train_test_split(X, y, random_state=0)
model = RandomForestClassifier(random_state=0).fit(X_tr, y_tr)
print("accuracy:", model.score(X_te, y_te))
See examples/ for 8 runnable scripts and notebooks/
for narrated walkthroughs.
Enterprise layer — clustering, PII masking, trust scores, lineage, CLI
from freshdata.enterprise import (
clean_enterprise, EnterpriseConfig, ClusterConfig, MaskingRule, SemanticValidatorConfig,
)
ec = EnterpriseConfig(
enable_clustering=True,
clustering=ClusterConfig(columns=("vendor",)), # merge "Acme Inc" / "ACME inc"
masking=(MaskingRule(name="pii", columns=("email",), strategy="hash", salt="…"),),
semantic=(SemanticValidatorConfig(name="iso", kind="reference",
columns=("country",), reference=("US", "CA", "GB")),),
fail_under_trust=80, # quality gate
)
result = clean_enterprise(df, enterprise=ec) # df may be pandas OR polars
print(result.quality.to_markdown()) # before/after trust report
result.lineage.emit("lineage.json") # OpenLineage RunEvents
assert result.passed_gate
Batch CLI (exits non-zero when the trust gate fails):
freshdata clean in.csv -o out.parquet --mask email:hash --cluster vendor \
--report quality.json --lineage lineage.json --fail-under-trust 80
freshdata trust in.csv --fail-under 90
freshdata profile in.csv --json
📚 Documentation
Full documentation lives at https://freshcode-org.github.io/freshdata/:
🤝 Contributing
Contributions are welcome! Please read CONTRIBUTING.md and our Code of Conduct. Quick start:
git clone https://github.com/FreshCode-Org/freshdata
cd freshdata
pip install -e ".[dev,ml,polars]"
pre-commit install
pytest && ruff check src tests && mypy src/freshdata
Security issues: see SECURITY.md for private disclosure.
🗺️ Roadmap
- Per-column decision engine with explainable reports (0.3)
- Enterprise layer: clustering, masking, trust score, lineage, CLI (0.4)
- Documentation site + examples + packaging governance (0.5)
- Pluggable custom cleaning rules / strategy registry
- Native Polars cleaning engine (beyond the adapter)
- HTML/interactive profiling report
- Config-as-YAML for the core cleaner (not just the CLI)
- 1.0 — stable public API
Have an idea? Open a discussion or issue.
📄 License
MIT — see LICENSE.
👤 Maintainer
Built and maintained by Johnny Wilson Dougherty (@JohnnyWilson-Portfolio).
If freshdata saves you time, please ⭐ the
repository — it genuinely helps
others discover the project.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file freshdata_cleaner-0.5.0.tar.gz.
File metadata
- Download URL: freshdata_cleaner-0.5.0.tar.gz
- Upload date:
- Size: 1.6 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b84260555e6d39c3cd9741d8dd7146356bf6a99956572852183c9412559e8680
|
|
| MD5 |
b4a8d28acd94bc7b018170c2a80e8032
|
|
| BLAKE2b-256 |
bd184c9e094d30e3a723635a8cf4d92d52f77212b73bfdc11462f8a9fdba0555
|
File details
Details for the file freshdata_cleaner-0.5.0-py3-none-any.whl.
File metadata
- Download URL: freshdata_cleaner-0.5.0-py3-none-any.whl
- Upload date:
- Size: 91.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c12c5eca8eaeca181014516c5c3252faa2a7f3f6dfce13b39ef34e716522ff47
|
|
| MD5 |
e16aa51b145f4cf91b7b0a8962209bdb
|
|
| BLAKE2b-256 |
80f99eab98ffd26591c597deebf655ced79d3389d0c9529a766e2f923ee614a4
|