Skip to main content

Fast, safe, automatic data cleaning for real-world tabular data.

Project description

freshdata

Automated DataFrame cleaning for pandas — explainable, safe, and production-ready.

One call turns a messy CSV, Excel, or SQL export into analysis- and ML-ready data — and tells you exactly what it changed and why.

PyPI Version Python Versions License: MIT CI Docs Downloads Coverage Ruff Checked with mypy

Documentation · Quickstart · API Reference · Examples · Changelog


freshdata is an automated data-cleaning library for Python that does real, intelligent preprocessing of real-world tabular data. It is not a fillna wrapper: a rule-based decision engine profiles every column (missing ratio, dtype, skewness, cardinality, inferred role) and chooses the right action per column — then logs a rationale, a risk level, and a confidence score for each decision so nothing happens silently.

import pandas as pd
import freshdata as fd

df = pd.read_csv("export.csv")

cleaned = fd.clean(df)                              # one line
cleaned, report = fd.clean(df, return_report=True)  # ... with a full audit trail
print(report.summary())
freshdata clean report
  rows:    525 -> 500 (-25)
  columns: 7 -> 6 (-1)
  missing: 421 -> 0 cell(s)
  memory:  100.8 KB -> 89.2 KB
  time:    0.017s
  engine:  25 duplicate row(s) removed; 20 outlier(s) flagged; imputed: age, segment
  actions (7):
    - [fix_dtypes] 'mostly_gone': converted to Int64
    - [drop_duplicates] dropped 25 duplicate row(s) (4.8% of rows, keep='first')
    - [missing] 'age': filled 12 missing value(s) with median (39.6846)
    - [missing] 'segment': filled 90 missing value(s) with sentinel "Missing" ('Missing')
    - [missing] 'mostly_gone': preserved 300 missing value(s)
    - [outliers] 'amount': flagged 15 outlier(s), 3.0% of values (method=iqr, factor=1.5) in new column 'amount_outlier'
    - [outliers] 'age': flagged 5 outlier(s), 1.0% of values (method=iqr, factor=1.5) in new column 'age_outlier'
  review (1):
    ? column 'mostly_gone' preserved at 60.0% missing in balanced mode

✨ Key features

  • Automated DataFrame cleaning in one callfd.clean(df) handles missing values, outliers, duplicates, dtype repair, and messy column names.
  • Per-column decision engine — infers each column's role (id, target, datetime, free text, categorical, numeric) and applies explicit, documented threshold rules instead of one blunt global strategy.
  • Explainable by design — every decision carries a rationale, risk level, and confidence score. If a NaN survives, the report says exactly why.
  • Safe defaults — never imputes an identifier, never modifies a target/label column, never force-fills free text, never removes outliers blindly.
  • AI-ready preprocessing — produces clean, typed, leakage-aware frames ready for scikit-learn, XGBoost, or any ML pipeline.
  • Data profilingfd.profile(df) gives read-only data-quality insight using the same inference code as clean, so previews are faithful.
  • pandas-first, Polars-optional — pandas + NumPy core; pass a Polars frame and get a Polars frame back when the optional adapter is installed.
  • Enterprise layer — opt-in fuzzy clustering, PII masking, semantic validation, a 0–100 Data Trust Score, OpenLineage metadata, and a batch CLI.
  • Typed, tested, fast — fully type-hinted (py.typed), 800+ tests, 95%+ coverage, vectorized pandas/NumPy throughout.

🤔 Why FreshData exists

Most data-cleaning code is hand-written, one-off, and silent. People reach for df.dropna() or df.fillna(0) and quietly corrupt their analysis — imputing an ID, leaking a target, or deleting the very outliers that were the signal. General-purpose tools don't fix this:

  • pandas gives you primitives, not decisions — you still write every rule.
  • profiling tools (sweetviz, ydata-profiling) describe data but don't clean it.
  • validation tools (Great Expectations) check data but don't repair it.

freshdata fills the gap: an opinionated engine that makes the right cleaning decision per column and explains it, so you get reproducible, auditable, ML-ready data without writing — or trusting — yet another bespoke script.

📦 Installation

pip install freshdata-cleaner                 # pandas + numpy only
pip install "freshdata-cleaner[ml]"           # + scikit-learn (KNN imputation, IsolationForest)
pip install "freshdata-cleaner[enterprise]"   # + polars, pyarrow, requests, pyyaml (enterprise layer + CLI)
pip install "freshdata-cleaner[all]"          # everything, including cleanlab

Requires Python ≥ 3.9 and pandas ≥ 1.5. Verify the install:

python -c "import freshdata as fd; print(fd.__version__)"

🚀 Quickstart

import pandas as pd
import freshdata as fd

df = pd.read_csv("messy_export.csv")

# Clean with sensible, explainable defaults
cleaned, report = fd.clean(df, return_report=True)

print(report.summary())        # human-readable audit trail
report.to_frame()              # decisions as a DataFrame
report.to_dict()               # JSON-friendly for logging / dashboards

Preview the engine's choices before touching your data:

print(fd.profile(df))                    # read-only data-quality report
print(fd.suggest_plan(df).summary())     # the exact plan clean() would run
print(fd.compare_plans(df))              # strategies side by side

🔁 Before vs after

Before — raw exportAfter — fd.clean(df)
First Name AGE Salary($) empty
Ann 34 $1,200.50
Bob N/A -
Bob N/A -
Cara 41 $2,000

whitespace, N/A/- sentinels, currency strings, an all-empty column, a duplicate row, text dtypes

first_name age salary age_was_missing
Ann 34 1200.50 False
Bob 38 Missing True
Cara 41 2000.00 False

snake_case names, real Int64/float64 dtypes, sentinels → missing → imputed, duplicate dropped, empty column removed, a missingness indicator added

Every one of those changes appears in report.summary() with a rationale, risk level, and confidence score — no silent mutations.

🧩 Core API

name purpose
fd.clean(df, *, return_report=False, config=None, **options) clean, optionally returning a CleanReport
fd.profile(df, *, include_plan=False, **options) read-only inspection with actionable issues
fd.suggest_plan(df, **options) dry-run: primary + alternative models per column
fd.compare_plans(df, *, strategies=...) side-by-side models across strategies
fd.compare_clean(df, *, strategies=...) side-by-side actual clean outcomes
fd.explain_clean(df, **options) what clean() did and why, plus inferred roles
fd.Cleaner(config=None, **options) reusable configured pipeline (.clean(), .report_)
fd.CleanConfig frozen dataclass holding every option
fd.CleanReport / fd.Action audit trail with rationale / risk / confidence
# Tune the engine — explicit choices always override the defaults
cleaned = fd.clean(
    df,
    strategy="balanced",          # "aggressive" | "conservative"
    target_column="churn",        # never modified (no leakage)
    id_columns=("customer_id",),  # never imputed
    preserve_columns=("notes",),  # never dropped
    outlier_method="iqr",         # "zscore" | "auto" | "isolation_forest"
    return_report=True,
)

# Reusable pipeline across many files
cleaner = fd.Cleaner(target_column="churn")
for path in paths:
    out = cleaner.clean(pd.read_csv(path))
    log.info(cleaner.report_.summary())
How the cleaning engine works (two layers)

Layer 1 — representation repair (always on):

order step what it does
1 column_names snake_case names, deduplicate collisions ("a", "a""a", "a_2")
2 strip_whitespace trim surrounding whitespace in text cells
3 normalize_sentinels "N/A", "null", "-", "", "#REF!", … → missing
4 drop_empty_columns / drop_empty_rows remove all-missing columns and rows
5 fix_dtypes text → numeric ("$1,234.56" works) / datetime / boolean, validated
6 drop_duplicates resolve duplicate rows (first/last/drop/aggregate)

Layer 2 — the decision engine (strategy="balanced", the default) infers each column's role and applies explicit threshold rules:

missing ratio numeric categorical datetime
≤ 5% mean if ~normal & no outliers, else median mode if clear majority, else "Unknown" ffill/bfill if time-ordered
5–30% median (KNN only in aggressive mode) mode if dominant, else "Missing" ffill/bfill if time-ordered
> 30% preserved + warning (balanced) same same

Role gates run first: targets are never modified, IDs are never imputed, free text is never force-filled. Outliers in ID/target columns, preserve_columns, and domain-sensitive columns (AQI, pollutants, fraud/risk names) are always preserved — there the extremes usually are the signal.

⚡ Performance highlights

Typical throughput on a modern laptop (vectorized pandas/NumPy, one-pass engine caching — no C extension required):

Dataset size Balanced Aggressive
500 rows < 0.5 s < 1 s
3,000 rows < 2.5 s < 6 s
29k rows (full AQI) < 5 s KNN gated
python benchmarks/bench.py --fixtures --compare   # all fixtures, side by side

📊 How FreshData compares

Capability freshdata pandas pyjanitor Great Expectations sweetviz cleanlab
One-call automatic cleaning
Per-column decisions by inferred role
Missing-value imputation (smart)
Outlier detection & handling
Duplicate resolution
Dtype / format repair
Explainable audit trail
Data profiling
Data validation / quality gates ✅¹
PII masking ✅¹
Label-noise (ML) detection ✅¹
Polars support

✅ built-in · ➖ partial / manual · ❌ not a goal · ¹ via the optional enterprise layer

🌍 Real-world use cases

  • ML preprocessing — turn raw CSVs into leakage-aware, typed feature matrices before scikit-learn / XGBoost, without imputing IDs or touching the label.
  • Analytics & BI ingestion — clean CRM, finance, and survey exports (currency strings, N/A sentinels, duplicate rows) on the way into a warehouse.
  • Data-quality gates in ETL — run the enterprise CLI in Airflow/Prefect/cron; fail the job when the Data Trust Score drops below a threshold.
  • Exploratory data analysis (EDA)fd.profile(df) surfaces missingness, dtype issues, and duplicates before you commit to a modeling approach.
  • Notebook hygiene — replace ad-hoc dropna/fillna cells with one auditable, reproducible call.

🛠️ Example pipeline

import pandas as pd
import freshdata as fd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

raw = pd.read_csv("customers.csv")

# 1. Clean with the target protected from leakage
clean_df, report = fd.clean(raw, target_column="churn", return_report=True)
assert not report.warnings, report.warnings        # gate on data quality

# 2. Split & model on AI-ready data
X = pd.get_dummies(clean_df.drop(columns="churn"))
y = clean_df["churn"]
X_tr, X_te, y_tr, y_te = train_test_split(X, y, random_state=0)

model = RandomForestClassifier(random_state=0).fit(X_tr, y_tr)
print("accuracy:", model.score(X_te, y_te))

See examples/ for 8 runnable scripts and notebooks/ for narrated walkthroughs.

Enterprise layer — clustering, PII masking, trust scores, lineage, CLI
from freshdata.enterprise import (
    clean_enterprise, EnterpriseConfig, ClusterConfig, MaskingRule, SemanticValidatorConfig,
)

ec = EnterpriseConfig(
    enable_clustering=True,
    clustering=ClusterConfig(columns=("vendor",)),       # merge "Acme Inc" / "ACME  inc"
    masking=(MaskingRule(name="pii", columns=("email",), strategy="hash", salt="…"),),
    semantic=(SemanticValidatorConfig(name="iso", kind="reference",
              columns=("country",), reference=("US", "CA", "GB")),),
    fail_under_trust=80,                                  # quality gate
)
result = clean_enterprise(df, enterprise=ec)             # df may be pandas OR polars
print(result.quality.to_markdown())                      # before/after trust report
result.lineage.emit("lineage.json")                      # OpenLineage RunEvents
assert result.passed_gate

Batch CLI (exits non-zero when the trust gate fails):

freshdata clean in.csv -o out.parquet --mask email:hash --cluster vendor \
    --report quality.json --lineage lineage.json --fail-under-trust 80
freshdata trust in.csv --fail-under 90
freshdata profile in.csv --json

📚 Documentation

Full documentation lives at https://freshcode-org.github.io/freshdata/:

🤝 Contributing

Contributions are welcome! Please read CONTRIBUTING.md and our Code of Conduct. Quick start:

git clone https://github.com/FreshCode-Org/freshdata
cd freshdata
pip install -e ".[dev,ml,polars]"
pre-commit install
pytest && ruff check src tests && mypy src/freshdata

Security issues: see SECURITY.md for private disclosure.

🗺️ Roadmap

  • Per-column decision engine with explainable reports (0.3)
  • Enterprise layer: clustering, masking, trust score, lineage, CLI (0.4)
  • Documentation site + examples + packaging governance (0.5)
  • Pluggable custom cleaning rules / strategy registry
  • Native Polars cleaning engine (beyond the adapter)
  • HTML/interactive profiling report
  • Config-as-YAML for the core cleaner (not just the CLI)
  • 1.0 — stable public API

Have an idea? Open a discussion or issue.

📄 License

MIT — see LICENSE.

👤 Maintainer

Built and maintained by Johnny Wilson Dougherty (@JohnnyWilson-Portfolio).

If freshdata saves you time, please ⭐ the repository — it genuinely helps others discover the project.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

freshdata_cleaner-0.5.0.tar.gz (1.6 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

freshdata_cleaner-0.5.0-py3-none-any.whl (91.8 kB view details)

Uploaded Python 3

File details

Details for the file freshdata_cleaner-0.5.0.tar.gz.

File metadata

  • Download URL: freshdata_cleaner-0.5.0.tar.gz
  • Upload date:
  • Size: 1.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for freshdata_cleaner-0.5.0.tar.gz
Algorithm Hash digest
SHA256 b84260555e6d39c3cd9741d8dd7146356bf6a99956572852183c9412559e8680
MD5 b4a8d28acd94bc7b018170c2a80e8032
BLAKE2b-256 bd184c9e094d30e3a723635a8cf4d92d52f77212b73bfdc11462f8a9fdba0555

See more details on using hashes here.

File details

Details for the file freshdata_cleaner-0.5.0-py3-none-any.whl.

File metadata

File hashes

Hashes for freshdata_cleaner-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c12c5eca8eaeca181014516c5c3252faa2a7f3f6dfce13b39ef34e716522ff47
MD5 e16aa51b145f4cf91b7b0a8962209bdb
BLAKE2b-256 80f99eab98ffd26591c597deebf655ced79d3389d0c9529a766e2f923ee614a4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page