Skip to main content

Fast, safe, automatic data cleaning for real-world tabular data.

Project description

freshdata

Automated DataFrame cleaning for pandas — explainable, safe, and production-ready.

One call turns a messy CSV, Excel, or SQL export into analysis- and ML-ready data — and tells you exactly what it changed and why.

PyPI Version Python Versions License: MIT CI Docs Downloads Coverage Ruff Checked with mypy

Documentation · Quickstart · API Reference · Examples · Changelog


freshdata is an automated data-cleaning library for Python that does real, intelligent preprocessing of real-world tabular data. It is not a fillna wrapper: a rule-based decision engine profiles every column (missing ratio, dtype, skewness, cardinality, inferred role) and chooses the right action per column — then logs a rationale, a risk level, and a confidence score for each decision so nothing happens silently.

import pandas as pd
import freshdata as fd

df = pd.read_csv("export.csv")

cleaned = fd.clean(df)                              # one line
cleaned, report = fd.clean(df, return_report=True)  # ... with a full audit trail
print(report.summary())
freshdata clean report
  rows:    525 -> 500 (-25)
  columns: 7 -> 6 (-1)
  missing: 421 -> 0 cell(s)
  memory:  100.8 KB -> 89.2 KB
  time:    0.017s
  engine:  25 duplicate row(s) removed; 20 outlier(s) flagged; imputed: age, segment
  actions (7):
    - [fix_dtypes] 'mostly_gone': converted to Int64
    - [drop_duplicates] dropped 25 duplicate row(s) (4.8% of rows, keep='first')
    - [missing] 'age': filled 12 missing value(s) with median (39.6846)
    - [missing] 'segment': filled 90 missing value(s) with sentinel "Missing" ('Missing')
    - [missing] 'mostly_gone': preserved 300 missing value(s)
    - [outliers] 'amount': flagged 15 outlier(s), 3.0% of values (method=iqr, factor=1.5) in new column 'amount_outlier'
    - [outliers] 'age': flagged 5 outlier(s), 1.0% of values (method=iqr, factor=1.5) in new column 'age_outlier'
  review (1):
    ? column 'mostly_gone' preserved at 60.0% missing in balanced mode

✨ Key features

  • Automated DataFrame cleaning in one callfd.clean(df) handles missing values, outliers, duplicates, dtype repair, and messy column names.
  • Per-column decision engine — infers each column's role (id, target, datetime, free text, categorical, numeric) and applies explicit, documented threshold rules instead of one blunt global strategy.
  • Explainable by design — every decision carries a rationale, risk level, and confidence score. If a NaN survives, the report says exactly why.
  • Safe defaults — never imputes an identifier, never modifies a target/label column, never force-fills free text, never removes outliers blindly.
  • AI-ready preprocessing — produces clean, typed, leakage-aware frames ready for scikit-learn, XGBoost, or any ML pipeline.
  • Data profilingfd.profile(df) gives read-only data-quality insight using the same inference code as clean, so previews are faithful.
  • pandas-first, Polars-optional — pandas + NumPy core; pass a Polars frame and get a Polars frame back when the optional adapter is installed.
  • Enterprise layer — opt-in fuzzy clustering, PII masking, semantic validation, a 0–100 Data Trust Score, OpenLineage metadata, and a batch CLI.
  • Typed, tested, fast — fully type-hinted (py.typed), 800+ tests, 95%+ coverage, vectorized pandas/NumPy throughout.

🤔 Why FreshData exists

Most data-cleaning code is hand-written, one-off, and silent. People reach for df.dropna() or df.fillna(0) and quietly corrupt their analysis — imputing an ID, leaking a target, or deleting the very outliers that were the signal. General-purpose tools don't fix this:

  • pandas gives you primitives, not decisions — you still write every rule.
  • profiling tools (sweetviz, ydata-profiling) describe data but don't clean it.
  • validation tools (Great Expectations) check data but don't repair it.

freshdata fills the gap: an opinionated engine that makes the right cleaning decision per column and explains it, so you get reproducible, auditable, ML-ready data without writing — or trusting — yet another bespoke script.

📦 Installation

pip install freshdata-cleaner                 # pandas + numpy only
pip install "freshdata-cleaner[ml]"           # + scikit-learn (KNN imputation, IsolationForest)
pip install "freshdata-cleaner[enterprise]"   # + polars, pyarrow, requests, pyyaml (enterprise layer + CLI)
pip install "freshdata-cleaner[all]"          # everything, including cleanlab

Requires Python ≥ 3.9 and pandas ≥ 1.5. Verify the install:

python -c "import freshdata as fd; print(fd.__version__)"

🚀 Quickstart

import pandas as pd
import freshdata as fd

df = pd.read_csv("messy_export.csv")

# Clean with sensible, explainable defaults
cleaned, report = fd.clean(df, return_report=True)

print(report.summary())        # human-readable audit trail
report.to_frame()              # decisions as a DataFrame
report.to_dict()               # JSON-friendly for logging / dashboards

Preview the engine's choices before touching your data:

print(fd.profile(df))                    # read-only data-quality report
print(fd.suggest_plan(df).summary())     # the exact plan clean() would run
print(fd.compare_plans(df))              # strategies side by side

🔁 Before vs after

Before — raw exportAfter — fd.clean(df)
First Name AGE Salary($) empty
Ann 34 $1,200.50
Bob N/A -
Bob N/A -
Cara 41 $2,000

whitespace, N/A/- sentinels, currency strings, an all-empty column, a duplicate row, text dtypes

first_name age salary age_was_missing
Ann 34 1200.50 False
Bob 38 Missing True
Cara 41 2000.00 False

snake_case names, real Int64/float64 dtypes, sentinels → missing → imputed, duplicate dropped, empty column removed, a missingness indicator added

Every one of those changes appears in report.summary() with a rationale, risk level, and confidence score — no silent mutations.

🧩 Core API

name purpose
fd.clean(df, *, return_report=False, config=None, **options) clean, optionally returning a CleanReport
fd.profile(df, *, include_plan=False, **options) read-only inspection with actionable issues
fd.suggest_plan(df, **options) dry-run: primary + alternative models per column
fd.compare_plans(df, *, strategies=...) side-by-side models across strategies
fd.compare_clean(df, *, strategies=...) side-by-side actual clean outcomes
fd.explain_clean(df, **options) what clean() did and why, plus inferred roles
fd.Cleaner(config=None, **options) reusable configured pipeline (.clean(), .report_)
fd.CleanConfig frozen dataclass holding every option
fd.CleanReport / fd.Action audit trail with rationale / risk / confidence
# Tune the engine — explicit choices always override the defaults
cleaned = fd.clean(
    df,
    strategy="balanced",          # "aggressive" | "conservative"
    target_column="churn",        # never modified (no leakage)
    id_columns=("customer_id",),  # never imputed
    preserve_columns=("notes",),  # never dropped
    outlier_method="iqr",         # "zscore" | "auto" | "isolation_forest"
    return_report=True,
)

# Reusable pipeline across many files
cleaner = fd.Cleaner(target_column="churn")
for path in paths:
    out = cleaner.clean(pd.read_csv(path))
    log.info(cleaner.report_.summary())
How the cleaning engine works (two layers)

Layer 1 — representation repair (always on):

order step what it does
1 column_names snake_case names, deduplicate collisions ("a", "a""a", "a_2")
2 strip_whitespace trim surrounding whitespace in text cells
3 normalize_sentinels "N/A", "null", "-", "", "#REF!", … → missing
4 drop_empty_columns / drop_empty_rows remove all-missing columns and rows
5 fix_dtypes text → numeric ("$1,234.56" works) / datetime / boolean, validated
6 drop_duplicates resolve duplicate rows (first/last/drop/aggregate)

Layer 2 — the decision engine (strategy="balanced", the default) infers each column's role and applies explicit threshold rules:

missing ratio numeric categorical datetime
≤ 5% mean if ~normal & no outliers, else median mode if clear majority, else "Unknown" ffill/bfill if time-ordered
5–30% median (KNN only in aggressive mode) mode if dominant, else "Missing" ffill/bfill if time-ordered
> 30% preserved + warning (balanced) same same

Role gates run first: targets are never modified, IDs are never imputed, free text is never force-filled. Outliers in ID/target columns, preserve_columns, and domain-sensitive columns (AQI, pollutants, fraud/risk names) are always preserved — there the extremes usually are the signal.

⚡ Performance highlights

Typical throughput on a modern laptop (vectorized pandas/NumPy, one-pass engine caching — no C extension required):

Dataset size Balanced Aggressive
500 rows < 0.5 s < 1 s
3,000 rows < 2.5 s < 6 s
29k rows (full AQI) < 5 s KNN gated
python benchmarks/bench.py --fixtures --compare   # all fixtures, side by side

📊 How FreshData compares

Capability freshdata pandas pyjanitor Great Expectations sweetviz cleanlab
One-call automatic cleaning
Per-column decisions by inferred role
Missing-value imputation (smart)
Outlier detection & handling
Duplicate resolution
Dtype / format repair
Explainable audit trail
Data profiling
Data validation / quality gates ✅¹
PII masking ✅¹
Label-noise (ML) detection ✅¹
Polars support

✅ built-in · ➖ partial / manual · ❌ not a goal · ¹ via the optional enterprise layer

🌍 Real-world use cases

  • ML preprocessing — turn raw CSVs into leakage-aware, typed feature matrices before scikit-learn / XGBoost, without imputing IDs or touching the label.
  • Analytics & BI ingestion — clean CRM, finance, and survey exports (currency strings, N/A sentinels, duplicate rows) on the way into a warehouse.
  • Data-quality gates in ETL — run the enterprise CLI in Airflow/Prefect/cron; fail the job when the Data Trust Score drops below a threshold.
  • Exploratory data analysis (EDA)fd.profile(df) surfaces missingness, dtype issues, and duplicates before you commit to a modeling approach.
  • Notebook hygiene — replace ad-hoc dropna/fillna cells with one auditable, reproducible call.

🛠️ Example pipeline

import pandas as pd
import freshdata as fd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

raw = pd.read_csv("customers.csv")

# 1. Clean with the target protected from leakage
clean_df, report = fd.clean(raw, target_column="churn", return_report=True)
assert not report.warnings, report.warnings        # gate on data quality

# 2. Split & model on AI-ready data
X = pd.get_dummies(clean_df.drop(columns="churn"))
y = clean_df["churn"]
X_tr, X_te, y_tr, y_te = train_test_split(X, y, random_state=0)

model = RandomForestClassifier(random_state=0).fit(X_tr, y_tr)
print("accuracy:", model.score(X_te, y_te))

See examples/ for 8 runnable scripts and notebooks/ for narrated walkthroughs.

Enterprise layer — clustering, PII masking, trust scores, lineage, CLI
from freshdata.enterprise import (
    clean_enterprise, EnterpriseConfig, ClusterConfig, MaskingRule, SemanticValidatorConfig,
)

ec = EnterpriseConfig(
    enable_clustering=True,
    clustering=ClusterConfig(columns=("vendor",)),       # merge "Acme Inc" / "ACME  inc"
    masking=(MaskingRule(name="pii", columns=("email",), strategy="hash", salt="…"),),
    semantic=(SemanticValidatorConfig(name="iso", kind="reference",
              columns=("country",), reference=("US", "CA", "GB")),),
    fail_under_trust=80,                                  # quality gate
)
result = clean_enterprise(df, enterprise=ec)             # df may be pandas OR polars
print(result.quality.to_markdown())                      # before/after trust report
result.lineage.emit("lineage.json")                      # OpenLineage RunEvents
assert result.passed_gate

Batch CLI (exits non-zero when the trust gate fails):

freshdata clean in.csv -o out.parquet --mask email:hash --cluster vendor \
    --report quality.json --lineage lineage.json --fail-under-trust 80
freshdata trust in.csv --fail-under 90
freshdata profile in.csv --json

📚 Documentation

Full documentation lives at https://freshcode-org.github.io/freshdata/:

🤝 Contributing

Contributions are welcome! Please read CONTRIBUTING.md and our Code of Conduct. Quick start:

git clone https://github.com/FreshCode-Org/freshdata
cd freshdata
pip install -e ".[dev,ml,polars]"
pre-commit install
pytest && ruff check src tests && mypy src/freshdata

Security issues: see SECURITY.md for private disclosure.

🗺️ Roadmap

  • Per-column decision engine with explainable reports (0.3)
  • Enterprise layer: clustering, masking, trust score, lineage, CLI (0.4)
  • Documentation site + examples + packaging governance (0.5)
  • Pluggable custom cleaning rules / strategy registry
  • Native Polars cleaning engine (beyond the adapter)
  • HTML/interactive profiling report
  • Config-as-YAML for the core cleaner (not just the CLI)
  • 1.0 — stable public API

Have an idea? Open a discussion or issue.

📄 License

MIT — see LICENSE.

👤 Maintainer

Built and maintained by Johnny Wilson Dougherty (@JohnnyWilson-Portfolio).

If freshdata saves you time, please ⭐ the repository — it genuinely helps others discover the project.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

freshdata_cleaner-1.0.0.tar.gz (1.6 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

freshdata_cleaner-1.0.0-py3-none-any.whl (91.8 kB view details)

Uploaded Python 3

File details

Details for the file freshdata_cleaner-1.0.0.tar.gz.

File metadata

  • Download URL: freshdata_cleaner-1.0.0.tar.gz
  • Upload date:
  • Size: 1.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for freshdata_cleaner-1.0.0.tar.gz
Algorithm Hash digest
SHA256 3df69b7d9cd41978b351bba43d92fe2012b7ae8b4bbe73143a655126bb0c2164
MD5 acba75f824a1888f87782d7b4eaa9805
BLAKE2b-256 8cd480b04b2e105051d6749a0c7af11820830d5d32f0cd28b9e2e657f25469c1

See more details on using hashes here.

File details

Details for the file freshdata_cleaner-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for freshdata_cleaner-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e67608189ba000388a09099a1a115b37c982a1d0f14f02ba8af53e55495c107e
MD5 5a4724b40645ef10f2f52d4cd73380ea
BLAKE2b-256 f9dbb0ee989d5bba1f22711812a640ef59061c07ccb284fa5252aff0fce3a46d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page