Skip to main content

Fast, safe, automatic data cleaning for real-world tabular data.

Project description

freshdata

Fast, safe, automatic data cleaning for real-world tabular data.

CI Python License: MIT

freshdata fixes the messy parts of CSV / Excel / SQL-export data — stray whitespace, "N/A" strings, numbers stored as text, duplicate rows — in one call, and tells you exactly what it did.

import pandas as pd
import freshdata as fd

df = pd.read_csv("export.csv")

cleaned = fd.clean(df)                      # one line
cleaned, report = fd.clean(df, report=True) # ... with a full audit trail
print(report.summary())
freshdata clean report
  rows:    5 -> 4 (-1)
  columns: 6 -> 5 (-1)
  memory:  1.5 KB -> 298 B
  time:    0.011s
  actions (12):
    - [column_names] renamed 5 column(s): ' First Name '->'first_name', 'AGE'->'age', …
    - [strip_whitespace] 'first_name': trimmed surrounding whitespace
    - [normalize_sentinels] 'age': replaced sentinel strings ("N/A", "-", "", …) with missing
    - [drop_empty_columns] dropped 1 all-missing column(s): empty
    - [fix_dtypes] 'age': converted to Int64
    - [fix_dtypes] 'joined_date': converted to datetime64[ns]
    - [fix_dtypes] 'active': converted to bool
    - [fix_dtypes] 'salary': converted to float64
    - [drop_duplicates] dropped 1 duplicate row(s)

Install

pip install freshdata-cleaner

Requires Python ≥ 3.9 and pandas ≥ 1.5. No other dependencies.

Why another cleaning library?

Most auto-cleaners are either trivial wrappers or opaque frameworks that guess. freshdata is built on four rules:

  1. No surprises. Defaults only repair representation — whitespace, sentinel strings, wrong dtypes, exact duplicate rows, all-empty rows/columns. Anything that changes your data's statistics (imputation, outlier handling, lossy downcasting) is opt-in.
  2. Everything is reported. Every transformation is recorded with the column name and the number of affected cells. bool(report) is False when nothing changed.
  3. Never mutates your input. clean returns a new frame (built from a shallow copy, so unchanged columns cost no extra memory). profile is read-only.
  4. Fast by construction. Vectorized pandas operations only — no row-wise apply. Type inference pre-screens a sample, so hopeless conversions are rejected at O(sample), not O(n), and conversions only stick when ≥ 95 % of values parse (configurable).

What clean does by default

order step what it does
1 column_names snake_case names, deduplicate collisions ("a", "a""a", "a_2")
2 strip_whitespace trim surrounding whitespace in text cells (internal spacing kept)
3 normalize_sentinels "N/A", "null", "-", "", "#REF!", … → missing
4 drop_empty_columns / drop_empty_rows remove all-missing columns and rows
5 fix_dtypes text → numeric ("$1,234.56" works) / datetime / boolean, validated
6 drop_duplicates drop exact duplicate rows, keep the first

Conversions are conservative: a column converts only when at least numeric_threshold (default 0.95) of its non-missing values parse, mixed-type columns never lose their non-string values, and every value coerced to missing is counted in the report.

Opt-in steps

fd.clean(
    df,
    impute="auto",              # median for numeric, mode otherwise ("mean"/"median"/"mode")
    outliers="clip",            # or "flag" to add a boolean <col>_outlier column
    outlier_method="iqr",       # or "zscore"; factors default to 1.5 / 3.0
    drop_constant_columns=True, # single-valued columns
    optimize_memory=True,       # downcast numerics, categorize low-cardinality text
    reset_index=True,           # 0..n-1 index instead of original labels
)

Every option lives on one frozen dataclass — fd.CleanConfig — and unknown names fail immediately with a "did you mean" suggestion:

config = fd.CleanConfig(drop_duplicates=False, extra_sentinels=("unknown",))
fd.clean(df, config=config, impute="median")   # config + overrides

cleaner = fd.Cleaner(impute="median")          # reusable pipeline
for path in paths:
    out = cleaner.clean(pd.read_csv(path))
    log.info(cleaner.report_.summary())

Profiling

fd.profile(df) inspects without changing anything — and because it runs the same inference code as clean, its suggestions are a faithful preview:

print(fd.profile(df))
freshdata profile — 5 rows x 6 columns, 1.5 KB
  missing cells: 6 (20.0%)   duplicate rows: 1
  column        dtype    missing  issues
   First Name   object       20%  20.0% missing; 1 value(s) with surrounding whitespace; …
  AGE           object         -  1 sentinel value(s) meaning missing; would convert to Int64
  Joined Date   object         -  would convert to datetime64[ns]
  Active        object         -  would convert to bool
  Salary($)     object         -  would convert to float64
  empty         object      100%  100.0% missing; constant column

profile.to_frame() gives the same as a DataFrame; profile.to_dict() is JSON-friendly for logging and data-quality dashboards.

What freshdata will not do

  • Guess at fuzzy entity resolution ("Jon" vs "John").
  • Impute, drop outliers, or change distributions unless you ask.
  • Parse ambiguous European decimal commas ("1.234,56") — too risky to guess.
  • Mutate your DataFrame, ever.

API

name purpose
fd.clean(df, *, report=False, config=None, **options) clean, optionally returning a CleanReport
fd.profile(df, *, config=None, **options) read-only inspection with actionable issues
fd.Cleaner(config=None, **options) reusable configured pipeline (.clean(), .report_)
fd.CleanConfig frozen dataclass holding every option
fd.CleanReport / fd.Action audit trail (summary(), to_dict(), to_frame())
fd.Profile / fd.ColumnProfile profiling results

Development

git clone https://github.com/JohnnyWilson-Portfolio/freshdata
cd freshdata
pip install -e ".[dev]"
pytest
ruff check src tests

Benchmarks live in benchmarks/bench.py (python benchmarks/bench.py).

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

freshdata_cleaner-0.1.0.tar.gz (30.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

freshdata_cleaner-0.1.0-py3-none-any.whl (30.6 kB view details)

Uploaded Python 3

File details

Details for the file freshdata_cleaner-0.1.0.tar.gz.

File metadata

  • Download URL: freshdata_cleaner-0.1.0.tar.gz
  • Upload date:
  • Size: 30.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for freshdata_cleaner-0.1.0.tar.gz
Algorithm Hash digest
SHA256 b0c4924ec6ebbb9d55f31daacd254e08add542e364d89a4133b172b2c615fefe
MD5 7b5348372e37782e6ddafe9a9086126d
BLAKE2b-256 140eeb6bd85d9b0ff2cad00c1804bf691ddacfe8736ec2190b7ddd985a141ddf

See more details on using hashes here.

File details

Details for the file freshdata_cleaner-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for freshdata_cleaner-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f15e14449c16932f6452cffee7024d4f4b00a7156c097b79473b0def9bba8336
MD5 7d06227535c7c42f3a771adbb630e289
BLAKE2b-256 d3b7d10d49790a14e858763cd58bc392297e35ea91452c952b4c017eca895bda

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page