Fast, safe, automatic data cleaning for real-world tabular data.

These details have not been verified by PyPI

Project links

Project description

freshdata

Fast, safe, automatic data cleaning for real-world tabular data.

freshdata fixes the messy parts of CSV / Excel / SQL-export data — stray whitespace, "N/A" strings, numbers stored as text, duplicate rows — in one call, and tells you exactly what it did.

import pandas as pd
import freshdata as fd

df = pd.read_csv("export.csv")

cleaned = fd.clean(df)                      # one line
cleaned, report = fd.clean(df, report=True) # ... with a full audit trail
print(report.summary())

freshdata clean report
  rows:    5 -> 4 (-1)
  columns: 6 -> 5 (-1)
  memory:  1.5 KB -> 298 B
  time:    0.011s
  actions (12):
    - [column_names] renamed 5 column(s): ' First Name '->'first_name', 'AGE'->'age', …
    - [strip_whitespace] 'first_name': trimmed surrounding whitespace
    - [normalize_sentinels] 'age': replaced sentinel strings ("N/A", "-", "", …) with missing
    - [drop_empty_columns] dropped 1 all-missing column(s): empty
    - [fix_dtypes] 'age': converted to Int64
    - [fix_dtypes] 'joined_date': converted to datetime64[ns]
    - [fix_dtypes] 'active': converted to bool
    - [fix_dtypes] 'salary': converted to float64
    - [drop_duplicates] dropped 1 duplicate row(s)

Install

pip install freshdata-cleaner

Requires Python ≥ 3.9 and pandas ≥ 1.5. No other dependencies.

Why another cleaning library?

Most auto-cleaners are either trivial wrappers or opaque frameworks that guess. freshdata is built on four rules:

No surprises. Defaults only repair representation — whitespace, sentinel strings, wrong dtypes, exact duplicate rows, all-empty rows/columns. Anything that changes your data's statistics (imputation, outlier handling, lossy downcasting) is opt-in.
Everything is reported. Every transformation is recorded with the column name and the number of affected cells. bool(report) is False when nothing changed.
Never mutates your input. clean returns a new frame (built from a shallow copy, so unchanged columns cost no extra memory). profile is read-only.
Fast by construction. Vectorized pandas operations only — no row-wise apply. Type inference pre-screens a sample, so hopeless conversions are rejected at O(sample), not O(n), and conversions only stick when ≥ 95 % of values parse (configurable).

What `clean` does by default

order	step	what it does
1	`column_names`	snake_case names, deduplicate collisions (`"a", "a"` → `"a", "a_2"`)
2	`strip_whitespace`	trim surrounding whitespace in text cells (internal spacing kept)
3	`normalize_sentinels`	`"N/A"`, `"null"`, `"-"`, `""`, `"#REF!"`, … → missing
4	`drop_empty_columns` / `drop_empty_rows`	remove all-missing columns and rows
5	`fix_dtypes`	text → numeric (`"$1,234.56"` works) / datetime / boolean, validated
6	`drop_duplicates`	drop exact duplicate rows, keep the first

Conversions are conservative: a column converts only when at least numeric_threshold (default 0.95) of its non-missing values parse, mixed-type columns never lose their non-string values, and every value coerced to missing is counted in the report.

Opt-in steps

fd.clean(
    df,
    impute="auto",              # median for numeric, mode otherwise ("mean"/"median"/"mode")
    outliers="clip",            # or "flag" to add a boolean <col>_outlier column
    outlier_method="iqr",       # or "zscore"; factors default to 1.5 / 3.0
    drop_constant_columns=True, # single-valued columns
    optimize_memory=True,       # downcast numerics, categorize low-cardinality text
    reset_index=True,           # 0..n-1 index instead of original labels
)

Every option lives on one frozen dataclass — fd.CleanConfig — and unknown names fail immediately with a "did you mean" suggestion:

config = fd.CleanConfig(drop_duplicates=False, extra_sentinels=("unknown",))
fd.clean(df, config=config, impute="median")   # config + overrides

cleaner = fd.Cleaner(impute="median")          # reusable pipeline
for path in paths:
    out = cleaner.clean(pd.read_csv(path))
    log.info(cleaner.report_.summary())

Profiling

fd.profile(df) inspects without changing anything — and because it runs the same inference code as clean, its suggestions are a faithful preview:

print(fd.profile(df))

freshdata profile — 5 rows x 6 columns, 1.5 KB
  missing cells: 6 (20.0%)   duplicate rows: 1
  column        dtype    missing  issues
   First Name   object       20%  20.0% missing; 1 value(s) with surrounding whitespace; …
  AGE           object         -  1 sentinel value(s) meaning missing; would convert to Int64
  Joined Date   object         -  would convert to datetime64[ns]
  Active        object         -  would convert to bool
  Salary($)     object         -  would convert to float64
  empty         object      100%  100.0% missing; constant column

profile.to_frame() gives the same as a DataFrame; profile.to_dict() is JSON-friendly for logging and data-quality dashboards.

What freshdata will not do

Guess at fuzzy entity resolution ("Jon" vs "John").
Impute, drop outliers, or change distributions unless you ask.
Parse ambiguous European decimal commas ("1.234,56") — too risky to guess.
Mutate your DataFrame, ever.

API

name	purpose
`fd.clean(df, , report=False, config=None, *options)`	clean, optionally returning a `CleanReport`
`fd.profile(df, , config=None, *options)`	read-only inspection with actionable issues
`fd.Cleaner(config=None, **options)`	reusable configured pipeline (`.clean()`, `.report_`)
`fd.CleanConfig`	frozen dataclass holding every option
`fd.CleanReport` / `fd.Action`	audit trail (`summary()`, `to_dict()`, `to_frame()`)
`fd.Profile` / `fd.ColumnProfile`	profiling results

Development

git clone https://github.com/JohnnyWilson-Portfolio/freshdata
cd freshdata
pip install -e ".[dev]"
pytest
ruff check src tests

Benchmarks live in benchmarks/bench.py (python benchmarks/bench.py).

License

MIT — see LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.1

Jun 15, 2026

1.0.0

Jun 14, 2026

0.5.0

Jun 14, 2026

0.4.0

Jun 14, 2026

0.2.0

Jun 12, 2026

This version

0.1.0

Jun 12, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

freshdata_cleaner-0.1.0.tar.gz (30.4 kB view details)

Uploaded Jun 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

freshdata_cleaner-0.1.0-py3-none-any.whl (30.6 kB view details)

Uploaded Jun 12, 2026 Python 3

File details

Details for the file freshdata_cleaner-0.1.0.tar.gz.

File metadata

Download URL: freshdata_cleaner-0.1.0.tar.gz
Upload date: Jun 12, 2026
Size: 30.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for freshdata_cleaner-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`b0c4924ec6ebbb9d55f31daacd254e08add542e364d89a4133b172b2c615fefe`
MD5	`7b5348372e37782e6ddafe9a9086126d`
BLAKE2b-256	`140eeb6bd85d9b0ff2cad00c1804bf691ddacfe8736ec2190b7ddd985a141ddf`

See more details on using hashes here.

File details

Details for the file freshdata_cleaner-0.1.0-py3-none-any.whl.

File metadata

Download URL: freshdata_cleaner-0.1.0-py3-none-any.whl
Upload date: Jun 12, 2026
Size: 30.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for freshdata_cleaner-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f15e14449c16932f6452cffee7024d4f4b00a7156c097b79473b0def9bba8336`
MD5	`7d06227535c7c42f3a771adbb630e289`
BLAKE2b-256	`d3b7d10d49790a14e858763cd58bc392297e35ea91452c952b4c017eca895bda`

See more details on using hashes here.

freshdata-cleaner 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

freshdata

Install

Why another cleaning library?

What `clean` does by default

Opt-in steps

Profiling

What freshdata will not do

API

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

freshdata-cleaner 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

freshdata

Install

Why another cleaning library?

What clean does by default

Opt-in steps

Profiling

What freshdata will not do

API

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

What `clean` does by default