A lightweight library for cleaning and optimizing pandas DataFrames

These details have not been verified by PyPI

Project links

Project description

pandasclean

A lightweight Python library for cleaning and optimizing pandas DataFrames. Built for data analysts and data scientists who want practical, no-fuss data cleaning with sensible defaults.

Quick Start

import pandas as pd
from pandasclean import auto_clean

df = pd.read_csv('your_data.csv')
df_clean, report = auto_clean(df)

That's it. One line cleans your entire DataFrame.

Features

Outlier Detection & Handling — Detect outliers using the IQR method and choose to report, drop, or cap them
Memory Reduction — Automatically downcast numeric dtypes and convert low cardinality string columns to save memory
NaN Handling — Drop, fill with mean/median, or supply custom fill values per column
Auto Clean — One function that runs everything with sensible defaults

Installation

pip install pandasclean

Usage

Auto Clean

Runs all cleaning functions in the correct order with sensible defaults.

from pandasclean import auto_clean

df_clean, report = auto_clean(df)

For custom behaviour, use the individual functions directly.

Outlier Detection & Handling

from pandasclean import find_outliers

# Report outlier bounds without changing data
df, bounds = find_outliers(df, strategy='report')

# Drop rows containing outliers
df_clean, bounds = find_outliers(df, strategy='drop')

# Cap outliers to the nearest bound (Winsorization)
df_clean, bounds = find_outliers(df, strategy='cap')

# Target specific columns with a custom multiplier
df_clean, bounds = find_outliers(df, columns=['age', 'salary'], multiplier=3.0, strategy='cap')

NaN Handling

from pandasclean import handle_nan

# Report null counts and percentages without making changes
df, report = handle_nan(df, strategy='report')

# Drop rows with any NaN
df_clean, report = handle_nan(df, strategy='drop', axis='rows', how='any')

# Drop columns where more than 50% of values are NaN
df_clean, report = handle_nan(df, strategy='drop', axis='columns', threshold=50)

# Fill NaN with column mean (numeric columns only)
df_clean, report = handle_nan(df, strategy='mean')

# Fill NaN with column median (numeric columns only)
df_clean, report = handle_nan(df, strategy='median')

# Fill all NaN with a single custom value
df_clean, report = handle_nan(df, strategy='custom', fill_value=0)

# Fill NaN with different values per column
df_clean, report = handle_nan(df, strategy='custom', fill_value={
    'age': 0,
    'name': 'unknown',
    'salary': 50000
})

Memory Reduction

from pandasclean import reduce_memory

# Optimize all columns with default settings
df_optimized, report = reduce_memory(df)

# Disable category conversion for string columns
df_optimized, report = reduce_memory(df, convert_category=False)

# Custom cardinality threshold for category conversion
df_optimized, report = reduce_memory(df, cardinality_threshold=0.3)

Benchmarks

Tested on a 1.5 million row DataFrame with mixed dtypes:

Column	Before	After	Notes
`Account_ID`	`int64`	`int32`	Downcast
`Customer_Age`	`int64`	`int8`	Downcast
`Account_Balance`	`float64`	`float32`	Downcast
`Monthly_Spend`	`float64`	`float32`	Downcast
`Credit_Score`	`float64`	`float32`	Downcast
`Activity_Metric_1`	`float64`	`float32`	Downcast
`Activity_Metric_2`	`float64`	`float32`	Downcast
`Tier`	`object`	`category`	Low cardinality

before = df.memory_usage(deep=True).sum() / (1024 * 1024)
df_optimized, report = reduce_memory(df)
after = df_optimized.memory_usage(deep=True).sum() / (1024 * 1024)

# Before: 152.71 MB
# After:  37.19 MB
# Reduction: 75.6%

Always use memory_usage(deep=True) for accurate measurement — pandas default undercounts string column memory.

How It Works

Memory Reduction

dtype	Action
`int64`	Downcast to smallest safe type (`int8` → `int16` → `int32`)
`float64`	Downcast to `float32` where possible
`object` / `str`	Convert to `category` if cardinality ratio is below threshold

Outlier Detection

Uses the IQR method to compute bounds:

lower_bound = Q1 - (multiplier × IQR)
upper_bound = Q3 + (multiplier × IQR)

Standard multiplier values:

1.5 — mild outliers (default)
3.0 — extreme outliers only

Parameters

`auto_clean(df)`

Parameter	Default	Description
`df`	required	Input DataFrame

`find_outliers(df, columns, multiplier, strategy)`

Parameter	Default	Description
`df`	required	Input DataFrame
`columns`	`None`	Columns to check. Defaults to all numeric columns
`multiplier`	`1.5`	IQR multiplier. Use `3.0` for extreme outliers only
`strategy`	`'report'`	One of `'report'`, `'drop'`, `'cap'`

`handle_nan(df, columns, strategy, fill_value, axis, how, threshold)`

Parameter	Default	Description
`df`	required	Input DataFrame
`columns`	`None`	Columns to process. Defaults to all columns
`strategy`	`'report'`	One of `'report'`, `'drop'`, `'mean'`, `'median'`, `'custom'`
`fill_value`	`None`	Scalar or dict. Required when strategy is `'custom'`
`axis`	`'rows'`	`'rows'` or `'columns'`. Only used with `'drop'` strategy
`how`	`'any'`	`'any'` or `'all'`. Only used with `'drop'` strategy
`threshold`	`None`	NaN percentage threshold for dropping. Overrides `how` when set

`reduce_memory(df, columns, convert_category, cardinality_threshold)`

Parameter	Default	Description
`df`	required	Input DataFrame
`columns`	`None`	Columns to process. Defaults to all columns
`convert_category`	`True`	Whether to convert low cardinality strings to `category`
`cardinality_threshold`	`0.5`	Max unique ratio to trigger category conversion

Use Cases

Data cleaning before analysis — handle outliers and NaN values in one step
Memory optimization before ML training — reduce DataFrame size before fitting models, especially useful for GPU training where float32 is standard
CSV/Parquet compression — smaller dtypes mean smaller files on disk. Saving to parquet after running reduce_memory can reduce file sizes significantly

Roadmap

Outlier detection and handling (IQR method)
Memory reduction (dtype downcasting + category conversion)
NaN handling (drop, mean, median, custom)
Auto clean
Z-score based outlier detection
Skewness detection and fixing
Duplicate detection and removal
HTML report generator

License

MIT License

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.2

Mar 20, 2026

0.1.1

Mar 20, 2026

0.1.0

Mar 19, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pandasclean-0.1.2.tar.gz (8.0 kB view details)

Uploaded Mar 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pandasclean-0.1.2-py3-none-any.whl (8.1 kB view details)

Uploaded Mar 20, 2026 Python 3

File details

Details for the file pandasclean-0.1.2.tar.gz.

File metadata

Download URL: pandasclean-0.1.2.tar.gz
Upload date: Mar 20, 2026
Size: 8.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for pandasclean-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`db5716d1a68e556a4cc19a45e9c4afc156e33e7895ee79fb2d24b061510b3939`
MD5	`ab2db5763332d45a0871f6374754f734`
BLAKE2b-256	`b0f06ada71ec4780e6325270a29101ce3b3fb7a2fd37986213b2482aa9b00059`

See more details on using hashes here.

File details

Details for the file pandasclean-0.1.2-py3-none-any.whl.

File metadata

Download URL: pandasclean-0.1.2-py3-none-any.whl
Upload date: Mar 20, 2026
Size: 8.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for pandasclean-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6c5a8d241cf161d3b088e43e10f4e892df60d5a40ea18fdab38e49aba8dd1379`
MD5	`8bd3d6724126c186eb0c071f0dc8fa43`
BLAKE2b-256	`e9d64a84f0f6efb17a8539851946fe78fa76a28060314b67a3a5ca24da8dcd66`

See more details on using hashes here.

pandasclean 0.1.2

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

pandasclean

Quick Start

Features

Installation

Usage

Auto Clean

Outlier Detection & Handling

NaN Handling

Memory Reduction

Benchmarks

How It Works

Memory Reduction

Outlier Detection

Parameters

auto_clean(df)

find_outliers(df, columns, multiplier, strategy)

handle_nan(df, columns, strategy, fill_value, axis, how, threshold)

reduce_memory(df, columns, convert_category, cardinality_threshold)

Use Cases

Roadmap

License

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`auto_clean(df)`

`find_outliers(df, columns, multiplier, strategy)`

`handle_nan(df, columns, strategy, fill_value, axis, how, threshold)`

`reduce_memory(df, columns, convert_category, cardinality_threshold)`