Skip to main content

A lightweight library for cleaning and optimizing pandas DataFrames

Project description

pandasclean

A lightweight Python library for cleaning and optimizing pandas DataFrames. Built for data analysts and data scientists who want practical, no-fuss data cleaning with sensible defaults.


Quick Start

import pandas as pd
from pandasclean import auto_clean

df = pd.read_csv('your_data.csv')
df_clean, report = auto_clean(df)

That's it. One line cleans your entire DataFrame.


Features

  • Outlier Detection & Handling — Detect outliers using the IQR method and choose to report, drop, or cap them
  • Memory Reduction — Automatically downcast numeric dtypes and convert low cardinality string columns to save memory
  • NaN Handling — Drop, fill with mean/median, or supply custom fill values per column
  • Auto Clean — One function that runs everything with sensible defaults

Installation

pip install pandasclean

Usage

Auto Clean

Runs all cleaning functions in the correct order with sensible defaults.

from pandasclean import auto_clean

df_clean, report = auto_clean(df)

For custom behaviour, use the individual functions directly.


Outlier Detection & Handling

from pandasclean import find_outliers

# Report outlier bounds without changing data
df, bounds = find_outliers(df, strategy='report')

# Drop rows containing outliers
df_clean, bounds = find_outliers(df, strategy='drop')

# Cap outliers to the nearest bound (Winsorization)
df_clean, bounds = find_outliers(df, strategy='cap')

# Target specific columns with a custom multiplier
df_clean, bounds = find_outliers(df, columns=['age', 'salary'], multiplier=3.0, strategy='cap')

NaN Handling

from pandasclean import handle_nan

# Report null counts and percentages without making changes
df, report = handle_nan(df, strategy='report')

# Drop rows with any NaN
df_clean, report = handle_nan(df, strategy='drop', axis='rows', how='any')

# Drop columns where more than 50% of values are NaN
df_clean, report = handle_nan(df, strategy='drop', axis='columns', threshold=50)

# Fill NaN with column mean (numeric columns only)
df_clean, report = handle_nan(df, strategy='mean')

# Fill NaN with column median (numeric columns only)
df_clean, report = handle_nan(df, strategy='median')

# Fill all NaN with a single custom value
df_clean, report = handle_nan(df, strategy='custom', fill_value=0)

# Fill NaN with different values per column
df_clean, report = handle_nan(df, strategy='custom', fill_value={
    'age': 0,
    'name': 'unknown',
    'salary': 50000
})

Memory Reduction

from pandasclean import reduce_memory

# Optimize all columns with default settings
df_optimized, report = reduce_memory(df)

# Disable category conversion for string columns
df_optimized, report = reduce_memory(df, convert_category=False)

# Custom cardinality threshold for category conversion
df_optimized, report = reduce_memory(df, cardinality_threshold=0.3)

Benchmarks

Tested on a 1.5 million row DataFrame with mixed dtypes:

Column Before After Notes
Account_ID int64 int32 Downcast
Customer_Age int64 int8 Downcast
Account_Balance float64 float32 Downcast
Monthly_Spend float64 float32 Downcast
Credit_Score float64 float32 Downcast
Activity_Metric_1 float64 float32 Downcast
Activity_Metric_2 float64 float32 Downcast
Tier object category Low cardinality
before = df.memory_usage(deep=True).sum() / (1024 * 1024)
df_optimized, report = reduce_memory(df)
after = df_optimized.memory_usage(deep=True).sum() / (1024 * 1024)

# Before: 152.71 MB
# After:  37.19 MB
# Reduction: 75.6%

Always use memory_usage(deep=True) for accurate measurement — pandas default undercounts string column memory.


How It Works

Memory Reduction

dtype Action
int64 Downcast to smallest safe type (int8int16int32)
float64 Downcast to float32 where possible
object / str Convert to category if cardinality ratio is below threshold

Outlier Detection

Uses the IQR method to compute bounds:

  • lower_bound = Q1 - (multiplier × IQR)
  • upper_bound = Q3 + (multiplier × IQR)

Standard multiplier values:

  • 1.5 — mild outliers (default)
  • 3.0 — extreme outliers only

Parameters

auto_clean(df)

Parameter Default Description
df required Input DataFrame

find_outliers(df, columns, multiplier, strategy)

Parameter Default Description
df required Input DataFrame
columns None Columns to check. Defaults to all numeric columns
multiplier 1.5 IQR multiplier. Use 3.0 for extreme outliers only
strategy 'report' One of 'report', 'drop', 'cap'

handle_nan(df, columns, strategy, fill_value, axis, how, threshold)

Parameter Default Description
df required Input DataFrame
columns None Columns to process. Defaults to all columns
strategy 'report' One of 'report', 'drop', 'mean', 'median', 'custom'
fill_value None Scalar or dict. Required when strategy is 'custom'
axis 'rows' 'rows' or 'columns'. Only used with 'drop' strategy
how 'any' 'any' or 'all'. Only used with 'drop' strategy
threshold None NaN percentage threshold for dropping. Overrides how when set

reduce_memory(df, columns, convert_category, cardinality_threshold)

Parameter Default Description
df required Input DataFrame
columns None Columns to process. Defaults to all columns
convert_category True Whether to convert low cardinality strings to category
cardinality_threshold 0.5 Max unique ratio to trigger category conversion

Use Cases

  • Data cleaning before analysis — handle outliers and NaN values in one step
  • Memory optimization before ML training — reduce DataFrame size before fitting models, especially useful for GPU training where float32 is standard
  • CSV/Parquet compression — smaller dtypes mean smaller files on disk. Saving to parquet after running reduce_memory can reduce file sizes significantly

Roadmap

  • Outlier detection and handling (IQR method)
  • Memory reduction (dtype downcasting + category conversion)
  • NaN handling (drop, mean, median, custom)
  • Auto clean
  • Z-score based outlier detection
  • Skewness detection and fixing
  • Duplicate detection and removal
  • HTML report generator

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pandasclean-0.1.2.tar.gz (8.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pandasclean-0.1.2-py3-none-any.whl (8.1 kB view details)

Uploaded Python 3

File details

Details for the file pandasclean-0.1.2.tar.gz.

File metadata

  • Download URL: pandasclean-0.1.2.tar.gz
  • Upload date:
  • Size: 8.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for pandasclean-0.1.2.tar.gz
Algorithm Hash digest
SHA256 db5716d1a68e556a4cc19a45e9c4afc156e33e7895ee79fb2d24b061510b3939
MD5 ab2db5763332d45a0871f6374754f734
BLAKE2b-256 b0f06ada71ec4780e6325270a29101ce3b3fb7a2fd37986213b2482aa9b00059

See more details on using hashes here.

File details

Details for the file pandasclean-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: pandasclean-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 8.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for pandasclean-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 6c5a8d241cf161d3b088e43e10f4e892df60d5a40ea18fdab38e49aba8dd1379
MD5 8bd3d6724126c186eb0c071f0dc8fa43
BLAKE2b-256 e9d64a84f0f6efb17a8539851946fe78fa76a28060314b67a3a5ca24da8dcd66

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page