Automatic, configurable data cleansing for pandas DataFrames

These details have not been verified by PyPI

Project links

Project description

autoclean-dataframe

A Python library for automatic, configurable data cleansing of pandas DataFrames with detailed reporting. Clean messy tabular data quickly using declarative configuration.

Features

Declarative Configuration: Define cleaning rules using Python dicts, Pydantic models, or YAML/JSON files
Comprehensive Cleaning Operations:
- Missing value imputation (mean, median, mode, constant, forward/backward fill)
- Type conversion with intelligent error handling
- Whitespace and text normalization (strip, case conversion)
- Categorical normalization and value validation
- PII masking (email, phone, custom patterns)
- Outlier detection and handling (flag, remove, or clip)
- Duplicate removal and empty row/column handling
Smart Defaults: Quick auto_clean() function for common scenarios
Detailed Reporting: Track all changes with human-readable summaries and machine-parseable JSON
Configurable I/O: Load/save configurations from YAML or JSON files
Type-Safe: Full type hints with Pydantic validation
Testable: Immutable operations return new DataFrames

Installation

pip install autoclean-dataframe

Quick Start

1. Automatic Cleaning (Easiest)

from autoclean_dataframe import auto_clean

# Apply smart defaults: remove duplicates, infer types, detect outliers
df_clean, report = auto_clean(df)
print(report)

2. Programmatic Configuration

from autoclean_dataframe import (
    clean_dataframe,
    DataCleanConfig,
    GeneralCleanConfig,
    ColumnConfig,
    TypeConversionConfig,
    MissingValueConfig,
)

config = DataCleanConfig(
    general=GeneralCleanConfig(
        remove_duplicates=True,
        drop_fully_empty_rows=True,
    ),
    columns={
        "age": ColumnConfig(
            column_name="age",
            type_conversion=TypeConversionConfig(target_type="int"),
            missing_values=MissingValueConfig(strategy="mean"),
        ),
        "email": ColumnConfig(
            column_name="email",
            strip_whitespace=True,
            to_lowercase=True,
        ),
    }
)

df_clean, report = clean_dataframe(df, config)
print(report)

3. YAML Configuration

Create config.yaml:

general:
  remove_duplicates: true
  drop_fully_empty_rows: true

columns:
  age:
    column_name: age
    type_conversion:
      target_type: int
    missing_values:
      strategy: mean

  email:
    column_name: email
    strip_whitespace: true
    to_lowercase: true
    pii:
      pii_type: email

Then in Python:

from autoclean_dataframe import load_config, clean_dataframe

config = load_config("config.yaml")
df_clean, report = clean_dataframe(df, config)

Core Features

1. Missing Value Handling

MissingValueConfig(
    strategy="mean",        # "mean", "median", "mode", "constant", "forward_fill", "backward_fill", "drop_row", "none"
    constant_value=0,       # For strategy="constant"
    threshold=0.5,          # Drop column if missing % > threshold
)

2. Type Conversion

TypeConversionConfig(
    target_type="int",           # "int", "float", "str", "bool", "datetime", "category", "none"
    datetime_format="%Y-%m-%d",  # For datetime conversion
    strict=False,                # If True, raise on conversion failure; else coerce to NaN
    infer_type=False,            # Auto-detect type if not specified
)

3. Outlier Detection

OutlierConfig(
    method="iqr",           # "iqr", "zscore", "none"
    action="flag",          # "flag", "remove", "clip", "none"
    iqr_multiplier=1.5,     # Q1 - k*IQR, Q3 + k*IQR
    zscore_threshold=3.0,   # |z| > threshold
)

4. PII Masking

PiiConfig(
    pii_type="email",       # "email", "phone", "ssn", "credit_card", "custom", "none"
    custom_pattern=r"\d{3}-\d{2}-\d{4}",  # For pii_type="custom"
    mask_char="*",          # Character to use for masking
)

Masked outputs:

Email: john@example.com → ***@***.com
Phone: 555-123-4567 → ***-***-4567 (keeps last 4 digits)

5. Text Normalization

ColumnConfig(
    column_name="name",
    strip_whitespace=True,      # Remove leading/trailing spaces
    to_lowercase=True,          # Convert to lowercase
    to_uppercase=False,         # Convert to uppercase (mutually exclusive with to_lowercase)
)

6. Categorical Validation

ColumnConfig(
    column_name="status",
    allowed_values=["active", "inactive", "pending"],  # Restrict to these values
)

7. General Cleaning

GeneralCleanConfig(
    drop_fully_empty_rows=True,      # Drop rows where ALL values are NaN
    drop_fully_empty_columns=True,   # Drop columns where ALL values are NaN
    remove_duplicates=True,          # Remove duplicate rows
    normalize_unicode=False,         # Normalize to NFC form
    infer_dtypes=False,             # Auto-detect column types
)

Cleaning Report

The cleaning pipeline returns a CleanReport object with detailed information:

df_clean, report = clean_dataframe(df, config)

# Print human-readable summary
print(report)

# Export to JSON
json_str = report.to_json()
report.to_dict()

# Save to file
from autoclean_dataframe import save_report
save_report(report, "report.json")
save_report(report, "report.txt")

Report includes:

Row/column counts before and after
Per-column change summaries
Count of specific operations (type conversions, imputations, outliers removed, etc.)
Warnings and errors encountered

Example:

======================================================================
DATA CLEANING REPORT
======================================================================
Timestamp: 2024-01-15T10:30:45.123456

OVERVIEW
----------------------------------------------------------------------
Rows before: 100
Rows after:  95
Rows removed: 5
Columns before: 10
Columns after:  10

Duplicate rows removed: 2

COLUMN CHANGES
----------------------------------------------------------------------

age:
  - Missing values handled: 3
  - Type conversions: 97
  - Outliers detected: 2
  - Outliers clipped: 2

email:
  - Whitespace stripped: 5
  - PII values masked: 100

======================================================================

Examples

See the examples/ directory for complete examples:

simple_usage.py: Basic cleaning operations
yaml_config_usage.py: Using YAML configuration files
config_example.yaml: Annotated example configuration

Run examples:

cd examples
python3 simple_usage.py
python3 yaml_config_usage.py

Configuration Schema

Full Pydantic model schema:

DataCleanConfig(
    general: GeneralCleanConfig = GeneralCleanConfig(),
    columns: Dict[str, ColumnConfig] = {},
    preserve_index: bool = True,
    verbose: bool = False,
)

GeneralCleanConfig(
    drop_fully_empty_rows: bool = False,
    drop_fully_empty_columns: bool = False,
    remove_duplicates: bool = False,
    normalize_unicode: bool = False,
    infer_dtypes: bool = False,
)

ColumnConfig(
    column_name: str,
    strip_whitespace: bool = False,
    to_lowercase: bool = False,
    to_uppercase: bool = False,
    type_conversion: Optional[TypeConversionConfig] = None,
    missing_values: Optional[MissingValueConfig] = None,
    outliers: Optional[OutlierConfig] = None,
    pii: Optional[PiiConfig] = None,
    allowed_values: Optional[List[Any]] = None,
)

API Reference

Main Functions

# Apply cleaning with config
clean_dataframe(df: pd.DataFrame, config: DataCleanConfig) -> Tuple[pd.DataFrame, CleanReport]

# Apply smart defaults
auto_clean(df: pd.DataFrame, verbose: bool = False) -> Tuple[pd.DataFrame, CleanReport]

Configuration & I/O

# Load config from file
load_config(path: Union[str, Path]) -> DataCleanConfig

# Save config to file
save_config(config: DataCleanConfig, path: Union[str, Path], format: str = "yaml") -> None

# Save report to file
save_report(report: CleanReport, path: Union[str, Path], format: str = "json") -> None

# Config serialization
config_to_dict(config: DataCleanConfig) -> Dict[str, Any]
config_to_yaml(config: DataCleanConfig) -> str
config_to_json(config: DataCleanConfig) -> str

Types & Enums

ColumnType = {"numeric", "categorical", "datetime", "text", "unknown"}
ImputationMethod = {"mean", "median", "mode", "forward_fill", "backward_fill", "constant", "drop_row", "none"}
OutlierMethod = {"iqr", "zscore", "none"}
OutlierAction = {"remove", "clip", "flag", "none"}
PiiType = {"email", "phone", "ssn", "credit_card", "custom", "none"}

Exceptions

AutocleanException              # Base exception
ConfigValidationError           # Configuration validation failed
DataValidationError             # Input DataFrame validation failed
TypeConversionError             # Type conversion failed
OutlierDetectionError           # Outlier detection failed
ReportExportError               # Report serialization failed

Design Principles

Immutable by Default: Always returns new DataFrames, never modifies input
Fail-Safe: Coerces conversion failures to NaN by default, tracks issues in report
Explicit Over Implicit: Conservative defaults, requires explicit configuration
Traceable: Every change tracked and reported
Type-Safe: Full type hints, Pydantic validation

Common Workflows

Clean CSV with Smart Defaults

import pandas as pd
from autoclean_dataframe import auto_clean, save_report

# Load and clean
df = pd.read_csv("messy_data.csv")
df_clean, report = auto_clean(df, verbose=True)

# Save results
df_clean.to_csv("clean_data.csv", index=False)
save_report(report, "report.json")

Type Inference and Conversion

from autoclean_dataframe import auto_clean

# Auto-detect types and convert
df_clean, report = auto_clean(df)

# Check what was inferred
for col in df_clean.columns:
    print(f"{col}: {df_clean[col].dtype}")

Handle Missing Values

config = DataCleanConfig(
    columns={
        "numeric_col": ColumnConfig(
            column_name="numeric_col",
            missing_values=MissingValueConfig(strategy="median"),
        ),
        "categorical_col": ColumnConfig(
            column_name="categorical_col",
            missing_values=MissingValueConfig(
                strategy="constant",
                constant_value="unknown",
            ),
        ),
    }
)
df_clean, report = clean_dataframe(df, config)

Detect and Remove Outliers

from autoclean_dataframe import OutlierConfig, OutlierMethod, OutlierAction

config = DataCleanConfig(
    columns={
        "measurement": ColumnConfig(
            column_name="measurement",
            outliers=OutlierConfig(
                method=OutlierMethod.IQR,
                action=OutlierAction.REMOVE,
                iqr_multiplier=1.5,
            ),
        )
    }
)
df_clean, report = clean_dataframe(df, config)
print(f"Rows removed: {report.rows_removed}")

Performance Notes

Memory: Always creates a copy of the DataFrame (immutable design)
Speed: Optimized for typical data sizes (up to millions of rows)
Scaling: Linear time complexity for most operations

For very large datasets (>1B rows), consider:

Processing in chunks
Using more targeted configurations (fewer columns)
Disabling expensive operations (outlier detection)

Testing

Run the test suite:

pip install pytest pytest-cov
pytest tests/ -v

Coverage: >80% of codebase

Contributing

Contributions welcome! Areas for enhancement:

Additional PII pattern types
Custom outlier detection methods
Integration with Dask for larger-than-memory data
Web API for cleaning service

License

MIT License

Project Status

This is an alpha release (v0.1.0). The API is stable but may evolve. Please report issues on GitHub.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.1

Dec 1, 2025

This version

1.0.0

Dec 1, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autoclean_dataframe-1.0.0.tar.gz (28.3 kB view details)

Uploaded Dec 1, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

autoclean_dataframe-1.0.0-py3-none-any.whl (19.4 kB view details)

Uploaded Dec 1, 2025 Python 3

File details

Details for the file autoclean_dataframe-1.0.0.tar.gz.

File metadata

Download URL: autoclean_dataframe-1.0.0.tar.gz
Upload date: Dec 1, 2025
Size: 28.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for autoclean_dataframe-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`e3ffb6fa5451e91fa062fd141ecc93d3dda604a34b07ba0dcf4dba2ca7b94f4e`
MD5	`0c8848a9086fe3541577c71004324a6d`
BLAKE2b-256	`2c52a58f0c2e5f50b9d654bed3c639699ae344024f45f1273ea006b0d8f79515`

See more details on using hashes here.

File details

Details for the file autoclean_dataframe-1.0.0-py3-none-any.whl.

File metadata

Download URL: autoclean_dataframe-1.0.0-py3-none-any.whl
Upload date: Dec 1, 2025
Size: 19.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for autoclean_dataframe-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1b64bc7fe2df3fbc3484e64714a198b3ca391ba1bef664290df507b419eab9e7`
MD5	`e71be0cc54c516e7178a5e3831f85981`
BLAKE2b-256	`692c550526ce1df59775c3545446ca822eb6f299b579467c7b653f81324cff20`

See more details on using hashes here.

autoclean-dataframe 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

autoclean-dataframe

Features

Installation

Quick Start

1. Automatic Cleaning (Easiest)

2. Programmatic Configuration

3. YAML Configuration

Core Features

1. Missing Value Handling

2. Type Conversion

3. Outlier Detection

4. PII Masking

5. Text Normalization

6. Categorical Validation

7. General Cleaning

Cleaning Report

Examples

Configuration Schema

API Reference

Main Functions

Configuration & I/O

Types & Enums

Exceptions

Design Principles

Common Workflows

Clean CSV with Smart Defaults

Type Inference and Conversion

Handle Missing Values

Detect and Remove Outliers

Performance Notes

Testing

Contributing

License

Project Status

See Also

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes