Skip to main content

Automatic, configurable data cleansing for pandas DataFrames

Project description

autoclean-dataframe

A Python library for automatic, configurable data cleansing of pandas DataFrames with detailed reporting. Clean messy tabular data quickly using declarative configuration.

Features

  • Declarative Configuration: Define cleaning rules using Python dicts, Pydantic models, or YAML/JSON files
  • Comprehensive Cleaning Operations:
    • Missing value imputation (mean, median, mode, constant, forward/backward fill)
    • Type conversion with intelligent error handling
    • Whitespace and text normalization (strip, case conversion)
    • Categorical normalization and value validation
    • PII masking (email, phone, custom patterns)
    • Outlier detection and handling (flag, remove, or clip)
    • Duplicate removal and empty row/column handling
  • Smart Defaults: Quick auto_clean() function for common scenarios
  • Detailed Reporting: Track all changes with human-readable summaries and machine-parseable JSON
  • Configurable I/O: Load/save configurations from YAML or JSON files
  • Type-Safe: Full type hints with Pydantic validation
  • Testable: Immutable operations return new DataFrames

Installation

pip install autoclean-dataframe

Quick Start

1. Automatic Cleaning (Easiest)

from autoclean_dataframe import auto_clean

# Apply smart defaults: remove duplicates, infer types, detect outliers
df_clean, report = auto_clean(df)
print(report)

2. Programmatic Configuration

from autoclean_dataframe import (
    clean_dataframe,
    DataCleanConfig,
    GeneralCleanConfig,
    ColumnConfig,
    TypeConversionConfig,
    MissingValueConfig,
)

config = DataCleanConfig(
    general=GeneralCleanConfig(
        remove_duplicates=True,
        drop_fully_empty_rows=True,
    ),
    columns={
        "age": ColumnConfig(
            column_name="age",
            type_conversion=TypeConversionConfig(target_type="int"),
            missing_values=MissingValueConfig(strategy="mean"),
        ),
        "email": ColumnConfig(
            column_name="email",
            strip_whitespace=True,
            to_lowercase=True,
        ),
    }
)

df_clean, report = clean_dataframe(df, config)
print(report)

3. YAML Configuration

Create config.yaml:

general:
  remove_duplicates: true
  drop_fully_empty_rows: true

columns:
  age:
    column_name: age
    type_conversion:
      target_type: int
    missing_values:
      strategy: mean

  email:
    column_name: email
    strip_whitespace: true
    to_lowercase: true
    pii:
      pii_type: email

Then in Python:

from autoclean_dataframe import load_config, clean_dataframe

config = load_config("config.yaml")
df_clean, report = clean_dataframe(df, config)

Core Features

1. Missing Value Handling

MissingValueConfig(
    strategy="mean",        # "mean", "median", "mode", "constant", "forward_fill", "backward_fill", "drop_row", "none"
    constant_value=0,       # For strategy="constant"
    threshold=0.5,          # Drop column if missing % > threshold
)

2. Type Conversion

TypeConversionConfig(
    target_type="int",           # "int", "float", "str", "bool", "datetime", "category", "none"
    datetime_format="%Y-%m-%d",  # For datetime conversion
    strict=False,                # If True, raise on conversion failure; else coerce to NaN
    infer_type=False,            # Auto-detect type if not specified
)

3. Outlier Detection

OutlierConfig(
    method="iqr",           # "iqr", "zscore", "none"
    action="flag",          # "flag", "remove", "clip", "none"
    iqr_multiplier=1.5,     # Q1 - k*IQR, Q3 + k*IQR
    zscore_threshold=3.0,   # |z| > threshold
)

4. PII Masking

PiiConfig(
    pii_type="email",       # "email", "phone", "ssn", "credit_card", "custom", "none"
    custom_pattern=r"\d{3}-\d{2}-\d{4}",  # For pii_type="custom"
    mask_char="*",          # Character to use for masking
)

Masked outputs:

  • Email: john@example.com***@***.com
  • Phone: 555-123-4567***-***-4567 (keeps last 4 digits)

5. Text Normalization

ColumnConfig(
    column_name="name",
    strip_whitespace=True,      # Remove leading/trailing spaces
    to_lowercase=True,          # Convert to lowercase
    to_uppercase=False,         # Convert to uppercase (mutually exclusive with to_lowercase)
)

6. Categorical Validation

ColumnConfig(
    column_name="status",
    allowed_values=["active", "inactive", "pending"],  # Restrict to these values
)

7. General Cleaning

GeneralCleanConfig(
    drop_fully_empty_rows=True,      # Drop rows where ALL values are NaN
    drop_fully_empty_columns=True,   # Drop columns where ALL values are NaN
    remove_duplicates=True,          # Remove duplicate rows
    normalize_unicode=False,         # Normalize to NFC form
    infer_dtypes=False,             # Auto-detect column types
)

Cleaning Report

The cleaning pipeline returns a CleanReport object with detailed information:

df_clean, report = clean_dataframe(df, config)

# Print human-readable summary
print(report)

# Export to JSON
json_str = report.to_json()
report.to_dict()

# Save to file
from autoclean_dataframe import save_report
save_report(report, "report.json")
save_report(report, "report.txt")

Report includes:

  • Row/column counts before and after
  • Per-column change summaries
  • Count of specific operations (type conversions, imputations, outliers removed, etc.)
  • Warnings and errors encountered

Example:

======================================================================
DATA CLEANING REPORT
======================================================================
Timestamp: 2024-01-15T10:30:45.123456

OVERVIEW
----------------------------------------------------------------------
Rows before: 100
Rows after:  95
Rows removed: 5
Columns before: 10
Columns after:  10

Duplicate rows removed: 2

COLUMN CHANGES
----------------------------------------------------------------------

age:
  - Missing values handled: 3
  - Type conversions: 97
  - Outliers detected: 2
  - Outliers clipped: 2

email:
  - Whitespace stripped: 5
  - PII values masked: 100

======================================================================

Examples

See the examples/ directory for complete examples:

  1. simple_usage.py: Basic cleaning operations
  2. yaml_config_usage.py: Using YAML configuration files
  3. config_example.yaml: Annotated example configuration

Run examples:

cd examples
python3 simple_usage.py
python3 yaml_config_usage.py

Configuration Schema

Full Pydantic model schema:

DataCleanConfig(
    general: GeneralCleanConfig = GeneralCleanConfig(),
    columns: Dict[str, ColumnConfig] = {},
    preserve_index: bool = True,
    verbose: bool = False,
)

GeneralCleanConfig(
    drop_fully_empty_rows: bool = False,
    drop_fully_empty_columns: bool = False,
    remove_duplicates: bool = False,
    normalize_unicode: bool = False,
    infer_dtypes: bool = False,
)

ColumnConfig(
    column_name: str,
    strip_whitespace: bool = False,
    to_lowercase: bool = False,
    to_uppercase: bool = False,
    type_conversion: Optional[TypeConversionConfig] = None,
    missing_values: Optional[MissingValueConfig] = None,
    outliers: Optional[OutlierConfig] = None,
    pii: Optional[PiiConfig] = None,
    allowed_values: Optional[List[Any]] = None,
)

API Reference

Main Functions

# Apply cleaning with config
clean_dataframe(df: pd.DataFrame, config: DataCleanConfig) -> Tuple[pd.DataFrame, CleanReport]

# Apply smart defaults
auto_clean(df: pd.DataFrame, verbose: bool = False) -> Tuple[pd.DataFrame, CleanReport]

Configuration & I/O

# Load config from file
load_config(path: Union[str, Path]) -> DataCleanConfig

# Save config to file
save_config(config: DataCleanConfig, path: Union[str, Path], format: str = "yaml") -> None

# Save report to file
save_report(report: CleanReport, path: Union[str, Path], format: str = "json") -> None

# Config serialization
config_to_dict(config: DataCleanConfig) -> Dict[str, Any]
config_to_yaml(config: DataCleanConfig) -> str
config_to_json(config: DataCleanConfig) -> str

Types & Enums

ColumnType = {"numeric", "categorical", "datetime", "text", "unknown"}
ImputationMethod = {"mean", "median", "mode", "forward_fill", "backward_fill", "constant", "drop_row", "none"}
OutlierMethod = {"iqr", "zscore", "none"}
OutlierAction = {"remove", "clip", "flag", "none"}
PiiType = {"email", "phone", "ssn", "credit_card", "custom", "none"}

Exceptions

AutocleanException              # Base exception
ConfigValidationError           # Configuration validation failed
DataValidationError             # Input DataFrame validation failed
TypeConversionError             # Type conversion failed
OutlierDetectionError           # Outlier detection failed
ReportExportError               # Report serialization failed

Design Principles

  1. Immutable by Default: Always returns new DataFrames, never modifies input
  2. Fail-Safe: Coerces conversion failures to NaN by default, tracks issues in report
  3. Explicit Over Implicit: Conservative defaults, requires explicit configuration
  4. Traceable: Every change tracked and reported
  5. Type-Safe: Full type hints, Pydantic validation

Common Workflows

Clean CSV with Smart Defaults

import pandas as pd
from autoclean_dataframe import auto_clean, save_report

# Load and clean
df = pd.read_csv("messy_data.csv")
df_clean, report = auto_clean(df, verbose=True)

# Save results
df_clean.to_csv("clean_data.csv", index=False)
save_report(report, "report.json")

Type Inference and Conversion

from autoclean_dataframe import auto_clean

# Auto-detect types and convert
df_clean, report = auto_clean(df)

# Check what was inferred
for col in df_clean.columns:
    print(f"{col}: {df_clean[col].dtype}")

Handle Missing Values

config = DataCleanConfig(
    columns={
        "numeric_col": ColumnConfig(
            column_name="numeric_col",
            missing_values=MissingValueConfig(strategy="median"),
        ),
        "categorical_col": ColumnConfig(
            column_name="categorical_col",
            missing_values=MissingValueConfig(
                strategy="constant",
                constant_value="unknown",
            ),
        ),
    }
)
df_clean, report = clean_dataframe(df, config)

Detect and Remove Outliers

from autoclean_dataframe import OutlierConfig, OutlierMethod, OutlierAction

config = DataCleanConfig(
    columns={
        "measurement": ColumnConfig(
            column_name="measurement",
            outliers=OutlierConfig(
                method=OutlierMethod.IQR,
                action=OutlierAction.REMOVE,
                iqr_multiplier=1.5,
            ),
        )
    }
)
df_clean, report = clean_dataframe(df, config)
print(f"Rows removed: {report.rows_removed}")

Performance Notes

  • Memory: Always creates a copy of the DataFrame (immutable design)
  • Speed: Optimized for typical data sizes (up to millions of rows)
  • Scaling: Linear time complexity for most operations

For very large datasets (>1B rows), consider:

  • Processing in chunks
  • Using more targeted configurations (fewer columns)
  • Disabling expensive operations (outlier detection)

Testing

Run the test suite:

pip install pytest pytest-cov
pytest tests/ -v

Coverage: >80% of codebase

Contributing

Contributions welcome! Areas for enhancement:

  • Additional PII pattern types
  • Custom outlier detection methods
  • Integration with Dask for larger-than-memory data
  • Web API for cleaning service

License

MIT License

Project Status

This is an alpha release (v0.1.0). The API is stable but may evolve. Please report issues on GitHub.

See Also

  • pandas: Data manipulation
  • pydantic: Configuration validation
  • great-expectations: More advanced data validation
  • pandas-profiling: Data profiling and analysis

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autoclean_dataframe-1.0.0.tar.gz (28.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

autoclean_dataframe-1.0.0-py3-none-any.whl (19.4 kB view details)

Uploaded Python 3

File details

Details for the file autoclean_dataframe-1.0.0.tar.gz.

File metadata

  • Download URL: autoclean_dataframe-1.0.0.tar.gz
  • Upload date:
  • Size: 28.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for autoclean_dataframe-1.0.0.tar.gz
Algorithm Hash digest
SHA256 e3ffb6fa5451e91fa062fd141ecc93d3dda604a34b07ba0dcf4dba2ca7b94f4e
MD5 0c8848a9086fe3541577c71004324a6d
BLAKE2b-256 2c52a58f0c2e5f50b9d654bed3c639699ae344024f45f1273ea006b0d8f79515

See more details on using hashes here.

File details

Details for the file autoclean_dataframe-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for autoclean_dataframe-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1b64bc7fe2df3fbc3484e64714a198b3ca391ba1bef664290df507b419eab9e7
MD5 e71be0cc54c516e7178a5e3831f85981
BLAKE2b-256 692c550526ce1df59775c3545446ca822eb6f299b579467c7b653f81324cff20

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page