Automatic, configurable data cleansing for pandas DataFrames
Project description
autoclean-dataframe
A Python library for automatic, configurable data cleansing of pandas DataFrames with detailed reporting. Clean messy tabular data quickly using declarative configuration.
Features
- Declarative Configuration: Define cleaning rules using Python dicts, Pydantic models, or YAML/JSON files
- Comprehensive Cleaning Operations:
- Missing value imputation (mean, median, mode, constant, forward/backward fill)
- Type conversion with intelligent error handling
- Whitespace and text normalization (strip, case conversion)
- Categorical normalization and value validation
- PII masking (email, phone, custom patterns)
- Outlier detection and handling (flag, remove, or clip)
- Duplicate removal and empty row/column handling
- Smart Defaults: Quick
auto_clean()function for common scenarios - Detailed Reporting: Track all changes with human-readable summaries and machine-parseable JSON
- Configurable I/O: Load/save configurations from YAML or JSON files
- Type-Safe: Full type hints with Pydantic validation
- Testable: Immutable operations return new DataFrames
Installation
pip install autoclean-dataframe
Quick Start
1. Automatic Cleaning (Easiest)
from autoclean_dataframe import auto_clean
# Apply smart defaults: remove duplicates, infer types, detect outliers
df_clean, report = auto_clean(df)
print(report)
2. Programmatic Configuration
from autoclean_dataframe import (
clean_dataframe,
DataCleanConfig,
GeneralCleanConfig,
ColumnConfig,
TypeConversionConfig,
MissingValueConfig,
)
config = DataCleanConfig(
general=GeneralCleanConfig(
remove_duplicates=True,
drop_fully_empty_rows=True,
),
columns={
"age": ColumnConfig(
column_name="age",
type_conversion=TypeConversionConfig(target_type="int"),
missing_values=MissingValueConfig(strategy="mean"),
),
"email": ColumnConfig(
column_name="email",
strip_whitespace=True,
to_lowercase=True,
),
}
)
df_clean, report = clean_dataframe(df, config)
print(report)
3. YAML Configuration
Create config.yaml:
general:
remove_duplicates: true
drop_fully_empty_rows: true
columns:
age:
column_name: age
type_conversion:
target_type: int
missing_values:
strategy: mean
email:
column_name: email
strip_whitespace: true
to_lowercase: true
pii:
pii_type: email
Then in Python:
from autoclean_dataframe import load_config, clean_dataframe
config = load_config("config.yaml")
df_clean, report = clean_dataframe(df, config)
Core Features
1. Missing Value Handling
MissingValueConfig(
strategy="mean", # "mean", "median", "mode", "constant", "forward_fill", "backward_fill", "drop_row", "none"
constant_value=0, # For strategy="constant"
threshold=0.5, # Drop column if missing % > threshold
)
2. Type Conversion
TypeConversionConfig(
target_type="int", # "int", "float", "str", "bool", "datetime", "category", "none"
datetime_format="%Y-%m-%d", # For datetime conversion
strict=False, # If True, raise on conversion failure; else coerce to NaN
infer_type=False, # Auto-detect type if not specified
)
3. Outlier Detection
OutlierConfig(
method="iqr", # "iqr", "zscore", "none"
action="flag", # "flag", "remove", "clip", "none"
iqr_multiplier=1.5, # Q1 - k*IQR, Q3 + k*IQR
zscore_threshold=3.0, # |z| > threshold
)
4. PII Masking
PiiConfig(
pii_type="email", # "email", "phone", "ssn", "credit_card", "custom", "none"
custom_pattern=r"\d{3}-\d{2}-\d{4}", # For pii_type="custom"
mask_char="*", # Character to use for masking
)
Masked outputs:
- Email:
john@example.com→***@***.com - Phone:
555-123-4567→***-***-4567(keeps last 4 digits)
5. Text Normalization
ColumnConfig(
column_name="name",
strip_whitespace=True, # Remove leading/trailing spaces
to_lowercase=True, # Convert to lowercase
to_uppercase=False, # Convert to uppercase (mutually exclusive with to_lowercase)
)
6. Categorical Validation
ColumnConfig(
column_name="status",
allowed_values=["active", "inactive", "pending"], # Restrict to these values
)
7. General Cleaning
GeneralCleanConfig(
drop_fully_empty_rows=True, # Drop rows where ALL values are NaN
drop_fully_empty_columns=True, # Drop columns where ALL values are NaN
remove_duplicates=True, # Remove duplicate rows
normalize_unicode=False, # Normalize to NFC form
infer_dtypes=False, # Auto-detect column types
)
Cleaning Report
The cleaning pipeline returns a CleanReport object with detailed information:
df_clean, report = clean_dataframe(df, config)
# Print human-readable summary
print(report)
# Export to JSON
json_str = report.to_json()
report.to_dict()
# Save to file
from autoclean_dataframe import save_report
save_report(report, "report.json")
save_report(report, "report.txt")
Report includes:
- Row/column counts before and after
- Per-column change summaries
- Count of specific operations (type conversions, imputations, outliers removed, etc.)
- Warnings and errors encountered
Example:
======================================================================
DATA CLEANING REPORT
======================================================================
Timestamp: 2024-01-15T10:30:45.123456
OVERVIEW
----------------------------------------------------------------------
Rows before: 100
Rows after: 95
Rows removed: 5
Columns before: 10
Columns after: 10
Duplicate rows removed: 2
COLUMN CHANGES
----------------------------------------------------------------------
age:
- Missing values handled: 3
- Type conversions: 97
- Outliers detected: 2
- Outliers clipped: 2
email:
- Whitespace stripped: 5
- PII values masked: 100
======================================================================
Examples
See the examples/ directory for complete examples:
- simple_usage.py: Basic cleaning operations
- yaml_config_usage.py: Using YAML configuration files
- config_example.yaml: Annotated example configuration
Run examples:
cd examples
python3 simple_usage.py
python3 yaml_config_usage.py
Configuration Schema
Full Pydantic model schema:
DataCleanConfig(
general: GeneralCleanConfig = GeneralCleanConfig(),
columns: Dict[str, ColumnConfig] = {},
preserve_index: bool = True,
verbose: bool = False,
)
GeneralCleanConfig(
drop_fully_empty_rows: bool = False,
drop_fully_empty_columns: bool = False,
remove_duplicates: bool = False,
normalize_unicode: bool = False,
infer_dtypes: bool = False,
)
ColumnConfig(
column_name: str,
strip_whitespace: bool = False,
to_lowercase: bool = False,
to_uppercase: bool = False,
type_conversion: Optional[TypeConversionConfig] = None,
missing_values: Optional[MissingValueConfig] = None,
outliers: Optional[OutlierConfig] = None,
pii: Optional[PiiConfig] = None,
allowed_values: Optional[List[Any]] = None,
)
API Reference
Main Functions
# Apply cleaning with config
clean_dataframe(df: pd.DataFrame, config: DataCleanConfig) -> Tuple[pd.DataFrame, CleanReport]
# Apply smart defaults
auto_clean(df: pd.DataFrame, verbose: bool = False) -> Tuple[pd.DataFrame, CleanReport]
Configuration & I/O
# Load config from file
load_config(path: Union[str, Path]) -> DataCleanConfig
# Save config to file
save_config(config: DataCleanConfig, path: Union[str, Path], format: str = "yaml") -> None
# Save report to file
save_report(report: CleanReport, path: Union[str, Path], format: str = "json") -> None
# Config serialization
config_to_dict(config: DataCleanConfig) -> Dict[str, Any]
config_to_yaml(config: DataCleanConfig) -> str
config_to_json(config: DataCleanConfig) -> str
Types & Enums
ColumnType = {"numeric", "categorical", "datetime", "text", "unknown"}
ImputationMethod = {"mean", "median", "mode", "forward_fill", "backward_fill", "constant", "drop_row", "none"}
OutlierMethod = {"iqr", "zscore", "none"}
OutlierAction = {"remove", "clip", "flag", "none"}
PiiType = {"email", "phone", "ssn", "credit_card", "custom", "none"}
Exceptions
AutocleanException # Base exception
ConfigValidationError # Configuration validation failed
DataValidationError # Input DataFrame validation failed
TypeConversionError # Type conversion failed
OutlierDetectionError # Outlier detection failed
ReportExportError # Report serialization failed
Design Principles
- Immutable by Default: Always returns new DataFrames, never modifies input
- Fail-Safe: Coerces conversion failures to NaN by default, tracks issues in report
- Explicit Over Implicit: Conservative defaults, requires explicit configuration
- Traceable: Every change tracked and reported
- Type-Safe: Full type hints, Pydantic validation
Common Workflows
Clean CSV with Smart Defaults
import pandas as pd
from autoclean_dataframe import auto_clean, save_report
# Load and clean
df = pd.read_csv("messy_data.csv")
df_clean, report = auto_clean(df, verbose=True)
# Save results
df_clean.to_csv("clean_data.csv", index=False)
save_report(report, "report.json")
Type Inference and Conversion
from autoclean_dataframe import auto_clean
# Auto-detect types and convert
df_clean, report = auto_clean(df)
# Check what was inferred
for col in df_clean.columns:
print(f"{col}: {df_clean[col].dtype}")
Handle Missing Values
config = DataCleanConfig(
columns={
"numeric_col": ColumnConfig(
column_name="numeric_col",
missing_values=MissingValueConfig(strategy="median"),
),
"categorical_col": ColumnConfig(
column_name="categorical_col",
missing_values=MissingValueConfig(
strategy="constant",
constant_value="unknown",
),
),
}
)
df_clean, report = clean_dataframe(df, config)
Detect and Remove Outliers
from autoclean_dataframe import OutlierConfig, OutlierMethod, OutlierAction
config = DataCleanConfig(
columns={
"measurement": ColumnConfig(
column_name="measurement",
outliers=OutlierConfig(
method=OutlierMethod.IQR,
action=OutlierAction.REMOVE,
iqr_multiplier=1.5,
),
)
}
)
df_clean, report = clean_dataframe(df, config)
print(f"Rows removed: {report.rows_removed}")
Performance Notes
- Memory: Always creates a copy of the DataFrame (immutable design)
- Speed: Optimized for typical data sizes (up to millions of rows)
- Scaling: Linear time complexity for most operations
For very large datasets (>1B rows), consider:
- Processing in chunks
- Using more targeted configurations (fewer columns)
- Disabling expensive operations (outlier detection)
Testing
Run the test suite:
pip install pytest pytest-cov
pytest tests/ -v
Coverage: >80% of codebase
Contributing
Contributions welcome! Areas for enhancement:
- Additional PII pattern types
- Custom outlier detection methods
- Integration with Dask for larger-than-memory data
- Web API for cleaning service
License
MIT License
Project Status
This is an alpha release (v0.1.0). The API is stable but may evolve. Please report issues on GitHub.
See Also
- pandas: Data manipulation
- pydantic: Configuration validation
- great-expectations: More advanced data validation
- pandas-profiling: Data profiling and analysis
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file autoclean_dataframe-1.0.0.tar.gz.
File metadata
- Download URL: autoclean_dataframe-1.0.0.tar.gz
- Upload date:
- Size: 28.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e3ffb6fa5451e91fa062fd141ecc93d3dda604a34b07ba0dcf4dba2ca7b94f4e
|
|
| MD5 |
0c8848a9086fe3541577c71004324a6d
|
|
| BLAKE2b-256 |
2c52a58f0c2e5f50b9d654bed3c639699ae344024f45f1273ea006b0d8f79515
|
File details
Details for the file autoclean_dataframe-1.0.0-py3-none-any.whl.
File metadata
- Download URL: autoclean_dataframe-1.0.0-py3-none-any.whl
- Upload date:
- Size: 19.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1b64bc7fe2df3fbc3484e64714a198b3ca391ba1bef664290df507b419eab9e7
|
|
| MD5 |
e71be0cc54c516e7178a5e3831f85981
|
|
| BLAKE2b-256 |
692c550526ce1df59775c3545446ca822eb6f299b579467c7b653f81324cff20
|