A comprehensive data cleaning toolkit for various data structures.
Project description
CleanEasy
CleanEasy is a powerful, user-friendly Python library designed to simplify data cleaning and preprocessing for data scientists and analysts. Built on top of pandas, numpy, scikit-learn, and nltk, it provides a chainable API to handle common tasks like missing value imputation, outlier detection, text processing, date manipulation, and categorical encoding. With detailed logging and formatted output, CleanEasy makes data preparation intuitive, transparent, and visually appealing.
Table of Contents
- Introduction
- Features
- Installation
- Usage
- Project Structure
- Testing
- Contributing
- License
- Contact and Support
- FAQ
- Roadmap
Introduction
CleanEasy streamlines the data cleaning process by offering a unified interface for a wide range of preprocessing tasks. Whether you're working with DataFrames, NumPy arrays, lists, dictionaries, or CSV files, CleanEasy handles data conversion, cleaning, and validation with ease. Its method-chaining API allows you to build complex cleaning pipelines in a readable, maintainable way, while detailed logs and formatted outputs (using tabulate and colorama) ensure clarity and usability.
Key highlights:
- Supports multiple data input formats.
- Extensive methods for imputation, outlier removal, text processing, and encoding.
- Built-in validation tools for skewness, normality, and correlations.
- Auto-cleaning pipeline for quick preprocessing.
- Pretty-printed output for easy interpretation.
Features
CleanEasy offers a rich set of tools for data preprocessing:
Data Input and Conversion
- Accepts
pandas.DataFrame,numpy.ndarray, lists, dictionaries, or CSV file paths. - Automatically converts inputs to a
pandas.DataFrameusingconvert_to_dataframe.
Missing Value Imputation
- KNN Imputation:
impute_knnfor numeric columns using k-nearest neighbors. - Statistical Imputation:
impute_mean,impute_median,impute_mode. - Time-Series Imputation:
impute_forward_fill,impute_backward_fill,impute_interpolate. - Constant Imputation:
impute_constantwith a user-specified value. - Drop Missing:
drop_missing_rows,drop_missing_columnsbased on thresholds.
Outlier Detection and Handling
- Isolation Forest:
remove_outliers_isolation_forestfor robust outlier removal. - IQR:
remove_outliers_iqrandcap_outliers_iqrfor interquartile range-based handling. - Z-Score:
remove_outliers_zscoreandcap_outliers_zscorefor standard deviation-based handling. - DBSCAN:
remove_outliers_dbscanfor clustering-based outlier detection.
Text Processing
- Tokenization:
tokenize_textusing NLTK's word tokenizer. - Lemmatization:
lemmatize_textwith WordNet lemmatizer. - Cleaning:
lowercase_text,remove_special_chars,trim_whitespace,remove_numbers,replace_text.
Date and Time Processing
- Parsing:
parse_datesto convert strings to datetime. - Feature Extraction:
extract_year,extract_month,extract_quarter,extract_day_of_week. - Formatting:
standardize_date_formatfor consistent date strings.
Categorical Encoding
- Frequency Encoding:
frequency_encodefor value counts. - Label Encoding:
label_encodefor ordinal categories. - One-Hot Encoding:
one_hot_encodewith drop-first option. - Rare Categories:
merge_rare_categoriesto group infrequent categories.
Data Validation
- Skewness:
check_skewnessfor numeric columns. - Normality:
check_normalityusing Shapiro-Wilk test. - Missing Values:
check_missing_proportionfor column-wise missing ratios. - Unique Values:
check_unique_valuesfor distinct counts. - Correlations:
check_correlationandremove_highly_correlatedfor numeric columns.
Other Utilities
- Duplicates:
drop_duplicatesandidentify_duplicates. - Scaling:
standardize_numeric(z-score) andnormalize_numeric(min-max). - Binning:
bin_numericfor discretizing numeric columns. - Log Transformation:
log_transformfor handling skewed data. - Auto-Cleaning:
auto_cleanfor a customizable, one-step pipeline.
Output and Logging
- Detailed logging of all operations with customizable log levels.
- Formatted console output with tables (
tabulate) and colors (colorama). - Results storage in
get_results()for inspection.
Installation
Prerequisites
- Python: 3.8 or higher
- Operating System: Windows, macOS, or Linux
- Virtual Environment: Recommended for dependency isolation
- Terminal: For running commands (e.g., Windows Terminal, VS Code, or bash)
Step-by-Step Instructions
-
Clone or Download the Repository
git clone https://github.com/CyberMatic-AmAn/cleaneasy.git cd cleaneasy
-
Create a Virtual Environment (optional but recommended)
python -m venv venv .\venv\Scripts\activate # Windows source venv/bin/activate # Linux/macOS
-
Install Dependencies Install required packages from
requirements.txt:pip install -r requirements.txt
Dependencies include:
pandas>=1.5.0numpy>=1.23.0scipy>=1.9.0scikit-learn>=1.1.0nltk>=3.7pytest>=7.0.0tabulate>=0.8.9colorama>=0.4.4
-
Download NLTK Data Some methods (e.g.,
tokenize_text,lemmatize_text) require NLTK resources:import nltk nltk.download('punkt') nltk.download('punkt_tab') nltk.download('wordnet')
-
Install CleanEasy as a Package Install the
cleaneasypackage locally to make it importable:pip install .
Usage
Basic Example
The main.py script demonstrates a typical cleaning pipeline. It processes a sample dataset with missing values, outliers, text, dates, and categorical data, producing formatted output.
import pandas as pd
import json
from tabulate import tabulate
from colorama import init, Fore, Style
from cleaneasy import CleanEasy
# Initialize colorama for colored output
init()
def format_dict(d, indent=0):
"""Pretty-print a dictionary with indentation for nested structures."""
result = []
for key, value in d.items():
key_str = f"{Fore.CYAN}{key}{Style.RESET_ALL}"
if isinstance(value, dict):
result.append(f"{' ' * indent}{key_str}:")
result.append(format_dict(value, indent + 1))
elif isinstance(value, list) and key == 'name_tokens':
value_str = ', '.join([str(item) for item in value])
result.append(f"{' ' * indent}{key_str}: {value_str}")
else:
if isinstance(value, (np.floating, np.integer)):
value = float(value) if isinstance(value, np.floating) else int(value)
result.append(f"{' ' * indent}{key_str}: {value}")
return '\n'.join(result)
# Sample data
data = {
'name': ['John@Doe', 'Jane Smith!', None, 'Alice'],
'age': [25, 30, 1000, None],
'salary': [50000, None, 60000, 55000],
'date': ['2023-01-01', '2023-02-02', 'invalid', '2023-03-03'],
'category': ['A', 'B', 'A', 'C']
}
df = pd.DataFrame(data)
# Initialize CleanEasy
cleaner = CleanEasy(df, log_level='INFO')
# Apply cleaning steps
cleaner.parse_dates(columns=['date'])
cleaner = (cleaner
.impute_knn(columns=['age', 'salary'], n_neighbors=3, weights='distance')
.remove_outliers_isolation_forest(columns=['age'], contamination=0.2, random_state=42)
.tokenize_text(columns=['name'], lowercase=True)
.extract_day_of_week(columns=['date'], return_numeric=True)
.frequency_encode(columns=['category'], normalize=True)
)
# Store skewness results
skewness_results = cleaner.check_skewness(columns=['age', 'salary'])
# Continue method chain
cleaned_df = (cleaner
.remove_highly_correlated(threshold=0.8, method='pearson')
.get_cleaned_data()
)
# Display results
print(f"\n{Fore.GREEN}=== Cleaned DataFrame ==={Style.RESET_ALL}")
cleaned_df_display = cleaned_df.copy()
cleaned_df_display['name_tokens'] = cleaned_df_display['name_tokens'].apply(lambda x: ', '.join(x))
print(tabulate(cleaned_df_display, headers='keys', tablefmt='psql', showindex=True, floatfmt='.2f'))
print(f"\n{Fore.GREEN}=== Cleaning Steps ==={Style.RESET_ALL}")
for i, step in enumerate(cleaner.get_cleaning_log(), 1):
print(f"{i}. {step}")
print(f"\n{Fore.GREEN}=== Skewness Results ==={Style.RESET_ALL}")
skewness_formatted = {k: float(v) for k, v in skewness_results.items()}
for col, value in skewness_formatted.items():
print(f"{Fore.CYAN}{col}{Style.RESET_ALL}: {value:.4f}")
print(f"\n{Fore.GREEN}=== All Results ==={Style.RESET_ALL}")
results = cleaner.get_results()
for key, value in results.items():
if isinstance(value, dict):
for subkey, subvalue in value.items():
if isinstance(subvalue, (np.floating, np.integer)):
results[key][subkey] = float(subvalue) if isinstance(subvalue, np.floating) else int(subvalue)
print(format_dict(results))
Example Output
Running python main.py produces:
2025-07-05 12:35:10,417 - CleanEasy - INFO - Initialized CleanEasy with data type: DataFrame
2025-07-05 12:35:10,421 - CleanEasy - INFO - Parsed date to datetime
2025-07-05 12:35:10,425 - CleanEasy - INFO - Imputed ['age', 'salary'] with KNN (n_neighbors=3, weights=distance)
2025-07-05 12:35:10,540 - CleanEasy - INFO - Removed 1 outliers from ['age'] using Isolation Forest (contamination=0.2)
2025-07-05 12:35:10,610 - CleanEasy - INFO - Tokenized text in name (lowercase=True)
2025-07-05 12:35:10,637 - CleanEasy - INFO - Extracted day of week from date to date_dayofweek (numeric=True)
2025-07-05 12:35:10,641 - CleanEasy - INFO - Frequency encoded category to category_freq (normalize=True)
2025-07-05 12:35:10,641 - CleanEasy - INFO - Skewness for age: 1.7314
2025-07-05 12:35:10,642 - CleanEasy - INFO - Skewness for salary: 1.7314
2025-07-05 12:35:10,645 - CleanEasy - INFO - Dropped 1 highly correlated columns (method=pearson, threshold=0.8)
=== Cleaned DataFrame ===
+----+------------+-------+------------+-----------+------------------+--------------+-----------------+
| | name | age | date | category | name_tokens | date_dayofweek | category_freq |
|----+------------+-------+------------+-----------+------------------+--------------+-----------------|
| 0 | John@Doe | 25.00 | 2023-01-01 | A | john, @, doe | 6 | 0.33 |
| 1 | Jane Smith!| 30.00 | 2023-02-02 | B | jane, smith, ! | 3 | 0.33 |
| 3 | Alice | 512.50| 2023-03-03 | C | alice | 4 | 0.33 |
+----+------------+-------+------------+-----------+------------------+--------------+-----------------+
=== Cleaning Steps ===
1. Parsed date columns
2. Imputed missing values with KNN (weights=distance)
3. Removed outliers using Isolation Forest
4. Tokenized text columns
5. Extracted day of week from datetime columns
6. Applied frequency encoding
7. Checked skewness
8. Removed highly correlated columns (threshold=0.8)
=== Skewness Results ===
age: 1.7314
salary: 1.7314
=== All Results ===
knn_imputation:
columns: ['age', 'salary']
n_neighbors: 3
weights: distance
isolation_forest:
columns: ['age']
outliers_removed: 1
name_tokens: [john, @, doe], [jane, smith, !], [alice]
category_freq:
A: 0.3333333333333333
B: 0.3333333333333333
C: 0.3333333333333333
skewness:
age: 1.7314295926231227
salary: 1.7314295926231076
correlated_columns_dropped: ['salary']
Auto-Cleaning Example
Use auto_clean for a one-step pipeline:
cleaner = CleanEasy(df, log_level='INFO')
cleaned_df = cleaner.auto_clean(
impute_method='knn',
outlier_method='isolation_forest',
text_clean=True,
date_parse=True,
categorical_encode='frequency'
)
print(f"\n{Fore.GREEN}=== Auto-Cleaned DataFrame ==={Style.RESET_ALL}")
print(tabulate(cleaned_df, headers='keys', tablefmt='psql', showindex=True, floatfmt='.2f'))
Project Structure
cleaneasy/
├── cleaneasy/
│ ├── __init__.py # Package initialization and exports
│ ├── core.py # Core CleanEasy class with cleaning methods
│ ├── utils.py # Utility functions (e.g., convert_to_dataframe)
│ ├── validators.py # Validation functions (e.g., check_skewness)
├── tests/
│ ├── __init__.py # Test package initialization
│ ├── test_core.py # Tests for core.py
│ ├── test_utils.py # Tests for utils.py
│ ├── test_validators.py # Tests for validators.py
├── docs/
│ ├── conf.py # Sphinx documentation configuration
│ ├── index.rst # Sphinx documentation index
├── main.py # Example script demonstrating usage
├── pyproject.toml # Project metadata and build configuration
├── requirements.txt # Dependencies
├── README.md # This file
├── LICENSE # License file (MIT)
Testing
CleanEasy includes a test suite using pytest to ensure reliability.
-
Install pytest
pip install pytest
-
Run Tests
cd cleaneasy pytest tests/
Tests cover:
- Initialization and data conversion (
test_core.py,test_utils.py) - Cleaning methods (e.g.,
impute_knn,remove_outliers_isolation_forest) - Validation functions (e.g.,
check_skewness,check_normality)
- Initialization and data conversion (
Contributing
We welcome contributions to CleanEasy! To contribute:
-
Fork the Repository
git clone https://github.com/CyberMatic-AmAn/cleaneasy.git cd cleaneasy
-
Create a Branch
git checkout -b feature/your-feature-name
-
Make Changes
- Add new features or fix bugs in
cleaneasy/. - Update tests in
tests/. - Document changes in
docs/if necessary.
- Add new features or fix bugs in
-
Run Tests Ensure all tests pass:
pytest tests/. -
Submit a Pull Request
- Push your branch:
git push origin feature/your-feature-name. - Open a pull request on GitHub with a clear description of changes.
- Push your branch:
-
Report Issues
- Use the GitHub Issues page to report bugs or suggest features.
- Include detailed descriptions and reproduction steps.
License
CleanEasy is licensed under the MIT License. See the LICENSE file for details.
Contact and Support
- Email: exehyper999@gmail.com (replace with your contact)
- GitHub Issues: github.com/CyberMatic-AmAn/cleaneasy/issues
- Documentation: https://github.com/CyberMatic-AmAn/cleaneasy
For support, open an issue on GitHub or contact the maintainer directly.
FAQ
Why do I get an NLTK error when using tokenize_text?
Ensure NLTK data is downloaded:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('wordnet')
Why is the output not colored?
- Verify
coloramais installed:pip show colorama. - Ensure your terminal supports ANSI colors (e.g., Windows Terminal, VS Code).
- Check that
colorama.init()is called inmain.py.
How do I add a new cleaning method?
- Add the method to
cleaneasy/core.pyin theCleanEasyclass. - Ensure it returns
selffor method chaining. - Update tests in
tests/test_core.py. - Document the method in
docs/and thisREADME.md.
Can I use CleanEasy with large datasets?
Yes, but performance depends on the methods used (e.g., impute_knn and remove_outliers_isolation_forest can be computationally intensive). Test with a sample first.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cleaneasy-0.4.2.tar.gz.
File metadata
- Download URL: cleaneasy-0.4.2.tar.gz
- Upload date:
- Size: 20.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4a19818971b5753243f001cc0318ae35090e31df16867e4081c6f42537715c5f
|
|
| MD5 |
c50db5cf3f7a34fc02fb021225e691af
|
|
| BLAKE2b-256 |
d0496b84b93a0d4c38ebf074a2d293ec5065fc42f47a353f9aaf9dbefc3cbfe0
|
File details
Details for the file cleaneasy-0.4.2-py3-none-any.whl.
File metadata
- Download URL: cleaneasy-0.4.2-py3-none-any.whl
- Upload date:
- Size: 15.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0d72d1a9d0f29b9cef19e2cdab981739a5bd54f5afdff07a43479d244b9a1085
|
|
| MD5 |
6f95d90c1df2461a0448e3530b57b7b1
|
|
| BLAKE2b-256 |
d0d8e9e5734e4daed89c0b1dd6ba51ca122f0e595dfa5093033d48b964f32d22
|