Skip to main content

AI-powered data cleaning library with intelligent recommendations and professional visualizations

Project description

DataPolish 🧹✨

AI-Powered Data Cleaning Library for Python

DataPolish is a comprehensive data cleaning and preprocessing library that combines powerful automation with AI-driven insights to help you prepare your data for analysis and machine learning.

PyPI version Python 3.7+ License: MIT


✨ Features

🤖 AI-Powered Intelligence

  • Smart Data Descriptions - Get instant rule-based insights about your dataset
  • Intelligent Recommendations - Receive personalized cleaning suggestions
  • Automated Explanations - Understand what cleaning operations did to your data

🧹 Comprehensive Cleaning

  • Missing Values - 5 strategies (drop, mean, median, mode, forward fill)
  • Outlier Detection - IQR and Z-score methods with flexible actions
  • Duplicate Removal - Smart duplicate detection and removal
  • Column Management - Drop columns by name or index

📊 Advanced Analysis

  • Correlation Analysis - Detailed correlation reports with interpretations
  • Data Profiling - Comprehensive quality scoring and statistics
  • Quality Metrics - Automated data quality assessment

📈 Professional Visualizations

  • Overview dashboards
  • Missing value heatmaps
  • Distribution plots
  • Correlation matrices
  • Outlier visualizations
  • Categorical analysis
  • Export DataFrame as images - Perfect for reports and presentations

💾 Flexible I/O

  • Support for CSV, Excel, JSON formats
  • Smart auto-save functionality
  • Multiple export options
  • Preserves original data

🚀 Quick Start

Installation

pip install datapolish

Basic Usage

from datapolish import DataCleaner

# Load your data
cleaner = DataCleaner("your_data.csv")

# Get AI-powered description
print(cleaner.describe_data(detail_level=0))

# Get intelligent recommendations
print(cleaner.get_recommendations())

# Clean your data
config = {
    'missing': {'strategy': 'median'},
    'outliers': {'method': 'iqr', 'action': 'cap'},
    'duplicates': {'drop': True}
}
cleaner.clean(config)

# Get explanation of what was done
print(cleaner.explain_cleaning('detailed'))

# Save cleaned data
cleaner.save()  # Auto-saves to ./datapolish_output/

📚 Comprehensive Example

from datapolish import DataCleaner

# Initialize
cleaner = DataCleaner("sales_data.csv")

# 1. Drop unnecessary columns
cleaner.drop_columns('ID', 'Internal_Code', 0)  # By name or index

# 2. Analyze correlations
result = cleaner.analyze_correlation('Price', 'Sales')
print(result['interpretation'])

# 3. View data as professional image
cleaner.view_as_image(rows=10, save_path='preview.png')
cleaner.view_as_image(rows=-10, save_path='tail.png')  # Last 10 rows

# 4. Profile your data
profile = cleaner.profile()
print(f"Quality Score: {profile['quality_score']}/100")

# 5. Visualize data quality
cleaner.visualize('overview', save_path='dashboard.png')
cleaner.visualize('missing', save_path='missing_values.png')

# 6. Clean with smart configuration
config = {
    'missing': {'strategy': 'median'},
    'outliers': {'method': 'iqr', 'action': 'cap', 'threshold': 1.5},
    'duplicates': {'drop': True, 'keep': 'first'}
}
cleaner.clean(config)

# 7. Verify results
cleaner.view_as_image(rows=10, save_path='cleaned_preview.png')

# 8. Save cleaned data
cleaner.save("cleaned_sales_data.csv")

🎯 Key Features in Detail

1. Drop Columns (New!)

Drop columns by name, index, or both:

# By name
cleaner.drop_columns('Age', 'Gender')

# By index (0-based)
cleaner.drop_columns(0, 2, 5)

# Mix both!
cleaner.drop_columns('Name', 0, 'Email', 3)

2. Correlation Analysis (New!)

Get detailed correlation analysis with AI interpretation:

# Analyze specific pair
result = cleaner.analyze_correlation('Height', 'Weight')
print(result['interpretation'])  # Plain English explanation

# Find all significant correlations
result = cleaner.analyze_correlation(threshold=0.5)
for corr in result['all_correlations']:
    print(f"{corr['col1']}{corr['col2']}: {corr['correlation']:.3f}")

3. View as Image (New!)

Export DataFrame as professional table images:

# First 10 rows
cleaner.view_as_image(rows=10, save_path='preview.png')

# Last 10 rows (using negative index!)
cleaner.view_as_image(rows=-10, save_path='tail.png')

# Full table
cleaner.view_as_image(save_path='full_data.png')

# Custom styling
cleaner.view_as_image(
    rows=20,
    save_path='styled.png',
    title='My Dataset',
    show_dtypes=True
)

4. AI-Powered Descriptions

# Brief description
print(cleaner.describe_data(0))

# Detailed description
print(cleaner.describe_data(1))

5. Smart Recommendations

recommendations = cleaner.get_recommendations()
print(recommendations)

6. Professional Visualizations

# Overview dashboard
cleaner.visualize('overview', save_path='dashboard.png')

# Missing values heatmap
cleaner.visualize('missing', save_path='missing.png')

# Distribution plots
cleaner.visualize('distribution', 
                 columns=['Age', 'Salary', 'Score'],
                 save_path='distributions.png')

# Correlation matrix
cleaner.visualize('correlation', save_path='correlations.png')

🎓 Documentation

Core Methods

DataCleaner(file_path)

Initialize the cleaner with your data file.

Parameters:

  • file_path (str): Path to CSV, Excel, or JSON file

drop_columns(*columns)

Remove columns by name or index.

Parameters:

  • *columns: Column names (str) or indices (int, 0-based)

Returns: List of dropped column names

Example:

cleaner.drop_columns('ID', 0, 'Email', 3)

analyze_correlation(col1=None, col2=None, threshold=0.3, method='pearson')

Analyze correlations with detailed interpretations.

Parameters:

  • col1 (str): First column (for specific pair analysis)
  • col2 (str): Second column (for specific pair analysis)
  • threshold (float): Minimum correlation to report (for all-pairs)
  • method (str): 'pearson', 'spearman', or 'kendall'

Returns: Dictionary with correlation analysis

Example:

# Specific pair
result = cleaner.analyze_correlation('X', 'Y')

# All pairs
result = cleaner.analyze_correlation(threshold=0.5)

view_as_image(rows=None, save_path=None, **kwargs)

Export DataFrame as professional table image.

Parameters:

  • rows (int/None): Number of rows (positive=first N, negative=last N, None=all)
  • save_path (str): Where to save the image
  • title (str): Custom title
  • show_dtypes (bool): Show data types in headers
  • figsize (tuple): Figure size

Example:

cleaner.view_as_image(rows=10, save_path='preview.png')
cleaner.view_as_image(rows=-10, save_path='tail.png')

profile()

Get comprehensive data profile.

Returns: Dictionary with statistics and quality metrics


describe_data(detail_level=0)

Get AI-generated data description.

Parameters:

  • detail_level (int): 0=brief, 1=detailed

Returns: String description


get_recommendations()

Get AI-generated cleaning recommendations.

Returns: String with recommendations


visualize(plot_type='overview', **kwargs)

Create professional visualizations.

Parameters:

  • plot_type (str): 'overview', 'missing', 'distribution', 'correlation', 'outliers', 'categorical'
  • save_path (str): Where to save the plot
  • columns (list): Specific columns (for some plot types)

clean(config)

Execute cleaning operations.

Parameters:

  • config (dict): Cleaning configuration

Example:

config = {
    'missing': {'strategy': 'median'},
    'outliers': {'method': 'iqr', 'action': 'cap'},
    'duplicates': {'drop': True}
}
cleaner.clean(config)

explain_cleaning(detail_level='summary')

Get explanation of cleaning operations.

Parameters:

  • detail_level (str): 'summary' or 'detailed'

Returns: String explanation


save(filename=None, format='csv')

Save cleaned data.

Parameters:

  • filename (str): Output filename (optional)
  • format (str): 'csv', 'excel', or 'json'

Returns: Path to saved file


📊 Cleaning Configuration

Missing Values

config = {
    'missing': {
        'strategy': 'median',  # 'drop', 'mean', 'median', 'mode', 'ffill'
        'columns': ['Age', 'Salary']  # Optional: specific columns
    }
}

Outliers

config = {
    'outliers': {
        'method': 'iqr',        # 'iqr' or 'zscore'
        'action': 'cap',         # 'cap', 'remove', or 'flag'
        'threshold': 1.5,        # IQR multiplier (default 1.5)
        'columns': ['Price']     # Optional: specific columns
    }
}

Duplicates

config = {
    'duplicates': {
        'drop': True,
        'keep': 'first',  # 'first', 'last', or False
        'subset': None    # Optional: columns to check
    }
}

Complete Configuration

config = {
    'missing': {'strategy': 'median'},
    'outliers': {'method': 'iqr', 'action': 'cap'},
    'duplicates': {'drop': True}
}
cleaner.clean(config)

💡 Use Cases

Data Science Projects

  • Clean datasets before analysis
  • Generate quality reports
  • Create visualization dashboards

Machine Learning Pipelines

  • Preprocess training data
  • Handle missing values intelligently
  • Detect and handle outliers

Business Analytics

  • Prepare data for reporting
  • Ensure data quality
  • Create professional visualizations

Data Quality Audits

  • Assess data quality scores
  • Identify data issues
  • Generate comprehensive reports

🔧 Requirements

  • Python 3.7 or higher
  • pandas >= 1.3.0
  • numpy >= 1.20.0
  • scipy >= 1.7.0
  • matplotlib >= 3.3.0
  • seaborn >= 0.11.0
  • openpyxl >= 3.0.0

📄 License

MIT License - see LICENSE file for details


🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.


📧 Contact


🙏 Acknowledgments

DataPolish was created to make data cleaning accessible, intelligent, and efficient for everyone from beginners to data science professionals.


⭐ Star us on GitHub!

If you find DataPolish helpful, please star the repository!


Happy Data Cleaning! 🧹✨📊

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datapolish-1.0.0.tar.gz (59.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datapolish-1.0.0-py3-none-any.whl (31.7 kB view details)

Uploaded Python 3

File details

Details for the file datapolish-1.0.0.tar.gz.

File metadata

  • Download URL: datapolish-1.0.0.tar.gz
  • Upload date:
  • Size: 59.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for datapolish-1.0.0.tar.gz
Algorithm Hash digest
SHA256 1133fe67cc056dc6eb77c96ad9c04c717878f80a202a5312113be3fcb01a6c9c
MD5 d1e54211a546d1a8c328821e93803dc9
BLAKE2b-256 1263f2598942129c0ba4518d3c4ee311e66562905ca36991dd97468de78a4923

See more details on using hashes here.

File details

Details for the file datapolish-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: datapolish-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 31.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for datapolish-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f43ff67fe6cdd709a92f1dda052b21a49d84d92f1e929ec25da998497dbb576a
MD5 d0c6c27580fbe60bf5e0b6d322bbf9d4
BLAKE2b-256 1460a6cbf90979d8674eacea201dd2a1fbdca69beb52fcbb324ac41da382f3e9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page