Skip to main content

Self-contained CSV data cleaning tool with AI capabilities

Project description

CSV Data Cleaner

A powerful, self-contained tool for cleaning CSV data using industry-standard Python libraries with AI-powered intelligent suggestions and automatic cleaning capabilities.

๐Ÿš€ Key Features

AI-Powered Features

  • ๐Ÿค– AI-Powered Automatic Cleaning: Execute AI suggestions automatically with ai-clean command
  • ๐Ÿง  Intelligent Suggestions: Get AI-powered cleaning recommendations with ai-suggest command
  • ๐Ÿ“Š Data Analysis: AI-powered data analysis and insights with ai-analyze command
  • ๐ŸŽฏ Learning System: AI learns from your feedback to improve suggestions over time
  • โšก Multi-Provider Support: OpenAI, Anthropic, and local LLM support

Core Cleaning Capabilities

  • ๐Ÿ”ง Multiple Libraries: pandas, pyjanitor, feature-engine, dedupe, missingno
  • โš™๏ธ 30+ Operations: Remove duplicates, handle missing values, clean text, fix dates, etc.
  • ๐Ÿ“ˆ Performance Optimization: Parallel processing, memory management, chunked processing
  • ๐Ÿ“Š Data Validation: Schema validation, data quality assessment, comprehensive reporting
  • ๐ŸŽจ Visualization: Data quality heatmaps, missing data analysis, correlation matrices

๐Ÿ› ๏ธ Installation

Quick Install

pip install csv-cleaner

From Source

git clone https://github.com/your-repo/csv-cleaner.git
cd csv-cleaner
pip install -e .

๐Ÿš€ Quick Start

AI-Powered Automatic Cleaning (NEW!)

# Automatic cleaning with AI suggestions
csv-cleaner ai-clean input.csv output.csv

# Auto-confirm all suggestions
csv-cleaner ai-clean input.csv output.csv --auto-confirm

# Preview execution plan without modifying files
csv-cleaner ai-clean input.csv output.csv --dry-run

# Limit number of suggestions
csv-cleaner ai-clean input.csv output.csv --max-suggestions 10

AI-Powered Suggestions

# Get AI-powered cleaning suggestions
csv-cleaner ai-suggest input.csv

# Get suggestions with specific analysis
csv-cleaner ai-suggest input.csv --output suggestions.json

AI-Powered Data Analysis

# Get comprehensive data analysis
csv-cleaner ai-analyze input.csv

# Save analysis to file
csv-cleaner ai-analyze input.csv --output analysis.json

Traditional Cleaning

# Clean with specific operations
csv-cleaner clean input.csv output.csv --operations "remove_duplicates,fill_missing"

# Interactive mode
csv-cleaner clean input.csv output.csv --interactive

# Performance optimized
csv-cleaner clean input.csv output.csv --parallel --chunk-size 10000

๐Ÿค– AI Configuration

Setup AI Providers

# Configure OpenAI
csv-cleaner ai-configure set --provider openai --api-key sk-...

# Configure Anthropic
csv-cleaner ai-configure set --provider anthropic --api-key sk-ant-...

# Show current configuration
csv-cleaner ai-configure show

# Validate configuration
csv-cleaner ai-configure validate

AI Features Overview

AI-Powered Automatic Cleaning (ai-clean)

  • Automatic Execution: AI generates and executes cleaning suggestions
  • Execution Planning: Shows detailed execution plan with confidence levels
  • User Control: Choose between automatic execution and manual confirmation
  • Dry-Run Mode: Preview changes without modifying files
  • Learning Integration: AI learns from execution results

AI-Powered Suggestions (ai-suggest)

  • Intelligent Analysis: AI analyzes data and suggests optimal cleaning operations
  • Confidence Scoring: Each suggestion includes confidence level and reasoning
  • Library Selection: AI recommends the best library for each operation
  • Impact Assessment: Estimates the impact of each suggestion

AI-Powered Analysis (ai-analyze)

  • Comprehensive Profiling: Detailed data quality assessment
  • Pattern Recognition: Identifies data patterns and anomalies
  • Recommendation Engine: Suggests cleaning strategies based on analysis
  • Exportable Reports: Save analysis results for further review

๐Ÿ“‹ Available Operations

Basic Data Cleaning (Pandas)

  • remove_duplicates - Remove duplicate rows
  • fill_missing - Fill missing values with various strategies
  • drop_missing - Remove rows/columns with missing values
  • clean_text - Clean and normalize text data
  • fix_dates - Convert and standardize date formats
  • convert_types - Convert data types automatically
  • rename_columns - Rename columns
  • drop_columns - Remove unwanted columns
  • select_columns - Select specific columns

Advanced Data Cleaning (PyJanitor)

  • clean_names - Clean column names
  • remove_empty - Remove empty rows/columns
  • fill_empty - Fill empty values
  • handle_missing - Advanced missing value handling
  • remove_constant_columns - Remove columns with constant values
  • remove_columns_with_nulls - Remove columns with null values
  • coalesce_columns - Combine multiple columns

Feature Engineering (Feature-Engine)

  • advanced_imputation - Advanced missing value imputation
  • categorical_encoding - Encode categorical variables
  • outlier_detection - Detect and handle outliers
  • variable_selection - Select relevant variables
  • data_transformation - Apply data transformations
  • missing_indicator - Create missing value indicators

Missing Data Analysis (MissingNo)

  • missing_matrix - Generate missing data matrix visualization
  • missing_bar - Generate missing data bar chart
  • missing_heatmap - Generate missing data heatmap
  • missing_dendrogram - Generate missing data dendrogram
  • missing_summary - Generate missing data summary

ML-Based Deduplication (Dedupe)

  • dedupe - ML-based deduplication with fuzzy matching

๐Ÿ“Š Examples

Example 1: AI-Powered Automatic Cleaning

# Clean messy data automatically
csv-cleaner ai-clean messy_data.csv cleaned_data.csv --auto-confirm

Output:

๐Ÿค– AI-Powered Data Cleaning
===========================

๐Ÿ“Š Data Analysis Complete
- Rows: 10,000 | Columns: 15
- Missing values: 1,250 (8.3%)
- Duplicates: 150 (1.5%)
- Data quality score: 78%

๐ŸŽฏ AI Suggestions Generated (5 suggestions)
1. Remove duplicates (confidence: 95%)
2. Fill missing values with median (confidence: 88%)
3. Clean column names (confidence: 92%)
4. Convert date columns (confidence: 85%)
5. Handle outliers in 'price' column (confidence: 76%)

๐Ÿ“‹ Execution Plan
================
1. clean_names (pandas) - Clean column names
2. remove_duplicates (pandas) - Remove 150 duplicate rows
3. fill_missing (pandas) - Fill 1,250 missing values
4. fix_dates (pandas) - Convert date columns
5. handle_outliers (feature-engine) - Handle price outliers

๐Ÿš€ Executing AI suggestions...
โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 100%

โœ… Successfully executed 5 operations
๐Ÿ“Š Results: 9,850 rows โ†’ 9,700 rows (150 duplicates removed)
๐Ÿ’พ Saved to: cleaned_data.csv

Example 2: AI-Powered Suggestions

csv-cleaner ai-suggest data.csv

Output:

๐Ÿค– AI-Powered Cleaning Suggestions
==================================

๐Ÿ“Š Data Analysis
- Dataset: 5,000 rows ร— 12 columns
- Quality issues detected: Missing values, inconsistent dates, duplicates

๐ŸŽฏ Recommended Operations:

1. **Remove Duplicates** (Confidence: 94%)
   - Library: pandas
   - Impact: Remove ~50 duplicate rows
   - Reasoning: Found exact duplicates in customer data

2. **Fill Missing Values** (Confidence: 89%)
   - Library: pandas
   - Strategy: Forward fill for dates, median for numeric
   - Impact: Fill 200 missing values

3. **Fix Date Columns** (Confidence: 87%)
   - Library: pandas
   - Columns: 'order_date', 'ship_date'
   - Impact: Standardize date formats

4. **Clean Column Names** (Confidence: 92%)
   - Library: pyjanitor
   - Impact: Standardize naming convention

5. **Handle Outliers** (Confidence: 76%)
   - Library: feature-engine
   - Column: 'amount'
   - Impact: Cap extreme values

๐Ÿ”ง Configuration

Performance Settings

# Set memory limit
csv-cleaner config set performance.memory_limit 4.0

# Enable parallel processing
csv-cleaner config set performance.parallel_processing true

# Set chunk size
csv-cleaner config set performance.chunk_size 5000

AI Settings

# Set default AI provider
csv-cleaner config set ai.default_provider openai

# Set suggestion confidence threshold
csv-cleaner config set ai.confidence_threshold 0.7

# Enable learning mode
csv-cleaner config set ai.learning_enabled true

๐Ÿ“ˆ Performance Features

  • Parallel Processing: Multi-core data processing
  • Memory Management: Efficient memory usage for large datasets
  • Chunked Processing: Process large files in chunks
  • Progress Tracking: Real-time progress monitoring
  • Performance Monitoring: Track processing times and resource usage

๐Ÿงช Testing

# Run all tests
pytest

# Run with coverage
pytest --cov=csv_cleaner

# Run specific test categories
pytest tests/unit/
pytest tests/integration/

๐Ÿš€ Deployment

PyPI Deployment

The project includes automated deployment scripts for PyPI:

# Setup basic version
python scripts/setup-pypi.py

# Deploy to TestPyPI
python scripts/deploy-pypi.py --test

# Deploy to production PyPI
python scripts/deploy-pypi.py --version 1.0.0

Deployment Features

  • โœ… Automated testing and validation
  • โœ… Safety checks and prerequisites verification
  • โœ… Package building and quality checks
  • โœ… Version management and tagging
  • โœ… Release notes generation

For detailed deployment instructions, see scripts/deployment-guide.md.

๐Ÿ“š Documentation

๐Ÿค Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests
  5. Submit a pull request

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ†˜ Support


Made with โค๏ธ for data scientists and analysts

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

csv_data_cleaner-0.1.0.tar.gz (288.2 kB view details)

Uploaded Source

Built Distribution

csv_data_cleaner-0.1.0-py3-none-any.whl (101.8 kB view details)

Uploaded Python 3

File details

Details for the file csv_data_cleaner-0.1.0.tar.gz.

File metadata

  • Download URL: csv_data_cleaner-0.1.0.tar.gz
  • Upload date:
  • Size: 288.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.1

File hashes

Hashes for csv_data_cleaner-0.1.0.tar.gz
Algorithm Hash digest
SHA256 7f9aaae29437ee7556edef991bad9433638781ccddd0f4ad0cfc4c50e2d7d639
MD5 6e0f8e5de712e415e2ec57ecdd7a64af
BLAKE2b-256 e1bad6b0a5030a16ba205bf27ee2f800af7f1d7322ca6d7b34c4e5cfa089f93a

See more details on using hashes here.

File details

Details for the file csv_data_cleaner-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for csv_data_cleaner-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2fb398c14857edd5ad0f8f3767e35394afd076900728426a7e92fc171bf6d6fd
MD5 af493f3b5cf7368613c3985e805a1fcd
BLAKE2b-256 66a392b4239f649e32efc1fc8b82690ea25ff1f9e44d47967f61e8e2cf65fa67

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page