Self-contained CSV data cleaning tool with AI capabilities
Project description
CSV Data Cleaner
A powerful, self-contained tool for cleaning CSV data using industry-standard Python libraries with AI-powered intelligent suggestions and automatic cleaning capabilities.
๐ Key Features
AI-Powered Features
- ๐ค AI-Powered Automatic Cleaning: Execute AI suggestions automatically with
ai-clean
command - ๐ง Intelligent Suggestions: Get AI-powered cleaning recommendations with
ai-suggest
command - ๐ Data Analysis: AI-powered data analysis and insights with
ai-analyze
command - ๐ฏ Learning System: AI learns from your feedback to improve suggestions over time
- โก Multi-Provider Support: OpenAI, Anthropic, and local LLM support
Core Cleaning Capabilities
- ๐ง Multiple Libraries: pandas, pyjanitor, feature-engine, dedupe, missingno
- โ๏ธ 30+ Operations: Remove duplicates, handle missing values, clean text, fix dates, etc.
- ๐ Performance Optimization: Parallel processing, memory management, chunked processing
- ๐ Data Validation: Schema validation, data quality assessment, comprehensive reporting
- ๐จ Visualization: Data quality heatmaps, missing data analysis, correlation matrices
๐ ๏ธ Installation
Quick Install
pip install csv-cleaner
From Source
git clone https://github.com/your-repo/csv-cleaner.git
cd csv-cleaner
pip install -e .
๐ Quick Start
AI-Powered Automatic Cleaning (NEW!)
# Automatic cleaning with AI suggestions
csv-cleaner ai-clean input.csv output.csv
# Auto-confirm all suggestions
csv-cleaner ai-clean input.csv output.csv --auto-confirm
# Preview execution plan without modifying files
csv-cleaner ai-clean input.csv output.csv --dry-run
# Limit number of suggestions
csv-cleaner ai-clean input.csv output.csv --max-suggestions 10
AI-Powered Suggestions
# Get AI-powered cleaning suggestions
csv-cleaner ai-suggest input.csv
# Get suggestions with specific analysis
csv-cleaner ai-suggest input.csv --output suggestions.json
AI-Powered Data Analysis
# Get comprehensive data analysis
csv-cleaner ai-analyze input.csv
# Save analysis to file
csv-cleaner ai-analyze input.csv --output analysis.json
Traditional Cleaning
# Clean with specific operations
csv-cleaner clean input.csv output.csv --operations "remove_duplicates,fill_missing"
# Interactive mode
csv-cleaner clean input.csv output.csv --interactive
# Performance optimized
csv-cleaner clean input.csv output.csv --parallel --chunk-size 10000
๐ค AI Configuration
Setup AI Providers
# Configure OpenAI
csv-cleaner ai-configure set --provider openai --api-key sk-...
# Configure Anthropic
csv-cleaner ai-configure set --provider anthropic --api-key sk-ant-...
# Show current configuration
csv-cleaner ai-configure show
# Validate configuration
csv-cleaner ai-configure validate
AI Features Overview
AI-Powered Automatic Cleaning (ai-clean
)
- Automatic Execution: AI generates and executes cleaning suggestions
- Execution Planning: Shows detailed execution plan with confidence levels
- User Control: Choose between automatic execution and manual confirmation
- Dry-Run Mode: Preview changes without modifying files
- Learning Integration: AI learns from execution results
AI-Powered Suggestions (ai-suggest
)
- Intelligent Analysis: AI analyzes data and suggests optimal cleaning operations
- Confidence Scoring: Each suggestion includes confidence level and reasoning
- Library Selection: AI recommends the best library for each operation
- Impact Assessment: Estimates the impact of each suggestion
AI-Powered Analysis (ai-analyze
)
- Comprehensive Profiling: Detailed data quality assessment
- Pattern Recognition: Identifies data patterns and anomalies
- Recommendation Engine: Suggests cleaning strategies based on analysis
- Exportable Reports: Save analysis results for further review
๐ Available Operations
Basic Data Cleaning (Pandas)
remove_duplicates
- Remove duplicate rowsfill_missing
- Fill missing values with various strategiesdrop_missing
- Remove rows/columns with missing valuesclean_text
- Clean and normalize text datafix_dates
- Convert and standardize date formatsconvert_types
- Convert data types automaticallyrename_columns
- Rename columnsdrop_columns
- Remove unwanted columnsselect_columns
- Select specific columns
Advanced Data Cleaning (PyJanitor)
clean_names
- Clean column namesremove_empty
- Remove empty rows/columnsfill_empty
- Fill empty valueshandle_missing
- Advanced missing value handlingremove_constant_columns
- Remove columns with constant valuesremove_columns_with_nulls
- Remove columns with null valuescoalesce_columns
- Combine multiple columns
Feature Engineering (Feature-Engine)
advanced_imputation
- Advanced missing value imputationcategorical_encoding
- Encode categorical variablesoutlier_detection
- Detect and handle outliersvariable_selection
- Select relevant variablesdata_transformation
- Apply data transformationsmissing_indicator
- Create missing value indicators
Missing Data Analysis (MissingNo)
missing_matrix
- Generate missing data matrix visualizationmissing_bar
- Generate missing data bar chartmissing_heatmap
- Generate missing data heatmapmissing_dendrogram
- Generate missing data dendrogrammissing_summary
- Generate missing data summary
ML-Based Deduplication (Dedupe)
dedupe
- ML-based deduplication with fuzzy matching
๐ Examples
Example 1: AI-Powered Automatic Cleaning
# Clean messy data automatically
csv-cleaner ai-clean messy_data.csv cleaned_data.csv --auto-confirm
Output:
๐ค AI-Powered Data Cleaning
===========================
๐ Data Analysis Complete
- Rows: 10,000 | Columns: 15
- Missing values: 1,250 (8.3%)
- Duplicates: 150 (1.5%)
- Data quality score: 78%
๐ฏ AI Suggestions Generated (5 suggestions)
1. Remove duplicates (confidence: 95%)
2. Fill missing values with median (confidence: 88%)
3. Clean column names (confidence: 92%)
4. Convert date columns (confidence: 85%)
5. Handle outliers in 'price' column (confidence: 76%)
๐ Execution Plan
================
1. clean_names (pandas) - Clean column names
2. remove_duplicates (pandas) - Remove 150 duplicate rows
3. fill_missing (pandas) - Fill 1,250 missing values
4. fix_dates (pandas) - Convert date columns
5. handle_outliers (feature-engine) - Handle price outliers
๐ Executing AI suggestions...
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 100%
โ
Successfully executed 5 operations
๐ Results: 9,850 rows โ 9,700 rows (150 duplicates removed)
๐พ Saved to: cleaned_data.csv
Example 2: AI-Powered Suggestions
csv-cleaner ai-suggest data.csv
Output:
๐ค AI-Powered Cleaning Suggestions
==================================
๐ Data Analysis
- Dataset: 5,000 rows ร 12 columns
- Quality issues detected: Missing values, inconsistent dates, duplicates
๐ฏ Recommended Operations:
1. **Remove Duplicates** (Confidence: 94%)
- Library: pandas
- Impact: Remove ~50 duplicate rows
- Reasoning: Found exact duplicates in customer data
2. **Fill Missing Values** (Confidence: 89%)
- Library: pandas
- Strategy: Forward fill for dates, median for numeric
- Impact: Fill 200 missing values
3. **Fix Date Columns** (Confidence: 87%)
- Library: pandas
- Columns: 'order_date', 'ship_date'
- Impact: Standardize date formats
4. **Clean Column Names** (Confidence: 92%)
- Library: pyjanitor
- Impact: Standardize naming convention
5. **Handle Outliers** (Confidence: 76%)
- Library: feature-engine
- Column: 'amount'
- Impact: Cap extreme values
๐ง Configuration
Performance Settings
# Set memory limit
csv-cleaner config set performance.memory_limit 4.0
# Enable parallel processing
csv-cleaner config set performance.parallel_processing true
# Set chunk size
csv-cleaner config set performance.chunk_size 5000
AI Settings
# Set default AI provider
csv-cleaner config set ai.default_provider openai
# Set suggestion confidence threshold
csv-cleaner config set ai.confidence_threshold 0.7
# Enable learning mode
csv-cleaner config set ai.learning_enabled true
๐ Performance Features
- Parallel Processing: Multi-core data processing
- Memory Management: Efficient memory usage for large datasets
- Chunked Processing: Process large files in chunks
- Progress Tracking: Real-time progress monitoring
- Performance Monitoring: Track processing times and resource usage
๐งช Testing
# Run all tests
pytest
# Run with coverage
pytest --cov=csv_cleaner
# Run specific test categories
pytest tests/unit/
pytest tests/integration/
๐ Deployment
PyPI Deployment
The project includes automated deployment scripts for PyPI:
# Setup basic version
python scripts/setup-pypi.py
# Deploy to TestPyPI
python scripts/deploy-pypi.py --test
# Deploy to production PyPI
python scripts/deploy-pypi.py --version 1.0.0
Deployment Features
- โ Automated testing and validation
- โ Safety checks and prerequisites verification
- โ Package building and quality checks
- โ Version management and tagging
- โ Release notes generation
For detailed deployment instructions, see scripts/deployment-guide.md.
๐ Documentation
๐ค Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐ Support
- Documentation: docs/
- Issues: GitHub Issues
- Discussions: GitHub Discussions
Made with โค๏ธ for data scientists and analysts
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file csv_data_cleaner-0.1.0.tar.gz
.
File metadata
- Download URL: csv_data_cleaner-0.1.0.tar.gz
- Upload date:
- Size: 288.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
7f9aaae29437ee7556edef991bad9433638781ccddd0f4ad0cfc4c50e2d7d639
|
|
MD5 |
6e0f8e5de712e415e2ec57ecdd7a64af
|
|
BLAKE2b-256 |
e1bad6b0a5030a16ba205bf27ee2f800af7f1d7322ca6d7b34c4e5cfa089f93a
|
File details
Details for the file csv_data_cleaner-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: csv_data_cleaner-0.1.0-py3-none-any.whl
- Upload date:
- Size: 101.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
2fb398c14857edd5ad0f8f3767e35394afd076900728426a7e92fc171bf6d6fd
|
|
MD5 |
af493f3b5cf7368613c3985e805a1fcd
|
|
BLAKE2b-256 |
66a392b4239f649e32efc1fc8b82690ea25ff1f9e44d47967f61e8e2cf65fa67
|