AI-powered data cleaning assistant with multiple interfaces
Project description
ScrubPy
Introduction
ScrubPy is a comprehensive Python library for intelligent data cleaning and preprocessing. It provides multiple interfaces including a web application, CLI tools, and AI-powered chat assistance to help data scientists, analysts, and researchers transform messy datasets into clean, analysis-ready formats. The library combines automated quality analysis with intelligent suggestions to streamline the data preparation workflow.
Key Features
- Multi-Interface Support: Web GUI (Streamlit), Command Line Interface (CLI), and Interactive Chat Assistant
- AI-Powered Analysis: Integration with Large Language Models for intelligent data cleaning recommendations
- Comprehensive Quality Assessment: Automated detection of missing values, duplicates, outliers, and data type inconsistencies
- Smart Cleaning Operations: Automated and guided data cleaning with preview capabilities
- Professional Reporting: Generate detailed PDF reports and export cleaned datasets
Architecture Overview
ScrubPy follows a modular architecture where users can interact through multiple interfaces (Web, CLI, Chat) that all utilize the same core data processing engine. The workflow starts with data loading through the core module, followed by quality analysis using the quality analyzer, interactive cleaning operations with preview capabilities, and finally export of cleaned data with comprehensive reporting. The AI components provide intelligent suggestions throughout the process.
Installation
Install ScrubPy using pip:
pip install scrubpy
For AI features, install with additional dependencies:
pip install scrubpy[ai]
Module Documentation
Core Module (scrubpy.core)
The core module provides fundamental data loading and cleaning operations:
load_dataset(file_path): Intelligent data loading with automatic format detection for CSV, JSON, Excel, and Parquet filesget_dataset_summary(df): Comprehensive dataset overview including shape, column types, and basic statisticsremove_duplicates(df, method): Advanced duplicate detection with configurable strategiesfill_missing_values(df, method, columns): Multiple imputation methods including mean, median, mode, and forward/backward filldetect_outliers(df, method): Statistical outlier detection using IQR, Z-score, and isolation forest methodsconvert_data_types(df): Automatic data type optimization and conversion
Quality Analyzer Module (scrubpy.quality_analyzer)
Intelligent quality assessment system:
SmartDataQualityAnalyzer: Main analyzer class providing comprehensive quality scoringanalyze_quality(df): Complete quality analysis returning issue detection and recommendationsQualityIssuedataclass: Structured representation of detected data quality issues- Quality scoring algorithms for completeness, consistency, validity, and uniqueness metrics
CLI Module (scrubpy.cli)
Interactive command-line interface:
- Rich terminal interface with progress indicators and colored output
- Interactive dataset selection and preview capabilities
- Step-by-step guided cleaning workflow
- Export options for cleaned datasets and quality reports
Web Interface (scrubpy.web)
Modern Streamlit-based web application:
- Drag-and-drop file upload with format validation
- Real-time data preview with pagination
- Interactive quality dashboard with visual indicators
- One-click cleaning operations with preview capabilities
- Export functionality for multiple formats
Usage Examples
Basic Data Cleaning
import scrubpy
# Load your dataset
df = scrubpy.load_dataset("data.csv")
# Analyze data quality
analyzer = scrubpy.SmartDataQualityAnalyzer()
quality_report = analyzer.analyze_quality(df)
# Clean the data
clean_df = scrubpy.remove_duplicates(df)
clean_df = scrubpy.fill_missing_values(clean_df, method="mean", numeric_only=True)
clean_df = scrubpy.detect_outliers(clean_df, method="iqr")
Command Line Interface
# Launch interactive CLI
scrubpy
# Follow the interactive prompts to clean your data
Web Interface Usage
# Start the web application
scrubpy-web
# Navigate to http://localhost:8501 in your browser
# Upload your dataset and follow the interactive cleaning workflow
AI Chat Assistant
# Start chat mode with your dataset
scrubpy-chat data.csv
# Interact with the AI assistant using natural language:
# "What quality issues does my data have?"
# "Remove duplicates and handle missing values"
# "Generate a quality report"
API Reference
Core Functions
import scrubpy
# Data loading
df = scrubpy.load_dataset(file_path, **kwargs)
# Quality analysis
analyzer = scrubpy.SmartDataQualityAnalyzer()
issues = analyzer.analyze_quality(df)
# Data cleaning operations
clean_df = scrubpy.remove_duplicates(df, method='exact')
clean_df = scrubpy.fill_missing_values(df, method='mean')
outliers = scrubpy.detect_outliers(df, method='iqr')
System Requirements
- Python: 3.8 or higher
- Operating System: Windows, macOS, Linux
- Memory: 2GB minimum (4GB recommended for large datasets)
- Storage: 100MB for installation
Contributing
We welcome contributions to ScrubPy! Please follow these guidelines:
- Fork the repository and create a feature branch
- Write tests for new functionality
- Ensure code follows PEP 8 style guidelines
- Submit a pull request with a clear description of changes
Development Setup
git clone https://github.com/username/scrubpy.git
cd scrubpy
pip install -e ".[dev]"
pytest tests/
License
This project is licensed under the MIT License. See the LICENSE file for details.
Acknowledgments
- Built with pandas and numpy for efficient data processing
- Streamlit for the modern web interface
- Typer and Rich for enhanced CLI experience
- OpenAI for AI-powered features
What’s Next?
We plan to add smart visual exports, column intelligence, and eventually ML-powered cleaning.
Why This Exists
Sometimes you just need a quick tool to clean and inspect your data without writing boilerplate pandas code. ScrubPy helps you do that, even if you're not a data wizard.
📚 License
MIT
Made with ❤️ by a student learning to make tools that help others.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scrubpy-2.0.1.tar.gz.
File metadata
- Download URL: scrubpy-2.0.1.tar.gz
- Upload date:
- Size: 612.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7b4495e1b9ee1bed5e505be47c122f452f0085a86a255ca0e7ddc70d9bba42fc
|
|
| MD5 |
17e46d062243e12ce24be5adc83d6449
|
|
| BLAKE2b-256 |
ed335267a36f646007871e26894498b6cbc306ae059afacd29d9e5180535b28b
|
File details
Details for the file scrubpy-2.0.1-py3-none-any.whl.
File metadata
- Download URL: scrubpy-2.0.1-py3-none-any.whl
- Upload date:
- Size: 481.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
29045c3bdd67ce521b1d7ddaa0de4f0f3fc442021d8d66700e38b5fcd1ec8079
|
|
| MD5 |
f97f49bd13ddc1c28a2ee1ce209ff1e0
|
|
| BLAKE2b-256 |
6213fc6056b84d9b2340760c489174d201a9fbdcc3c998377daff6e16e88eb7d
|