Skip to main content

AI-powered data cleaning assistant with multiple interfaces

Project description

ScrubPy

PyPI version Python 3.8+ License: MIT

Introduction

ScrubPy is a comprehensive Python library for intelligent data cleaning and preprocessing. It provides multiple interfaces including a web application, CLI tools, and AI-powered chat assistance to help data scientists, analysts, and researchers transform messy datasets into clean, analysis-ready formats. The library combines automated quality analysis with intelligent suggestions to streamline the data preparation workflow.

Key Features

  • Multi-Interface Support: Web GUI (Streamlit), Command Line Interface (CLI), and Interactive Chat Assistant
  • AI-Powered Analysis: Integration with Large Language Models for intelligent data cleaning recommendations
  • Comprehensive Quality Assessment: Automated detection of missing values, duplicates, outliers, and data type inconsistencies
  • Smart Cleaning Operations: Automated and guided data cleaning with preview capabilities
  • Professional Reporting: Generate detailed PDF reports and export cleaned datasets

Architecture Overview

ScrubPy follows a modular architecture where users can interact through multiple interfaces (Web, CLI, Chat) that all utilize the same core data processing engine. The workflow starts with data loading through the core module, followed by quality analysis using the quality analyzer, interactive cleaning operations with preview capabilities, and finally export of cleaned data with comprehensive reporting. The AI components provide intelligent suggestions throughout the process.

Installation

Install ScrubPy using pip:

pip install scrubpy

For AI features, install with additional dependencies:

pip install scrubpy[ai]

Module Documentation

Core Module (scrubpy.core)

The core module provides fundamental data loading and cleaning operations:

  • load_dataset(file_path): Intelligent data loading with automatic format detection for CSV, JSON, Excel, and Parquet files
  • get_dataset_summary(df): Comprehensive dataset overview including shape, column types, and basic statistics
  • remove_duplicates(df, method): Advanced duplicate detection with configurable strategies
  • fill_missing_values(df, method, columns): Multiple imputation methods including mean, median, mode, and forward/backward fill
  • detect_outliers(df, method): Statistical outlier detection using IQR, Z-score, and isolation forest methods
  • convert_data_types(df): Automatic data type optimization and conversion

Quality Analyzer Module (scrubpy.quality_analyzer)

Intelligent quality assessment system:

  • SmartDataQualityAnalyzer: Main analyzer class providing comprehensive quality scoring
  • analyze_quality(df): Complete quality analysis returning issue detection and recommendations
  • QualityIssue dataclass: Structured representation of detected data quality issues
  • Quality scoring algorithms for completeness, consistency, validity, and uniqueness metrics

CLI Module (scrubpy.cli)

Interactive command-line interface:

  • Rich terminal interface with progress indicators and colored output
  • Interactive dataset selection and preview capabilities
  • Step-by-step guided cleaning workflow
  • Export options for cleaned datasets and quality reports

Web Interface (scrubpy.web)

Modern Streamlit-based web application:

  • Drag-and-drop file upload with format validation
  • Real-time data preview with pagination
  • Interactive quality dashboard with visual indicators
  • One-click cleaning operations with preview capabilities
  • Export functionality for multiple formats

Usage Examples

Basic Data Cleaning

import scrubpy

# Load your dataset
df = scrubpy.load_dataset("data.csv")

# Analyze data quality
analyzer = scrubpy.SmartDataQualityAnalyzer()
quality_report = analyzer.analyze_quality(df)

# Clean the data
clean_df = scrubpy.remove_duplicates(df)
clean_df = scrubpy.fill_missing_values(clean_df, method="mean", numeric_only=True)
clean_df = scrubpy.detect_outliers(clean_df, method="iqr")

Command Line Interface

# Launch interactive CLI
scrubpy

# Follow the interactive prompts to clean your data

Web Interface Usage

# Start the web application
scrubpy-web

# Navigate to http://localhost:8501 in your browser
# Upload your dataset and follow the interactive cleaning workflow

AI Chat Assistant

# Start chat mode with your dataset
scrubpy-chat data.csv

# Interact with the AI assistant using natural language:
# "What quality issues does my data have?"
# "Remove duplicates and handle missing values"
# "Generate a quality report"

API Reference

Core Functions

import scrubpy

# Data loading
df = scrubpy.load_dataset(file_path, **kwargs)

# Quality analysis
analyzer = scrubpy.SmartDataQualityAnalyzer()
issues = analyzer.analyze_quality(df)

# Data cleaning operations
clean_df = scrubpy.remove_duplicates(df, method='exact')
clean_df = scrubpy.fill_missing_values(df, method='mean')
outliers = scrubpy.detect_outliers(df, method='iqr')

System Requirements

  • Python: 3.8 or higher
  • Operating System: Windows, macOS, Linux
  • Memory: 2GB minimum (4GB recommended for large datasets)
  • Storage: 100MB for installation

Contributing

We welcome contributions to ScrubPy! Please follow these guidelines:

  1. Fork the repository and create a feature branch
  2. Write tests for new functionality
  3. Ensure code follows PEP 8 style guidelines
  4. Submit a pull request with a clear description of changes

Development Setup

git clone https://github.com/username/scrubpy.git
cd scrubpy
pip install -e ".[dev]"
pytest tests/

License

This project is licensed under the MIT License. See the LICENSE file for details.

Acknowledgments

  • Built with pandas and numpy for efficient data processing
  • Streamlit for the modern web interface
  • Typer and Rich for enhanced CLI experience
  • OpenAI for AI-powered features

What’s Next?

We plan to add smart visual exports, column intelligence, and eventually ML-powered cleaning.


Why This Exists

Sometimes you just need a quick tool to clean and inspect your data without writing boilerplate pandas code. ScrubPy helps you do that, even if you're not a data wizard.


📚 License

MIT


Made with ❤️ by a student learning to make tools that help others.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrubpy-2.0.1.tar.gz (612.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrubpy-2.0.1-py3-none-any.whl (481.0 kB view details)

Uploaded Python 3

File details

Details for the file scrubpy-2.0.1.tar.gz.

File metadata

  • Download URL: scrubpy-2.0.1.tar.gz
  • Upload date:
  • Size: 612.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for scrubpy-2.0.1.tar.gz
Algorithm Hash digest
SHA256 7b4495e1b9ee1bed5e505be47c122f452f0085a86a255ca0e7ddc70d9bba42fc
MD5 17e46d062243e12ce24be5adc83d6449
BLAKE2b-256 ed335267a36f646007871e26894498b6cbc306ae059afacd29d9e5180535b28b

See more details on using hashes here.

File details

Details for the file scrubpy-2.0.1-py3-none-any.whl.

File metadata

  • Download URL: scrubpy-2.0.1-py3-none-any.whl
  • Upload date:
  • Size: 481.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for scrubpy-2.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 29045c3bdd67ce521b1d7ddaa0de4f0f3fc442021d8d66700e38b5fcd1ec8079
MD5 f97f49bd13ddc1c28a2ee1ce209ff1e0
BLAKE2b-256 6213fc6056b84d9b2340760c489174d201a9fbdcc3c998377daff6e16e88eb7d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page