Skip to main content

A cross-language DataFrame cleaning assistant with interactive GUI and one-click code export

Project description

๐Ÿงน Databroom

A powerful DataFrame cleaning tool with Command Line Interface, Interactive GUI, and Programmatic API - automatically generates reproducible Python/pandas, R/tidyverse code, and CLI Commands

PyPI version Python 3.8+ MIT License

Image

๐Ÿ“‘ Table of Contents


๐Ÿ†š Why Databroom?

The Problem: Manual Data Cleaning is Tedious

With pandas (manual approach):

import pandas as pd
import unicodedata

# Read the file
df = pd.read_csv("messy_data.csv")

# Remove columns with more than 90% missing values
threshold = 0.9
df = df.loc[:, df.isnull().mean() < threshold]

# Standardize column names manually
df.columns = df.columns.str.lower().str.replace(' ', '_')

# Normalize text values (removing accents)
def normalize_text(text):
    if not isinstance(text, str):
        return text
    return ''.join(
        c for c in unicodedata.normalize('NFKD', text)
        if not unicodedata.combining(c)
    )

# Apply to all string columns (need to identify them first)
string_cols = df.select_dtypes(include=['object']).columns
for col in string_cols:
    df[col] = df[col].apply(normalize_text)

# Save the result
df.to_csv("clean_data.csv", index=False)

With Databroom (one command):

databroom clean messy_data.csv \
  --clean-all \
  --output-file clean_data.csv \
  --output-code cleaning_script.py

The Benefits

Feature Manual Pandas Databroom
Lines of code ~20+ lines 1 command
Time to implement 10-15 minutes 10 seconds
Error prone High (manual logic) Low (tested operations)
Reproducible Need to save script Auto-generates code
Cross-language Python only Python + R output
GUI option No Yes (databroom gui)
Parameter tuning Manual coding CLI flags & GUI sliders

Real-world Comparison

Complex cleaning task:

# Pandas approach: ~50 lines of code
import pandas as pd
import unicodedata
import numpy as np

df = pd.read_excel("survey_data.xlsx")

# Remove empty columns
empty_threshold = 0.8
df = df.dropna(axis=1, thresh=int(empty_threshold * len(df)))

# Remove empty rows  
df = df.dropna(how='all')

# Fix column names
df.columns = df.columns.str.lower()
df.columns = df.columns.str.replace(' ', '_')
df.columns = df.columns.str.replace('[^a-z0-9_]', '', regex=True)

# Normalize text in all string columns
def clean_text(text):
    if pd.isna(text) or not isinstance(text, str):
        return text
    # Remove accents
    text = unicodedata.normalize('NFKD', text)
    text = ''.join(c for c in text if not unicodedata.combining(c))
    return text

string_columns = df.select_dtypes(include=['object']).columns
for col in string_columns:
    df[col] = df[col].apply(clean_text)

df.to_csv("cleaned_survey.csv", index=False)

Databroom approach:

databroom clean survey_data.xlsx \
  --clean-all \
  --empty-threshold 0.8 \
  --output-file cleaned_survey.csv \
  --output-code survey_cleaning.py \
  --verbose

Result: Same output, 1 command, includes reproducible script generation.

When to Use Databroom

โœ… Perfect for:

  • ๐Ÿค– Full automation - Transform your entire data cleaning pipeline into a single command
  • Quick data exploration and cleaning
  • Batch processing multiple files
  • Learning data cleaning best practices
  • Generating reproducible cleaning scripts
  • Teams needing consistent data preprocessing
  • Converting workflows between Python and R

๐Ÿš€ Quick Start

Installation

# Complete installation - CLI + GUI + API (recommended)
pip install databroom

# CLI + API only (no Streamlit GUI)
pip install databroom[cli]

# GUI + API only (no CLI interface)  
pip install databroom[gui]

Command Line Interface (Primary Interface)

Clean your data files instantly with powerful CLI commands:

# Smart clean everything (recommended)
databroom clean data.csv --clean-all --output-file clean.csv

# Column cleaning with custom threshold
databroom clean messy.xlsx --clean-columns --empty-threshold 0.8 --output-file cleaned.xlsx

# Complete cleaning pipeline with code generation
databroom clean survey.csv --clean-all --output-code cleaning_script.py --lang python

# Generate R/tidyverse code
databroom clean data.csv --clean-rows --output-code analysis.R --lang r

# Advanced options with verbose output
databroom clean dataset.json --clean-all --no-snakecase --verbose --info

# Launch interactive GUI
databroom gui

# List all available operations
databroom list

Interactive GUI

Launch the web-based interface for visual data cleaning:

databroom gui
# Opens http://localhost:8501 in your browser

GUI Screenshots

[GUI screenshots will be added here to showcase the interactive interface, file upload, operation panels, live preview, and code generation features]

Programmatic API

Use Databroom directly in your Python scripts:

from databroom.core.broom import Broom

# Load and clean data with method chaining
broom = Broom.from_csv('data.csv')
result = broom.clean_all()  # Smart clean everything

# Or use specific operations
result = (broom
    .clean_columns(empty_threshold=0.9)
    .clean_rows())

# Get cleaned DataFrame
cleaned_df = result.get_df()
print(f"Cleaned {cleaned_df.shape[0]} rows ร— {cleaned_df.shape[1]} columns")

# Generate reproducible code
from databroom.generators.base import CodeGenerator
generator = CodeGenerator('python')
generator.load_history(result.get_history())
generator.export_code('my_cleaning_pipeline.py')

โœจ Features

๐Ÿ–ฅ๏ธ Command Line Interface

  • Instant cleaning with intuitive flags and parameters
  • Batch processing capabilities for multiple files
  • Code generation in Python/pandas and R/tidyverse
  • Flexible output formats (CSV, Excel, JSON)
  • Rich help system with examples and colored output
  • Verbose mode for detailed operation feedback

๐ŸŽจ Interactive GUI

  • Drag & drop file upload (CSV, Excel, JSON)
  • Live preview of cleaning operations
  • Interactive parameter tuning with sliders and inputs
  • Real-time code generation with syntax highlighting
  • One-click download of cleaned data and generated scripts
  • Operation history with undo functionality

โš™๏ธ Programmatic API

  • Chainable methods for fluent data cleaning workflows
  • Factory methods for easy file loading (from_csv(), from_excel(), etc.)
  • History tracking for reproducible operations
  • Template-based code generation with Jinja2

๐Ÿ”„ Code Generation

  • Complete scripts with imports, file loading, and execution
  • Cross-language support (Python/pandas โ†” R/tidyverse)
  • Template system for customizable output formats
  • Reproducible workflows that can be shared and version controlled

๐Ÿงฐ Available Cleaning Operations

Operation CLI Flag Purpose
๐Ÿงน Clean All --clean-all Smart clean everything: columns + rows with all operations
๐Ÿ“‹ Clean Columns --clean-columns Clean column names: snake_case + remove accents + remove empty
๐Ÿ“Š Clean Rows --clean-rows Clean row data: snake_case + remove accents + remove empty
Remove Empty Columns --remove-empty-cols Legacy: Use --clean-columns instead
Remove Empty Rows --remove-empty-rows Legacy: Use --clean-rows instead
Standardize Column Names --standardize-column-names Legacy: Use --clean-columns instead
Normalize Column Names --normalize-column-names Legacy: Use --clean-columns instead
Normalize Values --normalize-values Legacy: Use --clean-rows instead
Standardize Values --standardize-values Legacy: Use --clean-rows instead

CLI Parameters

# Smart Operations (recommended)
--clean-all                          # Clean everything: columns + rows
--clean-columns                      # Clean column names only
--clean-rows                         # Clean row data only

# Advanced Options (disable specific operations)
--no-snakecase                       # Keep original text case in rows
--no-snakecase-cols                  # Keep original column name case
--no-remove-accents-vals             # Keep accents in text values
--no-remove-empty-cols               # Keep empty columns

# Parameters
--empty-threshold 0.8                # Custom missing value threshold (default: 0.9)

# Output options
--output-file cleaned.csv            # Save cleaned data
--output-code script.py              # Generate code file
--lang python                        # Code language (python/r)

# Behavior options
--verbose                            # Detailed output
--quiet                              # Minimal output  
--info                               # Show DataFrame info

๐Ÿ“Š Example Workflows

Data Science Pipeline

# Clean survey data and generate analysis script
databroom clean survey_data.xlsx \
  --clean-all \
  --empty-threshold 0.7 \
  --output-file clean_survey.csv \
  --output-code survey_analysis.py \
  --verbose

R/Tidyverse Workflow

# Generate R script for tidyverse users
databroom clean research_data.csv \
  --clean-all \
  --output-code tidyverse_pipeline.R \
  --lang r

Batch Processing Setup

# Process multiple files with consistent operations
for file in data/*.csv; do
  databroom clean "$file" \
    --clean-columns \
    --output-file "clean_$(basename "$file")" \
    --quiet
done

๐Ÿ—๏ธ Architecture

Databroom follows a modular architecture designed for extensibility and maintainability:

databroom/
โ”œโ”€โ”€ cli/                 # Command line interface (Typer + Rich)
โ”‚   โ”œโ”€โ”€ main.py          # Entry point and app configuration
โ”‚   โ”œโ”€โ”€ commands.py      # CLI commands (clean, gui, list)
โ”‚   โ”œโ”€โ”€ operations.py    # Operation parsing and execution
โ”‚   โ””โ”€โ”€ utils.py         # File handling and code generation
โ”œโ”€โ”€ core/                # Core cleaning engine
โ”‚   โ”œโ”€โ”€ janitor.py       # Main API with method chaining
โ”‚   โ”œโ”€โ”€ pipeline.py      # Operation coordination and state management  
โ”‚   โ”œโ”€โ”€ cleaning_ops.py  # Individual cleaning operations
โ”‚   โ””โ”€โ”€ history_tracker.py # Automatic operation tracking
โ”œโ”€โ”€ generators/          # Code generation system
โ”‚   โ”œโ”€โ”€ base.py          # Template-based code generator
โ”‚   โ””โ”€โ”€ templates/       # Jinja2 templates for Python/R
โ”œโ”€โ”€ gui/                 # Streamlit web interface
โ”‚   โ””โ”€โ”€ app.py           # Interactive GUI application
โ””โ”€โ”€ tests/               # Comprehensive test suite

๐Ÿ› ๏ธ Development

Local Development

# Clone repository
git clone https://github.com/onlozanoo/databroom.git
cd databroom

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install in development mode
pip install -e ".[dev,cli,all]"

# Run tests
pytest

# Run CLI locally
python -m databroom.cli.main --help

Testing

# Run full test suite
pytest

# Run with coverage
pytest --cov=databroom

# Run specific test categories
pytest -m "not slow"           # Skip slow tests
pytest tests/cli/              # Test CLI only
pytest tests/core/             # Test core functionality

Code Quality

# Format code
black databroom/
isort databroom/

# Lint
flake8 databroom/

# Type check
mypy databroom/

๐Ÿ“ˆ Project Status

Current Version: v0.3.0 - Production Ready

โœ… Fully Implemented

  • Complete CLI with all cleaning operations
  • Interactive Streamlit GUI with live preview
  • Programmatic API with method chaining
  • Python and R code generation
  • Comprehensive test suite
  • PyPI package distribution ready
    • Dinamic new operations load

๐Ÿšง In Active Development

  • Extended cleaning operations library
  • Advanced parameter validation
  • Performance optimizations
  • Enhanced error handling

๐Ÿ“‹ Planned Features

  • Preview in CLI
  • Presets
  • Batch transform
  • Save/load cleaning pipelines
  • Custom cleaning operation plugins
  • Integration with popular data tools
  • Advanced reporting capabilities

๐Ÿค Contributing

I welcome contributions! Here's how you can help:

Ways to Contribute

  • ๐Ÿ› Bug Reports: Submit issues with detailed reproduction steps
  • ๐Ÿ’ก Feature Requests: Propose new cleaning operations or CLI features
  • ๐Ÿ“ Documentation: Improve examples, tutorials, or API docs
  • ๐Ÿงช Testing: Add test cases or improve test coverage
  • ๐Ÿ’ป Code: Implement new features or fix existing issues

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

Built with โค๏ธ for the data science community

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

databroom-0.3.0.tar.gz (15.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

databroom-0.3.0-py3-none-any.whl (10.1 kB view details)

Uploaded Python 3

File details

Details for the file databroom-0.3.0.tar.gz.

File metadata

  • Download URL: databroom-0.3.0.tar.gz
  • Upload date:
  • Size: 15.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.1

File hashes

Hashes for databroom-0.3.0.tar.gz
Algorithm Hash digest
SHA256 70e8559c269c88f304fbf7125eda346714139551a24bcd02af45193ccb8982e4
MD5 80dcb9d7c44151a7f1760b813776b502
BLAKE2b-256 c2e94c376fdda49f4570ee7f853d605432a150781a02eb54429084b74205e5a3

See more details on using hashes here.

File details

Details for the file databroom-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: databroom-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 10.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.1

File hashes

Hashes for databroom-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e6592424b2da244650c1a63eb98666d7279dc386402b1cccb2ef84d2f88cfba6
MD5 640505d5e332f0c69870bfbfad29da62
BLAKE2b-256 98617b7a81cfd4d2b41a19184a3da456830c50c03f2db9fd074d0093043bb6a7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page