A cross-language DataFrame cleaning assistant with interactive GUI and one-click code export

These details have not been verified by PyPI

Project links

Project description

🧹 Databroom

A powerful DataFrame cleaning tool with Command Line Interface, Interactive GUI, and Programmatic API - automatically generates reproducible Python/pandas, R/tidyverse code, and CLI Commands

📑 Table of Contents

🧹 Databroom

🆚 Why Databroom?

The Problem: Manual Data Cleaning is Tedious

With pandas (manual approach):

import pandas as pd
import unicodedata

# Read the file
df = pd.read_csv("messy_data.csv")

# Remove columns with more than 90% missing values
threshold = 0.9
df = df.loc[:, df.isnull().mean() < threshold]

# Standardize column names manually
df.columns = df.columns.str.lower().str.replace(' ', '_')

# Normalize text values (removing accents)
def normalize_text(text):
    if not isinstance(text, str):
        return text
    return ''.join(
        c for c in unicodedata.normalize('NFKD', text)
        if not unicodedata.combining(c)
    )

# Apply to all string columns (need to identify them first)
string_cols = df.select_dtypes(include=['object']).columns
for col in string_cols:
    df[col] = df[col].apply(normalize_text)

# Save the result
df.to_csv("clean_data.csv", index=False)

With Databroom (one command):

databroom clean messy_data.csv \
  --clean-all \
  --output-file clean_data.csv \
  --output-code cleaning_script.py

The Benefits

Feature	Manual Pandas	Databroom
Lines of code	~20+ lines	1 command
Time to implement	10-15 minutes	10 seconds
Error prone	High (manual logic)	Low (tested operations)
Reproducible	Need to save script	Auto-generates code
Cross-language	Python only	Python + R output
GUI option	No	Yes (`databroom gui`)
Parameter tuning	Manual coding	CLI flags & GUI sliders

Real-world Comparison

Complex cleaning task:

# Pandas approach: ~50 lines of code
import pandas as pd
import unicodedata
import numpy as np

df = pd.read_excel("survey_data.xlsx")

# Remove empty columns
empty_threshold = 0.8
df = df.dropna(axis=1, thresh=int(empty_threshold * len(df)))

# Remove empty rows  
df = df.dropna(how='all')

# Fix column names
df.columns = df.columns.str.lower()
df.columns = df.columns.str.replace(' ', '_')
df.columns = df.columns.str.replace('[^a-z0-9_]', '', regex=True)

# Normalize text in all string columns
def clean_text(text):
    if pd.isna(text) or not isinstance(text, str):
        return text
    # Remove accents
    text = unicodedata.normalize('NFKD', text)
    text = ''.join(c for c in text if not unicodedata.combining(c))
    return text

string_columns = df.select_dtypes(include=['object']).columns
for col in string_columns:
    df[col] = df[col].apply(clean_text)

df.to_csv("cleaned_survey.csv", index=False)

Databroom approach:

databroom clean survey_data.xlsx \
  --clean-all \
  --empty-threshold 0.8 \
  --output-file cleaned_survey.csv \
  --output-code survey_cleaning.py \
  --verbose

Result: Same output, 1 command, includes reproducible script generation.

When to Use Databroom

✅ Perfect for:

🤖 Full automation - Transform your entire data cleaning pipeline into a single command
Quick data exploration and cleaning
Batch processing multiple files
Learning data cleaning best practices
Generating reproducible cleaning scripts
Teams needing consistent data preprocessing
Converting workflows between Python and R

🚀 Quick Start

Installation

# Complete installation - CLI + GUI + API (recommended)
pip install databroom

# CLI + API only (no Streamlit GUI)
pip install databroom[cli]

# GUI + API only (no CLI interface)  
pip install databroom[gui]

Command Line Interface (Primary Interface)

Clean your data files instantly with powerful CLI commands:

# Smart clean everything (recommended)
databroom clean data.csv --clean-all --output-file clean.csv

# Column cleaning with custom threshold
databroom clean messy.xlsx --clean-columns --empty-threshold 0.8 --output-file cleaned.xlsx

# Complete cleaning pipeline with code generation
databroom clean survey.csv --clean-all --output-code cleaning_script.py --lang python

# Generate R/tidyverse code
databroom clean data.csv --clean-rows --output-code analysis.R --lang r

# Advanced options with verbose output
databroom clean dataset.json --clean-all --no-snakecase --verbose --info

# Launch interactive GUI
databroom gui

# List all available operations
databroom list

Interactive GUI

Launch the web-based interface for visual data cleaning:

databroom gui
# Opens http://localhost:8501 in your browser

GUI Screenshots

[GUI screenshots will be added here to showcase the interactive interface, file upload, operation panels, live preview, and code generation features]

Programmatic API

Use Databroom directly in your Python scripts:

from databroom.core.broom import Broom

# Load and clean data with method chaining
broom = Broom.from_csv('data.csv')
result = broom.clean_all()  # Smart clean everything

# Or use specific operations
result = (broom
    .clean_columns(empty_threshold=0.9)
    .clean_rows())

# Get cleaned DataFrame
cleaned_df = result.get_df()
print(f"Cleaned {cleaned_df.shape[0]} rows × {cleaned_df.shape[1]} columns")

# Generate reproducible code
from databroom.generators.base import CodeGenerator
generator = CodeGenerator('python')
generator.load_history(result.get_history())
generator.export_code('my_cleaning_pipeline.py')

✨ Features

🖥️ Command Line Interface

Instant cleaning with intuitive flags and parameters
Batch processing capabilities for multiple files
Code generation in Python/pandas and R/tidyverse
Flexible output formats (CSV, Excel, JSON)
Rich help system with examples and colored output
Verbose mode for detailed operation feedback

🎨 Interactive GUI

Drag & drop file upload (CSV, Excel, JSON)
Live preview of cleaning operations
Interactive parameter tuning with sliders and inputs
Real-time code generation with syntax highlighting
One-click download of cleaned data and generated scripts
Operation history with undo functionality

⚙️ Programmatic API

Chainable methods for fluent data cleaning workflows
Factory methods for easy file loading (from_csv(), from_excel(), etc.)
History tracking for reproducible operations
Template-based code generation with Jinja2

🔄 Code Generation

Complete scripts with imports, file loading, and execution
Cross-language support (Python/pandas ↔ R/tidyverse)
Template system for customizable output formats
Reproducible workflows that can be shared and version controlled

🧰 Available Cleaning Operations

Operation	CLI Flag	Purpose
🧹 Clean All	`--clean-all`	Smart clean everything: columns + rows with all operations
📋 Clean Columns	`--clean-columns`	Clean column names: snake_case + remove accents + remove empty
📊 Clean Rows	`--clean-rows`	Clean row data: snake_case + remove accents + remove empty
~~Remove Empty Columns~~	~~`--remove-empty-cols`~~	~~Legacy: Use `--clean-columns` instead~~
~~Remove Empty Rows~~	~~`--remove-empty-rows`~~	~~Legacy: Use `--clean-rows` instead~~
~~Standardize Column Names~~	~~`--standardize-column-names`~~	~~Legacy: Use `--clean-columns` instead~~
~~Normalize Column Names~~	~~`--normalize-column-names`~~	~~Legacy: Use `--clean-columns` instead~~
~~Normalize Values~~	~~`--normalize-values`~~	~~Legacy: Use `--clean-rows` instead~~
~~Standardize Values~~	~~`--standardize-values`~~	~~Legacy: Use `--clean-rows` instead~~

CLI Parameters

# Smart Operations (recommended)
--clean-all                          # Clean everything: columns + rows
--clean-columns                      # Clean column names only
--clean-rows                         # Clean row data only

# Advanced Options (disable specific operations)
--no-snakecase                       # Keep original text case in rows
--no-snakecase-cols                  # Keep original column name case
--no-remove-accents-vals             # Keep accents in text values
--no-remove-empty-cols               # Keep empty columns

# Parameters
--empty-threshold 0.8                # Custom missing value threshold (default: 0.9)

# Output options
--output-file cleaned.csv            # Save cleaned data
--output-code script.py              # Generate code file
--lang python                        # Code language (python/r)

# Behavior options
--verbose                            # Detailed output
--quiet                              # Minimal output  
--info                               # Show DataFrame info

📊 Example Workflows

Data Science Pipeline

# Clean survey data and generate analysis script
databroom clean survey_data.xlsx \
  --clean-all \
  --empty-threshold 0.7 \
  --output-file clean_survey.csv \
  --output-code survey_analysis.py \
  --verbose

R/Tidyverse Workflow

# Generate R script for tidyverse users
databroom clean research_data.csv \
  --clean-all \
  --output-code tidyverse_pipeline.R \
  --lang r

Batch Processing Setup

# Process multiple files with consistent operations
for file in data/*.csv; do
  databroom clean "$file" \
    --clean-columns \
    --output-file "clean_$(basename "$file")" \
    --quiet
done

🏗️ Architecture

Databroom follows a modular architecture designed for extensibility and maintainability:

databroom/
├── cli/                 # Command line interface (Typer + Rich)
│   ├── main.py          # Entry point and app configuration
│   ├── commands.py      # CLI commands (clean, gui, list)
│   ├── operations.py    # Operation parsing and execution
│   └── utils.py         # File handling and code generation
├── core/                # Core cleaning engine
│   ├── janitor.py       # Main API with method chaining
│   ├── pipeline.py      # Operation coordination and state management  
│   ├── cleaning_ops.py  # Individual cleaning operations
│   └── history_tracker.py # Automatic operation tracking
├── generators/          # Code generation system
│   ├── base.py          # Template-based code generator
│   └── templates/       # Jinja2 templates for Python/R
├── gui/                 # Streamlit web interface
│   └── app.py           # Interactive GUI application
└── tests/               # Comprehensive test suite

🛠️ Development

Local Development

# Clone repository
git clone https://github.com/onlozanoo/databroom.git
cd databroom

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install in development mode
pip install -e ".[dev,cli,all]"

# Run tests
pytest

# Run CLI locally
python -m databroom.cli.main --help

Testing

# Run full test suite
pytest

# Run with coverage
pytest --cov=databroom

# Run specific test categories
pytest -m "not slow"           # Skip slow tests
pytest tests/cli/              # Test CLI only
pytest tests/core/             # Test core functionality

Code Quality

# Format code
black databroom/
isort databroom/

# Lint
flake8 databroom/

# Type check
mypy databroom/

📈 Project Status

Current Version: v0.3.0 - Production Ready

✅ Fully Implemented

Complete CLI with all cleaning operations
Interactive Streamlit GUI with live preview
Programmatic API with method chaining
Python and R code generation
Comprehensive test suite
PyPI package distribution ready
- Dinamic new operations load

🚧 In Active Development

Extended cleaning operations library
Advanced parameter validation
Performance optimizations
Enhanced error handling

📋 Planned Features

Preview in CLI
Presets
Batch transform
Save/load cleaning pipelines
Custom cleaning operation plugins
Integration with popular data tools
Advanced reporting capabilities

🤝 Contributing

I welcome contributions! Here's how you can help:

Ways to Contribute

🐛 Bug Reports: Submit issues with detailed reproduction steps
💡 Feature Requests: Propose new cleaning operations or CLI features
📝 Documentation: Improve examples, tutorials, or API docs
🧪 Testing: Add test cases or improve test coverage
💻 Code: Implement new features or fix existing issues

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

Built with ❤️ for the data science community

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.4

Sep 29, 2025

0.3.1

Jul 30, 2025

This version

0.3.0

Jul 30, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

databroom-0.3.0.tar.gz (15.6 kB view details)

Uploaded Jul 30, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

databroom-0.3.0-py3-none-any.whl (10.1 kB view details)

Uploaded Jul 30, 2025 Python 3

File details

Details for the file databroom-0.3.0.tar.gz.

File metadata

Download URL: databroom-0.3.0.tar.gz
Upload date: Jul 30, 2025
Size: 15.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.1

File hashes

Hashes for databroom-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`70e8559c269c88f304fbf7125eda346714139551a24bcd02af45193ccb8982e4`
MD5	`80dcb9d7c44151a7f1760b813776b502`
BLAKE2b-256	`c2e94c376fdda49f4570ee7f853d605432a150781a02eb54429084b74205e5a3`

See more details on using hashes here.

File details

Details for the file databroom-0.3.0-py3-none-any.whl.

File metadata

Download URL: databroom-0.3.0-py3-none-any.whl
Upload date: Jul 30, 2025
Size: 10.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.1

File hashes

Hashes for databroom-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e6592424b2da244650c1a63eb98666d7279dc386402b1cccb2ef84d2f88cfba6`
MD5	`640505d5e332f0c69870bfbfad29da62`
BLAKE2b-256	`98617b7a81cfd4d2b41a19184a3da456830c50c03f2db9fd074d0093043bb6a7`

See more details on using hashes here.

databroom 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🧹 Databroom

📑 Table of Contents

🆚 Why Databroom?

The Problem: Manual Data Cleaning is Tedious

The Benefits

Real-world Comparison

When to Use Databroom

🚀 Quick Start

Installation

Command Line Interface (Primary Interface)

Interactive GUI

GUI Screenshots

Programmatic API

✨ Features

🖥️ Command Line Interface

🎨 Interactive GUI

⚙️ Programmatic API

🔄 Code Generation

🧰 Available Cleaning Operations

CLI Parameters

📊 Example Workflows

Data Science Pipeline

R/Tidyverse Workflow

Batch Processing Setup

🏗️ Architecture

🛠️ Development

Local Development

Testing

Code Quality

📈 Project Status

🤝 Contributing

Ways to Contribute

📄 License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes