A cross-language DataFrame cleaning assistant with interactive GUI and one-click code export
Project description
๐งน Databroom
A powerful DataFrame cleaning tool with Command Line Interface, Interactive GUI, and Programmatic API - automatically generates reproducible Python/pandas, R/tidyverse code, and CLI Commands
๐ Table of Contents
- ๐งน Databroom
๐ Why Databroom?
The Problem: Manual Data Cleaning is Tedious
With pandas (manual approach):
import pandas as pd
import unicodedata
# Read the file
df = pd.read_csv("messy_data.csv")
# Remove columns with more than 90% missing values
threshold = 0.9
df = df.loc[:, df.isnull().mean() < threshold]
# Standardize column names manually
df.columns = df.columns.str.lower().str.replace(' ', '_')
# Normalize text values (removing accents)
def normalize_text(text):
if not isinstance(text, str):
return text
return ''.join(
c for c in unicodedata.normalize('NFKD', text)
if not unicodedata.combining(c)
)
# Apply to all string columns (need to identify them first)
string_cols = df.select_dtypes(include=['object']).columns
for col in string_cols:
df[col] = df[col].apply(normalize_text)
# Save the result
df.to_csv("clean_data.csv", index=False)
With Databroom (one command):
databroom clean messy_data.csv \
--clean-all \
--output-file clean_data.csv \
--output-code cleaning_script.py
The Benefits
| Feature | Manual Pandas | Databroom |
|---|---|---|
| Lines of code | ~20+ lines | 1 command |
| Time to implement | 10-15 minutes | 10 seconds |
| Error prone | High (manual logic) | Low (tested operations) |
| Reproducible | Need to save script | Auto-generates code |
| Cross-language | Python only | Python + R output |
| GUI option | No | Yes (databroom gui) |
| Parameter tuning | Manual coding | CLI flags & GUI sliders |
Real-world Comparison
Complex cleaning task:
# Pandas approach: ~50 lines of code
import pandas as pd
import unicodedata
import numpy as np
df = pd.read_excel("survey_data.xlsx")
# Remove empty columns
empty_threshold = 0.8
df = df.dropna(axis=1, thresh=int(empty_threshold * len(df)))
# Remove empty rows
df = df.dropna(how='all')
# Fix column names
df.columns = df.columns.str.lower()
df.columns = df.columns.str.replace(' ', '_')
df.columns = df.columns.str.replace('[^a-z0-9_]', '', regex=True)
# Normalize text in all string columns
def clean_text(text):
if pd.isna(text) or not isinstance(text, str):
return text
# Remove accents
text = unicodedata.normalize('NFKD', text)
text = ''.join(c for c in text if not unicodedata.combining(c))
return text
string_columns = df.select_dtypes(include=['object']).columns
for col in string_columns:
df[col] = df[col].apply(clean_text)
df.to_csv("cleaned_survey.csv", index=False)
Databroom approach:
databroom clean survey_data.xlsx \
--clean-all \
--empty-threshold 0.8 \
--output-file cleaned_survey.csv \
--output-code survey_cleaning.py \
--verbose
Result: Same output, 1 command, includes reproducible script generation.
When to Use Databroom
โ Perfect for:
- ๐ค Full automation - Transform your entire data cleaning pipeline into a single command
- Quick data exploration and cleaning
- Batch processing multiple files
- Learning data cleaning best practices
- Generating reproducible cleaning scripts
- Teams needing consistent data preprocessing
- Converting workflows between Python and R
๐ Quick Start
Installation
# Complete installation - CLI + GUI + API (recommended)
pip install databroom
# CLI + API only (no Streamlit GUI)
pip install databroom[cli]
# GUI + API only (no CLI interface)
pip install databroom[gui]
Command Line Interface (Primary Interface)
Clean your data files instantly with powerful CLI commands:
# Smart clean everything (recommended)
databroom clean data.csv --clean-all --output-file clean.csv
# Column cleaning with custom threshold
databroom clean messy.xlsx --clean-columns --empty-threshold 0.8 --output-file cleaned.xlsx
# Complete cleaning pipeline with code generation
databroom clean survey.csv --clean-all --output-code cleaning_script.py --lang python
# Generate R/tidyverse code
databroom clean data.csv --clean-rows --output-code analysis.R --lang r
# Advanced options with verbose output
databroom clean dataset.json --clean-all --no-snakecase --verbose --info
# Launch interactive GUI
databroom gui
# List all available operations
databroom list
Interactive GUI
Launch the web-based interface for visual data cleaning:
databroom gui
# Opens http://localhost:8501 in your browser
GUI Screenshots
[GUI screenshots will be added here to showcase the interactive interface, file upload, operation panels, live preview, and code generation features]
Programmatic API
Use Databroom directly in your Python scripts:
from databroom.core.broom import Broom
# Load and clean data with method chaining
broom = Broom.from_csv('data.csv')
result = broom.clean_all() # Smart clean everything
# Or use specific operations
result = (broom
.clean_columns(empty_threshold=0.9)
.clean_rows())
# Get cleaned DataFrame
cleaned_df = result.get_df()
print(f"Cleaned {cleaned_df.shape[0]} rows ร {cleaned_df.shape[1]} columns")
# Generate reproducible code
from databroom.generators.base import CodeGenerator
generator = CodeGenerator('python')
generator.load_history(result.get_history())
generator.export_code('my_cleaning_pipeline.py')
โจ Features
๐ฅ๏ธ Command Line Interface
- Instant cleaning with intuitive flags and parameters
- Batch processing capabilities for multiple files
- Code generation in Python/pandas and R/tidyverse
- Flexible output formats (CSV, Excel, JSON)
- Rich help system with examples and colored output
- Verbose mode for detailed operation feedback
๐จ Interactive GUI
- Drag & drop file upload (CSV, Excel, JSON)
- Live preview of cleaning operations
- Interactive parameter tuning with sliders and inputs
- Real-time code generation with syntax highlighting
- One-click download of cleaned data and generated scripts
- Operation history with undo functionality
โ๏ธ Programmatic API
- Chainable methods for fluent data cleaning workflows
- Factory methods for easy file loading (
from_csv(),from_excel(), etc.) - History tracking for reproducible operations
- Template-based code generation with Jinja2
๐ Code Generation
- Complete scripts with imports, file loading, and execution
- Cross-language support (Python/pandas โ R/tidyverse)
- Template system for customizable output formats
- Reproducible workflows that can be shared and version controlled
๐งฐ Available Cleaning Operations
| Operation | CLI Flag | Purpose |
|---|---|---|
| ๐งน Clean All | --clean-all |
Smart clean everything: columns + rows with all operations |
| ๐ Clean Columns | --clean-columns |
Clean column names: snake_case + remove accents + remove empty |
| ๐ Clean Rows | --clean-rows |
Clean row data: snake_case + remove accents + remove empty |
--remove-empty-cols |
--clean-columns instead |
|
--remove-empty-rows |
--clean-rows instead |
|
--standardize-column-names |
--clean-columns instead |
|
--normalize-column-names |
--clean-columns instead |
|
--normalize-values |
--clean-rows instead |
|
--standardize-values |
--clean-rows instead |
CLI Parameters
# Smart Operations (recommended)
--clean-all # Clean everything: columns + rows
--clean-columns # Clean column names only
--clean-rows # Clean row data only
# Advanced Options (disable specific operations)
--no-snakecase # Keep original text case in rows
--no-snakecase-cols # Keep original column name case
--no-remove-accents-vals # Keep accents in text values
--no-remove-empty-cols # Keep empty columns
# Parameters
--empty-threshold 0.8 # Custom missing value threshold (default: 0.9)
# Output options
--output-file cleaned.csv # Save cleaned data
--output-code script.py # Generate code file
--lang python # Code language (python/r)
# Behavior options
--verbose # Detailed output
--quiet # Minimal output
--info # Show DataFrame info
๐ Example Workflows
Data Science Pipeline
# Clean survey data and generate analysis script
databroom clean survey_data.xlsx \
--clean-all \
--empty-threshold 0.7 \
--output-file clean_survey.csv \
--output-code survey_analysis.py \
--verbose
R/Tidyverse Workflow
# Generate R script for tidyverse users
databroom clean research_data.csv \
--clean-all \
--output-code tidyverse_pipeline.R \
--lang r
Batch Processing Setup
# Process multiple files with consistent operations
for file in data/*.csv; do
databroom clean "$file" \
--clean-columns \
--output-file "clean_$(basename "$file")" \
--quiet
done
๐๏ธ Architecture
Databroom follows a modular architecture designed for extensibility and maintainability:
databroom/
โโโ cli/ # Command line interface (Typer + Rich)
โ โโโ main.py # Entry point and app configuration
โ โโโ commands.py # CLI commands (clean, gui, list)
โ โโโ operations.py # Operation parsing and execution
โ โโโ utils.py # File handling and code generation
โโโ core/ # Core cleaning engine
โ โโโ janitor.py # Main API with method chaining
โ โโโ pipeline.py # Operation coordination and state management
โ โโโ cleaning_ops.py # Individual cleaning operations
โ โโโ history_tracker.py # Automatic operation tracking
โโโ generators/ # Code generation system
โ โโโ base.py # Template-based code generator
โ โโโ templates/ # Jinja2 templates for Python/R
โโโ gui/ # Streamlit web interface
โ โโโ app.py # Interactive GUI application
โโโ tests/ # Comprehensive test suite
๐ ๏ธ Development
Local Development
# Clone repository
git clone https://github.com/onlozanoo/databroom.git
cd databroom
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install in development mode
pip install -e ".[dev,cli,all]"
# Run tests
pytest
# Run CLI locally
python -m databroom.cli.main --help
Testing
# Run full test suite
pytest
# Run with coverage
pytest --cov=databroom
# Run specific test categories
pytest -m "not slow" # Skip slow tests
pytest tests/cli/ # Test CLI only
pytest tests/core/ # Test core functionality
Code Quality
# Format code
black databroom/
isort databroom/
# Lint
flake8 databroom/
# Type check
mypy databroom/
๐ Project Status
Current Version: v0.3.0 - Production Ready
โ Fully Implemented
- Complete CLI with all cleaning operations
- Interactive Streamlit GUI with live preview
- Programmatic API with method chaining
- Python and R code generation
- Comprehensive test suite
- PyPI package distribution ready
-
- Dinamic new operations load
๐ง In Active Development
- Extended cleaning operations library
- Advanced parameter validation
- Performance optimizations
- Enhanced error handling
๐ Planned Features
- Preview in CLI
- Presets
- Batch transform
- Save/load cleaning pipelines
- Custom cleaning operation plugins
- Integration with popular data tools
- Advanced reporting capabilities
๐ค Contributing
I welcome contributions! Here's how you can help:
Ways to Contribute
- ๐ Bug Reports: Submit issues with detailed reproduction steps
- ๐ก Feature Requests: Propose new cleaning operations or CLI features
- ๐ Documentation: Improve examples, tutorials, or API docs
- ๐งช Testing: Add test cases or improve test coverage
- ๐ป Code: Implement new features or fix existing issues
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
Built with โค๏ธ for the data science community
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file databroom-0.3.0.tar.gz.
File metadata
- Download URL: databroom-0.3.0.tar.gz
- Upload date:
- Size: 15.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
70e8559c269c88f304fbf7125eda346714139551a24bcd02af45193ccb8982e4
|
|
| MD5 |
80dcb9d7c44151a7f1760b813776b502
|
|
| BLAKE2b-256 |
c2e94c376fdda49f4570ee7f853d605432a150781a02eb54429084b74205e5a3
|
File details
Details for the file databroom-0.3.0-py3-none-any.whl.
File metadata
- Download URL: databroom-0.3.0-py3-none-any.whl
- Upload date:
- Size: 10.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e6592424b2da244650c1a63eb98666d7279dc386402b1cccb2ef84d2f88cfba6
|
|
| MD5 |
640505d5e332f0c69870bfbfad29da62
|
|
| BLAKE2b-256 |
98617b7a81cfd4d2b41a19184a3da456830c50c03f2db9fd074d0093043bb6a7
|