A cross-language DataFrame cleaning assistant with interactive GUI and one-click code export
Project description
Databroom
Clean messy DataFrames fast with CLI, GUI, and API – generate reproducible Python and R code in seconds.
Why Databroom?
Manual pandas approach:
# 15+ lines of repetitive code
import pandas as pd
import unicodedata
df = pd.read_csv("messy_data.csv")
# Remove empty columns
df = df.loc[:, df.isnull().mean() < 0.9]
# Clean column names
df.columns = df.columns.str.lower().str.replace(' ', '_')
# Remove accents from text values
def clean_text(text):
if pd.isna(text): return text
return ''.join(c for c in unicodedata.normalize('NFKD', str(text))
if not unicodedata.combining(c))
for col in df.select_dtypes(include=['object']).columns:
df[col] = df[col].apply(clean_text)
df.to_csv("clean_data.csv", index=False)
Databroom approach:
# One-liner equivalent
databroom clean messy_data.csv --clean-all --output-file clean_data.csv
Installation
pip install databroom
Quick Start
Command Line Interface
# Full cleaning pipeline
databroom clean data.csv --clean-all --output-file cleaned.csv
# Clean only columns
databroom clean data.csv --clean-columns --output-file cleaned.csv
# Clean with code generation
databroom clean data.csv --clean-all --output-code script.py
# Generate R code
databroom clean data.csv --clean-all --output-code script.R --lang r
# Launch interactive GUI
databroom gui
Python API
from databroom.core.broom import Broom
# Load and clean data
broom = Broom.from_csv('data.csv')
cleaned = broom.clean_all() # Smart clean everything
# Or use specific operations
cleaned = broom.clean_columns().clean_rows()
# Get cleaned DataFrame
df = cleaned.get_df()
Features
- Smart Operations:
--clean-all,--clean-columns,--clean-rows,--promote-headers - Structure Operations: Fix data format issues (promote headers, etc.)
- Advanced Options: Fine-tune with
--no-snakecase,--empty-threshold, etc. - Code Generation: Export Python/pandas or R/tidyverse scripts
- Pipeline Management: Save and load cleaning pipelines in JSON format (GUI, CLI, and API)
- Modular GUI: Component-based Streamlit interface with organized operations
- File Support: CSV, Excel, JSON input/output
Legacy operations (still supported)
remove_empty_cols(),remove_empty_rows()standardize_column_names(),normalize_column_names()normalize_values(),standardize_values()
CLI Parameters
# Smart Operations
--clean-all # Clean everything
--clean-columns # Clean column names only
--clean-rows # Clean row data only
# Structure Operations
--promote-headers # Convert data row to column headers
--promote-row-index 1 # Row index to promote (default: 0)
--keep-promoted-row # Keep the promoted row in data
# Advanced Options
--no-snakecase # Keep original text case
--no-remove-accents-vals # Keep accents in values
--empty-threshold 0.8 # Custom missing value threshold
# Output
--output-file clean.csv # Save cleaned data
--output-code script.py # Generate reproducible code
--lang python # Code language (python/r)
Examples
Data Science Workflow
databroom clean survey.xlsx \
--clean-all \
--empty-threshold 0.7 \
--output-file clean.csv \
--output-code analysis.py
--pipeline-file pipeline.json
R/Tidyverse Code Generation
databroom clean data.csv \
--clean-all \
--output-code analysis.R \
--lang r
Batch Processing
for file in *.csv; do
databroom clean "$file" --clean-columns --output-file "clean_$file"
done
GUI Interface
Launch the interactive web interface:
databroom gui
# Opens http://localhost:8501
Features:
- Drag & drop file upload
- Organized operations by category (Structure, Column, Row)
- Live preview of operations with step back/reset
- Interactive parameter tuning with configuration panels
- Real-time code generation (Python/R)
- One-click download of cleaned data and code
Method Chaining
from databroom.core.broom import Broom
# Method chaining with structure operations
result = (Broom.from_csv('messy_data.csv')
.promote_headers(row_index=1) # Convert row 1 to headers
.clean_columns(empty_threshold=0.8)
.clean_rows(snakecase=False)
.get_df())
Code Generation
All operations automatically generate reproducible code:
# Generated Python code
import pandas as pd
from databroom.core.broom import Broom
broom_instance = Broom.from_csv("data.csv")
broom_instance = broom_instance.clean_all()
df_cleaned = broom_instance.pipeline.df
License
MIT License - see LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file databroom-0.4.tar.gz.
File metadata
- Download URL: databroom-0.4.tar.gz
- Upload date:
- Size: 42.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cd297e1088391ab29d6e85ff67b77ee5c8a795a0708ef99bf6e4fe075ab301b1
|
|
| MD5 |
e50fededf793b95238071eda57e34128
|
|
| BLAKE2b-256 |
46e4d29da444d22a8ca8e796ecc39e3d45c1c1090d936e14fa7142d320b3aaed
|
File details
Details for the file databroom-0.4-py3-none-any.whl.
File metadata
- Download URL: databroom-0.4-py3-none-any.whl
- Upload date:
- Size: 42.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
32a3b701e3760ae43166fcc28f433b31fe2d78c5619cd1e66f4a62126417a0f2
|
|
| MD5 |
0aaa16b8975fff2a088f54b085400a92
|
|
| BLAKE2b-256 |
fc1c119f5625c5c41a818009d3b8dcce57ef8979baa25a8d813a32446023354f
|