A cross-language DataFrame cleaning assistant with interactive GUI and one-click code export

These details have not been verified by PyPI

Project links

Project description

Databroom

Clean messy DataFrames fast with CLI, GUI, and API – generate reproducible Python and R code in seconds.

Why Databroom?

Manual pandas approach:

# 15+ lines of repetitive code
import pandas as pd
import unicodedata

df = pd.read_csv("messy_data.csv")
# Remove empty columns
df = df.loc[:, df.isnull().mean() < 0.9]
# Clean column names
df.columns = df.columns.str.lower().str.replace(' ', '_')
# Remove accents from text values
def clean_text(text):
    if pd.isna(text): return text
    return ''.join(c for c in unicodedata.normalize('NFKD', str(text)) 
                   if not unicodedata.combining(c))
for col in df.select_dtypes(include=['object']).columns:
    df[col] = df[col].apply(clean_text)
df.to_csv("clean_data.csv", index=False)

Databroom approach:

# One-liner equivalent
databroom clean messy_data.csv --clean-all --output-file clean_data.csv

Installation

pip install databroom

Quick Start

Command Line Interface

# Full cleaning pipeline
databroom clean data.csv --clean-all --output-file cleaned.csv

# Clean only columns
databroom clean data.csv --clean-columns --output-file cleaned.csv

# Clean with code generation
databroom clean data.csv --clean-all --output-code script.py

# Generate R code
databroom clean data.csv --clean-all --output-code script.R --lang r

# Launch interactive GUI
databroom gui

Python API

from databroom.core.broom import Broom

# Load and clean data
broom = Broom.from_csv('data.csv')
cleaned = broom.clean_all()  # Smart clean everything

# Or use specific operations
cleaned = broom.clean_columns().clean_rows()

# Get cleaned DataFrame
df = cleaned.get_df()

Features

Smart Operations: --clean-all, --clean-columns, --clean-rows, --promote-headers
Structure Operations: Fix data format issues (promote headers, etc.)
Advanced Options: Fine-tune with --no-snakecase, --empty-threshold, etc.
Code Generation: Export Python/pandas or R/tidyverse scripts
Pipeline Management: Save and load cleaning pipelines in JSON format (GUI, CLI, and API)
Modular GUI: Component-based Streamlit interface with organized operations
File Support: CSV, Excel, JSON input/output

Legacy operations (still supported)

remove_empty_cols(), remove_empty_rows()
standardize_column_names(), normalize_column_names()
normalize_values(), standardize_values()

CLI Parameters

# Smart Operations
--clean-all              # Clean everything
--clean-columns          # Clean column names only  
--clean-rows            # Clean row data only

# Structure Operations
--promote-headers        # Convert data row to column headers
--promote-row-index 1    # Row index to promote (default: 0)
--keep-promoted-row      # Keep the promoted row in data

# Advanced Options
--no-snakecase          # Keep original text case
--no-remove-accents-vals # Keep accents in values
--empty-threshold 0.8   # Custom missing value threshold

# Output
--output-file clean.csv # Save cleaned data
--output-code script.py # Generate reproducible code
--lang python          # Code language (python/r)

Examples

Data Science Workflow

databroom clean survey.xlsx \
  --clean-all \
  --empty-threshold 0.7 \
  --output-file clean.csv \
  --output-code analysis.py
  --pipeline-file pipeline.json

R/Tidyverse Code Generation

databroom clean data.csv \
  --clean-all \
  --output-code analysis.R \
  --lang r

Batch Processing

for file in *.csv; do
  databroom clean "$file" --clean-columns --output-file "clean_$file"
done

GUI Interface

Launch the interactive web interface:

databroom gui
# Opens http://localhost:8501

Features:

Drag & drop file upload
Organized operations by category (Structure, Column, Row)
Live preview of operations with step back/reset
Interactive parameter tuning with configuration panels
Real-time code generation (Python/R)
One-click download of cleaned data and code

Method Chaining

from databroom.core.broom import Broom

# Method chaining with structure operations
result = (Broom.from_csv('messy_data.csv')
          .promote_headers(row_index=1)    # Convert row 1 to headers
          .clean_columns(empty_threshold=0.8)
          .clean_rows(snakecase=False)
          .get_df())

Code Generation

All operations automatically generate reproducible code:

# Generated Python code
import pandas as pd
from databroom.core.broom import Broom

broom_instance = Broom.from_csv("data.csv")
broom_instance = broom_instance.clean_all()
df_cleaned = broom_instance.pipeline.df

License

MIT License - see LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.4

Sep 29, 2025

0.3.1

Jul 30, 2025

0.3.0

Jul 30, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

databroom-0.4.tar.gz (42.6 kB view details)

Uploaded Sep 29, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

databroom-0.4-py3-none-any.whl (42.6 kB view details)

Uploaded Sep 29, 2025 Python 3

File details

Details for the file databroom-0.4.tar.gz.

File metadata

Download URL: databroom-0.4.tar.gz
Upload date: Sep 29, 2025
Size: 42.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.1

File hashes

Hashes for databroom-0.4.tar.gz
Algorithm	Hash digest
SHA256	`cd297e1088391ab29d6e85ff67b77ee5c8a795a0708ef99bf6e4fe075ab301b1`
MD5	`e50fededf793b95238071eda57e34128`
BLAKE2b-256	`46e4d29da444d22a8ca8e796ecc39e3d45c1c1090d936e14fa7142d320b3aaed`

See more details on using hashes here.

File details

Details for the file databroom-0.4-py3-none-any.whl.

File metadata

Download URL: databroom-0.4-py3-none-any.whl
Upload date: Sep 29, 2025
Size: 42.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.1

File hashes

Hashes for databroom-0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`32a3b701e3760ae43166fcc28f433b31fe2d78c5619cd1e66f4a62126417a0f2`
MD5	`0aaa16b8975fff2a088f54b085400a92`
BLAKE2b-256	`fc1c119f5625c5c41a818009d3b8dcce57ef8979baa25a8d813a32446023354f`

See more details on using hashes here.

databroom 0.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Databroom

Why Databroom?

Installation

Quick Start

Command Line Interface

Python API

Features

Legacy operations (still supported)

CLI Parameters

Examples

Data Science Workflow

R/Tidyverse Code Generation

Batch Processing

GUI Interface

Method Chaining

Code Generation

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes