Skip to main content

A cross-language DataFrame cleaning assistant with interactive GUI and one-click code export

Project description

Databroom

Clean messy DataFrames fast with CLI, GUI, and API – generate reproducible Python and R code in seconds.

Why Databroom?

Manual pandas approach:

# 15+ lines of repetitive code
import pandas as pd
import unicodedata

df = pd.read_csv("messy_data.csv")
# Remove empty columns
df = df.loc[:, df.isnull().mean() < 0.9]
# Clean column names
df.columns = df.columns.str.lower().str.replace(' ', '_')
# Remove accents from text values
def clean_text(text):
    if pd.isna(text): return text
    return ''.join(c for c in unicodedata.normalize('NFKD', str(text)) 
                   if not unicodedata.combining(c))
for col in df.select_dtypes(include=['object']).columns:
    df[col] = df[col].apply(clean_text)
df.to_csv("clean_data.csv", index=False)

Databroom approach:

# One-liner equivalent
databroom clean messy_data.csv --clean-all --output-file clean_data.csv

Installation

pip install databroom

Quick Start

Command Line Interface

# Full cleaning pipeline
databroom clean data.csv --clean-all --output-file cleaned.csv

# Clean only columns
databroom clean data.csv --clean-columns --output-file cleaned.csv

# Clean with code generation
databroom clean data.csv --clean-all --output-code script.py

# Generate R code
databroom clean data.csv --clean-all --output-code script.R --lang r

# Launch interactive GUI
databroom gui

Python API

from databroom.core.broom import Broom

# Load and clean data
broom = Broom.from_csv('data.csv')
cleaned = broom.clean_all()  # Smart clean everything

# Or use specific operations
cleaned = broom.clean_columns().clean_rows()

# Get cleaned DataFrame
df = cleaned.get_df()

Features

  • Smart Operations: --clean-all, --clean-columns, --clean-rows, --promote-headers
  • Structure Operations: Fix data format issues (promote headers, etc.)
  • Advanced Options: Fine-tune with --no-snakecase, --empty-threshold, etc.
  • Code Generation: Export Python/pandas or R/tidyverse scripts
  • Pipeline Management: Save and load cleaning pipelines in JSON format (GUI, CLI, and API)
  • Modular GUI: Component-based Streamlit interface with organized operations
  • File Support: CSV, Excel, JSON input/output

Legacy operations (still supported)

  • remove_empty_cols(), remove_empty_rows()
  • standardize_column_names(), normalize_column_names()
  • normalize_values(), standardize_values()

CLI Parameters

# Smart Operations
--clean-all              # Clean everything
--clean-columns          # Clean column names only  
--clean-rows            # Clean row data only

# Structure Operations
--promote-headers        # Convert data row to column headers
--promote-row-index 1    # Row index to promote (default: 0)
--keep-promoted-row      # Keep the promoted row in data

# Advanced Options
--no-snakecase          # Keep original text case
--no-remove-accents-vals # Keep accents in values
--empty-threshold 0.8   # Custom missing value threshold

# Output
--output-file clean.csv # Save cleaned data
--output-code script.py # Generate reproducible code
--lang python          # Code language (python/r)

Examples

Data Science Workflow

databroom clean survey.xlsx \
  --clean-all \
  --empty-threshold 0.7 \
  --output-file clean.csv \
  --output-code analysis.py
  --pipeline-file pipeline.json

R/Tidyverse Code Generation

databroom clean data.csv \
  --clean-all \
  --output-code analysis.R \
  --lang r

Batch Processing

for file in *.csv; do
  databroom clean "$file" --clean-columns --output-file "clean_$file"
done

GUI Interface

Launch the interactive web interface:

databroom gui
# Opens http://localhost:8501

Features:

  • Drag & drop file upload
  • Organized operations by category (Structure, Column, Row)
  • Live preview of operations with step back/reset
  • Interactive parameter tuning with configuration panels
  • Real-time code generation (Python/R)
  • One-click download of cleaned data and code

Method Chaining

from databroom.core.broom import Broom

# Method chaining with structure operations
result = (Broom.from_csv('messy_data.csv')
          .promote_headers(row_index=1)    # Convert row 1 to headers
          .clean_columns(empty_threshold=0.8)
          .clean_rows(snakecase=False)
          .get_df())

Code Generation

All operations automatically generate reproducible code:

# Generated Python code
import pandas as pd
from databroom.core.broom import Broom

broom_instance = Broom.from_csv("data.csv")
broom_instance = broom_instance.clean_all()
df_cleaned = broom_instance.pipeline.df

License

MIT License - see LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

databroom-0.4.tar.gz (42.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

databroom-0.4-py3-none-any.whl (42.6 kB view details)

Uploaded Python 3

File details

Details for the file databroom-0.4.tar.gz.

File metadata

  • Download URL: databroom-0.4.tar.gz
  • Upload date:
  • Size: 42.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.1

File hashes

Hashes for databroom-0.4.tar.gz
Algorithm Hash digest
SHA256 cd297e1088391ab29d6e85ff67b77ee5c8a795a0708ef99bf6e4fe075ab301b1
MD5 e50fededf793b95238071eda57e34128
BLAKE2b-256 46e4d29da444d22a8ca8e796ecc39e3d45c1c1090d936e14fa7142d320b3aaed

See more details on using hashes here.

File details

Details for the file databroom-0.4-py3-none-any.whl.

File metadata

  • Download URL: databroom-0.4-py3-none-any.whl
  • Upload date:
  • Size: 42.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.1

File hashes

Hashes for databroom-0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 32a3b701e3760ae43166fcc28f433b31fe2d78c5619cd1e66f4a62126417a0f2
MD5 0aaa16b8975fff2a088f54b085400a92
BLAKE2b-256 fc1c119f5625c5c41a818009d3b8dcce57ef8979baa25a8d813a32446023354f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page