LLM-powered incremental data cleaning pipeline that processes massive datasets in chunks and generates Python cleaning functions
Project description
Recursive Data Cleaner
LLM-powered incremental data cleaning for massive datasets. Process files in chunks, identify quality issues, and automatically generate Python cleaning functions.
How It Works
- Chunk your data (JSONL, CSV, JSON, Parquet, PDF, Word, Excel, XML, and more)
- Analyze each chunk with an LLM to identify issues
- Generate one cleaning function per issue
- Validate functions on holdout data before accepting
- Output a ready-to-use
cleaning_functions.py
The system maintains a "docstring registry" - feeding generated function descriptions back into prompts so the LLM knows what's already solved and avoids duplicate work.
Installation
pip install -e .
For Apple Silicon (MLX backend):
pip install -e ".[mlx]"
For document conversion (PDF, Word, Excel, HTML, etc.):
pip install -e ".[markitdown]"
For Parquet files:
pip install -e ".[parquet]"
For Terminal UI (Rich dashboard):
pip install -e ".[tui]"
Quick Start
from recursive_cleaner import DataCleaner
from backends import MLXBackend
# Any LLM with generate(prompt) -> str works
llm = MLXBackend(model_path="your-model")
cleaner = DataCleaner(
llm_backend=llm,
file_path="messy_data.jsonl",
chunk_size=50,
instructions="""
- Normalize phone numbers to E.164
- Fix typos in status field (valid: active, pending, churned)
- Convert dates to ISO 8601
""",
)
cleaner.run() # Generates cleaning_functions.py
Features
Core
- Chunked Processing: Handle files larger than LLM context windows
- Incremental Generation: One function per issue, building up a complete solution
- Docstring Registry: Automatic context management with FIFO eviction
- AST Validation: All generated code validated before output
- Error Recovery: Retries with error feedback on parse failures
Data Quality (v0.4.0+)
- Holdout Validation: Test functions on unseen 20% of each chunk
- Sampling Strategies: Sequential, random, or stratified sampling
- Quality Metrics: Before/after comparison with improvement reports
- Dependency Resolution: Topological sort for correct function ordering
Optimization (v0.5.0+)
- Two-Pass Consolidation: Merge redundant functions after generation
- Early Termination: Stop when LLM detects pattern saturation
- LLM Agency: Model decides chunk cleanliness and saturation
Security (v0.5.1+)
- Dangerous Code Detection: AST-based detection of exec, eval, subprocess, network calls
Observability (v0.6.0)
- Latency Metrics: Track min/max/avg/total LLM call times
- Import Consolidation: Deduplicate and merge imports in output
- Cleaning Reports: Markdown summary with functions, timing, quality delta
- Dry-Run Mode: Analyze data without generating functions
Format Expansion (v0.7.0)
- Markitdown Integration: Convert 20+ formats (PDF, Word, Excel, PowerPoint, HTML, EPUB, etc.) to text
- Parquet Support: Load parquet files as structured data via pyarrow
- LLM-Generated Parsers: Auto-generate parsers for XML and unknown formats (
auto_parse=True)
Terminal UI (v0.8.0)
- Mission Control Dashboard: Rich-based live terminal UI with retro aesthetic
- Real-time Progress: Animated progress bars, chunk/iteration counters
- Transmission Log: Parsed LLM responses showing issues detected and functions being generated
- Token Estimation: Track estimated input/output tokens across the run
- Graceful Fallback: Works without Rich installed (falls back to callbacks)
CLI (v0.9.0)
- Command Line Interface: Use without writing Python code
- Multiple Backends: MLX (Apple Silicon) and OpenAI-compatible (OpenAI, LM Studio, Ollama)
- Four Commands:
generate,analyze(dry-run),resume,apply
Apply Mode (v1.0.0)
- Apply Cleaning Functions: Apply generated functions to full datasets
- Data Formats: JSONL, CSV, JSON, Parquet, Excel (.xlsx/.xls) output same format
- Text Formats: PDF, Word, HTML, etc. output as Markdown
- Streaming: Memory-efficient line-by-line processing for JSONL/CSV
- Colored TUI: Enhanced transmission log with syntax-highlighted XML parsing
Command Line Interface
After installation, the recursive-cleaner command is available:
# Generate cleaning functions with MLX (Apple Silicon)
recursive-cleaner generate data.jsonl \
--provider mlx \
--model "lmstudio-community/Qwen3-80B-MLX-4bit" \
--instructions "Normalize phone numbers to E.164" \
--output cleaning_functions.py
# Use OpenAI
export OPENAI_API_KEY=your-key
recursive-cleaner generate data.jsonl \
--provider openai \
--model gpt-4o \
--instructions "Fix date formats"
# Use LM Studio or Ollama (OpenAI-compatible)
recursive-cleaner generate data.jsonl \
--provider openai \
--model "qwen/qwen3-vl-30b" \
--base-url http://localhost:1234/v1 \
--instructions "Normalize prices"
# Dry-run analysis
recursive-cleaner analyze data.jsonl \
--provider openai \
--model gpt-4o \
--instructions @instructions.txt
# Resume from checkpoint
recursive-cleaner resume cleaning_state.json \
--provider mlx \
--model "model-path"
# Apply cleaning functions to data
recursive-cleaner apply data.jsonl \
--functions cleaning_functions.py \
--output cleaned_data.jsonl
# Apply to Excel (outputs same format)
recursive-cleaner apply sales.xlsx \
--functions cleaning_functions.py
# Apply to PDF (outputs markdown)
recursive-cleaner apply document.pdf \
--functions cleaning_functions.py \
--output cleaned.md
CLI Options
recursive-cleaner generate <FILE> [OPTIONS]
Required:
FILE Input data file
-p, --provider {mlx,openai} LLM provider
-m, --model MODEL Model name/path
Optional:
-i, --instructions TEXT Cleaning instructions (or @file.txt)
--base-url URL API URL for OpenAI-compatible servers
--chunk-size N Items per chunk (default: 50)
--max-iterations N Max iterations per chunk (default: 5)
-o, --output PATH Output file (default: cleaning_functions.py)
--tui Enable Rich dashboard
--optimize Consolidate redundant functions
--track-metrics Measure before/after quality
Configuration
cleaner = DataCleaner(
# Required
llm_backend=llm,
file_path="data.jsonl",
# Chunking
chunk_size=50, # Items per chunk (or chars for text mode)
max_iterations=5, # Max iterations per chunk
context_budget=8000, # Max chars for docstring context
# Validation
validate_runtime=True, # Test functions before accepting
schema_sample_size=10, # Records for schema inference
holdout_ratio=0.2, # Fraction held out for validation
# Sampling
sampling_strategy="stratified", # "sequential", "random", "stratified"
stratify_field="status", # Field for stratified sampling
# Optimization
optimize=True, # Consolidate redundant functions
early_termination=True, # Stop when patterns saturate
track_metrics=True, # Measure before/after quality
# Observability
report_path="report.md", # Markdown report output (None to disable)
dry_run=False, # Analyze without generating functions
# Format Expansion
auto_parse=False, # LLM generates parser for unknown formats
# Terminal UI
tui=True, # Enable Rich dashboard (requires [tui] extra)
# Progress & State
on_progress=callback, # Progress event callback
state_file="state.json", # Enable resume on interrupt
)
Progress Events
def on_progress(event):
match event["type"]:
case "chunk_start":
print(f"Chunk {event['chunk_index']}/{event['total_chunks']}")
case "llm_call":
print(f"LLM latency: {event['latency_ms']}ms")
case "function_generated":
print(f"Generated: {event['function_name']}")
case "issues_detected": # dry-run mode
print(f"Found {len(event['issues'])} issues")
case "complete":
stats = event["latency_stats"]
print(f"Done! Avg latency: {stats['avg_ms']}ms")
Output
The cleaner generates cleaning_functions.py:
# Auto-generated cleaning functions
import re
def normalize_phone_numbers(data):
"""Normalize phone numbers to E.164 format."""
# ... implementation ...
def fix_status_typos(data):
"""Fix typos in status field."""
# ... implementation ...
def clean_data(data):
"""Apply all cleaning functions in order."""
data = normalize_phone_numbers(data)
data = fix_status_typos(data)
return data
Custom LLM Backend
Implement the simple protocol:
class MyBackend:
def generate(self, prompt: str) -> str:
# Call your LLM (OpenAI, Anthropic, local, etc.)
return response
Text Mode
For plain text files (PDFs, documents):
cleaner = DataCleaner(
llm_backend=llm,
file_path="document.txt",
chunk_size=4000, # Characters, not items
instructions="Fix OCR errors, normalize whitespace",
)
Text mode uses sentence-aware chunking to avoid splitting mid-sentence.
Resume on Interrupt
# Start with state file
cleaner = DataCleaner(
llm_backend=llm,
file_path="huge_file.jsonl",
state_file="cleaning_state.json",
)
cleaner.run()
# If interrupted, resume later:
cleaner = DataCleaner.resume("cleaning_state.json", llm)
cleaner.run()
Architecture
recursive_cleaner/
├── cli.py # Command line interface
├── cleaner.py # Main DataCleaner class
├── context.py # Docstring registry with FIFO eviction
├── dependencies.py # Topological sort for function ordering
├── metrics.py # Quality metrics before/after
├── optimizer.py # Two-pass consolidation with LLM agency
├── output.py # Function file generation + import consolidation
├── parser_generator.py # LLM-generated parsers for unknown formats
├── parsers.py # Chunking for all formats + sampling
├── prompt.py # LLM prompt templates
├── report.py # Markdown report generation
├── response.py # XML/markdown parsing + agency dataclasses
├── schema.py # Schema inference
├── tui.py # Rich terminal dashboard
├── validation.py # Runtime validation + holdout
└── vendor/
└── chunker.py # Vendored sentence-aware chunker
backends/
├── mlx_backend.py # MLX-LM backend for Apple Silicon
└── openai_backend.py # OpenAI-compatible backend
Testing
pytest tests/ -v
555 tests covering all features. Test datasets in test_cases/:
- E-commerce product catalogs
- Healthcare patient records
- Financial transaction data
Philosophy
- Simplicity over extensibility: ~5,000 lines that do one thing well
- stdlib over dependencies: Only
tenacityrequired - Retry over recover: On error, retry with error in prompt
- Wu wei: Let the LLM make decisions about data it understands
Version History
| Version | Features |
|---|---|
| v1.0.1 | Return type validation, prompt signature clarity, duplicate field detection |
| v1.0.0 | Apply mode for cleaning data, Excel support (.xlsx/.xls), enhanced TUI colors |
| v0.9.0 | CLI tool with MLX and OpenAI-compatible backends (LM Studio, Ollama) |
| v0.8.0 | Terminal UI with Rich dashboard, mission control aesthetic, transmission log |
| v0.7.0 | Markitdown (20+ formats), Parquet support, LLM-generated parsers |
| v0.6.0 | Latency metrics, import consolidation, cleaning report, dry-run mode |
| v0.5.1 | Dangerous code detection (AST-based security) |
| v0.5.0 | Two-pass optimization, early termination, LLM agency |
| v0.4.0 | Holdout validation, dependency resolution, sampling, quality metrics |
| v0.3.0 | Text mode with sentence-aware chunking |
| v0.2.0 | Runtime validation, schema inference, callbacks, incremental saves |
| v0.1.0 | Core pipeline, chunking, docstring registry |
Acknowledgments
- Sentence-aware text chunking adapted from Chonkie (MIT License)
- Development assisted by Claude Code
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file recursive_cleaner-1.0.2.tar.gz.
File metadata
- Download URL: recursive_cleaner-1.0.2.tar.gz
- Upload date:
- Size: 229.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f0cb255e36a821911f88ed3ea5443c846a40805453f29ae8e53757f0355fa055
|
|
| MD5 |
7a742b90b56909b2f2a8cf69a8e4b1c2
|
|
| BLAKE2b-256 |
0c847d9ed5920e7155f6025a084945e840802df1a6e067fe7588eeb8ee0864cd
|
File details
Details for the file recursive_cleaner-1.0.2-py3-none-any.whl.
File metadata
- Download URL: recursive_cleaner-1.0.2-py3-none-any.whl
- Upload date:
- Size: 59.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2378de2095355b8327ac1e24848d826d7c1847b8bc525b71b539744172bb8288
|
|
| MD5 |
6e40c4b704c26e5d933728ddc09fa7cd
|
|
| BLAKE2b-256 |
bd816e81e6914e97c529f02649c9bc9cd8314e68ef7bf18b50ace166f3721aa5
|