Terminal-based CLI/TUI for managing MongoDB data quality operations
Project description
Yirifi Data Quality - CLI/TUI Tool
MongoDB data quality operations made fast, safe, and repeatable
A comprehensive CLI/TUI tool for managing data quality across MongoDB databases. Think of it as TypeScript for your database operations - with safety rails, state tracking, and automation for common tasks.
๐ฏ What Is This?
Yirifi Data Quality (yirifi-dq) is a command-line tool that automates MongoDB data quality operations:
- Remove duplicates - Find and clean duplicate records
- Detect orphans - Identify and fix broken relationships
- Normalize data - Standardize field values
- Track everything - Complete audit trail of all operations
- Safety first - Automatic backups, test mode, verification
From 60 minutes to 2 minutes per operation.
๐ Quick Start (5 Minutes)
1. Install
# Clone the repository (or navigate to it)
cd /path/to/yirifi-data-fixes
# Install the CLI tool
pip install -e .
# Verify installation
yirifi-dq --help
2. Configure MongoDB Connection
Create a .env file:
cp .env.example .env
Add your connection strings:
DEV_MONGODB_URI=mongodb://localhost:27017/regdb_dev
UAT_MONGODB_URI=mongodb://your-uat-server/regdb
PRD_MONGODB_URI=mongodb://your-prod-server/regdb
3. Run Your First Operation
Option A: Interactive Wizard (Recommended for first-time users)
yirifi-dq new
Follow the 8-screen guided workflow to create and execute an operation.
Option B: Command Mode (For power users)
# Remove duplicate URLs from links collection (test mode - only 10 records)
yirifi-dq new duplicate-cleanup \
--database regdb \
--collection links \
--field url \
--keep-strategy oldest \
--env DEV \
--test-mode \
--execute-now
# Output:
# โ Operation OP-2025-001 created
# โ Backup created: output/backup_20250116_100000.json
# โ Found 5 duplicates, removed 5
# โ Verification passed
๐ Who Is This For?
Data Analysts
Use the interactive wizard (yirifi-dq new) for guided workflows. No coding required!
Junior Developers
Use CLI commands for faster execution. Learn from examples in docs/tutorials/.
Senior Developers
Use CLI commands for automation, extend with custom operations, or write manual Python scripts for complex cases. See docs/developer-guide/.
โจ Key Features
1. Dual Interface
Interactive Wizard (TUI)
- 8-screen guided workflow
- Perfect for exploratory work
- No MongoDB knowledge needed
Command Line (CLI)
- 9 powerful commands
- Scriptable and automatable
- Tab completion support
2. Automatic Safety
- Mandatory backups before deletion (no exceptions)
- Test mode default (limits to 10 records for safety)
- Collection locking (prevents concurrent operations)
- Auto-verification (confirms expected results)
- Rollback support (restore from backup with one command)
3. State Management
- SQLite database (state.db) - Fast queries, filtering, concurrent operation management
- INDEX.yaml export - Git-friendly operation history
- Complete audit trail - Every action logged with INFO/WARNING/ERROR levels
4. Pre-defined Operations
duplicate-cleanup- Remove duplicate records intelligentlyorphan-cleanup- Clean orphaned recordsframework-stats- Generate framework statisticsverify-all-operations- Verify all completed operations- Custom operations via YAML definitions
๐ CLI Commands Overview
| Command | Description | Example |
|---|---|---|
yirifi-dq new |
Create operation (wizard or command mode) | yirifi-dq new duplicate-cleanup --database regdb --collection links --field url |
yirifi-dq list |
List/filter operations | yirifi-dq list --status completed --database regdb |
yirifi-dq show <id> |
Show operation details | yirifi-dq show OP-2025-001 |
yirifi-dq execute <id> |
Execute saved operation | yirifi-dq execute OP-2025-001 |
yirifi-dq verify <id> |
Verify operation results | yirifi-dq verify OP-2025-001 |
yirifi-dq rollback <id> |
Rollback with backup restore | yirifi-dq rollback OP-2025-001 |
yirifi-dq stats |
Framework statistics | yirifi-dq stats --database regdb |
yirifi-dq logs <id> |
View operation logs | yirifi-dq logs OP-2025-001 --level ERROR |
yirifi-dq export-index |
Export to INDEX.yaml for git | yirifi-dq export-index |
Complete reference: docs/reference/cli-commands.md
๐๏ธ Project Structure
yirifi-data-fixes/
โโโ yirifi_dq/ # Main package
โ โโโ commands/ # CLI commands (new, list, execute, etc.)
โ โโโ tui/ # Interactive wizard screens
โ โโโ engine/ # Orchestrator, safety, templates
โ โโโ db/ # State management (state.db, SQLite)
โ โโโ models/ # Pydantic models (validation)
โ โโโ config/ # Operation & category YAML definitions
โ โโโ validators/ # Duplicate/orphan detection
โ โโโ fixers/ # Remove duplicates, clean orphans
โ โโโ analyzers/ # Field analysis, statistics
โ โโโ generators/ # Slug generation, etc.
โ
โโโ docs/ # Complete documentation
โ โโโ user-guide/ # For data analysts & junior devs
โ โโโ developer-guide/ # For senior devs & contributors
โ โโโ tutorials/ # Step-by-step learning
โ โโโ workflows/ # CLI-first operation guides
โ โโโ reference/ # CLI commands, YAML specs, schemas
โ โโโ troubleshooting/ # Common issues & solutions
โ โโโ architecture/ # Design decisions & patterns
โ
โโโ databases/ # Operation folders (auto-created by CLI)
โ โโโ {db}/{collection}/{field}/{type}/{operation_name}/
โ โโโ OPERATION.md # Auto-generated documentation
โ โโโ input/ # Input data (if needed)
โ โโโ scripts/ # Generated scripts (if needed)
โ โโโ output/ # Backups, reports, results
โ โโโ analysis/ # Analysis results
โ
โโโ framework/ # Framework metadata
โ โโโ INDEX.yaml # Git-friendly operation history (auto-exported)
โ โโโ INDEX.json.legacy # Legacy format (archived)
โ
โโโ templates/ # Templates for manual operations
โโโ .env # MongoDB connection strings
โโโ README.md # This file
๐ฏ Common Use Cases
Remove Duplicate URLs
# Interactive wizard (guided)
yirifi-dq new
# Or command mode (direct)
yirifi-dq new duplicate-cleanup \
--database regdb \
--collection links \
--field url \
--keep-strategy oldest \
--env DEV \
--test-mode \
--execute-now
Clean Orphaned Articles
yirifi-dq new orphan-cleanup \
--database regdb \
--primary-collection links \
--foreign-collection articlesdocuments \
--primary-field link_yid \
--foreign-field articleYid \
--action delete \
--env DEV \
--test-mode
View Framework Statistics
yirifi-dq stats
# Or for specific database/collection
yirifi-dq stats --database regdb --collection links
Rollback an Operation
# Dry-run preview first
yirifi-dq rollback OP-2025-001 --dry-run
# Then rollback for real
yirifi-dq rollback OP-2025-001
๐ Documentation
For AI Assistants (Claude Code)
- CLAUDE.md - Quick reference guide (450 lines, optimized for LLM parsing)
- CLAUDE_GUIDE.md - Comprehensive guide (650+ lines, all commands, workflows, architecture)
For Human Users
New to the CLI?
Running Operations?
Extending the Framework?
Having Problems?
Complete documentation hub: docs/README.md
๐๏ธ Architecture (Simplified)
CLI/TUI Layer (yirifi-dq commands, interactive wizard)
โ
Orchestration Layer (workflow engine, safety enforcement)
โ
State Management Layer (SQLite state.db, INDEX.yaml export)
โ
Data Operations Layer (validators, fixers, analyzers, MongoDB utilities)
4-layer architecture:
- CLI/TUI - User interface (commands + interactive wizard)
- Orchestration - Workflow coordination (folder creation, backup, execute, verify)
- State Management - Operation tracking (SQLite + INDEX.yaml)
- Data Operations - MongoDB utilities (validators, fixers, analyzers)
Complete architecture: docs/architecture/architecture-overview.md
๐ก๏ธ Safety & Best Practices
Automatic Safety Features
โ Mandatory backups before any deletion operation โ Test mode default (limits to 10 records unless explicitly disabled) โ Collection locks (prevents concurrent operations on same collection) โ Auto-verification after execution (count checks, orphan detection) โ Complete audit trail (all operations logged to SQLite)
Golden Rules
- โ ๏ธ Always backup before deletion (CLI does this automatically)
- โ ๏ธ Always test on DEV first or use
--test-mode(CLI defaults to test mode) - โ ๏ธ Always verify after execution (
yirifi-dq verify <id>) - โ ๏ธ Check cross-collection relationships (links โ articlesdocuments)
- โ ๏ธ If verification fails โ Rollback immediately (
yirifi-dq rollback <id>)
๐ Framework Evolution
Phase 1: Bespoke Scripts (2024 Q4)
- Manual folder creation
- One-off Python scripts
- Manual backup/restore
- 30-60 minutes per operation
Phase 2: Utility Library (2025 Q1)
- Reusable validators, fixers
- Consistent error handling
- Still required manual orchestration
Phase 3: CLI/TUI Tool (2025 Q2 - Current)
yirifi-dqCLI (9 commands)- Interactive wizard (8 screens)
- State management (SQLite + INDEX.yaml)
- Automatic safety (backups, locks, verification)
- 2 minutes per operation โจ
๐ค Contributing
Adding a New CLI Command
# 1. Create yirifi_dq/commands/my_command.py
import click
@click.command()
@click.option('--param', required=True)
def my_command(param: str):
"""Command description"""
click.echo(f"Executing: {param}")
# 2. Register in yirifi_dq/main.py
from yirifi_dq.commands.my_command import my_command
cli.add_command(my_command)
# 3. Test
yirifi-dq my-command --param test
Complete guide: docs/developer-guide/adding-commands.md
Defining a New Operation Type
Create operation YAML:
# yirifi_dq/config/operations/my_operation.yaml
operation:
id: my-operation
name: My Operation
description: What this operation does
categories:
- data_quality
parameters:
- name: my_param
type: string
required: true
safety:
requires_backup: true
verification:
- check: custom_check
Complete guide: docs/developer-guide/adding-operations.md
๐ State Management
SQLite Database (state.db)
Tracks all operations with:
- operations table - Operation configs and status
- operation_logs table - Complete audit trail (INFO/WARNING/ERROR/DEBUG)
- operation_locks table - Collection-level locks (prevents concurrent operations)
- framework_stats table - Cumulative statistics
INDEX.yaml Export
Git-friendly YAML export of operation history:
- Human-readable
- Git-trackable (diffs, blame, history)
- Backward compatible with old INDEX.json workflow
- Auto-exported on operation completion
๐ Troubleshooting
Command not found: yirifi-dq
pip install -e .
yirifi-dq --help
ModuleNotFoundError: No module named 'yirifi_dq'
# Ensure you're in project root
cd /path/to/yirifi-data-fixes
pip install -e .
Collection is locked
Another operation is running on this collection. Wait for it to complete or check locks:
yirifi-dq list --status executing
All troubleshooting: docs/troubleshooting/
๐ License
Internal use only - Yirifi Data Quality Team
๐ Support
- Quick questions? Check CLAUDE.md
- Comprehensive guide? See CLAUDE_GUIDE.md
- User docs? Browse docs/user-guide/
- Developer docs? See docs/developer-guide/
- Issues? Check docs/troubleshooting/
Framework Version: 2.0.0 (CLI/TUI)
Last Updated: 2025-11-16
Operations Tracked: See yirifi-dq stats
๐ Success Stories
"From 60 minutes of manual scripting to 2 minutes with
yirifi-dq new. Game changer!" - Data Team
"The interactive wizard is perfect for training new team members. No MongoDB knowledge required." - Team Lead
"I can finally automate data quality checks in our CI/CD pipeline with the CLI commands." - DevOps Engineer
Basic Commands
Check for linting issues
ruff check .
Auto-fix issues
ruff check . --fix
Format code
ruff format .
Check with statistics
ruff check . --statistics
Pre-commit Integration
Install pre-commit hooks (if not already done)
pre-commit install
Run all hooks manually
pre-commit run --all-files
Run only ruff
pre-commit run ruff --all-files
CI/CD Integration
Add to your CI pipeline: ruff check . # Fails if issues found ruff format --check . # Fails if formatting needed
Usage Examples
CLI Usage:
List all available scripts
yirifi-dq scripts list
Get detailed script information
yirifi-dq scripts info articles/duplicate-cleanup
Run a script
yirifi-dq run articles/duplicate-cleanup
--database regdb
--collection articlesdocuments
--field slug
--keep-strategy newest
--test-mode
TUI Usage:
Launch TUI
yirifi-dq tui
Ready to get started? Run yirifi-dq new and follow the wizard! ๐
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file yirifi_dq-1.0.0.tar.gz.
File metadata
- Download URL: yirifi_dq-1.0.0.tar.gz
- Upload date:
- Size: 186.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
411a2fce766bc3bfa33404537ebb10565a148e5f8d27b87cee6dde0c3264c2ab
|
|
| MD5 |
77372282467817577baa6f69d0159fb6
|
|
| BLAKE2b-256 |
cb6bd9b4fe6828bf6c4b71b37ddcf65093fa03a54718d7be7ed7577a1ef467de
|
File details
Details for the file yirifi_dq-1.0.0-py3-none-any.whl.
File metadata
- Download URL: yirifi_dq-1.0.0-py3-none-any.whl
- Upload date:
- Size: 219.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d98f08eeb2fe125401cd680ab057cab97e7fa6480e2e3b69eb9cbcae6878b763
|
|
| MD5 |
c070654befc6cf8086d92dee0cb884dd
|
|
| BLAKE2b-256 |
9312ec5f6e72476e088f066c0f04532536ef6eaa8aed7998cf11422244b088b5
|