Skip to main content

Terminal-based CLI/TUI for managing MongoDB data quality operations

Project description

Yirifi Data Quality - CLI/TUI Tool

MongoDB data quality operations made fast, safe, and repeatable

A comprehensive CLI/TUI tool for managing data quality across MongoDB databases. Think of it as TypeScript for your database operations - with safety rails, state tracking, and automation for common tasks.


๐ŸŽฏ What Is This?

Yirifi Data Quality (yirifi-dq) is a command-line tool that automates MongoDB data quality operations:

  • Remove duplicates - Find and clean duplicate records
  • Detect orphans - Identify and fix broken relationships
  • Normalize data - Standardize field values
  • Track everything - Complete audit trail of all operations
  • Safety first - Automatic backups, test mode, verification

From 60 minutes to 2 minutes per operation.


๐Ÿš€ Quick Start (5 Minutes)

1. Install

# Clone the repository (or navigate to it)
cd /path/to/yirifi-data-fixes

# Install the CLI tool
pip install -e .

# Verify installation
yirifi-dq --help

2. Configure MongoDB Connection

Create a .env file:

cp .env.example .env

Add your connection strings:

DEV_MONGODB_URI=mongodb://localhost:27017/regdb_dev
UAT_MONGODB_URI=mongodb://your-uat-server/regdb
PRD_MONGODB_URI=mongodb://your-prod-server/regdb

3. Run Your First Operation

Option A: Interactive Wizard (Recommended for first-time users)

yirifi-dq new

Follow the 8-screen guided workflow to create and execute an operation.

Option B: Command Mode (For power users)

# Remove duplicate URLs from links collection (test mode - only 10 records)
yirifi-dq new duplicate-cleanup \
  --database regdb \
  --collection links \
  --field url \
  --keep-strategy oldest \
  --env DEV \
  --test-mode \
  --execute-now

# Output:
# โœ“ Operation OP-2025-001 created
# โœ“ Backup created: output/backup_20250116_100000.json
# โœ“ Found 5 duplicates, removed 5
# โœ“ Verification passed

๐ŸŽ“ Who Is This For?

Data Analysts

Use the interactive wizard (yirifi-dq new) for guided workflows. No coding required!

Junior Developers

Use CLI commands for faster execution. Learn from examples in docs/tutorials/.

Senior Developers

Use CLI commands for automation, extend with custom operations, or write manual Python scripts for complex cases. See docs/developer-guide/.


โœจ Key Features

1. Dual Interface

Interactive Wizard (TUI)

  • 8-screen guided workflow
  • Perfect for exploratory work
  • No MongoDB knowledge needed

Command Line (CLI)

  • 9 powerful commands
  • Scriptable and automatable
  • Tab completion support

2. Automatic Safety

  • Mandatory backups before deletion (no exceptions)
  • Test mode default (limits to 10 records for safety)
  • Collection locking (prevents concurrent operations)
  • Auto-verification (confirms expected results)
  • Rollback support (restore from backup with one command)

3. State Management

  • SQLite database (state.db) - Fast queries, filtering, concurrent operation management
  • INDEX.yaml export - Git-friendly operation history
  • Complete audit trail - Every action logged with INFO/WARNING/ERROR levels

4. Pre-defined Operations

  • duplicate-cleanup - Remove duplicate records intelligently
  • orphan-cleanup - Clean orphaned records
  • framework-stats - Generate framework statistics
  • verify-all-operations - Verify all completed operations
  • Custom operations via YAML definitions

๐Ÿ“‹ CLI Commands Overview

Command Description Example
yirifi-dq new Create operation (wizard or command mode) yirifi-dq new duplicate-cleanup --database regdb --collection links --field url
yirifi-dq list List/filter operations yirifi-dq list --status completed --database regdb
yirifi-dq show <id> Show operation details yirifi-dq show OP-2025-001
yirifi-dq execute <id> Execute saved operation yirifi-dq execute OP-2025-001
yirifi-dq verify <id> Verify operation results yirifi-dq verify OP-2025-001
yirifi-dq rollback <id> Rollback with backup restore yirifi-dq rollback OP-2025-001
yirifi-dq stats Framework statistics yirifi-dq stats --database regdb
yirifi-dq logs <id> View operation logs yirifi-dq logs OP-2025-001 --level ERROR
yirifi-dq export-index Export to INDEX.yaml for git yirifi-dq export-index

Complete reference: docs/reference/cli-commands.md


๐Ÿ—‚๏ธ Project Structure

yirifi-data-fixes/
โ”œโ”€โ”€ yirifi_dq/               # Main package
โ”‚   โ”œโ”€โ”€ commands/            # CLI commands (new, list, execute, etc.)
โ”‚   โ”œโ”€โ”€ tui/                 # Interactive wizard screens
โ”‚   โ”œโ”€โ”€ engine/              # Orchestrator, safety, templates
โ”‚   โ”œโ”€โ”€ db/                  # State management (state.db, SQLite)
โ”‚   โ”œโ”€โ”€ models/              # Pydantic models (validation)
โ”‚   โ”œโ”€โ”€ config/              # Operation & category YAML definitions
โ”‚   โ”œโ”€โ”€ validators/          # Duplicate/orphan detection
โ”‚   โ”œโ”€โ”€ fixers/              # Remove duplicates, clean orphans
โ”‚   โ”œโ”€โ”€ analyzers/           # Field analysis, statistics
โ”‚   โ””โ”€โ”€ generators/          # Slug generation, etc.
โ”‚
โ”œโ”€โ”€ docs/                    # Complete documentation
โ”‚   โ”œโ”€โ”€ user-guide/          # For data analysts & junior devs
โ”‚   โ”œโ”€โ”€ developer-guide/     # For senior devs & contributors
โ”‚   โ”œโ”€โ”€ tutorials/           # Step-by-step learning
โ”‚   โ”œโ”€โ”€ workflows/           # CLI-first operation guides
โ”‚   โ”œโ”€โ”€ reference/           # CLI commands, YAML specs, schemas
โ”‚   โ”œโ”€โ”€ troubleshooting/     # Common issues & solutions
โ”‚   โ””โ”€โ”€ architecture/        # Design decisions & patterns
โ”‚
โ”œโ”€โ”€ databases/               # Operation folders (auto-created by CLI)
โ”‚   โ””โ”€โ”€ {db}/{collection}/{field}/{type}/{operation_name}/
โ”‚       โ”œโ”€โ”€ OPERATION.md     # Auto-generated documentation
โ”‚       โ”œโ”€โ”€ input/           # Input data (if needed)
โ”‚       โ”œโ”€โ”€ scripts/         # Generated scripts (if needed)
โ”‚       โ”œโ”€โ”€ output/          # Backups, reports, results
โ”‚       โ””โ”€โ”€ analysis/        # Analysis results
โ”‚
โ”œโ”€โ”€ framework/               # Framework metadata
โ”‚   โ”œโ”€โ”€ INDEX.yaml           # Git-friendly operation history (auto-exported)
โ”‚   โ””โ”€โ”€ INDEX.json.legacy    # Legacy format (archived)
โ”‚
โ”œโ”€โ”€ templates/               # Templates for manual operations
โ”œโ”€โ”€ .env                     # MongoDB connection strings
โ””โ”€โ”€ README.md                # This file

๐ŸŽฏ Common Use Cases

Remove Duplicate URLs

# Interactive wizard (guided)
yirifi-dq new

# Or command mode (direct)
yirifi-dq new duplicate-cleanup \
  --database regdb \
  --collection links \
  --field url \
  --keep-strategy oldest \
  --env DEV \
  --test-mode \
  --execute-now

Clean Orphaned Articles

yirifi-dq new orphan-cleanup \
  --database regdb \
  --primary-collection links \
  --foreign-collection articlesdocuments \
  --primary-field link_yid \
  --foreign-field articleYid \
  --action delete \
  --env DEV \
  --test-mode

View Framework Statistics

yirifi-dq stats

# Or for specific database/collection
yirifi-dq stats --database regdb --collection links

Rollback an Operation

# Dry-run preview first
yirifi-dq rollback OP-2025-001 --dry-run

# Then rollback for real
yirifi-dq rollback OP-2025-001

๐Ÿ“š Documentation

For AI Assistants (Claude Code)

  • CLAUDE.md - Quick reference guide (450 lines, optimized for LLM parsing)
  • CLAUDE_GUIDE.md - Comprehensive guide (650+ lines, all commands, workflows, architecture)

For Human Users

New to the CLI?

Running Operations?

Extending the Framework?

Having Problems?

Complete documentation hub: docs/README.md


๐Ÿ—๏ธ Architecture (Simplified)

CLI/TUI Layer (yirifi-dq commands, interactive wizard)
    โ†“
Orchestration Layer (workflow engine, safety enforcement)
    โ†“
State Management Layer (SQLite state.db, INDEX.yaml export)
    โ†“
Data Operations Layer (validators, fixers, analyzers, MongoDB utilities)

4-layer architecture:

  1. CLI/TUI - User interface (commands + interactive wizard)
  2. Orchestration - Workflow coordination (folder creation, backup, execute, verify)
  3. State Management - Operation tracking (SQLite + INDEX.yaml)
  4. Data Operations - MongoDB utilities (validators, fixers, analyzers)

Complete architecture: docs/architecture/architecture-overview.md


๐Ÿ›ก๏ธ Safety & Best Practices

Automatic Safety Features

โœ… Mandatory backups before any deletion operation โœ… Test mode default (limits to 10 records unless explicitly disabled) โœ… Collection locks (prevents concurrent operations on same collection) โœ… Auto-verification after execution (count checks, orphan detection) โœ… Complete audit trail (all operations logged to SQLite)

Golden Rules

  1. โš ๏ธ Always backup before deletion (CLI does this automatically)
  2. โš ๏ธ Always test on DEV first or use --test-mode (CLI defaults to test mode)
  3. โš ๏ธ Always verify after execution (yirifi-dq verify <id>)
  4. โš ๏ธ Check cross-collection relationships (links โ†” articlesdocuments)
  5. โš ๏ธ If verification fails โ†’ Rollback immediately (yirifi-dq rollback <id>)

๐Ÿ”„ Framework Evolution

Phase 1: Bespoke Scripts (2024 Q4)

  • Manual folder creation
  • One-off Python scripts
  • Manual backup/restore
  • 30-60 minutes per operation

Phase 2: Utility Library (2025 Q1)

  • Reusable validators, fixers
  • Consistent error handling
  • Still required manual orchestration

Phase 3: CLI/TUI Tool (2025 Q2 - Current)

  • yirifi-dq CLI (9 commands)
  • Interactive wizard (8 screens)
  • State management (SQLite + INDEX.yaml)
  • Automatic safety (backups, locks, verification)
  • 2 minutes per operation โœจ

๐Ÿค Contributing

Adding a New CLI Command

# 1. Create yirifi_dq/commands/my_command.py
import click

@click.command()
@click.option('--param', required=True)
def my_command(param: str):
    """Command description"""
    click.echo(f"Executing: {param}")

# 2. Register in yirifi_dq/main.py
from yirifi_dq.commands.my_command import my_command
cli.add_command(my_command)

# 3. Test
yirifi-dq my-command --param test

Complete guide: docs/developer-guide/adding-commands.md

Defining a New Operation Type

Create operation YAML:

# yirifi_dq/config/operations/my_operation.yaml
operation:
  id: my-operation
  name: My Operation
  description: What this operation does
  categories:
    - data_quality
  parameters:
    - name: my_param
      type: string
      required: true
  safety:
    requires_backup: true
  verification:
    - check: custom_check

Complete guide: docs/developer-guide/adding-operations.md


๐Ÿ“Š State Management

SQLite Database (state.db)

Tracks all operations with:

  • operations table - Operation configs and status
  • operation_logs table - Complete audit trail (INFO/WARNING/ERROR/DEBUG)
  • operation_locks table - Collection-level locks (prevents concurrent operations)
  • framework_stats table - Cumulative statistics

INDEX.yaml Export

Git-friendly YAML export of operation history:

  • Human-readable
  • Git-trackable (diffs, blame, history)
  • Backward compatible with old INDEX.json workflow
  • Auto-exported on operation completion

๐Ÿ†˜ Troubleshooting

Command not found: yirifi-dq

pip install -e .
yirifi-dq --help

ModuleNotFoundError: No module named 'yirifi_dq'

# Ensure you're in project root
cd /path/to/yirifi-data-fixes
pip install -e .

Collection is locked

Another operation is running on this collection. Wait for it to complete or check locks:

yirifi-dq list --status executing

All troubleshooting: docs/troubleshooting/


๐Ÿ“ License

Internal use only - Yirifi Data Quality Team


๐Ÿ“ž Support


Framework Version: 2.0.0 (CLI/TUI) Last Updated: 2025-11-16 Operations Tracked: See yirifi-dq stats


๐ŸŽ‰ Success Stories

"From 60 minutes of manual scripting to 2 minutes with yirifi-dq new. Game changer!" - Data Team

"The interactive wizard is perfect for training new team members. No MongoDB knowledge required." - Team Lead

"I can finally automate data quality checks in our CI/CD pipeline with the CLI commands." - DevOps Engineer


Basic Commands

Check for linting issues

ruff check .

Auto-fix issues

ruff check . --fix

Format code

ruff format .

Check with statistics

ruff check . --statistics

Pre-commit Integration

Install pre-commit hooks (if not already done)

pre-commit install

Run all hooks manually

pre-commit run --all-files

Run only ruff

pre-commit run ruff --all-files

CI/CD Integration

Add to your CI pipeline: ruff check . # Fails if issues found ruff format --check . # Fails if formatting needed

Usage Examples

CLI Usage:

List all available scripts

yirifi-dq scripts list

Get detailed script information

yirifi-dq scripts info articles/duplicate-cleanup

Run a script

yirifi-dq run articles/duplicate-cleanup
--database regdb
--collection articlesdocuments
--field slug
--keep-strategy newest
--test-mode

TUI Usage:

Launch TUI

yirifi-dq tui

Ready to get started? Run yirifi-dq new and follow the wizard! ๐Ÿš€

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yirifi_dq-1.0.0.tar.gz (186.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

yirifi_dq-1.0.0-py3-none-any.whl (219.3 kB view details)

Uploaded Python 3

File details

Details for the file yirifi_dq-1.0.0.tar.gz.

File metadata

  • Download URL: yirifi_dq-1.0.0.tar.gz
  • Upload date:
  • Size: 186.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for yirifi_dq-1.0.0.tar.gz
Algorithm Hash digest
SHA256 411a2fce766bc3bfa33404537ebb10565a148e5f8d27b87cee6dde0c3264c2ab
MD5 77372282467817577baa6f69d0159fb6
BLAKE2b-256 cb6bd9b4fe6828bf6c4b71b37ddcf65093fa03a54718d7be7ed7577a1ef467de

See more details on using hashes here.

File details

Details for the file yirifi_dq-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: yirifi_dq-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 219.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for yirifi_dq-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d98f08eeb2fe125401cd680ab057cab97e7fa6480e2e3b69eb9cbcae6878b763
MD5 c070654befc6cf8086d92dee0cb884dd
BLAKE2b-256 9312ec5f6e72476e088f066c0f04532536ef6eaa8aed7998cf11422244b088b5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page