Skip to main content

Data Validation Gini (DVG) CLI for row count and row/column comparison with HTML reports

Project description

Data Validation Gini (DVG)

Data Validation Gini is a lightweight Python CLI for validating source and target datasets and generating a rich HTML reconciliation report.

The repository also includes a CSV data mutation utility (data_corruptor.py) to create controlled mismatches for validation testing.

What This Project Does

  • Compares source vs target files using row-level and cell-level checks.
  • Supports CSV and Excel (.xlsx, .xlsm, .xltx) inputs.
  • Supports single-sheet and multi-sheet validation (via sheet mapping).
  • Produces a styled, filterable HTML report with KPI summary cards.
  • Includes repeatable batch scripts for common mutation and validation scenarios.

Current Validation Modes

  • ROWCOUNT: checks source/target data row counts.
  • ROW_COL_VALIDATION: checks headers and row/column values.
  • Combined mode: pass both as comma-separated values:
    • ROWCOUNT,ROW_COL_VALIDATION

Key Features in Current Implementation

  • Header mismatch detection:
    • header length mismatches
    • header name mismatches
  • Row alignment using preferred key columns:
    • employee_id, id, emp_id, record_id, pk
    • falls back to first column if no preferred key exists
  • Mismatch classification:
    • CELL
    • SRC_ONLY
    • TGT_ONLY
    • HEADER_LENGTH
    • HEADER_NAME
    • ROWCOUNT
  • HTML report KPIs:
    • SRC Count
    • TGT Count
    • PASSED
    • FAILED
    • Pass Rate
    • Failed Rate
    • SRC Only
    • TGT Only
  • Per-column filter inputs in mismatch table for quick triage.

Requirements

  • Python 3.9+
  • Packages:
    • openpyxl
    • pytest (for tests)
    • python-dotenv

Install dependencies:

pip install -r requirements.txt

Quick Start (Windows Batch Flow)

From project root:

scripts\001_env.bat
scripts\002_activate.bat
scripts\003_setup.bat

Run all mutation scenarios:

scripts\004_run.bat

Run a DVG validation and generate HTML:

scripts\dvg.bat

Run sheet mapping validation (Excel to Excel):

scripts\006_run_sheet_mapping.bat

Deactivate venv:

scripts\008_deactivate.bat

CLI Usage

DVG Validator

python dvg.py \
  --file-type EXCEL \
  --src-path inputs/employees.csv \
  --tgt-path outputs/employees.csv \
  --validation-type ROWCOUNT,ROW_COL_VALIDATION \
  --html-output output/report_<datetime>.html

Optional arguments:

  • --src-sheet <sheet_name>
  • --tgt-sheet <sheet_name>
  • --sheet-mapping "SRC1:TGT1,SRC2:TGT2"
  • --chunk-size <positive_int> (default: 1000)

Notes:

  • --sheet-mapping is supported only for Excel file pairs.
  • --file-type currently accepts EXCEL (for both CSV and Excel processing paths).
  • <datetime> token in --html-output is replaced at runtime with YYYYMMDD_HHMMSS.
  • --chunk-size controls the number of data rows read per batch for CSV/XLSX loading.
  • Console output now shows chunk progress for source/target loading: total chunks, current chunk, and completion summary.

Large-file tuning tip:

  • Start with --chunk-size 1000 (default), then increase to 2000 or 5000 for faster reads if memory allows.
  • In dvg.bat, set CHUNK_SIZE in the config block to tune batch size without changing CLI commands.

Installed CLI Entry Point

If installed as a package, you can run:

dvg --file-type EXCEL --src-path ... --tgt-path ... --validation-type ROWCOUNT

Data Mutation Utility (data_corruptor.py)

Use this utility to generate controlled data drift before validation.

Example:

python data_corruptor.py \
  --input inputs/employees.csv \
  --output outputs/employees_typos.csv \
  --column email \
  --percentage 1.0 \
  --type typo

Batch Scripts for Mutation Scenarios

Located in the scripts/ folder:

  • run_case_swap.bat - Swap character cases
  • run_date_shift.bat - Shift dates by random days
  • run_nullify.bat - Replace values with NULL/empty
  • run_numeric_shift.bat - Shift numeric values
  • run_typo.bat - Introduce character typos

Example:

scripts\run_case_swap.bat

Supported mutation types:

  • nullify
    • Replaces selected values with blank strings.
    • Purpose: validate missing-value detection.
  • case_swap
    • Swaps letter casing in selected values.
    • Purpose: validate case sensitivity behavior.
  • numeric_shift
    • Adds/subtracts a numeric offset (--value).
    • Purpose: validate precision and tolerance checks.
  • date_shift
    • Shifts date/datetime values by day count (--value).
    • Supported formats: YYYY-MM-DD, YYYY-MM-DD HH:MM:SS.
    • Purpose: validate temporal drift handling.
  • typo
    • Randomly replaces one character in selected strings.
    • Purpose: validate strict text/hash mismatch detection.

Sample Scenario Scripts

  • run_case_swap.bat
  • run_date_shift.bat
  • run_nullify.bat
  • run_numeric_shift.bat
  • run_typo.bat

Each script mutates inputs/employees.csv into a corresponding file under outputs/.

Reports

Generated reports are written under output/ and include:

  • high-level pass/fail status
  • validation metadata (source, target, validation type, timestamp)
  • KPI cards
  • detailed mismatch table with filters

Tests

Run tests with:

pytest

Security Audits

The project includes comprehensive security scanning with automated HTML report generation. See SECURITY_AUDITS.md for detailed documentation.

Quick Start

Run all security audits:

scripts\013_run_all_security_audits.bat

Or on Linux/macOS:

bash scripts/013_run_all_security_audits.sh

Individual audit scripts:

  • scripts/010_run_pip_audit.bat - Scan Python dependencies for known vulnerabilities
  • scripts/011_run_trivy_audit.bat - Scan filesystem for misconfigurations and secrets
  • scripts/012_run_gitleaks_audit.bat - Detect accidentally committed secrets

Reports Generated:

  • audits/pip_audit_report.html - Dependency vulnerability report
  • audits/trivy_fs_report.html - Filesystem audit report
  • audits/gitleaks_report.html - Secret detection report
  • audits/security_audit_consolidated.html - Master consolidated report

Install Security Tools:

# Windows (Chocolatey)
choco install trivy gitleaks
pip install pip-audit

# macOS (Homebrew)
brew install trivy gitleaks
pip install pip-audit

See SECURITY_AUDITS.md for:

  • Detailed tool documentation
  • CI/CD integration examples
  • Troubleshooting guides
  • Report interpretation tips

MCP Server

This project now ships a small MCP server for the CLI. Start it with:

dvg-mcp

The server exposes four tools:

  • run_validation - run the existing file comparison workflow and return a structured result.
  • preview_input - inspect a CSV or Excel file without loading the full dataset.
  • mutate_data - create a controlled CSV mutation using the same corruption rules as the CLI helper.
  • get_last_report - read the latest HTML report and return the KPI summary.

IDE Setup

VS Code

Option 1: Using .vscode/settings.json

Create or edit .vscode/settings.json in your workspace:

{
  "github.copilot.codeium.enabled": true,
  "mcp.servers": [
    {
      "name": "data-validation-gini",
      "command": "dvg-mcp",
      "cwd": "c:\\MyProjects\\data-validation-gini",
      "transport": "stdio",
      "disabled": false
    }
  ]
}

Option 2: Using VS Code MCP Extension Settings

  1. Open Command Palette (Ctrl+Shift+P)
  2. Search for "MCP: Add Server"
  3. Configure with:
    • Name: data-validation-gini
    • Command: dvg-mcp
    • Working Directory: c:\MyProjects\data-validation-gini
    • Transport: stdio

Option 3: Using Copilot Chat Extension Settings

Edit settings.json with Copilot-specific MCP configuration:

{
  "chat.mcp.servers": [
    {
      "name": "data-validation-gini",
      "command": "dvg-mcp",
      "cwd": "c:\\MyProjects\\data-validation-gini",
      "args": [],
      "env": {
        "PYTHONPATH": "c:\\MyProjects\\data-validation-gini"
      }
    }
  ]
}

Cursor

Using cursor_settings.json

Edit your Cursor settings file (usually in %APPDATA%\Cursor\User\settings.json on Windows):

{
  "mcp.servers": [
    {
      "name": "data-validation-gini",
      "command": "dvg-mcp",
      "cwd": "c:\\MyProjects\\data-validation-gini",
      "transport": "stdio",
      "timeout": 30000
    }
  ]
}

Alternatively, use Cursor's GUI:

  1. Open Cursor Settings
  2. Navigate to "MCP Servers"
  3. Click "Add Server"
  4. Enter the configuration above

Claude Desktop

Using claude_desktop_config.json

Edit %APPDATA%\Claude\claude_desktop_config.json on Windows:

{
  "mcpServers": {
    "data-validation-gini": {
      "command": "dvg-mcp",
      "args": [],
      "cwd": "c:\\MyProjects\\data-validation-gini",
      "env": {
        "PYTHONPATH": "c:\\MyProjects\\data-validation-gini"
      }
    }
  }
}

JetBrains IDEs (PyCharm, IntelliJ IDEA)

Using IDE Settings (MCP Plugin)

If using a JetBrains MCP integration plugin:

  1. Open SettingsToolsMCP Servers (or similar)
  2. Click Add and configure:
{
  "type": "custom",
  "name": "data-validation-gini",
  "command": "dvg-mcp",
  "workingDirectory": "c:\\MyProjects\\data-validation-gini",
  "stdio": true,
  "disabled": false,
  "environment": {
    "PYTHONPATH": "c:\\MyProjects\\data-validation-gini"
  }
}

Neovim (with MCP Client Plugin)

Using neovim/init.lua or MCP plugin config

Example for a Neovim MCP plugin:

require('mcp').register_server({
  name = "data-validation-gini",
  command = "dvg-mcp",
  cwd = "c:\\MyProjects\\data-validation-gini",
  transport = "stdio"
})

Or in YAML if using a config file:

servers:
  - name: data-validation-gini
    command: dvg-mcp
    cwd: c:\MyProjects\data-validation-gini
    transport: stdio

Generic MCP Clients (Python, Node.js, etc.)

For Python clients:

import subprocess

mcp_server = {
    "name": "data-validation-gini",
    "command": "dvg-mcp",
    "args": [],
    "cwd": "c:\\MyProjects\\data-validation-gini",
    "transport": "stdio"
}

# Start server
process = subprocess.Popen(
    [mcp_server["command"]] + mcp_server.get("args", []),
    cwd=mcp_server["cwd"],
    stdin=subprocess.PIPE,
    stdout=subprocess.PIPE,
    stderr=subprocess.PIPE,
    text=True
)

For Node.js/JavaScript clients:

const { spawn } = require('child_process');

const mcpServer = {
  name: 'data-validation-gini',
  command: 'dvg-mcp',
  cwd: 'c:\\MyProjects\\data-validation-gini',
  transport: 'stdio'
};

const process = spawn(mcpServer.command, [], {
  cwd: mcpServer.cwd,
  stdio: ['pipe', 'pipe', 'pipe']
});

Other IDEs and MCP clients

  1. Use any IDE or assistant that supports MCP over stdio.
  2. Register the server command as dvg-mcp.
  3. Set the working directory to the repository root so relative paths like inputs/ and output/ resolve correctly.
  4. Make sure the project dependencies are installed before launching the server.

Key Configuration Properties:

Property Value Required Notes
command dvg-mcp Yes The entry point for the MCP server
cwd / workingDirectory c:\MyProjects\data-validation-gini Yes Path to project root (enables relative file paths)
transport stdio Yes Communication protocol (HTTP and other protocols not supported)
timeout 30000 No Timeout in milliseconds (default: 30s)
disabled false No Set to true to temporarily disable the server
env.PYTHONPATH Project root path No Helps Python resolve imports correctly

Natural Language Usage

You can talk to the server in plain English and let the client translate that into tool calls.

Example requests:

  • "Compare these two CSV files with chunk size 5000 and save a report."
  • "Preview the first 5 rows of this XLSX sheet before I validate it."
  • "Mutate the email column in this CSV using the typo mode at 1%."
  • "Show me the latest report summary and pass/fail counts."
  • "Validate this Excel workbook with the departments sheet mapped to departments."
  • "Run a row-count check only and use the default chunk size."

The server defaults to chunk size 1000 when you do not specify one.

Project Structure (High Level)

Core Files

  • dvg.py - validation CLI
  • dvg_report.py - HTML report generation
  • data_corruptor.py - mutation utility
  • dvg_mcp.py - MCP server for the CLI
  • README.md - Main documentation
  • SECURITY_AUDITS.md - Security audit scripts documentation

Scripts Folder (scripts/)

Setup & Environment:

  • 001_env.bat/sh - Python environment setup
  • 002_activate.bat/sh - Activate virtual environment
  • 003_setup.bat/sh - Install dependencies
  • 008_deactivate.bat/sh - Deactivate virtual environment

Validation & Mutation:

  • dvg.bat - Run DVG validation
  • 004_run.bat - Run all data mutation scenarios
  • 006_run_sheet_mapping.bat - Run sheet mapping validation

Data Mutation Scripts:

  • run_case_swap.bat - Character case swapping
  • run_date_shift.bat - Date shifting
  • run_nullify.bat - Nullify values
  • run_numeric_shift.bat - Numeric value shifting
  • run_typo.bat - Introduce typos

Security Audits:

  • 010_run_pip_audit.bat/sh - Run pip-audit security scan
  • 011_run_trivy_audit.bat/sh - Run Trivy filesystem audit
  • 012_run_gitleaks_audit.bat/sh - Run GitLeaks secret detection
  • 013_run_all_security_audits.bat/sh - Run all security audits (consolidated)

Directories

  • inputs/ - baseline sample datasets
  • outputs/ - mutated sample datasets
  • output/ - generated validation report files
  • audits/ - generated security audit reports (JSON & HTML)
  • tests/ - unit tests
  • data_validation_gini.egg-info/ - package metadata

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data_validation_gini-0.1.9.tar.gz (25.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

data_validation_gini-0.1.9-py3-none-any.whl (21.5 kB view details)

Uploaded Python 3

File details

Details for the file data_validation_gini-0.1.9.tar.gz.

File metadata

  • Download URL: data_validation_gini-0.1.9.tar.gz
  • Upload date:
  • Size: 25.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for data_validation_gini-0.1.9.tar.gz
Algorithm Hash digest
SHA256 4c03e571efde26b32fe388e7ca68a61a08c90ed9b33bc3ddb07209dd88efec86
MD5 b433c1a7a1440f5c2a96f7fe179e1dcd
BLAKE2b-256 6c88e562864f583b002d3116885906b1cd1217fc15b1b6abb1f73fea10bc3764

See more details on using hashes here.

Provenance

The following attestation bundles were made for data_validation_gini-0.1.9.tar.gz:

Publisher: publish-pypi.yml on ShanKonduru/data-validation-gini

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file data_validation_gini-0.1.9-py3-none-any.whl.

File metadata

File hashes

Hashes for data_validation_gini-0.1.9-py3-none-any.whl
Algorithm Hash digest
SHA256 184cce36aba72f51205df081de07fb5033abbe08d6fd7812216730530e1a8d18
MD5 d7b9b85cce3aac94f321a8a617e1da2b
BLAKE2b-256 d06a08432e28d15fa4aa4982b183c9955c3fbda8a101844a2360597986c0578d

See more details on using hashes here.

Provenance

The following attestation bundles were made for data_validation_gini-0.1.9-py3-none-any.whl:

Publisher: publish-pypi.yml on ShanKonduru/data-validation-gini

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page