Data Validation Gini (DVG) CLI for row count and row/column comparison with HTML reports
Project description
Data Validation Gini (DVG)
Data Validation Gini is a lightweight Python CLI for validating source and target datasets and generating a rich HTML reconciliation report.
The repository also includes a CSV data mutation utility (data_corruptor.py) to create controlled mismatches for validation testing.
Latest Updates
- Migrated to a
src/package layout (data_validation_gini) while preserving root-level compatibility wrappers. - Added reusable file I/O classes:
IniConfigStorefor INI read/write operationsJsonFileStorefor JSON read/write operations
- Refactored test and coverage scripts for reliable local execution on Windows and Linux/macOS.
- Expanded automated tests and achieved 100% package coverage for
data_validation_gini.
What This Project Does
- Compares source vs target files using row-level and cell-level checks.
- Supports CSV and Excel (
.xlsx,.xlsm,.xltx) inputs. - Supports single-sheet and multi-sheet validation (via sheet mapping).
- Produces a styled, filterable HTML report with KPI summary cards.
- Includes repeatable batch scripts for common mutation and validation scenarios.
Current Validation Modes
ROWCOUNT: checks source/target data row counts.ROW_COL_VALIDATION: checks headers and row/column values.- Combined mode: pass both as comma-separated values:
ROWCOUNT,ROW_COL_VALIDATION
Key Features in Current Implementation
- Header mismatch detection:
- header length mismatches
- header name mismatches
- Row alignment using preferred key columns:
employee_id,id,emp_id,record_id,pk- falls back to first column if no preferred key exists
- Mismatch classification:
CELLSRC_ONLYTGT_ONLYHEADER_LENGTHHEADER_NAMEROWCOUNT
- HTML report KPIs:
- SRC Count
- TGT Count
- PASSED
- FAILED
- Pass Rate
- Failed Rate
- SRC Only
- TGT Only
- Per-column filter inputs in mismatch table for quick triage.
Requirements
- Python 3.9+
- Packages:
openpyxlpytest(for tests)python-dotenv
Install dependencies:
pip install -r requirements.txt
Quick Start (Windows Batch Flow)
From project root:
scripts\001_env.bat
scripts\002_activate.bat
scripts\003_setup.bat
Run all mutation scenarios:
scripts\004_run.bat
Run a DVG validation and generate HTML:
scripts\dvg.bat
Run sheet mapping validation (Excel to Excel):
scripts\006_run_sheet_mapping.bat
Deactivate venv:
scripts\008_deactivate.bat
CLI Usage
DVG Validator
python dvg.py \
--file-type EXCEL \
--src-path inputs/employees.csv \
--tgt-path outputs/employees.csv \
--validation-type ROWCOUNT,ROW_COL_VALIDATION \
--html-output output/report_<datetime>.html
Optional arguments:
--src-sheet <sheet_name>--tgt-sheet <sheet_name>--sheet-mapping "SRC1:TGT1,SRC2:TGT2"--chunk-size <positive_int>(default:1000)
Notes:
--sheet-mappingis supported only for Excel file pairs.--file-typecurrently acceptsEXCEL(for both CSV and Excel processing paths).<datetime>token in--html-outputis replaced at runtime withYYYYMMDD_HHMMSS.--chunk-sizecontrols the number of data rows read per batch for CSV/XLSX loading.- Console output now shows chunk progress for source/target loading: total chunks, current chunk, and completion summary.
Large-file tuning tip:
- Start with
--chunk-size 1000(default), then increase to2000or5000for faster reads if memory allows. - In
dvg.bat, setCHUNK_SIZEin the config block to tune batch size without changing CLI commands.
Installed CLI Entry Point
If installed as a package, you can run:
dvg --file-type EXCEL --src-path ... --tgt-path ... --validation-type ROWCOUNT
Data Mutation Utility (data_corruptor.py)
Use this utility to generate controlled data drift before validation.
Example:
python data_corruptor.py \
--input inputs/employees.csv \
--output outputs/employees_typos.csv \
--column email \
--percentage 1.0 \
--type typo
Batch Scripts for Mutation Scenarios
Located in the scripts/ folder:
run_case_swap.bat- Swap character casesrun_date_shift.bat- Shift dates by random daysrun_nullify.bat- Replace values with NULL/emptyrun_numeric_shift.bat- Shift numeric valuesrun_typo.bat- Introduce character typos
Example:
scripts\run_case_swap.bat
Supported mutation types:
nullify- Replaces selected values with blank strings.
- Purpose: validate missing-value detection.
case_swap- Swaps letter casing in selected values.
- Purpose: validate case sensitivity behavior.
numeric_shift- Adds/subtracts a numeric offset (
--value). - Purpose: validate precision and tolerance checks.
- Adds/subtracts a numeric offset (
date_shift- Shifts date/datetime values by day count (
--value). - Supported formats:
YYYY-MM-DD,YYYY-MM-DD HH:MM:SS. - Purpose: validate temporal drift handling.
- Shifts date/datetime values by day count (
typo- Randomly replaces one character in selected strings.
- Purpose: validate strict text/hash mismatch detection.
Sample Scenario Scripts
run_case_swap.batrun_date_shift.batrun_nullify.batrun_numeric_shift.batrun_typo.bat
Each script mutates inputs/employees.csv into a corresponding file under outputs/.
Reports
Generated reports are written under output/ and include:
- high-level pass/fail status
- validation metadata (source, target, validation type, timestamp)
- KPI cards
- detailed mismatch table with filters
Tests
Run tests with:
pytest
Local Test Scripts
Windows:
scripts\005_run_unit_tests.bat
scripts\005_run_code_cov.bat
Linux/macOS:
bash scripts/005_run_unit_tests.sh
bash scripts/005_run_code_cov.sh
Coverage command used by the scripts:
python -m pytest --cov=data_validation_gini --cov-report=term-missing --cov-report=html
Current target and baseline: 100% coverage for package modules under src/data_validation_gini.
Security Audits
The project includes comprehensive security scanning with automated HTML report generation. See docs/security/SECURITY_AUDITS.md for detailed documentation.
Quick Start
Run all security audits:
scripts\013_run_all_security_audits.bat
Or on Linux/macOS:
bash scripts/013_run_all_security_audits.sh
Individual audit scripts:
scripts/010_run_pip_audit.bat- Scan Python dependencies for known vulnerabilitiesscripts/011_run_trivy_audit.bat- Scan filesystem for misconfigurations and secretsscripts/012_run_gitleaks_audit.bat- Detect accidentally committed secrets
Reports Generated:
audits/pip_audit_report.html- Dependency vulnerability reportaudits/trivy_fs_report.html- Filesystem audit reportaudits/gitleaks_report.html- Secret detection report
Install Security Tools:
# Windows (Chocolatey)
choco install trivy gitleaks
pip install pip-audit
# macOS (Homebrew)
brew install trivy gitleaks
pip install pip-audit
See docs/security/SECURITY_AUDITS.md for:
- Detailed tool documentation
- CI/CD integration examples
- Troubleshooting guides
- Report interpretation tips
MCP Server
This project now ships a small MCP server for the CLI. Start it with:
dvg-mcp
The server exposes four tools:
run_validation- run the existing file comparison workflow and return a structured result.preview_input- inspect a CSV or Excel file without loading the full dataset.mutate_data- create a controlled CSV mutation using the same corruption rules as the CLI helper.get_last_report- read the latest HTML report and return the KPI summary.
Additional tools:
run_db_validation- run DB-to-DB table validation and return structured status/output.list_db_tables_tool- list available user tables for a configured DB alias.read_json_file- read and parse a JSON file.write_json_file- write structured payloads to JSON.
IDE Setup
VS Code
Option 1: Using .vscode/settings.json
Create or edit .vscode/settings.json in your workspace:
{
"github.copilot.codeium.enabled": true,
"mcp.servers": [
{
"name": "data-validation-gini",
"command": "dvg-mcp",
"cwd": "c:\\MyProjects\\data-validation-gini",
"transport": "stdio",
"disabled": false
}
]
}
Option 2: Using VS Code MCP Extension Settings
- Open Command Palette (
Ctrl+Shift+P) - Search for "MCP: Add Server"
- Configure with:
- Name:
data-validation-gini - Command:
dvg-mcp - Working Directory:
c:\MyProjects\data-validation-gini - Transport:
stdio
- Name:
Option 3: Using Copilot Chat Extension Settings
Edit settings.json with Copilot-specific MCP configuration:
{
"chat.mcp.servers": [
{
"name": "data-validation-gini",
"command": "dvg-mcp",
"cwd": "c:\\MyProjects\\data-validation-gini",
"args": [],
"env": {
"PYTHONPATH": "c:\\MyProjects\\data-validation-gini"
}
}
]
}
Cursor
Using cursor_settings.json
Edit your Cursor settings file (usually in %APPDATA%\Cursor\User\settings.json on Windows):
{
"mcp.servers": [
{
"name": "data-validation-gini",
"command": "dvg-mcp",
"cwd": "c:\\MyProjects\\data-validation-gini",
"transport": "stdio",
"timeout": 30000
}
]
}
Alternatively, use Cursor's GUI:
- Open Cursor Settings
- Navigate to "MCP Servers"
- Click "Add Server"
- Enter the configuration above
Claude Desktop
Using claude_desktop_config.json
Edit %APPDATA%\Claude\claude_desktop_config.json on Windows:
{
"mcpServers": {
"data-validation-gini": {
"command": "dvg-mcp",
"args": [],
"cwd": "c:\\MyProjects\\data-validation-gini",
"env": {
"PYTHONPATH": "c:\\MyProjects\\data-validation-gini"
}
}
}
}
JetBrains IDEs (PyCharm, IntelliJ IDEA)
Using IDE Settings (MCP Plugin)
If using a JetBrains MCP integration plugin:
- Open Settings → Tools → MCP Servers (or similar)
- Click Add and configure:
{
"type": "custom",
"name": "data-validation-gini",
"command": "dvg-mcp",
"workingDirectory": "c:\\MyProjects\\data-validation-gini",
"stdio": true,
"disabled": false,
"environment": {
"PYTHONPATH": "c:\\MyProjects\\data-validation-gini"
}
}
Neovim (with MCP Client Plugin)
Using neovim/init.lua or MCP plugin config
Example for a Neovim MCP plugin:
require('mcp').register_server({
name = "data-validation-gini",
command = "dvg-mcp",
cwd = "c:\\MyProjects\\data-validation-gini",
transport = "stdio"
})
Or in YAML if using a config file:
servers:
- name: data-validation-gini
command: dvg-mcp
cwd: c:\MyProjects\data-validation-gini
transport: stdio
Generic MCP Clients (Python, Node.js, etc.)
For Python clients:
import subprocess
mcp_server = {
"name": "data-validation-gini",
"command": "dvg-mcp",
"args": [],
"cwd": "c:\\MyProjects\\data-validation-gini",
"transport": "stdio"
}
# Start server
process = subprocess.Popen(
[mcp_server["command"]] + mcp_server.get("args", []),
cwd=mcp_server["cwd"],
stdin=subprocess.PIPE,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
text=True
)
For Node.js/JavaScript clients:
const { spawn } = require('child_process');
const mcpServer = {
name: 'data-validation-gini',
command: 'dvg-mcp',
cwd: 'c:\\MyProjects\\data-validation-gini',
transport: 'stdio'
};
const process = spawn(mcpServer.command, [], {
cwd: mcpServer.cwd,
stdio: ['pipe', 'pipe', 'pipe']
});
Other IDEs and MCP clients
- Use any IDE or assistant that supports MCP over
stdio. - Register the server command as
dvg-mcp. - Set the working directory to the repository root so relative paths like
inputs/andoutput/resolve correctly. - Make sure the project dependencies are installed before launching the server.
Key Configuration Properties:
| Property | Value | Required | Notes |
|---|---|---|---|
command |
dvg-mcp |
Yes | The entry point for the MCP server |
cwd / workingDirectory |
c:\MyProjects\data-validation-gini |
Yes | Path to project root (enables relative file paths) |
transport |
stdio |
Yes | Communication protocol (HTTP and other protocols not supported) |
timeout |
30000 |
No | Timeout in milliseconds (default: 30s) |
disabled |
false |
No | Set to true to temporarily disable the server |
env.PYTHONPATH |
Project root path | No | Helps Python resolve imports correctly |
Natural Language Usage
You can talk to the server in plain English and let the client translate that into tool calls.
Example requests:
- "Compare these two CSV files with chunk size 5000 and save a report."
- "Preview the first 5 rows of this XLSX sheet before I validate it."
- "Mutate the email column in this CSV using the typo mode at 1%."
- "Show me the latest report summary and pass/fail counts."
- "Validate this Excel workbook with the departments sheet mapped to departments."
- "Run a row-count check only and use the default chunk size."
The server defaults to chunk size 1000 when you do not specify one.
Project Structure (High Level)
Core Files
src/data_validation_gini/dvg.py- validation CLI implementationsrc/data_validation_gini/dvg_report.py- HTML report generationsrc/data_validation_gini/data_corruptor.py- mutation utility implementationsrc/data_validation_gini/dvg_mcp.py- MCP server implementationsrc/data_validation_gini/dvg_db.py- database connectivity and table loadingsrc/data_validation_gini/file_stores.py- INI/JSON file reader-writer classesdvg.py,dvg_db.py,dvg_mcp.py,dvg_report.py,data_corruptor.py- root compatibility wrappersREADME.md- Main documentationdocs/CONTRIBUTING.md- contributor workflow and repository boundariesdocs/security/SECURITY_AUDITS.md- Security audit scripts documentation
Scripts Folder (scripts/)
Setup & Environment:
001_env.bat/sh- Python environment setup002_activate.bat/sh- Activate virtual environment003_setup.bat/sh- Install dependencies008_deactivate.bat/sh- Deactivate virtual environment
Domain Implementations:
scripts/data/- operational data workflows (mutations, sheet mapping, DB startup/seed/compare)scripts/testing/- local test and coverage workflowsscripts/security/- security audit workflows and consolidated run
Compatibility Wrappers (root scripts):
- Existing root scripts remain valid (for example
004_run.bat,005_run_unit_tests.bat,010_run_pip_audit.bat). - Each wrapper forwards to the new domain script path so existing entrypoints and automation remain unchanged.
Validation & CLI:
dvg.bat/sh- Run DVG validation
Directories
inputs/- baseline sample datasetsoutputs/- mutated sample datasetsoutput/- generated validation report filesaudits/- generated security audit reports (JSON & HTML)tests/- unit testsdata_validation_gini.egg-info/- package metadata
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file data_validation_gini-0.3.3.tar.gz.
File metadata
- Download URL: data_validation_gini-0.3.3.tar.gz
- Upload date:
- Size: 44.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1487c8cd702de709c0e328f20d0c1f83122192479d35b8206774f502baea3013
|
|
| MD5 |
a0da514ecefd63b4c9bba7dcd6884aab
|
|
| BLAKE2b-256 |
6dc6139cf23a246eb26aa7cd99e45730c3f91336d2c32dc8758365633014625e
|
Provenance
The following attestation bundles were made for data_validation_gini-0.3.3.tar.gz:
Publisher:
publish-pypi.yml on ShanKonduru/data-validation-gini
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
data_validation_gini-0.3.3.tar.gz -
Subject digest:
1487c8cd702de709c0e328f20d0c1f83122192479d35b8206774f502baea3013 - Sigstore transparency entry: 1560197181
- Sigstore integration time:
-
Permalink:
ShanKonduru/data-validation-gini@21c2a70b6a4ba10e59a026f1a841dc3fb7de1030 -
Branch / Tag:
refs/tags/v0.3.3 - Owner: https://github.com/ShanKonduru
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@21c2a70b6a4ba10e59a026f1a841dc3fb7de1030 -
Trigger Event:
push
-
Statement type:
File details
Details for the file data_validation_gini-0.3.3-py3-none-any.whl.
File metadata
- Download URL: data_validation_gini-0.3.3-py3-none-any.whl
- Upload date:
- Size: 30.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a55b03f5a26d99f9abd541989e13518e6965d2ea015ebe17b500245a85a55f72
|
|
| MD5 |
095d0b4fd0959fbd79e00d0e2bd48be0
|
|
| BLAKE2b-256 |
95e442fd5d1aff23a884a78885b5460fd2cdc74ae2bb64bc1e8a05de0b50bec6
|
Provenance
The following attestation bundles were made for data_validation_gini-0.3.3-py3-none-any.whl:
Publisher:
publish-pypi.yml on ShanKonduru/data-validation-gini
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
data_validation_gini-0.3.3-py3-none-any.whl -
Subject digest:
a55b03f5a26d99f9abd541989e13518e6965d2ea015ebe17b500245a85a55f72 - Sigstore transparency entry: 1560197369
- Sigstore integration time:
-
Permalink:
ShanKonduru/data-validation-gini@21c2a70b6a4ba10e59a026f1a841dc3fb7de1030 -
Branch / Tag:
refs/tags/v0.3.3 - Owner: https://github.com/ShanKonduru
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@21c2a70b6a4ba10e59a026f1a841dc3fb7de1030 -
Trigger Event:
push
-
Statement type: