Skip to main content

Data Validation Gini (DVG) CLI for cross-platform data validation with file-to-DB, DB-to-file, and DB-to-DB support

Project description

Data Validation Gini (DVG)

Data Validation Gini is a lightweight Python CLI for validating source and target datasets and generating a rich HTML reconciliation report.

The repository also includes a CSV data mutation utility (data_corruptor.py) to create controlled mismatches for validation testing.

Latest Updates (v0.3.15)

  • NEW: File-to-Database & Database-to-File Validation ⭐ MAJOR FEATURE
    • CSV ↔ PostgreSQL, MySQL, SQLite (12 combinations total)
    • Excel ↔ PostgreSQL, MySQL, SQLite
    • Validate data loads without export/import workflows
    • Perfect for ETL validation, data migration verification
    • Examples: scripts/data/020_csv_to_sqlite.bat
  • --version flag - Added CLI version flag to check installed version
    • Use python dvg.py --version or dvg --version
    • Version accessible via data_validation_gini.__version__
  • SCHEMA_VALIDATION - Full implementation of schema validation:
    • Validates column count, column names, and inferred data types
    • Detects INTEGER, FLOAT, BOOLEAN, DATE, and STRING types from sample data
    • Can be combined with ROWCOUNT_VALIDATION and ROW_COL_VALIDATION
    • See scripts/data/007_run_schema_validation.bat for examples
  • Migrated to a src/ package layout (data_validation_gini) while preserving root-level compatibility wrappers.
  • Enhanced CLI contract with explicit source/target kind flags (--src-kind, --tgt-kind) and compatibility shims.
  • Added canonical validation-type normalization (ROWCOUNT alias -> ROWCOUNT_VALIDATION).
  • Added mismatch capping with --max-mismatches.
  • Added reusable file I/O classes:
    • IniConfigStore for INI read/write operations
    • JsonFileStore for JSON read/write operations
  • Refactored test and coverage scripts for reliable local execution on Windows and Linux/macOS.
  • Expanded automated tests and achieved 100% package coverage for data_validation_gini.

What This Project Does

  • Compares source vs target files using row-level and cell-level checks.
  • Supports CSV and Excel (.xlsx, .xlsm, .xltx) inputs.
  • Supports single-sheet and multi-sheet validation (via sheet mapping).
  • Produces a styled, filterable HTML report with KPI summary cards.
  • Includes repeatable batch scripts for common mutation and validation scenarios.

Current Validation Modes

  • ROWCOUNT_VALIDATION: checks source/target data row counts.
  • ROWCOUNT: compatibility alias of ROWCOUNT_VALIDATION.
  • ROW_COL_VALIDATION: checks headers and row/column values.
  • SCHEMA_VALIDATION: checks column count, column names (order-sensitive), and inferred data types.
  • Combined mode: pass multiple as comma-separated values:
    • ROWCOUNT_VALIDATION,ROW_COL_VALIDATION
    • SCHEMA_VALIDATION,ROW_COL_VALIDATION
    • ROWCOUNT_VALIDATION,SCHEMA_VALIDATION,ROW_COL_VALIDATION

Key Features in Current Implementation

  • Header mismatch detection:
    • header length mismatches
    • header name mismatches
  • Row alignment using preferred key columns:
    • employee_id, id, emp_id, record_id, pk
    • falls back to first column if no preferred key exists
  • Mismatch classification:
    • CELL - cell value mismatch
    • SRC_ONLY - value in source only
    • TGT_ONLY - value in target only
    • HEADER_LENGTH - header column count mismatch
    • HEADER_NAME - header name mismatch
    • ROWCOUNT - row count mismatch
    • SCHEMA_COLUMN_COUNT - schema column count mismatch
    • SCHEMA_COLUMN_NAME - schema column name mismatch
    • SCHEMA_DATA_TYPE - schema data type mismatch (INTEGER, FLOAT, BOOLEAN, DATE, STRING)
  • HTML report KPIs:
    • SRC Count
    • TGT Count
    • PASSED
    • FAILED
    • Pass Rate
    • Failed Rate
    • SRC Only
    • TGT Only
  • Per-column filter inputs in mismatch table for quick triage.

Requirements

  • Python 3.9+
  • Packages:
    • openpyxl
    • pytest (for tests)
    • python-dotenv

Install dependencies:

pip install -r requirements.txt

Quick Start (Windows Batch Flow)

From project root:

scripts\001_env.bat
scripts\002_activate.bat
scripts\003_setup.bat

Run all mutation scenarios:

scripts\004_run.bat

Run a DVG validation and generate HTML:

scripts\dvg.bat

Run sheet mapping validation (Excel to Excel):

scripts\006_run_sheet_mapping.bat

Deactivate venv:

scripts\008_deactivate.bat

CLI Usage

Version Information

Check the installed version:

python dvg.py --version
# or if installed as package:
dvg --version

DVG Validator

python dvg.py \
  --src-kind csv \
  --tgt-kind csv \
  --src-path inputs/employees.csv \
  --tgt-path outputs/employees.csv \
  --validation-type ROWCOUNT_VALIDATION,ROW_COL_VALIDATION \
  --html-output output/report_<datetime>.html

Legacy compatibility mode is still available:

python dvg.py \
  --file-type EXCEL \
  --src-path inputs/employees.csv \
  --tgt-path outputs/employees.csv \
  --validation-type ROWCOUNT,ROW_COL_VALIDATION

Optional arguments:

  • --src-sheet <sheet_name>
  • --tgt-sheet <sheet_name>
  • --sheet-mapping "SRC1:TGT1,SRC2:TGT2"
  • --chunk-size <positive_int> (default: 1000)
  • --src-db-alias <alias>, --tgt-db-alias <alias>
  • --src-env <env>, --tgt-env <env>, --allow-cross-env
  • --max-mismatches <int>
  • --key-mode <AUTO|PRIMARY_KEY|COLUMNS|GROUP_CANONICAL|HASH>

Notes:

  • --sheet-mapping is supported only for Excel file pairs.
  • Provide either --file-type or both --src-kind and --tgt-kind.
  • --file-type remains supported for backward compatibility.
  • DB kind declarations include sqlserver and oracle, but current implementation supports DB execution only for sqlite, postgresql, and mysql.
  • Mixed file<->DB validation in a single run is not implemented yet.
  • <datetime> token in --html-output is replaced at runtime with YYYYMMDD_HHMMSS.
  • --chunk-size controls the number of data rows read per batch for CSV/XLSX loading.
  • --max-mismatches truncates mismatch details included in console preview and HTML report.
  • Console output now shows chunk progress for source/target loading: total chunks, current chunk, and completion summary.

Large-file tuning tip:

  • Start with --chunk-size 1000 (default), then increase to 2000 or 5000 for faster reads if memory allows.
  • In dvg.bat, set CHUNK_SIZE in the config block to tune batch size without changing CLI commands.

Installed CLI Entry Point

If installed as a package, you can run:

dvg --src-kind csv --tgt-kind csv --src-path ... --tgt-path ... --validation-type ROWCOUNT_VALIDATION

Data Mutation Utility (data_corruptor.py)

Use this utility to generate controlled data drift before validation.

Example:

python data_corruptor.py \
  --input inputs/employees.csv \
  --output outputs/employees_typos.csv \
  --column email \
  --percentage 1.0 \
  --type typo

Batch Scripts for Mutation Scenarios

Located in the scripts/ folder:

  • run_case_swap.bat - Swap character cases
  • run_date_shift.bat - Shift dates by random days
  • run_nullify.bat - Replace values with NULL/empty
  • run_numeric_shift.bat - Shift numeric values
  • run_typo.bat - Introduce character typos

Example:

scripts\run_case_swap.bat

Supported mutation types:

  • nullify
    • Replaces selected values with blank strings.
    • Purpose: validate missing-value detection.
  • case_swap
    • Swaps letter casing in selected values.
    • Purpose: validate case sensitivity behavior.
  • numeric_shift
    • Adds/subtracts a numeric offset (--value).
    • Purpose: validate precision and tolerance checks.
  • date_shift
    • Shifts date/datetime values by day count (--value).
    • Supported formats: YYYY-MM-DD, YYYY-MM-DD HH:MM:SS.
    • Purpose: validate temporal drift handling.
  • typo
    • Randomly replaces one character in selected strings.
    • Purpose: validate strict text/hash mismatch detection.

Sample Scenario Scripts

  • run_case_swap.bat
  • run_date_shift.bat
  • run_nullify.bat
  • run_numeric_shift.bat
  • run_typo.bat

Each script mutates inputs/employees.csv into a corresponding file under outputs/.

Reports

Generated reports are written under output/ and include:

  • high-level pass/fail status
  • validation metadata (source, target, validation type, timestamp)
  • KPI cards
  • detailed mismatch table with filters

Tests

Run tests with:

pytest

Local Test Scripts

Windows:

scripts\005_run_unit_tests.bat
scripts\005_run_code_cov.bat

Linux/macOS:

bash scripts/005_run_unit_tests.sh
bash scripts/005_run_code_cov.sh

Coverage command used by the scripts:

python -m pytest --cov=data_validation_gini --cov-report=term-missing --cov-report=html

Current target and baseline: 100% coverage for package modules under src/data_validation_gini.

Security Audits

The project includes comprehensive security scanning with automated HTML report generation. See docs/security/SECURITY_AUDITS.md for detailed documentation.

Quick Start

Run all security audits:

scripts\013_run_all_security_audits.bat

Or on Linux/macOS:

bash scripts/013_run_all_security_audits.sh

Individual audit scripts:

  • scripts/010_run_pip_audit.bat - Scan Python dependencies for known vulnerabilities
  • scripts/011_run_trivy_audit.bat - Scan filesystem for misconfigurations and secrets
  • scripts/012_run_gitleaks_audit.bat - Detect accidentally committed secrets

Reports Generated:

  • audits/pip_audit_report.html - Dependency vulnerability report
  • audits/trivy_fs_report.html - Filesystem audit report
  • audits/gitleaks_report.html - Secret detection report
  • audits/consolidated_security_report.html - Multi-scanner dashboard (all tools combined)

Install Security Tools:

# Windows (Chocolatey)
choco install trivy gitleaks
pip install pip-audit

# macOS (Homebrew)
brew install trivy gitleaks
pip install pip-audit

See docs/security/SECURITY_AUDITS.md for:

  • Detailed tool documentation
  • CI/CD integration examples
  • Troubleshooting guides
  • Report interpretation tips

Project Structure (High Level)

Core Files

  • src/data_validation_gini/dvg.py - validation CLI implementation
  • src/data_validation_gini/dvg_report.py - HTML report generation
  • src/data_validation_gini/data_corruptor.py - mutation utility implementation
  • src/data_validation_gini/dvg_db.py - database connectivity and table loading
  • src/data_validation_gini/file_stores.py - INI/JSON file reader-writer classes
  • dvg.py, dvg_db.py, dvg_report.py, data_corruptor.py - root compatibility wrappers
  • README.md - Main documentation
  • docs/CONTRIBUTING.md - contributor workflow and repository boundaries
  • docs/security/SECURITY_AUDITS.md - Security audit scripts documentation

Scripts Folder (scripts/)

Setup & Environment:

  • 001_env.bat/sh - Python environment setup
  • 002_activate.bat/sh - Activate virtual environment
  • 003_setup.bat/sh - Install dependencies
  • 008_deactivate.bat/sh - Deactivate virtual environment

Domain Implementations:

  • scripts/data/ - operational data workflows (mutations, sheet mapping, DB startup/seed/compare)
  • scripts/testing/ - local test and coverage workflows
  • scripts/security/ - security audit workflows and consolidated run

Compatibility Wrappers (root scripts):

  • Existing root scripts remain valid (for example 004_run.bat, 005_run_unit_tests.bat, 010_run_pip_audit.bat).
  • Each wrapper forwards to the new domain script path so existing entrypoints and automation remain unchanged.

Validation & CLI:

  • dvg.bat/sh - Run DVG validation

Directories

  • inputs/ - baseline sample datasets
  • outputs/ - mutated sample datasets
  • output/ - generated validation report files
  • audits/ - generated security audit reports (JSON & HTML)
  • tests/ - unit tests
  • data_validation_gini.egg-info/ - package metadata

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data_validation_gini-0.3.16.tar.gz (51.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

data_validation_gini-0.3.16-py3-none-any.whl (38.6 kB view details)

Uploaded Python 3

File details

Details for the file data_validation_gini-0.3.16.tar.gz.

File metadata

  • Download URL: data_validation_gini-0.3.16.tar.gz
  • Upload date:
  • Size: 51.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for data_validation_gini-0.3.16.tar.gz
Algorithm Hash digest
SHA256 ca3fb582e459e3da8e5267fae25d3866efb7c0d1d1d49bb6f99eb7db3591e5c5
MD5 69f938f4c22d5df4fc2a38fa8784eedb
BLAKE2b-256 d40db2d3932e46b1e1667312f493aec47db69ec6155b102a62fff0ae852ab118

See more details on using hashes here.

Provenance

The following attestation bundles were made for data_validation_gini-0.3.16.tar.gz:

Publisher: publish-pypi.yml on ShanKonduru/data-validation-gini

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file data_validation_gini-0.3.16-py3-none-any.whl.

File metadata

File hashes

Hashes for data_validation_gini-0.3.16-py3-none-any.whl
Algorithm Hash digest
SHA256 89424adf4754cfe049927a76bf84eca9330d617787cd992e77802dc4f8d4f746
MD5 3c1f7305abdd6586e2c1c246cb0f23ad
BLAKE2b-256 a810f26a3152d42450ccc675e84a38a4557ec9866299ffd39904e526da037ce0

See more details on using hashes here.

Provenance

The following attestation bundles were made for data_validation_gini-0.3.16-py3-none-any.whl:

Publisher: publish-pypi.yml on ShanKonduru/data-validation-gini

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page