Skip to main content

Data Validation Gini (DVG) CLI for row count and row/column comparison with HTML reports

Project description

Data Validation Gini (DVG)

Data Validation Gini is a lightweight Python CLI for validating source and target datasets and generating a rich HTML reconciliation report.

The repository also includes a CSV data mutation utility (data_corruptor.py) to create controlled mismatches for validation testing.

What This Project Does

  • Compares source vs target files using row-level and cell-level checks.
  • Supports CSV and Excel (.xlsx, .xlsm, .xltx) inputs.
  • Supports single-sheet and multi-sheet validation (via sheet mapping).
  • Produces a styled, filterable HTML report with KPI summary cards.
  • Includes repeatable batch scripts for common mutation and validation scenarios.

Current Validation Modes

  • ROWCOUNT: checks source/target data row counts.
  • ROW_COL_VALIDATION: checks headers and row/column values.
  • Combined mode: pass both as comma-separated values:
    • ROWCOUNT,ROW_COL_VALIDATION

Key Features in Current Implementation

  • Header mismatch detection:
    • header length mismatches
    • header name mismatches
  • Row alignment using preferred key columns:
    • employee_id, id, emp_id, record_id, pk
    • falls back to first column if no preferred key exists
  • Mismatch classification:
    • CELL
    • SRC_ONLY
    • TGT_ONLY
    • HEADER_LENGTH
    • HEADER_NAME
    • ROWCOUNT
  • HTML report KPIs:
    • SRC Count
    • TGT Count
    • PASSED
    • FAILED
    • Pass Rate
    • Failed Rate
    • SRC Only
    • TGT Only
  • Per-column filter inputs in mismatch table for quick triage.

Requirements

  • Python 3.9+
  • Packages:
    • openpyxl
    • pytest (for tests)
    • python-dotenv

Install dependencies:

pip install -r requirements.txt

Quick Start (Windows Batch Flow)

From project root:

001_env.bat
002_activate.bat
003_setup.bat

Run all mutation scenarios:

004_run.bat

Run a DVG validation and generate HTML:

dvg.bat

Run sheet mapping validation (Excel to Excel):

006_run_sheet_mapping.bat

Deactivate venv:

008_deactivate.bat

CLI Usage

DVG Validator

python dvg.py \
  --file-type EXCEL \
  --src-path inputs/employees.csv \
  --tgt-path outputs/employees.csv \
  --validation-type ROWCOUNT,ROW_COL_VALIDATION \
  --html-output output/report_<datetime>.html

Optional arguments:

  • --src-sheet <sheet_name>
  • --tgt-sheet <sheet_name>
  • --sheet-mapping "SRC1:TGT1,SRC2:TGT2"

Notes:

  • --sheet-mapping is supported only for Excel file pairs.
  • --file-type currently accepts EXCEL (for both CSV and Excel processing paths).
  • <datetime> token in --html-output is replaced at runtime with YYYYMMDD_HHMMSS.

Installed CLI Entry Point

If installed as a package, you can run:

dvg --file-type EXCEL --src-path ... --tgt-path ... --validation-type ROWCOUNT

Data Mutation Utility (data_corruptor.py)

Use this utility to generate controlled data drift before validation.

Example:

python data_corruptor.py \
  --input inputs/employees.csv \
  --output outputs/employees_typos.csv \
  --column email \
  --percentage 1.0 \
  --type typo

Supported mutation types:

  • nullify
    • Replaces selected values with blank strings.
    • Purpose: validate missing-value detection.
  • case_swap
    • Swaps letter casing in selected values.
    • Purpose: validate case sensitivity behavior.
  • numeric_shift
    • Adds/subtracts a numeric offset (--value).
    • Purpose: validate precision and tolerance checks.
  • date_shift
    • Shifts date/datetime values by day count (--value).
    • Supported formats: YYYY-MM-DD, YYYY-MM-DD HH:MM:SS.
    • Purpose: validate temporal drift handling.
  • typo
    • Randomly replaces one character in selected strings.
    • Purpose: validate strict text/hash mismatch detection.

Sample Scenario Scripts

  • run_case_swap.bat
  • run_date_shift.bat
  • run_nullify.bat
  • run_numeric_shift.bat
  • run_typo.bat

Each script mutates inputs/employees.csv into a corresponding file under outputs/.

Reports

Generated reports are written under output/ and include:

  • high-level pass/fail status
  • validation metadata (source, target, validation type, timestamp)
  • KPI cards
  • detailed mismatch table with filters

Tests

Run tests with:

pytest

Project Structure (High Level)

  • dvg.py - validation CLI
  • dvg_report.py - HTML report generation
  • data_corruptor.py - mutation utility
  • inputs/ - baseline sample datasets
  • outputs/ - mutated sample datasets
  • output/ - generated report files
  • tests/ - unit tests

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data_validation_gini-0.1.4.tar.gz (16.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

data_validation_gini-0.1.4-py3-none-any.whl (13.6 kB view details)

Uploaded Python 3

File details

Details for the file data_validation_gini-0.1.4.tar.gz.

File metadata

  • Download URL: data_validation_gini-0.1.4.tar.gz
  • Upload date:
  • Size: 16.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for data_validation_gini-0.1.4.tar.gz
Algorithm Hash digest
SHA256 8075c5345b531856f89aecaba12336660056faea9f1eb23b4227994a3f84d2fb
MD5 d0eeb2c3b5ccfc0f6a1e8ba41dbfdcd3
BLAKE2b-256 f47b0ab7e0a189714da2e9fb8bf46fc9adb59eabe006d82dde1e9552b00699dd

See more details on using hashes here.

Provenance

The following attestation bundles were made for data_validation_gini-0.1.4.tar.gz:

Publisher: publish-pypi.yml on ShanKonduru/data-validation-gini

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file data_validation_gini-0.1.4-py3-none-any.whl.

File metadata

File hashes

Hashes for data_validation_gini-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 9d34bcf1832d7a645d7eb6388236da532a8e4c04bc810fb04dc924bbb50dc996
MD5 17aa9100212ce7c5bd7beb94a678d2d3
BLAKE2b-256 005ec69d03dd5b68109ced2c776fe4f4f1963a16de2d6c4b256654c4db086139

See more details on using hashes here.

Provenance

The following attestation bundles were made for data_validation_gini-0.1.4-py3-none-any.whl:

Publisher: publish-pypi.yml on ShanKonduru/data-validation-gini

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page