Add your description here

Project description

icsDV - initions Data Validation Tool

Introduction

The icsDataValidation tool identifies data mismatches between two databases. The functionalities are specifically geared to support migration projects. It helps to find data issues in tables and views in comparison of a source and a target system.

What is "generic" about the tool?

The icsDataValidation tool (icsDV) is in particular structered in a way that it is easily expandable. The main code is used by all different database options. Specifics for each supported database are implemented in a database service per database.

The different database services are very similar. They hold the same methods with the same input and output parameters. Each method is aligned with the syntax and the settings of the database it is created for. Each core implementation includes connections setup, object comparison functionality and the result preparation.

Supported Databases

The icsDV supports comparisons between the following databases:

Snowflake
Teradata
Azure SQL Server
Exasol
Oracle
Databricks with and without Unity Catalog

Comparison results can be written to either Snowflake or Databricks.

Features

The key features of the tool are:

Comparison of tables and views between a source and a target system.
Pipeline integration in Azure DevOps or GitLab
Multiple verification/comparison steps:
- Row count comparison
- Column names comparison
- Aggregation comparison (depending on data type)
- "group by" comparison
- Pandas DataFrame comparison (with a threshold for the size of the object)
- Pandas DataFrame sample comparison (with a random sample of the object)
Detailed representation of the comparison result
- "high-level" result (for each pipeline/execution)
- "object-level" result (for each table/view)
- "column-level" result (for each column)
Parallelization for performance enhancement of the comparison of a large number of objects
Input testsets (white-listing of objects)
Object filter (black-listing of objects)
Object mappings between the source and the target system
Comparison result saved and displayed in multiple instances
- saved as JSON files in the repository
- export to result tables in the target system (Snowflake or Databricks)
- export to Azure Blob Storage or AWS S3 Bucket

Repository Structure

The repository is structured in the following sections:

icsDataValidation

This is where all code files are stored.
icsDataValidation/main.py

Entry point for python.
icsDataValidation/core

Main code files for the parts independent on the source and target system.
icsDataValidation/services/database_services

Database services for all supported systems can be found here. Each file contains a class that is identically structured in comparison to the other database service classes. Each database service class contains methods to query metadata, create aggregations, and retrieve data for the comparison step.
icsDataValidation/connection_setups

The connection setups are database dependent. They define how the credentials for the database connections are retrieved.
examples/comparison_results

The comparison results are saved here. One JSON file with all results is saved for each execution/pipeline run. Additionally there are live comparison results saved for each compared object as a failsafe.
examples

This folder contains all files defining a specific validation setup. - A file named migration_config.json contains configurations about the source system, the target system and the mapping of objects between both. It contains the blacklists and "group by" aggregation settings. - A file named ics_data_validation_config.json specifies the source system, the target system and the results system. Most importantly, this includes the names of the results tables and the connection configurations (Server, Port, Secrets) of source and target system. - A file named manual_execution_params.py is only relevant for local execution of the code. It contains settings which would otherwise be defined in the pipeline setup, i.e. limits on the size of objects to compare and the numeric precision. - The folder testsets contains JSON files specifying whitelists of objects to compare.

For all the files here, empty *.template.* files are available and may serve as a starting point. This repo stores only template files. The actual files used for each setup should not be committed here. They are stored in a separate repository..
examples/pipeline

Files defining the pipelines that execute the icsDV are stored here. For example, YML files for Azure DevOps pipelines.

icsDV - Execution Manual

icsDV - Input Parameters

There are four types of input parameters:

Pipeline Parameters - which are defined as input parameters of a pipeline (Azure DevOps Pipeline or Gitlab Pipeline).
Manual Execution Parameters - defined in the code (testing_tool.py). They correspond to the Pipeline Parameters and are used when executing the code directly without a pipeline instead of the Pipeline Parameters.
Global Parameters - directly defined in the TestingToolParams class. They are used in pipeline runs and for manual executions.
Environmental Parameters - Stored either in Azure DevOps in a variable group, in Gitlab, or, for manual executions, in a *.env file in a location that can be specified in the manual_execution_params.py.

Additionally the parameters can be categorized into 3 groups:

Setup Parameters - these are parameters which are usually just set once when setting up the icsDV.
Configuration Parameters - are used to configure the general settings but can be adjusted to the conditions of the workload on the fly.
Execution Parameters - are set individually for each execution of the icsDV, e.g. the selection of objects to be tested.

Setup Parameters

Stored in ics_data_validation_config.json:

Parameter	Description	Input Type
source_system_selection	Name of the source system as defined in the database_config.json as a key.	Pipeline Parameter or Manual Execution Parameter
target_system_selection	Name of the target system as defined in the database_config.json as a key.	Pipeline Parameter or Manual Execution Parameter
result_system_selection	Name of the result system as defined in the database_config.json as a key.	Pipeline Parameter or Manual Execution Parameter
azure_devops_pipeline	Azure DevOps Pipeline support. Set to "True" to push the changes of a run to the GIT repository.	Global Parameter - TestingToolParams
gitlab_pipeline	Gitlab Pipeline support. Set to "True" to push the changes of a run to the GIT repository.	Global Parameter - TestingToolParams
result_database_name	Name of the database or catalog the results are written to	Global Parameter - TestingToolParams
result_schema_name	Name of the schema the results are written to	Global Parameter - TestingToolParams
result_table_highlevel_name	Name of the high-level results table	Global Parameter - TestingToolParams
result_table_objectlevel_name	Name of the object-level results table	Global Parameter - TestingToolParams
result_table_columnlevel_name	Name of the column-level results table	Global Parameter - TestingToolParams
result_meta_data_schema_name	Name of the schema the full results are written to	Global Parameter - TestingToolParams
result_table_name	Name of the table the full results are written to	Global Parameter - TestingToolParams
result_live_table_name	Name of the table the live results are written to	Global Parameter - TestingToolParams
results_folder_name	Folder that in which the results are stored in JSON format. Default: `examples/comparison_results/`	Global Parameter - TestingToolParams
remaining_mapping_objects_folder_name	Output folder that holds information about source system objects which are not covered by the mapping and are therefor not included in the comparison. Default: `examples/remaining_mapping_objects/`	Global Parameter - TestingToolParams
testset_folder_name	Folder that holds the test set files in JSON format. Default: `examples/testsets/`	Global Parameter - TestingToolParams
stage_schema	Name of the Snowflake Schema where the stage is created to upload the comparison results to Snowflake. Only needed if the `upload_result_to_result_database` functionality is used with Snowflake as target system.	Global Parameter - TestingToolParams
stage_name_prefix	Prefix of the name of the Snowflake Stage which is used to upload the comparison results to Snowflake. The name is complemented by a run_guid which is a unique uuid for each icsDV execution. Only needed if the `upload_result_to_result_database` functionality is used.	Global Parameter - TestingToolParams
container_name	Name of the Azure Storage Container to upload the comparison results into the blob storage. Note: Only needed if the `upload_result_to_blob` functionality is used.	Global Parameter - TestingToolParams
bucket_name	Name of the AWS S3 Bucket to upload the comparison results into the AWS. Note: Only needed if the `upload_result_to_bucket` functionality is used.	Global Parameter - TestingToolParams

Configuration Parameters

Stored in manual_execution_params.py:

Parameter	Description	Input Type
ENV_FILEPATH	Absolute path to the `*.env` file containing secrets, passwords and tokens.	Pipeline Parameter or Manual Execution Parameters
UPLOAD_RESULT_TO_BLOB	Set to "True" to upload the comparison results to an Azure Blob Storage. An `azure_storage_connection_string` is needed if set to "True".	Pipeline Parameter or Manual Execution Parameters
UPLOAD_RESULT_TO_BUCKET	Set to "True" to upload the comparison results to an AWS S3 Bucket. An `aws_bucket_access_key` and an `aws_bucket_secret_key` is needed if set to "True".	Pipeline Parameter or Manual Execution Parameter
UPLOAD_RESULT_TO_RESULT_DATABASE	Set to "True" to upload the comparison results to Snowflake or Databricks. A `result_system_selection` is needed if set to "True".	Pipeline Parameter or Manual Execution Parameter
MAX_OBJECT_SIZE	Limits Pandas comparison to objects of a size smaller than `MAX_OBJECT_SIZE` bytes. Data type is String. Default: `str(-1)`, no limit.	Pipeline Parameter or Manual Execution Parameter
MAX_ROW_NUMBER	Limits Pandas comparison to objects with less than `MAX_ROW_NUMBER` rows. Data type is String. Default: `str(-1)`, no limit.	Pipeline Parameter or Manual Execution Parameter
EXECUTE_GROUP_BY_COMPARISON	Set to "True" to execute group-by comparisons. See sec. "Group-By-Aggregation" for details.	Pipeline Parameter or Manual Execution Parameter
USE_GROUP_BY_COLUMNS	Set to "True" to activate group-by columns. See sec. "Group-By-Aggregation" for details.	Pipeline Parameter or Manual Execution Parameter
MIN_GROUP_BY_COUNT_DISTINCT	Minimum expected number of group-by counts. See sec. "Group-By-Aggregation" for details.	Pipeline Parameter or Manual Execution Parameter
MAX_GROUP_BY_COUNT_DISTINCT	Maximum expected number of group-by counts. See sec. "Group-By-Aggregation" for details.	Pipeline Parameter or Manual Execution Parameter
MAX_GROUP_BY_SIZE	Maximum size of the group-by query. See sec. "Group-By-Aggregation" for details.	Pipeline Parameter or Manual Execution Parameter
NUMERIC_SCALE	Number of digits to compare. Data type is String. Default: `str(2)`, i.e. deviations below 0.01 are tolerated.	Pipeline Parameter or Manual Execution Parameter

Execution Parameters

Stored in manual_execution_params.py:

Parameter	Description	Input Type
DATABASE_NAME	Filters the test set on a specific database/catalog. For no filter set "None" as a Manual Execution Parameter and leave it empty as a Pipeline Parameter.	Pipeline Parameter or Manual Execution Parameter
SCHEMA_NAME	Filters the test set on a specific schema. For no filter set "None" as a Manual Execution Parameter and leave it empty as a Pipeline Parameter.	Pipeline Parameter or Manual Execution Parameter
TESTSET_FILE_NAMES	File names of the test set as defined in the folder testset_folder_name (see Setup Parameters) as JSON files.	Pipeline Parameter or Manual Execution Parameter
OBJECT_TYPE_RESTRICTION	Filters the testset to only tables (`"include_only_tables"`), only views (`"include_only_views"`) or all tables and views (`"include_all"`).	Pipeline Parameter or Manual Execution Parameter
MAX_NUMBER_OF_THREADS	Maximum number of threads used. Values larget than the default, `str(1)`, activate parallelization.	Pipeline Parameter or Manual Execution Parameter

icsDV - Configuration

Blacklists

Whitelists (Testsets)

Mapping

Group-By-Aggregation

The Group-By-Aggregation is a feature to pinpoint the differences in the data. It can be activiated by setting the parameter EXECUTE_GROUP_BY_COMPARISON to TRUE. If activated an additional comparison step is performed. Each table is queried with a group-by-statement including aggregations depending on the data type. Those aggregations are consequently compared. As a result the differences in the data can be narrowed down to certain grouping values.

There are three options to define the column over which the group-by is executed.

"group-by-columns-per-table" defined as multiple lists for specific tables. Activated with the USE_GROUP_BY_COLUMNS parameter and GROUP_BY_COLUMNS_PER_TABLE defined in the migration_config.json.
"group-by-columns" from a predifined list for all tables by a validation. Activated with the USE_GROUP_BY_COLUMNS parameter and GROUP_BY_COLUMNS defined in the migration_config.json.
"group-by-columns" evaluated from all existing columns by a validation

The validation consists of a number of tests and can be configured by a number of parameters to either easily find columns to group by over or to only select columns which add a definite value for pinpointing the differences in the data.

The validation tests for the "group-by-columns" are:

Number of distinct values of the column is more than 1.
Number of distinct values of the column is less than the rowcount of the table.
Number of distinct values of the column exceeds the MIN_GROUP_BY_COUNT_DISTINCT parameter.
Number of distinct values of the column is below the MAX_GROUP_BY_COUNT_DISTINCT parameter.
The size of the expected result of the group-by-query is below the MAX_GROUP_BY_SIZE parameter. (The size is defined by "Number of distinct values" * "Number of columns")

All tests are executed on source and target.

Note: The group by comparison can be activated by setting the execute_group_by_comparison parameter to TRUE. The migration_config.json has to include the follwing keys when the parameter use_group_by_columns is set to TRUE.

"GROUP_BY_AGGREGATION":{
  "GROUP_BY_COLUMNS_PER_TABLE": {},
  "GROUP_BY_COLUMNS":[]
}

The values of those keys can be empty.

icsDV - Comparison Results

JSON Results

Complete Comparison Result JSONs
Live Comparison Result JSONs

Target System Result Tables

High-Level Result
Object-Level Result
Column-Level Result

Result Export in a File Storage

icsDV - Setup

Code setup

To handle the code, we recommend using VS Code.
The code is written in python. The tool is compatible with version 3.11
It is recommended to use a project-specific python environment. You can create one with python -m venv .env in the root folder of this repo. After creating it, you should activate it (source .env/bin/activate), select the python binary .env/bin/python therein as your python interpreter in VSC and make sure that python libraries are read from and installed to this environment, i.e. export PYTHONPATH=$(pwd)/.env/lib/python3.8/site-packages.
In this environment, install the packages listed in the requirements.txt and the requirements-dev.txt. i.e. run pip install -r requirements.txt.

Setup for manual execution

Setup as Azure DevOps pipeline

Setup as GitLab pipeline

authentication

The following auth methods to snowflake are supported:

password, provided via PASSWORD_NAME
private key with/without encryption, provided via PRIVATE_KEY_NAME with/without PRIVATE_KEY_PASSPHRASE_NAME
path to private key file with/without encryption, provided via PRIVATE_KEY_FILE_PATH with/without PRIVATE_KEY_FILE_PASSWORD

devcontainer

run with uv as follows in devcontainer:

uv run -s  icsDataValidation/main.py

Inside the devcontainer config the mounts setting is used to bring a .env from the host system into the devcontainer.

"mounts": [
        "source=/home/Documents/Generic_Testing_Tool/generic_testing_tool_password.env,target=/workspaces/icsDataValidation/examples/generic_testing_tool_password.env,type=bind"
    ]

To use this feature either create the .env under the source path on your host or adjust this path to another path on the host system. The target path do no need adjustment!

Project details

Release history Release notifications | RSS feed

1.0.494

May 6, 2026

1.0.490

May 5, 2026

1.0.488

May 4, 2026

1.0.486

May 4, 2026

1.0.484

May 4, 2026

1.0.482

May 4, 2026

1.0.480

May 4, 2026

1.0.478

Apr 30, 2026

1.0.475

Apr 28, 2026

1.0.473

Apr 28, 2026

1.0.471

Apr 28, 2026

1.0.469

Apr 27, 2026

1.0.467

Apr 27, 2026

1.0.465

Apr 17, 2026

1.0.463

Apr 15, 2026

1.0.461

Apr 15, 2026

1.0.459

Apr 9, 2026

1.0.457

Apr 9, 2026

1.0.455

Apr 9, 2026

1.0.453

Mar 12, 2026

1.0.451

Feb 19, 2026

1.0.449

Feb 19, 2026

1.0.447

Feb 12, 2026

1.0.446

Feb 12, 2026

1.0.444

Feb 12, 2026

1.0.443

Feb 12, 2026

1.0.441

Feb 3, 2026

1.0.439

Feb 3, 2026

1.0.438

Feb 3, 2026

1.0.430

Dec 4, 2025

1.0.428

Oct 22, 2025

1.0.427

May 6, 2025

1.0.425

Apr 8, 2025

1.0.423

Apr 8, 2025

1.0.421

Apr 7, 2025

1.0.419

Apr 3, 2025

1.0.415 yanked

Apr 3, 2025

1.0.378

Mar 28, 2025

1.0.371

Mar 28, 2025

1.0.365

Mar 28, 2025

1.0.363

Mar 28, 2025

1.0.361

Mar 28, 2025

1.0.360

Mar 28, 2025

1.0.358

Mar 28, 2025

1.0.357

Mar 28, 2025

1.0.352

Mar 28, 2025

1.0.344

Mar 28, 2025

1.0.319

Mar 28, 2025

1.0.317

Mar 28, 2025

1.0.315

Mar 28, 2025

1.0.313

Mar 28, 2025

1.0.311

Mar 28, 2025

1.0.309

Mar 28, 2025

1.0.307

Mar 28, 2025

1.0.305

Mar 28, 2025

1.0.303

Mar 28, 2025

1.0.297

Mar 28, 2025

1.0.295

Mar 28, 2025

1.0.293

Mar 28, 2025

1.0.291

Mar 28, 2025

1.0.289

Mar 28, 2025

1.0.287

Mar 28, 2025

1.0.285

Mar 28, 2025

1.0.280

Mar 28, 2025

1.0.277

Mar 28, 2025

1.0.275

Mar 28, 2025

1.0.273

Mar 28, 2025

1.0.271

Mar 28, 2025

1.0.263

Mar 28, 2025

1.0.252

Mar 28, 2025

1.0.250

Mar 28, 2025

1.0.248

Mar 28, 2025

1.0.246

Mar 28, 2025

1.0.244

Mar 28, 2025

1.0.242

Mar 28, 2025

1.0.240

Mar 28, 2025

1.0.239

Mar 28, 2025

1.0.235

Mar 28, 2025

1.0.234

Mar 28, 2025

1.0.232

Mar 28, 2025

This version

0.0.1 yanked

Apr 3, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

icsdatavalidation-0.0.1.tar.gz (92.3 kB view details)

Uploaded Apr 3, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

icsdatavalidation-0.0.1-py3-none-any.whl (101.5 kB view details)

Uploaded Apr 3, 2025 Python 3

File details

Details for the file icsdatavalidation-0.0.1.tar.gz.

File metadata

Download URL: icsdatavalidation-0.0.1.tar.gz
Upload date: Apr 3, 2025
Size: 92.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.9

File hashes

Hashes for icsdatavalidation-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`f62cbd1a0e7a0524715f58333f05ea2123cb34fceb01f91d90f3be320837ab10`
MD5	`b3f0aa9a9a8df1e41083aba1011f3cdb`
BLAKE2b-256	`dc6d773d5d2afd131b0e790c545c5c06758c3bd3b7c0fbc47e0a58e3480bddd9`

See more details on using hashes here.

File details

Details for the file icsdatavalidation-0.0.1-py3-none-any.whl.

File metadata

Download URL: icsdatavalidation-0.0.1-py3-none-any.whl
Upload date: Apr 3, 2025
Size: 101.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.9

File hashes

Hashes for icsdatavalidation-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d31080d8e0c1e317bd333021326720241f7fedfb51aba695f9cf2d5327512ddc`
MD5	`e89807b8a051f097f95420587df7881d`
BLAKE2b-256	`b24eac69ca3feae60dd7e3b50dcaefb6b90c43bd3cd58050d5404f70f441ac73`

See more details on using hashes here.

icsDataValidation 0.0.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

icsDV - initions Data Validation Tool

Introduction

What is "generic" about the tool?

Supported Databases

Features

Repository Structure

icsDV - Execution Manual

icsDV - Input Parameters

Setup Parameters

Configuration Parameters

Execution Parameters

icsDV - Configuration

Blacklists

Whitelists (Testsets)

Mapping

Group-By-Aggregation

icsDV - Comparison Results

JSON Results

Target System Result Tables

Result Export in a File Storage

icsDV - Setup

Code setup

Setup for manual execution

Setup as Azure DevOps pipeline

Setup as GitLab pipeline

authentication

devcontainer

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes