A Python Package designed to validate data sources and sinks

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Hashmap Data Validator

About

Hashmap Data Validator, or hdv, tool that is used to validate the data in two database tables. HDV uses great expectations to run validation on the two tables. It currently supports connections to Snowflake with the Snowflake-connector-python and to Oracle via cx_Oracle. HDV currently runs expectations on row count and row hash values to validate the tables.

How To Use

The user installs the package via PyPi with:

pip install hashmap-data-validator

After installation, user needs to run a .py file with the following import: import hdv
Two .yml files will then be created in the user's home directory with the following path: .hashmap_data_validator/hdv_profiles.yml and .hashmap_data_validator/hdv_configuration.yml
The user then configures both of the newly created YAML files accordingly. (Examples below)
After configuration is complete, the user can call and run validation with hdv from the command line with:

hdv

HDV writes the validation results to a newly created validation_results.html file (if it does not exist) in the directory where the command is called and opens that file in your browser

Time To Execute Method

2 million total rows = ~15 seconds
8 million total rows = ~2 minutes
15 million total rows = ~3.5 minutes

User Documentation

Classes

Configuration templates for HDV supported databases. Add these YAML templates to your hdv_configuration.yml and configure accordingly.

Oracle

Oracle database resource.

base class

OracleResource

configuration

required
- env: section name in hdv profile yml file for connection information
- table_name: table name to validate
optional
- chunk_size: the number of rows in the table to grab in batches (iterates over table with offset)
- offset: the number of rows to skip between table batch grabs
NOTE: Only include chunk_size and offset values if you would like to validate a samples of your tables.

    oracle:
      type: OracleResource
      conf:
        env: oracle
        table_name: <oracle_db>.<oracle_table>
        chunk_size: 100000 # batches are grabbed in chunks of 100000 rows (optional)
        offset: 200000 # skips every 200000 rows in the table after a batch grab (optional)

Snowflake

Snowflake database resource.

base class

SnowflakeResource

configuration

required
- env: section name in hdv profile yml file for connection information
- table_name: table name to validate
optional
- chunk_size: the number of rows in the table to grab in batches (iterates over table with offset)
- offset: the number of rows to skip between table batch grabs
NOTE: Only include chunk_size and offset values if you would like to validate a samples of your tables.

    oracle:
      type: SnowflakeResource
      conf:
        env: snowflake
        table_name: <snowflake_table_name>
        chunk_size: 100000 # batches are grabbed in chunks of 100000 rows (optional)
        offset: 200000 # skips every 200000 rows in the table after a batch grab (optional)

More database support to come

HDV Configuration Setup

After you run the python file with import hdv, two configuration files will be created in your home directory under the .hashmap_data_validator/ folder.

hdv_profiles.yml is where you configure your connection credentials for the databases you would like to validate. The default file will look similar to this:

dev:
  oracle:
    host: <host>
    port: <port>
    sid: <sid>
    user: <oracle_username>
    password: <oracle_password>
    client_library_dir: <path to oracle client library files e.g. C:\instantclient-basic-windows.x64-19.9.0.0.0dbru\instantclient_19_9>
  snowflake:
    account: <account>
    role: <role>
    warehouse: <warehouse_name>
    database: <database_name>
    schema: <schema_name>
    user: <snowflake_username>
    password: <snowflake_password>

HDV uses this file to manage the connections to the necessary databases. HDV only supports Oracle and Snowflake at the moment, but we plan to extend support to other databases in the future. In this file, you can add your credentials for any database that HDV supports. When the validation method is called, HDV is pointed to the individual configurations in this file and connects using the corresponding credentials.

Note: you can add any number of database configurations in this file. The above is just a starting template with two databases. For example, you could have multiple Snowflake configurations. However, the names would need to be unique (e.g. snowflake_1, snowflake_2)

hdv_configuration.yml is where you configure the specifics of your validation. The file should look similar to below:

version: 1beta

validations:
  from: oracle_1
  to: snowflake_1

resources:
  oracle_1:
    type: OracleResource
    conf:
      env: oracle
      table_name: <oracle_db>.<oracle_table>
      chunk_size: <optional>
      offset: <optional>
  snowflake_1:
    type: SnowflakeResource
    conf:
      env: snowflake
      table_name: <snowflake_table>
      chunk_size: <optional>
      offset: <optional>
  dummy_database:
    type: TestResource
    conf:
      env: test_environment
      table_name: <test_table>
      chunk_size: <optional>
      offset: <optional>

As you can see above, you can add any number of resources in this file. Then, to run a validation, you call the specific resources in the validations section of the YAML.

The example above will run a validation on the oracle_1 and snowflake_1 resources, which are both configured in the above file as well. Notice that the env values of the oracle_1 and snowflake_1 resources point to named configurations in the example hdv_profiles.yml shown above. The type values point to which HDV class type to use in validation (SnowflakeResource and OracleResource in this case). Refer above to the HDV supported class types.

If I wanted to run a validation on different tables, I merely have to alter the to and from values to point to different resources configured in the file or I could just change the current configurations.

Note: You can configure any number of resources in hdv_configuration.yml

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.1.0.0

Feb 11, 2021

0.0.1.2

Feb 9, 2021

0.0.1.1

Feb 9, 2021

0.0.1.0

Feb 9, 2021

0.0.0.8

Jan 29, 2021

0.0.0.7

Jan 29, 2021

0.0.0.5

Jan 28, 2021

0.0.0.3

Jan 27, 2021

0.0.0.2

Jan 27, 2021

0.0.0.1

Jan 27, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hashmap_data_validator-0.1.0.0.tar.gz (12.8 kB view hashes)

Uploaded Feb 11, 2021 Source

Hashes for hashmap_data_validator-0.1.0.0.tar.gz

Hashes for hashmap_data_validator-0.1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`18eb544d0827f6b7f36fc1b1ea957f1d57f429ede379e4d5eb7b46b01b6be21b`
MD5	`7a74e9142548491c08392b8caa371c21`
BLAKE2b-256	`53de427b7c50f16f33dff9a6f9435e5e709b9ea03415925dc581d0047d355005`