A Python Package designed to validate data sources and sinks
Project description
Hashmap Data Validator
About
Hashmap Data Validator, or hdv, tool that is used to validate the data in two database tables. HDV uses great expectations to run validation on the two tables. It currently supports connections to Snowflake with the Snowflake-connector-python and to Oracle via cx_Oracle. HDV currently runs expectations on row count and row hash values to validate the tables.
How To Use
- The user installs the package via PyPi with:
pip install hashmap-data-validator
- After installation, user needs to run a
.pyfile with the following import:import hdv - Two
.ymlfiles will then be created in the user'shomedirectory with the following path:.hashmap_data_validator/hdv_profiles.ymland.hashmap_data_validator/hdv_configuration.yml - The user then configures both of the newly created YAML files accordingly. (Examples below)
- After configuration is complete, the user can call and run validation with
hdvfrom the command line with:
hdv
- HDV writes the validation results to a newly created
validation_results.htmlfile (if it does not exist) in the directory where the command is called and opens that file in your browser
Time To Execute Method
- 2 million total rows = ~15 seconds
- 8 million total rows = ~2 minutes
- 15 million total rows = ~3.5 minutes
User Documentation
Classes
Configuration templates for HDV supported databases. Add these YAML templates to your hdv_configuration.yml and configure accordingly.
Oracle
Oracle database resource.
base class
OracleResource
configuration
-
required
env: section name inhdvprofile yml file for connection informationtable_name: table name to validate
-
optional
chunk_size: the number of rows in the table to grab in batches (iterates over table with offset)offset: the number of rows to skip between table batch grabs
NOTE: Only include
chunk_sizeandoffsetvalues if you would like to validate a samples of your tables.
oracle:
type: OracleResource
conf:
env: oracle
table_name: <oracle_db>.<oracle_table>
chunk_size: 100000 # batches are grabbed in chunks of 100000 rows (optional)
offset: 200000 # skips every 200000 rows in the table after a batch grab (optional)
Snowflake
Snowflake database resource.
base class
SnowflakeResource
configuration
-
required
env: section name inhdvprofile yml file for connection informationtable_name: table name to validate
-
optional
chunk_size: the number of rows in the table to grab in batches (iterates over table with offset)offset: the number of rows to skip between table batch grabs
NOTE: Only include
chunk_sizeandoffsetvalues if you would like to validate a samples of your tables.
oracle:
type: SnowflakeResource
conf:
env: snowflake
table_name: <snowflake_table_name>
chunk_size: 100000 # batches are grabbed in chunks of 100000 rows (optional)
offset: 200000 # skips every 200000 rows in the table after a batch grab (optional)
More database support to come
HDV Configuration Setup
After you run the python file with import hdv, two configuration files will be created in your home directory under the
.hashmap_data_validator/ folder.
hdv_profiles.ymlis where you configure your connection credentials for the databases you would like to validate. The default file will look similar to this:
dev:
oracle:
host: <host>
port: <port>
sid: <sid>
user: <oracle_username>
password: <oracle_password>
client_library_dir: <path to oracle client library files e.g. C:\instantclient-basic-windows.x64-19.9.0.0.0dbru\instantclient_19_9>
snowflake:
account: <account>
role: <role>
warehouse: <warehouse_name>
database: <database_name>
schema: <schema_name>
user: <snowflake_username>
password: <snowflake_password>
HDV uses this file to manage the connections to the necessary databases. HDV only supports Oracle and Snowflake at the moment, but we plan to extend support to other databases in the future. In this file, you can add your credentials for any database that HDV supports. When the validation method is called, HDV is pointed to the individual configurations in this file and connects using the corresponding credentials.
Note: you can add any number of database configurations in this file. The above is just a starting template with two databases. For example, you could have multiple Snowflake
configurations. However, the names would need to be unique (e.g. snowflake_1, snowflake_2)
hdv_configuration.ymlis where you configure the specifics of your validation. The file should look similar to below:
version: 1beta
validations:
from: oracle_1
to: snowflake_1
resources:
oracle_1:
type: OracleResource
conf:
env: oracle
table_name: <oracle_db>.<oracle_table>
chunk_size: <optional>
offset: <optional>
snowflake_1:
type: SnowflakeResource
conf:
env: snowflake
table_name: <snowflake_table>
chunk_size: <optional>
offset: <optional>
dummy_database:
type: TestResource
conf:
env: test_environment
table_name: <test_table>
chunk_size: <optional>
offset: <optional>
As you can see above, you can add any number of resources in this file. Then, to run a validation, you call the
specific resources in the validations section of the YAML.
The example above will run a validation on the oracle_1 and
snowflake_1 resources, which are both configured in the above file as well. Notice that the env values of the oracle_1 and snowflake_1 resources
point to named configurations in the example hdv_profiles.yml shown above. The type values point to which HDV class type to use in validation (SnowflakeResource and OracleResource in this case). Refer above to the HDV supported class types.
If I wanted to run a validation on different tables, I merely have to alter the to and from values to point to different resources configured in the file
or I could just change the current configurations.
Note: You can configure any number of resources in hdv_configuration.yml
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file hashmap_data_validator-0.1.0.0.tar.gz.
File metadata
- Download URL: hashmap_data_validator-0.1.0.0.tar.gz
- Upload date:
- Size: 12.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.23.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.56.2 CPython/3.7.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
18eb544d0827f6b7f36fc1b1ea957f1d57f429ede379e4d5eb7b46b01b6be21b
|
|
| MD5 |
7a74e9142548491c08392b8caa371c21
|
|
| BLAKE2b-256 |
53de427b7c50f16f33dff9a6f9435e5e709b9ea03415925dc581d0047d355005
|