A Python Package designed to validate data sources and sinks
Project description
Hashmap Data Validator
About
Hashmap Data Validator, or hdv, tool that is used to validate the data in two database tables. HDV uses great expectations to run validation on the two tables. It currently supports connections to Snowflake with the Snowflake-connector-python and to Oracle via cx_Oracle. HDV currently runs expectations on row count and row hash values to validate the tables.
How To Use
- The user installs the package via PyPi with:
pip install hashmap-data-validator
- After installation, user needs to run a
.py
file with the following import:import hdv
- Two
.yml
files will then be created in the user'shome
directory with the following path:.hashmap_data_validator/hdv_profiles.yml
and.hashmap_data_validator/hdv_configuration.yml
- The user then configures both of the newly created YAML files accordingly. (Examples below)
- After configuration is complete, the user can call and run validation with
hdv
from the command line with:
hdv
- HDV writes the validation results to a newly created
validation_results.html
file (if it does not exist) in the directory where the command is called and opens that file in your browser
Time To Execute Method
- 2 million total rows = ~15 seconds
- 8 million total rows = ~2 minutes
- 15 million total rows = ~3.5 minutes
User Documentation
Classes
Configuration templates for HDV supported databases. Add these YAML templates to your hdv_configuration.yml
and configure accordingly.
Oracle
Oracle database resource.
base class
OracleResource
configuration
-
required
env
: section name inhdv
profile yml file for connection informationtable_name
: table name to validate
-
optional
chunk_size
: the number of rows in the table to grab in batches (iterates over table with offset)offset
: the number of rows to skip between table batch grabs
NOTE: Only include
chunk_size
andoffset
values if you would like to validate a samples of your tables.
oracle:
type: OracleResource
conf:
env: oracle
table_name: <oracle_db>.<oracle_table>
chunk_size: 100000 # batches are grabbed in chunks of 100000 rows (optional)
offset: 200000 # skips every 200000 rows in the table after a batch grab (optional)
Snowflake
Snowflake database resource.
base class
SnowflakeResource
configuration
-
required
env
: section name inhdv
profile yml file for connection informationtable_name
: table name to validate
-
optional
chunk_size
: the number of rows in the table to grab in batches (iterates over table with offset)offset
: the number of rows to skip between table batch grabs
NOTE: Only include
chunk_size
andoffset
values if you would like to validate a samples of your tables.
oracle:
type: SnowflakeResource
conf:
env: snowflake
table_name: <snowflake_table_name>
chunk_size: 100000 # batches are grabbed in chunks of 100000 rows (optional)
offset: 200000 # skips every 200000 rows in the table after a batch grab (optional)
More database support to come
HDV Configuration Setup
After you run the python file with import hdv
, two configuration files will be created in your home
directory under the
.hashmap_data_validator/
folder.
hdv_profiles.yml
is where you configure your connection credentials for the databases you would like to validate. The default file will look similar to this:
dev:
oracle:
host: <host>
port: <port>
sid: <sid>
user: <oracle_username>
password: <oracle_password>
client_library_dir: <path to oracle client library files e.g. C:\instantclient-basic-windows.x64-19.9.0.0.0dbru\instantclient_19_9>
snowflake:
account: <account>
role: <role>
warehouse: <warehouse_name>
database: <database_name>
schema: <schema_name>
user: <snowflake_username>
password: <snowflake_password>
HDV uses this file to manage the connections to the necessary databases. HDV only supports Oracle and Snowflake at the moment, but we plan to extend support to other databases in the future. In this file, you can add your credentials for any database that HDV supports. When the validation method is called, HDV is pointed to the individual configurations in this file and connects using the corresponding credentials.
Note: you can add any number of database configurations in this file. The above is just a starting template with two databases. For example, you could have multiple Snowflake
configurations. However, the names would need to be unique (e.g. snowflake_1
, snowflake_2
)
hdv_configuration.yml
is where you configure the specifics of your validation. The file should look similar to below:
version: 1beta
validations:
from: oracle_1
to: snowflake_1
resources:
oracle_1:
type: OracleResource
conf:
env: oracle
table_name: <oracle_db>.<oracle_table>
chunk_size: <optional>
offset: <optional>
snowflake_1:
type: SnowflakeResource
conf:
env: snowflake
table_name: <snowflake_table>
chunk_size: <optional>
offset: <optional>
dummy_database:
type: TestResource
conf:
env: test_environment
table_name: <test_table>
chunk_size: <optional>
offset: <optional>
As you can see above, you can add any number of resources in this file. Then, to run a validation, you call the
specific resources in the validations
section of the YAML.
The example above will run a validation on the oracle_1
and
snowflake_1
resources, which are both configured in the above file as well. Notice that the env
values of the oracle_1
and snowflake_1
resources
point to named configurations in the example hdv_profiles.yml
shown above. The type
values point to which HDV class type to use in validation (SnowflakeResource
and OracleResource
in this case). Refer above to the HDV supported class types.
If I wanted to run a validation on different tables, I merely have to alter the to
and from
values to point to different resources configured in the file
or I could just change the current configurations.
Note: You can configure any number of resources in hdv_configuration.yml
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file hashmap_data_validator-0.1.0.0.tar.gz
.
File metadata
- Download URL: hashmap_data_validator-0.1.0.0.tar.gz
- Upload date:
- Size: 12.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.23.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.56.2 CPython/3.7.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 18eb544d0827f6b7f36fc1b1ea957f1d57f429ede379e4d5eb7b46b01b6be21b |
|
MD5 | 7a74e9142548491c08392b8caa371c21 |
|
BLAKE2b-256 | 53de427b7c50f16f33dff9a6f9435e5e709b9ea03415925dc581d0047d355005 |