Calculate the Unique Numeric Fingerprint for Tabular Data
Project description
IDHash
Efficiently create an identifiable hash for a dataset that is;
- Independent of row ordering
- Dependent of column ordering
Quickly identify if two datasets are identical by comparing their hashes, without needing to presort the values.
IDHash is based upon UNFv6, and details around UNF can be found here, where the major differences are that UNF is column-invariant but row-dependent, and IDHash is column-dependent and row-invariant.
Example - Hash Entire Dataset
import pandas as pd
import pyarrow as pa
from idhash import id_hash
def hash_pd(df: pd.DataFrame) -> int:
dtypes = [str(df.dtypes[x]) for x in df.columns]
df_batches = pa.Table.from_pandas(x).to_batches()
return id_hash(df_batches, df.columns, dtypes)
x=pd.DataFrame.from_dict({'a': [1,2,3]})
print(hash_pd(x))
> 259167810065665855969772359546814925541
Example - Hash Iteratively
import pandas as pd
import pyarrow as pa
from idhash import id_hash, IDHasher
from typing import List
def create_hasher(columns: List[str], dtypes: pd.Series) -> IDHasher:
dtypes = [str(dtypes[x]) for x in columns]
return IDHasher(field_names=columns, field_types=dtypes)
x=pd.DataFrame.from_dict({'a': [1,2,3]})
hasher = create_hasher(x.columns, x.dtypes)
batches = pa.Table.from_pandas(x).to_batches(max_chunksize=1)
for batch in batches:
hasher.write_batches([batch], delta="Add")
print(hasher.finalize())
> 259167810065665855969772359546814925541
Iterative hashing has an additional benefit - it's possible to verify a delta between two datasets, i.e.
dataset_a: pd.DataFrame, dataset_b: pd.DataFrame, delta: List[pa.RecordBatch] = load_data()
hasher_a = create_hasher(dataset_a.columns, dataset_a.dtypes)
hasher_a.write_batches(pa.Table.from_pandas(dataset_a), delta="Add")
hasher_b = create_hasher(dataset_b.columns, dataset_b.dtypes)
hasher_b.write_batches(pa.Table.from_pandas(dataset_b), delta="Add")
assert hasher_a.write_batches(delta, delta="Add").finalize() == hasher_b.finalize()
Preprocessing
Each column has specific pre-processing according to the UNF definition. This mostly consists of ensuring that floating point values, datetimes, and timestamps are representable consistently across datasets when taking into account floating point epsilon.
Hash Generation
Each row is taken as a single bytestream, and hashed using Murmurhash128. Murmurhash is a non-cryptographically secure hash function that produces a well distributed hash for each individual value. By adding (wrapping around f64::max) the individual hashed primitives, a final hash can be produced for the final dataset that does not take into account duplicates.
Checking for Equality + Delta
As the hashed rows are added to each other to produce the final value, it is also possible to remove rows against the final hash by producing a row hash in the same manner as was originally performed.
Data Processing
IDHash operates over Apache Arrow RecordBatches and can process with zero-copy over the batches.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
File details
Details for the file idhash-0.2.0.tar.gz
.
File metadata
- Download URL: idhash-0.2.0.tar.gz
- Upload date:
- Size: 5.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2ff67a109ef2259d4be54708e9016d04120a2914f9e767da2352e34c78c3349e |
|
MD5 | c064d4c407c862bb06430fb9afe9656c |
|
BLAKE2b-256 | 0005367be387ed992288dedfb953b2bdd05e7ebb04c3735055e29f8a19b5d93b |
File details
Details for the file idhash-0.2.0-pp37-pypy37_pp73-manylinux_2_5_x86_64.manylinux1_x86_64.whl
.
File metadata
- Download URL: idhash-0.2.0-pp37-pypy37_pp73-manylinux_2_5_x86_64.manylinux1_x86_64.whl
- Upload date:
- Size: 6.8 MB
- Tags: PyPy, manylinux: glibc 2.5+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | adc112e8bb672b1a0916a10b6915a7d3b117ff0cb4b3fadbf96b564fad454611 |
|
MD5 | 94cc95be2a4d9db169ee4117eab3bd1d |
|
BLAKE2b-256 | 27557b8681d05e10d94e86b8093a106d09922659988b30dfbe99d8706bcb1d9f |
File details
Details for the file idhash-0.2.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.whl
.
File metadata
- Download URL: idhash-0.2.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.whl
- Upload date:
- Size: 6.8 MB
- Tags: CPython 3.10, manylinux: glibc 2.5+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | afef041d61096e2aedf46c68def4242629a8d613ecf83bbffbc606afb411a297 |
|
MD5 | da81111e4855b0e81b1ea732532b6bfe |
|
BLAKE2b-256 | e79eaf7f2b8fa3e79ac9bf1396ba4ef588a68b592c1c05564114ab89818c54d5 |
File details
Details for the file idhash-0.2.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl
.
File metadata
- Download URL: idhash-0.2.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl
- Upload date:
- Size: 6.8 MB
- Tags: CPython 3.9, manylinux: glibc 2.5+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8ba62d8e35bf1523da92f45233b0fb5ac6420f962af3af4804b9547ae4cd01e6 |
|
MD5 | d4267e922c6dd2055c5d3895b546c2e7 |
|
BLAKE2b-256 | 2f32232f3c1a5b2586bd16bba19b0eec08ec381bab2f2d311f78af706aab356b |
File details
Details for the file idhash-0.2.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl
.
File metadata
- Download URL: idhash-0.2.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl
- Upload date:
- Size: 6.8 MB
- Tags: CPython 3.8, manylinux: glibc 2.5+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 306a7a6a8e216dedee9a57e462d52972452b788767630b1c3685e9f3ff7ecfe3 |
|
MD5 | ff442b816695960cdc6d93c30078bf1e |
|
BLAKE2b-256 | 0e45007b85a449658e1b435e2f170971fc2d42f2b9039f63d41c8f0ebf66c197 |
File details
Details for the file idhash-0.2.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl
.
File metadata
- Download URL: idhash-0.2.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl
- Upload date:
- Size: 6.8 MB
- Tags: CPython 3.7m, manylinux: glibc 2.5+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 652c95e6d74287799baece924fa4a1ef3eeb0899903f8ad3495beb3d2fa96a76 |
|
MD5 | 46c2454d506faa79383391b87b0c7fce |
|
BLAKE2b-256 | 280febfc946233c75276c549606f73fd326f143e91a97abc600be0cc76fce39a |
File details
Details for the file idhash-0.2.0-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.whl
.
File metadata
- Download URL: idhash-0.2.0-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.whl
- Upload date:
- Size: 6.7 MB
- Tags: CPython 3.6m, manylinux: glibc 2.5+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8f20d6eba98eecae75142ec9575f93f0aaf6dfa5dae4cf74453a96b8283912de |
|
MD5 | dc65a723d968fab898629594135cab84 |
|
BLAKE2b-256 | a1d807eafbc788d4621d04646a9fc493523bed3f0b6b409b3fce80daf8f0b356 |