Skip to main content

Calculate the Unique Numeric Fingerprint for Tabular Data

Project description

IDHash

Efficiently create an identifiable hash for a dataset that is;

  • Independent of row ordering
  • Dependent of column ordering

Quickly identify if two datasets are identical by comparing their hashes, without needing to presort the values.

IDHash is based upon UNFv6, and details around UNF can be found here, where the major differences are that UNF is column-invariant but row-dependent, and IDHash is column-dependent and row-invariant.

Example - Hash Entire Dataset

import pandas as pd
import pyarrow as pa
from idhash import id_hash

def hash_pd(df: pd.DataFrame) -> int:
    dtypes = [str(df.dtypes[x]) for x in df.columns]
    df_batches = pa.Table.from_pandas(x).to_batches()
    return id_hash(df_batches, df.columns, dtypes)

x=pd.DataFrame.from_dict({'a': [1,2,3]})
print(hash_pd(x))
> 259167810065665855969772359546814925541

Example - Hash Iteratively

import pandas as pd
import pyarrow as pa
from idhash import id_hash, IDHasher
from typing import List

def create_hasher(columns: List[str], dtypes: pd.Series) -> IDHasher:
    dtypes = [str(dtypes[x]) for x in columns]
    return IDHasher(field_names=columns, field_types=dtypes)

x=pd.DataFrame.from_dict({'a': [1,2,3]})
hasher = create_hasher(x.columns, x.dtypes)
batches = pa.Table.from_pandas(x).to_batches(max_chunksize=1)
for batch in batches:
    hasher.write_batches([batch], delta="Add")
print(hasher.finalize())
> 259167810065665855969772359546814925541

Iterative hashing has an additional benefit - it's possible to verify a delta between two datasets, i.e.

dataset_a: pd.DataFrame, dataset_b: pd.DataFrame, delta: List[pa.RecordBatch] = load_data()
hasher_a = create_hasher(dataset_a.columns, dataset_a.dtypes)
hasher_a.write_batches(pa.Table.from_pandas(dataset_a), delta="Add")
hasher_b = create_hasher(dataset_b.columns, dataset_b.dtypes)
hasher_b.write_batches(pa.Table.from_pandas(dataset_b), delta="Add")

assert hasher_a.write_batches(delta, delta="Add").finalize() == hasher_b.finalize()

Preprocessing

Each column has specific pre-processing according to the UNF definition. This mostly consists of ensuring that floating point values, datetimes, and timestamps are representable consistently across datasets when taking into account floating point epsilon.

Hash Generation

Each row is taken as a single bytestream, and hashed using Murmurhash128. Murmurhash is a non-cryptographically secure hash function that produces a well distributed hash for each individual value. By adding (wrapping around f64::max) the individual hashed primitives, a final hash can be produced for the final dataset that does not take into account duplicates.

Checking for Equality + Delta

As the hashed rows are added to each other to produce the final value, it is also possible to remove rows against the final hash by producing a row hash in the same manner as was originally performed.

Data Processing

IDHash operates over Apache Arrow RecordBatches and can process with zero-copy over the batches.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

idhash-0.2.0.tar.gz (5.0 kB view details)

Uploaded Source

Built Distributions

idhash-0.2.0-pp37-pypy37_pp73-manylinux_2_5_x86_64.manylinux1_x86_64.whl (6.8 MB view details)

Uploaded PyPy manylinux: glibc 2.5+ x86-64

idhash-0.2.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.whl (6.8 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.5+ x86-64

idhash-0.2.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl (6.8 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.5+ x86-64

idhash-0.2.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl (6.8 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.5+ x86-64

idhash-0.2.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (6.8 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.5+ x86-64

idhash-0.2.0-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (6.7 MB view details)

Uploaded CPython 3.6m manylinux: glibc 2.5+ x86-64

File details

Details for the file idhash-0.2.0.tar.gz.

File metadata

  • Download URL: idhash-0.2.0.tar.gz
  • Upload date:
  • Size: 5.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.8.10

File hashes

Hashes for idhash-0.2.0.tar.gz
Algorithm Hash digest
SHA256 2ff67a109ef2259d4be54708e9016d04120a2914f9e767da2352e34c78c3349e
MD5 c064d4c407c862bb06430fb9afe9656c
BLAKE2b-256 0005367be387ed992288dedfb953b2bdd05e7ebb04c3735055e29f8a19b5d93b

See more details on using hashes here.

File details

Details for the file idhash-0.2.0-pp37-pypy37_pp73-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for idhash-0.2.0-pp37-pypy37_pp73-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 adc112e8bb672b1a0916a10b6915a7d3b117ff0cb4b3fadbf96b564fad454611
MD5 94cc95be2a4d9db169ee4117eab3bd1d
BLAKE2b-256 27557b8681d05e10d94e86b8093a106d09922659988b30dfbe99d8706bcb1d9f

See more details on using hashes here.

File details

Details for the file idhash-0.2.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for idhash-0.2.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 afef041d61096e2aedf46c68def4242629a8d613ecf83bbffbc606afb411a297
MD5 da81111e4855b0e81b1ea732532b6bfe
BLAKE2b-256 e79eaf7f2b8fa3e79ac9bf1396ba4ef588a68b592c1c05564114ab89818c54d5

See more details on using hashes here.

File details

Details for the file idhash-0.2.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

  • Download URL: idhash-0.2.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl
  • Upload date:
  • Size: 6.8 MB
  • Tags: CPython 3.9, manylinux: glibc 2.5+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.8.10

File hashes

Hashes for idhash-0.2.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 8ba62d8e35bf1523da92f45233b0fb5ac6420f962af3af4804b9547ae4cd01e6
MD5 d4267e922c6dd2055c5d3895b546c2e7
BLAKE2b-256 2f32232f3c1a5b2586bd16bba19b0eec08ec381bab2f2d311f78af706aab356b

See more details on using hashes here.

File details

Details for the file idhash-0.2.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

  • Download URL: idhash-0.2.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl
  • Upload date:
  • Size: 6.8 MB
  • Tags: CPython 3.8, manylinux: glibc 2.5+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.8.10

File hashes

Hashes for idhash-0.2.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 306a7a6a8e216dedee9a57e462d52972452b788767630b1c3685e9f3ff7ecfe3
MD5 ff442b816695960cdc6d93c30078bf1e
BLAKE2b-256 0e45007b85a449658e1b435e2f170971fc2d42f2b9039f63d41c8f0ebf66c197

See more details on using hashes here.

File details

Details for the file idhash-0.2.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

  • Download URL: idhash-0.2.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl
  • Upload date:
  • Size: 6.8 MB
  • Tags: CPython 3.7m, manylinux: glibc 2.5+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.8.10

File hashes

Hashes for idhash-0.2.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 652c95e6d74287799baece924fa4a1ef3eeb0899903f8ad3495beb3d2fa96a76
MD5 46c2454d506faa79383391b87b0c7fce
BLAKE2b-256 280febfc946233c75276c549606f73fd326f143e91a97abc600be0cc76fce39a

See more details on using hashes here.

File details

Details for the file idhash-0.2.0-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

  • Download URL: idhash-0.2.0-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.whl
  • Upload date:
  • Size: 6.7 MB
  • Tags: CPython 3.6m, manylinux: glibc 2.5+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.8.10

File hashes

Hashes for idhash-0.2.0-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 8f20d6eba98eecae75142ec9575f93f0aaf6dfa5dae4cf74453a96b8283912de
MD5 dc65a723d968fab898629594135cab84
BLAKE2b-256 a1d807eafbc788d4621d04646a9fc493523bed3f0b6b409b3fce80daf8f0b356

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page