Skip to main content

Calculate the Unique Numeric Fingerprint for Tabular Data

Project description

IDHash

Efficiently create an identifiable hash for a dataset that is;

  • Independent of row ordering
  • Dependent of column ordering

Quickly identify if two datasets are identical by comparing their hashes, without needing to presort the values.

IDHash is based upon UNFv6, and details around UNF can be found here, where the major differences are that UNF is column-invariant but row-dependent, and IDHash is column-dependent and row-invariant.

Example - Hash Entire Dataset

import pandas as pd
import pyarrow as pa
from idhash import id_hash

x=pd.DataFrame.from_dict({'a': [1,2,3]})
print(id_hash(x))
> 128759287229449989335676029604592902962

Example - Hash Iteratively

import pandas as pd
import pyarrow as pa
from idhash import id_hash, create_hasher
from typing import List

x=pd.DataFrame.from_dict({'a': [1,2,3]})
hasher = create_hasher([str(y) for y in x.columns], x.dtypes)
batches = pa.Table.from_pandas(x).to_batches(max_chunksize=1)
for batch in batches:
    hasher.write_batches([batch], delta="Add")
print(hasher.finalize())
> 128759287229449989335676029604592902962

Iterative hashing has an additional benefit - it's possible to verify a delta between two datasets, i.e.

dataset_a: pd.DataFrame, dataset_b: pd.DataFrame, delta: List[pa.RecordBatch] = load_data()
hasher_a = create_hasher(dataset_a.columns, dataset_a.dtypes)
hasher_a.write_batches(pa.Table.from_pandas(dataset_a), delta="Add")
hasher_b = create_hasher(dataset_b.columns, dataset_b.dtypes)
hasher_b.write_batches(pa.Table.from_pandas(dataset_b), delta="Add")

assert hasher_a.write_batches(delta, delta="Add").finalize() == hasher_b.finalize()

Preprocessing

Each column has specific pre-processing according to the UNF definition. This mostly consists of ensuring that floating point values, datetimes, and timestamps are representable consistently across datasets when taking into account floating point epsilon.

Hash Generation

Each row is taken as a single bytestream, and hashed using Murmurhash128. Murmurhash is a non-cryptographically secure hash function that produces a well distributed hash for each individual value. By adding (wrapping around f64::max) the individual hashed primitives, a final hash can be produced for the final dataset that does not take into account duplicates.

Checking for Equality + Delta

As the hashed rows are added to each other to produce the final value, it is also possible to remove rows against the final hash by producing a row hash in the same manner as was originally performed.

Data Processing

IDHash operates over Apache Arrow RecordBatches and can process with zero-copy over the batches.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

idhash-0.3.0-pp38-pypy38_pp73-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.3 MB view details)

Uploaded PyPy manylinux: glibc 2.5+ x86-64

idhash-0.3.0-pp37-pypy37_pp73-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.3 MB view details)

Uploaded PyPy manylinux: glibc 2.5+ x86-64

idhash-0.3.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.5+ x86-64

idhash-0.3.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.5+ x86-64

idhash-0.3.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.5+ x86-64

idhash-0.3.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.5+ x86-64

File details

Details for the file idhash-0.3.0-pp38-pypy38_pp73-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for idhash-0.3.0-pp38-pypy38_pp73-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 85ebb65f9867af87b141e8316e88957df5a83777302d41006863df84cadbba57
MD5 1eebf901ce8d971b34c607434daf3d29
BLAKE2b-256 7007c356f0b4d576d420db5468c13d8ce827161cb948e28a82dde834bfecc153

See more details on using hashes here.

File details

Details for the file idhash-0.3.0-pp37-pypy37_pp73-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for idhash-0.3.0-pp37-pypy37_pp73-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 f59d8d4b862d3c65c45dcf089890006ce3c5e716050006f1504dabf64c9ee9af
MD5 7de0d7cab576e0a4d0834eca662a114f
BLAKE2b-256 61b32dc773aa907af7f8768ce5ef8aa8144b4b7b321f37341bf24212fc20e65a

See more details on using hashes here.

File details

Details for the file idhash-0.3.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for idhash-0.3.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 c9022c9e5c186e11155cd22b504e472596f90d097e7e35a180a9c85b78607eda
MD5 6627c7eed80701165bb40089bda4f85e
BLAKE2b-256 d4f5fbd3232be221e54ad4a5915f5421747d0bd7de7298089b8e4641ad697d64

See more details on using hashes here.

File details

Details for the file idhash-0.3.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

  • Download URL: idhash-0.3.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.9, manylinux: glibc 2.5+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.8.10

File hashes

Hashes for idhash-0.3.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 3fe88a105c23db2d84ea63a4ebeee34f9bf291e4fc1cf08f170ee868112a4d71
MD5 3775909baebaf47b14c17110b554e659
BLAKE2b-256 707853b3b8c0fce1569c8efd97574bbe1db3bdcc0da2d4334571198cbc0338ca

See more details on using hashes here.

File details

Details for the file idhash-0.3.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

  • Download URL: idhash-0.3.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.8, manylinux: glibc 2.5+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.8.10

File hashes

Hashes for idhash-0.3.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 58f06545b1a785f6f0441d467ee94dcb58c6ad8267d6e01c58e95b50a6c27191
MD5 b7bc4dd2140fa22e0b200f0a3c3efe6d
BLAKE2b-256 b9698ce475478015e31fd7ac28728152f99dfab2a726f121269ac21afe414b1c

See more details on using hashes here.

File details

Details for the file idhash-0.3.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for idhash-0.3.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 cc2d56b438267b2197a5dd3723298dcdca5878da99e7d521587bfcbf2e01967f
MD5 f805fabdf6d58388e431a81f00c8e81c
BLAKE2b-256 43c49ea69ba292d799852dc4624dd39a8e09f33f72a1057ad08395907cb658dc

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page