Skip to main content

Calculate the Unique Numeric Fingerprint for Tabular Data

Project description

IDHash

Efficiently create an identifiable hash for a dataset that is independent of row ordering but dependent of column ordering. The core purpose of IDHash is to identify quickly if two datasets are the same without requiring them to both be on the same machine, but also expands to being able to check if one dataset is a modified version of the other dataset if the modification is known upfront.

IDHash is based upon UNFv6, and details around UNF can be found here, where the major differences are that UNF is column-invariant but row-dependent, and IDHash is column-dependent and row-invariant.

Examples

import pandas as pd
import pyarrow as pa
from idhash import id_hash

def hash_pd(df: pd.DataFrame) -> int:
    dtypes = [str(df.dtypes[x]) for x in df.columns]
    df_batches = pa.Table.from_pandas(x).to_batches()
    return id_hash(df_batches, df.columns, dtypes)

x=pd.DataFrame.from_dict({'a': [1,2,3]})
print(hash_pd(x))
> 259167810065665855969772359546814925541

Method Drawbacks

Duplicate Identification

Due to the requirement for row-invariance, a dataset of

    A    B
    1    2
    1    2
    1    2

will produce the same hash as;

    A    B
    1    2

But not the same as;

    A    B
    1    2
    1    2

In practice, this is relatively unlikely, and for the core purpose of datasets within Machine Learning, it is not a primary issue.

Preprocessing

Each column has specific pre-processing according to the UNF definition. This mostly consists of ensuring that floating point values and timestamps (currently unsupported in IDHash) are representable consistently across datasets when taking into account floating point epsilon.

Hash Generation

Each row is taken as a single bytestream, and hashed using Murmurhash128. Murmurhash is a non-cryptographically secure hash function that produces a well distributed hash for each individual value. By XORing the individual hashed primitives, a final hash can be produced for the final dataset that does not take into account duplicates.

Checking for Equality + Delta

As the hashed rows are XORed against each other to produce the final value, it is also possible to remove rows against the final hash by producing a row hash in the same manner as was originally performed.

Data Processing

IDHash operates over Apache Arrow RecordBatches and can process with zero-copy over the batches.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

idhash-0.1.3.tar.gz (3.8 kB view details)

Uploaded Source

Built Distributions

idhash-0.1.3-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.5+ x86-64

idhash-0.1.3-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.5+ x86-64

idhash-0.1.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.5+ x86-64

File details

Details for the file idhash-0.1.3.tar.gz.

File metadata

  • Download URL: idhash-0.1.3.tar.gz
  • Upload date:
  • Size: 3.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/0.11.3

File hashes

Hashes for idhash-0.1.3.tar.gz
Algorithm Hash digest
SHA256 4b7def89c96d728b231351dfe5f4672e97ce8cad6939899dc009b1858bff9dc0
MD5 f35eb8a1c7c873d53cfee4a28b7f86a7
BLAKE2b-256 367baa4ef46a1ea1e4627822ea380c2e0f0ec68f116076e692ede2316edf17c1

See more details on using hashes here.

File details

Details for the file idhash-0.1.3-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for idhash-0.1.3-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 131a0f54b0f7e67961fe3e45aaf12e779ac9cbf27b2ca66b90eab72747f7a50d
MD5 0f152e652beaeef162bef120fed13304
BLAKE2b-256 54b0c017b69474a9d78153989103ea1635b330b0d0e2edf638273aeae45b7d8a

See more details on using hashes here.

File details

Details for the file idhash-0.1.3-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for idhash-0.1.3-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 87134ca1a78b1918df17a76f9faf7f96508d3c6ad51b005e2e4ed01a5f194f5c
MD5 7e6c6992c78f34d836df2c7d2b61cb14
BLAKE2b-256 7b15cbf8a748c68bd5ec52129ad2d6ec176455be4f92cd37a334f6a9d12cd132

See more details on using hashes here.

File details

Details for the file idhash-0.1.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for idhash-0.1.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 10d92d2dfacbeaa57437d3388dffa1015bfaafa432ed761c674c0f587746561e
MD5 b183fd7c2a29c89227391996755bf72b
BLAKE2b-256 13694b75906cf15053c4057705e545fd588d4e7f5655a3239024a697d61a97ab

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page