Skip to main content

Fast hash in 2D Arrays (Numpy/Pandas/lists/tuples)

Project description

Fast hash in 2D Arrays (Numpy/Pandas/lists/tuples)

pip install arrayhascher

Tested against Windows / Python 3.11 / Anaconda

Cython (and a C/C++ compiler) must be installed

    Computes a hash value for each column in a DataFrame/NumPy Array/list/tuple.

    Parameters:
    - df (numpy.ndarray, pandas.Series, pandas.DataFrame, list, tuple): 2D (!) Input data to compute hash values for.
    - fail_convert_to_string (bool, optional): If True, tries to convert non-string columns to strings after failed hashing. - The original data won't change!
                                               If False, raises an exception if conversion fails. Default is True.
    - whole_result (bool, optional): If True, returns an array of hash values for each element in the DataFrame/NumPy Array/list/tuple.
                                    If False, returns a condensed array of hash values for each column.
                                    Default is False.

    Returns:
    - numpy.ndarray: If `whole_result` is False, returns a condensed array of hash values for each column.
                     If `whole_result` is True, returns an array of hash values for each element in the DataFrame.

    Example:
        import pandas as pd

        from arrayhascher import get_hash_column

        def test_drop_duplicates(df,hashdata):
            # Example of how to delete duplicates

            return df.assign(__XXXX___DELETE____=hashdata).drop_duplicates(subset='__XXXX___DELETE____').drop(
                columns='__XXXX___DELETE____')

        # With pandas ----------------------------------------------------------------
        df = pd.read_csv(
            "https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv"
        )
        df = pd.concat([df for _ in range(10000)], ignore_index=True)
        df = df.sample(len(df))
        hashdata = get_hash_column(df, fail_convert_to_string=True, whole_result=False)
        # Out[3]:
        # array([-4123592378399267822,   -20629003135630820,  1205215161148196795,
        #        ...,  4571993557129865534, -5454081294880889185,
        #         2672790383060839465], dtype=int64)

        # %timeit test_drop_duplicates(df,hashdata)
        # %timeit df.drop_duplicates()
        # 947 ms ± 18.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
        # 2.94 s ± 10.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

        # Numpy only  ----------------------------------------------------------------
        hashdata = get_hash_column(df.to_numpy(), fail_convert_to_string=True, whole_result=False)
        print(hashdata)
        # # array([-4123592378399267822,   -20629003135630820,  1205215161148196795,
        # #        ...,  4571993557129865534, -5454081294880889185,
        # #         2672790383060839465], dtype=int64)

        # Works also with lists  ------------------------------------------------------
        get_hash_column(df[:100].to_numpy().tolist(), fail_convert_to_string=True, whole_result=False)
        # array([-5436153420663104440, -1384246600780856199,   177114776690388363,
        #          788413506175135724,  1442743010667139722, -6386366738900951630,
        #        -8610361015858259700,  3995349003546064044,  3627302932646306514,
        #         3448626572271213155, -1555175565302024830,  3265835764424924148, ....
        # And tuples  ----------------------------------------------------------------
        tuple(map(tuple, df[:100].to_numpy().tolist()))

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arrayhascher-0.11.tar.gz (84.5 kB view details)

Uploaded Source

Built Distribution

arrayhascher-0.11-py3-none-any.whl (87.8 kB view details)

Uploaded Python 3

File details

Details for the file arrayhascher-0.11.tar.gz.

File metadata

  • Download URL: arrayhascher-0.11.tar.gz
  • Upload date:
  • Size: 84.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for arrayhascher-0.11.tar.gz
Algorithm Hash digest
SHA256 342167a6d5cb42da05a568657c4109b0413839da1f6409752fa3ee1ac180b6bb
MD5 f2cfe947013ac3de3c149ff9009e2c5f
BLAKE2b-256 7600cba8c25067062fc273ad80979557317526416d53b8896b0304a3084a5c98

See more details on using hashes here.

File details

Details for the file arrayhascher-0.11-py3-none-any.whl.

File metadata

  • Download URL: arrayhascher-0.11-py3-none-any.whl
  • Upload date:
  • Size: 87.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for arrayhascher-0.11-py3-none-any.whl
Algorithm Hash digest
SHA256 5a1aec3a4128f9523f1291ecf310da643a4308489cf69e97f02f1b59fefec9b0
MD5 6bb9ac1af988e073c1b44ea29c970087
BLAKE2b-256 b5b7b815087f49ea87de8abccec4d8f61a637706c4e6ce85e802dd5d3d7d5428

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page