Fast hash in 2D Arrays (Numpy/Pandas/lists/tuples)
Project description
Fast hash in 2D Arrays (Numpy/Pandas/lists/tuples)
pip install arrayhascher
Tested against Windows / Python 3.11 / Anaconda
Cython (and a C/C++ compiler) must be installed
Computes a hash value for each column in a DataFrame/NumPy Array/list/tuple.
Parameters:
- df (numpy.ndarray, pandas.Series, pandas.DataFrame, list, tuple): 2D (!) Input data to compute hash values for.
- fail_convert_to_string (bool, optional): If True, tries to convert non-string columns to strings after failed hashing. - The original data won't change!
If False, raises an exception if conversion fails. Default is True.
- whole_result (bool, optional): If True, returns an array of hash values for each element in the DataFrame/NumPy Array/list/tuple.
If False, returns a condensed array of hash values for each column.
Default is False.
Returns:
- numpy.ndarray: If `whole_result` is False, returns a condensed array of hash values for each column.
If `whole_result` is True, returns an array of hash values for each element in the DataFrame.
Example:
import pandas as pd
from arrayhascher import get_hash_column
def test_drop_duplicates(df,hashdata):
# Example of how to delete duplicates
return df.assign(__XXXX___DELETE____=hashdata).drop_duplicates(subset='__XXXX___DELETE____').drop(
columns='__XXXX___DELETE____')
# With pandas ----------------------------------------------------------------
df = pd.read_csv(
"https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv"
)
df = pd.concat([df for _ in range(10000)], ignore_index=True)
df = df.sample(len(df))
hashdata = get_hash_column(df, fail_convert_to_string=True, whole_result=False)
# Out[3]:
# array([-4123592378399267822, -20629003135630820, 1205215161148196795,
# ..., 4571993557129865534, -5454081294880889185,
# 2672790383060839465], dtype=int64)
# %timeit test_drop_duplicates(df,hashdata)
# %timeit df.drop_duplicates()
# 947 ms ± 18.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# 2.94 s ± 10.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# Numpy only ----------------------------------------------------------------
hashdata = get_hash_column(df.to_numpy(), fail_convert_to_string=True, whole_result=False)
print(hashdata)
# # array([-4123592378399267822, -20629003135630820, 1205215161148196795,
# # ..., 4571993557129865534, -5454081294880889185,
# # 2672790383060839465], dtype=int64)
# Works also with lists ------------------------------------------------------
get_hash_column(df[:100].to_numpy().tolist(), fail_convert_to_string=True, whole_result=False)
# array([-5436153420663104440, -1384246600780856199, 177114776690388363,
# 788413506175135724, 1442743010667139722, -6386366738900951630,
# -8610361015858259700, 3995349003546064044, 3627302932646306514,
# 3448626572271213155, -1555175565302024830, 3265835764424924148, ....
# And tuples ----------------------------------------------------------------
tuple(map(tuple, df[:100].to_numpy().tolist()))
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
arrayhascher-0.11.tar.gz
(84.5 kB
view details)
Built Distribution
File details
Details for the file arrayhascher-0.11.tar.gz
.
File metadata
- Download URL: arrayhascher-0.11.tar.gz
- Upload date:
- Size: 84.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 342167a6d5cb42da05a568657c4109b0413839da1f6409752fa3ee1ac180b6bb |
|
MD5 | f2cfe947013ac3de3c149ff9009e2c5f |
|
BLAKE2b-256 | 7600cba8c25067062fc273ad80979557317526416d53b8896b0304a3084a5c98 |
File details
Details for the file arrayhascher-0.11-py3-none-any.whl
.
File metadata
- Download URL: arrayhascher-0.11-py3-none-any.whl
- Upload date:
- Size: 87.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5a1aec3a4128f9523f1291ecf310da643a4308489cf69e97f02f1b59fefec9b0 |
|
MD5 | 6bb9ac1af988e073c1b44ea29c970087 |
|
BLAKE2b-256 | b5b7b815087f49ea87de8abccec4d8f61a637706c4e6ce85e802dd5d3d7d5428 |