Calculate overlapping values between two arrays and return the results as a DataFrame
Project description
Calculate overlapping values between two arrays and return the results as a DataFrame
Tested against Windows 10 / Python 3.10 / Anaconda
pip install stridesduplicatefinder
Problem: you have to lists of different sizes and want to find the overlapping values.
Using pure Python - working, but slow
all indices / same values
a1=[1,2,3,4,5,6,7]
a2=[0,0,3,1,5,6,8,1,32,]
res1=[(index1, index2, value1, value2) for index1, value1 in enumerate(a1) for index2, value2 in enumerate(a2) if value1 == value2]
print(res1)
# [(0, 3, 1, 1), (0, 7, 1, 1), (2, 2, 3, 3), (4, 4, 5, 5), (5, 5, 6, 6)]
same indices / same values
res2=[(index1, index2, value1, value2) for index1, value1 in enumerate(a1) for index2, value2 in enumerate(a2) if value1 == value2 and index1==index2]
print(res2)
# [(2, 2, 3, 3), (4, 4, 5, 5), (5, 5, 6, 6)]
Using stridesduplicatefinder - numpy or numexpr
from stridesduplicatefinder import get_overlapping
def test_numexpr():
start = perf_counter()
_ = get_overlapping(
fu="a==b", a=a1, b=a2, numpy_or_numexpr="numexpr", same_index_required=False
)
print(f"numexpr test: {perf_counter() - start}")
print(_)
def test_numpy():
start = perf_counter()
_ = get_overlapping(
fu=lambda a, b: a == b,
a=a1,
b=a2,
numpy_or_numexpr="numpy",
same_index_required=False,
)
print(f"numpy test: {perf_counter() - start}")
print(_)
def python_test():
start = perf_counter()
_ = [(i1, i2, a, b) for i2, a in enumerate(a1) for i1, b in enumerate(a2) if a == b]
print(f"python test: {perf_counter() - start}")
print(_[:10])
a1 = np.random.randint(1, 100, size=(19000,),dtype=np.int64)
a2 = np.random.randint(1, 100, size=(7777,),dtype=np.int64)
from time import perf_counter
python_test()
# python test: 13.229658300006122
test_numpy()
# numpy test: 0.5666937999994843
test_numexpr()
# numexpr test: 0.48387080000247806
Calculate overlapping values between two arrays and return the results as a DataFrame.
Parameters:
- fu: function or string to be evaluated as a condition for overlap.
- a: First input array.
- b: Second input array.
- numpy_or_numexpr: 'numpy' or 'numexpr' indicating the evaluation method.
- same_index_required: If True, only return rows where index1 == index2.
Returns:
- A DataFrame with columns 'index1', 'value1', 'index2', 'value2' containing
information about overlapping values.
Example Usage:
- To find overlapping values between two NumPy arrays:
a1 = np.random.randint(1, 10, size=(100000,))
a2 = np.random.randint(1, 10, size=(100,))
df1 = get_overlapping(
fu="a==b", a=a1, b=a2, numpy_or_numexpr="numexpr", same_index_required=True
)
print(df1)
- To find overlapping values using a custom function:
a1 = np.random.randint(1, 10, size=(100000,))
a2 = np.random.randint(1, 10, size=(100,))
df2 = get_overlapping(
fu=lambda a, b: a == b,
a=a1,
b=a2,
numpy_or_numexpr="numpy",
same_index_required=False,
)
print(df2)
- To find overlapping values between two arrays of strings:
a1 = np.array(["aa", "b", "c", "d", "ee11", "f", "gg", "h", "i", "j"])
a1 = np.repeat(a1, 1000)
a2 = np.array(["aa", "b", "c", "ee11", "f", "gg"])
a2 = np.repeat(a2, 1000)
np.random.shuffle(a1)
np.random.shuffle(a2)
df3 = get_overlapping(
fu="a == b",
a=np.char.array(a1).encode("utf-8"),
b=np.char.array(a2).encode("utf-8"),
numpy_or_numexpr="numexpr",
same_index_required=True,
)
print(df3)
# index1 value1 index2 value2
# 0 5 1 5 1
# 1 20 8 20 8
# 2 33 5 33 5
# 3 34 1 34 1
# 4 41 5 41 5
# 5 43 2 43 2
# 6 51 7 51 7
# 7 52 1 52 1
# 8 55 7 55 7
# 9 57 1 57 1
# 10 70 2 70 2
# 11 74 8 74 8
# index1 value1 index2 value2
# 0 0 4 8 4
# 1 0 4 12 4
# 2 0 4 13 4
# 3 0 4 26 4
# 4 0 4 53 4
# ... ... ... ...
# 1112213 99999 9 47 9
# 1112214 99999 9 62 9
# 1112215 99999 9 72 9
# 1112216 99999 9 81 9
# 1112217 99999 9 96 9
# [1112218 rows x 4 columns]
# index1 value1 index2 value2
# 0 1 gg 4 gg
# 1 1 gg 5 gg
# 2 1 gg 10 gg
# 3 1 gg 13 gg
# 4 1 gg 17 gg
# ... ... ... ...
# 5999995 9999 c 5978 c
# 5999996 9999 c 5979 c
# 5999997 9999 c 5990 c
# 5999998 9999 c 5992 c
# 5999999 9999 c 5995 c
# [6000000 rows x 4 columns]
# index1 value1 index2 value2
# 0 31 b'aa' 31 b'aa'
# 1 40 b'b' 40 b'b'
# 2 46 b'aa' 46 b'aa'
# 3 47 b'gg' 47 b'gg'
# 4 65 b'b' 65 b'b'
# .. ... ... ... ...
# 626 5966 b'aa' 5966 b'aa'
# 627 5982 b'f' 5982 b'f'
# 628 5985 b'ee11' 5985 b'ee11'
# 629 5995 b'c' 5995 b'c'
# 630 5996 b'gg' 5996 b'gg'
# [631 rows x 4 columns]
The function computes the overlapping values based on the specified condition (function or string)
and returns a DataFrame with the results. If `same_index_required` is set to True, it filters
the results to include only rows where the indices match.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file stridesduplicatefinder-0.10.tar.gz
.
File metadata
- Download URL: stridesduplicatefinder-0.10.tar.gz
- Upload date:
- Size: 24.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 10b516d4ed9438eb11b0ddc50ca2f82dfeabf83aeb3ff53c7894eccbe403c576 |
|
MD5 | c2f9511cb87d31dcd7ee6c7f2a174766 |
|
BLAKE2b-256 | f0b1e8e73fa23e2847aba474f934835167929532d145ea049640cd9e677ffef9 |
File details
Details for the file stridesduplicatefinder-0.10-py3-none-any.whl
.
File metadata
- Download URL: stridesduplicatefinder-0.10-py3-none-any.whl
- Upload date:
- Size: 25.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | df2c2ccf6227ecc2c62b74bdba596031942af8668c9fdfceb860a57ba3ad201d |
|
MD5 | 07a9a1b3fe332f226a81fa7b07829440 |
|
BLAKE2b-256 | f5906bddc3b23a639ed8f0460b6addad87b643ec06e70240607fc9dd7c11f2d0 |