Skip to main content

fast isin() function using Cython (C++) - up to 80 times faster than NumPy/Pandas.

Project description

fast isin() function using Cython (C++) - up to 80 times faster than NumPy/Pandas.

pip install isincython

Tested against Python 3.11 / Windows 10

Cython (and a C/C++ compiler) must be installed to use the optimized Cython implementation.

This module provides functions for efficiently checking if elements in one array are present in another array. It includes a Cython implementation for improved performance.

Note: The Cython implementation is compiled during the first import, and the compiled extension module is stored in the same directory. Subsequent imports will use the precompiled module for improved performance.

import timeit
from isincython import generate_random_arrays, fast_isin
import numpy as np

size = 10000000
low = 0
high = 254
arras = [
    (size, "float32", low, high),
    (size, "float64", low, high),
    (size, np.uint8, low, high),
    (size, np.int8, low, high),
    (size, np.int16, low, high),
    (size, np.int32, low, high),
    (size, np.int64, low, high),
    (size, np.uint16, low, high),
    (size, np.uint32, low, high),
    (size, np.uint64, low, high),
]

reps = 1
for a in arras:
    arr = generate_random_arrays(*a)
    seq = generate_random_arrays(size // 10, *a[1:])
    s = """u=fast_isin(arr,seq)"""
    u = fast_isin(arr, seq)
    print("c++", arr[u])
    t1 = timeit.timeit(s, globals=globals(), number=reps) / reps
    print(t1)
    s2 = """q=np.isin(arr,seq)"""
    q = np.isin(arr, seq)
    print("numpy", arr[q])

    t2 = timeit.timeit(s2, globals=globals(), number=reps) / reps
    print(t2)
    print(np.all(q == u))

    print("-----------------")

haystack = np.array(
    [
        b"Cumings",
        b"Heikkinen",
        b"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",
        b"aaa",
        b"bbbb()",
        b"Futrelle",
        b"Allen",
        b"Cumings, Mrs. John Bradley (Florence Briggs Thayer)q",
        b"Braund, Mr. Owen Harris",
        b"Heikkinen, Miss. Laina",
        b"Futrelle, Mrs. Jacques Heath (Lily May Peel)",
        b"Allen, Mr. William Henry",
        b"Braund",
    ],
    dtype="S",
)
needels = np.array(
    [
        b"Braund, Mr. Owen Harris",
        b"Cumings, Mrs. John Bradley (Florence Briggs Th",
        b"Heikkinen, Miss. Lxxaina",
        b"Futrelle, Mrs. Jacqxues Heath (Lily May Peel)",
        b"Allen, Mxr. William Henry",
        b"sdfsdd",
        b"aaa",
        b"bbbb()",
    ],
    dtype="S",
)
haystack = np.ascontiguousarray(np.concatenate([haystack for _ in range(200000)]))
needels = np.ascontiguousarray(np.concatenate([needels for _ in range(10000)]))

s = "o = fast_isin(haystack, needels)"
t1 = timeit.timeit(s, globals=globals(), number=reps) / reps
s1 = "o = np.isin(haystack, needels)"
t2 = timeit.timeit(s1, globals=globals(), number=reps) / reps
print(f"c++ {t1}")
print(f"numpy {t2}")
o1 = fast_isin(haystack, needels)
o2 = np.isin(haystack, needels)
print(np.all(o1 == o2))
needels = needels.astype("U")
haystack = haystack.astype("U")
s = "o = fast_isin(haystack, needels)"
t1 = timeit.timeit(s, globals=globals(), number=reps) / reps
s1 = "o = np.isin(haystack, needels)"
t2 = timeit.timeit(s1, globals=globals(), number=reps) / reps
print(f"c++ {t1}")
print(f"numpy {t2}")
o1 = fast_isin(haystack, needels)
o2 = np.isin(haystack, needels)
print(np.all(o1 == o2))

# c++ [136.03264   62.5741   156.39038  ...  78.545906 229.14676  186.44472 ]
# 0.39614199999778066
# numpy [136.03264   62.5741   156.39038  ...  78.545906 229.14676  186.44472 ]
# 2.1623376999996253
# True
# -----------------
# c++ []
# 0.4184691000045859
# numpy []
# 2.189824300003238
# True
# -----------------
# c++ [126 128  31 ... 113 190 146]
# 0.011114299995824695
# numpy [126 128  31 ... 113 190 146]
# 0.05381579999811947
# True
# -----------------
# c++ [  23   35   52 ...   54   98 -125]
# 0.010347299998102244
# numpy [  23   35   52 ...   54   98 -125]
# 0.8121466000011424
# True
# -----------------
# c++ [144  29  89 ...  90  34 202]
# 0.012101899999834131
# numpy [144  29  89 ...  90  34 202]
# 0.05841199999849778
# True
# -----------------
# c++ [ 93  51 131 ... 231 147 140]
# 0.013264799999888055
# numpy [ 93  51 131 ... 231 147 140]
# 0.07822610000584973
# True
# -----------------
# c++ [138 158 233 ...  64  82 160]
# 0.018734699995548
# numpy [138 158 233 ...  64  82 160]
# 0.09425780000310624
# True
# -----------------
# c++ [158  17 126 ...  55   7 116]
# 0.011595800002396572
# numpy [158  17 126 ...  55   7 116]
# 0.06014610000420362
# True
# -----------------
# c++ [ 60  12 226 ... 152 190 155]
# 0.013999900002090726
# numpy [ 60  12 226 ... 152 190 155]
# 0.07416449999436736
# True
# -----------------
# c++ [239  84  81 ... 146  85  63]
# 0.026196500002697576
# numpy [239  84  81 ... 146  85  63]
# 0.11476380000385689
# True
# -----------------
# c++ 0.7991062000000966
# numpy 2.1993997000026866
# True
# c++ 1.7051588000031188
# numpy 3.0464809000040987
# True

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

isincython-0.13.tar.gz (24.5 kB view details)

Uploaded Source

Built Distribution

isincython-0.13-py3-none-any.whl (24.2 kB view details)

Uploaded Python 3

File details

Details for the file isincython-0.13.tar.gz.

File metadata

  • Download URL: isincython-0.13.tar.gz
  • Upload date:
  • Size: 24.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for isincython-0.13.tar.gz
Algorithm Hash digest
SHA256 b6d9da45f3f4c1ed9b1f957cf6bab7c43bb915943fda88586c942e4aa9f2f9a6
MD5 7465de3ecf11e60352b3c446a3fd5824
BLAKE2b-256 c1d162bbeb2467423950753fc8796784290649ff0aecd2447182f13e66b50bef

See more details on using hashes here.

File details

Details for the file isincython-0.13-py3-none-any.whl.

File metadata

  • Download URL: isincython-0.13-py3-none-any.whl
  • Upload date:
  • Size: 24.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for isincython-0.13-py3-none-any.whl
Algorithm Hash digest
SHA256 c52c3d5fb998e865a587704c8faeb6af111dc456bafca1c4499bea78a75bed75
MD5 52b41908598903cb1181263a884378ec
BLAKE2b-256 afdf7c4cd9343c5e54ca68903f47ec235253af9e4ca2f252ccb9cda72f572cb2

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page