Skip to main content

Finds close numerical matches across two arrays

Project description

Close Numerical Matches

PyPI - Downloads PyPI - Version GitHub Workflow Status (branch) CodeFactor Grade GitHub issues GitHub license PyPI - Python Version PRs welcome

This package finds close numerical matches fast across two 2D arrays of shape (n, d) and (m, d) (if it can be assumed there will be relatively few matches and d is relatively low). Returns the indices of the matches.

Installation

You can install close-numerical-matches from PyPI:

$ pip install close-numerical-matches

The package is supported on Python 3.6 and above and requires Numpy.

How to use

Import find_matches from close_numerical_matches and supply two arrays of shape (n, d) and (m, d) and a given tolerance level. Optionally provide your desired distance metric and a bucket tolerance multiplier. The arguments in more detail:

  • arr0 : np.ndarray
    First array to find matches against. Should be of size (n, d).
  • arr1 : np.ndarray
    Second array to find matches against. Should be of size (m, d).
  • dist : {'norm', 'max'} or Callable[[np.ndarray], np.ndarray], default='norm'
    Distance metric to calculate distance. 'norm' and 'max' are currently supported. If you want some other distance function, you can supply your own function. It should take an (n, d) array as argument and return an (n,) array.
  • tol : float, default=0.1
    The tolerance where values are considered the similar enough to count as a match. Should be > 0.
  • bucket_tol_mult : int, default=2
    The tolerance multiplier to use for assigning buckets. Can in some instances make algorithm faster to tweak this. Should never be less than 1.

Example

>>> import numpy as np
>>> from close_numerical_matches import find_matches
>>> arr0 = np.array([[25, 24], [50, 50], [25, 26]])
>>> arr1 = np.array([[25, 23], [25, 25], [50.6, 50.6], [60, 60]])
>>> find_matches(arr0, arr1, tol=1.0001)
array([[0, 0], [0, 1], [1, 2], [2, 1]])
>>> find_matches(arr0, arr1, tol=0.9999)
array([[1, 2]])
>>> find_matches(arr0, arr1, tol=0.60001)
array([], dtype=int64)
>>> find_matches(arr0, arr1, tol=0.60001, dist='max')
array([[1, 2]])
>>> manhatten_dist = lambda arr: np.sum(np.abs(arr), axis=1)
>>> matches = find_matches(arr0, arr1, tol=0.11, dist=manhatten_dist)
>>> matches
array([[0, 1], [0, 1], [2, 1]])
>>> indices0, indices1 = matches.T
>>> arr0[indices0]
array([[25, 24], [25, 24], [25, 26]])

More examples can be found in the test cases.

How fast is it?

Here is an unscientific example:

from timeit import default_timer as timer
import numpy as np
from close_numerical_matches import naive_find_matches, find_matches

arr1 = np.random.rand(320_000, 2)
arr2 = np.random.rand(44_000, 2)

start = timer()
naive_find_matches(arr1, arr2, tol=0.001)
end = timer()
print(end - start)  # 255.335 s

start = timer()
find_matches(arr1, arr2, tol=0.001)
end = timer()
print(end - start)  # 5.821 s

How it works

Instead of comparing every element in the first array against every element in the second array, resulting in an O(nmd) runtime, all elements are at first assigned to buckets so only elements that are relatively close are compared. In the case of relatively few matches and a low dimensionality d, this cuts the runtime down to almost linear O((n + m)d).

In general, the algorithm runtime of the bucket approach is O((n + m)d + Bd³ + ∑_{b ∈ B} n_b m_b) where B is the number of buckets and n_b and m_b are the number of items assigned to bucket b. As can be seen, it scales bad with dimensionality and also does not improve from the naive approach if all elements are assigned to the same bucket. In case the bucket approach is likely to be slower than the naive approach, this library will fall back to the naive approach.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

close-numerical-matches-0.1.0.tar.gz (7.0 kB view details)

Uploaded Source

Built Distribution

close_numerical_matches-0.1.0-py2.py3-none-any.whl (7.6 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file close-numerical-matches-0.1.0.tar.gz.

File metadata

  • Download URL: close-numerical-matches-0.1.0.tar.gz
  • Upload date:
  • Size: 7.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.9.6

File hashes

Hashes for close-numerical-matches-0.1.0.tar.gz
Algorithm Hash digest
SHA256 9b45a59282470065245610dcc7bc76a8b0eaeb67dbfb95322704cc2fcaa96824
MD5 d85b61c3a975aa1ac89e9f6bb5a3bd0c
BLAKE2b-256 34b5d20f1928c9345c0a8500d6d446d1f3ecb0919599d3f7eb4157074e144d22

See more details on using hashes here.

File details

Details for the file close_numerical_matches-0.1.0-py2.py3-none-any.whl.

File metadata

  • Download URL: close_numerical_matches-0.1.0-py2.py3-none-any.whl
  • Upload date:
  • Size: 7.6 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.9.6

File hashes

Hashes for close_numerical_matches-0.1.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 4f6ff05db885c0bdc1431e44fd0724afd3b29d9b54999e4d3e421237dcd75ac1
MD5 1e28a68ec7086fe3c8e2304d00a5edf8
BLAKE2b-256 03bd85ff650c0d78cdd46ffee869a8970cba204eee8857293ca583b33626559b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page