Skip to main content

Distance functions: A drop-in replacement for, and a super-set of the scipy.spatial.distance module.

Project description

Algorithms for Big Data: Distances (v1.0.3)

This package contains algorithms for computing distances between data points. It is a thin Python wrapper around the distances crate, in Rust. It provides drop-in replacements for the distance functions in scipy.spatial.distance.

Supported Distance Functions

Installation

pip install abd-distances

Usage

import math

import numpy
import abd_distances.simd as distance

a = numpy.array([i for i in range(10_000)], dtype=numpy.float32)
b = a + 1.0

dist = distance.euclidean(a, b)

assert math.fabs(dist - 100.0) < 1e-6

print(dist)
# 100.0

Vector Distances

  • Bray-Curtis: abd_distances.vector.braycurtis
  • Canberra: abd_distances.vector.canberra
  • Chebyshev: abd_distances.vector.chebyshev
  • Correlation
  • Cosine: abd_distances.vector.cosine
  • Euclidean: abd_distances.vector.euclidean
  • Jensen-Shannon
  • Mahalanobis
  • Manhattan: abd_distances.vector.manhattan and abd_distances.vector.cityblock
  • Minkowski: abd_distances.vector.minkowski
  • Standardized Euclidean
  • Squared Euclidean: abd_distances.vector.sqeuclidean
  • Pairwise Distances: abd_distances.vector.cdist and abd_distances.vector.pdist
  • ...

Boolean Distances

  • Dice
  • Hamming
  • Jaccard
  • Kulczynski 1D
  • Rogers-Tanimoto
  • Russell-Rao
  • Sokal-Michener
  • Sokal-Sneath
  • Yule
  • ...

SIMD-Accelerated Vector Distances

  • Euclidean: abd_distances.simd.euclidean
  • Squared Euclidean: abd_distances.simd.sqeuclidean
  • Cosine: abd_distances.simd.cosine
  • Pairwise Distances: abd_distances.simd.cdist and abd_distances.simd.pdist
  • ...

String Distances

  • Hamming: abd_distances.strings.hamming
  • Levenshtein: abd_distances.strings.levenshtein
  • Needleman-Wunsch: abd_distances.strings.needleman_wunsch
  • Smith-Waterman
  • Pairwise Distances
  • ...

Benchmarks

SIMD-Accelerated Vector Distance Benchmarks

These benchmarks were run on an Intel Core i7-11700KF CPU @ 4.900GHz, using a single thread. The OS was Arch Linux, with kernel version 6.7.4-arch1-1.

The "Min", "Max", and "Mean" columns show the minimum, maximum, and mean times (in seconds), respectively, taken to compute the pairwise distances using the functions from scipy.spatial.distance. The "Min (+)", "Max (+)", and "Mean (+)" columns show the speedup of the this package's functions over the scipy functions. All pairwise distances (cdist and pdist) were computed for 200x200 vectors of 500 dimensions, and the average time was taken over 100 runs. All individual distances were computed for 20x20 vectors of 500 dimensions, and the average time was taken over 100 runs.

Benchmark Min Max Mean Min (+) Max (+) Mean (+)
cdist, euclidean, f32 2.560 2.576 2.566 0.185 (13.9x) 0.196 (13.2x) 0.188 (13.7x)
cdist, euclidean, f64 2.398 2.406 2.401 0.292 (8.2x) 0.307 (7.8x) 0.298 (8.0x)
cdist, sqeuclidean, f32 2.519 2.527 2.523 0.182 (13.9x) 0.197 (12.8x) 0.187 (13.5x)
cdist, sqeuclidean, f64 2.381 2.393 2.389 0.293 (8.1x) 0.318 (7.5x) 0.301 (7.9x)
cdist, cosine, f32 4.011 4.021 4.016 0.625 (6.4x) 0.637 (6.3x) 0.632 (6.4x)
cdist, cosine, f64 3.978 4.009 3.992 0.626 (6.4x) 0.666 (6.0x) 0.638 (6.3x)
pdist, euclidean, f32 1.235 1.249 1.241 0.252 (4.9x) 0.263 (4.7x) 0.257 (4.8x)
pdist, euclidean, f64 1.216 1.262 1.234 0.302 (4.0x) 0.312 (4.0x) 0.308 (4.0x)
pdist, sqeuclidean, f32 1.229 1.250 1.237 0.251 (4.9x) 0.303 (4.1x) 0.265 (4.7x)
pdist, sqeuclidean, f64 1.209 1.213 1.211 0.306 (3.9x) 0.313 (3.9x) 0.310 (3.9x)
pdist, cosine, f32 2.001 2.017 2.006 0.468 (4.3x) 0.484 (4.2x) 0.478 (4.2x)
pdist, cosine, f64 1.991 2.004 1.996 0.461 (4.3x) 0.476 (4.2x) 0.471 (4.2x)
euclidean, f32 0.644 0.670 0.654 0.076 (8.5x) 0.080 (8.4x) 0.078 (8.3x)
euclidean, f64 0.672 0.701 0.682 0.097 (6.9x) 0.102 (6.9x) 0.100 (6.8x)
sqeuclidean, f32 0.506 0.512 0.508 0.076 (6.6x) 0.079 (6.5x) 0.078 (6.5x)
sqeuclidean, f64 0.515 0.519 0.518 0.100 (5.1x) 0.104 (5.0x) 0.103 (5.0x)
cosine, f32 0.668 0.687 0.677 0.110 (6.1x) 0.113 (6.1x) 0.111 (6.1x)
cosine, f64 0.465 0.472 0.469 0.127 (3.7x) 0.130 (3.6x) 0.129 (3.6x)
f32 f64

Euclidean f32 Squared Euclidean f32 Cosine f32

Euclidean f64 Squared Euclidean f64 Cosine f64

Vector Distance Benchmarks (No SIMD)

These benchmarks were run on an Intel Core i7-11700KF CPU @ 4.900GHz, using a single thread. The OS was Arch Linux, with kernel version 6.7.4-arch1-1.

The "Min", "Max", and "Mean" columns show the minimum, maximum, and mean times (in seconds), respectively, taken to compute the pairwise distances using the functions from scipy.spatial.distance. The "Min (+)", "Max (+)", and "Mean (+)" columns show the speedup of the this package's functions over the scipy functions. All pairwise distances (cdist and pdist) were computed for 200x200 vectors of 500 dimensions, and the average time was taken over 100 runs. All individual distances were computed for 20x20 vectors of 500 dimensions, and the average time was taken over 100 runs.

These benchmarks were run using the richbench package.

Benchmark Min Max Mean Min (+) Max (+) Mean (+)
braycurtis, f32 1.103 1.134 1.114 0.323 (3.4x) 0.324 (3.5x) 0.323 (3.4x)
braycurtis, f64 0.834 0.843 0.838 0.170 (4.9x) 0.173 (4.9x) 0.171 (4.9x)
canberra, f32 2.524 2.529 2.526 0.153 (16.5x) 0.155 (16.3x) 0.154 (16.4x)
canberra, f64 2.216 2.260 2.235 0.168 (13.2x) 0.170 (13.3x) 0.169 (13.2x)
chebyshev, f32 2.738 2.774 2.753 0.149 (18.3x) 0.151 (18.4x) 0.150 (18.4x)
chebyshev, f64 2.777 2.784 2.781 0.165 (16.8x) 0.166 (16.8x) 0.165 (16.8x)
euclidean, f32 0.641 0.641 0.641 0.150 (4.3x) 0.150 (4.3x) 0.150 (4.3x)
euclidean, f64 0.657 0.662 0.660 0.167 (3.9x) 0.168 (3.9x) 0.168 (3.9x)
sqeuclidean, f32 0.506 0.509 0.507 0.149 (3.4x) 0.149 (3.4x) 0.149 (3.4x)
sqeuclidean, f64 0.514 0.518 0.516 0.165 (3.1x) 0.173 (3.0x) 0.170 (3.0x)
cityblock, f32 0.437 0.443 0.440 0.150 (2.9x) 0.150 (2.9x) 0.150 (2.9x)
cityblock, f64 0.444 0.451 0.447 0.167 (2.7x) 0.168 (2.7x) 0.168 (2.7x)
cosine, f32 0.659 0.668 0.664 0.314 (2.1x) 0.315 (2.1x) 0.314 (2.1x)
cosine, f64 0.459 0.471 0.465 0.321 (1.4x) 0.325 (1.4x) 0.324 (1.4x)
cdist, braycurtis, f32 4.902 4.906 4.904 1.802 (2.7x) 1.875 (2.6x) 1.833 (2.7x)
cdist, braycurtis, f64 4.765 4.775 4.768 0.710 (6.7x) 0.735 (6.5x) 0.725 (6.6x)
cdist, canberra, f32 6.914 6.943 6.930 1.356 (5.1x) 1.385 (5.0x) 1.367 (5.1x)
cdist, canberra, f64 6.782 6.813 6.797 0.684 (9.9x) 0.701 (9.7x) 0.692 (9.8x)
cdist, chebyshev, f32 2.763 2.768 2.765 0.640 (4.3x) 0.663 (4.2x) 0.649 (4.3x)
cdist, chebyshev, f64 2.659 2.677 2.664 0.647 (4.1x) 0.662 (4.0x) 0.655 (4.1x)
cdist, euclidean, f32 2.563 2.570 2.564 0.644 (4.0x) 0.658 (3.9x) 0.653 (3.9x)
cdist, euclidean, f64 2.378 2.400 2.388 0.630 (3.8x) 0.649 (3.7x) 0.640 (3.7x)
cdist, sqeuclidean, f32 2.516 2.523 2.519 0.648 (3.9x) 0.660 (3.8x) 0.652 (3.9x)
cdist, sqeuclidean, f64 2.412 2.423 2.417 0.631 (3.8x) 0.645 (3.8x) 0.638 (3.8x)
cdist, cityblock, f32 4.545 4.552 4.548 0.647 (7.0x) 0.671 (6.8x) 0.658 (6.9x)
cdist, cityblock, f64 4.406 4.407 4.407 0.633 (7.0x) 0.657 (6.7x) 0.647 (6.8x)
cdist, cosine, f32 4.010 4.020 4.013 2.254 (1.8x) 2.292 (1.8x) 2.270 (1.8x)
cdist, cosine, f64 3.987 3.992 3.990 2.241 (1.8x) 2.288 (1.7x) 2.258 (1.8x)
pdist, braycurtis, f32 2.382 2.387 2.385 1.062 (2.2x) 1.074 (2.2x) 1.069 (2.2x)
pdist, braycurtis, f64 2.368 2.378 2.371 0.510 (4.6x) 0.523 (4.5x) 0.516 (4.6x)
pdist, canberra, f32 3.374 3.411 3.389 0.831 (4.1x) 0.841 (4.1x) 0.835 (4.1x)
pdist, canberra, f64 3.369 3.411 3.396 0.504 (6.7x) 0.515 (6.6x) 0.509 (6.7x)
pdist, chebyshev, f32 1.362 1.364 1.363 0.478 (2.9x) 0.488 (2.8x) 0.484 (2.8x)
pdist, chebyshev, f64 1.338 1.343 1.341 0.476 (2.8x) 0.485 (2.8x) 0.481 (2.8x)
pdist, euclidean, f32 1.241 1.250 1.246 0.482 (2.6x) 0.487 (2.6x) 0.484 (2.6x)
pdist, euclidean, f64 1.222 1.228 1.225 0.474 (2.6x) 0.503 (2.4x) 0.482 (2.5x)
pdist, sqeuclidean, f32 1.224 1.247 1.233 0.477 (2.6x) 0.490 (2.5x) 0.481 (2.6x)
pdist, sqeuclidean, f64 1.211 1.214 1.213 0.470 (2.6x) 0.478 (2.5x) 0.475 (2.6x)
pdist, cityblock, f32 2.204 2.207 2.205 0.483 (4.6x) 0.491 (4.5x) 0.486 (4.5x)
pdist, cityblock, f64 2.189 2.198 2.192 0.476 (4.6x) 0.483 (4.5x) 0.481 (4.6x)
pdist, cosine, f32 2.000 2.004 2.002 1.292 (1.5x) 1.302 (1.5x) 1.296 (1.5x)
pdist, cosine, f64 1.988 1.992 1.990 1.288 (1.5x) 1.296 (1.5x) 1.292 (1.5x)
F32 F64

Chebyshev f32 Euclidean f32 Squared Euclidean f32 Manhattan f32 Canberra f32 Cosine f32

Chebyshev f64 Euclidean f64 Squared Euclidean f64 Manhattan f64 Canberra f64 Cosine f64

u32 u64

Bray-Curtis u32

Bray-Curtis u64

String Distance Benchmarks

These benchmarks were run on an Intel Core i7-11700KF CPU @ 4.900GHz, using a single thread. The OS was Arch Linux, with kernel version 6.7.4-arch1-1.

All string distances were computed 100 times each, among different pairs of strings, and the average time was taken.

Hamming Levenshtein Needleman-Wunsch

License

This package is licensed under the MIT license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

abd_distances-1.0.3.tar.gz (1.7 MB view hashes)

Uploaded Source

Built Distributions

abd_distances-1.0.3-cp39-abi3-win_amd64.whl (392.2 kB view hashes)

Uploaded CPython 3.9+ Windows x86-64

abd_distances-1.0.3-cp39-abi3-win32.whl (344.2 kB view hashes)

Uploaded CPython 3.9+ Windows x86

abd_distances-1.0.3-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (502.4 kB view hashes)

Uploaded CPython 3.9+ manylinux: glibc 2.17+ x86-64

abd_distances-1.0.3-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl (1.3 MB view hashes)

Uploaded CPython 3.9+ manylinux: glibc 2.17+ s390x

abd_distances-1.0.3-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (523.6 kB view hashes)

Uploaded CPython 3.9+ manylinux: glibc 2.17+ ppc64le

abd_distances-1.0.3-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (440.1 kB view hashes)

Uploaded CPython 3.9+ manylinux: glibc 2.17+ ARMv7l

abd_distances-1.0.3-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (451.5 kB view hashes)

Uploaded CPython 3.9+ manylinux: glibc 2.17+ ARM64

abd_distances-1.0.3-cp39-abi3-manylinux_2_12_i686.manylinux2010_i686.whl (490.6 kB view hashes)

Uploaded CPython 3.9+ manylinux: glibc 2.12+ i686

abd_distances-1.0.3-cp39-abi3-macosx_11_0_arm64.whl (438.5 kB view hashes)

Uploaded CPython 3.9+ macOS 11.0+ ARM64

abd_distances-1.0.3-cp39-abi3-macosx_10_12_x86_64.whl (475.7 kB view hashes)

Uploaded CPython 3.9+ macOS 10.12+ x86-64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page