Compute distances in numpy arrays with nans
Project description
Nandist: Calculating distances in arrays with missing values
The python library nandist
enables (fast) computation of various distances in numpy arrays containing missing (NaN) values.
These distances are implemented as a drop-in replacement for distance functions in the scipy.spatial.distance
module.
The distance functions in nandist
can be used as a drop-in replacement for the distance functions in scipy.spatial.distance
.
Currently, nandist
offers the following distance functions:
chebyshev
cityblock
cosine
euclidean
minkowski
It also provides drop-in replacements for pdist
and cdist
, which can be used for fast calculation of pairwise distances of arrays in matrices.
cdist
pdist
These functions can be passed a distance metric (metric
) and optional parameters such as a weight vector (w
) and distance metric parameters such as Minkowski's p
parameter.
Examples
A simple example for calculating the cityblock distance between (0, 1) and (NaN, 0) is shown below.
>>> import nandist
>>> import scipy
>>> import numpy as np
>>>
>>> # City-block distance between (0, 1) and (NaN, 0)
>>> u, v = np.array([0, 1]), np.array([np.nan, 0])
>>> scipy.spatial.distance.cityblock(u, v)
nan
>>> nandist.cityblock(u, v)
1.0
You can replace the function cityblock
by any of the supported distance functions.
You can get pairwise distances between arrays in two matrices using cdist
.
The NaNs do not need to be in the same component.
>>> import nandist
>>> import numpy as np
>>> # City-block distances between vectors A = [(0, 0), (1, NaN)] and vectors B=[(1, NaN) and (1, 1)]
>>> XA, XB = np.array([[0, 0], [1, np.nan]]), np.array([[1, np.nan], [1, 1]])
>>> Y = nandist.cdist(XA, XB, metric="cityblock")
array([[1., 2.],
[0., 0.]])
How to install
Using pip:
pip install nandist
Supported metrics
Supported distance metrics are:
- Chebyshev:
chebyshev
,metric="chebyshev"
- Cityblock:
cityblock
,metric="cityblock"
- Cosine:
cosine
,metric="cosine"
- Euclidean:
euclidean
,metric="euclidean"
- Minkowski:
minkowski
,metric="minkowski"
If you require support for additional distance metrics, please submit an Issue or Merge Request.
How does it work
In nandist
, the components where a vector is NaN will be ignored (interpreted as "any number") in the distance metric.
This is achieved by replacing NaN values with zeros and correcting the resulting overestimated distance value.
Under the hood, nandist
calls functions from scipy.spatial.distance
and then applies the corrections using numpy
linear algebra.
This ensures that the outcomes of nandist
functions are equivalent to scipy.spatial.distance
distance functions when arrays are passed without NaNs in them.
In addition, all heavy computational lifting is done through scipy
, requiring only the additional computational cost of applying the corrections.
Does it always work?
No. The package nandist
performs a correction on an overestimation of the distances when missing values are imputed as zero.
It is possible that this correction runs into the limits of floating point arithmetic.
In that case, nandist
will raise an appropriate error.
However, you don't often run into these edge cases in typical usage.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file nandist-0.9.0.tar.gz
.
File metadata
- Download URL: nandist-0.9.0.tar.gz
- Upload date:
- Size: 12.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8d36d01bab376f6774ccf0e13a3733d42b8691ee11336baec170cfa882f13cb1 |
|
MD5 | 7759591f57c648c5cee54d96fa706e43 |
|
BLAKE2b-256 | c98dc49c21363b116cdf95e6da78151efd6c3a89833f0fb05e1f4a9f6099ece5 |
File details
Details for the file nandist-0.9.0-py3-none-any.whl
.
File metadata
- Download URL: nandist-0.9.0-py3-none-any.whl
- Upload date:
- Size: 6.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | dcb52471c67764fb843b932d1e4b1f372e3f04813bc4c023728236c6bab16bc3 |
|
MD5 | e336493aa52cc2db85ec656df7c11fc4 |
|
BLAKE2b-256 | 43bff183c73f94d47d068cf79b899ee136abd0686b12e89d49f456872db2d9c6 |