Skip to main content

Closely find closest pairs of points, eg duplicates, in a dataset

Project description

Closely :triangular_ruler:

PyPI version Build Status DOI

Find the closest pairs in an array.

Closely compares distances of arrays/embeddings and sorts them.

Getting Started

pip install closely

or install from source:

git clone https://github.com/justinshenk/closely
cd closely
pip install .

How to use

import closely

# X is an n x m numpy array
pairs, distances = closely.solve(X, n=1)

You can specify how many pairs you want to identify with n.

Example

import closely
import numpy as np
import matplotlib.pyplot as plt

# Create dataset
X = np.random.random((100,2))
pairs, distances = closely.solve(X, n=1)

# Plot points
z, y = np.split(X, 2, axis=1)
fig, ax = plt.subplots()
ax.scatter(z, y) 

for i, txt in enumerate(X): 
    if i in pairs: 
        ax.annotate(i, (z[i], y[i]), color='red') 
    else: 
        ax.annotate(i, (z[i], y[i]))

plt.show() 

Check pairs:

In [10]: pairs                                                                                                                                
Out[10]: 
array([[ 7, 16],
       [96, 50]])

Output: example_plot

Credit and Explanation

Python code for ordering distance matrices modified from Andriy Lazorenko, packaged and made useful for >2 features by Justin Shenk.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

closely-19.0.2.dev0.tar.gz (6.5 kB view details)

Uploaded Source

File details

Details for the file closely-19.0.2.dev0.tar.gz.

File metadata

  • Download URL: closely-19.0.2.dev0.tar.gz
  • Upload date:
  • Size: 6.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.7.3

File hashes

Hashes for closely-19.0.2.dev0.tar.gz
Algorithm Hash digest
SHA256 75c67c61a72f5a956fa5cd336b78b099fbf639eeeca9c6b4acb50fa9967113e1
MD5 06dba047d161b91743ebf976cb1750b2
BLAKE2b-256 deb89eca3a2b22ba1d0d226cd8a9edf4ed8e59b7134adf7325c766af5d03f559

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page