Sample Relation Finder; Efficient thresholded proximity graph computation; RGP toolkit.
Project description
Refnd
In datasets generated using a RGP (Relational Generative Process), such as datasets generated using evolution-like processes, the relational structure is important to consider for multiple tasks such as split without leakage, visualization, hypothesis generation / validation, etc. The relational structure of the dataset consists of knowing which pairs of elements are related (for example, which samples are evolutionary related). However, this structure is rarely known in advance, so we need to infer it.
Given a distance measurement, we can brute-force compute all pair distances to find related samples. Then, we can define a distance threshold under which samples are considered related. Linking those samples with an edge and the distance as weight yields a thresholded-proximity graph.
The problem with the brute force approach is that it has an $O(n^2)$ computational complexity, and does not scale well for large datasets. Fortunately, we defined a variant of the Hierarchical Navigable Small World (HNSW) algorithm to build this thresholded proximity graph in $O(nlog(n))$ instead.
Once the graph is obtained, operations on the dataset become easier, and more theoretically grounded. In fact, the distance in the graph between two samples can correlate with the likelihood of two samples being related if the distance measurement is well chosen. Hence, this helps visualize the data, split datasets without leakage, make discoveries, etc.
Furthermore, we can cluster the graph by finding communities or connected components. From these clusters, we can effectively split the dataset into train and test set without leakage with respect to the proximity threshold by splitting along clusters.
This library contains a toolkit of efficient functions and data structures to work with datasets generated from RGP. The core computations are implemented in Rust and multithreaded for maximum throughput! Everything is wrapped within an easy to use Python API. It currently supports:
- Protein/peptides sequences with Local and Global alignments
- Molecules with Real and Bit based Tanimoto similarities.
- More coming!
To give an idea of what the library contains, we have these functions:
- HNSW approximate proximity graph in $O(nlog(n))$
- Find nearest neighbors using an exact algorithm ($O(n)$) or approximate using HNSW ($O(log(n))$)
- Exact proximity graph in $O(n^2)$
- Leiden clustering to find communities in the graph.
- Find connected components within the graph.
- Partition a dataset along clusters to prevent data leakage.
- And more!
Installation – Python
pip install refnd
Build from source (latest version, potentially unstable)
pip install "git+https://github.com/anthol42/refnd.git#subdirectory=py"
Example
The following example shows how to split a protein dataset using 1 - global alignment as the distance function using the python API.
from refnd import KernelVariant, HNSWState, find_communities, find_components, partition
from refnd.utils import read_fasta
# Load the dataset
dataset = read_fasta("datasets/proteins.fasta")
sequences = [seq for header, seq in dataset]
# Initiate the HNSW index
hnsw = HNSWState(KernelVariant.ProteinGlobal, sequences, proximity_threshold=0.3)
# Build it
hnsw.build()
# Get the proximity edges
edges = hnsw.edges()
# Load the graph
g = edges.graph()
# Get clusters
clusters = find_components(g) # Component based clustering – faster
clusters = find_communities(g) # Community based clustering - smaller clusters
# Partition into train and test
train_ids, test_ids = partition(clusters, g, post_filtering=True)
train = [dataset[i] for i in train_ids]
test = [dataset[i] for i in test_ids]
Documentation
See the documentation references here: https://anthol42.github.io/refnd/
Feedback
This project is currently in active development, and your feedback is greatly appreciated. If you find a bug, or would like a new feature, or give your thoughts on the API, please open an issue and we will be happy to help.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file refnd-0.0.1.tar.gz.
File metadata
- Download URL: refnd-0.0.1.tar.gz
- Upload date:
- Size: 215.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
23fca8460af9a16302614c660b06f95dd089f7d8521287b2b22d40977d85a641
|
|
| MD5 |
47812ec426d4a97ed1f5879b43ccd012
|
|
| BLAKE2b-256 |
560f47211660c822a5ab9482d89b331cd24dbf9a4ef2f8e3ffe6fa41f39ee238
|
File details
Details for the file refnd-0.0.1-cp314-cp314-win_amd64.whl.
File metadata
- Download URL: refnd-0.0.1-cp314-cp314-win_amd64.whl
- Upload date:
- Size: 3.3 MB
- Tags: CPython 3.14, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5833c590a62b4a160706c3a6e5ced153d445a0ddf52977799e5d73426c4a0270
|
|
| MD5 |
f652722ea6d7a7493ae1acf7a0f077e8
|
|
| BLAKE2b-256 |
e647fd3f2e54cc44feae3104ce37268f4bd33b414fc8f09b7d49374a24f4d12b
|
File details
Details for the file refnd-0.0.1-cp314-cp314-manylinux_2_28_x86_64.whl.
File metadata
- Download URL: refnd-0.0.1-cp314-cp314-manylinux_2_28_x86_64.whl
- Upload date:
- Size: 4.0 MB
- Tags: CPython 3.14, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fd6369908399a81c424abacf3639980a12ce0cc7342a8841a21902cebb3bdb1e
|
|
| MD5 |
34b411d3a729402301ef0dd3396024be
|
|
| BLAKE2b-256 |
d444b7192f9cad40895aeefa3fa2989bdca70e39efb979f4a231f697c95baf32
|
File details
Details for the file refnd-0.0.1-cp314-cp314-macosx_11_0_arm64.whl.
File metadata
- Download URL: refnd-0.0.1-cp314-cp314-macosx_11_0_arm64.whl
- Upload date:
- Size: 1.6 MB
- Tags: CPython 3.14, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e9d85e97e588d1f8933e8f90ddfd119b9af0feef2030e25d9bb617066aa3d8f0
|
|
| MD5 |
11984ebb9b4e8208f40df443b7e1b6a5
|
|
| BLAKE2b-256 |
70855662f55e75f17f629f87aead9380b88b93caff1b0852eeb2ca8f2dae4111
|
File details
Details for the file refnd-0.0.1-cp314-cp314-macosx_10_15_x86_64.whl.
File metadata
- Download URL: refnd-0.0.1-cp314-cp314-macosx_10_15_x86_64.whl
- Upload date:
- Size: 3.2 MB
- Tags: CPython 3.14, macOS 10.15+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6ea9dbce7f3bce3a396fa8025be3ebe1b95135ee69b56de6fffafb2b5ec4420f
|
|
| MD5 |
6ecf426b11db4009d67451add86137dc
|
|
| BLAKE2b-256 |
ed1bb63db1f19695adda34ff40fe52cd4b946bb5c183dea1bfed2e639ffc6907
|
File details
Details for the file refnd-0.0.1-cp313-cp313-win_amd64.whl.
File metadata
- Download URL: refnd-0.0.1-cp313-cp313-win_amd64.whl
- Upload date:
- Size: 3.3 MB
- Tags: CPython 3.13, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
077333f684420f3d61922d0cac16fbc904903aad8419c2e3490ea1403bd9f010
|
|
| MD5 |
d50f3e690a8811050cbb6e5f05f8111f
|
|
| BLAKE2b-256 |
a83824b3384e921938dab4e70c5c3e509f745de8173d0eb1afc54f8185edb46d
|
File details
Details for the file refnd-0.0.1-cp313-cp313-manylinux_2_28_x86_64.whl.
File metadata
- Download URL: refnd-0.0.1-cp313-cp313-manylinux_2_28_x86_64.whl
- Upload date:
- Size: 4.0 MB
- Tags: CPython 3.13, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1c1325bbd926745f5dc2a729575d56e7af9f042460844b9227129c8942e49a92
|
|
| MD5 |
6617854064494d68b084ddc10a7a8098
|
|
| BLAKE2b-256 |
cb09958717c62bb5315a06724508a6982bdb69b58ec5322afb0b8998e6a9ff10
|
File details
Details for the file refnd-0.0.1-cp313-cp313-macosx_11_0_arm64.whl.
File metadata
- Download URL: refnd-0.0.1-cp313-cp313-macosx_11_0_arm64.whl
- Upload date:
- Size: 1.6 MB
- Tags: CPython 3.13, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e281a0d5d8e4ff2290e3cfb93d036056f89cdecfe8dbde9fe7759bce80377ee3
|
|
| MD5 |
3379b6c77e70c942f13f9461e6b83ed9
|
|
| BLAKE2b-256 |
780c3b1fd9d9311d5336027836f1db0d4765d0dbbdd2a23803de5f84549c5037
|
File details
Details for the file refnd-0.0.1-cp313-cp313-macosx_10_14_x86_64.whl.
File metadata
- Download URL: refnd-0.0.1-cp313-cp313-macosx_10_14_x86_64.whl
- Upload date:
- Size: 3.3 MB
- Tags: CPython 3.13, macOS 10.14+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
da072e4c1909426369a4b1990376ec751ba3f385167dc0c32faee04af18e6534
|
|
| MD5 |
9fe00b81eda27ea54b73aedf88abc2bf
|
|
| BLAKE2b-256 |
44ea37219b9334e8749515b922952722794f10b56a1ae2894447a53b315c41db
|
File details
Details for the file refnd-0.0.1-cp312-cp312-win_amd64.whl.
File metadata
- Download URL: refnd-0.0.1-cp312-cp312-win_amd64.whl
- Upload date:
- Size: 3.3 MB
- Tags: CPython 3.12, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d9ae177a03f6596ab486685dba9f67340a5e6bb0c607ff364c6b92831c7c1f46
|
|
| MD5 |
e5f137a268a873c50730dd979fc6fd5b
|
|
| BLAKE2b-256 |
9d0a20855401e4ef6a308d5b52c464e5fb70fa43cbf355ebde99e3589ddec488
|
File details
Details for the file refnd-0.0.1-cp312-cp312-manylinux_2_28_x86_64.whl.
File metadata
- Download URL: refnd-0.0.1-cp312-cp312-manylinux_2_28_x86_64.whl
- Upload date:
- Size: 4.0 MB
- Tags: CPython 3.12, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
351320d323c32a3b7271d4bd8cb9be603a7f45e533bbcdfce8cedb2ab1cc6969
|
|
| MD5 |
004509628f116d27ae1720d9971a6d5e
|
|
| BLAKE2b-256 |
3c6b5a9a9c8acba4e47874b25a5b8c44569a252d13269a3b8027df2b7a1532b0
|
File details
Details for the file refnd-0.0.1-cp312-cp312-macosx_11_0_arm64.whl.
File metadata
- Download URL: refnd-0.0.1-cp312-cp312-macosx_11_0_arm64.whl
- Upload date:
- Size: 1.6 MB
- Tags: CPython 3.12, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3d2359afce611e194dea3334aca2e30125ea311c78fa996ec459c96f15b2e475
|
|
| MD5 |
fb32bc027198efd3487df80e00dc8996
|
|
| BLAKE2b-256 |
e5649e2386b323e6c3b11011d72417586d0197bfd5bcecdc87febf1065ec87de
|
File details
Details for the file refnd-0.0.1-cp312-cp312-macosx_10_14_x86_64.whl.
File metadata
- Download URL: refnd-0.0.1-cp312-cp312-macosx_10_14_x86_64.whl
- Upload date:
- Size: 3.3 MB
- Tags: CPython 3.12, macOS 10.14+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
eb4204183cd5cbf253a7c762035b859f3714a5ee852037fb8b2ae54863317ce2
|
|
| MD5 |
f55623f8a3971019eac35af83b05187b
|
|
| BLAKE2b-256 |
4f649eaf0002af2a2891944e1cdcbcf3b91bab2de09feb492f24f3bccf391316
|
File details
Details for the file refnd-0.0.1-cp311-cp311-win_amd64.whl.
File metadata
- Download URL: refnd-0.0.1-cp311-cp311-win_amd64.whl
- Upload date:
- Size: 3.3 MB
- Tags: CPython 3.11, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1333f74b3cef01f0d4c0f4bd4a7c66381a379f05310b9428113b3501c90ab07c
|
|
| MD5 |
31a245cc609c2b842104b10479beb488
|
|
| BLAKE2b-256 |
012d7724673f2756facf15b6a3922bdf3e8c4f44a1a203d630cf8d72babb44ea
|
File details
Details for the file refnd-0.0.1-cp311-cp311-manylinux_2_28_x86_64.whl.
File metadata
- Download URL: refnd-0.0.1-cp311-cp311-manylinux_2_28_x86_64.whl
- Upload date:
- Size: 4.0 MB
- Tags: CPython 3.11, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
686cc0a35ac52b8704312c0ef7193cac150ecc01a251b319436cb597f531d408
|
|
| MD5 |
8f2a3303cc39335d8dfeae49afae7d5d
|
|
| BLAKE2b-256 |
b54512401c64fb3fbe4c28767636519538ca77b158a8c4f4eaacf9b63ea57ae5
|
File details
Details for the file refnd-0.0.1-cp311-cp311-macosx_11_0_arm64.whl.
File metadata
- Download URL: refnd-0.0.1-cp311-cp311-macosx_11_0_arm64.whl
- Upload date:
- Size: 1.6 MB
- Tags: CPython 3.11, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9d1b993a6b61e6c161403e43a9df02b676e7d8a910a3a7098687f73675fbf2bd
|
|
| MD5 |
2c486cc0a05cfc0d4ac0e7a1ff25c260
|
|
| BLAKE2b-256 |
0c0579a84fcd0bcb34214b641ab65810c2eef439dcac11a848b762af029fc625
|
File details
Details for the file refnd-0.0.1-cp311-cp311-macosx_10_14_x86_64.whl.
File metadata
- Download URL: refnd-0.0.1-cp311-cp311-macosx_10_14_x86_64.whl
- Upload date:
- Size: 3.3 MB
- Tags: CPython 3.11, macOS 10.14+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6f8f676592167ba1673b577559a06d3a8932b7a10d9285fa31a739e18a6e7b9c
|
|
| MD5 |
e022b5af7b375c1b8b66303af6b73e77
|
|
| BLAKE2b-256 |
31d219d7af0b6c23df4b9a0b8acab14467d7673ea4619e8225f0b3f0c2dc6ac0
|
File details
Details for the file refnd-0.0.1-cp310-cp310-win_amd64.whl.
File metadata
- Download URL: refnd-0.0.1-cp310-cp310-win_amd64.whl
- Upload date:
- Size: 3.3 MB
- Tags: CPython 3.10, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9b268f96edf87cf11cf9acb9027280ee2acd77fdc35550543d06a435fec847a1
|
|
| MD5 |
cd36dfcf71236e0588ed08325382a0df
|
|
| BLAKE2b-256 |
a978188ed62339bff5494a3dc92dcc82941111502676a8ed94923168d218eb03
|
File details
Details for the file refnd-0.0.1-cp310-cp310-manylinux_2_28_x86_64.whl.
File metadata
- Download URL: refnd-0.0.1-cp310-cp310-manylinux_2_28_x86_64.whl
- Upload date:
- Size: 4.0 MB
- Tags: CPython 3.10, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5ed4ceaaa80dfb083fd5778cc3963782d81cdac3d50c028dd5881c4083f9c1f3
|
|
| MD5 |
8b64ecd3ff2ab8333dd63e681363599e
|
|
| BLAKE2b-256 |
35e9c6082b1d3bba126ec19ea3bcbaf92b44b749be7f99debbcd4a0f910914c2
|
File details
Details for the file refnd-0.0.1-cp310-cp310-macosx_11_0_arm64.whl.
File metadata
- Download URL: refnd-0.0.1-cp310-cp310-macosx_11_0_arm64.whl
- Upload date:
- Size: 1.6 MB
- Tags: CPython 3.10, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3f5a7d4499b23390eebec3aef66434f83510c40a2eb10900c541f81780124485
|
|
| MD5 |
bb27f0d630324dcdb9e69f4efa4df37f
|
|
| BLAKE2b-256 |
282048a1a24d7e55bd2e7d41d838bd832d2b53e084adb09482996aaf38df0ad6
|
File details
Details for the file refnd-0.0.1-cp310-cp310-macosx_10_14_x86_64.whl.
File metadata
- Download URL: refnd-0.0.1-cp310-cp310-macosx_10_14_x86_64.whl
- Upload date:
- Size: 3.3 MB
- Tags: CPython 3.10, macOS 10.14+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
746aac7d5fc82f9444a6b8054325369753a530c9a39bbc6681829b7078cf40d6
|
|
| MD5 |
d0d0eb517531527ca68b6f4609b5f660
|
|
| BLAKE2b-256 |
92bea7be4febd4043770be4852bd15cf61302a1181170a775272d974e4364cab
|