Skip to main content

Sample Relation Finder; Efficient thresholded proximity graph computation; RGP toolkit.

Project description

Refnd

In datasets generated using a RGP (Relational Generative Process), such as datasets generated using evolution-like processes, the relational structure is important to consider for multiple tasks such as split without leakage, visualization, hypothesis generation / validation, etc. The relational structure of the dataset consists of knowing which pairs of elements are related (for example, which samples are evolutionary related). However, this structure is rarely known in advance, so we need to infer it.

Given a distance measurement, we can brute-force compute all pair distances to find related samples. Then, we can define a distance threshold under which samples are considered related. Linking those samples with an edge and the distance as weight yields a thresholded-proximity graph.

The problem with the brute force approach is that it has an $O(n^2)$ computational complexity, and does not scale well for large datasets. Fortunately, we defined a variant of the Hierarchical Navigable Small World (HNSW) algorithm to build this thresholded proximity graph in $O(nlog(n))$ instead.

Once the graph is obtained, operations on the dataset become easier, and more theoretically grounded. In fact, the distance in the graph between two samples can correlate with the likelihood of two samples being related if the distance measurement is well chosen. Hence, this helps visualize the data, split datasets without leakage, make discoveries, etc.

Furthermore, we can cluster the graph by finding communities or connected components. From these clusters, we can effectively split the dataset into train and test set without leakage with respect to the proximity threshold by splitting along clusters.

This library contains a toolkit of efficient functions and data structures to work with datasets generated from RGP. The core computations are implemented in Rust and multithreaded for maximum throughput! Everything is wrapped within an easy to use Python API. It currently supports:

  • Protein/peptides sequences with Local and Global alignments
  • Molecules with Real and Bit based Tanimoto similarities.
  • More coming!

To give an idea of what the library contains, we have these functions:

  • HNSW approximate proximity graph in $O(nlog(n))$
  • Find nearest neighbors using an exact algorithm ($O(n)$) or approximate using HNSW ($O(log(n))$)
  • Exact proximity graph in $O(n^2)$
  • Leiden clustering to find communities in the graph.
  • Find connected components within the graph.
  • Partition a dataset along clusters to prevent data leakage.
  • And more!

Installation – Python

pip install refnd

Build from source (latest version, potentially unstable)

pip install "git+https://github.com/anthol42/refnd.git#subdirectory=py"

Example

The following example shows how to split a protein dataset using 1 - global alignment as the distance function using the python API.

from refnd import KernelVariant, HNSWState, find_communities, find_components, partition
from refnd.utils import read_fasta

# Load the dataset
dataset = read_fasta("datasets/proteins.fasta")
sequences = [seq for header, seq in dataset]

# Initiate the HNSW index
hnsw = HNSWState(KernelVariant.ProteinGlobal, sequences, proximity_threshold=0.3)

# Build it
hnsw.build()

# Get the proximity edges
edges = hnsw.edges()

# Load the graph
g = edges.graph()

# Get clusters
clusters = find_components(g) # Component based clustering – faster
clusters = find_communities(g) # Community based clustering - smaller clusters

# Partition into train and test
train_ids, test_ids = partition(clusters, g, post_filtering=True)
train = [dataset[i] for i in train_ids]
test = [dataset[i] for i in test_ids]

Documentation

See the documentation references here: https://anthol42.github.io/refnd/

Feedback

This project is currently in active development, and your feedback is greatly appreciated. If you find a bug, or would like a new feature, or give your thoughts on the API, please open an issue and we will be happy to help.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

refnd-0.0.1.tar.gz (215.7 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

refnd-0.0.1-cp314-cp314-win_amd64.whl (3.3 MB view details)

Uploaded CPython 3.14Windows x86-64

refnd-0.0.1-cp314-cp314-manylinux_2_28_x86_64.whl (4.0 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.28+ x86-64

refnd-0.0.1-cp314-cp314-macosx_11_0_arm64.whl (1.6 MB view details)

Uploaded CPython 3.14macOS 11.0+ ARM64

refnd-0.0.1-cp314-cp314-macosx_10_15_x86_64.whl (3.2 MB view details)

Uploaded CPython 3.14macOS 10.15+ x86-64

refnd-0.0.1-cp313-cp313-win_amd64.whl (3.3 MB view details)

Uploaded CPython 3.13Windows x86-64

refnd-0.0.1-cp313-cp313-manylinux_2_28_x86_64.whl (4.0 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.28+ x86-64

refnd-0.0.1-cp313-cp313-macosx_11_0_arm64.whl (1.6 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

refnd-0.0.1-cp313-cp313-macosx_10_14_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.13macOS 10.14+ x86-64

refnd-0.0.1-cp312-cp312-win_amd64.whl (3.3 MB view details)

Uploaded CPython 3.12Windows x86-64

refnd-0.0.1-cp312-cp312-manylinux_2_28_x86_64.whl (4.0 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ x86-64

refnd-0.0.1-cp312-cp312-macosx_11_0_arm64.whl (1.6 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

refnd-0.0.1-cp312-cp312-macosx_10_14_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.12macOS 10.14+ x86-64

refnd-0.0.1-cp311-cp311-win_amd64.whl (3.3 MB view details)

Uploaded CPython 3.11Windows x86-64

refnd-0.0.1-cp311-cp311-manylinux_2_28_x86_64.whl (4.0 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ x86-64

refnd-0.0.1-cp311-cp311-macosx_11_0_arm64.whl (1.6 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

refnd-0.0.1-cp311-cp311-macosx_10_14_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.11macOS 10.14+ x86-64

refnd-0.0.1-cp310-cp310-win_amd64.whl (3.3 MB view details)

Uploaded CPython 3.10Windows x86-64

refnd-0.0.1-cp310-cp310-manylinux_2_28_x86_64.whl (4.0 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.28+ x86-64

refnd-0.0.1-cp310-cp310-macosx_11_0_arm64.whl (1.6 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

refnd-0.0.1-cp310-cp310-macosx_10_14_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.10macOS 10.14+ x86-64

File details

Details for the file refnd-0.0.1.tar.gz.

File metadata

  • Download URL: refnd-0.0.1.tar.gz
  • Upload date:
  • Size: 215.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for refnd-0.0.1.tar.gz
Algorithm Hash digest
SHA256 23fca8460af9a16302614c660b06f95dd089f7d8521287b2b22d40977d85a641
MD5 47812ec426d4a97ed1f5879b43ccd012
BLAKE2b-256 560f47211660c822a5ab9482d89b331cd24dbf9a4ef2f8e3ffe6fa41f39ee238

See more details on using hashes here.

File details

Details for the file refnd-0.0.1-cp314-cp314-win_amd64.whl.

File metadata

  • Download URL: refnd-0.0.1-cp314-cp314-win_amd64.whl
  • Upload date:
  • Size: 3.3 MB
  • Tags: CPython 3.14, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for refnd-0.0.1-cp314-cp314-win_amd64.whl
Algorithm Hash digest
SHA256 5833c590a62b4a160706c3a6e5ced153d445a0ddf52977799e5d73426c4a0270
MD5 f652722ea6d7a7493ae1acf7a0f077e8
BLAKE2b-256 e647fd3f2e54cc44feae3104ce37268f4bd33b414fc8f09b7d49374a24f4d12b

See more details on using hashes here.

File details

Details for the file refnd-0.0.1-cp314-cp314-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for refnd-0.0.1-cp314-cp314-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 fd6369908399a81c424abacf3639980a12ce0cc7342a8841a21902cebb3bdb1e
MD5 34b411d3a729402301ef0dd3396024be
BLAKE2b-256 d444b7192f9cad40895aeefa3fa2989bdca70e39efb979f4a231f697c95baf32

See more details on using hashes here.

File details

Details for the file refnd-0.0.1-cp314-cp314-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for refnd-0.0.1-cp314-cp314-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 e9d85e97e588d1f8933e8f90ddfd119b9af0feef2030e25d9bb617066aa3d8f0
MD5 11984ebb9b4e8208f40df443b7e1b6a5
BLAKE2b-256 70855662f55e75f17f629f87aead9380b88b93caff1b0852eeb2ca8f2dae4111

See more details on using hashes here.

File details

Details for the file refnd-0.0.1-cp314-cp314-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for refnd-0.0.1-cp314-cp314-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 6ea9dbce7f3bce3a396fa8025be3ebe1b95135ee69b56de6fffafb2b5ec4420f
MD5 6ecf426b11db4009d67451add86137dc
BLAKE2b-256 ed1bb63db1f19695adda34ff40fe52cd4b946bb5c183dea1bfed2e639ffc6907

See more details on using hashes here.

File details

Details for the file refnd-0.0.1-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: refnd-0.0.1-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 3.3 MB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for refnd-0.0.1-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 077333f684420f3d61922d0cac16fbc904903aad8419c2e3490ea1403bd9f010
MD5 d50f3e690a8811050cbb6e5f05f8111f
BLAKE2b-256 a83824b3384e921938dab4e70c5c3e509f745de8173d0eb1afc54f8185edb46d

See more details on using hashes here.

File details

Details for the file refnd-0.0.1-cp313-cp313-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for refnd-0.0.1-cp313-cp313-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 1c1325bbd926745f5dc2a729575d56e7af9f042460844b9227129c8942e49a92
MD5 6617854064494d68b084ddc10a7a8098
BLAKE2b-256 cb09958717c62bb5315a06724508a6982bdb69b58ec5322afb0b8998e6a9ff10

See more details on using hashes here.

File details

Details for the file refnd-0.0.1-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for refnd-0.0.1-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 e281a0d5d8e4ff2290e3cfb93d036056f89cdecfe8dbde9fe7759bce80377ee3
MD5 3379b6c77e70c942f13f9461e6b83ed9
BLAKE2b-256 780c3b1fd9d9311d5336027836f1db0d4765d0dbbdd2a23803de5f84549c5037

See more details on using hashes here.

File details

Details for the file refnd-0.0.1-cp313-cp313-macosx_10_14_x86_64.whl.

File metadata

File hashes

Hashes for refnd-0.0.1-cp313-cp313-macosx_10_14_x86_64.whl
Algorithm Hash digest
SHA256 da072e4c1909426369a4b1990376ec751ba3f385167dc0c32faee04af18e6534
MD5 9fe00b81eda27ea54b73aedf88abc2bf
BLAKE2b-256 44ea37219b9334e8749515b922952722794f10b56a1ae2894447a53b315c41db

See more details on using hashes here.

File details

Details for the file refnd-0.0.1-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: refnd-0.0.1-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 3.3 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for refnd-0.0.1-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 d9ae177a03f6596ab486685dba9f67340a5e6bb0c607ff364c6b92831c7c1f46
MD5 e5f137a268a873c50730dd979fc6fd5b
BLAKE2b-256 9d0a20855401e4ef6a308d5b52c464e5fb70fa43cbf355ebde99e3589ddec488

See more details on using hashes here.

File details

Details for the file refnd-0.0.1-cp312-cp312-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for refnd-0.0.1-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 351320d323c32a3b7271d4bd8cb9be603a7f45e533bbcdfce8cedb2ab1cc6969
MD5 004509628f116d27ae1720d9971a6d5e
BLAKE2b-256 3c6b5a9a9c8acba4e47874b25a5b8c44569a252d13269a3b8027df2b7a1532b0

See more details on using hashes here.

File details

Details for the file refnd-0.0.1-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for refnd-0.0.1-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 3d2359afce611e194dea3334aca2e30125ea311c78fa996ec459c96f15b2e475
MD5 fb32bc027198efd3487df80e00dc8996
BLAKE2b-256 e5649e2386b323e6c3b11011d72417586d0197bfd5bcecdc87febf1065ec87de

See more details on using hashes here.

File details

Details for the file refnd-0.0.1-cp312-cp312-macosx_10_14_x86_64.whl.

File metadata

File hashes

Hashes for refnd-0.0.1-cp312-cp312-macosx_10_14_x86_64.whl
Algorithm Hash digest
SHA256 eb4204183cd5cbf253a7c762035b859f3714a5ee852037fb8b2ae54863317ce2
MD5 f55623f8a3971019eac35af83b05187b
BLAKE2b-256 4f649eaf0002af2a2891944e1cdcbcf3b91bab2de09feb492f24f3bccf391316

See more details on using hashes here.

File details

Details for the file refnd-0.0.1-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: refnd-0.0.1-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 3.3 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for refnd-0.0.1-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 1333f74b3cef01f0d4c0f4bd4a7c66381a379f05310b9428113b3501c90ab07c
MD5 31a245cc609c2b842104b10479beb488
BLAKE2b-256 012d7724673f2756facf15b6a3922bdf3e8c4f44a1a203d630cf8d72babb44ea

See more details on using hashes here.

File details

Details for the file refnd-0.0.1-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for refnd-0.0.1-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 686cc0a35ac52b8704312c0ef7193cac150ecc01a251b319436cb597f531d408
MD5 8f2a3303cc39335d8dfeae49afae7d5d
BLAKE2b-256 b54512401c64fb3fbe4c28767636519538ca77b158a8c4f4eaacf9b63ea57ae5

See more details on using hashes here.

File details

Details for the file refnd-0.0.1-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for refnd-0.0.1-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 9d1b993a6b61e6c161403e43a9df02b676e7d8a910a3a7098687f73675fbf2bd
MD5 2c486cc0a05cfc0d4ac0e7a1ff25c260
BLAKE2b-256 0c0579a84fcd0bcb34214b641ab65810c2eef439dcac11a848b762af029fc625

See more details on using hashes here.

File details

Details for the file refnd-0.0.1-cp311-cp311-macosx_10_14_x86_64.whl.

File metadata

File hashes

Hashes for refnd-0.0.1-cp311-cp311-macosx_10_14_x86_64.whl
Algorithm Hash digest
SHA256 6f8f676592167ba1673b577559a06d3a8932b7a10d9285fa31a739e18a6e7b9c
MD5 e022b5af7b375c1b8b66303af6b73e77
BLAKE2b-256 31d219d7af0b6c23df4b9a0b8acab14467d7673ea4619e8225f0b3f0c2dc6ac0

See more details on using hashes here.

File details

Details for the file refnd-0.0.1-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: refnd-0.0.1-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 3.3 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for refnd-0.0.1-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 9b268f96edf87cf11cf9acb9027280ee2acd77fdc35550543d06a435fec847a1
MD5 cd36dfcf71236e0588ed08325382a0df
BLAKE2b-256 a978188ed62339bff5494a3dc92dcc82941111502676a8ed94923168d218eb03

See more details on using hashes here.

File details

Details for the file refnd-0.0.1-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for refnd-0.0.1-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 5ed4ceaaa80dfb083fd5778cc3963782d81cdac3d50c028dd5881c4083f9c1f3
MD5 8b64ecd3ff2ab8333dd63e681363599e
BLAKE2b-256 35e9c6082b1d3bba126ec19ea3bcbaf92b44b749be7f99debbcd4a0f910914c2

See more details on using hashes here.

File details

Details for the file refnd-0.0.1-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for refnd-0.0.1-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 3f5a7d4499b23390eebec3aef66434f83510c40a2eb10900c541f81780124485
MD5 bb27f0d630324dcdb9e69f4efa4df37f
BLAKE2b-256 282048a1a24d7e55bd2e7d41d838bd832d2b53e084adb09482996aaf38df0ad6

See more details on using hashes here.

File details

Details for the file refnd-0.0.1-cp310-cp310-macosx_10_14_x86_64.whl.

File metadata

File hashes

Hashes for refnd-0.0.1-cp310-cp310-macosx_10_14_x86_64.whl
Algorithm Hash digest
SHA256 746aac7d5fc82f9444a6b8054325369753a530c9a39bbc6681829b7078cf40d6
MD5 d0d0eb517531527ca68b6f4609b5f660
BLAKE2b-256 92bea7be4febd4043770be4852bd15cf61302a1181170a775272d974e4364cab

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page