Skip to main content

Redundancy reduction and leakage-aware dataset partitioning for biological ML.

Project description

BioSieve

Tests License

BioSieve is a Python toolkit for preparing biological sequence datasets for machine learning.

It covers two main workflows:

  • Redundancy reduction — remove near-duplicate sequences before training using sequence, embedding, descriptor, or structural similarity
  • Leakage-aware splitting — partition datasets into train/val/test (or k-folds) with strategies that respect biological structure (homology clusters, sequence distance, groups, time)

Installation

BioSieve supports Python 3.11+.

pip install git+https://github.com/kren-ai-lab/biosieve.git

Install optional extras as needed:

  • minhash for approximate Jaccard-based deduplication (minhash_jaccard strategy)
  • faiss for GPU-accelerated embedding similarity search (embedding_cosine strategy)
pip install 'biosieve[minhash] @ git+https://github.com/kren-ai-lab/biosieve.git'
pip install 'biosieve[faiss] @ git+https://github.com/kren-ai-lab/biosieve.git'

The mmseqs2 reducer and homology_aware splitter require the MMseqs2 binary to be available in PATH.

[!TIP] You can install MMSeqs2 easily with pixi: pixi global install -c bioconda -c conda-forge mmseqs2.

Quick Start

Redundancy reduction

Remove near-duplicate sequences using k-mer Jaccard similarity:

biosieve reduce \
  -i dataset.csv \
  -o dataset_nr.csv \
  --strategy kmer_jaccard \
  --mapping-output mapping.csv \
  --report-output report.json

Pass parameters via a YAML file:

biosieve reduce \
  -i dataset.csv \
  -o dataset_nr.csv \
  --strategy kmer_jaccard \
  --params params.yaml
# params.yaml
kmer_jaccard:
  threshold: 0.8
  k: 5

Override a single parameter inline without a file:

biosieve reduce -i dataset.csv -o out.csv --strategy kmer_jaccard --set kmer_jaccard.threshold=0.9

Dataset splitting

Split with a leakage-aware strategy:

biosieve split \
  -i dataset_nr.csv \
  -o splits/ \
  --strategy homology_aware \
  --params params.yaml
# params.yaml
homology_aware:
  mode: precomputed
  clusters_path: clusters.csv
  member_col: id
  cluster_col: cluster_id
  test_size: 0.2

Strategies

Redundancy reduction

Strategy Description Extra needed
exact Remove exact sequence duplicates
identity_greedy Greedy reduction by sequence identity
kmer_jaccard Greedy reduction by k-mer Jaccard similarity
minhash_jaccard Approximate k-mer Jaccard via MinHash LSH (fast) biosieve[minhash]
embedding_cosine Cosine similarity on precomputed embeddings biosieve[faiss] (optional)
descriptor_euclidean Euclidean distance on numeric descriptor columns
structural_distance Graph-based reduction on precomputed structural edges
mmseqs2 Homology clustering via MMseqs2 mmseqs binary

Splitting

Strategy Description
random Random train/val/test split
stratified Stratified by a categorical label column
stratified_numeric Stratified by a numeric label column (binned)
group No group appears in more than one split
time Chronological split by a time column
cluster_aware Group split using a precomputed cluster column
distance_aware Test set selected as farthest points in embedding/descriptor space
homology_aware Group split derived from MMseqs2 clusters or precomputed clusters

All strategies also support k-fold variants (random_kfold, stratified_kfold, group_kfold, stratified_numeric_kfold, distance_aware_kfold).

Outputs

Every run produces consistent artefacts:

  • Reduced or split CSVs
  • Mapping CSV (removed_id, representative_id, cluster_id, score) for reduction runs
  • JSON report with strategy name, effective parameters, and reduction/split statistics

Learn More

License

GPL-3.0-or-later. See LICENSE.

Acknowledgements

Built on top of scikit-learn, polars, NumPy, and optionally datasketch, FAISS, and MMseqs2.

Developed by KREN AI Lab at Universidad de Magallanes, Chile.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

biosieve-0.1.0.tar.gz (67.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

biosieve-0.1.0-py3-none-any.whl (104.7 kB view details)

Uploaded Python 3

File details

Details for the file biosieve-0.1.0.tar.gz.

File metadata

  • Download URL: biosieve-0.1.0.tar.gz
  • Upload date:
  • Size: 67.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for biosieve-0.1.0.tar.gz
Algorithm Hash digest
SHA256 3afa586a84b1facc93192c5af74673967396dead0fec15c31dd54c5af6ad6583
MD5 1c67b2f016d722571dcffbe3df247343
BLAKE2b-256 624c842225cf5c5240a9a235e0343f7896a58ff96e8f5502eac32941b386126d

See more details on using hashes here.

Provenance

The following attestation bundles were made for biosieve-0.1.0.tar.gz:

Publisher: publish-pypi.yml on kren-ai-lab/biosieve

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file biosieve-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: biosieve-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 104.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for biosieve-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 85e44e84834059a39d693ade55b8089faa9382a8f378bee185c7f89d06ec8b21
MD5 1948c1786194d7e0a673ee2a5e36cc95
BLAKE2b-256 5ad2212642867f63e93cfe78b85d848046ea5a273e6b8c082657a7967a94947c

See more details on using hashes here.

Provenance

The following attestation bundles were made for biosieve-0.1.0-py3-none-any.whl:

Publisher: publish-pypi.yml on kren-ai-lab/biosieve

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page