Redundancy reduction and leakage-aware dataset partitioning for biological ML.
Project description
BioSieve
BioSieve is a Python toolkit for preparing biological sequence datasets for machine learning.
It covers two main workflows:
- Redundancy reduction — remove near-duplicate sequences before training using sequence, embedding, descriptor, or structural similarity
- Leakage-aware splitting — partition datasets into train/val/test (or k-folds) with strategies that respect biological structure (homology clusters, sequence distance, groups, time)
Installation
BioSieve supports Python 3.11+.
pip install git+https://github.com/kren-ai-lab/biosieve.git
Install optional extras as needed:
minhashfor approximate Jaccard-based deduplication (minhash_jaccardstrategy)faissfor GPU-accelerated embedding similarity search (embedding_cosinestrategy)
pip install 'biosieve[minhash] @ git+https://github.com/kren-ai-lab/biosieve.git'
pip install 'biosieve[faiss] @ git+https://github.com/kren-ai-lab/biosieve.git'
The mmseqs2 reducer and homology_aware splitter require the MMseqs2 binary
to be available in PATH.
[!TIP] You can install MMSeqs2 easily with pixi:
pixi global install -c bioconda -c conda-forge mmseqs2.
Quick Start
Redundancy reduction
Remove near-duplicate sequences using k-mer Jaccard similarity:
biosieve reduce \
-i dataset.csv \
-o dataset_nr.csv \
--strategy kmer_jaccard \
--mapping-output mapping.csv \
--report-output report.json
Pass parameters via a YAML file:
biosieve reduce \
-i dataset.csv \
-o dataset_nr.csv \
--strategy kmer_jaccard \
--params params.yaml
# params.yaml
kmer_jaccard:
threshold: 0.8
k: 5
Override a single parameter inline without a file:
biosieve reduce -i dataset.csv -o out.csv --strategy kmer_jaccard --set kmer_jaccard.threshold=0.9
Dataset splitting
Split with a leakage-aware strategy:
biosieve split \
-i dataset_nr.csv \
-o splits/ \
--strategy homology_aware \
--params params.yaml
# params.yaml
homology_aware:
mode: precomputed
clusters_path: clusters.csv
member_col: id
cluster_col: cluster_id
test_size: 0.2
Strategies
Redundancy reduction
| Strategy | Description | Extra needed |
|---|---|---|
exact |
Remove exact sequence duplicates | — |
identity_greedy |
Greedy reduction by sequence identity | — |
kmer_jaccard |
Greedy reduction by k-mer Jaccard similarity | — |
minhash_jaccard |
Approximate k-mer Jaccard via MinHash LSH (fast) | biosieve[minhash] |
embedding_cosine |
Cosine similarity on precomputed embeddings | biosieve[faiss] (optional) |
descriptor_euclidean |
Euclidean distance on numeric descriptor columns | — |
structural_distance |
Graph-based reduction on precomputed structural edges | — |
mmseqs2 |
Homology clustering via MMseqs2 | mmseqs binary |
Splitting
| Strategy | Description |
|---|---|
random |
Random train/val/test split |
stratified |
Stratified by a categorical label column |
stratified_numeric |
Stratified by a numeric label column (binned) |
group |
No group appears in more than one split |
time |
Chronological split by a time column |
cluster_aware |
Group split using a precomputed cluster column |
distance_aware |
Test set selected as farthest points in embedding/descriptor space |
homology_aware |
Group split derived from MMseqs2 clusters or precomputed clusters |
All strategies also support k-fold variants (random_kfold, stratified_kfold, group_kfold,
stratified_numeric_kfold, distance_aware_kfold).
Outputs
Every run produces consistent artefacts:
- Reduced or split CSVs
- Mapping CSV (
removed_id,representative_id,cluster_id,score) for reduction runs - JSON report with strategy name, effective parameters, and reduction/split statistics
Learn More
- examples/README.md for runnable scripts and config files
License
GPL-3.0-or-later. See LICENSE.
Acknowledgements
Built on top of scikit-learn, polars, NumPy, and optionally datasketch, FAISS, and MMseqs2.
Developed by KREN AI Lab at Universidad de Magallanes, Chile.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file biosieve-0.1.0.tar.gz.
File metadata
- Download URL: biosieve-0.1.0.tar.gz
- Upload date:
- Size: 67.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3afa586a84b1facc93192c5af74673967396dead0fec15c31dd54c5af6ad6583
|
|
| MD5 |
1c67b2f016d722571dcffbe3df247343
|
|
| BLAKE2b-256 |
624c842225cf5c5240a9a235e0343f7896a58ff96e8f5502eac32941b386126d
|
Provenance
The following attestation bundles were made for biosieve-0.1.0.tar.gz:
Publisher:
publish-pypi.yml on kren-ai-lab/biosieve
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
biosieve-0.1.0.tar.gz -
Subject digest:
3afa586a84b1facc93192c5af74673967396dead0fec15c31dd54c5af6ad6583 - Sigstore transparency entry: 1341613205
- Sigstore integration time:
-
Permalink:
kren-ai-lab/biosieve@14feacd9fef7196d4232bc6788ac63d8d51a79ed -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/kren-ai-lab
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@14feacd9fef7196d4232bc6788ac63d8d51a79ed -
Trigger Event:
push
-
Statement type:
File details
Details for the file biosieve-0.1.0-py3-none-any.whl.
File metadata
- Download URL: biosieve-0.1.0-py3-none-any.whl
- Upload date:
- Size: 104.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
85e44e84834059a39d693ade55b8089faa9382a8f378bee185c7f89d06ec8b21
|
|
| MD5 |
1948c1786194d7e0a673ee2a5e36cc95
|
|
| BLAKE2b-256 |
5ad2212642867f63e93cfe78b85d848046ea5a273e6b8c082657a7967a94947c
|
Provenance
The following attestation bundles were made for biosieve-0.1.0-py3-none-any.whl:
Publisher:
publish-pypi.yml on kren-ai-lab/biosieve
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
biosieve-0.1.0-py3-none-any.whl -
Subject digest:
85e44e84834059a39d693ade55b8089faa9382a8f378bee185c7f89d06ec8b21 - Sigstore transparency entry: 1341613208
- Sigstore integration time:
-
Permalink:
kren-ai-lab/biosieve@14feacd9fef7196d4232bc6788ac63d8d51a79ed -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/kren-ai-lab
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@14feacd9fef7196d4232bc6788ac63d8d51a79ed -
Trigger Event:
push
-
Statement type: