Redundancy reduction and leakage-aware dataset partitioning for biological ML.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

dialvarezs

These details have not been verified by PyPI

Project description

BioSieve

License

BioSieve is a Python toolkit for preparing biological sequence datasets for machine learning.

It covers two main workflows:

Redundancy reduction — remove near-duplicate sequences before training using sequence, embedding, descriptor, or structural similarity
Leakage-aware splitting — partition datasets into train/val/test (or k-folds) with strategies that respect biological structure (homology clusters, sequence distance, groups, time)

Installation

BioSieve supports Python 3.11+.

pip install git+https://github.com/kren-ai-lab/biosieve.git

Install optional extras as needed:

minhash for approximate Jaccard-based deduplication (minhash_jaccard strategy)
faiss for GPU-accelerated embedding similarity search (embedding_cosine strategy)

pip install 'biosieve[minhash] @ git+https://github.com/kren-ai-lab/biosieve.git'
pip install 'biosieve[faiss] @ git+https://github.com/kren-ai-lab/biosieve.git'

The mmseqs2 reducer and homology_aware splitter require the MMseqs2 binary to be available in PATH.

[!TIP] You can install MMSeqs2 easily with pixi: pixi global install -c bioconda -c conda-forge mmseqs2.

Quick Start

Redundancy reduction

Remove near-duplicate sequences using k-mer Jaccard similarity:

biosieve reduce \
  -i dataset.csv \
  -o dataset_nr.csv \
  --strategy kmer_jaccard \
  --mapping-output mapping.csv \
  --report-output report.json

Pass parameters via a YAML file:

biosieve reduce \
  -i dataset.csv \
  -o dataset_nr.csv \
  --strategy kmer_jaccard \
  --params params.yaml

# params.yaml
kmer_jaccard:
  threshold: 0.8
  k: 5

Override a single parameter inline without a file:

biosieve reduce -i dataset.csv -o out.csv --strategy kmer_jaccard --set kmer_jaccard.threshold=0.9

Dataset splitting

Split with a leakage-aware strategy:

biosieve split \
  -i dataset_nr.csv \
  -o splits/ \
  --strategy homology_aware \
  --params params.yaml

# params.yaml
homology_aware:
  mode: precomputed
  clusters_path: clusters.csv
  member_col: id
  cluster_col: cluster_id
  test_size: 0.2

Strategies

Redundancy reduction

Strategy	Description	Extra needed
`exact`	Remove exact sequence duplicates	—
`identity_greedy`	Greedy reduction by sequence identity	—
`kmer_jaccard`	Greedy reduction by k-mer Jaccard similarity	—
`minhash_jaccard`	Approximate k-mer Jaccard via MinHash LSH (fast)	`biosieve[minhash]`
`embedding_cosine`	Cosine similarity on precomputed embeddings	`biosieve[faiss]` (optional)
`descriptor_euclidean`	Euclidean distance on numeric descriptor columns	—
`structural_distance`	Graph-based reduction on precomputed structural edges	—
`mmseqs2`	Homology clustering via MMseqs2	`mmseqs` binary

Splitting

Strategy	Description
`random`	Random train/val/test split
`stratified`	Stratified by a categorical label column
`stratified_numeric`	Stratified by a numeric label column (binned)
`group`	No group appears in more than one split
`time`	Chronological split by a time column
`cluster_aware`	Group split using a precomputed cluster column
`distance_aware`	Test set selected as farthest points in embedding/descriptor space
`homology_aware`	Group split derived from MMseqs2 clusters or precomputed clusters

All strategies also support k-fold variants (random_kfold, stratified_kfold, group_kfold, stratified_numeric_kfold, distance_aware_kfold).

Outputs

Every run produces consistent artefacts:

Reduced or split CSVs
Mapping CSV (removed_id, representative_id, cluster_id, score) for reduction runs
JSON report with strategy name, effective parameters, and reduction/split statistics

Learn More

examples/README.md for runnable scripts and config files

License

GPL-3.0-or-later. See LICENSE.

Acknowledgements

Built on top of scikit-learn, polars, NumPy, and optionally datasketch, FAISS, and MMseqs2.

Developed by KREN AI Lab at Universidad de Magallanes, Chile.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

dialvarezs

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Apr 20, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

biosieve-0.1.0.tar.gz (67.1 kB view details)

Uploaded Apr 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

biosieve-0.1.0-py3-none-any.whl (104.7 kB view details)

Uploaded Apr 20, 2026 Python 3

File details

Details for the file biosieve-0.1.0.tar.gz.

File metadata

Download URL: biosieve-0.1.0.tar.gz
Upload date: Apr 20, 2026
Size: 67.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for biosieve-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`3afa586a84b1facc93192c5af74673967396dead0fec15c31dd54c5af6ad6583`
MD5	`1c67b2f016d722571dcffbe3df247343`
BLAKE2b-256	`624c842225cf5c5240a9a235e0343f7896a58ff96e8f5502eac32941b386126d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for biosieve-0.1.0.tar.gz:

Publisher: publish-pypi.yml on kren-ai-lab/biosieve

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: biosieve-0.1.0.tar.gz
- Subject digest: 3afa586a84b1facc93192c5af74673967396dead0fec15c31dd54c5af6ad6583
- Sigstore transparency entry: 1341613205
- Sigstore integration time: Apr 20, 2026
Source repository:
- Permalink: kren-ai-lab/biosieve@14feacd9fef7196d4232bc6788ac63d8d51a79ed
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/kren-ai-lab
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-pypi.yml@14feacd9fef7196d4232bc6788ac63d8d51a79ed
- Trigger Event: push

File details

Details for the file biosieve-0.1.0-py3-none-any.whl.

File metadata

Download URL: biosieve-0.1.0-py3-none-any.whl
Upload date: Apr 20, 2026
Size: 104.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for biosieve-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`85e44e84834059a39d693ade55b8089faa9382a8f378bee185c7f89d06ec8b21`
MD5	`1948c1786194d7e0a673ee2a5e36cc95`
BLAKE2b-256	`5ad2212642867f63e93cfe78b85d848046ea5a273e6b8c082657a7967a94947c`

See more details on using hashes here.

Provenance

The following attestation bundles were made for biosieve-0.1.0-py3-none-any.whl:

Publisher: publish-pypi.yml on kren-ai-lab/biosieve

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: biosieve-0.1.0-py3-none-any.whl
- Subject digest: 85e44e84834059a39d693ade55b8089faa9382a8f378bee185c7f89d06ec8b21
- Sigstore transparency entry: 1341613208
- Sigstore integration time: Apr 20, 2026
Source repository:
- Permalink: kren-ai-lab/biosieve@14feacd9fef7196d4232bc6788ac63d8d51a79ed
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/kren-ai-lab
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-pypi.yml@14feacd9fef7196d4232bc6788ac63d8d51a79ed
- Trigger Event: push

biosieve 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

BioSieve

Installation

Quick Start

Redundancy reduction

Dataset splitting

Strategies

Redundancy reduction

Splitting

Outputs

Learn More

License

Acknowledgements

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance