Skip to main content

diverse_seq: a tool for sampling diverse biological sequences

Project description

PyPI - Python Version CI Coverage Status Codacy Badge CodeQL Ruff DOI

diverse-seq provides alignment-free algorithms to facilitate phylogenetic workflows

diverse-seq implements computationally efficient alignment-free algorithms that enable efficient prototyping for phylogenetic workflows. It can accelerate parameter selection searches for sequence alignment and phylogeny estimation by identifying a subset of sequences that are representative of the diversity in a collection. We show that selecting representative sequences with an entropy measure of k-mer frequencies correspond well to sampling via conventional genetic distances. The computational performance is linear with respect to the number of sequences and can be run in parallel. Applied to a collection of 10.5k whole microbial genomes on a laptop took ~8 minutes to prepare the data and 4 minutes to select 100 representatives. diverse-seq can further boost the performance of phylogenetic estimation by providing a seed phylogeny that can be further refined by a more sophisticated algorithm. For ~1k whole microbial genomes on a laptop, it takes ~1.8 minutes to estimate a bifurcating tree from mash distances.

You can read more about the methods implemented in diverse-seq in the preprint here.

The user documentation is here.

Installation

We recommend installing diverse-seq from PyPI as follows

pip install "diverse-seq[extra]"

for the full jupyter experience.

For command line only usage, install as follows

pip install diverse-seq

NOTE If you experience any errors during installation, we recommend using uv pip. This command provides much better error messages than the standard pip command. If you cannot resolve the installation problem, please open an issue on the GitHub repository.

Using uv

Speaking of uv, it provides a simplified approach to install dvs as a command-line only tool as

uv tool install diverse-seq

Usage in this case is then

uvx --from diverse-seq dvs

Dependencies

For a full listing of dependencies, see the pyproject.toml file.

The command line interface

dvs is the command line interface for diverse-seq.

The `dvs` subcommands
Usage: dvs [OPTIONS] COMMAND [ARGS]...

  dvs -- alignment free detection of the most diverse sequences using JSD

Options:
  --version  Show the version and exit.
  --help     Show this message and exit.

Commands:
  demo-data  Export a demo sequence file
  prep       Writes processed sequences to a <HDF5 file>.dvseqs.
  max        Identify the seqs that maximise average delta JSD
  nmost      Identify n seqs that maximise average delta JSD
  ctree      Quickly compute a cluster tree based on kmers for a collection...

The Python API

We make comparable capabilities available as cogent3 apps. The main difference is the app instances directly operate on, and return, cogent3 sequence collections. See the docs for demonstrations of how to use the apps.

Project Information

diverse-seq is released under the BSD-3 license. If you want to contribute to the diverse-seq project (and we hope you do! :innocent:) the code of conduct and other useful developer information is available on the wiki.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

diverse_seq-2025.12.11.tar.gz (180.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

diverse_seq-2025.12.11-py3-none-any.whl (66.5 kB view details)

Uploaded Python 3

File details

Details for the file diverse_seq-2025.12.11.tar.gz.

File metadata

  • Download URL: diverse_seq-2025.12.11.tar.gz
  • Upload date:
  • Size: 180.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for diverse_seq-2025.12.11.tar.gz
Algorithm Hash digest
SHA256 cf6ad76d3840f96abb722ae374493b82579935c7a01565e047149a843d4a5595
MD5 8c332eba35e3006615e4265faed8d4cc
BLAKE2b-256 abccb59462115c16e1a79155be9daf895c81005673beefbb0a09135fdc97b0bb

See more details on using hashes here.

Provenance

The following attestation bundles were made for diverse_seq-2025.12.11.tar.gz:

Publisher: release.yml on HuttleyLab/DiverseSeq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file diverse_seq-2025.12.11-py3-none-any.whl.

File metadata

File hashes

Hashes for diverse_seq-2025.12.11-py3-none-any.whl
Algorithm Hash digest
SHA256 80d8c6b91c751eb8c1e914d663cc06c880e2473894eb6a6016c3f7e14e42d3d1
MD5 ddce0a9d72e4a5d6cfd6507db246ab1a
BLAKE2b-256 8fc2a982437fad054b9d9b3f4411342590bf826839ee2f34d22dc3fc5853d41d

See more details on using hashes here.

Provenance

The following attestation bundles were made for diverse_seq-2025.12.11-py3-none-any.whl:

Publisher: release.yml on HuttleyLab/DiverseSeq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page