Skip to main content

diverse_seq: a tool for sampling diverse biological sequences

Project description

PyPI - Python Version CI Coverage Status Codacy Badge CodeQL Ruff DOI

diverse-seq provides alignment-free algorithms to facilitate phylogenetic workflows

diverse-seq implements computationally efficient alignment-free algorithms that enable efficient prototyping for phylogenetic workflows. It can accelerate parameter selection searches for sequence alignment and phylogeny estimation by identifying a subset of sequences that are representative of the diversity in a collection. We show that selecting representative sequences with an entropy measure of k-mer frequencies correspond well to sampling via conventional genetic distances. The computational performance is linear with respect to the number of sequences and can be run in parallel. Applied to a collection of 10.5k whole microbial genomes on a laptop took ~8 minutes to prepare the data and 4 minutes to select 100 representatives. diverse-seq can further boost the performance of phylogenetic estimation by providing a seed phylogeny that can be further refined by a more sophisticated algorithm. For ~1k whole microbial genomes on a laptop, it takes ~1.8 minutes to estimate a bifurcating tree from mash distances.

You can read more about the methods implemented in diverse-seq in the preprint here.

The user documentation is here.

Installation

We recommend installing diverse-seq from PyPI as follows

pip install "diverse-seq[extra]"

for the full jupyter experience.

For command line only usage, install as follows

pip install diverse-seq

NOTE If you experience any errors during installation, we recommend using uv pip. This command provides much better error messages than the standard pip command. If you cannot resolve the installation problem, please open an issue on the GitHub repository.

Using uv

Speaking of uv, it provides a simplified approach to install dvs as a command-line only tool as

uv tool install diverse-seq

Usage in this case is then

uvx --from diverse-seq dvs

Dependencies

For a full listing of dependencies, see the pyproject.toml file.

The command line interface

dvs is the command line interface for diverse-seq.

The `dvs` subcommands
Usage: dvs [OPTIONS] COMMAND [ARGS]...

  dvs -- alignment free detection of the most diverse sequences using JSD

Options:
  --version  Show the version and exit.
  --help     Show this message and exit.

Commands:
  demo-data  Export a demo sequence file
  prep       Writes processed sequences to a <HDF5 file>.dvseqs.
  max        Identify the seqs that maximise average delta JSD
  nmost      Identify n seqs that maximise average delta JSD
  ctree      Quickly compute a cluster tree based on kmers for a collection...

The Python API

We make comparable capabilities available as cogent3 apps. The main difference is the app instances directly operate on, and return, cogent3 sequence collections. See the docs for demonstrations of how to use the apps.

Project Information

diverse-seq is released under the BSD-3 license. If you want to contribute to the diverse-seq project (and we hope you do! :innocent:) the code of conduct and other useful developer information is available on the wiki.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

diverse_seq-2025.12.17.tar.gz (180.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

diverse_seq-2025.12.17-py3-none-any.whl (66.4 kB view details)

Uploaded Python 3

File details

Details for the file diverse_seq-2025.12.17.tar.gz.

File metadata

  • Download URL: diverse_seq-2025.12.17.tar.gz
  • Upload date:
  • Size: 180.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for diverse_seq-2025.12.17.tar.gz
Algorithm Hash digest
SHA256 a53b381c619881276ecdd5f8a1944ad59a810125cc9f10ae3aeebd5035441117
MD5 e97a4f2dc2028e71b54e959421005011
BLAKE2b-256 136d3b9ac15d87f0c8ae7a231c5d6c7a4269967eeedb702bba4e1f4a1d167077

See more details on using hashes here.

Provenance

The following attestation bundles were made for diverse_seq-2025.12.17.tar.gz:

Publisher: release.yml on HuttleyLab/DiverseSeq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file diverse_seq-2025.12.17-py3-none-any.whl.

File metadata

File hashes

Hashes for diverse_seq-2025.12.17-py3-none-any.whl
Algorithm Hash digest
SHA256 062c3b7141f0d7cb3d7480df4a0d7e48f4c506914f29119a44df0f9e07d42862
MD5 d95b8efaa00c48d523e815ccaca0c4cb
BLAKE2b-256 5261612a5d27de59165f7f8d1d91822c994de70982d826e80f5fdaead51a3927

See more details on using hashes here.

Provenance

The following attestation bundles were made for diverse_seq-2025.12.17-py3-none-any.whl:

Publisher: release.yml on HuttleyLab/DiverseSeq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page