diverse_seq: a tool for sampling diverse biological sequences

These details have not been verified by PyPI

Project links

Project description

`diverse_seq` provides alignment-free algorithms to facilitate phylogenetic workflows

diverse-seq implements computationally efficient alignment-free algorithms that enable efficient prototyping for phylogenetic workflows. It can accelerate parameter selection searches for sequence alignment and phylogeny estimation by identifying a subset of sequences that are representative of the diversity in a collection. We show that selecting representative sequences with an entropy measure of k-mer frequencies correspond well to sampling via conventional genetic distances. The computational performance is linear with respect to the number of sequences and can be run in parallel. Applied to a collection of 10.5k whole microbial genomes on a laptop took ~8 minutes to prepare the data and 4 minutes to select 100 representatives. diverse-seq can further boost the performance of phylogenetic estimation by providing a seed phylogeny that can be further refined by a more sophisticated algorithm. For ~1k whole microbial genomes on a laptop, it takes ~1.8 minutes to estimate a bifurcating tree from mash distances.

You can read more about the methods implemented in diverse_seq in the preprint here.

`dvs prep`: preparing the sequence data

Convert sequence data into a more efficient format for the diversity assessment. This must be done before running either the nmost or max commands.

CLI options for dvs prep

Usage: dvs prep [OPTIONS]

  Writes processed sequences to a <HDF5 file>.dvseqs.

Options:
  -s, --seqdir PATH        directory containing sequence files  [required]
  -sf, --suffix TEXT       sequence file suffix  [default: fa]
  -o, --outpath PATH       write processed seqs to this filename  [required]
  -np, --numprocs INTEGER  number of processes  [default: 1]
  -F, --force_overwrite    Overwrite existing file if it exists
  -m, --moltype [dna|rna]  Molecular type of sequences  [default: dna]
  -L, --limit INTEGER      number of sequences to process
  -hp, --hide_progress     hide progress bars
  --help                   Show this message and exit.

`dvs nmost`: select the n-most diverse sequences

Selects the n sequences that maximise the total JSD. We recommend using nmost for large datasets.

Note A fuller explanation is coming soon!

Options for command line dvs nmost

Usage: dvs nmost [OPTIONS]

  Identify n seqs that maximise average delta JSD

Options:
  -s, --seqfile PATH       path to .dvseqs file  [required]
  -o, --outpath PATH       the input string will be cast to Path instance
  -n, --number INTEGER     number of seqs in divergent set  [required]
  -k INTEGER               k-mer size  [default: 6]
  -i, --include TEXT       seqnames to include in divergent set
  -np, --numprocs INTEGER  number of processes  [default: 1]
  -L, --limit INTEGER      number of sequences to process
  -v, --verbose            is an integer indicating number of cl occurrences
                           [default: 0]
  -hp, --hide_progress     hide progress bars
  --help                   Show this message and exit.

Options for cogent3 app dvs_nmost

The dvs nmost is also available as the cogent3 app dvs_nmost. The result of using cogent3.app_help("dvs_nmost") is shown below.

Overview
--------
select the n-most diverse seqs from a sequence collection

Options for making the app
--------------------------
dvs_nmost_app = get_app(
    'dvs_nmost',
    n=10,
    moltype='dna',
    include=None,
    k=6,
    seed=None,
)

Parameters
----------
n
    the number of divergent sequences
moltype
    molecular type of the sequences
k
    k-mer size
include
    sequence names to include in the final result
seed
    random number seed

Notes
-----
If called with an alignment, the ungapped sequences are used.
The order of the sequences is randomised. If include is not None, the
named sequences are added to the final result.

Input type
----------
ArrayAlignment, SequenceCollection, Alignment

Output type
-----------
ArrayAlignment, SequenceCollection, Alignment

`dvs max`: maximise variance in the selected sequences

The result of the max command is typically a set that are modestly more diverse than that from nmost.

Note A fuller explanation is coming soon!

Options for command line dvs max

Usage: dvs max [OPTIONS]

  Identify the seqs that maximise average delta JSD

Options:
  -s, --seqfile PATH       path to .dvseqs file  [required]
  -o, --outpath PATH       the input string will be cast to Path instance
  -z, --min_size INTEGER   minimum size of divergent set  [default: 7]
  -zp, --max_size INTEGER  maximum size of divergent set
  -k INTEGER               k-mer size  [default: 6]
  -st, --stat [stdev|cov]  statistic to maximise  [default: stdev]
  -i, --include TEXT       seqnames to include in divergent set
  -np, --numprocs INTEGER  number of processes  [default: 1]
  -L, --limit INTEGER      number of sequences to process
  -T, --test_run           reduce number of paths and size of query seqs
  -v, --verbose            is an integer indicating number of cl occurrences
                           [default: 0]
  -hp, --hide_progress     hide progress bars
  --help                   Show this message and exit.

Options for cogent3 app dvs_max

The dvs max is also available as the cogent3 app dvs_max.

Overview
--------
select the maximally divergent seqs from a sequence collection

Options for making the app
--------------------------
dvs_max_app = get_app(
    'dvs_max',
    min_size=5,
    max_size=30,
    stat='stdev',
    moltype='dna',
    include=None,
    k=6,
    seed=None,
)

Parameters
----------
min_size
    minimum size of the divergent set
max_size
    the maximum size of the divergent set
stat
    either stdev or cov, which represent the statistics
    std(delta_jsd) and cov(delta_jsd) respectively
moltype
    molecular type of the sequences
include
    sequence names to include in the final result
k
    k-mer size
seed
    random number seed

Notes
-----
If called with an alignment, the ungapped sequences are used.
The order of the sequences is randomised. If include is not None, the
named sequences are added to the final result.

Input type
----------
ArrayAlignment, SequenceCollection, Alignment

Output type
-----------
ArrayAlignment, SequenceCollection, Alignment

`dvs ctree`: build a phylogeny using k-mers

The result of the ctree command is a newick formatted tree string without distances.

Note A fuller explanation is coming soon!

Options for command line dvs ctree

Usage: dvs ctree [OPTIONS]

  Quickly compute a cluster tree based on kmers for a collection of sequences.

Options:
  -s, --seqfile PATH              path to .dvseqs file  [required]
  -o, --outpath PATH              the input string will be cast to Path instance
  -m, --moltype [dna|rna]         Molecular type of sequences  [default: dna]
  -k INTEGER                      k-mer size  [default: 6]
  --sketch-size INTEGER           sketch size for mash distance
  -d, --distance [mash|euclidean]
                                  distance measure for tree construction
                                  [default: mash]
  -c, --canonical-kmers           consider kmers identical to their reverse
                                  complement
  -L, --limit INTEGER             number of sequences to process
  -np, --numprocs INTEGER         number of processes  [default: 1]
  -hp, --hide_progress            hide progress bars
  --help                          Show this message and exit.

Options for cogent3 app dvs_ctree

The dvs ctree is also available as the cogent3 app dvs_ctree or dvs_par_ctree. The latter is not composable, but can run the analysis for a single collection in parallel.

Overview
--------
Create a cluster tree from kmer distances.

Options for making the app
--------------------------
dvs_ctree_app = get_app(
    'dvs_ctree',
    k=12,
    sketch_size=3000,
    moltype='dna',
    distance_mode='mash',
    mash_canonical_kmers=None,
    show_progress=False,
)

Initialise parameters for generating a kmer cluster tree.

Parameters
----------
k
    kmer size
sketch_size
    size of sketches, only applies to mash distance
moltype
    seq collection molecular type
distance_mode
    mash distance or euclidean distance between kmer freqs
mash_canonical_kmers
    whether to use mash canonical kmers for mash distance
show_progress
    whether to show progress bars

Notes
-----
This app is composable.

If mash_canonical_kmers is enabled when using the mash distance,
kmers are considered identical to their reverse complement.

References
----------
.. [1] Ondov, B. D., Treangen, T. J., Melsted, P., Mallonee, A. B.,
   Bergman, N. H., Koren, S., & Phillippy, A. M. (2016).
   Mash: fast genome and metagenome distance estimation using MinHash.
   Genome biology, 17, 1-14.

Input type
----------
ArrayAlignment, SequenceCollection, Alignment

Output type
-----------
PhyloNode

Overview
--------
Create a cluster tree from kmer distances in parallel.

Options for making the app
--------------------------
dvs_par_ctree_app = get_app(
    'dvs_par_ctree',
    k=12,
    sketch_size=3000,
    moltype='dna',
    distance_mode='mash',
    mash_canonical_kmers=None,
    show_progress=False,
    max_workers=None,
    parallel=True,
)

Initialise parameters for generating a kmer cluster tree.

Parameters
----------
k
    kmer size
sketch_size
    size of sketches, only applies to mash distance
moltype
    seq collection molecular type
distance_mode
    mash distance or euclidean distance between kmer freqs
mash_canonical_kmers
    whether to use mash canonical kmers for mash distance
show_progress
    whether to show progress bars
numprocs
    number of workers, defaults to running serial

Notes
-----
This app is not composable but can run in parallel. It is
best suited to a single large sequence collection.

If mash_canonical_kmers is enabled when using the mash distance,
kmers are considered identical to their reverse complement.

References
----------
.. [1] Ondov, B. D., Treangen, T. J., Melsted, P., Mallonee, A. B.,
   Bergman, N. H., Koren, S., & Phillippy, A. M. (2016).
   Mash: fast genome and metagenome distance estimation using MinHash.
   Genome biology, 17, 1-14.

Input type
----------
ArrayAlignment, SequenceCollection, Alignment

Output type
-----------
PhyloNode

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

2024.11.22a1 pre-release

Nov 22, 2024

2024.11.8a3 pre-release

Nov 11, 2024

2024.11.8a2 pre-release

Nov 9, 2024

2024.11.8a1 pre-release

Nov 8, 2024

2024.9.2a1 pre-release

Sep 2, 2024

2024.8.26a6 pre-release

Aug 29, 2024

2024.8.26a5 pre-release

Aug 26, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

diverse_seq-2024.11.22a1.tar.gz (147.8 kB view details)

Uploaded Nov 22, 2024 Source

Built Distribution

diverse_seq-2024.11.22a1-py3-none-any.whl (33.5 kB view details)

Uploaded Nov 22, 2024 Python 3

File details

Details for the file diverse_seq-2024.11.22a1.tar.gz.

File metadata

Download URL: diverse_seq-2024.11.22a1.tar.gz
Upload date: Nov 22, 2024
Size: 147.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for diverse_seq-2024.11.22a1.tar.gz
Algorithm	Hash digest
SHA256	`960d6b3640c192b5f99972b3a29a877eafc86d5013fe5a9a90bc97111c6396f9`
MD5	`36c37a8709b261a18fa0b2ee2178264c`
BLAKE2b-256	`2033f9afe7804502cdbecc8586e62cb67b50d72f6171f49a5d4974991f32b6ea`

See more details on using hashes here.

Provenance

The following attestation bundles were made for diverse_seq-2024.11.22a1.tar.gz:

Publisher: release.yml on HuttleyLab/DiverseSeq

Attestations:

Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: diverse_seq-2024.11.22a1.tar.gz
- Subject digest: 960d6b3640c192b5f99972b3a29a877eafc86d5013fe5a9a90bc97111c6396f9
- Sigstore transparency entry: 150743936
- Sigstore integration time: Nov 22, 2024

File details

Details for the file diverse_seq-2024.11.22a1-py3-none-any.whl.

File metadata

Download URL: diverse_seq-2024.11.22a1-py3-none-any.whl
Upload date: Nov 22, 2024
Size: 33.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for diverse_seq-2024.11.22a1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c780ea677bea640ab2a77ff2e00b7a289575b3097a438aa5db71aa209e3b1bcc`
MD5	`2efea1ffa50dc6568db495e1e77e8cf3`
BLAKE2b-256	`f3f004e129fd046546dc8f70c9b7d54f5d82af492f8e6d5fe43f98adf6b7fd58`

See more details on using hashes here.

Provenance

The following attestation bundles were made for diverse_seq-2024.11.22a1-py3-none-any.whl:

Publisher: release.yml on HuttleyLab/DiverseSeq

Attestations:

Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: diverse_seq-2024.11.22a1-py3-none-any.whl
- Subject digest: c780ea677bea640ab2a77ff2e00b7a289575b3097a438aa5db71aa209e3b1bcc
- Sigstore transparency entry: 150743938
- Sigstore integration time: Nov 22, 2024

diverse-seq 2024.11.22a1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

`diverse_seq` provides alignment-free algorithms to facilitate phylogenetic workflows

`dvs prep`: preparing the sequence data

`dvs nmost`: select the n-most diverse sequences

`dvs max`: maximise variance in the selected sequences

`dvs ctree`: build a phylogeny using k-mers

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

diverse-seq 2024.11.22a1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

diverse_seq provides alignment-free algorithms to facilitate phylogenetic workflows

dvs prep: preparing the sequence data

dvs nmost: select the n-most diverse sequences

dvs max: maximise variance in the selected sequences

dvs ctree: build a phylogeny using k-mers

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`diverse_seq` provides alignment-free algorithms to facilitate phylogenetic workflows

`dvs prep`: preparing the sequence data

`dvs nmost`: select the n-most diverse sequences

`dvs max`: maximise variance in the selected sequences

`dvs ctree`: build a phylogeny using k-mers