Skip to main content

A collection of scripts that are useful for dealing with viral RNA NGS data.

Project description

Bioconda package Docker container Tests

The smallgenomeutilities are a collection of scripts that is useful for dealing and manipulating NGS data of small viral genomes. They are written in Python 3 with a small number of dependencies.

The smallgenomeutilities are part of the V-pipe workflow for analysing NGS data of short viral genomes.

Dependencies

You can install these python modules either using pip or bioconda:

  • biopython

  • bcbio-gff

  • numpy

  • pandas

  • progress

  • pysam

  • pysamstats

  • sklearn

  • matplotlib

  • progress

  • pyyaml

  • more_itertools

In addition to the modules, frameshift_deletions_checks currently requires mafft being installed – it is also available on bioconda.

Installation

The recommended way to install the smallgenomeutilities is using the bioconda package:

mamba install smallgenomeutilities

Another possibility is using pip:

# install from the current directory
pip install --editable .

# install from GitHub
pip install git+https://github.com/cbg-ethz/smallgenomeutilities.git

# install from Pypi
pip install smallgenomeutilities

Description of utilities

aln2basecnt

extract base counts and coverage information from a single alignment file

compute_mds

Compute multidimensional scaling for visualizing distances among reconstructed haplotypes.

convert_qr

Convert QuasiRecomb output of a transmitter and recipient set of haplotypes to a combined set of haplotypes, where gaps have been filtered. Optionally translate to peptide sequence.

convert_reference

Perform a genomic liftover. Transform an alignment in SAM or BAM format from one reference sequence to another. Can replace M states by =/X.

coverage

Calculate average coverage for a target region on a different contig.

coverage_depth_qc

Computes ‘fraction of genome covered a depth’ QC metrics from coverage TSV files (made by aln2basecnt, samtools depth, etc.)

coverage_stats

Calculate average coverage for a target region of an alignment.

extract_consensus

Build consensus sequences including either the majority base or the ambiguous bases from an alignment (BAM) file.

extract_coverage_intervals

Extract regions with sufficient coverage for running ShoRAH. Half-open intervals are returned, [start:end), and 0-based indexing is used.

extract_sam

Extract subsequences of an alignment, with the option of converting it to peptide sequences. Can filter on the basis of subsequence frequency or gap frequencies in subsequences.

extract_seq

Extract sequences of alignments into a FASTA file where the sequence id matches a given string.

frameshift_deletions_checks

European Galaxy server

Produce a report about frameshifting indels in a consensus sequences

gather_coverage

gather multiple per sample coverage information into a single unified file

mapper

Determine the genomic offsets on a target contig, given an initial contig and offsets. Can be used to map between reference genomes.

min_coverage

find the minimum coverage in a region from an alignment

minority_freq

Extract frequencies of minority variants from multiple samples. A region of interest is also supported.

pair_sequences

Compare sequences from a multiple sequence alignment from transmitter and recipient samples in order to determine the optimal matching of transmitters to recipients.

paired_end_read_merger

Merge paired-end reads to one merged read based on alignment.

predict_num_reads

Predict number of reads after quality preprocessing.

prepare_primers

Starting with a primers BED file, generate the other files used by V-pipe (inserts BED file, and TSV and FASTA file of primers sequences)

remove_gaps_msa

Given a multiple sequence alignment, remove loci with a gap fraction above a certain threshold.

Using the utilities

After installation, all utilities are available as command-line programs. You can run any utility by simply typing its name in your terminal, followed by any required arguments:

# Get help for any utility
aln2basecnt --help

# Example usage of paired_end_read_merger
paired_end_read_merger input.sam -f reference.fasta -o output_fused.sam

Each utility supports the --help flag which provides detailed information about its usage, required arguments, and available options.

Citation

If you use the paired_end_read_merger or the frameshift_deletions_checks, please cite

Fuhrmann, L., Jablonski, K. P., Topolsky, I., Batavia, A. A., Borgsmueller, N., Icer Baykal, P., … & Beerenwinkel, N. (2023). “V-pipe 3.0: a sustainable pipeline for within-sample viral genetic diversity estimation.” , https://doi.org/10.1101/2023.10.16.562462

For all other scripts, please cite

Posada-Céspedes S., Seifert D., Topolsky I., Jablonski K.P., Metzner K.J., and Beerenwinkel N. 2021. “V-pipe: a computational pipeline for assessing viral genetic diversity from high-throughput sequencing data.” Bioinformatics, January. https://doi.org/10.1093/bioinformatics/btab015

Contributions

  • David Seifert orcdseif gitdseif

  • Susana Posada Cespedes orcsposa gitsposa

  • Ivan Blagoev Topolsky orcitopo gititopo

  • Lara Fuhrmann orclfuhr gitlfuhr

  • Mateo Carrara orcmcarr gitmcarr

  • Michal Okoniewski orcmokn gitmokn

  • Gordon J. Köhn orcgkoe gitgkoe

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

smallgenomeutilities-0.5.2.tar.gz (70.5 MB view details)

Uploaded Source

Built Distribution

smallgenomeutilities-0.5.2-py3-none-any.whl (80.4 kB view details)

Uploaded Python 3

File details

Details for the file smallgenomeutilities-0.5.2.tar.gz.

File metadata

  • Download URL: smallgenomeutilities-0.5.2.tar.gz
  • Upload date:
  • Size: 70.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for smallgenomeutilities-0.5.2.tar.gz
Algorithm Hash digest
SHA256 78858154d8581dc91a1dd741d7a0e13b893ad03418e078a05f24e66f8fc75d53
MD5 460bbb5b0779deb1d491b8ba19c8494d
BLAKE2b-256 350c835a531f165757460f9ea9afd73d51b2ae64362fb0af62a87888493739fb

See more details on using hashes here.

Provenance

The following attestation bundles were made for smallgenomeutilities-0.5.2.tar.gz:

Publisher: publish-to-pypi.yml on cbg-ethz/smallgenomeutilities

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file smallgenomeutilities-0.5.2-py3-none-any.whl.

File metadata

File hashes

Hashes for smallgenomeutilities-0.5.2-py3-none-any.whl
Algorithm Hash digest
SHA256 da7f6030aa7c44f4f06cf728d7075199c254c480cca0b6f8d3d44e0627d97325
MD5 8c052f441ad45bc94405e50929a482ea
BLAKE2b-256 0c69b26c133d7d8fbeb98a2af29b8a82e3af53f435dc1a7cf7e958b2d30206c7

See more details on using hashes here.

Provenance

The following attestation bundles were made for smallgenomeutilities-0.5.2-py3-none-any.whl:

Publisher: publish-to-pypi.yml on cbg-ethz/smallgenomeutilities

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page