RNA virus analysis toolkit
Project description
RolyPoly
RolyPoly is an RNA virus analysis toolkit, meant to be a "swiss-army knife" for RNA virus discovery and characterization by including a variety of commands, wrappers, parsers, automations, and some "quality of life" features for any many of a virus investigation process (from raw read processing to genome annotation). While it includes an "end-2-end" command that employs an entire pipeline, the main goals of rolypoly are:
- Help non-computational researchers take a deep dive into their data without compromising on using tools that are non-techie friendly.
- Help (software) developers of virus analysis pipeline "plug" holes missing from their framework, by using specific RolyPoly commands to add features to their existing code base.
WIP - NOTE
RolyPoly is an open, still in progress project - I aim to summrise the main functionality into a manuscript by the end of 2025, or early 2026. Pull requests and contributions are welcome and will be considered (see)
Docs
For more detailed information, please refer to the docs. While it isn't updated often, it should still be helpful. Most commands support a --help flag and that tends to be the most up date.
Installation
Quick and Easy - One Conda/Mamba Environment
Recommended for most users who want a "just works" solution and primarily intend to use rolypoly as a CLI tool in an independent environment.
We hope to have rolypoly available from bioconda in the near future.
In the meantime, it can be installed with the quick_setup.sh script which which will also fetch the pre-generated data rolypoly will require.
curl -O https://code.jgi.doe.gov/rolypoly/rolypoly/-/raw/main/src/setup/quick_setup.sh && \
bash quick_setup.sh
Quick Setup - Additional Options
You can specify custom paths for the code, databases, and conda environment location:
bash quick_setup.sh /path/to/conda/env /path/to/install/rolypoly_code /path/to/store/databases /path/to/logfile
By default if no positional arguments are supplied, rolypoly is installed into the session current folder (path the quick_setup.sh is called from):
- database in
./rolypoly/data/ - code in
./rolypoly/code/ - conda enviroment in
./rolypoly/env/ - log file in
./RolyPoly_quick_setup.log
Modular / Dev - Command-Specific Pixi Environments
For software developers looking to try or make use of specific rolypoly features with minimal risk of dependency conflicts. This approach allows you to install only the tools you need for specific functionality.
# Install pixi first (if not already installed)
curl -fsSL https://pixi.sh/install.sh | bash
# Clone the repository
git clone https://code.jgi.doe.gov/rolypoly/rolypoly.git
cd rolypoly
# Install for specific functionality (examples):
pixi install -e reads-only # Just read processing tools
pixi install -e assembly-only # Just assembly tools
pixi install -e basic-analysis # Reads + assembly + identification
pixi install -e complete # All tools (equivalent to legacy install)
# Run commands in the appropriate environment
pixi run -e reads-only rolypoly filter-reads --help
# or load the environment
pixi shell -e reads-only
rolypoly filter-reads --help
For detailed modular installation options, see the installation documentation.
Usage
RolyPoly is a command-line tool with subcommands grouped by analysis stage. For a detailed help (in terminal), use rolypoly --help or rolypoly <group> --help. For more specific help, see the docs.
rolypoly [OPTIONS] <GROUP> <COMMAND> [ARGS]...
Command Groups and Subcommands
data
get-data— Download/setup required dataversion— Show code and data version info
reads
filter-reads: Host/rRNA/adapters/artifact filtering and QC (bbmap, seqkit, etc.)shrink-reads: Downsample or subsample reads (seqkit, custom)mask-dna: Mask DNA regions in RNA-seq reads (bbmap, seqkit)
annotation
annotate: Genome feature annotation (prodigal, pyrodigal-rv, custom)annotate-rna: RNA secondary structure labelling and ribozyme detection (Infernal, ViennaRNA, Rfam)annotate-prot: Protein domain annotation and functional prediction (HMMER, Pfam, custom)
assembly (Meta/Genome Assembly)
assemble— Assemble genomes/metagenomesfilter-contigs— Filter assembled contigs
misc (Miscellaneous)
end2end— Run end-to-end pipelinefetch-sra— Download SRA fastq filesfastx-stats— Compute FASTX statisticsrename-seqs— Rename sequencesquick-taxonomy— Quick taxonomy assignment
identify (RNA Virus Identification)
marker-search— Search for viral markerssearch-viruses— Map and identify viruses
Notes:
- Only the commands listed above are currently exposed via the CLI. Some modules in the codebase are not available as CLI commands.
- For help on any command, use:
rolypoly <group> <command> --help - Some commands (e.g.,
co-assembly,refine,visualize,characterise, etc.) are not currently available or are commented out in the CLI.
Project Status
Active development. Currently implemented features:
- ✅ NGS raw read filtering (Host, rRNA, adapters, artefacts) and quality control report (
reads filter-reads) - ✅ Assembly (SPAdes, MEGAHIT and penguin) (
assembly assemble) - ✅ Contig filtering and clustering (
assembly filter-contigs) - ✅ Marker gene search with pyhmmer (mainly RdRps, genomad VV's or user-provided) (
identify marker-search) - ✅ RNA secondary structure prediction, annotation and ribozyme identification (
annotation annotate-rna) - ✅ Nucleotide search vs known viruses (
identify search-viruses) - ✅ Prepare external data (
data get-data)
Under development:
- 🚧 Protein annotation (
annotation annotate-prot) (mostly done, but need to check other DBs or tools - Currently no structural prediction support) - 🚧 Host prediction (
TBD) - 🚧 Genome binning and refinement (
TBD) - 🚧 Virus taxonomic classification (
TBD) - 🚧 Virus feature prediction (+/-ssRNA/dsRNA, circular/linear, mono/poly-segmented, capsid type, etc.) (
TBD) - 🚧 Cross-sample analysis (
TBD)
For more details about the implementation status and roadmap please contact us directly or open an issue.
Dependencies
📦 Modular Installation Available: RolyPoly supports both quick setup (one environment for all tools) and modular installation (command-specific environments). The modular approach is particularly useful for software developers who want to integrate specific rolypoly features with minimal dependency conflicts. See the installation documentation for details.
Not all 3rd party software is used by all the different commands. RolyPoly includes a "citation reminder" that will try to list all the external software used by a command. The "reminded citations" are pretty printed to console (stdout) but are also written to a logfile. The bibtex file rolypoly uses for this is included in the codebase.
Click to show dependencies
Non-Python
- SPAdes.
- seqkit
- datasets
- bbmap - via bbmapy
- megahit
- mmseqs2
- plass and penguin
- diamond
- pigz
- prodigal - via pyrodigal-rv
- linearfold
- HMMER - via pyhmmer
- needletail
- infernal
- aragorn
- tRNAscan-SE
- bowtie1
- falco
Python Libraries
Databases used by rolypoly
RolyPoly will try to remind you to cite these too based on the commands you run. For more details, see the citation_reminder.py script and all_used_tools_dbs_citations
Click to show databases
- NCBI RefSeq rRNAs - Reference RNA sequences from NCBI RefSeq
- NCBI RefSeq viruses - Reference viral sequences from NCBI RefSeq
- PFAM_A_37 - RdRp and RT profiles from Pfam-A version 37
- RVMT - RNA Virus Meta-Transcriptomes database
- SILVA_138 - High-quality ribosomal RNA database
- NeoRdRp_v2.1 - Collection of RdRp profiles
- RdRp-Scan - RdRp profile database incorporating PALMdb
- TSA_2018 - RNA virus profiles from transcriptome assemblies
- Rfam - Database of RNA families (structural/catalytic/both)
Motivation
There are many good virus analysis software out there*. Many of them are custom made for specific virus groups, some are generalists, but most of them require complete control over the analysis process (so one or two point of entry for data). Apart from the input, these pipelines vary in their implementation (laguange, workflow magnement system (snakemake, nextflow...), dependecies), methodologies (tool choice for similar step like assembler), goals (e.g. specific pathogen analysis vs whole virome analysis). These are other differences effect the design process and the tooling choices (such as selecting a fast nucleic based sequence search method limited to high identity, over a slow but more senstive profile or structure (amino) based search method). This has created some "lock in" (IMO), and I have found myself asked by people "what do you recomend for xyz" or "which pipeline should I use". Most people have limited time to invest in custom analysis pipeline design and so end up opting for an existing, off-the-shelve option, potentially compromising or having to align their goals with what the given software offers (if they they are already aligned - great!).
- Checkout awesome-rna-virus-tools for an awesome list of RNA virus (and related) software.
Reporting Issues
Please report bugs you find in the Issues page.
Contribution
All forms of contributions are welcome - please see the CONTRIBUTING.md file for more details.
Authors (partial list, TBD update)
Click to show authors
- Uri Neri
- Brian Bushnell
- Simon Roux
- Antônio Pedro Castello Branco Rocha Camargo
- Andrei Stecca Steindorff
- Clement Coclet
- David Parker
- Dimitris Karapliafis
- And more!
- Your name here? Open a PR :)
Related projects
- RdRp-CATCH If you are interested in profile based marker searches, benchmarking, and thershold setting.
- suvtk if you are looking to expediate NCBI submission (among other tasks)
Acknowledgments
Thanks to the DOE Joint Genome Institute for infrastructure support. Special thanks to all contributors who have offered insights and improvements.
Copyright Notice
RolyPoly (rp) Copyright (c) 2024, The Regents of the University of California, through Lawrence Berkeley National Laboratory (subject to receipt of any required approvals from the U.S. Dept. of Energy). All rights reserved.
If you have questions about your rights to use or distribute this software, please contact Berkeley Lab's Intellectual Property Office at IPO@lbl.gov.
NOTICE. This Software was developed under funding from the U.S. Department of Energy and the U.S. Government consequently retains certain rights. As such, the U.S. Government has been granted for itself and others acting on its behalf a paid-up, nonexclusive, irrevocable, worldwide license in the Software to reproduce, distribute copies to the public, prepare derivative works, and perform publicly and display publicly, and to permit others to do so.
License Agreement
GPL v3 License
RolyPoly (rp) Copyright (c) 2024, The Regents of the University of California, through Lawrence Berkeley National Laboratory (subject to receipt of any required approvals from the U.S. Dept. of Energy). All rights reserved.
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rolypoly_tk-0.6.19.tar.gz.
File metadata
- Download URL: rolypoly_tk-0.6.19.tar.gz
- Upload date:
- Size: 180.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.23
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4a84c26453f668ac89736e45f330d65dbc19ebc82284ac2e6a7617dea1189e2a
|
|
| MD5 |
f9ff65735fdc133698a7e998c0cbe82b
|
|
| BLAKE2b-256 |
60a64be9d172a55cb877850b4707319b1a18bd5b75906c2f59f7cc58e67205dd
|
File details
Details for the file rolypoly_tk-0.6.19-py3-none-any.whl.
File metadata
- Download URL: rolypoly_tk-0.6.19-py3-none-any.whl
- Upload date:
- Size: 197.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.23
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dc7fe9c554c67be973d7211493d1bea7bc54e4ae370d8fb1bc6c94aecc892de8
|
|
| MD5 |
feaaa037235ce44066324391968dabcf
|
|
| BLAKE2b-256 |
3b9091c551116f1b0f3c17456406a625712de92f9ba181b1c3b279cc34c77cf1
|