A toolkit for prokaryotic comparative genomics
Project description
SCARAP: pangenome inference and comparative genomics of prokaryotes
SCARAP is a toolkit with modules for various tasks related to comparative genomics of prokaryotes. SCARAP has been designed to be fast and scalable. Its main feature is pangenome inference, but it also has modules for direct core genome inference (without inferring the full pangenome), subsampling representatives from a (large) set of genomes and constructing a concatenated core gene alignment ("supermatrix") that can later be used for phylogeny inference. SCARAP has been designed for prokaryotes but should work for eukaryotic genomes as well. It can handle large genome datasets on a range of taxonomic levels; it has been tested on datasets with prokaryotic genomes from the species to the order level.
Installation
You can install SCARAP through conda:
git clone https://github.com/swittouck/scarap.git
cd scarap
conda env create -f environment.yml
You can then run SCARAP as follows:
conda activate scarap
scarap -h
conda deactivate
You can also install SCARAP manually by cloning it and installing the following dependencies:
Quick start
Obtaining data
SCARAP works mainly with faa files: amino acid sequences of all (predicted) genes in a genome assembly. You can obtain faa files in at least three ways:
- You can run a gene prediction tool like Prodigal on genome assemblies of your favorite strains, or a complete annotation pipeline such as Prokka or Bakta.
- You can search your favorite taxon on NCBI genome and manually download assemblies in the following way: click on an assembly, click "Download", select "Protein (FASTA)" as file type and click "Download" again.
- Given a list of assembly accession numbers (i.e. starting with GCA/GCF), you can use ncbi-genome-download to download the corresponding faa files.
Given a list of accessions in a file called accessions.txt
, you can use ncbi-genome-download to download faa files as follows:
ncbi-genome-download -P \
--assembly-accessions accessions.txt \
--section genbank \
--formats protein-fasta \
bacteria
Inferring a pangenome
If you want to infer the pangenome of a set of genomes, you only need their faa files (fasta files with protein sequences) as input. If the faa files are stored in a folder faas
, you can infer the pangenome using 16 threads by running:
scarap pan ./faas ./pan -t 16
The pangenome will be stored in pan/pangenome.tsv
.
The pangenome is stored in a "long format": a table with the columns gene, genome and orthogroup.
Inferring a core genome
If you want to infer the core genome of a set of genomes directly, without inferring the full pangenome first, you can also do this with SCARAP. The reason you might want to do this, is because it is faster and because you sometimes don't need more than the core genome (e.g. when you are planning to infer a phylogeny).
You can infer the core genome, given a set of faa files in a folder faas
, in the following way:
scarap core ./faas ./core -t 16
The core genome will be stored in core/genes.tsv
.
Subsampling a set of genomes
If you have a (large) dataset of genomes that you wish to subsample in a representative way, you can do this using the sample
module. You will need to precompute the pangenome or core genome to do this; SCARAP calculates average amino acid identity (AAI) or core amino acid identity (cAAI) values in the subsampling process, and it uses the single-copy orthogroups from a pan- or core genome to do this.
For example, if you want to sample 100 genomes given a set of faa files in a folder faas
:
scarap core ./faas ./core -t 16
scarap sample ./faas ./core/genes.tsv ./representatives -m 100 -t 16
The representative genomes will be stored in representatives/seeds.txt
.
Important remark: by default, the per-gene amino acid identity values are estimated from alignment scores per column by MMseqs (alignment mode 1). For AAI values > 90%, these estimations are on average smaller than the exact values. It is possible to calculate exact AAI values by adding the --exact
option to the sample module, but this will be slower.
You can also sample genomes based on average nucleotide identity (ANI) or core nucleotide identity (cANI) values. In that case, you need to supply nucleotide sequences of predicted genes, e.g. in a folder ffns
:
scarap core ./faas ./core -t 16
scarap sample ./ffns ./core/genes.tsv ./representatives -m 100 -t 16
Building a "supermatrix" for a set of genomes
You can build a concatenated alignment of core genes ("supermatrix") for a set of genomes using the concat
module.
Let's say you want to build a supermatrix of 100 core genes for a set of genomes, with faa files given in a folder faas
:
scarap core ./faas ./core -m 100 -t 16
scarap concat ./faas ./core/genes.tsv ./supermatrix -t 16
The amino acid supermatrix will be saved in supermatrix/supermatrix_aas.fasta
.
If you want to produce a nucleotide-level supermatrix, this can be achieved by giving a folder with ffn files (nucleotide sequences of predicted genes) as an additional argument:
scarap concat ./faas ./core/genes.tsv ./supermatrix -n ./ffns -t 16
The nucleotide-level supermatrix will be saved in supermatrix/supermatrix_nucs.fasta
.
Modules
SCARAP is able to perform a number of specific tasks related to prokaryotic comparative genomics (see also scarap -h
).
The most useful modules of SCARAP are probably the following:
pan
: infer a pangenome from a set of faa filescore
: infer a core genome from a set of faa filessample
: sample a subset of representative genomes
Modules for other useful tasks are also available:
build
: build a profile database for a core/pangenomesearch
: search query genes in a profile databasecheckgenomes
: assess the genomes in a core genomecheckgroups
: assess the orthogroups in a core genomefilter
: filter the genomes/orthogroups in a pangenomeconcat
: construct a concatenated core orthogroup alignment from a core genomefetch
: fetch sequences and store in fasta per orthogroup
License
SCARAP is free software, licensed under GPLv3.
Feedback
All feedback and suggestions very welcome at stijn.wittouck[at]uantwerpen.be. You are of course also welcome to file issues.
Citation
A manuscript describing SCARAP and its validation has been prepared and will (hopefully) be published shortly.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file scarap-1.0.0.tar.gz
.
File metadata
- Download URL: scarap-1.0.0.tar.gz
- Upload date:
- Size: 53.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 00fb084a63dad9da7c981173241dd17361ec16d5c4132e311cb4da16ff0f8b07 |
|
MD5 | 554458f552a93eddd44c8b5aee21bb4d |
|
BLAKE2b-256 | 48be834fe0509dccd951d1bd848b654b8c3b53de60964624278b885f9cfa4db5 |
File details
Details for the file scarap-1.0.0-py3-none-any.whl
.
File metadata
- Download URL: scarap-1.0.0-py3-none-any.whl
- Upload date:
- Size: 53.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 13b09c7dd3c0014fdf9b1934975833e8939c8f3c3c58e6fb444c8906f357505e |
|
MD5 | 1e434a90b1fa5df72ae78aea53af05c7 |
|
BLAKE2b-256 | d1f528df8a19dfed9bb2b0ae298a486b783d94c8cf4ef7d93e3c54bedf58c10c |