Skip to main content

A toolkit for prokaryotic comparative genomics

Project description

SCARAP: pangenome inference and comparative genomics of prokaryotes

SCARAP is a toolkit with modules for various tasks related to comparative genomics of prokaryotes. SCARAP has been designed to be fast and scalable. Its main feature is pangenome inference, but it also has modules for direct core genome inference (without inferring the full pangenome), subsampling representatives from a (large) set of genomes and constructing a concatenated core gene alignment ("supermatrix") that can later be used for phylogeny inference. SCARAP has been designed for prokaryotes but should work for eukaryotic genomes as well. It can handle large genome datasets on a range of taxonomic levels; it has been tested on datasets with prokaryotic genomes from the species to the order level.

Installation

You can install SCARAP through conda:

git clone https://github.com/swittouck/scarap.git
cd scarap
conda env create -f environment.yml 

You can then run SCARAP as follows:

conda activate scarap
scarap -h
conda deactivate

You can also install SCARAP manually by cloning it and installing the following dependencies:

Quick start

Obtaining data

SCARAP works mainly with faa files: amino acid sequences of all (predicted) genes in a genome assembly. You can obtain faa files in at least three ways:

  • You can run a gene prediction tool like Prodigal on genome assemblies of your favorite strains, or a complete annotation pipeline such as Prokka or Bakta.
  • You can search your favorite taxon on NCBI genome and manually download assemblies in the following way: click on an assembly, click "Download", select "Protein (FASTA)" as file type and click "Download" again.
  • Given a list of assembly accession numbers (i.e. starting with GCA/GCF), you can use ncbi-genome-download to download the corresponding faa files.

Given a list of accessions in a file called accessions.txt, you can use ncbi-genome-download to download faa files as follows:

  ncbi-genome-download -P \
    --assembly-accessions accessions.txt \
    --section genbank \
    --formats protein-fasta \
    bacteria

Inferring a pangenome

If you want to infer the pangenome of a set of genomes, you only need their faa files (fasta files with protein sequences) as input. If the faa files are stored in a folder faas, you can infer the pangenome using 16 threads by running:

  scarap pan ./faas ./pan -t 16

The pangenome will be stored in pan/pangenome.tsv.

The pangenome is stored in a "long format": a table with the columns gene, genome and orthogroup.

Inferring a core genome

If you want to infer the core genome of a set of genomes directly, without inferring the full pangenome first, you can also do this with SCARAP. The reason you might want to do this, is because it is faster and because you sometimes don't need more than the core genome (e.g. when you are planning to infer a phylogeny).

You can infer the core genome, given a set of faa files in a folder faas, in the following way:

  scarap core ./faas ./core -t 16

The core genome will be stored in core/genes.tsv.

Subsampling a set of genomes

If you have a (large) dataset of genomes that you wish to subsample in a representative way, you can do this using the sample module. You will need to precompute the pangenome or core genome to do this; SCARAP calculates average amino acid identity (AAI) or core amino acid identity (cAAI) values in the subsampling process, and it uses the single-copy orthogroups from a pan- or core genome to do this.

For example, if you want to sample 100 genomes given a set of faa files in a folder faas:

  scarap core ./faas ./core -t 16
  scarap sample ./faas ./core/genes.tsv ./representatives -m 100 -t 16

The representative genomes will be stored in representatives/seeds.txt.

Important remark: by default, the per-gene amino acid identity values are estimated from alignment scores per column by MMseqs (alignment mode 1). For AAI values > 90%, these estimations are on average smaller than the exact values. It is possible to calculate exact AAI values by adding the --exact option to the sample module, but this will be slower.

You can also sample genomes based on average nucleotide identity (ANI) or core nucleotide identity (cANI) values. In that case, you need to supply nucleotide sequences of predicted genes, e.g. in a folder ffns:

  scarap core ./faas ./core -t 16
  scarap sample ./ffns ./core/genes.tsv ./representatives -m 100 -t 16

Building a "supermatrix" for a set of genomes

You can build a concatenated alignment of core genes ("supermatrix") for a set of genomes using the concat module.

Let's say you want to build a supermatrix of 100 core genes for a set of genomes, with faa files given in a folder faas:

  scarap core ./faas ./core -m 100 -t 16
  scarap concat ./faas ./core/genes.tsv ./supermatrix -t 16

The amino acid supermatrix will be saved in supermatrix/supermatrix_aas.fasta.

If you want to produce a nucleotide-level supermatrix, this can be achieved by giving a folder with ffn files (nucleotide sequences of predicted genes) as an additional argument:

  scarap concat ./faas ./core/genes.tsv ./supermatrix -n ./ffns -t 16

The nucleotide-level supermatrix will be saved in supermatrix/supermatrix_nucs.fasta.

Modules

SCARAP is able to perform a number of specific tasks related to prokaryotic comparative genomics (see also scarap -h).

The most useful modules of SCARAP are probably the following:

  • pan: infer a pangenome from a set of faa files
  • core: infer a core genome from a set of faa files
  • sample: sample a subset of representative genomes

Modules for other useful tasks are also available:

  • build: build a profile database for a core/pangenome
  • search: search query genes in a profile database
  • checkgenomes: assess the genomes in a core genome
  • checkgroups: assess the orthogroups in a core genome
  • filter: filter the genomes/orthogroups in a pangenome
  • concat: construct a concatenated core orthogroup alignment from a core genome
  • fetch: fetch sequences and store in fasta per orthogroup

License

SCARAP is free software, licensed under GPLv3.

Feedback

All feedback and suggestions very welcome at stijn.wittouck[at]uantwerpen.be. You are of course also welcome to file issues.

Citation

A manuscript describing SCARAP and its validation has been prepared and will (hopefully) be published shortly.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scarap-1.0.0.tar.gz (53.4 kB view details)

Uploaded Source

Built Distribution

scarap-1.0.0-py3-none-any.whl (53.2 kB view details)

Uploaded Python 3

File details

Details for the file scarap-1.0.0.tar.gz.

File metadata

  • Download URL: scarap-1.0.0.tar.gz
  • Upload date:
  • Size: 53.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.9

File hashes

Hashes for scarap-1.0.0.tar.gz
Algorithm Hash digest
SHA256 00fb084a63dad9da7c981173241dd17361ec16d5c4132e311cb4da16ff0f8b07
MD5 554458f552a93eddd44c8b5aee21bb4d
BLAKE2b-256 48be834fe0509dccd951d1bd848b654b8c3b53de60964624278b885f9cfa4db5

See more details on using hashes here.

File details

Details for the file scarap-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: scarap-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 53.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.9

File hashes

Hashes for scarap-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 13b09c7dd3c0014fdf9b1934975833e8939c8f3c3c58e6fb444c8906f357505e
MD5 1e434a90b1fa5df72ae78aea53af05c7
BLAKE2b-256 d1f528df8a19dfed9bb2b0ae298a486b783d94c8cf4ef7d93e3c54bedf58c10c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page