A Centrifuge based plasmid prediction tool

Project description

PlasmidCC

PlasmidCC is a plasmid classification tool that uses Centrifuge to predict the origin of contigs (plasmid or chromosome).

PlasmidCC is a generalization of PlasmidEC, which uses multiple classification tools to classify plasmids in E. coli isolates.

Installation
Usage
Output files
- Compatibility with gplas
- Intermediary files
Contributions
Citation

Installation

An installation of Centrifuge is required to run plasmidCC

We recommend using a conda environment with the centrifuge-core package installed:

conda create --name plasmidCC -c conda-forge -c bioconda centrifuge-core=1.0.4.1 pip
conda activate plasmidCC

Install plasmidCC using pip:

pip install plasmidCC

Verify installation:

plasmidCC --help

Usage

Test run example

plasmidCC -i test/test_ecoli.gfa -o test -n testEcoli -s Escherichia_coli -D

This will use the 'test_ecoli.gfa' file as input (-i), and store output in the 'test' directory (-o) under a new subdirectory named 'testEcoli' (-n). plasmidCC will look for the embedded database of E. coli (-s) and when not found, it will try to download this database (-D).

Input

As input, plasmidCC takes assembled contigs in .fasta format or an assembly graph in .gfa format. Such files can be obtained with Unicycler or SPAdes genome assembler.

Quick usage

Out of the box, plasmidCC can be used to predict plasmid contigs of certain embedded species. Use the --speciesopts flag to see a list of supported species:

plasmidCC --speciesopts

General (warning: general database requires >47GB of availabe RAM)
Escherichia_coli
Enterococcus_faecium
Enterococcus_faecalis
Salmonella_enterica
Staphylococcus_aureus
Acinetobacter_baumannii
Klebsiella_pneumoniae

You can specify which species database to use with the -s flag. For example:

plasmidCC -i test/K_pneumoniae_test.fasta -s Klebsiella_pneumoniae

Other species

It is possible to use plasmidCC for other species. However, a custom Centrifuge database will have to be constructed for the desired species. Instructions on how to do this can be found here. Once constructed, the location and name of your custom database can be supplied to plasmidCC by using the -p flag:

plasmidCC -i test/P_aeruginosa_test.fasta -p databases/my_custom_db

All options

plasmidCC --help

usage: plasmidCC -i INPUT [-o OUTPUT] [-n NAME] (-s SPECIES | -p CUSTOM_DB_PATH) [-l LENGTH]
                 [-t THREADS] [-P PLASMID_CUTOFF] [-C CHROMOSOME_CUTOFF] [-D] [-g] [-f] [-k]
                 [--speciesopts] [-v] [-h]

PlasmidCC: a Centrifuge based plasmid prediction tool

General:
  -i INPUT              input file (.fasta or .gfa)
  -o OUTPUT             Output directory
  -n NAME               Name prefix for output files (default: input file name)
  -s SPECIES            Select an embedded species database. Use --speciesopts for a list of all
                        supported species
  -p CUSTOM_DB_PATH     Path to a custom Centrifuge database (name without file extensions)

Parameters:
  -l LENGTH             Minimum sequence length filter (default: 1000)
  -t THREADS            Number of alignment threads to launch (default: 8)
  -P PLASMID_CUTOFF     Threshold of plasmid fraction to predict contig as plasmid (default: 0.7)
  -C CHROMOSOME_CUTOFF  Threshold of plasmid fraction to predict contig as chromosome (default: 0.3)

Other:
  -D, --download        Download embedded database if not yet downloaded
  -g, --gplas           Write an extra output file that is compatible for use with gplas
  -f, --force           Overwrite existing output if the same name is already used
  -k, --keep            Keep intermediary files

Info:
  --speciesopts         Prints a list of all supported species for the -s flag
  -v, --version         Prints plasmidCC version
  -h, --help            Prints this message

Output Files

plasmids.fasta

Sequences of all contigs predicted to originate from plasmids in FASTA format.

grep '>' test/testEcoli/testEcoli_plasmids.fasta

>S20_LN:i:91233_dp:f:0.5815421095375989
>S32_LN:i:42460_dp:f:0.6016122804021161
>S44_LN:i:21171_dp:f:0.5924640018897323
>S47_LN:i:17888_dp:f:0.5893320957724726
>S50_LN:i:11225_dp:f:0.6758514700227541
>S56_LN:i:6837_dp:f:0.5759570101860518
>S59_LN:i:5519_dp:f:0.5544497698217399
>S67_LN:i:2826_dp:f:0.6746421335091037
>S76_LN:i:1486_dp:f:1.3509551203209675

centrifuge_classified.txt

Table containing the predictions made by Centrifuge, the total nr. of matches, and the final classification for each contig.

head -n 5 test/testEcoli/testEcoli_centrifuge_classified.txt

readID	chromosome	plasmid	total_matches	chromosome_fraction	plasmid_fraction	final_classification
S80_LN:i:1427_dp:f:0.9617101819819399	2.0	0.0	2.0	1.0	0.0	chromosome
S81_LN:i:1343_dp:f:4.494970368199747	117.0	40.0	157.0	0.75	0.25	chromosome
S82_LN:i:1253_dp:f:1.182459332915489	1.0	0.0	1.0	1.0	0.0	chromosome
S83_LN:i:1242_dp:f:0.9224653122847608	1.0	0.0	1.0	1.0	0.0	chromosome
S84_LN:i:1063_dp:f:3.2697611578099566	118.0	33.0	151.0	0.78	0.22	chromosome

Compatibility with gplas

gplas is a tool that accurately bins predicted plasmid contigs into individual plasmids.

By using the -g flag, plasmidCC provides an extra output file that can be directly used as input for gplas. See an example below:

plasmidCC -i test/test_ecoli.gfa -o test -n testEcoli -s Escherichia_coli -g

head -n 10 test/testEcoli/testEcoli_gplas.tab

Prob_Chromosome	Prediction	Contig_name	Contig_length
1.0	Chromosome	S10_LN:i:198295_dp:f:0.8919341045340952	198295
1.0	Chromosome	S11_LN:i:173581_dp:f:0.8682632509656248	173581
1.0	Chromosome	S12_LN:i:169985_dp:f:1.0893451820087325	169985
1.0	Chromosome	S13_LN:i:169238_dp:f:1.1143772255735436	169238
1.0	Chromosome	S14_LN:i:135734_dp:f:0.8900147755192753	135734
1.0	Chromosome	S15_LN:i:114916_dp:f:0.8135597349289454	114916
1.0	Chromosome	S16_LN:i:112152_dp:f:0.9565731810452665	112152
1.0	Chromosome	S17_LN:i:107357_dp:f:1.0935311833495955	107357
1.0	Chromosome	S18_LN:i:105440_dp:f:0.9191174721979478	105440

Intermediary files

By default, intermediary files will get deleted at the end of a run. Use the -k flag to keep intermediary files.

centrifuge_results.txt

Raw Centrifuge classification output that is used by plasmidCC to produce the 'centrifuge_classified.txt' output.

summary.txt

Centrifuge report file summarizing details per classification group.

filtered.fasta

Input sequences (-i) filtered for minimum sequence length (-l). This file is used when running Centrifuge.

Contributions

PlasmidCC has been developed with contributions from Lisa Vader, Malbert Rogers, Julian Paganini, Jesse Kerkvliet, Anita Schürch and Oscar Jordan.

Citation

If you use plasmidCC, please cite:

(Citation follows)

Project details

Release history Release notifications | RSS feed

1.0.1

Apr 2, 2024

This version

1.0.0

Mar 25, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

plasmidCC-1.0.0.tar.gz (1.4 MB view details)

Uploaded Mar 25, 2024 Source

File details

Details for the file plasmidCC-1.0.0.tar.gz.

File metadata

Download URL: plasmidCC-1.0.0.tar.gz
Upload date: Mar 25, 2024
Size: 1.4 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.12.2

File hashes

Hashes for plasmidCC-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`63c7fabab5e8e7246f260b023321ccb0aab93174c1de5c68dfd236f956942a44`
MD5	`002bb638b39e77da965d5fb0e185b335`
BLAKE2b-256	`13a73b4340c14e30d95ed4aa98e63c5fc0c111d217f44376ed9f32cfe0a59733`

See more details on using hashes here.

plasmidCC 1.0.0

Navigation

Verified details

Maintainers

Unverified details