Binning plasmid-predicted contigs using short-read graphs

Project description

gplasCC: binning plasmid-predicted contigs

GplasCC is a tool to bin plasmid-predicted contigs based on sequence composition, coverage and assembly graph information. GplasCC is a new version of gplas that allows for plasmid classification of any binary plasmid classifier and extends the possibility of accurately binning predicted plasmid contigs into several discrete plasmid components by also attempting to place unbinned and repeat contigs into plasmid bins.

gplasCC: binning plasmid-predicted contigs
Table of Contents
Installation
- Requirements
- Installation using pip
Usage
- Using gplasCC with plasmidCC
- Using gplasCC with an external classification tool
Output files
- Intermediary results files
Complete usage
Issues and Bugs
Contributions
Citation

Installation

Requirements

An installation of Centrifuge is required if you are using plasmidCC as binary classifier (default). We reccomend using a conda environment with the centrifuge-core package installed.

conda create --name gplasCC -c conda-forge -c bioconda centrifuge-core=1.0.4.1 pip
conda activate gplasCC

If you prefer to use a different binary classifier, you can use gplasCC without installing Centrifuge.

Installation using pip

The prefered way of installing gplasCC is through pip:

pip install gplas

When this has finished, test the installation using

gplas --help

This should should show the help page of gplasCC.

Usage

Using gplasCC with plasmidCC

GplasCC comes built in with plasmidCC as a binary classifier. When using plasmidCC, gplasCC only requires one input file:

An assembly graph in .gfa format. Such an assembly graph can be obtained by assembling quality trimmed reads using Unicycler (preferred) or with SPAdes genome assembler.

Provide the path to your assembly graph with the -i flag, and select which plasmidCC database to use with the -s flag. Optionally, provide a custom name for your output with the -n flag. See example below:

gplas -i test_ecoli.gfa -s Escherichia_coli -n my_isolate

For an overview of plasmidCC supported species, use the --speciesopts flag:

gplas --speciesopts

Using gplasCC with an external classification tool

If you wish to use a different binary classifier, it is possible to provide your own external plasmid prediction file. We've listed and reviewed several other classifier tools here. Although they are all compatible with gplasCC, extra preprocessing steps are required:

Use gplasCC to convert the nodes from the assembly graph to FASTA format (most binary classifiers only accept FASTA files as input). To do this, provide your assembly graph (.gfa) and include the --extract flag.

gplas -i test_ecoli.gfa --extract -n my_isolate

The output FASTA file will be located in: gplas_input/my_isolate_contigs.fasta. By default, this file will only contain contigs larger than 1000 bp, however, this can be controlled with the -l flag.

Use this FASTA file as an input for the binary classification tool of your choice.
Format the output file:

The output from the binary classification tool has to be formatted as a tab separated file containing specific columns and headers (case sensitive). See an example below:

head -n 4 test_ecoli_plasmid_prediction.tab

Prob_Chromosome	Prob_Plasmid	Prediction	Contig_name	Contig_length
1.0	0.0	Chromosome	S1_LN:i:374865_dp:f:1.0749885035087077	374865
1.0	0.0	Chromosome	S10_LN:i:198295_dp:f:0.8919341045340952	198295
0.0	1.0	Plasmid	S20_LN:i:91233_dp:f:0.5815421095375989	91233

For proper compatability with gplasCC, please make sure your prediction file is tab-separated, and uses the correct (case sensitive) column names and prediction labels (Plasmid/Chromosome).

Once you've formatted the prediction file as above, move to Predict plasmids.

Predict plasmids

After pre-processing, we are now ready to predict individual plasmids.

Provide the paths to your assembly graph, using the -i flag, and to your binary classification file, with the -P flag. Optionally, provide a custom name for your output with the -n flag. See example below:

gplas -i test_ecoli.gfa -P test_ecoli_plasmid_prediction.tab -n my_isolate

Output files

GplasCC will create a folder called ‘results’ with the following files:

ls results/my_isolate*

## results/my_isolate_bin_0.fasta
## results/my_isolate_bin_1.fasta
## results/my_isolate_bin_2.fasta
## results/my_isolate_bins.tab
## results/my_isolate_chromosome_repeats.tab
## results/my_isolate_plasmidome_network.png
## results/my_isolate_results.tab

results/*.fasta

Fasta files with the contigs belonging to each predicted plasmid bin.

grep '>' results/my_isolate*.fasta

>S20_LN:i:91233_dp:f:0.5815421095375989
>S1_LN:i:374865_dp:f:1.0749885035087077
>S32_LN:i:42460_dp:f:0.6016122804021161
>S44_LN:i:21171_dp:f:0.5924640018897323
>S47_LN:i:17888_dp:f:0.5893320957724726
>S48_LN:i:11703_dp:f:1.1884320594277211
>S50_LN:i:11225_dp:f:0.6758514700227541
>S56_LN:i:6837_dp:f:0.5759570101860518
>S59_LN:i:5519_dp:f:0.5544497698217399
>S67_LN:i:2826_dp:f:0.6746421335091037
>S70_LN:i:2125_dp:f:9.215759397832965
>S76_LN:i:1486_dp:f:1.3509551203209675
>S84_LN:i:1063_dp:f:3.2697611578099566

results/*bins.tab

Tab delimited file containing a short overview showing the contigs that got assigned to each plasmid bin.

number	Bin
1	1
20	0
32	1
44	1
47	1
48	1
50	1
56	1
59	1
67	1
70	1
76	1
84	1

results/*results.tab

Tab delimited file containing the classification given by plasmidCC (or other binary classification tool) together with the bin prediction from gplasCC. The file contains the following information: contig number, contig name, probability of being chromosome-derived, probability of being plasmid-derived, class prediction, length, k-mer coverage, assigned bin.

Prob_Chromosome	Prob_Plasmid	Prediction	Contig_name	number	length	coverage	Bin
1.0	0.0	Repeat	S1_LN:i:374865_dp:f:1.0749885035087077	1	374865	1.07	1
1.0	0.0	Repeat	S48_LN:i:11703_dp:f:1.1884320594277211	48	11703	1.19	1
0.5	0.5	Repeat	S70_LN:i:2125_dp:f:9.215759397832965	70	2125	9.22	1
0.0	1.0	Repeat	S76_LN:i:1486_dp:f:1.3509551203209675	76	1486	1.35	1
0.78	0.22	Repeat	S84_LN:i:1063_dp:f:3.2697611578099566	84	1063	3.27	1
0.0	1.0	Plasmid	S20_LN:i:91233_dp:f:0.5815421095375989	20	91233	0.58	0
0.0	1.0	Plasmid	S32_LN:i:42460_dp:f:0.6016122804021161	32	42460	0.6	1
0.0	1.0	Plasmid	S44_LN:i:21171_dp:f:0.5924640018897323	44	21171	0.59	1
0.0	1.0	Plasmid	S47_LN:i:17888_dp:f:0.5893320957724726	47	17888	0.59	1
0.0	1.0	Plasmid	S50_LN:i:11225_dp:f:0.6758514700227541	50	11225	0.68	1
0.0	1.0	Plasmid	S56_LN:i:6837_dp:f:0.5759570101860518	56	6837	0.58	1
0.0	1.0	Plasmid	S59_LN:i:5519_dp:f:0.5544497698217399	59	5519	0.55	1
0.0	1.0	Plasmid	S67_LN:i:2826_dp:f:0.6746421335091037	67	2826	0.67	1

results/*chromosome_repeats.tab

Tab delimited file showing which contigs got assigned as chromosomal repeats.

number	Bin
1	Chromosome
48	Chromosome
55	Chromosome
66	Chromosome
68	Chromosome
70	Chromosome
74	Chromosome
79	Chromosome
81	Chromosome
84	Chromosome

results/*plasmidome_network.png

A visual representation of the plasmidome network generated by gplasCC. The network is created using an undirected graph with edges between plasmid unitigs co-existing in the random walks created by gplasCC.

Intermediary results files

If the -k flag is selected, gplasCC will also keep all intermediary files needed to construct the plasmid predictions. For example:

walks/normal_mode/*solutions.tab

gplasCC generates plasmid-like walks for each plasmid starting node. These paths are later used to generate the edges of the plasmidome network, but they can also be useful to observe all the different walks starting from a single node (plasmid unitig). These walks can be directly given to Bandage to visualize and manually inspect a walk.

In the example below, we find different possible plasmid walks starting from the node 67-. These paths may contain inversions and rearrangements since repeats units, such as transposases, can be present several times within the same plasmid sequence. In these cases, gplasCC can traverse the sequence in different ways generating different plasmid-like paths.

tail -n 10 walks/normal_mode/my_isolate_solutions.tab

67-,70-,50-,143-
67-,70-,50-,143-
67-,70-,50-,143-
67-,70-,47+,117-,84-,59+,70-,50-,143-
67-,70-,50-,143-
67-,70-,50-,143-
67-,70-,47+,117-,84-,59+,70-,50-,143-
67-,70-,47+,117-,84-,59+,70-,50-,143-
67-,70-,50-,143-
67-,70-,50-,143-

We can use Bandage to inspect the following path on the assembly graph: 67-,70-,47+,117-,84-,59+,70-,50-,143-

Complete usage

gplas --help

usage: gplas -i INPUT [-n NAME]
             (-s SPECIES | -p CUSTOM_DB_PATH | -P PREDICTION | --extract)
             [-t THRESHOLD_PREDICTION] [-b BOLD_COVERAGE_SD]
             [-x NUMBER_ITERATIONS] [-f FILT_GPLAS] [-e EDGE_THRESHOLD]
             [-q MODULARITY_THRESHOLD] [-l LENGTH_FILTER] [-k]
             [--speciesopts] [-v] [-h]

gplasCC: A tool for binning plasmid-predicted contigs into individual
predictions

General:
  -i INPUT              Path to the graph file in GFA (.gfa) format, used
                        to extract nodes and links
  -n NAME               Name prefix for output files (default: input file
                        name)
  -s SPECIES            Choose a species database for plasmidCC
                        classification. Use --speciesopts for a list of
                        all supported species
  -p CUSTOM_DB_PATH     Path to a custom Centrifuge database (name without
                        file extensions)
  -P PREDICTION         If not using plasmidCC. Provide a path to an
                        independent binary classification file
  --extract             extract FASTA sequences from the assembly graph to
                        use with an external classifier

Parameters:
  -t THRESHOLD_PREDICTION
                        Prediction threshold for plasmid-derived sequences
                        (default: 0.5)
  -b BOLD_COVERAGE_SD   Coverage variance allowed for bold walks to
                        recover unbinned plasmid-predicted nodes (default:
                        5)
  -x NUMBER_ITERATIONS  Number of walk iterations per starting node
                        (default: 20)
  -f FILT_GPLAS         filtering threshold to reject outgoing edges
                        (default: 0.1)
  -e EDGE_THRESHOLD     Edge threshold (default: 0.1)
  -q MODULARITY_THRESHOLD
                        Modularity threshold to split components in the
                        plasmidome network (default: 0.2)
  -l LENGTH_FILTER      Filtering threshold for sequence length (default:
                        1000)

Other:
  -k, --keep            Keep intermediary files

Info:
  --speciesopts         Prints a list of all supported species for the -s
                        flag
  -v, --version         Prints gplas version
  -h, --help            Prints this message

Issues and Bugs

You can report any issues or bugs that you find while installing/running gplasCC using the issue tracker.

Contributions

GplasCC has been developed with contributions from Oscar Jordan, Julian Paganini, Jesse Kerkvliet, Malbet Rogers, Sergio Arredondo and Anita Schürch.

Citation

A publication is in preparation. If you used an earlier version of gplas in your study, please cite: https://doi.org/10.1093/bioinformatics/btaa233 https://doi.org/10.1099/mgen.0.001193

Project details

Release history Release notifications | RSS feed

This version

1.0.1

Jun 26, 2025

1.0.0

Oct 15, 2024

0.9.0

Jun 10, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gplas-1.0.1.tar.gz (3.2 MB view details)

Uploaded Jun 26, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

gplas-1.0.1-py3-none-any.whl (48.4 kB view details)

Uploaded Jun 26, 2025 Python 3

File details

Details for the file gplas-1.0.1.tar.gz.

File metadata

Download URL: gplas-1.0.1.tar.gz
Upload date: Jun 26, 2025
Size: 3.2 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for gplas-1.0.1.tar.gz
Algorithm	Hash digest
SHA256	`bb83324c8928f3a838b3daa198d5652d9efedf077c7ffc053ae541fe93ec03ad`
MD5	`81cc40eab730617221c344f42d5e1252`
BLAKE2b-256	`7287abf33c7597245edbe2e78a93f793a19e4df8e81d540e612fbc13f41335ac`

See more details on using hashes here.

File details

Details for the file gplas-1.0.1-py3-none-any.whl.

File metadata

Download URL: gplas-1.0.1-py3-none-any.whl
Upload date: Jun 26, 2025
Size: 48.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for gplas-1.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`243c8a5eea37ccf9481c8a10126b69f265d2252b0426854aca20242375f2a51a`
MD5	`386cd334f8b790a7ae1ea3489a635e24`
BLAKE2b-256	`00668edcf506ef4edf1b3dfa2050fb70bfec6c537d84632c42f79aa8587f4e2d`

See more details on using hashes here.

gplas 1.0.1

Navigation

Verified details

Maintainers

Unverified details

Project description

gplasCC: binning plasmid-predicted contigs

Table of Contents

Installation

Requirements

Installation using pip

Usage

Using gplasCC with plasmidCC

Using gplasCC with an external classification tool

Predict plasmids

Output files

results/*.fasta

results/*bins.tab

results/*results.tab

results/*chromosome_repeats.tab

results/*plasmidome_network.png

Intermediary results files

walks/normal_mode/*solutions.tab

Complete usage

Issues and Bugs

Contributions

Citation

Project details

Verified details

Maintainers

Unverified details

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes