Escherichia coli fast serotyping using both raw reads and assemblies with automatic species identification
Project description
ECTyper (an easy typer)
ECTyper
is a standalone versatile serotyping module for Escherichia coli. It supports both fasta (assembled) and fastq (raw reads) file formats.
The tool provides convenient species identification coupled to quality control module giving a complete, transparent and reference laboratories suitable report on E.coli serotyping.
Dependencies:
- python >= 3.5
- bcftools >= 1.8
- blast == 2.7.1
- seqtk >= 1.2
- samtools >= 1.8
- bowtie2 >= 2.3.4.1
- mash >= 2.0
Python packages:
- biopython >= 1.70
- pandas >= 0.23.1
- requests >= 2.0
Installation
Option 1: As a conda package
-
If you do not have conda environment, get and install
miniconda
oranaconda
:bash miniconda.sh -b -p $HOME/miniconda echo ". $HOME/miniconda/etc/profile.d/conda.sh" >> ~/.bashrc source ~/.bashrc```
-
Install conda package from
bioconda
channelconda install -c bioconda ectyper
Option 2: From the source directly
Second option is to install from the source.
- Install dependencies. On Ubuntu distro run
apt install samtools bowtie2 mash bcftools ncbi-blast+ seqtk
- Install python dependencies via
pip
:
pip3 install pandas biopython
- Clone the repository or checkout a particular release (e.g v1.0.0, etc.):
git clone https://github.com/phac-nml/ecoli_serotyping.git
git checkout v1.0.0 #optionally checkout release version
- Install ectyper:
python3 setup.py install
Basic Usage
- Put the fasta/fastq files for serotyping analyses in one folder (concatenate paired raw reads files if you would like them to be considered a single entity)
ectyper -i [file path] -o [output_dir]
- View the results on the console or in
cat [output folder]/output.csv
Example Usage
ectyper -i ecoliA.fasta
for a single fileectyper -i ecoliA.fasta -o output_dir
for a single file, results stored inoutput_dir
ectyper -i ecoliA.fasta,ecoliB.fastq,ecoliC.fna
for multiple files (comma-delimited)ectyper -i ecoli_folder
for a folder (all files in the folder will be checked by the tool)
Advanced Usage
usage: ectyper [-h] [-V] -i INPUT [-c CORES] [-opid PERCENTIDENTITYOTYPE]
[-hpid PERCENTIDENTITYHTYPE] [-oplen PERCENTLENGTHOTYPE]
[-hplen PERCENTLENGTHHTYPE] [--verify] [-o OUTPUT] [-r REFSEQ] [-s] [--debug]
[--dbpath DBPATH]
ectyper v1.0 database v1.0 Prediction of Escherichia coli serotype from raw reads or assembled
genome sequences. The default settings are recommended.
optional arguments:
-h, --help show this help message and exit
-V, --version show program's version number and exit
-i INPUT, --input INPUT
Location of E. coli genome file(s). Can be a single file, a comma-
separated list of files, or a directory
-c CORES, --cores CORES
The number of cores to run ectyper with
-opid PERCENTIDENTITYOTYPE, --percentIdentityOtype PERCENTIDENTITYOTYPE
Percent identity required for an O antigen allele match [default 90]
-hpid PERCENTIDENTITYHTYPE, --percentIdentityHtype PERCENTIDENTITYHTYPE
Percent identity required for an H antigen allele match [default 95]
-oplen PERCENTLENGTHOTYPE, --percentLengthOtype PERCENTLENGTHOTYPE
Percent length required for an O antigen allele match [default 95]
-hplen PERCENTLENGTHHTYPE, --percentLengthHtype PERCENTLENGTHHTYPE
Percent length required for an H antigen allele match [default 50]
--verify Enable E. coli species verification
-o OUTPUT, --output OUTPUT
Directory location of output files
-r REFSEQ, --refseq REFSEQ
Location of pre-computed MASH RefSeq sketch. If provided, genomes
identified as non-E. coli will have their species identified using MASH.
For best results the pre-sketched RefSeq archive
https://gembox.cbcb.umd.edu/mash/refseq.genomes.k21s1000.msh is
recommended
-s, --sequence Prints the allele sequences if enabled as the final columns of the
output
--debug Print more detailed log including debug messages
--dbpath DBPATH Path to a custom database of O and H antigen alleles in JSON format. Check
Data/ectyper_database.json for more information
Fine-tunning parameters
ECTyper
requires minimum options to run (-i
and -o
) but allows for extensive configuration to accomodate wide variaty of typing scenarios
Parameter | Explanation | Usage scenario |
---|---|---|
-opid |
Specify minimum %identity threshold just for O antigen match |
Poor coverage of O antigen genes or for exploratory work (recommended value is 90) |
-opcov |
Minimum %covereage threshold for a valid match against reference O antigen alleles |
Poor coverage of O antigen genes and a user wants to get O antigen call regardless (recommend value is 95) |
-hpid |
Specify minimum %identity threshold just for H antigen match |
Poor coverage of O antigen genes or for exploratory work (recommend value is 95) |
-hpcov |
Minimum %covereage threshold for a valid match against reference H antigen alleles |
Poor coverage of O antigen genes and a user wants to get O antigen call regardless (recommend value is 95) |
--verify |
Verify species of the input and run QC module providing information on the reliability of the result and any typing issues | User not sure if sample is E.coli and wants to obtain if serotype prediction is of sufficient quality for reporting purposes |
-r |
Specify custom MASH sketch of reference genomes that will be used for species inference | User has a new assembled genome that is not available in NCBI RefSeq database. Make sure to add metadata to assembly_summary_refseq.txt and provide custom accession number that start with GCF_ prefix |
--dbpath |
Provide custom appended database of O and H antigen reference alleles in JSON format following structure and field names as default database ectyper_alleles_db.json |
User wants to add new alleles to the alleles database to improve typing performance |
Quality Control (QC) module
To provide an easier interpretation of the results and typing metrics, following QC codes were developed.
These codes allow to quickly filter "reportable" and "non-reportable" samples. The QC module is tightly linked to ECTyper allele database, specifically, MinPident
and MinPcov
fields.
For each reference allele minimum %identity
and %coverage
values were determined as a function of potential "cross-talk" between antigens (i.e. multiple potential antigen calls at a given setting).
The QC module covers the following serotyping scenarios. More scenarios might be added in future versions depending on user needs.
QC flag | Explanation |
---|---|
PASS (REPORTABLE) | Both O and H antigen alleles meet min %identity or %coverage thresholds (ensuring no antigen cross-talk) and single antigen predicted for O and H |
FAIL (-:- TYPING) | Sample is E.coli and O and H antigens are not typed. Serotype: -:- |
WARNING MIXED O-TYPE | A mixed O antigen call is predicted requiring wet-lab confirmation |
WARNING (WRONG SPECIES) | A sample is non-E.coli (e.g. E.albertii, Shigella, etc.) based on RefSeq assemblies |
WARNING (-:H TYPING) | A sample is E.coli and O antigen is not predicted (e.g. -:H18) |
WARNING (O:- TYPING) | A sample is E.coli and O antigen is not predicted (e.g. O17:-) |
WARNING (O NON-REPORT) | O antigen alleles do not meet min %identity or %coverage thresholds |
WARNING (H NON-REPORT) | H antigen alleles do not meet min %id or %cov thresholds |
WARNING (O and H NON-REPORT) | Both O and H antigen alleles do not meet min %identity or %coverage thresholds |
Report format
ECTyper
capitalizes on a concise minimum output coupled to easy results interpretation and reporting. ECTyper v1.0
serotyping results are available in a tab-delimited output.tsv
file consisting of the 16 columns listed below:
- Name: Sample name (usually a unique identifier)
- Species: the species column provides valuable species identification information in case of inadvertent sample contamination or mislabelling events
- O-type: O antigen
- H-type: H antigen
- Serotype: Predicted O and H antigen(s)
- QC: The Quality Control value summarizing the overall quality of prediction
- Evidence: How many alleles in total used to both call O and H antigens
- GeneScores: ECTyper O and H antigen gene scores in 0 to 1 range
- AllelesKeys: Best matching
ECTyper
database allele keys used to call the serotype - GeneIdentities(%):
%identity
values of the query alleles - GeneCoverages(%):
%coverage
values of the query alleles - GeneContigNames: the contig names where the query alleles were found
- GeneRanges: genomic coordinates of the query alleles
- GeneLengths: allele lengths of the query alleles
- Database: database release version and date
- Warnings: any additional warnings linked to the quality control status or any other error message(s).
Selected columns from the ECTyper
typical report are shown below.
Name | Species | Serotype | Evidence | QC | GeneScores | AlleleKeys | GeneIdentities(%) | GeneCoverages(%) | GeneContigNames | GeneRanges | GeneLengths | Database | Warnings |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
15-520 | Escherichia coli | O174:H21 | Based on 3 allele(s) | PASS (REPORTABLE) | wzx:1; wzy:1; fliC:1; | O104-5-wzx-origin;O104-13-wzy;H7-6-fliC-origin; | 100;100;100; | 100;100;100; | contig00049;contig00001;contig00019; | 22302-23492;178-1290;6507-8264; | 1191;1113;1758; | v1.0 (2020-05-07) | - |
EC20151709 | Escherichia coli | O157:H43 | Based on 3 allele(s) | PASS (REPORTABLE) | wzx:1;wzy:0.999;fliC:1 | O157-5-wzx-origin;O157-9-wzy-origin;H43-1-fliC-origin; | 100;99.916;99.934; | 100;100;100; | contig00002;contig00002;contig00003; | 62558-63949;64651-65835;59962-61467; | 1392;1185;1506; | v1.0 (2020-05-07) | - |
Availability
Resource | Description | Type |
---|---|---|
PyPI | PyPI pacakge that could be installed via pip utility |
Terminal |
Conda | Conda package available from BioConda channel | Terminal |
Docker | Images containing completely initialized ECTyper with all dependencies | Terminal |
Singluarity | Images containing completely initialized ECTyper with all dependencies | Terminal |
GitHub | Install source code as any Python package | Terminal |
Galaxy ToolShed | Galaxy wrapper available for installation on a private/public instance | Web-based |
Galaxy Europe | Galaxy public server to execute your analysis from anywhere | Web-based |
IRIDA plugin | IRIDA instances could easily install additional pipeline | Web-based |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file ectyper-1.0.0rc1.tar.gz
.
File metadata
- Download URL: ectyper-1.0.0rc1.tar.gz
- Upload date:
- Size: 510.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.6.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e1519b4a872db03f6d2940c5d635dd5bca8a408b6a54bc77d824be19471530ad |
|
MD5 | 5976eebda13f6e683c3805fb6be163a2 |
|
BLAKE2b-256 | 18e9a014b14f292a99b5d3c8c81d6c733bb20a2a71c95695055beb372ba027de |
File details
Details for the file ectyper-1.0.0rc1-py3.6.egg
.
File metadata
- Download URL: ectyper-1.0.0rc1-py3.6.egg
- Upload date:
- Size: 565.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.6.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 29131dfeb245c371c64b8d039a182f8924f628b3b410530e1d67be966574c63c |
|
MD5 | d2e08e188ee6197f78593cb2542664c0 |
|
BLAKE2b-256 | 01886f93f129fc3092c6993739f5bfc51661a82079a1ec9f1eadf9f6b731625e |