Skip to main content

calculate pair-wise allelic distances from cgMLST implements chewBBACAs

Project description

COREugate - A pipeline for cgMLST

From contigs to cgMLST profile and SLC.

COREugate has had a small facelift!! Under the hood we are now using NextFlow as our pipeline engine and have introduced some additional functionality for clustering the profiles.

  1. PrepSchema (if necessary) and Call alleles using chewBBACA.
  2. Combine profiles and statisitics for the whole dataset.
  3. Calculate pairwise allelic distances (missing data is ignored)
  4. Perform SLC to group related profiles, based on user supplied thresholds.

Dependencies

Python >=3.7
Biopython >=1.70
Nextflow >=20.10
chewBBACA >=2.6

NextFlow

Ensure that you have NextFlow installed. Detailed instructions can be found here

chewBBACA

chewBBACA is used here to prepare the schema, by selecting exemplar alleles for comparison and to call allele profiles. More information about chewBBACA and how it is works can be found here. COREugate can use a singularity version of chewBBACA, however if you want to install the latest version (>=2.0.16)

Run COREugate

Get COREugate

pip3 install git+https://github.com/kristyhoran/Coreugate

If you are installing COREugate on a server using --user please ensure that your ~/.local/bin is part of your PATH

export PATH=$PATH:/path/to/.local/bin

Running COREugate

coreugate [-h] [-v] [--input_file INPUT_FILE]
                 [--schema_path SCHEMA_PATH]
                 [--prodigal_training PRODIGAL_TRAINING] [--workdir WORKDIR]
                 [--threads THREADS]
                 [--filter_samples_threshold FILTER_SAMPLES_THRESHOLD]
                 [--cluster] [--cluster_thresholds CLUSTER_THRESHOLDS]
                 [--force] [--report]

Coreugate - a cgMLST pipeline implementing chewBACCA

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  --input_file INPUT_FILE, -i INPUT_FILE
                        Input file tab-delimited file3 columns isolate_id
                        path_to_input_file (contigs) (default: )
  --schema_path SCHEMA_PATH, -s SCHEMA_PATH
                        Path to species schema/allele db (or url if using
                        chewie Nomenclature server) (default: )
  --prodigal_training PRODIGAL_TRAINING, -p PRODIGAL_TRAINING
                        Prodigal file to be used in allele calling. See https:
                        //github.com/B-UMMI/chewBBACA/tree/master/CHEWBBACA/pr
                        odigal_training_files for options (default: )
  --workdir WORKDIR, -w WORKDIR
                        Working directory, default is current directory
                        (default: /home/khhor/validation/salmonella_typing/rev
                        erification_20210322)
  --threads THREADS, -t THREADS
                        Number of threads to run chewBACCA (default: 16)
  --filter_samples_threshold FILTER_SAMPLES_THRESHOLD, -ft FILTER_SAMPLES_THRESHOLD
                        The proportion of loci present in a sample for an
                        sample to be included in further analysis (0-1)
                        (default: 0.95)
  --cluster, -c         If you would like to cluster the pairwise distance
                        matrix. If selected you must provide a list of
                        thresholds. (default: False)
  --cluster_thresholds CLUSTER_THRESHOLDS, -ct CLUSTER_THRESHOLDS
                        Provide a comma separate list (NO SPACES) eg 20,40,200
                        (default: )
  --force, -f           If you want to force chewBBACA to re-run. (default:
                        False)
  --report              Save nextflow reports. (default: False)
                                 Display this help message
Sample data

Assemblies

isolate_name	path/to/assembly.fa	
Species cgMLST schema

COREugate requires an exisiting cgMLST schema, this can be a schema generated by the user or downloaded from one of the publically available databases. These schema should be in the format of a fasta file for each loci, each file should contain the different alleles for each loci. It should be noted that during allele calling, chewBBACA (implemented by COREugate) will add inferred alleles (more information) to your schema, so it is recommended that the schema path be fixed, that is that the schema is kept in a central location and a single version is used for each species/study.

Other optional arguments
  • prodigal_training a prodigal training file for allele calling. Recommended by chewBBACA developers, a list of default training files and further information can be found here.

Limitations of the pipeline

  • Coreugate is only able to work with pre-exisiting schemas that have been prep as described above, to derive profiles for isolates.
  • Possibly more, I just haven't found them yet!!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

coreugate-2.0.4.tar.gz (1.9 MB view details)

Uploaded Source

Built Distribution

coreugate-2.0.4-py3-none-any.whl (10.4 kB view details)

Uploaded Python 3

File details

Details for the file coreugate-2.0.4.tar.gz.

File metadata

  • Download URL: coreugate-2.0.4.tar.gz
  • Upload date:
  • Size: 1.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.7.10

File hashes

Hashes for coreugate-2.0.4.tar.gz
Algorithm Hash digest
SHA256 0c0df45eb7a21011bfd62b93ad7f786bf93dcdb40c3cd14f455d95045df493f4
MD5 e43a0f55fb818fa47cc0010ab7ce2319
BLAKE2b-256 95f2cf1adf80d6418c64e6a62e0a85522ccd1888806ec48c99be2116bd661c4b

See more details on using hashes here.

File details

Details for the file coreugate-2.0.4-py3-none-any.whl.

File metadata

  • Download URL: coreugate-2.0.4-py3-none-any.whl
  • Upload date:
  • Size: 10.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.7.10

File hashes

Hashes for coreugate-2.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 f792b8d0ba9d55b54fc218f2ddbba7a6312b13b71deca32376092509d7730e3a
MD5 95ed102ff33a0e568b6d18487cd132d6
BLAKE2b-256 fecb15d1f5b41477f76a0ca0b8fe7023f5f994fe65d257426ac10ea65fb739ad

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page