Skip to main content

Core Sequence Identifier

Project description

CORSID

install with bioconda PyPI version

CORSID is a computational tool to simultaneously identify TRS sites, the core sequence and gene locations given an unannotated coronavirus genome sequence. We also provide another tool, CORSID-A, which identifies TRS sites and the core sequence given a coronavirus genome sequence with annotated gene locations.

The data and results can be found in the repo CORSID-data. The visualized results of our tool applied to 468 coronavirus genomes can be found in CORSID-viz (source repo). Docker containers can be found in CORSID-container and Docker hub.

If you use CORSID in you work, please cite the following paper (bioRxiv):

Zhang, Chuanyi, Palash Sashittal, and Mohammed El-Kebir. "CORSID enables de novo identification of transcription regulatory sequences and genes in coronaviruses." bioRxiv (2021).

Figure

Contents

  1. Pre-requisites
  2. Installation
  3. Usage instructions

Pre-requisites

If you install with conda or pip as described bellow, then you don't need to manually install these pakcages.

Installation

Using conda

  1. Create a new conda environment named "corsid" and install dependencies:

    conda create -n corsid python=3.7
    
  2. Then activate the created environment: conda activate corsid.

  3. Install the package into current environment "corsid":

    conda install -c bioconda corsid
    

Using pip

  1. Create a new conda environment named "corsid" and install dependencies:

    conda create -n corsid python=3.7
    
  2. Then activate the created environment: conda activate corsid.

  3. Use pip to install the package:

    pip install corsid
    

Usage instructions

I/O formats

Input files

  • CORSID: CORSID identifies TRS-L, TRS-Bs, and genes directly in the complete genome.
    • FASTA file: the complete input genome
    • GFF3 annotation (optional): annotation file to validate the identified genes
  • CORSID-A: CORSID-A finds candidate regions for each gene given in the annotation file and identifies TRS-L and TRS-Bs in candidate regions.
    • FASTA file: the complete input genome
    • GFF3 annotation: known genes

Output files

  • CORSID:
    • JSON {filename}.json: sorted solutions and auxilary information. This file can be used as the input to the visualization webapp. Solutions are sorted in lexicographical order of (genome coverage, total matching score, minimum score), where "genome coverage" is the count of bases covered by identified genes, "total matching score" is the sum of matching scores between TRS-L and all identified TRS-Bs in the solution, and "minimum score" is the smallest matching score in the solution.
    • GFF3 {filename}.gff: annotated genes in GFF3 format of the optimal solution (the first one in the JSON output). Note that it shares the same file name as the JSON output, and the only difference is the extension.
    • CORSID also outputs to the standard output. It shows tables of solutions and visualization of TRS alignment. Users can redirect the standard output to a file as shown below.
  • CORSID-A:
    • JSON {filename}.json: sorted solutions and auxilary information. Solutions are sorted by their total matching score, which is the sum of matching scores of TRS-B and all identified TRS-Bs in the solution.
    • Standard output: similar to output of CORSID.

Example

After installation, you can check if the program runs correctly by analyzing the SARS-CoV-2 genome (NC_045512) as follows:

git clone git@github.com:elkebir-group/CORSID.git
cd CORSID
corsid -f test/NC_045512.fasta -o test/NC_045512.corsid.json > test/NC_045512.corsid.txt

The output files will be test/NC_045512.corsid.json, test/NC_045512.corsid.gff, and test/NC_045512.corsid.txt.

You can find a list of solutions displayed as tables in test/NC_045512.corsid.txt. The best solution should be the same as the figure below: Expected result

The corresponding GFF3 output should look like this: Expected GFF output

You can also use option -g test/NC_045512.gff to validate the identified genes.

corsid -f test/NC_045512.fasta -g test/NC_045512.gff \
    -o test/NC_045512.corsid.json > test/NC_045512.corsid.txt

The result will look like: Expected result

Similarly, you can also run CORSID-A with command:

corsid_a -f test/NC_045512.fasta -g test/NC_045512.gff \
    -o test/NC_045512.corsid_a.json > test/NC_045512.corsid_a.txt

Note that the annotation GFF file is required for CORSID-A. The output files will be test/NC_045512.corsid_a.json, and test/NC_045512.corsid_a.txt.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

corsid-0.1.3.tar.gz (33.0 kB view details)

Uploaded Source

Built Distribution

corsid-0.1.3-py3-none-any.whl (33.6 kB view details)

Uploaded Python 3

File details

Details for the file corsid-0.1.3.tar.gz.

File metadata

  • Download URL: corsid-0.1.3.tar.gz
  • Upload date:
  • Size: 33.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.7.12

File hashes

Hashes for corsid-0.1.3.tar.gz
Algorithm Hash digest
SHA256 4d82c700d9d079bd2bc5298f729eeb6fbe0d37aac66e03211ea1c64fb42efc06
MD5 bc226afd56428ac78fa04cd7a75e1d34
BLAKE2b-256 23259270e4cb4ba231455ee81c810ef7fd524ed5b48f973058a90ee8e7956355

See more details on using hashes here.

File details

Details for the file corsid-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: corsid-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 33.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.7.12

File hashes

Hashes for corsid-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 2b33467b9e34cd1ef3fb1da9a668a7594089c58fc117bbfcaadd9fbff94ba085
MD5 8a171b5cead83f11d1ec4318f61566d8
BLAKE2b-256 c0c902ee4c5b026a5a45fc219484ca1354f46ddd4ef768269140c191fdb43b58

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page