Core Sequence Identifier
Project description
CORSID
CORSID is a computational tool to simultaneously identify TRS sites, the core sequence and gene locations given an unannotated coronavirus genome sequence. We also provide another tool, CORSID-A, which identifies TRS sites and the core sequence given a coronavirus genome sequence with annotated gene locations.
The data and results can be found in the repo CORSID-data. The visualized results of our tool applied to 468 coronavirus genomes can be found in CORSID-viz.
Contents
- Pre-requisites
- Installation
- Using conda (recommended)
- Using pip (alternative)
- Usage instructions
Pre-requisites
- python3 (>=3.7)
- numpy
- pysam
- pandas
- pytablewriter
- (optional for simulation pipeline) snakemake (>=5.2.0)
Installation
Using conda (recommended)
-
Create a new conda environment named "corsid" and install dependencies:
conda create -n corsid python=3.7
-
Then activate the created environment:
conda activate corsid
. -
Install the package into current environment "corsid":
conda install -c bioconda corsid
Using pip (alternative)
We recommend installing in a virtual environment, as decribed in steps 1 and 2 in the previous section.
Use pip
to install the package:
pip install corsid
Usage instructions
I/O formats
CORSID takes a FASTA file containing the complete genome as input. Optionally it also takes an annotation file (GFF format) to validate the identified genes.
CORSID-A takes a FASTA file and an annotation file (GFF format) as input. It will find candidate regions for each gene given the annotation file, and run CORSID-A on candidate regions.
The output is an JSON file containing sorted solutions and auxilary information. This file can be used as the input to the visualization webapp. The program also outputs to the standard output, where it shows tables of solutions and visualization of TRS alignment.
Example
After installation, you can check if the program runs correctly by analyzing the SARS-CoV-2 genome (NC_045512) as follows:
git clone git@github.com:elkebir-group/CORSID.git
cd CORSID
corsid -f test/NC_045512.fasta -o test/NC_045512.json > test/NC_045512.txt
You can find a list of solutions displayed as tables in test/NC_045512.txt
. The best solution should be the same as the figure below:
You can also use option -g test/NC_045512.gff
to validate the identified genes.
corsid -f test/NC_045512.fasta -g test/NC_045512.gff \
-o test/NC_045512.json > test/NC_045512.txt
The result will look like:
Similarly, you can also run CORSID-A with command:
corsid_a -f test/NC_045512.fasta -g test/NC_045512.gff \
-o test/NC_045512.corsid_a.json > test/NC_045512.corsid_a.txt
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.