A database-driven system for handling genomic sequences and screening genomic profiles.
Project description
covSonar2
covSonar is a database-driven system for handling genomic sequences and screening genomic profiles.
What's new in covSonar V.2
- New design
- Improve workflows
- Performance improvements
- Exciting new features
- Support multiple pathogens
- Flexible in adding meta information
- New database design
- New database schema
- Retrieval efficiency
- Significantly smaller than the previous version
1. Prerequisites / Setup
covSonar2 has some software-environmental requirements that can most easily be met by building a custom conda environment.
Proceed as follows to install covSonar:
# download the repository to the current working directory using git
git clone https://github.com/rki-mf1/covsonar
# build the custom software environment using conda [recommended]
conda env create -n sonar2 -f covsonar/sonar.env.yml
# activate the conda evironment if built
conda activate sonar2
# testing
./covsonar/test.sh
3. Usage
In covSonar2 there are several tools that can be called via subcommands.
subcommand | purpose |
---|---|
setup | setup a new database. |
import | import genome sequences and sample information to the database |
list-prop | view sample properties added to the database |
add-prop | add a sample property to the database |
delete-prop | delete a sample property from the database |
match | get mutations profiles for given accessions |
restore | restore sequence(s) from the database |
info | show software and database info. |
optimize | optimizes the database |
db-upgrade | upgrade a database to the latest version |
update-lineage-info | download latest lineage information |
Each tool provides a help page that can be accessed with the -h
option.
# display help page
./sonar.py -h
# display help page for each tool
./sonar.py import -h
3.1 Setup database (setup ⛽)
First, we have to create a new database instance.
./sonar.py setup --db test.db
, or we can create a new database instance with predefined properties.
./sonar.py setup --db test.db --auto-create
By default, the MN908947.3 (SARS-CoV-2) is used as a reference. If we want to set up database for a different pathogen, we can add --gbk
following with Genbank file.
example;
./sonar.py setup --db test.db --auto-create --gbk Ebola.gb
note 📌: how to download genbank file
3.2 Property management (list-prop, add-prop, delete-prop)
In covSonar2, users now can arbitrarily add meta information or properties into a database to fit a specific project objective.s
To view the added properties, we can use the list-prop
command to display all information.
./sonar.py list-prop --db test.db
To add more properties, we can use the add-prop
command to add meta information into the database.
The required arguments are listed below when we use add-prop
--name
, name of sample property--descr
, description of the new property--dtype
, data type of the new property (e.g., 'integer', 'float', 'text', 'date', 'zip')
./sonar.py add-prop --db test.db --name SEQ_REASON --dtype text --descr "seq. reason"
tip 🕯️:
./sonar.py add-prop -h
to see all available arguments
The delete-prop
command is used to delete an unwanted property from the database.
./sonar.py delete-prop --db test.db --name SEQ_REASON
The program will ask an user for confirmation of the action.
Do you really want to delete this property? [YES/no]: YES
3.3 Adding genomes and meta information to the database (import)
Add sequence with meta information
./sonar.py import --db test.db --fasta valid.fasta --tsv day.tsv --threads 64 --cache tmp_cache --cols sample=IMS_ID
Update more
example:
./sonar.py import --db test.db --fasta valid.fasta --tsv day.tsv --threads 64 --cache tmp_cache --cols sample=IMS_ID
./sonar.py import --db test.db --fasta valid.fasta --tsv day.tsv --threads 64 --cache tmp_cache --cols sample=IMS_ID
sample
3.4 Query genome sequences based on profiles (match)
Genomic profiles can be defined to align genomes. For this purpose, the variants related to the complete genome of the SARS-CoV-2 isolate Wuhan-Hu-1 (NC_045512.2) must be expressed as follows:
type | nucleotide level | amino acid level |
---|---|---|
SNP | ref_nuc followed by ref_pos followed by alt_nuc (e.g. A3451T) | protein_symbol:ref_aa followed by ref_pos followed by alt_aa (e.g. S:N501Y) |
deletion | del:ref_pos:length_in_bp (e.g. del:3001:8) | protein_symbol:del:ref_pos:length_in_aa (e.g. ORF1ab:del:3001:21) |
insertion | ref_nuc followed by ref_pos followed by alt_nucs (e.g. A3451TGAT) | protein_symbol:ref_aa followed by ref_pos followed by alt_aas (e.g. N:A34AK) |
The positions refer to the reference (first nucleotide in the genome is position 1). Using the option --profile
, multiple variant definitions can be combined into a nucleotide, amino acid or mixed profile, which means that matching genomes must have all those variations in common. In contrast, alternative variations can be defined by multiple --profile
options. As an example, --profile S:N501Y S:E484K
matches genomes sharing the Nelly AND Erik variation while --profile S:N501Y --profile S:E484K
matches to genomes that share either the Nelly OR Erik variation OR both. Accordingly, using the option ^ profiles can be defined that have not to be present in the matched genomes.
There are additional options to adjust the matching.
option | description |
---|---|
--count | count matching genomes only |
example;
./sonar.py match --profile S:E484K --LINEAGE B.1.1.7 --db test.db
# matching B.1.1.7 genomes in DB 'test.db' that share an additional "Erik" mutation
./sonar.py match --profile S:E484K --LINEAGE B.1.1.7 --db test.db
# as before but matching genomes are counted only
./sonar.py match --profile S:E484K --LINEAGE B.1.1.7 --count --db test.db
# matching genomes in DB 'test.db' sharing the "Nelly" mutation
# and that were sampled in 2020
./sonar.py match --profile S:N501Y --DATE 2020-01-01:2020-12-31 --db test.db
# matching genomes in DB 'mydb' sharing the "Nelly" and the "Erik" mutation but not
# belonging to the B.1.1.7 lineage
./sonar.py match -profile S:N501Y S:E484K --LINEAGE ^B.1.1.7 --db test.db
Export to CSV/TSV/VCF file
covSonar can return results in different formats: --format ["csv", "tsv", "vcf"]
# example command
./sonar.py match --profile S:N501Y S:E484K --LINEAGE ^B.1.1.7 --db test.db --format csv -o out.csv
# in vcf format
./sonar.py match -i S:N501Y S:E484K --lineage Q.1 --db test.db --format vcf -o out.vcf
# example of --sample-file
./sonar.py match --sample-file accessions.txt --db test.db --format vcf -o out.vcf
Parent-Child relationship
⚠️ This function we only test on SARS-CoV-2
If we want to search all sublineages with a given lineage, covSonar offers --with-sublineage PROP_COLUMN
(PROP_COLUMN means the property name that we added to our database).
./sonar.py match --profle S:E484K --LINEAGE B.1.1.7 --with-sublineage LINEAGE --count --db test.db --debug
This query will return results ('B.1.1.7', 'Q.4', 'Q.5', 'Q.3', 'Q.6', 'Q.1', 'Q.7', 'Q.2', 'Q.8').
By default, we use SARS-CoV-2 lineages for this search and the file name must be lineage.all.tsv
.
lineage-update function for SARS-CoV-2 (COVID-19) ❗
Run update-lineage-info
flag, it will download the latest version of lineages from https://github.com/cov-lineages/pango-designation/ and install it in lib/lineage.all.tsv
# example command
./sonar.py update-lineage-info
3.5 Show infos about the used sonar system and database (info)
Detailed infos about the used sonar system (e.g. version, reference, number of imported genomes, unique sequences, available metadata).
# Show infos about the used sonar system and database 'test.db'
./sonar.py info --db test.db
3.6 Restore genome sequences from the database (restore)
Genome sequences can be restored from the database based on their accessions.
The restored sequences are combined with their original FASTA header and shown on the screen. The screen output can be redirected to a file easily by using >
.
# Restore genome sequences linked to accessions 'mygenome1' and 'mygenome2' from the
# database 'test.db' and write these to a fasta file named 'restored.fasta'
./sonar.py restore --sample mygenome1 mygenome2 --db test.db > restored.fasta
# as before, but consider all accessions from 'accessions.txt' (the file has to
# contain one accession per line)
./sonar.py restore --sample-file accessions.txt --db test.db > restored.fasta
3.7 Database management (db-upgrade, optimize)
Sometimes you might need the optimize
command to clean the problems from database operation (e.g., unused data block or storage overhead ).
# Show infos about the used sonar system
./sonar.py optimize --db test.db
When the newest version of covSonar use an old database version, covSonar will return the following error;
Compatibility error: the given database is not compatible with this version of sonar (Current database version: XXX; Supported database version: XXX)
Please run 'sonar.py db-upgrade' to upgrade database
We provide our database upgrade assistant to solve the problem.
# RUN
./sonar.py db-upgrade --db test.db
# Output
Warning: Backup db file before upgrading, Press Enter to continue...
## after pressing the Enter key
Current version: 3 Upgrade to: 4
Perform the Upgrade: file: mydb.db
Database now version: 4
Success: Database upgrade was successfully completed
⚠️ Warning: Backup the db file before upgrade.
How to contribute 🏗️
covSonar has been very carefully programmed and tested, but is still in an early stage of development. You can contribute to this project by reporting problems 🐛 or writing feature requests to the issue section 👨💻
Your feedback is very welcome 👨🔧!
Contact
For business inquiries or professional support requests 🍺 please contact Dr. Stephan Fuchs
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for covsonar-2.0.0a1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 98008d2ae61e6bc7cb241138af4f8675ede1c1b283cfe68d58e6f968c93068fa |
|
MD5 | ac19d5c1e882554366068a32a3b44e5d |
|
BLAKE2b-256 | 928659c434ad290cb8c5dde9de5fd28162230249ade23ec12f6c970fc02b33d0 |