A database-driven system for handling genomic sequences and screening genomic profiles.

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

covSonar2

covSonar is a database-driven system for handling genomic sequences and screening genomic profiles.

What's new in covSonar V.2

New design
- Improve workflows
- Performance improvements
Exciting new features
- Support multiple pathogens
- Flexible in adding meta information
New database design
- New database schema
- Retrieval efficiency
- Significantly smaller than the previous version

1. Prerequisites / Setup

covSonar2 has some software-environmental requirements that can most easily be met by building a custom conda environment.

Proceed as follows to install covSonar:

# download the repository to the current working directory using git
git clone https://github.com/rki-mf1/covsonar
# build the custom software environment using conda [recommended]
conda env create -n sonar2 -f covsonar/sonar.env.yml
# activate the conda evironment if built
conda activate sonar2
# testing
./covsonar/test.sh

3. Usage

In covSonar2 there are several tools that can be called via subcommands.

subcommand	purpose
setup	setup a new database.
import	import genome sequences and sample information to the database
list-prop	view sample properties added to the database
add-prop	add a sample property to the database
delete-prop	delete a sample property from the database
match	get mutations profiles for given accessions
restore	restore sequence(s) from the database
info	show software and database info.
optimize	optimizes the database
db-upgrade	upgrade a database to the latest version
update-lineage-info	download latest lineage information

Each tool provides a help page that can be accessed with the -h option.

# display help page
./sonar.py -h
# display help page for each tool
./sonar.py import -h

3.1 Setup database (setup ⛽)

First, we have to create a new database instance.

./sonar.py setup --db test.db

, or we can create a new database instance with predefined properties.

./sonar.py setup --db test.db --auto-create

By default, the MN908947.3 (SARS-CoV-2) is used as a reference. If we want to set up database for a different pathogen, we can add --gbk following with Genbank file.

example;

./sonar.py setup --db test.db --auto-create --gbk Ebola.gb

note 📌: how to download genbank file

3.2 Property management (list-prop, add-prop, delete-prop)

In covSonar2, users now can arbitrarily add meta information or properties into a database to fit a specific project objective.s

To view the added properties, we can use the list-prop command to display all information.

./sonar.py list-prop --db test.db

To add more properties, we can use the add-prop command to add meta information into the database.

The required arguments are listed below when we use add-prop

--name, name of sample property
--descr, description of the new property
--dtype, data type of the new property (e.g., 'integer', 'float', 'text', 'date', 'zip')

./sonar.py add-prop --db test.db  --name SEQ_REASON --dtype text --descr "seq. reason"

tip 🕯️: ./sonar.py add-prop -h to see all available arguments

The delete-prop command is used to delete an unwanted property from the database.

./sonar.py delete-prop --db test.db  --name SEQ_REASON

The program will ask an user for confirmation of the action.

Do you really want to delete this property? [YES/no]: YES

3.3 Adding genomes and meta information to the database (import)

Add sequence with meta information

./sonar.py import --db test.db --fasta valid.fasta --tsv day.tsv --threads 64 --cache tmp_cache --cols sample=IMS_ID

Update more

example:

./sonar.py import --db test.db --fasta valid.fasta --tsv day.tsv --threads 64 --cache tmp_cache --cols sample=IMS_ID

./sonar.py import --db test.db --fasta valid.fasta --tsv day.tsv --threads 64 --cache tmp_cache --cols sample=IMS_ID

sample

3.4 Query genome sequences based on profiles (match)

Genomic profiles can be defined to align genomes. For this purpose, the variants related to the complete genome of the SARS-CoV-2 isolate Wuhan-Hu-1 (NC_045512.2) must be expressed as follows:

type	nucleotide level	amino acid level
SNP	ref_nuc followed by ref_pos followed by alt_nuc (e.g. A3451T)	protein_symbol:ref_aa followed by ref_pos followed by alt_aa (e.g. S:N501Y)
deletion	del:ref_pos:length_in_bp (e.g. del:3001:8)	protein_symbol:del:ref_pos:length_in_aa (e.g. ORF1ab:del:3001:21)
insertion	ref_nuc followed by ref_pos followed by alt_nucs (e.g. A3451TGAT)	protein_symbol:ref_aa followed by ref_pos followed by alt_aas (e.g. N:A34AK)

The positions refer to the reference (first nucleotide in the genome is position 1). Using the option --profile, multiple variant definitions can be combined into a nucleotide, amino acid or mixed profile, which means that matching genomes must have all those variations in common. In contrast, alternative variations can be defined by multiple --profile options. As an example, --profile S:N501Y S:E484K matches genomes sharing the Nelly AND Erik variation while --profile S:N501Y --profile S:E484K matches to genomes that share either the Nelly OR Erik variation OR both. Accordingly, using the option ^ profiles can be defined that have not to be present in the matched genomes.

There are additional options to adjust the matching.

option	description
--count	count matching genomes only

example;

./sonar.py match --profile S:E484K --LINEAGE B.1.1.7 --db test.db

# matching B.1.1.7 genomes in DB 'test.db' that share an additional "Erik" mutation
./sonar.py match --profile S:E484K --LINEAGE B.1.1.7 --db test.db

# as before but matching genomes are counted only
./sonar.py match --profile S:E484K --LINEAGE B.1.1.7 --count --db test.db

# matching genomes in DB 'test.db' sharing the "Nelly" mutation
# and that were sampled in 2020
./sonar.py match --profile S:N501Y  --DATE 2020-01-01:2020-12-31 --db test.db

# matching genomes in DB 'mydb' sharing the "Nelly" and the "Erik" mutation but not
# belonging to the B.1.1.7 lineage
./sonar.py match -profile S:N501Y S:E484K --LINEAGE ^B.1.1.7 --db test.db

Export to CSV/TSV/VCF file

covSonar can return results in different formats: --format ["csv", "tsv", "vcf"]

# example command
./sonar.py match --profile S:N501Y S:E484K --LINEAGE ^B.1.1.7 --db test.db --format csv -o out.csv

# in vcf format
./sonar.py match -i S:N501Y S:E484K --lineage Q.1 --db test.db --format vcf -o out.vcf

# example of --sample-file
./sonar.py match --sample-file accessions.txt --db test.db --format vcf -o out.vcf

Parent-Child relationship

⚠️ This function we only test on SARS-CoV-2

If we want to search all sublineages with a given lineage, covSonar offers --with-sublineage PROP_COLUMN (PROP_COLUMN means the property name that we added to our database).

./sonar.py match --profle S:E484K --LINEAGE B.1.1.7 --with-sublineage LINEAGE --count --db test.db --debug

This query will return results ('B.1.1.7', 'Q.4', 'Q.5', 'Q.3', 'Q.6', 'Q.1', 'Q.7', 'Q.2', 'Q.8').

By default, we use SARS-CoV-2 lineages for this search and the file name must be lineage.all.tsv.

lineage-update function for SARS-CoV-2 (COVID-19) ❗

Run update-lineage-info flag, it will download the latest version of lineages from https://github.com/cov-lineages/pango-designation/ and install it in lib/lineage.all.tsv

# example command
./sonar.py update-lineage-info

3.5 Show infos about the used sonar system and database (info)

Detailed infos about the used sonar system (e.g. version, reference, number of imported genomes, unique sequences, available metadata).

# Show infos about the used sonar system and database 'test.db'
./sonar.py info --db test.db

3.6 Restore genome sequences from the database (restore)

Genome sequences can be restored from the database based on their accessions. The restored sequences are combined with their original FASTA header and shown on the screen. The screen output can be redirected to a file easily by using >.

# Restore genome sequences linked to accessions 'mygenome1' and 'mygenome2' from the
# database 'test.db' and write these to a fasta file named 'restored.fasta'
./sonar.py restore --sample mygenome1 mygenome2 --db test.db > restored.fasta
# as before, but consider all accessions from 'accessions.txt' (the file has to
# contain one accession per line)
./sonar.py restore --sample-file accessions.txt --db test.db > restored.fasta

3.7 Database management (db-upgrade, optimize)

Sometimes you might need the optimize command to clean the problems from database operation (e.g., unused data block or storage overhead ).

# Show infos about the used sonar system
./sonar.py optimize  --db test.db

When the newest version of covSonar use an old database version, covSonar will return the following error;

Compatibility error: the given database is not compatible with this version of sonar (Current database version: XXX; Supported database version: XXX)
Please run 'sonar.py  db-upgrade' to upgrade database

We provide our database upgrade assistant to solve the problem.

# RUN
./sonar.py db-upgrade --db test.db

# Output
Warning: Backup db file before upgrading, Press Enter to continue...

## after pressing the Enter key
Current version: 3  Upgrade to: 4
Perform the Upgrade: file: mydb.db
Database now version: 4
Success: Database upgrade was successfully completed

⚠️ Warning: Backup the db file before upgrade.

How to contribute 🏗️

covSonar has been very carefully programmed and tested, but is still in an early stage of development. You can contribute to this project by reporting problems 🐛 or writing feature requests to the issue section 👨‍💻

Your feedback is very welcome 👨‍🔧!

Contact

For business inquiries or professional support requests 🍺 please contact Dr. Stephan Fuchs

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

2.0.0a1 pre-release

Jul 6, 2022

2.0.0a0 pre-release

Jul 4, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

covsonar-2.0.0a1.tar.gz (125.2 kB view hashes)

Uploaded Jul 6, 2022 Source

Built Distribution

covsonar-2.0.0a1-py3-none-any.whl (136.6 kB view hashes)

Uploaded Jul 6, 2022 Python 3

Hashes for covsonar-2.0.0a1.tar.gz

Hashes for covsonar-2.0.0a1.tar.gz
Algorithm	Hash digest
SHA256	`4c9e9c0634460a7b349217e160c8aeaf42af217b7eb0d084d951b82bab8bf66d`
MD5	`7c01d5792191b3ec48078fd525f7b25f`
BLAKE2b-256	`dbb62994c60a7454aa2139f85fcaac0af5c115ca0ddea0ec5da23bca45daa061`

Hashes for covsonar-2.0.0a1-py3-none-any.whl

Hashes for covsonar-2.0.0a1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`98008d2ae61e6bc7cb241138af4f8675ede1c1b283cfe68d58e6f968c93068fa`
MD5	`ac19d5c1e882554366068a32a3b44e5d`
BLAKE2b-256	`928659c434ad290cb8c5dde9de5fd28162230249ade23ec12f6c970fc02b33d0`