sarscov2vec is an application of continuous vector space representation on novel species of coronaviruses genomes as the methodology of genome feature extraction step, to distinguish the most common different SARS-CoV-2 variants by supervised machine learning model.
Project description
sarscov2vec is an application of continuous vector space representation on novel species of coronaviruses genomes as the methodology of genome feature extraction step, to distinguish the most common 5 different SARS-CoV-2 variants (Alpha, Beta, Delta, Gamma, Omicron) by supervised machine learning model.
In this research we used 367,004 unique and complete genome sequence records from the official virus repositories. Prepared datasets for this research had balanced classes. Sub-set of 25,000 sequences from the final dataset were randomly selected and used to train the Natural Language Processing (NLP) algorithm. The next 36,365 of samples, unseen by embedding training sessions, were processed by machine learning pipeline. Each SARS-CoV-2 variant was represented by 12,000 samples from different parts of the world. Data separation between embedding and classifier was crucial to prevent the data leakage, which is a common problem in NLP.
Our research results show that the final hiper-tuned machine learning model achieved 99% of accuracy on the test set. Furthermore, this study demonstrated that the continuous vector space representation of SARS-CoV-2 genomes can be decomposed into 2D vector space and visualized as a method of explanation machine learning model decision.
The proposed methodology wrapped in the sarscov2vec brings a new alignment-free AI-aided bioinformatics tool that distinguishes different SARS-CoV-2 variants solely on the genome sequences. Importantly, the obtained results serve as the proof of concept that the presented approach can also be applied in understanding the genomic diversity of other pathogens.
Table of Contents
Modules | Installation | Contributions | Have a question? | Found a bug? | Team | Change log | License | Cite
Modules
fastText NLP model
Filename with SHA256 checksum | Variants | Description |
---|---|---|
ffasttext_unsupervised_kmer7_25k_samples.28.02.2022.bin 44f789dcb156201dac9217f8645d86ac585ec24c6eba68901695dc254a14adc3 |
Alpha, Beta, Delta, Gamma, Omicron (BA.1) | fastText unsupervised model trained on 7-mers tokens extracted from 25 000 unique SARS-CoV-2 samples |
Machine Learning model and label encoder
Filename with SHA256 checksum | Variants | Description |
---|---|---|
svm_supervised_36k_samples.28.02.2022.joblib 70abd23b0181786d4ab8e06ea23bd14641f509c13db58c7f2fa2baea17aa42af |
Alpha, Beta, Delta, Gamma, Omicron (BA.1, BA.2) | SVM supervised model trained and tested using 36,365 unique SARS-CoV-2 samples. Each genome sample was transformed by fastText model at 28.02.2022. |
label_encoder_36k_samples.28.02.2022.joblib 7cb654924f69de6efbf6f409efd91af05874e1392220d22b9883d36c17b366c9 |
Alpha, Beta, Delta, Gamma, Omicron (BA.1, BA.2) | Label extracted from 36,365 unique SARS-CoV-2 samples at 28.02.2022. |
Installation and usage
sarscov2vec package
sarscov2vec requires Python 3.8.0+ to run and can be installed by running:
pip install sarscov2vec
If you can't wait for the latest hotness from the develop branch, then install it directly from the repository:
pip install git+git://github.com/ptynecki/sarscov2vec.git@develop
Package examples are available in notebooks
directory.
Contributions
Development on the latest stable version of Python 3+ is preferred. As of this writing it's 3.8. You can use any operating system.
If you're fixing a bug or adding a new feature, add a test with pytest and check the code with Black and mypy. Before adding any large feature, first open an issue for us to discuss the idea with the core devs and community.
Have a question?
Obviously if you have a private question or want to cooperate with us, you can always reach out to us directly by mail.
Found a bug?
Feel free to add a new issue with a respective title and description on the the sarscov2vec repository. If you already found a solution to your problem, we would be happy to review your pull request.
Team
Researchers whose contributing to the sarscov2vec:
- Piotr Tynecki (Faculty of Computer Science, Bialystok University of Technology, Bialystok, Poland)
- Marcin Lubocki (Laboratory of Virus Molecular Biology, Intercollegiate Faculty of Biotechnology, University of Gdansk, Medical University of Gdańsk, Gdansk, Poland)
Change log
The log's will become rather long. It moved to its own file.
See CHANGELOG.md.
License
The sarscov2vec package is released under the under terms of the MIT License.
Cite
Application of continuous embedding of viral genome sequences and machine learning in the prediction of SARS-CoV-2 variants
Tynecki, P.; Lubocki, M.;
Computer Information Systems and Industrial Management. CISIM 2022. Lecture Notes in Computer Science, Springer
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file sarscov2vec-1.0.0.tar.gz
.
File metadata
- Download URL: sarscov2vec-1.0.0.tar.gz
- Upload date:
- Size: 6.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.8.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5c3a18cd1a80fe8207ffb4f8ba9ba1d8730486d83966388efb2b6a0958873aeb |
|
MD5 | 03d8947560698142fc8dd96fb9b359a7 |
|
BLAKE2b-256 | a7285d2ec7268401e28f02be3d5b40a2c91016530aa6df1497f61a3530ec0a5f |
File details
Details for the file sarscov2vec-1.0.0-py3-none-any.whl
.
File metadata
- Download URL: sarscov2vec-1.0.0-py3-none-any.whl
- Upload date:
- Size: 6.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.8.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b37cb76cfcf9643491ea903e3fc488129f587600f5a430c0e6a0874e4950b256 |
|
MD5 | 7cec835337106d6f5455f8a660e7ff99 |
|
BLAKE2b-256 | 4a087ecde3b1333cbddb41edbba394482efb3960ef55ab0d37ce4efcac3fe749 |