Skip to main content

sarscov2vec is an application of continuous vector space representation on novel species of coronaviruses genomes as the methodology of genome feature extraction step, to distinguish the most common different SARS-CoV-2 variants by supervised machine learning model.

Project description

sarscov2vec is an application of continuous vector space representation on novel species of coronaviruses genomes as the methodology of genome feature extraction step, to distinguish the most common 5 different SARS-CoV-2 variants (Alpha, Beta, Delta, Gamma, Omicron) by supervised machine learning model.

In this research we used 367,004 unique and complete genome sequence records from the official virus repositories. Prepared datasets for this research had balanced classes. Sub-set of 25,000 sequences from the final dataset were randomly selected and used to train the Natural Language Processing (NLP) algorithm. The next 36,365 of samples, unseen by embedding training sessions, were processed by machine learning pipeline. Each SARS-CoV-2 variant was represented by 12,000 samples from different parts of the world. Data separation between embedding and classifier was crucial to prevent the data leakage, which is a common problem in NLP.

Our research results show that the final hiper-tuned machine learning model achieved 99% of accuracy on the test set. Furthermore, this study demonstrated that the continuous vector space representation of SARS-CoV-2 genomes can be decomposed into 2D vector space and visualized as a method of explanation machine learning model decision.

The proposed methodology wrapped in the sarscov2vec brings a new alignment-free AI-aided bioinformatics tool that distinguishes different SARS-CoV-2 variants solely on the genome sequences. Importantly, the obtained results serve as the proof of concept that the presented approach can also be applied in understanding the genomic diversity of other pathogens.

PyPI pyversions Code style

Table of Contents

Modules | Installation | Contributions | Have a question? | Found a bug? | Team | Change log | License | Cite

Modules

fastText NLP model

Filename with SHA256 checksum Variants Description
ffasttext_unsupervised_kmer7_25k_samples.28.02.2022.bin
44f789dcb156201dac9217f8645d86ac585ec24c6eba68901695dc254a14adc3
Alpha, Beta, Delta, Gamma, Omicron (BA.1) fastText unsupervised model trained on 7-mers tokens extracted from 25 000 unique SARS-CoV-2 samples

Machine Learning model and label encoder

Filename with SHA256 checksum Variants Description
svm_supervised_36k_samples.28.02.2022.joblib
70abd23b0181786d4ab8e06ea23bd14641f509c13db58c7f2fa2baea17aa42af
Alpha, Beta, Delta, Gamma, Omicron (BA.1, BA.2) SVM supervised model trained and tested using 36,365 unique SARS-CoV-2 samples. Each genome sample was transformed by fastText model at 28.02.2022.
label_encoder_36k_samples.28.02.2022.joblib
7cb654924f69de6efbf6f409efd91af05874e1392220d22b9883d36c17b366c9
Alpha, Beta, Delta, Gamma, Omicron (BA.1, BA.2) Label extracted from 36,365 unique SARS-CoV-2 samples at 28.02.2022.

Installation and usage

sarscov2vec package

sarscov2vec requires Python 3.8.0+ to run and can be installed by running:

pip install sarscov2vec

If you can't wait for the latest hotness from the develop branch, then install it directly from the repository:

pip install git+git://github.com/ptynecki/sarscov2vec.git@develop

Package examples are available in notebooks directory.

Contributions

Development on the latest stable version of Python 3+ is preferred. As of this writing it's 3.8. You can use any operating system.

If you're fixing a bug or adding a new feature, add a test with pytest and check the code with Black and mypy. Before adding any large feature, first open an issue for us to discuss the idea with the core devs and community.

Have a question?

Obviously if you have a private question or want to cooperate with us, you can always reach out to us directly by mail.

Found a bug?

Feel free to add a new issue with a respective title and description on the the sarscov2vec repository. If you already found a solution to your problem, we would be happy to review your pull request.

Team

Researchers whose contributing to the sarscov2vec:

  • Piotr Tynecki (Faculty of Computer Science, Bialystok University of Technology, Bialystok, Poland)
  • Marcin Lubocki (Laboratory of Virus Molecular Biology, Intercollegiate Faculty of Biotechnology, University of Gdansk, Medical University of Gdańsk, Gdansk, Poland)

Change log

The log's will become rather long. It moved to its own file.

See CHANGELOG.md.

License

The sarscov2vec package is released under the under terms of the MIT License.

Cite

Application of continuous embedding of viral genome sequences and machine learning in the prediction of SARS-CoV-2 variants

Tynecki, P.; Lubocki, M.;

Computer Information Systems and Industrial Management. CISIM 2022. Lecture Notes in Computer Science, Springer

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sarscov2vec-1.0.0.tar.gz (6.8 kB view hashes)

Uploaded Source

Built Distribution

sarscov2vec-1.0.0-py3-none-any.whl (6.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page