Skip to main content

Genomic Benchmarks

Project description

PyPI version

Genomic Benchmarks 🧬🏋️✔️

In this repository, we collect benchmarks for classification of genomic sequences. It is shipped as a Python package, together with functions helping to download & manipulate datasets and train NN models.

Install

Genomic Benchmarks can be installed as follows:

pip install genomic-benchmarks

To use it with papermill, TF or pytorch, install the corresponding dependencies:

# if you want to use jupyter and papermill
pip install jupyter>=1.0.0
pip install papermill>=2.3.0

# if you want to train NN with TF
pip install tensorflow>=2.6.0
pip install tensorflow-addons
pip install typing-extensions --upgrade  # fixing TF installation issue

# if you want to train NN with torch
pip install torch>=1.10.0
pip install torchtext

For the package development, use Python 3.8 (ideally 3.8.9) and the installation described here.

Usage

Get the list of all datasets with the list_datasets function

>>> from genomic_benchmarks.data_check import list_datasets
>>> 
>>> list_datasets()
['demo_coding_vs_intergenomic_seqs', 'demo_human_or_worm', 'dummy_mouse_enhancers_ensembl', 'human_enhancers_cohn', 'human_enhancers_ensembl', 'human_ensembl_regulatory',  'human_nontata_promoters', 'human_ocr_ensembl']

You can get basic information about the benchmark with info function:

>>> from genomic_benchmarks.data_check import info
>>> 
>>> info("human_nontata_promoters", version=0)
Dataset `human_nontata_promoters` has 2 classes: negative, positive.

All lenghts of genomic intervals equals 251.

Totally 36131 sequences have been found, 27097 for training and 9034 for testing.
          train  test
negative  12355  4119
positive  14742  4915

The function download_dataset downloads the full-sequence form of the required benchmark (splitted into train and test sets, one folder for each class). If not specified otherwise, the data will be stored in .genomic_benchmarks subfolder of your home directory. By default, the dataset is obtained from our cloud cache (use_cloud_cache=True).

>>> from genomic_benchmarks.loc2seq import download_dataset
>>> 
>>> download_dataset("human_nontata_promoters", version=0)
Downloading 1VdUg0Zu8yfLS6QesBXwGz1PIQrTW3Ze4 into /home/petr/.genomic_benchmarks/human_nontata_promoters.zip... Done.
Unzipping...Done.
PosixPath('/home/petr/.genomic_benchmarks/human_nontata_promoters')

Getting TensorFlow Dataset for the benchmark and displaying samples is straightforward:

>>> from pathlib import Path
>>> import tensorflow as tf
>>> 
>>> BATCH_SIZE = 64
>>> SEQ_TRAIN_PATH = Path.home() / '.genomic_benchmarks' / 'human_nontata_promoters' / 'train'
>>> CLASSES = ['negative', 'positive']
>>> 
>>> train_dset = tf.keras.preprocessing.text_dataset_from_directory(
...     directory=SEQ_TRAIN_PATH,
...     batch_size=BATCH_SIZE,
...     class_names=CLASSES)
Found 27097 files belonging to 2 classes.
>>> 
>>> list(train_dset)[0][0][0]
<tf.Tensor: shape=(), dtype=string, numpy=b'TCCTGCCTTTCCACTTGCACCAGTTTTCCCACCCCAGCCTCAGGGCGGGGCTGCCTCGTCACTTGTCTCGGGGCAGATCTGCCCTACACACGTTAGCGCCGCGCGCAAAGCAGCCCCGCAGCACCCAGGCGCCTCCTGGCGGCGCCGCGAAGGGGCGGGGCTGTCGGCTGCGCGTTGTGCGCTGTCCCAGGTTGGAAACCAGTGCCCCAGGCGGCGAGGAGAGCGGTGCCTTGCAGGGATGCTGCGGGCGG'>

See How_To_Train_CNN_Classifier_With_TF.ipynb for more detailed description how to train CNN classifier with TensorFlow.

Getting Pytorch Dataset and displaying samples is also easy:

>>> from genomic_benchmarks.dataset_getters.pytorch_datasets import HumanNontataPromoters
>>> 
>>> dset = HumanNontataPromoters(split='train', version=0)
>>> dset[0]
('CAATCTCACAGGCTCCTGGTTGTCTACCCATGGACCCAGAGGTTCTTTGACAGCTTTGGCAACCTGTCCTCTGCCTCTGCCATCATGGGCAACCCCAAAGTCAAGGCACATGGCAAGAAGGTGCTGACTTCCTTGGGAGATGCCATAAAGCACCTGGATGATCTCAAGGGCACCTTTGCCCAGCTGAGTGAACTGCACTGTGACAAGCTGCATGTGGATCCTGAGAACTTCAAGGTGAGTCCAGGAGATGT', 0)

See How_To_Train_CNN_Classifier_With_Pytorch.ipynb for more detailed description how to train CNN classifier with Pytorch.

Structure of package

  • datasets: Each folder is one benchmark dataset (or a set of bechmarks in subfolders), see README.md for the format specification
  • docs: Each folder contains a Python notebook that has been used for the dataset creation
  • experiments: Training a simple neural network model(s) for each benchmark dataset, can be used as a baseline
  • notebooks: Main use-cases demonstrated in a form of Jupyter notebooks
  • src/genomic_benchmarks: Python module for datasets manipulation (downlading, checking, etc.)
  • tests: Unit tests for pytest and pytest-cov

How to contribute

How to contribute a model

If you beat our current best model on any dataset or just came with an interesting new idea, let us know about it: Make you code publicly available (GitHub repo, Colab...) and fill in the form at

https://forms.gle/pvkkrgHNCNmAAC1TA

How to contribute a dataset

If you have an interesting genomic dataset, send us an issue with the description and possibly link to the data (e.g. BED file and FASTQ reference). In the future, we will provide functions to make the import easy.

If you are a hero, read the specification of our dataset format and send us a pull request with new datasets/[YOUR_DATASET_NAME] and docs/[YOUR_DATASET_NAME] folders.

How to improve code in this package

We welcome new code contributors. If you see a bug, send us an issue with a minimal reproducible example. Or even better, fix the bug and send us a pull request.

Citing Genomic Benchmarks

If you use Genomic Benchmarks in your research, please cite it as follows.

Text

GRESOVA, Katarina, et al. Genomic Benchmarks: A Collection of Datasets for Genomic Sequence Classification. bioRxiv, 2022.

BibTeX

@article{gresova2022genomic,
  title={Genomic Benchmarks: A Collection of Datasets for Genomic Sequence Classification},
  author={Gresova, Katarina and Martinek, Vlastimil and Cechak, David and Simecek, Petr and Alexiou, Panagiotis},
  journal={bioRxiv},
  year={2022},
  publisher={Cold Spring Harbor Laboratory},
  url={https://www.biorxiv.org/content/10.1101/2022.06.08.495248}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genomic_benchmarks-0.0.9.tar.gz (21.4 kB view details)

Uploaded Source

File details

Details for the file genomic_benchmarks-0.0.9.tar.gz.

File metadata

  • Download URL: genomic_benchmarks-0.0.9.tar.gz
  • Upload date:
  • Size: 21.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.4.2 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.8.0 tqdm/4.30.0 CPython/3.8.10

File hashes

Hashes for genomic_benchmarks-0.0.9.tar.gz
Algorithm Hash digest
SHA256 d34d095e9e1acf2e0441bf27364e431d45065fb9b18ffe61d468ca0c89c31ab9
MD5 308118bfa87cad8c631cef59c43d53d1
BLAKE2b-256 875e0cd982007fa220822b84c9514fedcf5d07b546295bec666f1b2dd6b03056

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page