Skip to main content

Genomics data ML ready

Project description

DNARecords

PyPI license example workflow codecov pylint Score semantic-release: angular

Genomics data ML ready.

Transform your vcf, bgen, etc. genomics datasets into a sample wise format so that you can use it for Deep Learning models.

Installation

DNARecords package has two main dependencies:

  • Hail, if you are transforming your genomics data into DNARecords
  • Tensorflow, if you are using a previously DNARecords dataset, for example, to train a DL model

As you may know, Tensorflow and Spark does not play very well together on a cluster with more than one machine.

However, dnarecords package needs to be installed only on the driver machine of a Hail cluster.

For that reason, we recommend following these installation tips.

On a dev environment

$ pip install dnarecords

For further details (or any trouble), review Local environments section.

On a Hail cluster or submitting a job to it

You will already have Pyspark installed and will not intend to install Tensorflow.

So, just install dnarecords without dependencies on the driver machine.

There will be no missing modules as soon as you use the classes and functionality intended for Spark.

$ /opt/conda/miniconda3/bin/python -m pip install dnarecords --no-deps

Note: assuming Hail python executable is /opt/conda/miniconda3/bin/python

On a Tensorflow environment or submitting a job to it

You will already have Tensorflow installed and will not intend to install Pyspark.

So, just install dnarecords without dependencies.

There will be no missing modules as soon as you use the classes and functionality intended for Tensorflow.

$ pip install dnarecords --no-deps

Working on Google Dataproc

Just use and initialization action that installs dnarecords without dependencies.

$ hailctl dataproc start dnarecords --init gs://dnarecords/dataproc-init.sh

Iy you need to work with other cloud providers, refer to Hail docs.

Usage

It is quite straightforward to understand the functionality of the package.

Given some genomics data, you can transform it into a DNARecords Dataset this way:

import dnarecords as dr


hl = dr.helper.DNARecordsUtils.init_hail()
hl.utils.get_1kg('/tmp/1kg')
mt = hl.read_matrix_table('/tmp/1kg/1kg.mt')
mt = mt.annotate_entries(dosage=hl.pl_dosage(mt.PL))

dnarecords_path = '/tmp/dnarecords'
writer = dr.writer.DNARecordsWriter(mt.dosage)
writer.write(dnarecords_path, sparse=True, sample_wise=True, variant_wise=True,
             tfrecord_format=True, parquet_format=True,
             write_mode='overwrite', gzip=True)

Given a DNARecords Dataset, you can read it as Tensorflow Datasets this way:

import dnarecords as dr


dnarecords_path = '/tmp/dnarecords'
reader = dr.reader.DNARecordsReader(dnarecords_path)
samplewise_ds = reader.sample_wise_dataset()
variantwise_ds = reader.variant_wise_dataset()

Or, given a DNARecords Dataset, you can read it as Pyspark DataFrames this way:

import dnarecords as dr


dnarecords_path = '/tmp/dnarecords'
reader = dr.reader.DNASparkReader(dnarecords_path)
samplewise_df = reader.sample_wise_dnarecords()
variantwise_df = reader.variant_wise_dnarecords()

We will provide more examples and integrations soon.

Contributing

Interested in contributing? Check out the contributing guidelines. Please note that this project is released with a Code of Conduct. By contributing to this project, you agree to abide by its terms.

License

dnarecords was created by Atray Dixit, Andrés Mañas Mañas, Lucas Seninge. It is licensed under the terms of the MIT license.

Credits

dnarecords was created with cookiecutter and the py-pkgs-cookiecutter template.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dnarecords-0.2.0.tar.gz (14.1 kB view details)

Uploaded Source

Built Distribution

dnarecords-0.2.0-py3-none-any.whl (14.3 kB view details)

Uploaded Python 3

File details

Details for the file dnarecords-0.2.0.tar.gz.

File metadata

  • Download URL: dnarecords-0.2.0.tar.gz
  • Upload date:
  • Size: 14.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.9.13

File hashes

Hashes for dnarecords-0.2.0.tar.gz
Algorithm Hash digest
SHA256 5f1ee586746de5a30bb1afb8dca220a615bea0882238a75ad8bfcc9147ea1f2a
MD5 ad76c74b4dbbf93488599de2347f7e34
BLAKE2b-256 5d52955d2b895f05893bc1ebd3e5b308fc9c2046b38c88f6907cc54aaa2e7dfa

See more details on using hashes here.

File details

Details for the file dnarecords-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: dnarecords-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 14.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.9.13

File hashes

Hashes for dnarecords-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8ef18f6a5b41a1cd010b8380309e0fc2990da9b2dd1b15b349e20df149403b11
MD5 3d6f30c1d21fa81026d5e49109e4ae2d
BLAKE2b-256 a29292b0330ff2347c34be76c5c01a11218f99fbc7d751c9c7a1f82aeb898a45

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page