Genomics data ML ready
Project description
DNARecords
Genomics data ML ready.
Transform your vcf, bgen, etc. genomics datasets into a sample wise format so that you can use it for Deep Learning models.
Installation
DNARecords package has two main dependencies:
- Hail, if you are transforming your genomics data into DNARecords
- Tensorflow, if you are using a previously DNARecords dataset, for example, to train a DL model
As you may know, Tensorflow and Spark does not play very well together on a cluster with more than one machine.
However, dnarecords
package needs to be installed only on the driver machine of a Hail cluster.
For that reason, we recommend following these installation tips.
On a dev environment
$ pip install dnarecords
For further details (or any trouble), review Local environments section.
On a Hail cluster or submitting a job to it
You will already have Pyspark installed and will not intend to install Tensorflow.
So, just install dnarecords without dependencies on the driver machine.
There will be no missing modules as soon as you use the classes and functionality intended for Spark.
$ /opt/conda/miniconda3/bin/python -m pip install dnarecords --no-deps
Note: assuming Hail python executable is /opt/conda/miniconda3/bin/python
On a Tensorflow environment or submitting a job to it
You will already have Tensorflow installed and will not intend to install Pyspark.
So, just install dnarecords without dependencies.
There will be no missing modules as soon as you use the classes and functionality intended for Tensorflow.
$ pip install dnarecords --no-deps
Working on Google Dataproc
Just use and initialization action that installs dnarecords
without dependencies.
$ hailctl dataproc start dnarecords --init gs://dnarecords/dataproc-init.sh
Iy you need to work with other cloud providers, refer to Hail docs.
Usage
It is quite straightforward to understand the functionality of the package.
Given some genomics data, you can transform it into a DNARecords Dataset this way:
import dnarecords as dr
hl = dr.helper.DNARecordsUtils.init_hail()
hl.utils.get_1kg('/tmp/1kg')
mt = hl.read_matrix_table('/tmp/1kg/1kg.mt')
mt = mt.annotate_entries(dosage=hl.pl_dosage(mt.PL))
dnarecords_path = '/tmp/dnarecords'
writer = dr.writer.DNARecordsWriter(mt.dosage)
writer.write(dnarecords_path, sparse=True, sample_wise=True, variant_wise=True,
tfrecord_format=True, parquet_format=True,
write_mode='overwrite', gzip=True)
Given a DNARecords Dataset, you can read it as Tensorflow Datasets this way:
import dnarecords as dr
dnarecords_path = '/tmp/dnarecords'
reader = dr.reader.DNARecordsReader(dnarecords_path)
samplewise_ds = reader.sample_wise_dataset()
variantwise_ds = reader.variant_wise_dataset()
Or, given a DNARecords Dataset, you can read it as Pyspark DataFrames this way:
import dnarecords as dr
dnarecords_path = '/tmp/dnarecords'
reader = dr.reader.DNASparkReader(dnarecords_path)
samplewise_df = reader.sample_wise_dnarecords()
variantwise_df = reader.variant_wise_dnarecords()
We will provide more examples and integrations soon.
Contributing
Interested in contributing? Check out the contributing guidelines. Please note that this project is released with a Code of Conduct. By contributing to this project, you agree to abide by its terms.
License
dnarecords
was created by Atray Dixit, Andrés Mañas Mañas, Lucas Seninge. It is licensed under the terms of the MIT license.
Credits
dnarecords
was created with cookiecutter
and the py-pkgs-cookiecutter
template.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file dnarecords-0.2.6.tar.gz
.
File metadata
- Download URL: dnarecords-0.2.6.tar.gz
- Upload date:
- Size: 14.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3522d897dbfd305cf4daac3a7857cae32523fa8c0fbd84b47856628fb39fdf05 |
|
MD5 | 3cd35e6b941eca200c689f96558497ab |
|
BLAKE2b-256 | 7e42791c96b608a8c376cb8407252cc074e6b042f1f0e06ebee648e127c77200 |
File details
Details for the file dnarecords-0.2.6-py3-none-any.whl
.
File metadata
- Download URL: dnarecords-0.2.6-py3-none-any.whl
- Upload date:
- Size: 14.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6d733f5d626ee9ca8ac1bda55dc04dff79ce7e75eaff786d4dbd2e4335a681b4 |
|
MD5 | f2b6fd5e3e131823a2554a40cd8f9f3f |
|
BLAKE2b-256 | d7fbe3a411f8a62b1496989b1d90ca3267783724594d456527979c8ec8472638 |