Skip to main content

Convert bioinformatics data to Zarr

Project description

CI

bio2zarr

Convert bioinformatics file formats to Zarr

Initially supports converting VCF to the sgkit vcf-zarr specification

This is early alpha-status code: everything is subject to change, and it has not been thoroughly tested

Install

$ python3 -m pip install bio2zarr

This will install the programs vcf2zarr, plink2zarr and vcf_partition into your local Python path. You may need to update your $PATH to call the executables directly.

Alternatively, calling

$ python3 -m bio2zarr vcf2zarr <args>

is equivalent to

$ vcf2zarr <args>

and will always work.

vcf2zarr

Convert a VCF to zarr format:

$ vcf2zarr convert <VCF1> <VCF2> <zarr>

Converts the VCF to zarr format.

Do not use this for anything but the smallest files

The recommended approach is to use a multi-stage conversion

First, convert the VCF into the intermediate format:

vcf2zarr explode tests/data/vcf/sample.vcf.gz tmp/sample.exploded

Then, (optionally) inspect this representation to get a feel for your dataset

vcf2zarr inspect tmp/sample.exploded

Then, (optionally) generate a conversion schema to describe the corresponding Zarr arrays:

vcf2zarr mkschema tmp/sample.exploded > sample.schema.json

View and edit the schema, deleting any columns you don't want, or tweaking dtypes and compression settings to your taste.

Finally, encode to Zarr:

vcf2zarr encode tmp/sample.exploded tmp/sample.zarr -s sample.schema.json

Use the -p, --worker-processes argument to control the number of workers used in the explode and encode phases.

Shell completion

To enable shell completion for a particular session in Bash do:

eval "$(_VCF2ZARR_COMPLETE=bash_source vcf2zarr)" 

If you add this to your .bashrc vcf2zarr shell completion should available in all new shell sessions.

See the Click documentation for instructions on how to enable completion in other shells. a

plink2zarr

Convert a plink .bed file to zarr format. This is incomplete

vcf_partition

Partition a given VCF file into (approximately) a give number of regions:

vcf_partition 20201028_CCDG_14151_B01_GRM_WGS_2020-08-05_chr20.recalibrated_variants.vcf.gz -n 10

gives

chr20:1-6799360
chr20:6799361-14319616
chr20:14319617-21790720
chr20:21790721-28770304
chr20:28770305-31096832
chr20:31096833-38043648
chr20:38043649-45580288
chr20:45580289-52117504
chr20:52117505-58834944
chr20:58834945-

These reqion strings can then be used to split computation of the VCF into chunks for parallelisation.

TODO give a nice example here using xargs

WARNING that this does not take into account that indels may overlap partitions and you may count variants twice or more if they do

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bio2zarr-0.0.9.tar.gz (161.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bio2zarr-0.0.9-py3-none-any.whl (46.5 kB view details)

Uploaded Python 3

File details

Details for the file bio2zarr-0.0.9.tar.gz.

File metadata

  • Download URL: bio2zarr-0.0.9.tar.gz
  • Upload date:
  • Size: 161.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.2

File hashes

Hashes for bio2zarr-0.0.9.tar.gz
Algorithm Hash digest
SHA256 db2f909046610eb551170fd08671e12060c5b6626668f21b5972d1cc22aab46c
MD5 cb7f66cac252014cd1ea53572054e421
BLAKE2b-256 6ec9e0918f8b72d9e88ac86e68e7106aa27d4dc6fc72116f1080e0a91b4070a7

See more details on using hashes here.

File details

Details for the file bio2zarr-0.0.9-py3-none-any.whl.

File metadata

  • Download URL: bio2zarr-0.0.9-py3-none-any.whl
  • Upload date:
  • Size: 46.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.2

File hashes

Hashes for bio2zarr-0.0.9-py3-none-any.whl
Algorithm Hash digest
SHA256 da6d25eed7d59ec7c93668f9dc72c3e37441c43a8a35f2c395c9e1c2f65a8870
MD5 ec16e928621a577258329dfde9113993
BLAKE2b-256 045357f2b9d4afb38e9f75a5bb5bd499a098204144ddd721d7a49f3323792953

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page