Skip to main content

Cloud utilities for running Hail systematically.

Project description

pycloud

Introduction

pycloud is codebase that provides an user-friendly interface around using gsutil and gcloud in interactive Python environemnts such as Jupyter.

Information on gcloud and gsutil

Command-line references

gcloud Overview

gsutil Overview

Regions and zones

Google regions and zones

Regions are collections of zones. In order to deploy fault-tolerant applications that have high availability, Google recommends deploying applications across multiple zones and multiple regions.

If you have specific needs that require your data to live in the US, it makes sense to store your resources in zones in the us-central1 region or zones in the us-east1 region.

A zone is an isolated location within a region.

Stick with Eastern US, us-east1-b, us-east1-c, us-east1-d

Permissions

Project permission (IAM)

  • Owner

  • Editor

  • Viewer

  • Browser

Ch roles:

R: READ
W: WRITE
O: OWNER

GCloud class

To sign into Google Cloud,

gc = GCloud(account = 'tsingh@broadinstitute.org', 
            project = 'daly-neale-sczmeta', 
            local = True)

To logout of Google Cloud,

gc.logout()

Common functions

ls:

gc.ls('gs://sczmeta_genomes/test/*')

rm:

gc.rm('gs://test/', recursive=True)

cp:

gc.cpi(glob.glob('test/*'), 'gs://test/')

ComputeEngine class

The ComputeEngine class is used to monitor ComputeEngine virtual machines currently running in Google Cloud.

Create a ComputeEngine object:

ce = gc.ComputeEngine() or ce = ComputeEngine(account = 'tsingh@broadinstitute.org', project = 'daly-neale-sczmeta')

Get all running VM instances:

ce.get_instances()

Load a running instance:

cei = ce.ComputeEngineInstance('vm01')

ComputeEngineInstance class

The ComputeEngineInstance class is used to create, modify, and delete a single virtual machine running in Google Cloud.

To load a running instance:

cei = ce.ComputeEngineInstance('vm01')

or

cei = ComputeEngineInstance(instance_name, project_name)

DataProc class

The ComputeEngine class is used to monitor DataProc clusters currently running in Google Cloud.

To create a DataProc object:

dp = DataProc(account='tsingh@broadinstitute.org', 
              project='daly-neale-sczmeta')

DataProcInstance class

The DataProcInstance class is used to create, modify, and delete a DataProc cluster running in Google Cloud.

dpi = DataProcInstance('cluster01', 'tsingh@broadinstitute.org', 'daly-neale-sczmeta', 
                       zone = 'us-central1-b',
                       master_machine_type = 'n1-highmem-4', 
                       master_boot_disk_size = 50, 
                       n_workers = 2, 
                       worker_machine_type = 'n1-highmem-4', 
                       worker_boot_disk_size = 50,
                       n_pre_workers = 2, create = False)

To change the number of preemptible workers:

dpi.update(n_pre_workers = 20)

To delete a cluster,

dpi.delete()

HailRunner class

The HailRunner class provides a user-friendly interface to submitting sequential Hail jobs to the Spark cluster. HailRunner can be inherited into downstream classes to be adapted into different Cloud infrastructure.

outdir = 'outdir'

runner = HailRunner(outdir = outdir, 
                    input_file = "/Users/tsingh/1000Genomes_248samples_coreExome10K.vcf.bgz",
                    hail_jar_path = "/Users/tsingh/repos/hail/build/libs/hail-all-spark.jar", 
                    pyhail_zip = "/Users/tsingh/repos/hail/python/")

runner.add_batch(""" 
vds = hc.import_vcf('{}')
vds = vds.split_multi()
vds.write('{}')""", os.path.join(outdir, '1kg.vds'))

runner.submit('save_vds')
runner.quick_submit("""
vds = hc.read('{}')

print(vds.count(genotypes=True))
""".format(runner.f))

HailRunnerGC class

The HailRunner class adapted for use with Google Cloud and DataProc.

outdir = 'test/'
bucket_dir = 'gs://test/'
runner = HailRunnerGC('vm01',
                      outdir = outdir, 
                      input_file = "gs://data/exomes.vcf.bgz")

runner.add_batch(""" 
vds = hc.import_vcf('{}')

vds = vds.split_multi()

vds.write('{}')""", change_ext(runner.f, '.vds', '.vcf.bgz', bucket_dir))

runner.submit('save_vds')
runner.quick_submit("""

vds = hc.read('{}')

print(vds.count())
print(vds.variant_schema)
print(vds.sample_schema)
print(vds.global_schema)

""".format(runner.f))

Installing Hail on Broad prem

Add SPARK_HOME, HAIL_HOME, and add Hail and Spark bins to PATH.

On Broad cluster, get the version of Spark correct. ./gradlew -Dspark.version=2.1.0.cloudera shadowJar

use CMake
use GCC-5.2
use Java-1.8

before running gradlew

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sparkhub-0.2.3.tar.gz (38.5 kB view details)

Uploaded Source

Built Distribution

sparkhub-0.2.3-py3-none-any.whl (19.7 kB view details)

Uploaded Python 3

File details

Details for the file sparkhub-0.2.3.tar.gz.

File metadata

  • Download URL: sparkhub-0.2.3.tar.gz
  • Upload date:
  • Size: 38.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for sparkhub-0.2.3.tar.gz
Algorithm Hash digest
SHA256 09ec64f831e1ce162f4ff135cfa1269ea579e351d69626ed297a232b6f8a73ae
MD5 8d5d04e356feffdae46335ea276a22b7
BLAKE2b-256 e1ddbb2cd44ba995b2443a6e7a786f7ddad9c33f3c5dfd3cbedd5b412440f08a

See more details on using hashes here.

File details

Details for the file sparkhub-0.2.3-py3-none-any.whl.

File metadata

  • Download URL: sparkhub-0.2.3-py3-none-any.whl
  • Upload date:
  • Size: 19.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for sparkhub-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 6e6912072134f2cdd16ae6cb88247bd609cacc977f047029f81d0fedee994c87
MD5 cc2e6b8710295f0d8762e2bc14a3197a
BLAKE2b-256 78895865c980587d315b8e4047491c8bbadd92dae3837668c327bb515b62f1a3

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page