Analysis runner to help make analysis results reproducible

These details have not been verified by PyPI

Project links

Homepage

Environment
- Console
Intended Audience
- Science/Research
License
- OSI Approved :: GNU General Public License v3 (GPLv3)
Natural Language
- English
Operating System
Programming Language
- Python
Topic
- Scientific/Engineering
- Scientific/Engineering :: Bio-Informatics

Project description

Analysis runner

This tool helps to make analysis results reproducible, by automating the following aspects:

Allow quick iteration using an environment that resembles production.
Only allow access to production datasets through code that has been reviewed.
Link the output data with the exact program invocation of how the data has been generated.

One of our main workflow pipeline systems at the CPG is Hail Batch. By default, its pipelines are defined by running a Python program locally. This tool instead lets you run the "driver" on Hail Batch itself.

Furthermore, all invocations are logged together with the output data, as well as Airtable and the sample-metadata server.

When using the analysis-runner, the batch jobs are not run under your standard Hail Batch service account user (<USERNAME>-trial). Instead, a separate Hail Batch account is used to run the batch jobs on your behalf. There's a dedicated Batch service account for each dataset (e.g. "tob-wgs", "fewgenomes") and access level ("test", "standard", or "full", as documented in the team docs storage policies), which helps with bucket permission management and billing budgets.

Note that you can use the analysis-runner to start arbitrary jobs, e.g. R scripts. They're just launched in the Hail Batch environment, but you can use any Docker image you like.

The analysis-runner is also integrated with our Cromwell server to run WDL based workflows.

CLI

The analysis-runner CLI can be used to start pipelines based on a GitHub repository, commit, and command to run. To install it, use pip:

pip install analysis-runner

Run analysis-runner --help to see usage information.

Make sure that you're logged into GCP:

gcloud auth application-default login

If you're in the directory of the project you want to run, you can omit the --commit and --repository parameters, which will use your current git remote and commit HEAD.

For example:

analysis-runner \
    --dataset <dataset> \
    --description <description> \
    --access-level <level> \
    --output-dir <directory-within-bucket> \
    script_to_run.py with arguments

<level> corresponds to an access level as defined in the storage policies.

<directory-within-bucket> does not contain a prefix like gs://cpg-fewgenomes-main/. For example, if you want your results to be stored in gs://cpg-fewgenomes-main/1kg_pca/v2, specify --output-dir 1kg_pca/v2.

If you provide a --repository, you MUST supply a --commit <SHA>, e.g.:

analysis-runner \
    --repository my-approved-repo \
    --commit <commit-sha> \
    --dataset <dataset> \
    --description <description> \
    --access-level <level>
    --output-dir <directory-within-bucket> \
    script_to_run.py with arguments

For more examples (including for running an R script and dataproc), see the examples directory.

Custom Docker images

The default driver image that's used to run scripts comes with Hail and some statistics libraries preinstalled (see the corresponding Hail Dockerfile). It's possible to use any custom Docker image instead, using the --image parameter. Note that any such image needs to contain the critical dependencies as specified in the base image.

For R scripts, we add the R-tidyverse set of packages to the base image, see the corresponding R Dockerfile and the R example for more details.

Helper for Hail Batch

The analysis-runner package has a number of functions that make it easier to run reproducible analysis through Hail Batch.

This is installed in the analysis runner driver image, ie: you can access the analysis_runner module when running scripts through the analysis-runner.

Checking out a git repository at the current commit

import hailtop.batch as hb
from analysis_runner.git import (
  prepare_git_job,
  get_repo_name_from_current_directory,
  get_git_commit_ref_of_current_repository,
)

b = hb.Batch('do-some-analysis')
j = b.new_job('checkout_repo')
prepare_git_job(
  job=j,
  # you could specify a name here, like 'analysis-runner'
  repo_name=get_repo_name_from_current_directory(),
  # you could specify the specific commit here, eg: '1be7bb44de6182d834d9bbac6036b841f459a11a'
  commit=get_git_commit_ref_of_current_repository(),
)

# Now, the working directory of j is the checkout out repository
j.command('examples/bash/hello.sh')

Running a dataproc script

import hailtop.batch as hb
from analysis_runner.dataproc import setup_dataproc

b = hb.Batch('do-some-analysis')

# starts up a cluster, and submits a script to the cluster,
# see the definition for more information about how you can configure the cluster
# https://github.com/populationgenomics/analysis-runner/blob/main/analysis_runner/dataproc.py#L80
cluster = dataproc.setup_dataproc(
    b,
    max_age='1h',
    packages=['click', 'selenium'],
    init=['gs://cpg-reference/hail_dataproc/install_common.sh'],
    cluster_name='My Cluster with max-age=1h',
)
cluster.add_job('examples/dataproc/query.py', job_name='example')

Development

You can ignore this section if you just want to run the tool.

To bring up a stack corresponding to a dataset as described in the storage policies, see the stack directory.

To set up a development environment for the analysis runner using pip, run the following:

pip install -r requirements-dev.txt

pre-commit install --install-hooks

pip install --editable .

Deployment

Add a Hail Batch service account for all supported datasets.
Copy the Hail tokens to the Secret Manager.
Deploy the server by invoking the hail_update workflow manually, specifying the Hail package version.
Deploy the Airtable publisher.
Publish the CLI tool and library to PyPI.

Note that the hail_update workflow gets invoked whenever a new Hail package is published to PyPI. You can test this manually as follows:

curl \
  -X POST \
  -H "Authorization: token $GITHUB_TOKEN" -H "Accept: application/vnd.github.v3+json" \
  https://api.github.com/repos/populationgenomics/analysis-runner/actions/workflows/6364059/dispatches \
  -d '{"ref": "main", "inputs": {"hail_version": "0.2.84"}}'

The CLI tool is shipped as a pip package. To build a new version, we use bump2version. For example, to increment the patch section of the version tag 1.0.0 and make it 1.0.1, run:

git checkout -b add-new-version
bump2version patch
git push --set-upstream origin add-new-version
# Open pull request
open "https://github.com/populationgenomics/analysis-runner/pull/new/add-new-version"

It's important the pull request name start with "Bump version:" (which should happen by default). Once this is merged into main, a GitHub action workflow will build a new package that will be uploaded to PyPI, and become available to install with pip install.

Project details

These details have not been verified by PyPI

Project links

Homepage

Environment
- Console
Intended Audience
- Science/Research
License
- OSI Approved :: GNU General Public License v3 (GPLv3)
Natural Language
- English
Operating System
Programming Language
- Python
Topic
- Scientific/Engineering
- Scientific/Engineering :: Bio-Informatics

Release history Release notifications | RSS feed

3.1.2

Sep 24, 2024

3.1.1

Sep 24, 2024

3.1.0

Aug 22, 2024

3.0.0

Apr 14, 2024

2.44.0

Mar 5, 2024

2.43.4

Jan 12, 2024

2.43.3

Dec 14, 2023

2.43.2

Dec 8, 2023

2.43.1

Nov 24, 2023

2.43.0

Nov 24, 2023

2.42.0

Oct 5, 2023

2.41.3

Aug 17, 2023

2.41.2

Jun 27, 2023

2.41.1

Jun 26, 2023

2.41.0

Jun 26, 2023

2.40.9

Jun 23, 2023

2.40.8

Apr 18, 2023

2.40.7

Apr 6, 2023

2.40.6

Apr 6, 2023

2.40.5

Apr 4, 2023

2.40.4

Mar 29, 2023

2.40.3

Mar 28, 2023

2.40.2

Mar 27, 2023

2.40.1

Mar 27, 2023

2.40.0

Mar 24, 2023

2.39.0

Mar 9, 2023

2.38.2

Feb 12, 2023

2.38.1

Jan 29, 2023

2.38.0

Jan 12, 2023

2.37.0

Dec 19, 2022

2.36.5

Nov 30, 2022

2.36.4

Nov 28, 2022

2.36.3

Nov 24, 2022

2.36.2

Nov 23, 2022

2.36.1

Nov 23, 2022

2.36.0

Nov 8, 2022

2.35.10

Nov 7, 2022

2.35.9

Nov 3, 2022

2.35.7

Oct 26, 2022

2.35.6

Oct 13, 2022

2.35.5

Oct 5, 2022

2.35.3

Oct 3, 2022

2.35.2

Oct 3, 2022

2.35.1

Sep 27, 2022

2.35.0

Sep 22, 2022

2.34.1

Sep 13, 2022

2.34.0

Sep 8, 2022

2.33.1

Sep 7, 2022

2.33.0

Sep 7, 2022

2.32.11

Sep 5, 2022

2.32.10

Sep 5, 2022

2.32.9

Sep 5, 2022

2.32.8

Sep 1, 2022

2.32.7

Aug 31, 2022

2.32.6

Aug 31, 2022

2.32.5

Aug 30, 2022

2.32.4

Aug 25, 2022

2.32.3

Aug 24, 2022

2.32.2

Aug 22, 2022

2.32.1

Jul 21, 2022

2.32.0

Jul 8, 2022

2.31.0

Jun 6, 2022

2.30.2

Jun 7, 2022

2.30.1

Jun 7, 2022

2.30.0

May 30, 2022

2.29.0

May 30, 2022

2.28.0

May 30, 2022

2.27.3

May 26, 2022

2.27.2

May 26, 2022

2.27.1

May 26, 2022

2.27.0

May 26, 2022

2.26.5

May 10, 2022

2.26.4

Apr 29, 2022

2.26.3

Apr 21, 2022

2.26.2

Apr 12, 2022

2.26.1

Apr 11, 2022

2.26.0

Mar 31, 2022

2.25.0

Mar 29, 2022

2.24.1

Mar 25, 2022

2.24.0

Mar 23, 2022

2.23.1

Mar 23, 2022

2.23.0

Mar 22, 2022

2.22.2

Mar 22, 2022

2.22.1

Mar 22, 2022

2.22.0

Mar 21, 2022

2.21.0

Mar 21, 2022

2.20.0

Mar 21, 2022

2.19.0

Mar 17, 2022

2.18.1

Mar 16, 2022

2.18.0

Mar 15, 2022

2.17.0

Mar 15, 2022

This version

2.16.0

Mar 10, 2022

2.15.0

Mar 10, 2022

2.14.0

Mar 8, 2022

2.13.0

Mar 8, 2022

2.12.0

Feb 24, 2022

2.11.0

Feb 22, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

analysis-runner-2.16.0.tar.gz (30.9 kB view hashes)

Uploaded Mar 10, 2022 Source

Hashes for analysis-runner-2.16.0.tar.gz

Hashes for analysis-runner-2.16.0.tar.gz
Algorithm	Hash digest
SHA256	`d9445d2688b9cd546b84c98bdf71e36a44be07c5ea0e559adaf3d445c49f2d70`
MD5	`821799affe677bcdfb251a2a6b157374`
BLAKE2b-256	`62927a2388a53b70d1a30d223b38eb4b704bceb4b1056eb9871b94a0487feb85`