Skip to main content

Polls a GCP Pub/Sub topic for new SampleSheet notification messages in order to initiate bcl2fastq.

Project description

A downstream tool of smon that uses Pub/Sub notifications to initiate demultiplexing of an Illumina sequecing Run.

Use case

smon has done its job to persistantly store the raw sequencing runs in a Google Storage bucket. Now another tool is necessary that can automatially start demultiplexing. However, demultiplexing generally mustn’t begin until it has a SampleSheet. But once a SampleSheet is readily available, demultiplexing needs to start and the results need to be uploaded to Google Storage.

How it works

SampleSheets Subscriber (ssub) solves the aforementioned challenges by utilizing the power of GCP events and triggers. At a high level, it works as follows:

  1. User/application uploads a samplesheet to a dedicated bucket. The sample sheet naming convention is ${RUN_NAME }.csv.

  2. Google Storage immediately fires off an event to a Pub/Sub topic (whenever there is a new SampleSheet or when an existing one is overwritten).

  3. Meanwhile, ssub is running on a compute instance as a daemon process. It is subscribed to that same Pub/Sub topic. ssub polls the topic for new messages regularly, i.e. once a minute.

  4. When ssub receives a new message, the script parses information about the event.

  5. ssub will the query the Firestore collection - the same one used by smon - for a document whose name is equal to the samplesheet name (minus the .csv part). ssub will then download both the samplesheet and the run tarball. The samplesheet location is provided in the Pub/Sub message; the raw run tarball location is provided within the Firestore document.

  6. ssub will then kick off bcl2fastq.

  7. Demultiplexing results are output in a folder name ‘demux’ within the local run directory.

  8. ssub will upload the demux folder to the same Google Storage folder that contains the raw sequencing run.

  9. ssub will update the relevant Firestore document to add the location to the demux folder in Google Storage.

All processing happens within a sub-directory of the calling directory that is named ssub_runs.

Reanalysis

Reruns of the dmeultiplexing pipeline may be necessary for various reasons, i.e. the demultiplexing results are no longer available and need to be regenerated, or there was a missing barcode in the SampleSheet, or there was an incorrect barcode in the SampleSheet …

Reanalysis is easily accomplished simply by re-uploading the SampleSheet, overwriting the previous one, to Google Storage. Google Storage will assign a generation number to the SampleSheet. Think of the generation number as a type of versioning number that Google Storage assigns to each object each time that the object changes. Even re-uploading the same exact same file again produces a new generation number.

Internally, ssub does all of it’s processing (file downloads, analysis) within a local directory path named after the run and the generation number of the SampleSheet. Thus, it’s perfectly fine for a user to upload an incorrect SampleSheet, and then to immediately afterwards upload the correct one. In such a scenario, there will be two runs of the pipeline, and they won’t interfere with each other. You will notice, however, that there will be two sets of demultplexing results uploaded to Google Storage, each of which exist within a folder named after the original generation number.

Scalablilty

While thare aren’t any parallel steps in ssub, you can achieve scalability by launching two or more instances of ssub, either on one single, beefy compute instance, or on separate ones. While it’s certainly possible that two running instances of ssub can pull the same message from Pub/Sub, only one of these two insances will actually make use of it. It works as follows:

  1. Instance 1 of ssub receives a new message from Pub/Sub and immediately begings to process it.

  2. Instances 1 downloads and parses the corresponding Firestore document that’s related to the SampleSheet detailed within the Pub/Sub message.

  3. Instance 1 notices that the document doesn’t have the ssub.FIRESTORE_ATTR_SS_PUBSUB_DATA attribute set, so then sets it to the value of the JSON serialized of the PUb/Sub message data.

  4. Meanwhile, Instance 2 of ssub has also pulled down the same Pub/Sub message.

  5. Instance 2 queries Firestore and downloads the corresponding document.

  6. Instance 2 notices that the document attribute ssub.FIRESTORE_ATTR_SS_PUBSUB_DATA is already set, so it downloades this JSON value.

  7. Instance 2 then parses the generation number out of the JSON value it downloaded from Firestore and notices that the generation number is the same as the generation number in the Pub/Sub message that it is currently working on.

  8. Instance 2 logs a message that it is deferring further processing - thus leaving the rest of the work to be fulfilled by Instance 1.

Let’s now take a few steps back and pose the question - What if Instance 2 noticed that the generation numbers differ? Well, in this case, it will continue on to run the demultiplexing workflow since there are different versions of the SampleSheet at hand. It will also, however, first set the Firestore document’s ssub.FIRESTORE_ATTR_SS_PUBSUB_DATA attribute to the JSON serialization of the Pub/Sub message data that it’s working on.

Note: If using more than one deployment of ssub on the same instance, it is recommended to run each in a separate working directory.

Setup

  1. You should already have a Firestore collection for smon’s use. smon will create one for you if necessary, but you can create one in advance if you’d like for manual testing. See the documentation for smon for details on the structure of the documents stored in this collection.

  2. Create a dedicated Google Storage bucket for storing your SampleSheets and give it a useful name, i.e. samplesheets. Make sure to set the bucket to use Fine-Grained access control rather than Uniform.

  3. Create a dedicated Pub/Sub topic and give it a useful name, i.e. samplesheets.

  4. Create a notification configuration so that your samplesheets storage bucket will notify the samplesheets Pub/Sub topic whenever a new file is added or modified. Note that you can use gsutil for this as detailed here. Here is an example command:

    gsutil notification create -e OBJECT_FINALIZE -f json -t samplesheets gs://samplesheets

    If you get an access denied error while doing this, give the included script named create_notification_configuration.py a try. It uses the GCP Python API and is much easier to work with in regards to how permissions are configured.

  5. Create a Pub/Sub subscription. For example:

    gcloud beta pubsub subscriptions create --topic samplesheets ssub
  6. Locate the Cloud Storage service account and grant it the IAM role pubsub.publisher. By default, a bucket doesn’t have the priviledge to send notifications to Pub/Sub. Follow the instructions in steps 5 and 6 in this GCP documentation.

Mail notifications

If the ‘mail’ JSON object is set in your JSON configuration file, then the designated recipients will receive email notifications under the folowing events:

  • There is an Exception in the main thread

  • A new Pub/Sub message is being processed (duplicates excluded).

You can use the script send_test_email.py to test that the mail configuration you provide is working. If it is, you should receive an email with the subject “ssub test email”.

The configuration file

This is a small JSON file that lets the monitor know things such as which GCP bucket and Firestore collection to use, for example. The possible keys are:

  • name: The client name of the subscriber. The name will appear in the subject line if email notification is configured, as well as in other places, i.e. log messages.

  • cycle_pause_sec: The number of seconds to wait in-between polls of the Pub/Sub topic. Defaults to 60.

  • firestore_collection: The name of the Google Firestore collection to use for persistent workflow state that downstream tools can query. If it doesn’t exist yet, it will be created. If this parameter is not provided, support for Firestore is turned off.

  • sweep_age_sec: When an analysis directory (within the ssub_runs directory)

    is older than this many seconds, remove it. Defaults to 604800 (1 week).

The user-supplied configuration file is validated against a built-in schema.

Installation

In a later version of Python3, run:

pip3 install ssub

It is recommended to start your compute instance (that will run the monitor) using a service account with the following roles:

  • roles/storage.objectAdmin

  • roles/datastore.owner

Alternatively, give your compute instance the cloud-platform scope.

Deployment:

It is suggested to use the Dockerfile that comes in the respository.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ssub-0.2.1.tar.gz (17.3 kB view hashes)

Uploaded Source

Built Distribution

ssub-0.2.1-py3-none-any.whl (18.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page