Skip to main content

A Snakemake storage plugin that handles files available through Rucio.

Project description

CI codecov Docs DOI

snakemake-storage-plugin-rucio

A Snakemake storage plugin that handles files available through Rucio.

Usage

A manual for using different storage providers with Snakemake is available here. Documentation for this plugin is available here. Below are some examples of using the plugin in a Snakemake rule.

Download files

Download all input files and then run the workflow.

Snakefile content:

rule download:
    input:
        storage("rucio://test_scope/test{sample}.txt")
    output:
        "results/test{sample}.txt"
    shell:
        "mv {input} {output}"

This command will download files test1.txt and test2.txt from scope test_scope and move them to results/test1.txt and results/test2.txt respectively. The --verbose flag is useful to print debug logging information in case things do not work on the first attempt.

snakemake --cores 2 --verbose results/test1.txt results/test2.txt

Only get the URLs for later download

This is useful if the workflow processes multiple input files on multiple CPU cores and you would like to overlap download with computations, or if there is not enough storage space available to download all files prior to processing them.

Snakefile content:

rule get_url:
    input:
        storage("rucio://testing/test{sample}.txt", retrieve=False)
    output:
        "results/url{sample}.txt"
    shell:
        "echo {input} > {output}"

This command will store the URLs to files test1.txt and test2.txt from scope test_scope in files results/url1.txt and results/url2.txt respectively. In a real workflow, these URLs would be used to download the file when it is needed.

snakemake --cores 2 results/url1.txt results/url2.txt

Only get the URL and stream the data

This is useful if your input files are large and you only need part of the data or the data does not fit in local storage.

Snakefile content:

rule stream_file:
    input:
        storage("rucio://test_scope/test{sample}.txt", retrieve=False)
    output:
        "results/stream{sample}.txt"
    run:
        # Stream the file content into the output file.
        import gfal2
        import sys
        from pathlib import Path
        input_url  = input[0]
        output_path = output[0]
        print(f"Copying from {input_url} to {output_path}")
        Path(output_path).parent.mkdir(parents=True, exist_ok=True)

        ctx = gfal2.creat_context()
        size = ctx.stat(input_url).st_size
        file = ctx.open(input_url, "r")
        chunk_size = 2 # read 2 byte chunks for demonstration purposes
        n_chunks = (size // chunk_size) + 1
        with open(output_path, "w") as out_file:
            for _ in range(0, n_chunks):
                data = file.read(chunk_size)
                out_file.write(data)

This command retrieves the URLs to files test1.txt and test2.txt from scope test_scope and streams their content in 2 byte chunks to files results/stream1.txt and results/stream2.txt respectively. In a real workflow, larger chunks or a smarter access pattern that only reads the required bits are recommended.

snakemake --cores 2 --verbose results/stream1.txt results/stream2.txt

Upload a file

Upload a file using Rucio.

Snakefile content:

rule upload:
    output:
        "rucio://test_scope/test_file.txt"
    message:
        "Writing Hello world to {output} and uploading"
    shell:
        """
        echo "Hello world" > {output}
        """

This command will write some text to a local file test_file.txt and upload it to Rucio. The file will be uploaded to a storage element matching the RSE expression TEST_RSE_EXPRESSION in the scope test_scope and attached to the dataset test_dataset. Specifying the target dataset is required to avoid creating a replication rule per file, which would make the number of replication rules unmanageable.

snakemake --default-storage-provider rucio --storage-rucio-upload-rse TEST_RSE_EXPRESSION --storage-rucio-upload-dataset test_dataset --cores 1 --verbose 'rucio://test_scope/test_file.txt'

Contributing

Contributions are very welcome. Instructions on how to get started are available in the contribution guidelines.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

snakemake_storage_plugin_rucio-0.4.2.tar.gz (11.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

snakemake_storage_plugin_rucio-0.4.2-py3-none-any.whl (12.6 kB view details)

Uploaded Python 3

File details

Details for the file snakemake_storage_plugin_rucio-0.4.2.tar.gz.

File metadata

  • Download URL: snakemake_storage_plugin_rucio-0.4.2.tar.gz
  • Upload date:
  • Size: 11.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.11.14 Linux/6.14.0-1017-azure

File hashes

Hashes for snakemake_storage_plugin_rucio-0.4.2.tar.gz
Algorithm Hash digest
SHA256 7ef57dc0e6822e3c72ec1dd37218797125b9471821428cfefa8cec48a012d7c8
MD5 89fd4f43e7565c7e621dc5075a08168c
BLAKE2b-256 9542979b528ad02252267597a60b5b6af2ff3256f620c1d47fb086ed2302bd59

See more details on using hashes here.

File details

Details for the file snakemake_storage_plugin_rucio-0.4.2-py3-none-any.whl.

File metadata

File hashes

Hashes for snakemake_storage_plugin_rucio-0.4.2-py3-none-any.whl
Algorithm Hash digest
SHA256 a85f1010e807193a6e6c774cea91a66cd8e5217769d16cdeeba3eafd38e3e166
MD5 b101fd443101d3015df00ce119f692f7
BLAKE2b-256 0d6832c44036bed5540a8c7e984d76535b8ff36ede19654173637f5a96779318

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page