Skip to main content

Snakemake storage plugin for downloading files via HTTP with caching and rate limiting

Project description

Snakemake Storage Plugin: Cached HTTP

A Snakemake storage plugin for downloading files via HTTP with local caching, checksum verification, and adaptive rate limiting.

Note: This plugin is currently specifically designed for zenodo.org URLs.

Features

  • Local caching: Downloads are cached to avoid redundant transfers (can be disabled)
  • Checksum verification: Automatically verifies MD5 checksums from Zenodo API
  • Rate limit handling: Automatically respects Zenodo's rate limits using X-RateLimit-* headers with exponential backoff retry
  • Concurrent download control: Limits simultaneous downloads to prevent overwhelming Zenodo
  • Progress bars: Shows download progress with tqdm
  • Immutable URLs: Returns mtime=0 since Zenodo URLs are persistent
  • Environment variable support: Configure via environment variables for CI/CD workflows

Installation

From the pypsa-eur repository root:

pip install -e plugins/snakemake-storage-plugin-cached-http

Configuration

The Zenodo storage plugin works alongside other storage providers (like HTTP). Snakemake automatically routes URLs to the correct provider based on the URL pattern.

Register additional settings in your Snakefile if you want to customize the defaults:

# Optional: Configure cached HTTP storage with custom settings
# This extends the existing storage configuration (e.g., for HTTP)
storage cached_http:
    provider="cached-http",
    cache="~/.cache/snakemake-pypsa-eur",  # Default location
    max_concurrent_downloads=3,  # Download max 3 files at once

If you don't explicitly configure it, the plugin will use default settings automatically.

Settings

  • cache (optional): Cache directory for downloaded files

    • Default: Platform-dependent user cache directory (via platformdirs.user_cache_dir("snakemake-pypsa-eur"))
    • Set to "" (empty string) to disable caching
    • Files are cached here to avoid re-downloading
    • Environment variable: SNAKEMAKE_STORAGE_CACHED_HTTP_CACHE
  • skip_remote_checks (optional): Skip metadata checking with remote API

    • Default: False (perform checks)
    • Set to True or "1" to skip remote existence/size checks (useful for CI/CD)
    • Environment variable: SNAKEMAKE_STORAGE_CACHED_HTTP_SKIP_REMOTE_CHECKS
  • max_concurrent_downloads (optional): Maximum concurrent downloads

    • Default: 3
    • Controls how many files can be downloaded simultaneously
    • No environment variable support

Usage

Use Zenodo URLs directly in your rules. Snakemake automatically detects zenodo.org URLs and routes them to this plugin:

rule download_data:
    input:
        storage("https://zenodo.org/records/3520874/files/natura.tiff"),
    output:
        "resources/natura.tiff"
    shell:
        "cp {input} {output}"

Or if you configured a tagged storage entity:

rule download_data:
    input:
        storage.cached_http(
            "https://zenodo.org/records/3520874/files/natura.tiff"
        ),
    output:
        "resources/natura.tiff"
    shell:
        "cp {input} {output}"

The plugin will:

  1. Check if the file exists in the cache (if caching is enabled)
  2. If cached, copy from cache (fast)
  3. If not cached, download from Zenodo with:
    • Progress bar showing download status
    • Automatic rate limit handling with exponential backoff retry
    • Concurrent download limiting
    • MD5 checksum verification against Zenodo API metadata
  4. Store in cache for future use (if caching is enabled)

Example: CI/CD Configuration

For continuous integration environments where you want to skip caching and remote checks:

# GitHub Actions example
- name: Run snakemake workflows
  env:
    SNAKEMAKE_STORAGE_CACHED_HTTP_CACHE: ""
    SNAKEMAKE_STORAGE_CACHED_HTTP_SKIP_REMOTE_CHECKS: "1"
  run: |
    snakemake --cores all

Rate Limiting and Retry

Zenodo API limits:

  • Guest users: 60 requests/minute
  • Authenticated users: 100 requests/minute

The plugin automatically:

  • Monitors X-RateLimit-Remaining header
  • Waits when rate limit is reached
  • Uses X-RateLimit-Reset to calculate wait time
  • Retries failed requests with exponential backoff (up to 5 attempts)
  • Handles transient errors: HTTP errors, timeouts, checksum mismatches, and network issues

URL Handling

  • Only handles URLs containing zenodo.org
  • Other HTTP(S) URLs are handled by the standard snakemake-storage-plugin-http
  • Both plugins can coexist in the same workflow

Plugin Priority

When using storage() without specifying a plugin name, Snakemake checks all installed plugins:

  • Cached HTTP plugin: Only accepts zenodo.org URLs (is_valid_query returns True only for zenodo.org)
  • HTTP plugin: Accepts all HTTP/HTTPS URLs (including zenodo.org)

If both plugins are installed, zenodo.org URLs are ambiguous - both plugins accept them. Typically snakemake would raise an error: "Multiple suitable storage providers found" if you try to use storage() without specifying which plugin to use, ie. one needs to explicitly call the Cached HTTP provider for zenodo.org URLs using storage.cached_http(url) instead of storage(url), but we monkey-patch the http plugin to refuse zenodo.org urls.

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

snakemake_storage_plugin_cached_http-0.1.0.tar.gz (22.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

File details

Details for the file snakemake_storage_plugin_cached_http-0.1.0.tar.gz.

File metadata

File hashes

Hashes for snakemake_storage_plugin_cached_http-0.1.0.tar.gz
Algorithm Hash digest
SHA256 07fe4d130a323554d0b50574f9dfc1ab59fbe5db941778be07ad0b6becb604a1
MD5 d712dde5a8b4da5332ae32f526c67f75
BLAKE2b-256 b5329c2a3cb83f23a4d8c25e988f9f691d750ec3410183abf7171a9edaf5e768

See more details on using hashes here.

Provenance

The following attestation bundles were made for snakemake_storage_plugin_cached_http-0.1.0.tar.gz:

Publisher: publish.yml on PyPSA/snakemake-storage-plugin-cached-http

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file snakemake_storage_plugin_cached_http-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for snakemake_storage_plugin_cached_http-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5001530695488da8a79bd222c2f2f25674088cdd771af41b522ce08a68cf2183
MD5 0f6dd08b5f23df98f1517916e0fc5656
BLAKE2b-256 51c222af3ad190cb2a1679b61e196db0eabcd7d88f92e8f3289ca5d8ca854688

See more details on using hashes here.

Provenance

The following attestation bundles were made for snakemake_storage_plugin_cached_http-0.1.0-py3-none-any.whl:

Publisher: publish.yml on PyPSA/snakemake-storage-plugin-cached-http

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page