Snakemake storage plugin for downloading files via HTTP with caching and rate limiting
Project description
Snakemake Storage Plugin: Cached HTTP
A Snakemake storage plugin for downloading files via HTTP with local caching, checksum verification, and adaptive rate limiting.
Supported sources:
- zenodo.org - Zenodo data repository (checksum from API)
- data.pypsa.org - PyPSA data repository (checksum from manifest.yaml)
- storage.googleapis.com - Google Cloud Storage (checksum from GCS JSON API)
- any http(s) URL - Generic fallback with size/mtime from HTTP headers
Features
- Local caching: Downloads are cached to avoid redundant transfers (can be disabled)
- Checksum verification: Automatically verifies checksums (from Zenodo API, data.pypsa.org manifests, or GCS object metadata)
- Rate limit handling: Automatically respects Zenodo's rate limits using
X-RateLimit-*headers with exponential backoff retry - Concurrent download control: Limits simultaneous downloads to prevent overwhelming servers
- Resumable downloads: Interrupted transfers resume from where they left off using HTTP range requests
- Progress bars: Shows download progress with tqdm
- Immutable URLs: Returns mtime=0 for Zenodo and data.pypsa.org (persistent URLs); uses actual mtime for GCS and generic HTTP
- Environment variable support: Configure via environment variables for CI/CD workflows
Installation
From the pypsa-eur repository root:
pip install -e plugins/snakemake-storage-plugin-cached-http
Configuration
The Zenodo storage plugin works alongside other storage providers (like HTTP). Snakemake automatically routes URLs to the correct provider based on the URL pattern.
Register additional settings in your Snakefile if you want to customize the defaults:
# Optional: Configure cached HTTP storage with custom settings
# This extends the existing storage configuration (e.g., for HTTP)
storage cached_http:
provider="cached-http",
cache="~/.cache/snakemake-pypsa-eur", # Default location
max_concurrent_downloads=3, # Download max 3 files at once
If you don't explicitly configure it, the plugin will use default settings automatically.
Settings
-
cache (optional): Cache directory for downloaded files
- Default: Platform-dependent user cache directory (via
platformdirs.user_cache_dir("snakemake-pypsa-eur")) - Set to
""(empty string) to disable caching - Files are cached here to avoid re-downloading
- Environment variable:
SNAKEMAKE_STORAGE_CACHED_HTTP_CACHE
- Default: Platform-dependent user cache directory (via
-
skip_remote_checks (optional): Skip metadata checking with remote API
- Default:
False(perform checks) - Set to
Trueor"1"to skip remote existence/size checks (useful for CI/CD) - Environment variable:
SNAKEMAKE_STORAGE_CACHED_HTTP_SKIP_REMOTE_CHECKS
- Default:
-
max_concurrent_downloads (optional): Maximum concurrent downloads
- Default:
3 - Controls how many files can be downloaded simultaneously
- No environment variable support
- Default:
Usage
Use any HTTP(S) URL directly in your rules. Snakemake automatically routes all HTTP(S) URLs to this plugin:
rule download_zenodo:
input:
storage("https://zenodo.org/records/3520874/files/natura.tiff"),
output:
"resources/natura.tiff"
shell:
"cp {input} {output}"
rule download_pypsa:
input:
storage("https://data.pypsa.org/workflows/eur/eez/v12_20231025/World_EEZ_v12_20231025_LR.zip"),
output:
"resources/eez.zip"
shell:
"cp {input} {output}"
rule download_gcs:
input:
storage("https://storage.googleapis.com/open-tyndp-data-store/CBA_projects.zip"),
output:
"resources/cba_projects.zip"
shell:
"cp {input} {output}"
rule download_generic:
input:
storage("https://example.com/data/dataset.csv"),
output:
"resources/dataset.csv"
shell:
"cp {input} {output}"
Or if you configured a tagged storage entity:
rule download_data:
input:
storage.cached_http(
"https://zenodo.org/records/3520874/files/natura.tiff"
),
output:
"resources/natura.tiff"
shell:
"cp {input} {output}"
The plugin will:
- Check if the file exists in the cache (if caching is enabled)
- If cached, copy from cache (fast)
- If not cached, download with:
- Progress bar showing download status
- Automatic rate limit handling with exponential backoff retry
- Concurrent download limiting
- Checksum verification where available (Zenodo API, data.pypsa.org manifest, GCS metadata)
- Store in cache for future use (if caching is enabled)
Example: CI/CD Configuration
For continuous integration environments where you want to skip caching and remote checks:
# GitHub Actions example
- name: Run snakemake workflows
env:
SNAKEMAKE_STORAGE_CACHED_HTTP_CACHE: ""
SNAKEMAKE_STORAGE_CACHED_HTTP_SKIP_REMOTE_CHECKS: "1"
run: |
snakemake --cores all
Rate Limiting and Retry
Zenodo API limits:
- Guest users: 60 requests/minute
- Authenticated users: 100 requests/minute
The plugin automatically:
- Monitors
X-RateLimit-Remainingheader - Waits when rate limit is reached
- Uses
X-RateLimit-Resetto calculate wait time - Retries failed requests with exponential backoff (up to 5 attempts)
- Handles transient errors: HTTP errors, timeouts, checksum mismatches, and network issues
- Resumes interrupted downloads using
Rangerequests where supported by the server
URL Handling
This plugin accepts all HTTP(S) URLs and replaces snakemake-storage-plugin-http. It provides
enhanced support for specific sources:
| Source | Checksum | mtime | Immutable |
|---|---|---|---|
zenodo.org, sandbox.zenodo.org |
✓ (from API) | — | ✓ |
data.pypsa.org |
✓ (from manifest.yaml) | — | ✓ |
storage.googleapis.com |
✓ (from GCS API) | ✓ | — |
| any other HTTP(S) | — | ✓ (Last-Modified) | — |
Generic HTTP URLs are treated as mutable: size and mtime are read from Content-Length and
Last-Modified response headers. Servers that do not support HEAD requests are handled
gracefully (size and mtime default to 0).
License
MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file snakemake_storage_plugin_cached_http-0.4rc1.tar.gz.
File metadata
- Download URL: snakemake_storage_plugin_cached_http-0.4rc1.tar.gz
- Upload date:
- Size: 30.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f3bac1a1ce85d56a9d62c0a0455bbb363427dcf7e6030409871e6593431e663b
|
|
| MD5 |
e44698d4827a0f262d975dacd4678fc2
|
|
| BLAKE2b-256 |
ff317a2eac68d1b3438059f1773cdfb67c762f9470e5b80899177a12b8ac9ec3
|
Provenance
The following attestation bundles were made for snakemake_storage_plugin_cached_http-0.4rc1.tar.gz:
Publisher:
publish.yml on PyPSA/snakemake-storage-plugin-cached-http
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
snakemake_storage_plugin_cached_http-0.4rc1.tar.gz -
Subject digest:
f3bac1a1ce85d56a9d62c0a0455bbb363427dcf7e6030409871e6593431e663b - Sigstore transparency entry: 1123039873
- Sigstore integration time:
-
Permalink:
PyPSA/snakemake-storage-plugin-cached-http@dbe61b45ba93fc8218924afd6380e5e0a6f4cb0e -
Branch / Tag:
refs/tags/v0.4-rc.1 - Owner: https://github.com/PyPSA
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@dbe61b45ba93fc8218924afd6380e5e0a6f4cb0e -
Trigger Event:
release
-
Statement type:
File details
Details for the file snakemake_storage_plugin_cached_http-0.4rc1-py3-none-any.whl.
File metadata
- Download URL: snakemake_storage_plugin_cached_http-0.4rc1-py3-none-any.whl
- Upload date:
- Size: 13.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b23d62ddf6344071cc12d7414982d2ce0e4f938daaf67b065ac541c8a595a0ba
|
|
| MD5 |
832d9916782f57147ad633bdc7cda6c2
|
|
| BLAKE2b-256 |
c9f7557269dc17471fd5e2913643254da95f63557bcef220a4e24d66f1a4b9f7
|
Provenance
The following attestation bundles were made for snakemake_storage_plugin_cached_http-0.4rc1-py3-none-any.whl:
Publisher:
publish.yml on PyPSA/snakemake-storage-plugin-cached-http
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
snakemake_storage_plugin_cached_http-0.4rc1-py3-none-any.whl -
Subject digest:
b23d62ddf6344071cc12d7414982d2ce0e4f938daaf67b065ac541c8a595a0ba - Sigstore transparency entry: 1123039889
- Sigstore integration time:
-
Permalink:
PyPSA/snakemake-storage-plugin-cached-http@dbe61b45ba93fc8218924afd6380e5e0a6f4cb0e -
Branch / Tag:
refs/tags/v0.4-rc.1 - Owner: https://github.com/PyPSA
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@dbe61b45ba93fc8218924afd6380e5e0a6f4cb0e -
Trigger Event:
release
-
Statement type: