Skip to main content

A high-performance, fault-tolerant, and streaming-capable async downloader.

Project description

🐉 HydraStream

Python 3.11+ License: MIT Tests

HydraStream Demo

A high-performance, fault-tolerant, and streaming-capable downloader for Big Data. Built with pure Python, uvloop, and httpx.

💡 The Problem vs. The Solution

The Problem: Downloading massive datasets (ML weights, DB dumps, genomic sequences) using standard tools like wget or curl is slow due to single-connection limits. Furthermore, processing these huge files usually requires saving them to disk first, creating severe I/O bottlenecks and requiring massive storage.

The Solution: HydraStream acts like a multi-headed beast. It utilizes HTTP/2 multiplexing and concurrent chunk downloading to max out your bandwidth. Its killer feature is the Sequential Reordering Buffer, which instantly converts chaotic, multi-threaded downloads into a strict sequential byte stream, allowing you to pipe terabytes of data directly into other tools without ever touching your hard drive.


✨ Key Features

  • 🚀 Maximized Throughput: Concurrent chunk downloading using uvloop.
  • 🌊 True In-Memory Streaming: Downloads chunks asynchronously but yields them sequentially. Pipe data directly into parsers or Unix tools (zero disk I/O).
  • 🛡️ Bulletproof Reliability (The Hydra):
    • AIMD Rate Limiting & Circuit Breaker to prevent IP bans.
    • Exponential Backoff + Full Jitter for network drops.
    • Partial Chunk Commits: If a connection drops, it saves the exact byte offset. You never lose progress.
  • 🎯 The "Unplug" Challenge: We dare you to break it. Start a massive download, disable your Wi-Fi, close your laptop lid, or send a SIGSTOP to the process. Wait 10 minutes. Turn it back on. HydraStream won't crash. It will patiently wait, auto-recover, and resume from the exact byte where it left off.
  • 🧩 Smart Integrity Validation: Automatically extracts and verifies MD5 checksums from AWS S3 (ETag), Google Cloud (x-goog-hash), standard HTTP headers, and NCBI provider files.
  • 💾 Atomic Writes: Uses low-level os.pwrite to prevent Global Interpreter Lock (GIL) bottlenecks during disk I/O.
  • 📊 Adaptive UI:
    • Default: Beautiful, dynamic terminal UI powered by Rich (gradients, global ETA).
    • -nu / --no-ui: Plain text logs for CI/CD environments.
    • -q / --quiet: Strict POSIX compliance (stderr for logs, stdout for data streams).

🛠 Installation

Requires Python 3.11+. Install globally using uv (recommended) or pipx:

uv tool install git+https://github.com/Zhukovetski/HydraStream.git
pipx install git+https://github.com/Zhukovetski/HydraStream.git

You can use hydrastream, hstream, or simply hs to run the tool from anywhere in your system:

hs "https://ftp.ncbi.nlm.nih.gov/.../genome.fna.gz" -t 20 --output ./data

🚀 Usage

1. Basic Download (Disk Mode)

Download a file using 20 concurrent connections:

hs "https://ftp.ncbi.nlm.nih.gov/.../genome.fna.gz" -t 20 --output ./data

Disk Download Demo

*(If interrupted, rerun the exact command to resume from the last saved byte).*

2. Unix Pipeline Streaming (The Killer Feature) 💥

Download a compressed 100GB file, decompress it in memory, and process it—without saving the archive to your disk:

hs "https://ftp.ncbi.nlm.nih.gov/.../genome.fna.gz" -t 20 --stream --quiet | zcat | grep -c "^>"

Pipeline Streaming Demo

3. Use as a Python Library (For Data Science / MLOps)

Embed the streaming engine directly into your PyTorch/Pandas data loaders:

import asyncio
from hydrastream import HydraStream

async def main():
    urls =["https://url1.gz", "https://url2.gz"]

    async with HydraStream(threads=10, quiet=True) as loader:
        async for filename, stream in loader.stream_all(urls):
            print(f"Processing {filename}...")
            async for chunk_bytes in stream:
                # Feed raw bytes to your parser, ML model, or decompressor
                process_data(chunk_bytes)

asyncio.run(main())

⚙️ CLI Options

Option Shortcut Default Description
URLS - Required One or multiple URLs to download (separated by space).
--threads -t 1 Number of concurrent connections.
--output -o download/ Directory to save files and .state.json trackers.
--stream -s False Enable streaming mode (redirects data to stdout).
--no-ui -nu False Disables progress bars, leaves plain text logs.
--quiet -q False Dead silence. No console output at all (for strict pipelines).
--md5 None Expected MD5 hash (works only if a single URL is provided).
--buffer -b threads * 5MB Maximum stream buffer size in bytes.

🧠 Under the Hood (Architecture)

For those interested in System Design, this tool implements several advanced engineering patterns:

  • Network Congestion Control (AIMD): Implements an Additive Increase / Multiplicative Decrease algorithm and a Circuit Breaker pattern to dynamically scale requests, preventing IP bans and mitigating "Thundering Herd" problems.
  • Out-of-Order Execution to Sequential Stream: Uses asyncio.PriorityQueue for LIFO retry handling and heapq as a sliding reordering buffer to convert chaotic concurrent HTTP ranges into a strict sequential byte stream.
  • Zero-Overhead UI Debouncing: Uses a detached asynchronous refresh loop and a defaultdict buffer to batch terminal rendering operations, ensuring CPU load stays near 0% even at speeds of 500+ MB/s.
  • Crash-Proof State Persistence: Uses NamedTemporaryFile and POSIX directory fsync to guarantee atomic state saves. If power is lost mid-save, the state file never corrupts.
  • Smart File Discovery: Implements RFC 5987 parsing for Content-Disposition headers to extract complex UTF-8 filenames, falling back to URL parsing and mimetype guessing.
  • Graceful Shutdown & Fail-Fast: Intercepts SIGINT/SIGTERM, safely flushes queues with Poison Pills (-1), and instantly cancels all worker tasks for a specific file if a fatal HTTP 404/403 is encountered.

🗺️ Roadmap

The journey of the Hydra has just begun. Here is what is planned for the future:

  • v1.1: Autonomous Worker Scaling (AIMD)
    • Evolve the static thread pool into an adaptive concurrency manager. The system will dynamically spawn or kill download workers based on real-time network health and downstream pipeline backpressure. No more manual --threads tuning—the Hydra will automatically grow or shed heads to match your system's optimal capacity.
  • v2.0: Rewrite It In Rust (RIIR) 🦀
    • Port the core engine to Rust using tokio and reqwest to bypass the Python GIL. This will enable true multi-core execution, zero-cost abstractions, and bare-metal performance for hashing and I/O, while maintaining a Python wrapper (PyO3) for seamless Data Science / ML integration.

License

MIT License. Feel free to use, modify, and distribute.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hydrastream-1.0.0.tar.gz (26.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hydrastream-1.0.0-py3-none-any.whl (30.0 kB view details)

Uploaded Python 3

File details

Details for the file hydrastream-1.0.0.tar.gz.

File metadata

  • Download URL: hydrastream-1.0.0.tar.gz
  • Upload date:
  • Size: 26.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for hydrastream-1.0.0.tar.gz
Algorithm Hash digest
SHA256 df7d03d52e1164e61893e2943fbfae9b16d3e5f44d720012759ba80233b7465d
MD5 8e3492c67159fd9aaf85d7dcd054401f
BLAKE2b-256 051e0a31cff597fcbc632fe5484a42b9cf4f8441dc163a88c02fa91ba46d934c

See more details on using hashes here.

File details

Details for the file hydrastream-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: hydrastream-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 30.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for hydrastream-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3195ad648698a9ce6062e02595762e3e3e42dc3761802e7e626b64caacc5f44b
MD5 16f831153739dbdc77c4fb528da2ce42
BLAKE2b-256 97c5313b98996aa9734b297fea75432c89669bcf57980ffe3ba098b6f99a1c1a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page