A high-performance, fault-tolerant, and streaming-capable async downloader.
Project description
🐉 HydraStream
A high-performance, fault-tolerant, and streaming-capable downloader for Big Data. Built with pure Python, uvloop, and httpx.
💡 The Problem vs. The Solution
The Problem: Downloading massive datasets (ML weights, DB dumps, genomic sequences) using standard tools like wget or curl is slow due to single-connection limits. Furthermore, processing these huge files usually requires saving them to disk first, creating severe I/O bottlenecks and requiring massive storage.
The Solution: HydraStream acts like a multi-headed beast. It utilizes HTTP/2 multiplexing and concurrent chunk downloading to max out your bandwidth. Its killer feature is the Sequential Reordering Buffer, which instantly converts chaotic, multi-threaded downloads into a strict sequential byte stream, allowing you to pipe terabytes of data directly into other tools without ever touching your hard drive.
✨ Key Features
- 🚀 Maximized Throughput: Concurrent chunk downloading using
uvloop. - 🌊 True In-Memory Streaming: Downloads chunks asynchronously but yields them sequentially. Pipe data directly into parsers or Unix tools (zero disk I/O).
- 🛡️ Bulletproof Reliability (The Hydra):
- AIMD Rate Limiting & Circuit Breaker to prevent IP bans.
- Exponential Backoff + Full Jitter for network drops.
- Partial Chunk Commits: If a connection drops, it saves the exact byte offset. You never lose progress.
- 🎯 The "Unplug" Challenge: We dare you to break it. Start a massive download, disable your Wi-Fi, close your laptop lid, or send a
SIGSTOPto the process. Wait 10 minutes. Turn it back on. HydraStream won't crash. It will patiently wait, auto-recover, and resume from the exact byte where it left off. - 🧩 Smart Integrity Validation: Automatically extracts and verifies MD5 checksums from AWS S3 (
ETag), Google Cloud (x-goog-hash), standard HTTP headers, and NCBI provider files. - 💾 Atomic Writes: Uses low-level
os.pwriteto prevent Global Interpreter Lock (GIL) bottlenecks during disk I/O. - 📊 Adaptive UI:
- Default: Beautiful, dynamic terminal UI powered by
Rich(gradients, global ETA). -nu / --no-ui: Plain text logs for CI/CD environments.-q / --quiet: Strict POSIX compliance (stderr for logs, stdout for data streams).
- Default: Beautiful, dynamic terminal UI powered by
🛠 Installation
Requires Python 3.11+. Install globally using uv (recommended) or pipx:
uv tool install git+https://github.com/Zhukovetski/HydraStream.git
pipx install git+https://github.com/Zhukovetski/HydraStream.git
You can use hydrastream, hstream, or simply hs to run the tool from anywhere in your system:
hs "https://ftp.ncbi.nlm.nih.gov/.../genome.fna.gz" -t 20 --output ./data
🚀 Usage
1. Basic Download (Disk Mode)
Download a file using 20 concurrent connections:
hs "https://ftp.ncbi.nlm.nih.gov/.../genome.fna.gz" -t 20 --output ./data
2. Unix Pipeline Streaming (The Killer Feature) 💥
Download a compressed 100GB file, decompress it in memory, and process it—without saving the archive to your disk:
hs "https://ftp.ncbi.nlm.nih.gov/.../genome.fna.gz" -t 20 --stream --quiet | zcat | grep -c "^>"
3. Use as a Python Library (For Data Science / MLOps)
Embed the streaming engine directly into your PyTorch/Pandas data loaders:
import asyncio
from hydrastream import HydraStream
async def main():
urls =["https://url1.gz", "https://url2.gz"]
async with HydraStream(threads=10, quiet=True) as loader:
async for filename, stream in loader.stream_all(urls):
print(f"Processing {filename}...")
async for chunk_bytes in stream:
# Feed raw bytes to your parser, ML model, or decompressor
process_data(chunk_bytes)
asyncio.run(main())
⚙️ CLI Options
| Option | Shortcut | Default | Description |
|---|---|---|---|
URLS |
- | Required | One or multiple URLs to download (separated by space). |
--threads |
-t |
1 |
Number of concurrent connections. |
--output |
-o |
download/ |
Directory to save files and .state.json trackers. |
--stream |
-s |
False |
Enable streaming mode (redirects data to stdout). |
--no-ui |
-nu |
False |
Disables progress bars, leaves plain text logs. |
--quiet |
-q |
False |
Dead silence. No console output at all (for strict pipelines). |
--md5 |
None |
Expected MD5 hash (works only if a single URL is provided). | |
--buffer |
-b |
threads * 5MB |
Maximum stream buffer size in bytes. |
🧠 Under the Hood (Architecture)
For those interested in System Design, this tool implements several advanced engineering patterns:
- Network Congestion Control (AIMD): Implements an Additive Increase / Multiplicative Decrease algorithm and a Circuit Breaker pattern to dynamically scale requests, preventing IP bans and mitigating "Thundering Herd" problems.
- Out-of-Order Execution to Sequential Stream: Uses
asyncio.PriorityQueuefor LIFO retry handling andheapqas a sliding reordering buffer to convert chaotic concurrent HTTP ranges into a strict sequential byte stream. - Zero-Overhead UI Debouncing: Uses a detached asynchronous refresh loop and a
defaultdictbuffer to batch terminal rendering operations, ensuring CPU load stays near 0% even at speeds of 500+ MB/s. - Crash-Proof State Persistence: Uses
NamedTemporaryFileand POSIX directoryfsyncto guarantee atomic state saves. If power is lost mid-save, the state file never corrupts. - Smart File Discovery: Implements RFC 5987 parsing for
Content-Dispositionheaders to extract complex UTF-8 filenames, falling back to URL parsing and mimetype guessing. - Graceful Shutdown & Fail-Fast: Intercepts
SIGINT/SIGTERM, safely flushes queues with Poison Pills (-1), and instantly cancels all worker tasks for a specific file if a fatal HTTP 404/403 is encountered.
🗺️ Roadmap
The journey of the Hydra has just begun. Here is what is planned for the future:
- v1.1: Autonomous Worker Scaling (AIMD)
- Evolve the static thread pool into an adaptive concurrency manager. The system will dynamically spawn or kill download workers based on real-time network health and downstream pipeline backpressure. No more manual
--threadstuning—the Hydra will automatically grow or shed heads to match your system's optimal capacity.
- Evolve the static thread pool into an adaptive concurrency manager. The system will dynamically spawn or kill download workers based on real-time network health and downstream pipeline backpressure. No more manual
- v2.0: Rewrite It In Rust (RIIR) 🦀
- Port the core engine to Rust using
tokioandreqwestto bypass the Python GIL. This will enable true multi-core execution, zero-cost abstractions, and bare-metal performance for hashing and I/O, while maintaining a Python wrapper (PyO3) for seamless Data Science / ML integration.
- Port the core engine to Rust using
License
MIT License. Feel free to use, modify, and distribute.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hydrastream-1.0.0.tar.gz.
File metadata
- Download URL: hydrastream-1.0.0.tar.gz
- Upload date:
- Size: 26.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
df7d03d52e1164e61893e2943fbfae9b16d3e5f44d720012759ba80233b7465d
|
|
| MD5 |
8e3492c67159fd9aaf85d7dcd054401f
|
|
| BLAKE2b-256 |
051e0a31cff597fcbc632fe5484a42b9cf4f8441dc163a88c02fa91ba46d934c
|
File details
Details for the file hydrastream-1.0.0-py3-none-any.whl.
File metadata
- Download URL: hydrastream-1.0.0-py3-none-any.whl
- Upload date:
- Size: 30.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3195ad648698a9ce6062e02595762e3e3e42dc3761802e7e626b64caacc5f44b
|
|
| MD5 |
16f831153739dbdc77c4fb528da2ce42
|
|
| BLAKE2b-256 |
97c5313b98996aa9734b297fea75432c89669bcf57980ffe3ba098b6f99a1c1a
|