A high-performance, fault-tolerant, and streaming-capable async downloader.

These details have not been verified by PyPI

Project links

Project description

🐉 HydraStream

HydraStream Demo

A high-performance, fault-tolerant, and streaming-capable downloader for Big Data. Built with pure Python, uvloop, and httpx.

💡 The Problem vs. The Solution

The Problem: Downloading massive datasets (ML weights, DB dumps, genomic sequences) using standard tools like wget or curl is slow due to single-connection limits. Furthermore, processing these huge files usually requires saving them to disk first, creating severe I/O bottlenecks and requiring massive storage.

The Solution: HydraStream acts like a multi-headed beast. It utilizes HTTP/2 multiplexing and concurrent chunk downloading to max out your bandwidth. Its killer feature is the Sequential Reordering Buffer, which instantly converts chaotic, multi-threaded downloads into a strict sequential byte stream, allowing you to pipe terabytes of data directly into other tools without ever touching your hard drive.

✨ Key Features

🚀 Maximized Throughput: Concurrent chunk downloading using uvloop.
🌊 True In-Memory Streaming: Downloads chunks asynchronously but yields them sequentially. Pipe data directly into parsers or Unix tools (zero disk I/O).
🛡️ Bulletproof Reliability (The Hydra):
- AIMD Rate Limiting & Circuit Breaker to prevent IP bans.
- Exponential Backoff + Full Jitter for network drops.
- Partial Chunk Commits: If a connection drops, it saves the exact byte offset. You never lose progress.
🎯 The "Unplug" Challenge: We dare you to break it. Start a massive download, disable your Wi-Fi, close your laptop lid, or send a SIGSTOP to the process. Wait 10 minutes. Turn it back on. HydraStream won't crash. It will patiently wait, auto-recover, and resume from the exact byte where it left off.
🧩 Smart Integrity Validation: Automatically extracts and verifies MD5 checksums from AWS S3 (ETag), Google Cloud (x-goog-hash), standard HTTP headers, and NCBI provider files.
💾 Atomic Writes: Uses low-level os.pwrite to prevent Global Interpreter Lock (GIL) bottlenecks during disk I/O.
📊 Adaptive UI:
- Default: Beautiful, dynamic terminal UI powered by Rich (gradients, global ETA).
- -nu / --no-ui: Plain text logs for CI/CD environments.
- -q / --quiet: Strict POSIX compliance (stderr for logs, stdout for data streams).

🛠 Installation

Requires Python 3.11+. Install globally using uv (recommended) or pipx:

uv tool install git+https://github.com/Zhukovetski/HydraStream.git

pipx install git+https://github.com/Zhukovetski/HydraStream.git

You can use hydrastream, hstream, or simply hs to run the tool from anywhere in your system:

hs "https://ftp.ncbi.nlm.nih.gov/.../genome.fna.gz" -t 20 --output ./data

🚀 Usage

1. Basic Download (Disk Mode)

Download a file using 20 concurrent connections:

hs "https://ftp.ncbi.nlm.nih.gov/.../genome.fna.gz" -t 20 --output ./data

Disk Download Demo

*(If interrupted, rerun the exact command to resume from the last saved byte).*

2. Unix Pipeline Streaming (The Killer Feature) 💥

Download a compressed 100GB file, decompress it in memory, and process it—without saving the archive to your disk:

hs "https://ftp.ncbi.nlm.nih.gov/.../genome.fna.gz" -t 20 --stream --quiet | zcat | grep -c "^>"

Pipeline Streaming Demo

3. Use as a Python Library (For Data Science / MLOps)

Embed the streaming engine directly into your PyTorch/Pandas data loaders:

import asyncio
from hydrastream import HydraStream

async def main():
    urls =["https://url1.gz", "https://url2.gz"]

    async with HydraStream(threads=10, quiet=True) as loader:
        async for filename, stream in loader.stream_all(urls):
            print(f"Processing {filename}...")
            async for chunk_bytes in stream:
                # Feed raw bytes to your parser, ML model, or decompressor
                process_data(chunk_bytes)

asyncio.run(main())

⚙️ CLI Options

Option	Shortcut	Default	Description
`URLS`	-	Required	One or multiple URLs to download (separated by space).
`--threads`	`-t`	`1`	Number of concurrent connections.
`--output`	`-o`	`download/`	Directory to save files and `.state.json` trackers.
`--stream`	`-s`	`False`	Enable streaming mode (redirects data to `stdout`).
`--no-ui`	`-nu`	`False`	Disables progress bars, leaves plain text logs.
`--quiet`	`-q`	`False`	Dead silence. No console output at all (for strict pipelines).
`--md5`		`None`	Expected MD5 hash (works only if a single URL is provided).
`--buffer`	`-b`	`threads * 5MB`	Maximum stream buffer size in bytes.

🧠 Under the Hood (Architecture)

For those interested in System Design, this tool implements several advanced engineering patterns:

Network Congestion Control (AIMD): Implements an Additive Increase / Multiplicative Decrease algorithm and a Circuit Breaker pattern to dynamically scale requests, preventing IP bans and mitigating "Thundering Herd" problems.
Out-of-Order Execution to Sequential Stream: Uses asyncio.PriorityQueue for LIFO retry handling and heapq as a sliding reordering buffer to convert chaotic concurrent HTTP ranges into a strict sequential byte stream.
Zero-Overhead UI Debouncing: Uses a detached asynchronous refresh loop and a defaultdict buffer to batch terminal rendering operations, ensuring CPU load stays near 0% even at speeds of 500+ MB/s.
Crash-Proof State Persistence: Uses NamedTemporaryFile and POSIX directory fsync to guarantee atomic state saves. If power is lost mid-save, the state file never corrupts.
Smart File Discovery: Implements RFC 5987 parsing for Content-Disposition headers to extract complex UTF-8 filenames, falling back to URL parsing and mimetype guessing.
Graceful Shutdown & Fail-Fast: Intercepts SIGINT/SIGTERM, safely flushes queues with Poison Pills (-1), and instantly cancels all worker tasks for a specific file if a fatal HTTP 404/403 is encountered.

🗺️ Roadmap

The journey of the Hydra has just begun. Here is what is planned for the future:

v1.1: Autonomous Worker Scaling (AIMD)
- Evolve the static thread pool into an adaptive concurrency manager. The system will dynamically spawn or kill download workers based on real-time network health and downstream pipeline backpressure. No more manual --threads tuning—the Hydra will automatically grow or shed heads to match your system's optimal capacity.
v2.0: Rewrite It In Rust (RIIR) 🦀
- Port the core engine to Rust using tokio and reqwest to bypass the Python GIL. This will enable true multi-core execution, zero-cost abstractions, and bare-metal performance for hashing and I/O, while maintaining a Python wrapper (PyO3) for seamless Data Science / ML integration.

License

MIT License. Feel free to use, modify, and distribute.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.1.0

Mar 20, 2026

This version

1.0.0

Mar 12, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hydrastream-1.0.0.tar.gz (26.7 kB view details)

Uploaded Mar 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

hydrastream-1.0.0-py3-none-any.whl (30.0 kB view details)

Uploaded Mar 12, 2026 Python 3

File details

Details for the file hydrastream-1.0.0.tar.gz.

File metadata

Download URL: hydrastream-1.0.0.tar.gz
Upload date: Mar 12, 2026
Size: 26.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for hydrastream-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`df7d03d52e1164e61893e2943fbfae9b16d3e5f44d720012759ba80233b7465d`
MD5	`8e3492c67159fd9aaf85d7dcd054401f`
BLAKE2b-256	`051e0a31cff597fcbc632fe5484a42b9cf4f8441dc163a88c02fa91ba46d934c`

See more details on using hashes here.

File details

Details for the file hydrastream-1.0.0-py3-none-any.whl.

File metadata

Download URL: hydrastream-1.0.0-py3-none-any.whl
Upload date: Mar 12, 2026
Size: 30.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for hydrastream-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3195ad648698a9ce6062e02595762e3e3e42dc3761802e7e626b64caacc5f44b`
MD5	`16f831153739dbdc77c4fb528da2ce42`
BLAKE2b-256	`97c5313b98996aa9734b297fea75432c89669bcf57980ffe3ba098b6f99a1c1a`

See more details on using hashes here.

hydrastream 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🐉 HydraStream

💡 The Problem vs. The Solution

✨ Key Features

🛠 Installation

🚀 Usage

1. Basic Download (Disk Mode)

2. Unix Pipeline Streaming (The Killer Feature) 💥

3. Use as a Python Library (For Data Science / MLOps)

⚙️ CLI Options

🧠 Under the Hood (Architecture)

🗺️ Roadmap

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes