Skip to main content

Helper classes to read files over HTTP using Range requests, with caching

Project description

Tests PyPI version

hctef

Python library with helper classes to read files over HTTP using Range requests, with caching.

Overview

hctef provides a file-like interface for reading files over HTTP/HTTPS, using HTTP Range requests to fetch only the data you need. It includes intelligent caching to minimize network requests and supports both synchronous and asynchronous operations.

Features

  • File-like API: Works like a regular Python file object with read(), seek(), and tell() methods
  • Efficient Range Requests: Fetches only the data you need using HTTP Range headers
  • Intelligent Caching: Uses an interval tree to track cached byte ranges and minimize redundant requests
  • Prefetching: Optionally prefetch data from the start or end of the file
  • Sync and Async: Both synchronous and asynchronous implementations available
  • Context Manager Support: Use with with statements for automatic cleanup

Installation

pip install hctef

To include async support:

pip install hctef[async]

Quick Start

Synchronous Usage

from hctef import HttpFile

url = "https://example.com/large-file.bin"

with HttpFile(url) as f:
    # Read first 100 bytes
    data = f.read(100)

    # Seek to a specific position
    f.seek(1000)

    # Read from current position
    more_data = f.read(50)

    # Get current position
    position = f.tell()

    # Seek relative to end of file
    f.seek(-100, 2)

Asynchronous Usage

The async implementation supports independent cursors for concurrent reads:

import asyncio
from hctef.aio import AsyncHttpFile

url = "https://example.com/large-file.bin"

async with AsyncHttpFile(url) as f:
    # Read first 100 bytes
    data = await f.read(100)

    # Seek to a specific position (synchronous - no I/O)
    f.seek(1000)

    # Read from current position
    more_data = await f.read(50)

Parallel Reads with Multiple Cursors

Create independent cursors to read from different positions concurrently:

import asyncio
from hctef.aio import AsyncHttpFile

url = "https://example.com/large-file.bin"

async with AsyncHttpFile(url) as f:
    # Create independent cursors for parallel reading
    cursor1 = f.clone()
    cursor2 = f.clone()

    # Position each cursor at different locations
    f.seek(0)
    cursor1.seek(1000)
    cursor2.seek(2000)

    # Read from all three positions in parallel
    # All cursors share the same cache and HTTP session
    results = await asyncio.gather(
        f.read(100),        # Read bytes 0-100
        cursor1.read(100),  # Read bytes 1000-1100
        cursor2.read(100),  # Read bytes 2000-2100
    )

    # Each cursor maintains independent position
    print(f.tell())        # 100
    print(cursor1.tell())  # 1100
    print(cursor2.tell())  # 2100

Cursors are lightweight and share:

  • HTTP session (connection pooling)
  • Byte range cache (deduplication of overlapping requests)
  • File metadata

Configuration Options

Both HttpFile and AsyncHttpFile accept the following parameters:

HttpFile(
    url,
    minimum_range_request_bytes=8192,  # Minimum bytes per request (default: 8KB)
    prefetch_bytes=1048576,             # Bytes to prefetch on open (default: 1MB)
    prefetch_direction='END'            # 'START' or 'END' (default: 'END')
)
  • minimum_range_request_bytes: The minimum number of bytes to request in a single HTTP Range request (except when filling small cache gaps)
  • prefetch_bytes: How many bytes to fetch immediately when opening the file. Set to 0 to disable prefetching
  • prefetch_direction: Whether to prefetch from the start ('START') or end ('END') of the file

Requirements

  • Python 3.12 or higher
  • HTTP server must support Range requests
  • For async: aiohttp>=3.13.0

How It Works

When you open an HTTP file, hctef:

  1. Sends an initial Range request to determine the file size and verify Range support
  2. Optionally prefetches data from the start or end of the file
  3. Maintains an in-memory cache of fetched byte ranges (not suitable for downloading complete large files)
  4. On read(), checks the cache first and only fetches missing data from the server
  5. Combines multiple small requests into larger ones based on minimum_range_request_bytes

This approach minimizes HTTP requests while providing efficient random access to remote files.

Error Handling

hctef defines custom exceptions:

  • HctefError: Base exception class
  • HctefNetworkError: Raised for network-related errors (inherits from IOError)
  • HctefUrlError: Raised for invalid URLs (inherits from ValueError)
from hctef import HttpFile
from hctef.exceptions import HctefNetworkError, HctefUrlError

try:
    with HttpFile("https://example.com/file.bin") as f:
        data = f.read(100)
except HctefNetworkError as e:
    print(f"Network error: {e}")
except HctefUrlError as e:
    print(f"Invalid URL: {e}")

Development

To set up for development:

# Clone the repository
git clone https://github.com/jkeifer/hctef
cd hctef

# Install dependencies
uv sync --all-extras --dev

# Setup pre-commit
pre-commit install

# Run tests
pytest

# Run all checks with pre-commit
pre-commit run --all-files

Future Ideas

  • Consoldiate sync/async implementations
  • Allow uncached "cursor" for reading a large file segement
  • Cursors with separate caches (to allow clearing memory when done)
    • would allow cursor-based access with non-async implementation

License

Apache License 2.0

What is hctef?

It's the HTTP Client That Eats Files, obviously.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hctef-0.1.1.tar.gz (77.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hctef-0.1.1-py3-none-any.whl (15.3 kB view details)

Uploaded Python 3

File details

Details for the file hctef-0.1.1.tar.gz.

File metadata

  • Download URL: hctef-0.1.1.tar.gz
  • Upload date:
  • Size: 77.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for hctef-0.1.1.tar.gz
Algorithm Hash digest
SHA256 c720930343c031e50ff97280dcc2c9f8d22db8aa6b2b900784b6a6a29ed39032
MD5 930bba03e231010710d507e7682f14af
BLAKE2b-256 1d8c9d4871a2fb7715dd6081386c5ffd8d08c9fed26a5ff38bc00a74b171bd3e

See more details on using hashes here.

Provenance

The following attestation bundles were made for hctef-0.1.1.tar.gz:

Publisher: release.yml on jkeifer/hctef

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hctef-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: hctef-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 15.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for hctef-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 cf833f23c596be844fc673dc414ce1b63e977bbb506ff757576b9746173b4e00
MD5 9dd76bc9c834b7d83905053756027d28
BLAKE2b-256 aead3666c2f966a429a9c1268967260ece0dd94bdcb31b29fa62fba479358f46

See more details on using hashes here.

Provenance

The following attestation bundles were made for hctef-0.1.1-py3-none-any.whl:

Publisher: release.yml on jkeifer/hctef

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page