Skip to main content

Helper classes to read files over HTTP using Range requests, with caching

Project description

Tests PyPI version

hctef

Python library with helper classes to read files over HTTP using Range requests, with caching.

Overview

hctef provides a file-like interface for reading files over HTTP/HTTPS, using HTTP Range requests to fetch only the data you need. It includes intelligent caching to minimize network requests and supports both synchronous and asynchronous operations.

Features

  • File-like API: Works like a regular Python file object with read(), seek(), and tell() methods
  • Efficient Range Requests: Fetches only the data you need using HTTP Range headers
  • Intelligent Caching: Uses an interval tree to track cached byte ranges and minimize redundant requests
  • Prefetching: Optionally prefetch data from the start or end of the file
  • Sync and Async: Both synchronous and asynchronous implementations available
  • Context Manager Support: Use with with statements for automatic cleanup

Installation

pip install hctef

To include async support:

pip install hctef[async]

Quick Start

Synchronous Usage

from hctef import HttpFile

url = "https://example.com/large-file.bin"

with HttpFile(url) as f:
    # Read first 100 bytes
    data = f.read(100)

    # Seek to a specific position
    f.seek(1000)

    # Read from current position
    more_data = f.read(50)

    # Get current position
    position = f.tell()

    # Seek relative to end of file
    f.seek(-100, 2)

Asynchronous Usage

The async implementation supports independent cursors for concurrent reads:

import asyncio
from hctef.aio import AsyncHttpFile

url = "https://example.com/large-file.bin"

async with AsyncHttpFile(url) as f:
    # Read first 100 bytes
    data = await f.read(100)

    # Seek to a specific position (synchronous - no I/O)
    f.seek(1000)

    # Read from current position
    more_data = await f.read(50)

Parallel Reads with Multiple Cursors

Create independent cursors to read from different positions concurrently:

import asyncio
from hctef.aio import AsyncHttpFile

url = "https://example.com/large-file.bin"

async with AsyncHttpFile(url) as f:
    # Create independent cursors for parallel reading
    cursor1 = f.clone()
    cursor2 = f.clone()

    # Position each cursor at different locations
    f.seek(0)
    cursor1.seek(1000)
    cursor2.seek(2000)

    # Read from all three positions in parallel
    # All cursors share the same cache and HTTP session
    results = await asyncio.gather(
        f.read(100),        # Read bytes 0-100
        cursor1.read(100),  # Read bytes 1000-1100
        cursor2.read(100),  # Read bytes 2000-2100
    )

    # Each cursor maintains independent position
    print(f.tell())        # 100
    print(cursor1.tell())  # 1100
    print(cursor2.tell())  # 2100

Cursors are lightweight and share:

  • HTTP session (connection pooling)
  • Byte range cache (deduplication of overlapping requests)
  • File metadata

Configuration Options

Both HttpFile and AsyncHttpFile accept the following parameters:

HttpFile(
    url,
    minimum_range_request_bytes=8192,  # Minimum bytes per request (default: 8KB)
    prefetch_bytes=1048576,             # Bytes to prefetch on open (default: 1MB)
    prefetch_direction='END'            # 'START' or 'END' (default: 'END')
)
  • minimum_range_request_bytes: The minimum number of bytes to request in a single HTTP Range request (except when filling small cache gaps)
  • prefetch_bytes: How many bytes to fetch immediately when opening the file. Set to 0 to disable prefetching
  • prefetch_direction: Whether to prefetch from the start ('START') or end ('END') of the file

Requirements

  • Python 3.12 or higher
  • HTTP server must support Range requests
  • For async: aiohttp>=3.13.0

How It Works

When you open an HTTP file, hctef:

  1. Sends an initial Range request to determine the file size and verify Range support
  2. Optionally prefetches data from the start or end of the file
  3. Maintains an in-memory cache of fetched byte ranges (not suitable for downloading complete large files)
  4. On read(), checks the cache first and only fetches missing data from the server
  5. Combines multiple small requests into larger ones based on minimum_range_request_bytes

This approach minimizes HTTP requests while providing efficient random access to remote files.

Error Handling

hctef defines custom exceptions:

  • HctefError: Base exception class
  • HctefNetworkError: Raised for network-related errors (inherits from IOError)
  • HctefUrlError: Raised for invalid URLs (inherits from ValueError)
from hctef import HttpFile
from hctef.exceptions import HctefNetworkError, HctefUrlError

try:
    with HttpFile("https://example.com/file.bin") as f:
        data = f.read(100)
except HctefNetworkError as e:
    print(f"Network error: {e}")
except HctefUrlError as e:
    print(f"Invalid URL: {e}")

Development

To set up for development:

# Clone the repository
git clone https://github.com/jkeifer/hctef
cd hctef

# Install dependencies
uv sync --all-extras --dev

# Setup pre-commit
pre-commit install

# Run tests
pytest

# Run all checks with pre-commit
pre-commit run --all-files

Future Ideas

  • Consoldiate sync/async implementations
  • Allow uncached "cursor" for reading a large file segement
  • Cursors with separate caches (to allow clearing memory when done)
    • would allow cursor-based access with non-async implementation

License

Apache License 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hctef-0.1.0.tar.gz (77.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hctef-0.1.0-py3-none-any.whl (15.2 kB view details)

Uploaded Python 3

File details

Details for the file hctef-0.1.0.tar.gz.

File metadata

  • Download URL: hctef-0.1.0.tar.gz
  • Upload date:
  • Size: 77.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for hctef-0.1.0.tar.gz
Algorithm Hash digest
SHA256 b691f6067c96f69067c4fc11c897a2737a40ff350b1d8e4f9e67ecc7682a872e
MD5 e2f58c160533f61d9dd663f87de1157c
BLAKE2b-256 07a37af0bcfe7504e196fd5481e51b6e264b8f0aca30fc3bd859a2d1f522c6ab

See more details on using hashes here.

Provenance

The following attestation bundles were made for hctef-0.1.0.tar.gz:

Publisher: release.yml on jkeifer/hctef

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hctef-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: hctef-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 15.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for hctef-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 daa4382380800a120149fb6607bd7592549a09eb484198f6b62ac50e0ec12186
MD5 f87fbc92f2fd08f205025be153323233
BLAKE2b-256 6c10d1451fc633148b24ee7b6b793a653851e48f6b6dee8b6774a7a58494c219

See more details on using hashes here.

Provenance

The following attestation bundles were made for hctef-0.1.0-py3-none-any.whl:

Publisher: release.yml on jkeifer/hctef

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page