Skip to main content

An efficient, secure, and deterministic TAR streaming engine.

Project description

TarTape

TarTape is a streaming engine designed to turn massive directories into deterministic TAR archives on-the-fly, without requiring intermediate storage.

It is purpose-built for cloud-native backups and large-scale data movement where you need to stream terabytes of data directly to remote storage (S3, Azure, GCP). It eliminates the need to duplicate local disk space and provides the unique ability to resume failed uploads instantly from the exact byte they stopped.

Why is it useful?

  • Zero-Copy Streaming: Generates the TAR stream "in-flight" while transmitting. If your dataset is 100GB, you transmit 100GB without using a single extra GB of local cache.
  • Byte-Level Resume: If a 500GB upload fails at 80%, TarTape knows exactly at which byte the error occurred. You can resume the stream from that specific offset without re-scanning the source.
  • Logical Volume Slicing: Easily split a massive stream into fixed-size volumes (e.g., 5GB parts) to meet cloud provider upload limits, while maintaining a single valid TAR structure.
  • Stream Navigation: Jump to any file or offset within the resulting archive without having to process or read the preceding data.

Installation

pip install tartape

Usage Examples

1. Recording the Tape

Before streaming, you must "record" the directory state. This creates a lightweight index in .tartape/index.db.

import tartape

# Scan the dataset and generate the integrity catalog
tape = tartape.create("./massive_dataset")

print(f"Fingerprint: {tape.fingerprint}")
print(f"Total stream size: {tape.total_size} bytes")

2. Direct Streaming (Single-file Upload)

If you don't need to split the archive, you can consume the byte stream directly.

import requests
import tartape

with tartape.open("./massive_dataset") as tape:
    # 'play' emits events. We filter for 'file_data' to get raw bytes.
    def data_generator():
        for event in tape.play():
            if event.type == "file_data":
                yield event.data

    # Send the full TAR stream via HTTP without saving it to disk
    requests.put("https://storage.com/backup.tar", data=data_generator())

3. Volume Slicing (Cloud Slicing)

Ideal for services like AWS S3 or Azure Blobs that prefer fixed-size parts.

with tartape.open("./massive_dataset") as tape:
    # Split the stream into 1GB logical volumes
    for volume, manifest in tape.iter_volumes(size=1024**3):
        # 'volume' behaves like an open file (read, seek, tell)
        upload_to_s3(key=volume.name, body=volume)

4. Byte-Perfect Resume

If a transfer is interrupted, you can resume it from the exact byte where it left off.

# Suppose logs indicate that 45,678,912 bytes were sent before the error
LAST_BYTE_SENT = 45678912

with tartape.open("./massive_dataset") as tape:
    # 'play' will instantly jump to the requested offset
    for event in tape.play(start_offset=LAST_BYTE_SENT):
        if event.type == "file_data":
            socket.send(event.data)

5. Integrity Verification

Check if local files have mutated (mtime or size) relative to the recorded index.

with tartape.open("./massive_dataset") as tape:
    # 'verify' performs a random spot-check for quick detection.
    # Use verify(deep=True) for a full bit-by-bit audit of every file.
    try:
        tape.verify()
        print("Dataset is consistent with the index.")
    except Exception as e:
        print(f"Integrity compromised: {e}")

Observable Events

TarTape provides full visibility into the streaming process. Every chunk of data and every file transition is emitted as a structured event.

Event Type Description Key Metadata Available
file_start Emitted before a file or directory enters the stream. entry (metadata), start_offset, resumed (boolean).
file_data Raw bytes belonging to the current file (header, body, or padding). data (bytes).
file_end Emitted after a file is fully processed and closed. entry, end_offset, md5sum (if not resumed).
tape_completed Emitted after the 1024-byte TAR footer is sent. -

Integrity Rules & Constraints

  • T0 State Consistency: If a file changes after it has been recorded, the engine will abort the stream to prevent generating a corrupt or mismatched archive.
  • Anonymization: User/Group IDs (UID/GID) are scrubbed by default. This ensures that the same dataset generates the exact same byte stream (and Hash) regardless of the host machine or user.
  • Path Limits: For universal compatibility and fixed-offset predictability, paths are limited to 255 bytes total, and individual folder/file names are limited to 100 bytes.

Compatible with Python 3.10+ and any standard extraction tool (tar, 7-zip, etc).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tartape-2.2.0b0.tar.gz (29.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tartape-2.2.0b0-py3-none-any.whl (36.0 kB view details)

Uploaded Python 3

File details

Details for the file tartape-2.2.0b0.tar.gz.

File metadata

  • Download URL: tartape-2.2.0b0.tar.gz
  • Upload date:
  • Size: 29.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tartape-2.2.0b0.tar.gz
Algorithm Hash digest
SHA256 c8d9a10bee782962458c889b39cd2ff952efb770832230ba7f21b6ac9c5f6dc8
MD5 35828fdc9e4c0622ded746ef9e370dd9
BLAKE2b-256 604a7243e153a43116498c5d41c1951ad7c56bf157038a1023d03129c3969954

See more details on using hashes here.

Provenance

The following attestation bundles were made for tartape-2.2.0b0.tar.gz:

Publisher: publish.yml on CalumRakk/tartape

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tartape-2.2.0b0-py3-none-any.whl.

File metadata

  • Download URL: tartape-2.2.0b0-py3-none-any.whl
  • Upload date:
  • Size: 36.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tartape-2.2.0b0-py3-none-any.whl
Algorithm Hash digest
SHA256 5f915218fa945ce73bb9e5d87d37be181491e1446cfae015a2b406da8c0a1d2a
MD5 cafe79acef4eddfebf06c6f98f7ed840
BLAKE2b-256 03c75084ee7d5760b6b2f22794a74ab4f2ed31f898a84bd3df41eee31c95d491

See more details on using hashes here.

Provenance

The following attestation bundles were made for tartape-2.2.0b0-py3-none-any.whl:

Publisher: publish.yml on CalumRakk/tartape

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page