Skip to main content

Archive handling for VCollab applications — extract zip/tar.gz archives and stream directories as zip

Project description

Archive Utilities

Purpose

Archive handling for VCollab applications -- extract ZIP and TAR.GZ archives from binary streams and stream directories as ZIP archives (in-memory or via tempfile).

This package provides two categories of functionality:

  • Extraction -- Extract ZIP and TAR.GZ archives from any seekable binary stream (BinaryIO) to a target directory, using either in-memory BytesIO (fast, for small files) or temporary file (memory-efficient, for large files) strategies. Includes path traversal protection and configurable size/count limits.
  • Streaming -- Generate ZIP archives from directory contents on-the-fly for download responses, with memory-based streaming for small directories and tempfile-based streaming for large directories. Supports file filtering/exclusion.

When to use this package

Use vcti-archive when your application needs to:

  • Accept uploaded ZIP or TAR.GZ files and extract them to disk
  • Serve directory contents as downloadable ZIP archives
  • Stream large directory archives without loading everything into memory
  • Choose between memory-efficient and fast extraction strategies
  • Protect against malicious archives (path traversal, zip bombs)

Installation

The package has zero required dependencies. FastAPI integration is available as an optional extra.

Without FastAPI (CLI tools, background workers, any framework)

Use this when you only need extraction and streaming — no FastAPI-specific helpers. Works with Django, Flask, plain scripts, or anything that accepts BinaryIO and bytes iterators.

pip install vcti-archive@v1.0.0

What you get:

  • ZipExtractor, TarGzExtractor — extract from any BinaryIO
  • DirectoryZipMemoryStreamer — stream directory as ZIP (in-memory)
  • LargeDirectoryZipStreamer — stream directory as ZIP (tempfile)
  • Async wrappers, bomb protection, path traversal safety, logging

With FastAPI

Use this when you need streaming_zip_response() — a helper that wraps LargeDirectoryZipStreamer in a StreamingResponse with correct headers and BackgroundTasks cleanup.

pip install "vcti-archive[fastapi]>=1.0.2"

Everything above, plus:

  • streaming_zip_response(streamer, background_tasks) from vcti.archive.fastapi

In requirements.txt

# Without FastAPI
vcti-archive>=1.0.2

# With FastAPI
vcti-archive[fastapi]>=1.0.2

In pyproject.toml dependencies

# Without FastAPI
dependencies = [
    "vcti-archive>=1.0.2",
]

# With FastAPI
dependencies = [
    "vcti-archive[fastapi]>=1.0.2",
]

Quick Start

Usage without FastAPI

All core functionality works with any framework or no framework at all. Extractors accept any seekable BinaryIO (open files, io.BytesIO, UploadFile.file, etc.) and streamers yield plain bytes iterators.

Extract a ZIP archive:

from pathlib import Path
from vcti.archive import ZipExtractor

with open("archive.zip", "rb") as f:
    extractor = ZipExtractor(f, Path("/target/dir"))
    extractor.extract_using_bytesio()   # Fast, for small files
    # or
    extractor.extract_using_tempfile()  # Memory-efficient, for large files

# With bomb protection
extractor = ZipExtractor(
    stream, Path("/target"),
    max_total_size=500_000_000,  # 500MB limit
    max_file_count=10_000,       # 10K files limit
)
extractor.extract_using_bytesio()

Extract a TAR.GZ archive:

from pathlib import Path
from vcti.archive import TarGzExtractor

with open("archive.tar.gz", "rb") as f:
    TarGzExtractor(f, Path("/target/dir")).extract_using_bytesio()

Async extraction (non-blocking for async frameworks):

extractor = ZipExtractor(stream, Path("/target/dir"))
await extractor.async_extract_using_bytesio()

Stream a directory as ZIP (in-memory):

from pathlib import Path
from vcti.archive import DirectoryZipMemoryStreamer

streamer = DirectoryZipMemoryStreamer(Path("/data/project"))

# Write to a file
with open("output.zip", "wb") as out:
    for chunk in streamer:
        out.write(chunk)

# Or pass to any framework's streaming response
# Django: StreamingHttpResponse(streamer, content_type="application/zip")
# Flask:  Response(streamer, mimetype="application/zip")

Stream a large directory as ZIP (tempfile):

from pathlib import Path
from vcti.archive import LargeDirectoryZipStreamer

streamer = LargeDirectoryZipStreamer(
    folder_path=Path("/data/project"),
    archive_name="project.zip",
)
for chunk in streamer.stream():
    response.write(chunk)

File filtering (works with both streamers):

streamer = DirectoryZipMemoryStreamer(
    Path("/data/project"),
    exclude=lambda p: p.name.startswith(".") or p.suffix == ".log",
)

Usage with FastAPI

Install with pip install vcti-archive[fastapi]. Everything above still works, plus you get streaming_zip_response() — a helper that wraps LargeDirectoryZipStreamer in a StreamingResponse with correct headers and deferred temp-file cleanup via BackgroundTasks.

Extract an uploaded file:

from pathlib import Path
from fastapi import UploadFile
from vcti.archive import ZipExtractor

@app.post("/upload")
async def upload(file: UploadFile):
    extractor = ZipExtractor(file.file, Path("/data/uploads"))
    await extractor.async_extract_using_bytesio()
    return {"status": "extracted"}

Stream a directory as a download (in-memory, small dirs):

from fastapi.responses import StreamingResponse
from vcti.archive import DirectoryZipMemoryStreamer

@app.get("/download")
def download():
    streamer = DirectoryZipMemoryStreamer(Path("/data/project"))
    return StreamingResponse(streamer, media_type="application/zip")

Stream a large directory as a download (tempfile, large dirs):

from fastapi import BackgroundTasks
from vcti.archive import LargeDirectoryZipStreamer
from vcti.archive.fastapi import streaming_zip_response

@app.get("/download/large")
def download_large(background_tasks: BackgroundTasks):
    streamer = LargeDirectoryZipStreamer(
        folder_path=Path("/data/dataset"),
        archive_name="dataset.zip",
    )
    return streaming_zip_response(streamer, background_tasks)

Choosing a streamer

Both streamers produce identical ZIP output. The difference is where the ZIP is assembled:

  • DirectoryZipMemoryStreamer -- builds the ZIP in a BytesIO buffer, yielding chunks as it goes. Simple (no temp files, no cleanup), but the buffer stays in memory for the duration of the request.
  • LargeDirectoryZipStreamer -- writes the complete ZIP to a temp file first, then streams from disk. Needs cleanup (via on_cleanup or the streaming_zip_response helper) but memory usage stays flat regardless of archive size.

The right choice depends on your deployment, not a universal size threshold. Consider:

  • Process memory budget -- in a 512 MB container, a 200 MB in-memory ZIP may be too large; on a 32 GB server it's trivial.
  • Concurrent requests -- one 300 MB buffer is fine; fifty concurrent ones may not be.
  • Disk I/O -- the tempfile streamer writes then reads the full archive, so slow disks add latency that the memory streamer avoids.

When in doubt, start with DirectoryZipMemoryStreamer (less moving parts) and switch to LargeDirectoryZipStreamer if you observe memory pressure under production load.


Public API

Class / Function Purpose
ArchiveExtractor ABC base class for archive extractors (BytesIO and tempfile strategies)
ZipExtractor Extract ZIP archives with path traversal and bomb protection
TarGzExtractor Extract TAR.GZ archives with filter="data" security
DirectoryZipMemoryStreamer Stream directory as ZIP using in-memory buffer (reusable)
LargeDirectoryZipStreamer Stream directory as ZIP using temporary file
UnsupportedArchiveFormat Exception for unsupported archive formats
streaming_zip_response() FastAPI helper (optional, requires vcti-archive[fastapi])

Dependencies

  • Zero required dependencies -- Core functionality uses Python stdlib only (zipfile, tarfile, shutil, tempfile, asyncio).
  • Optional: fastapi -- Install with vcti-archive[fastapi] for streaming_zip_response() and FastAPI-specific integration.

Documentation

  • Design -- Architecture decisions and extraction strategies
  • Source Guide -- File descriptions and execution flow traces
  • API Reference -- Autodoc for all modules

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vcti_archive-1.0.2.tar.gz (23.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vcti_archive-1.0.2-py3-none-any.whl (17.2 kB view details)

Uploaded Python 3

File details

Details for the file vcti_archive-1.0.2.tar.gz.

File metadata

  • Download URL: vcti_archive-1.0.2.tar.gz
  • Upload date:
  • Size: 23.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vcti_archive-1.0.2.tar.gz
Algorithm Hash digest
SHA256 ba4c0265480b9f6629bd3559cd33bbd610cd61ba8e7cfbf7a3abedde570b87ba
MD5 f0dd0dc50ca69d906a4d59e19a0bba32
BLAKE2b-256 a70b0565227bb390626801e54478bdc0d5316160b7ecae6419ab95c2e58dcc24

See more details on using hashes here.

Provenance

The following attestation bundles were made for vcti_archive-1.0.2.tar.gz:

Publisher: publish.yml on vcollab/vcti-python-archive

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file vcti_archive-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: vcti_archive-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 17.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vcti_archive-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 c5632662a8695eb5b1150fe2f471a7971ffe4b9387ddfa3efe60fb8bd9f0ca7b
MD5 42588659fa2dd9b034c0e89d6e811121
BLAKE2b-256 5f034482a71f70d806ef6c3294e7ffa9ae22ac1077b573fe7882238eacf0c1b3

See more details on using hashes here.

Provenance

The following attestation bundles were made for vcti_archive-1.0.2-py3-none-any.whl:

Publisher: publish.yml on vcollab/vcti-python-archive

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page