Skip to main content

Use cloud-stored zipfiles with full ZipFile functionality, including partial downloads.

Project description

cloudzipfile

This module provides a way to access zipfiles in cloud storage without downloading the entire zip file. It is inspired by remotezip, but leverages the respective cloud APIs rather than requiring support for the range header. It currently only supports Azure, porting it to other systems should be fairly simple, pull requests very welcome!

Installation

pip install cloudzipfile

Usage

cloudzipfile is a subclass of Python's standard library zipfile.Zipfile and thus supports all its read methods.

Instead of providing Zipfile with a path, you provide a blob client of your cloud provider, for example:

# Import
from azure.storage.blob import BlobClient
from cloudzipfile.cloudzipfile import CloudZipFile
import os, tempfile, uuid

# Define blob client
BLOB_URL = 'https://cloudzipfileexamples.blob.core.windows.net/test/files.zip'
blobClient = BlobClient.from_blob_url(BLOB_URL)

# Define link to zipfile
# Will download central directory (where to find specific files)
PATH_OUTPUT = os.path.join(tempfile.gettempdir(), str(uuid.uuid4()))
FILES_DESIRED = ['file1.txt', 'file3.txt']
cloudZipFile = CloudZipFile(blobClient)

# Extract specific files
cloudZipFile.extractall(path=PATH_OUTPUT, members=FILES_DESIRED)

# Verify success: should show file1.txt and file2.txt
print(f'{PATH_OUTPUT}: {os.listdir(PATH_OUTPUT)}')

Future Development

Supporting other systems is fairly straightforward as you require only two methods. One that determines the size of the cloud file and one that performs a partial download, these should be supported by all major providers (I simply don't have experience with them).

How It Works

Zip files have a fixed structure, which can be leveraged for partial reading. They end with an EOCD which lists where to find the central directory. The central directory lists all files in the archive and where to find them. Python's zipfile uses these two pieces to determine which part of the file to load into memory when the user requests a particular file. This package overwrites that loading process to work with cloud APIs directly rather than only with local filesystems. All credit go to remotezip for figuring out how to overwrite the process, I only edited it to use APIs rather than HTTP requests.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cloudzipfile-1.0.5.tar.gz (5.6 kB view hashes)

Uploaded Source

Built Distribution

cloudzipfile-1.0.5-py3-none-any.whl (6.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page