Skip to main content

A tiny, zero-dependency replacement for Python's zipfile.ZipFile for creating reproducible/deterministic ZIP archives.

Project description

repro-zipfile

PyPI Conda Version conda-forge feedstock Supported Python versions tests codecov

A tiny, zero-dependency replacement for Python's zipfile.ZipFile library for creating reproducible/deterministic ZIP archives.

"Reproducible" or "deterministic" in this context means that the binary content of the ZIP archive is identical if you add files with identical binary content in the same order. It means you can reliably check equality of the contents of two ZIP archives by simply comparing checksums of the archive using a hash function like MD5 or SHA-256.

This Python package provides a ReproducibleZipFile class that works exactly like zipfile.ZipFile from the Python standard library, except that certain file metadata are set to fixed values. See "How does repro-zipfile work?" below for details.

You can also optionally install a command-line program, rpzip. See "rpzip command line program" below for more information.

Looking instead to create reproducible/deterministic tar archives? Check out our sister package, repro-tarfile!

Installation

repro-zipfile is available from PyPI. To install, run:

pip install repro-zipfile

It is also available from conda-forge. To install, run:

conda install repro-zipfile -c conda-forge

Usage

Simply import ReproducibleZipFile and use it in the same way you would use zipfile.ZipFile from the Python standard library.

from repro_zipfile import ReproducibleZipFile

with ReproducibleZipFile("archive.zip", "w") as zp:
    # Use write to add a file to the archive
    zp.write("examples/data.txt", arcname="data.txt")
    # Or writestr to write data to the archive
    zp.writestr("lore.txt", data="goodbye")

Note that files must be written to the archive in the same order to reproduce an identical archive. Be aware that functions that like os.listdir, os.glob, Path.iterdir, and Path.glob return files in a nondeterministic order—you should call sorted on their returned values first.

See examples/usage.py for an example script that you can run, and examples/demo_vs_zipfile.py for a demonstration in contrast with the standard library's zipfile module.

For more advanced usage, such as customizing the fixed metadata values, see the subsections under "How does repro-zipfile work?".

rpzip command-line program

PyPI

You can optionally install a lightweight command-line program, rpzip. This includes an additional dependency on the typer CLI framework. You can install it either directly or using the cli extra with repro-zipfile. We recommend you use pipx for installing Python CLIs into isolated virtual environments. You can also install it with regular pip, too.

pipx install rpzip
# or
pipx install repro-zipfile[cli]

rpzip is designed to a partial drop-in replacement ubiquitous zip program. Use rpzip --help to see the documentation. Here are some usage examples:

# Archive a single file
rpzip archive.zip examples/data.txt
# Archive multiple files
rpzip archive.zip examples/data.txt README.md
# Archive multiple files with a shell glob
rpzip archive.zip examples/*.py
# Archive a directory recursively
rpzip -r archive.zip examples

In addition to the fixed file metadata done by repro-zipfile, rpzip will also always sort all paths being written.

How does repro-zipfile work?

ZIP archives are not normally reproducible even when containing files with identical content because of file metadata. In particular, the usual culprits are:

  1. Last-modified timestamps
  2. File-system permissions (mode)

repro_zipfile.ReproducibleZipFile is a subclass of zipfile.ZipFile that overrides the write, writestr, and mkdir methods with versions that set the above metadata to fixed values. Note that repro-zipfile does not modify the original files—only the metadata written to the archive.

You can effectively reproduce what ReproducibleZipFile does with something like this:

from zipfile import ZipFile

with ZipFile("archive.zip", "w") as zp:
    # Use write to add a file to the archive
    zp.write("examples/data.txt", arcname="data.txt")
    zinfo = zp.getinfo("data.txt")
    zinfo.date_time = (1980, 1, 1, 0, 0, 0)
    zinfo.external_attr = 0o644 << 16
    # Or writestr to write data to the archive
    zp.writestr("lore.txt", data="goodbye")
    zinfo = zp.getinfo("lore.txt")
    zinfo.date_time = (1980, 1, 1, 0, 0, 0)
    zinfo.external_attr = 0o644 << 16

It's not hard to do, but we believe ReproducibleZipFile is sufficiently more convenient to justify a small package!

See the next two sections for more details about the replacement metadata values and how to customize them.

Last-modified timestamps

ZIP archives store the last-modified timestamps of files and directories. ReproducibleZipFile will set this to a fixed value. By default, the fixed value is 1980-01-01 00:00 UTC, which is the earliest timestamp that is supported by the ZIP format specifications.

You can customize this value with the SOURCE_DATE_EPOCH environment variable. If set, it will be used as the fixed value instead. This should be an integer corresponding to the Unix epoch time of the timestamp you want to set, e.g., 1704067230 for 2024-01-01 00:00:00 UTC. SOURCE_DATE_EPOCH is a standard created by the Reproducible Builds project for software distributions.

File-system permissions

ZIP archives store the file-system permissions of files and directories. The default permissions set for new files or directories often can be different across different systems or users without any intentional choices being made. (These default permissions are controlled by something called umask.) ReproducibleZipFile will set these to fixed values. By default, the fixed values are 0o644 (rw-r--r--) for files and 0o755 (rwxr-xr-x) for directories, which matches the common default umask of 0o022 for root users on Unix systems. (The 0o prefix is how you can write an octal—i.e., base 8—integer literal in Python.)

You can customize these values using the environment variables REPRO_ZIPFILE_FILE_MODE and REPRO_ZIPFILE_DIR_MODE. They should be in three-digit octal Unix numeric notation, e.g., 644 for rw-r--r--.

Why care about reproducible ZIP archives?

ZIP archives are often useful when dealing with a set of multiple files, especially if the files are large and can be compressed. Creating reproducible ZIP archives is often useful for:

  • Building a software package. This is a development best practice to make it easier to verify distributed software packages. See the Reproducible Builds project for more explanation.
  • Working with data. Verify that your data pipeline produced the same outputs, and avoid further reprocessing of identical data.
  • Packaging machine learning model artifacts. Manage model artifact packages more effectively by knowing when they contain identical models.

Related Tools and Alternatives

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

repro_zipfile-0.4.1.tar.gz (59.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

repro_zipfile-0.4.1-py3-none-any.whl (11.6 kB view details)

Uploaded Python 3

File details

Details for the file repro_zipfile-0.4.1.tar.gz.

File metadata

  • Download URL: repro_zipfile-0.4.1.tar.gz
  • Upload date:
  • Size: 59.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for repro_zipfile-0.4.1.tar.gz
Algorithm Hash digest
SHA256 0d5995b4311ed4871cbf6b3210e6a340b0f35f6e4ce2ba27676470bf6987c1db
MD5 a98406cf46ff88aba5cf5ba9950b410c
BLAKE2b-256 68d4514da0758e1ba5723ad45ff9981c8e3b30f5114516f9ba3968ccae3c0043

See more details on using hashes here.

File details

Details for the file repro_zipfile-0.4.1-py3-none-any.whl.

File metadata

  • Download URL: repro_zipfile-0.4.1-py3-none-any.whl
  • Upload date:
  • Size: 11.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for repro_zipfile-0.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 3061d5ab47064ce17255e0e7baa3f2f9128873e1ff49946ce13b73f405167763
MD5 c7a50faea523f1c544eaa90b16d106b0
BLAKE2b-256 00e81f11d8061c433c5205f1caaa06221fb033beb2d8dd5f8d5aaf21b41742e5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page