Skip to main content

ML project template repository

Project description

sarfile

Like tarfile, but streamable.

What is this?

This repository implements a "streaming archive" file format for collecting multiple files into one. This is similar to the tar format, but it puts the information about all the files in the archive into a contiguous block at the end of the file. This solves a couple problems:

  1. When the file is on local disk, it makes the startup time for reading the archive much faster, because we can read the entire header in one go, rather than having to seek around to each block.
  2. The benefits of the first bullet point are even greater when the file is on a remote file system, such as S3, because we can download the entire header in one network request, rather than having to make a network request for each file in the archive.

The downside is that once we've written a SAR file, we can't change it. Maybe future formats will support this, but for now, the recommended flow is to first generate a TAR file, then convert it using the builtin sarpack command line tool or the sarfile.pack_tar Python API.

Getting Started

Install the package using Pip:

pip install sarfile

Next, simply import the module:

import sarfile

You can convert a tarfile to a sarfile using the Python API:

sarfile.pack_tar(out="myfile.sar", tar="myfile.tar")

Alternatively, you can use the built-in command line tool:

sarpack myfile.sar myfile.tar

Finally, the file can be used in your Python script:

f = sarfile.open("myfile.sar"):
print(f.names)
with f["myfile.txt"] as myfile:
    print(myfile.read())

If you have installed smart_open, then you can also read from S3 as follows:

f = sarfile.open("myfile.sar")
print(f.names)
with f["myfile.txt"] as myfile:
    print(myfile.read())

The above code is much faster than reading a TAR file from S3, because we read the entire header into memory in one network request, rather than having to make a network request for each file in the archive. On subsequent accesses we also only download the part of the file we want to read.

Requirements

This package is tested against Python 3.10. Although not required, it is a good idea to install smart_open to support reading from S3 or other remote file systems, and tqdm to show a progress bar when packing large files.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sarfile-0.1.0.tar.gz (11.5 kB view details)

Uploaded Source

Built Distribution

sarfile-0.1.0-py3-none-any.whl (10.5 kB view details)

Uploaded Python 3

File details

Details for the file sarfile-0.1.0.tar.gz.

File metadata

  • Download URL: sarfile-0.1.0.tar.gz
  • Upload date:
  • Size: 11.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for sarfile-0.1.0.tar.gz
Algorithm Hash digest
SHA256 53db1f89ebb45f3d5c477a13b5c069c3666f18eccda0a6456a03a3cf92df634e
MD5 aa86ab0973309e84c26f07fecfdc9dde
BLAKE2b-256 5e8a75fb458a059486f6cb0407ae1ddd391c13cd20425c9989fd9eb705a642cc

See more details on using hashes here.

File details

Details for the file sarfile-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: sarfile-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 10.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for sarfile-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 95753c59b6046aa2338db0231f84626d5e9c963f0170e225bfc603f67c4a3e29
MD5 7f41884423a6d1dfffb0dce0d902762f
BLAKE2b-256 732f6f2043de9958a37712236a1cc7f47f4dbbc6cf0c20948f85b5949abd2d2e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page