ML project template repository
Project description
sarfile
Like tarfile, but streamable.
What is this?
This repository implements a "streaming archive" file format for collecting multiple files into one. This is similar to the TAR format, but it puts the information about all the files in the archive into a contiguous block at the beginning of the file. This solves a couple problems:
- Much faster startup times for large archives (we read the entire header into memory in one go)
- Much friendlier to remote file systems (only one network request rather than a bunch), in combination with
smart_open
- Fast random access
The file size is the same as an uncompressed TAR file.
The downside is that once we've written a SAR file, we can't change it. Maybe future formats will support this, but for now, the recommended flow is to first generate a TAR file, then convert it using the builtin sarpack
command line tool or the sarfile.pack_tar
Python API.
Also, the file format only exists in this repository, although it's very simple to implement (see the _header.py
documentation and the sarfile
object for how to load items).
Getting Started
Install the package using Pip:
pip install sarfile
Next, simply import the module:
import sarfile
You can convert a tarfile to a sarfile using the Python API:
sarfile.pack_tar(out="myfile.sar", tar="myfile.tar")
Alternatively, you can use the built-in command line tool:
sarpack myfile.sar myfile.tar
Finally, the file can be used in your Python script:
f = sarfile.open("myfile.sar"):
print(f.names)
with f["myfile.txt"] as myfile:
print(myfile.read())
If you have installed smart_open
, then you can also read from S3 as follows:
f = sarfile.open("myfile.sar")
print(f.names)
with f["myfile.txt"] as myfile:
print(myfile.read())
The above code is much faster than reading a TAR file from S3, because we read the entire header into memory in one network request, rather than having to make a network request for each file in the archive. On subsequent accesses we also only download the part of the file we want to read.
Requirements
This package is tested against Python 3.10. Although not required, it is a good idea to install smart_open
to support reading from S3 or other remote file systems, and tqdm
to show a progress bar when packing large files.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file sarfile-0.1.1.tar.gz
.
File metadata
- Download URL: sarfile-0.1.1.tar.gz
- Upload date:
- Size: 12.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9ae33d91938d8f56fecae34478e9d7d430d1816e50dea7f88f664b7f8a40cf4f |
|
MD5 | 447dbe4ee27846ee9f18119ca2872ccf |
|
BLAKE2b-256 | 3792a795e2a1aedc7beb60498f75d73eab7960d575399fd9eedb1acf12abcbbc |
File details
Details for the file sarfile-0.1.1-py3-none-any.whl
.
File metadata
- Download URL: sarfile-0.1.1-py3-none-any.whl
- Upload date:
- Size: 10.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | cccc95c1412c2b2a4cbcd5997888b143bdacabe688994d1ab7f4d25c8a82de95 |
|
MD5 | d5a8a2ad6703e7440ba4dcb936007b82 |
|
BLAKE2b-256 | ea11237dba5310e86bbcc8f3c4a687b47d6b65da1175bc14ed4cb3246f39044c |