save and load complex objects to disk without pickling
Project description
ZANJ
Overview
The ZANJ
format is meant to be a way of saving arbitrary objects to disk, in a way that is flexible, allows to keep configuration and data together, and is human readable. It is loosely inspired by HDF5 and the derived exdir
format, and the implementation is similar to npz
files. The on-disk format is as follows:
a file <filename>.zanj
is a zip file containing:
__zanj_meta__.json
: a file containing zanj-specific metadata including:- system information
- installed packages
- information about external files
__zanj__.json
: a file containing user-specified data- when an element is too big, it can be moved to an external file
.npy
for numpy arrays or torch tensors.jsonl
for pandas dataframes or large sequences
- list of external files stored in
__zanj_meta__.json
- "$ref" key will have value pointing to external file
__format__
key will detail an external format type
- when an element is too big, it can be moved to an external file
This library was originally a module in muutils
Implementation
ZANJ
main class for saving and loading zanj files
contains some configuration info about saving, such as:
- thresholds for how big an array/table has to be before moving to external file
- compression settings
- error modes
- handlers for serialization
Comparison to other formats
Format | Safe | Zero-copy | Lazy loading | No file size limit | Layout control | Flexibility | Bfloat16 |
---|---|---|---|---|---|---|---|
pickle (PyTorch) | ❌ | ❌ | ❌ | ✅ | ❌ | ✅ | ✅ |
H5 (Tensorflow) | ✅ | ❌ | ✅ | ✅ | ~ | ~ | ❌ |
HDF5 | ✅ | ? | ✅ | ✅ | ~ | ✅ | ❌ |
SavedModel (Tensorflow) | ✅ | ❌ | ❌ | ✅ | ✅ | ❌ | ✅ |
MsgPack (flax) | ✅ | ✅ | ❌ | ✅ | ❌ | ❌ | ✅ |
Protobuf (ONNX) | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ |
Cap'n'Proto | ✅ | ✅ | ~ | ✅ | ✅ | ~ | ❌ |
Numpy (npy,npz) | ✅ | ? | ? | ❌ | ✅ | ❌ | ❌ |
SafeTensors | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ |
exdir | ✅ | ? | ? | ? | ? | ✅ | ❌ |
ZANJ | ✅ | ? | ❌* | ✅ | ✅ | ✅ | ❌ |
- Safe: Can I use a file randomly downloaded and expect not to run arbitrary code ?
- Zero-copy: Does reading the file require more memory than the original file ?
- Lazy loading: Can I inspect the file without loading everything ? And loading only some tensors in it without scanning the whole file (distributed setting) ?
- Layout control: Lazy loading, is not necessarily enough since if the information about tensors is spread out in your file, then even if the information is lazily accessible you might have to access most of your file to read the available tensors (incurring many DISK -> RAM copies). Controlling the layout to keep fast access to single tensors is important.
- No file size limit: Is there a limit to the file size ?
- Flexibility: Can I save custom code in the format and be able to use it later with zero extra code ? (~ means we can store more than pure tensors, but no custom code)
- Bfloat16: Does the format support native bfloat16 (meaning no weird workarounds are necessary)? This is becoming increasingly important in the ML world.
(This table was stolen from safetensors)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file zanj-0.1.0.tar.gz
.
File metadata
- Download URL: zanj-0.1.0.tar.gz
- Upload date:
- Size: 27.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 30f39675bc1ca3f18750de8bf9ce85e9bbd920ccd0f8b9b8395c6622ebc7ee33 |
|
MD5 | 4b81407a3df59ac2bb08d95e19934ec6 |
|
BLAKE2b-256 | a2ceea413243f2248d340ae23d0aabfaf74f6db6a5ca58f65b5f584ac989267d |
File details
Details for the file zanj-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: zanj-0.1.0-py3-none-any.whl
- Upload date:
- Size: 30.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c3b9ef69c7da26cfc0df34d389aa8db2128f9be4c8a7c77e0c33b8fa383b427f |
|
MD5 | e6e391a7c676358ec7f07fd8e62a73f5 |
|
BLAKE2b-256 | 52b16c53501b484bd58864fd50eaa2e376ea84d8c6be738a515563c92ffd20cb |