Skip to main content

save and load complex objects to disk without pickling

Project description

ZANJ

installation

PyPi: https://pypi.org/project/zanj/

pip install zanj

Overview

The ZANJ format is meant to be a way of saving arbitrary objects to disk, in a way that is flexible, allows to keep configuration and data together, and is human readable. It is loosely inspired by HDF5 and the derived exdir format, and the implementation is similar to npz files. The on-disk format is as follows:

a file <filename>.zanj is a zip file containing:

  • __zanj_meta__.json: a file containing zanj-specific metadata including:
    • system information
    • installed packages
    • information about external files
  • __zanj__.json: a file containing user-specified data
    • when an element is too big, it can be moved to an external file
      • .npy for numpy arrays or torch tensors
      • .jsonl for pandas dataframes or large sequences
    • list of external files stored in __zanj_meta__.json
    • "$ref" key will have value pointing to external file
    • __format__ key will detail an external format type

This library was originally a module in muutils

Implementation

ZANJ

main class for saving and loading zanj files

contains some configuration info about saving, such as:

  • thresholds for how big an array/table has to be before moving to external file
  • compression settings
  • error modes
  • handlers for serialization

Comparison to other formats

Format Safe Zero-copy Lazy loading No file size limit Layout control Flexibility Bfloat16
pickle (PyTorch)
H5 (Tensorflow) ~ ~
HDF5 ? ~
SavedModel (Tensorflow)
MsgPack (flax)
Protobuf (ONNX)
Cap'n'Proto ~ ~
Numpy (npy,npz) ? ?
SafeTensors
exdir ? ? ? ?
ZANJ ? ❌*
  • Safe: Can I use a file randomly downloaded and expect not to run arbitrary code ?
  • Zero-copy: Does reading the file require more memory than the original file ?
  • Lazy loading: Can I inspect the file without loading everything ? And loading only some tensors in it without scanning the whole file (distributed setting) ?
  • Layout control: Lazy loading, is not necessarily enough since if the information about tensors is spread out in your file, then even if the information is lazily accessible you might have to access most of your file to read the available tensors (incurring many DISK -> RAM copies). Controlling the layout to keep fast access to single tensors is important.
  • No file size limit: Is there a limit to the file size ?
  • Flexibility: Can I save custom code in the format and be able to use it later with zero extra code ? (~ means we can store more than pure tensors, but no custom code)
  • Bfloat16: Does the format support native bfloat16 (meaning no weird workarounds are necessary)? This is becoming increasingly important in the ML world.

(This table was stolen from safetensors)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zanj-0.1.2.tar.gz (27.7 kB view details)

Uploaded Source

Built Distribution

zanj-0.1.2-py3-none-any.whl (30.6 kB view details)

Uploaded Python 3

File details

Details for the file zanj-0.1.2.tar.gz.

File metadata

  • Download URL: zanj-0.1.2.tar.gz
  • Upload date:
  • Size: 27.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.1

File hashes

Hashes for zanj-0.1.2.tar.gz
Algorithm Hash digest
SHA256 9e0ac7aa2a30461b3632b33f144c483b62c59db1c68df4f79579c276d7ef4aec
MD5 a65a1bccaec9df8ae0ba8474d651b57c
BLAKE2b-256 7ad91c80e3a25a845a3a73fe5e7566813b1c8f08eefd44e5f2ccd2bb77b51dd5

See more details on using hashes here.

File details

Details for the file zanj-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: zanj-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 30.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.1

File hashes

Hashes for zanj-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 8ed412bf17044fb253b3ad3c69e4faaf23a4a6dbd199b2d7126b633e162c59c3
MD5 820b08986c7c12239956948441941416
BLAKE2b-256 a86be27f91c3c485986d93be660897efbdb4def0cbf2d33d6cfc29647bbca4f1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page