Skip to main content

Easy serialization of Python objects with multiprocessing support

Project description

Epic serialize - Easy Python objects serialization

Epic-serialize CI

What is it?

The epic-serialize Python library provides a method for easily and efficiently serializing multiple Python objects into a single file, and then deserializing them back in random-like access.

Here's a quick example:

from epic.serialize import SpocReader, SpocWriter

with SpocWriter("myfile.spoc") as spocw:
    spocw.write(map(str, range(10)))
    spocw.write("--done--")

assert list(SpocReader("myfile.spoc")[-3:]) == ["8", "9", "--done--"]

The SPOC format

Spoc, or Serialized Python Objects Chunks, is a multi-serialization file format that allows to easily serialize multiple Python objects into a file which later allows quick reading and slicing.

Consider the case where a dict with 100M items is to be serialized to disk. By simply using pickle.dump, for example, the entire serialization will be done first in memory, and only then written to file, which may be slow and cause out-of-memory errors. In addition, it's sometimes desirable to serialize objects one-by-one (for example, while traversing an iterator).

In addition to serializing objects (in "chunks", as the name implies), Spoc also applies compression before writing to disk, thus potentially saving a lot of space.

The two main classes are SpocWriter, which can either create a new file or append to an existing one, and SpocReader, which allows reading, iteration and random access via index or slice.

SpocReader slices may also be passed to other process (they are picklable), which is useful when dividing jobs over multiple slices of the same Spoc file.

The SpocReader/SpocWriter classes fully support the 'with' statement, and this is the preferred syntax.

Currently, the following serialization schemes are supported:

  • pickle (builtin)
  • dill (requires installation)

Currently, the following compression algorithms are supported:

  • zlib (builtin)
  • bz2 (builtin)
  • gzip (builtin)
  • lzma (builtin)
  • lz4 (requires installation)

These parameters can be passed (by name or class) to the SpocWriter, but will only be used if the file is new or overwritten (not appended to). SpocReader as well as SpocWriter, when used to append to an existing file, interpret the serializer and compressor from the file's header.

Usage examples

Create new file with explicit serialization/compression:

with SpocWriter(filename, serialization="pickle", compression="gzip") as spocw:
    for item in items:
        spocw.write(item)

Open an existing file for appending:

with SpocWriter(filename, append=True) as spocw:
    spocw.write(more_items)

Read all items:

with SpocReader(filename) as spocr:
    read_items = list(spocr)

Read sliced items:

sliced_items = list(SpocReader(filename)[10:-1])
sliced_item = SpocReader(filename)[1000]

Project details


Release history Release notifications | RSS feed

This version

1.0

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

epic-serialize-1.0.tar.gz (32.9 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page