Skip to main content

Icechunk Python

Project description

Icechunk

Icechunk logo

PyPI Conda Forge Crates.io GitHub Repo stars Earthmover Community Slack


[!TIP] Icechunk 1.0 is released! Better API, more performance and stability


Icechunk is an open-source (Apache 2.0), transactional storage engine for tensor / ND-array data designed for use on cloud object storage. Icechunk works together with Zarr, augmenting the Zarr core data model with features that enhance performance, collaboration, and safety in a cloud-computing context.

Documentation and Resources

Crate Structure

The Rust workspace is organized into layered crates:

graph TD
    python[icechunk-python] --> core[icechunk]
    core --> arrow[icechunk-arrow-object-store]
    core --> s3[icechunk-s3 *optional*]
    core --> storage[icechunk-storage]
    core --> format[icechunk-format]
    arrow --> storage
    s3 --> storage
    storage --> types[icechunk-types]
    format --> types
Crate Description
icechunk-macros Procedural macro helpers for tests and internal use
icechunk-types Shared foundational types (Path, ETag, Move, error wrappers) used across all crates
icechunk-format Binary format types and serialization (snapshots, manifests, transaction logs, repo info)
icechunk-storage Storage trait definitions and common storage utilities
icechunk-arrow-object-store Storage backend using Apache Arrow's object_store (in-memory, local, GCS, Azure, etc.)
icechunk-s3 Native AWS S3 storage backend (optional feature)
icechunk Core storage engine: transactions, version control, repositories
icechunk-python PyO3 bindings exposing the engine to Python

Icechunk Overview

Let's break down what "transactional storage engine for Zarr" actually means:

  • Zarr is an open source specification for the storage of multidimensional array (a.k.a. tensor) data. Zarr defines the metadata for describing arrays (shape, dtype, etc.) and the way these arrays are chunked, compressed, and converted to raw bytes for storage. Zarr can store its data in any key-value store. There are many different implementations of Zarr in different languages. Right now, Icechunk only supports Zarr Python. If you're interested in implementing Icechunk support, please open an issue so we can help you.
  • Storage engine - Icechunk exposes a key-value interface to Zarr and manages all of the actual I/O for getting, setting, and updating both metadata and chunk data in cloud object storage. Zarr libraries don't have to know exactly how icechunk works under the hood in order to use it.
  • Transactional - The key improvement that Icechunk brings on top of regular Zarr is to provide consistent serializable isolation between transactions. This means that Icechunk data is safe to read and write in parallel from multiple uncoordinated processes. This allows Zarr to be used more like a database.

The core entity in Icechunk is a repository or repo. A repo is defined as a Zarr hierarchy containing one or more Arrays and Groups, and a repo functions as a self-contained Zarr Store. The most common scenario is for an Icechunk repo to contain a single Zarr group with multiple arrays, each corresponding to different physical variables but sharing common spatiotemporal coordinates. However, formally a repo can be any valid Zarr hierarchy, from a single Array to a deeply nested structure of Groups and Arrays. Users of Icechunk should aim to scope their repos only to related arrays and groups that require consistent transactional updates.

Icechunk supports the following core requirements:

  1. Object storage - the format is designed around the consistency features and performance characteristics available in modern cloud object storage. No external database or catalog is required to maintain a repo. (It also works with file storage.)
  2. Serializable isolation - Reads are isolated from concurrent writes and always use a committed snapshot of a repo. Writes are committed atomically and are never partially visible. No locks are required for reading.
  3. Time travel - Previous snapshots of a repo remain accessible after new ones have been written.
  4. Data version control - Repos support both tags (immutable references to snapshots) and branches (mutable references to snapshots).
  5. Chunk shardings - Chunk storage is decoupled from specific file names. Multiple chunks can be packed into a single object (sharding).
  6. Chunk references - Zarr-compatible chunks within other file formats (e.g. HDF5, NetCDF) can be referenced.
  7. Schema evolution - Arrays and Groups can be added, renamed, and removed from the hierarchy with minimal overhead.

Key Concepts

Groups, Arrays, and Chunks

Icechunk is designed around the Zarr data model, widely used in scientific computing, data science, and AI / ML. (The Zarr high-level data model is effectively the same as HDF5.) The core data structure in this data model is the array. Arrays have two fundamental properties:

  • shape - a tuple of integers which specify the dimensions of each axis of the array. A 10 x 10 square array would have shape (10, 10)
  • data type - a specification of what type of data is found in each element, e.g. integer, float, etc. Different data types have different precision (e.g. 16-bit integer, 64-bit float, etc.)

In Zarr / Icechunk, arrays are split into chunks. A chunk is the minimum unit of data that must be read / written from storage, and thus choices about chunking have strong implications for performance. Zarr leaves this completely up to the user. Chunk shape should be chosen based on the anticipated data access pattern for each array. An Icechunk array is not bounded by an individual file and is effectively unlimited in size.

For further organization of data, Icechunk supports groups within a single repo. Group are like folders which contain multiple arrays and or other groups. Groups enable data to be organized into hierarchical trees. A common usage pattern is to store multiple arrays in a group representing a NetCDF-style dataset.

Arbitrary JSON-style key-value metadata can be attached to both arrays and groups.

Snapshots

Every update to an Icechunk store creates a new snapshot with a unique ID. Icechunk users must organize their updates into groups of related operations called transactions. For example, appending a new time slice to multiple arrays should be done as a single transaction, comprising the following steps

  1. Update the array metadata to resize the array to accommodate the new elements.
  2. Write new chunks for each array in the group.

While the transaction is in progress, none of these changes will be visible to other users of the store. Once the transaction is committed, a new snapshot is generated. Readers can only see and use committed snapshots.

Branches and Tags

Additionally, snapshots occur in a specific linear (i.e. serializable) order within a branch. A branch is a mutable reference to a snapshot--a pointer that maps the branch name to a snapshot ID. The default branch is main. Every commit to the main branch updates this reference. Icechunk's design protects against the race condition in which two uncoordinated sessions attempt to update the branch at the same time; only one can succeed.

Icechunk also defines tags--immutable references to snapshot. Tags are appropriate for publishing specific releases of a repository or for any application which requires a persistent, immutable identifier to the store state.

Chunk References

Chunk references are "pointers" to chunks that exist in other files--HDF5, NetCDF, GRIB, etc. Icechunk can store these references alongside native Zarr chunks as "virtual datasets". You can then update these virtual datasets incrementally (overwrite chunks, change metadata, etc.) without touching the underling files.

How Does It Work?

!!! Note: For a more detailed explanation, have a look at the Icechunk spec.

Zarr itself works by storing both metadata and chunk data into a abstract store according to a specified system of "keys". For example, a 2D Zarr array called myarray, within a group called mygroup, would generate the following keys:

mygroup/zarr.json
mygroup/myarray/zarr.json
mygroup/myarray/c/0/0
mygroup/myarray/c/0/1

In standard regular Zarr stores, these key map directly to filenames in a filesystem or object keys in an object storage system. When writing data, a Zarr implementation will create these keys and populate them with data. When modifying existing arrays or groups, a Zarr implementation will potentially overwrite existing keys with new data.

This is generally not a problem, as long there is only one person or process coordinating access to the data. However, when multiple uncoordinated readers and writers attempt to access the same Zarr data at the same time, various consistency problems emerge. These consistency problems can occur in both file storage and object storage; they are particularly severe in a cloud setting where Zarr is being used as an active store for data that are frequently changed while also being read.

With Icechunk, we keep the same core Zarr data model, but add a layer of indirection between the Zarr keys and the on-disk storage. The Icechunk library translates between the Zarr keys and the actual on-disk data given the particular context of the user's state. Icechunk defines a series of interconnected metadata and data files that together enable efficient isolated reading and writing of metadata and chunks. Once written, these files are immutable. Icechunk keeps track of every single chunk explicitly in a "chunk manifest".

flowchart TD
    zarr-python[Zarr Library] <-- key / value--> icechunk[Icechunk Library]
    icechunk <-- data / metadata files --> storage[(Object Storage)]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

icechunk-2.1.0.tar.gz (3.5 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

icechunk-2.1.0-cp312-abi3-win_amd64.whl (16.5 MB view details)

Uploaded CPython 3.12+Windows x86-64

icechunk-2.1.0-cp312-abi3-musllinux_1_2_x86_64.whl (18.2 MB view details)

Uploaded CPython 3.12+musllinux: musl 1.2+ x86-64

icechunk-2.1.0-cp312-abi3-musllinux_1_2_aarch64.whl (17.7 MB view details)

Uploaded CPython 3.12+musllinux: musl 1.2+ ARM64

icechunk-2.1.0-cp312-abi3-manylinux_2_28_aarch64.whl (17.5 MB view details)

Uploaded CPython 3.12+manylinux: glibc 2.28+ ARM64

icechunk-2.1.0-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.8 MB view details)

Uploaded CPython 3.12+manylinux: glibc 2.17+ x86-64

icechunk-2.1.0-cp312-abi3-macosx_11_0_arm64.whl (16.1 MB view details)

Uploaded CPython 3.12+macOS 11.0+ ARM64

icechunk-2.1.0-cp312-abi3-macosx_10_12_x86_64.whl (17.4 MB view details)

Uploaded CPython 3.12+macOS 10.12+ x86-64

File details

Details for the file icechunk-2.1.0.tar.gz.

File metadata

  • Download URL: icechunk-2.1.0.tar.gz
  • Upload date:
  • Size: 3.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.14.1

File hashes

Hashes for icechunk-2.1.0.tar.gz
Algorithm Hash digest
SHA256 5acce7b17b02f9fb191b8ea37b5851da026453c83bd74e291c53f07c60416260
MD5 ab234e05995c6e9fe03d457e61d310ac
BLAKE2b-256 f1f11e906af5b1f95e9e833611c7b0b7ced4b8e436e5f2c0343540c58ffc5c13

See more details on using hashes here.

File details

Details for the file icechunk-2.1.0-cp312-abi3-win_amd64.whl.

File metadata

  • Download URL: icechunk-2.1.0-cp312-abi3-win_amd64.whl
  • Upload date:
  • Size: 16.5 MB
  • Tags: CPython 3.12+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.14.1

File hashes

Hashes for icechunk-2.1.0-cp312-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 34d49a07d43497109f427c6d5260a5f26d238062e4b68c15d67e2ce4b99b8bac
MD5 bfa4207fd36b98bb7e6bf08785b6880e
BLAKE2b-256 ac5f75ce238ca4c65f253c69512e1aaf86bc9fc6f6ada87e1a6d4991c9f3ec19

See more details on using hashes here.

File details

Details for the file icechunk-2.1.0-cp312-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for icechunk-2.1.0-cp312-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 19af6d020692e3bf164655cdea5fa20bdcb89200ee825b6eb57e041f9d1887b4
MD5 19c43313f92c09e67bd18f789a7da76e
BLAKE2b-256 9525a5a2629407f09f614215612f6805fa5438de9a43a6558f32c299cacdf676

See more details on using hashes here.

File details

Details for the file icechunk-2.1.0-cp312-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for icechunk-2.1.0-cp312-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 52b3bde69c73fa5c56e55ded68d492ab00b4c5449a7d5f5c788e2fa563f71e85
MD5 edee512f3baa6ca488d54ff21555ebf3
BLAKE2b-256 17450f3ace15826fc7d5d167e37f2736a4949adf041d7bf708f851238948b50d

See more details on using hashes here.

File details

Details for the file icechunk-2.1.0-cp312-abi3-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for icechunk-2.1.0-cp312-abi3-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 8cfa340f214604a2b2e8c69bcf07007569826ce6919e2c7af3d493754c4cc40d
MD5 4ef863ca2603a0ae784d67fef78398b9
BLAKE2b-256 edc0bda2ffd78d1393586bb26afdb651d1fcb33afc8d83b73f4fde38e535e928

See more details on using hashes here.

File details

Details for the file icechunk-2.1.0-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for icechunk-2.1.0-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 330158c1f0aebf745c2034d0059cacd3974f929ed663b4ce0b53a88fff16fffb
MD5 3eb75c45cf48a42758c2925d9ebef1f6
BLAKE2b-256 d05620a624039a4a705f0818e9167e312539552bd664aa5f347522ccc6fd7f02

See more details on using hashes here.

File details

Details for the file icechunk-2.1.0-cp312-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for icechunk-2.1.0-cp312-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 c55af8265ff54db8e180c60033fbe31a0662f32de9d60207f827ab39a4df1b44
MD5 891c5b3c51d1c67b127feb80968fd80e
BLAKE2b-256 3fe0fcbb8997cc7ee42056463ad464fd807dc1a642fbf0838dcede215d03883d

See more details on using hashes here.

File details

Details for the file icechunk-2.1.0-cp312-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for icechunk-2.1.0-cp312-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 30504fe5a868ed2221df0792f0e098e33d466ded4de2861e33ef5bb5843306cd
MD5 d706280de2d0f0f29cb374eadac4a6ff
BLAKE2b-256 c5b2f16b67330e28b2d0eddce475d502e988cce54f65f9fa0a6d64bf8a0ab3e5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page