Skip to main content

Fast binary serialization and storage for ASE Atoms.

Project description

asebytes

Storage-agnostic, lazy-loading data layer with pluggable backends (LMDB, Zarr, HDF5/H5MD, HuggingFace Datasets, ASE file formats). Three IO tiers — raw bytes, structured dicts, and ASE Atoms — each with full sync and async APIs plus pandas-style column views.

pip install asebytes[lmdb]      # LMDB backend (recommended)
pip install asebytes[zarr]      # Zarr backend (fast compression)
pip install asebytes[h5md]      # HDF5/H5MD backend
pip install asebytes[hf]        # HuggingFace Datasets backend
pip install asebytes[mongodb]   # MongoDB backend (shared remote storage)
# In-memory backend (MemoryObjectBackend) is built-in — no extras needed

Quick Start

from asebytes import ASEIO

# Sync
db = ASEIO("data.lmdb")
db.extend(atoms_list)
db[0] = new_atoms
atoms = db[0]

# Async
import asyncio
from asebytes import AsyncASEIO

async def main():
    db = AsyncASEIO("data.lmdb")
    await db.extend(atoms_list)
    atoms = await db[0]
    async for atoms in db:
        process(atoms)

asyncio.run(main())

String paths auto-detect the backend from the file extension. Pass a backend instance directly for full control.

Three IO Layers

Class Async class Row type Use case
ASEIO AsyncASEIO ase.Atoms Atomistic simulations
ObjectIO AsyncObjectIO dict[str, Any] Structured data without ASE
BlobIO AsyncBlobIO dict[bytes, bytes] Raw bytes, zero deserialization

ASEIO — Atoms objects

from asebytes import ASEIO, AsyncASEIO

# Sync
db = ASEIO("atoms.lmdb")
db.extend(atoms_list)
db.update(0, calc={"energy": -10.5})
atoms = db[0]                  # ase.Atoms

# Async
db = AsyncASEIO("atoms.lmdb")
await db.extend(atoms_list)
atoms = await db[0]            # ase.Atoms
await db.update(0, calc={"energy": -10.5})

ObjectIO — plain dicts

from asebytes import ObjectIO, AsyncObjectIO

# Sync
db = ObjectIO("records.lmdb")
db.extend([
    {"arrays.numbers": [29], "calc.energy": -3.5},
    {"arrays.numbers": [26], "calc.energy": -8.3},
])
row = db[0]  # {"arrays.numbers": [29], "calc.energy": -3.5}

# Async
db = AsyncObjectIO("records.lmdb")
await db.extend([{"arrays.numbers": [29], "calc.energy": -3.5}])
row = await db[0]

BlobIO — raw bytes

from asebytes import BlobIO, AsyncBlobIO

# Sync
db = BlobIO("blobs.lmdb")
db.extend([{b"key": b"value"}, {b"key": b"other"}])
row = db[0]                    # {b"key": b"value"}

# Async
db = AsyncBlobIO("blobs.lmdb")
await db.extend([{b"key": b"value"}])
row = await db[0]

Lazy Views

Indexing with slices, lists, or strings returns lazy views — nothing is loaded until you iterate or materialize.

Row views

# Sync
view = db[5:100]               # RowView (lazy)
view = db[[0, 42, 99]]         # RowView from index list
for row in view:
    process(row)

# Async
view = db[5:100]               # AsyncRowView (lazy)
async for row in view:
    process(row)
rows = await view.to_list()    # materialize to list

Column views

# Sync
energies = db["calc.energy"].to_list()
cols = db[["calc.energy", "calc.forces"]].to_dict()
# → {"calc.energy": [...], "calc.forces": [...]}

# Async
energies = await db["calc.energy"].to_list()
cols = await db[["calc.energy", "calc.forces"]].to_dict()

Chaining rows + columns

# Sync
db[0:500]["calc.energy"].to_list()

# Async
await db[0:500]["calc.energy"].to_list()

Materialization

# Sync
view.to_list()                 # load all into memory
view.to_dict()                 # column-oriented dict (ColumnView only)
for batch in view.chunked(1000):  # iterate in chunks
    process(batch)

# Async
await view.to_list()
await view.to_dict()
async for batch in view.chunked(1000):
    process(batch)

Write-back

Views support in-place mutations when backed by a writable backend.

# Sync
db[0:10].set(new_rows)         # overwrite rows
db[0:10].update({"info.tag": "train"})  # partial update (applies to all rows)
db[0:10].delete()              # delete rows (contiguous only)

# Async
await db[0:10].set(new_rows)
await db[0:10].update({"info.tag": "train"})
await db[0:10].delete()

Backends

Backend is auto-detected from the file extension:

Extension Backend Install extra
*.lmdb LMDBObjectBackend / LMDBBlobBackend asebytes[lmdb]
*.zarr ZarrBackend asebytes[zarr]
*.h5 / *.h5md H5MDBackend asebytes[h5md]
*.xyz / *.extxyz / *.traj ASEReadOnlyBackend (none)

URI schemes for remote/streaming sources:

Scheme Source Example
memory:// In-memory (no persistence) ObjectIO("memory://")
mongodb:// MongoDB ObjectIO("mongodb://host:port/db")
redis:// Redis ObjectIO("redis://host:port")
hf:// HuggingFace Datasets ASEIO("hf://user/dataset", ...)
colabfit:// ColabFit datasets ASEIO("colabfit://mlearn_Cu_train", ...)
optimade:// OPTIMADE datasets ASEIO("optimade://LeMaterial/LeMat-Bulk", ...)

Groups

All backends support a unified group parameter to organize data into independent collections within the same storage location. Groups are useful for storing multiple datasets, splits, or configurations in a single file/database.

# LMDB: separate subdirectories per group
db1 = ASEIO("data.lmdb", group="train")
db2 = ASEIO("data.lmdb", group="test")

# H5MD: /particles/{group}/ in the HDF5 structure
db = ASEIO("multi.h5", group="solvent")

# MongoDB: each group = a collection in the database
db = ObjectIO("mongodb://host:port/mydb", group="train")

# Zarr: separate subdirectories per group
db = ASEIO("data.zarr", group="conformers")

# Redis: key prefix = group
db = ObjectIO("redis://host:port", group="mydata")

# Memory: independent storage per group
db = ObjectIO("memory://", group="temp")

List available groups with list_groups():

from asebytes import ASEIO, H5MDBackend, LMDBObjectBackend

# Static method on backends
groups = H5MDBackend.list_groups("multi.h5")
groups = LMDBObjectBackend.list_groups("data.lmdb")

# Or via facades
groups = ASEIO.list_groups("data.lmdb")

Default group is backend-specific when not specified ("default" for most backends; H5MD defaults to "atoms"). Backends store groups using native strategies:

Backend Group storage
LMDB Subdirectory: {path}/{group}/
H5MD HDF5 group: /particles/{group}/
Zarr Subdirectory: {path}/{group}/
MongoDB Collection: group in database
Redis Key prefix: {group}:
Memory Internal dict keyed by group

Read-Through Cache

For slow or remote sources, cache_to creates a persistent local cache. First pass reads from source and fills the cache; subsequent reads are served from cache.

db = ASEIO("colabfit://dataset", split="train", cache_to="cache.lmdb")
for atoms in db:    # epoch 1: reads source, populates cache
    train(atoms)
for atoms in db:    # epoch 2+: reads from local cache
    train(atoms)

cache_to is available on ASEIO only. Accepts a file path (auto-creates backend) or any ReadWriteBackend instance. No invalidation — delete the cache file to reset.

HuggingFace Datasets

Stream or download datasets from the HuggingFace Hub via URI schemes.

# ColabFit (auto-selects column mapping, streams by default)
db = ASEIO("colabfit://mlearn_Cu_train", split="train")

# OPTIMADE (e.g. LeMaterial)
db = ASEIO("optimade://LeMaterial/LeMat-Bulk", split="train", name="compatible_pbe")

# Generic HuggingFace (requires explicit column mapping)
from asebytes import ColumnMapping
mapping = ColumnMapping(
    positions="pos", numbers="nums",
    calc={"energy": "total_energy"},
)
db = ASEIO("hf://user/dataset", mapping=mapping, split="train")

# Downloaded mode for faster access
db = ASEIO("colabfit://dataset", split="train", streaming=False)

Zarr / HDF5 / H5MD

Zarr

Flat layout with Blosc/LZ4 compression. Compact files and fast reads. Supports variable particle counts via NaN padding.

db = ASEIO("trajectory.zarr")
db.extend(atoms_list)

# Custom compression
from asebytes import ZarrBackend
db = ASEIO(ZarrBackend("data.zarr", compressor="zstd", clevel=9))

HDF5 / H5MD

H5MD-standard files with variable particle counts, per-frame PBC, and bond connectivity.

db = ASEIO("trajectory.h5", author_name="Jane Doe", compression="gzip")
db.extend(atoms_list)

# Multi-group files
from asebytes import H5MDBackend
groups = H5MDBackend.list_groups("multi.h5")
db = ASEIO("multi.h5", group="solvent")

MongoDB

Shared remote storage for multi-client access. Requires a running MongoDB instance (>= 4.4).

# Sync
db = ObjectIO("mongodb://user:pass@host:27017/mydb", group="train")
db.extend([{"energy": -3.5, "positions": [[0, 0, 0]]}])
row = db[0]

# Async — auto-dispatches to native AsyncMongoObjectBackend
db = AsyncObjectIO("mongodb://user:pass@host:27017/mydb", group="test")
row = await db[0]

Uses a sort-key array for O(1) positional access, with server-side field filtering via MongoDB projections — requesting specific keys (e.g. db.get(0, keys=["energy"])) only transfers those fields over the network.

In-Memory Backend

MemoryObjectBackend stores data in a plain Python list — no persistence, no dependencies. Useful for testing, ephemeral storage, and prototyping.

from asebytes import ObjectIO, ASEIO

db = ObjectIO("memory://")
db.extend([{"a": 1}, {"a": 2}])
assert len(db) == 2

# Works with all facades
db = ASEIO("memory://")
db.extend(atoms_list)

Key Convention

All data follows a flat namespace:

Prefix Content Examples
arrays.* Per-atom arrays arrays.positions, arrays.numbers, arrays.forces
calc.* Calculator results calc.energy, calc.stress
info.* Frame metadata info.smiles, info.label
(top-level) cell, pbc, constraints
from asebytes import atoms_to_dict, dict_to_atoms

d = atoms_to_dict(atoms)   # Atoms → flat dict
atoms = dict_to_atoms(d)   # flat dict → Atoms

Facade API Reference

All three tiers share the same method names. Async facades use await instead of direct calls.

Method BlobIO / ObjectIO / ASEIO AsyncBlobIO / AsyncObjectIO / AsyncASEIO
Read one row db[i] await db[i]
Read with key filter db.get(i, keys=[...]) await db.get(i, keys=[...])
List keys at index db.keys(i) await db.keys(i)
Append rows n = db.extend([...]) n = await db.extend([...])
Insert at position db.insert(i, row) await db.insert(i, row)
Overwrite row db[i] = row await db[i].set(row)
Partial update db.update(i, {...}) await db.update(i, {...})
Delete row del db[i] await db[i].delete()
Drop columns db.drop(keys=[...]) await db.drop(keys=[...])
Pre-allocate slots db.reserve(n) await db.reserve(n)
Clear all rows db.clear() await db.clear()
Remove container db.remove() await db.remove()
Length len(db) await db.len()
Iterate for row in db: async for row in db:
Context manager with db: async with db:

ASEIO / AsyncASEIO additionally support keyword-style updates:

db.update(i, info={"tag": "done"}, calc={"energy": -10.5})

Backend Adapters

Adapters convert between blob-level (dict[bytes, bytes]) and object-level (dict[str, Any]) backends:

Adapter Wraps Exposes
BlobToObjectReadAdapter ReadBackend[bytes, bytes] ReadBackend[str, Any]
BlobToObjectReadWriteAdapter ReadWriteBackend[bytes, bytes] ReadWriteBackend[str, Any]
ObjectToBlobReadAdapter ReadBackend[str, Any] ReadBackend[bytes, bytes]
ObjectToBlobReadWriteAdapter ReadWriteBackend[str, Any] ReadWriteBackend[bytes, bytes]

Async variants (AsyncBlobToObjectReadAdapter, etc.) mirror the same pattern for async backends.

from asebytes import BlobToObjectReadWriteAdapter, ObjectIO
from asebytes import LMDBBlobBackend

# Use a blob backend through the ObjectIO facade
blob_backend = LMDBBlobBackend("data.lmdb")
object_backend = BlobToObjectReadWriteAdapter(blob_backend)
db = ObjectIO(object_backend)

The registry uses these adapters automatically — e.g., BlobIO("data.lmdb") wraps the object backend as a blob backend via ObjectToBlobReadWriteAdapter when no native blob backend is registered.

Custom Backends

Implement ReadBackend[K, V] for read-only access or ReadWriteBackend[K, V] for full read-write:

from asebytes import ReadBackend

class MyBackend(ReadBackend[str, object]):
    def __len__(self) -> int: ...
    def get(self, index: int, keys: list[str] | None = None) -> dict[str, object] | None: ...

db = ObjectIO(MyBackend())

For async backends, subclass AsyncReadBackend[K, V] / AsyncReadWriteBackend[K, V], or wrap an existing sync backend:

from asebytes import SyncToAsyncAdapter, AsyncObjectIO

async_backend = SyncToAsyncAdapter(MyBackend())
db = AsyncObjectIO(async_backend)

Benchmarks

1000 frames each on two datasets — ethanol conformers (small molecules, fixed size) and LeMat-Traj (periodic structures, variable atom counts). All frames include energy, forces, and stress. Compared against aselmdb, znh5md, extxyz, and SQLite. Log scale — lower is better.

# LeMat-Traj benchmark data
lemat = list(ASEIO("optimade://LeMaterial/LeMat-Traj", split="train", name="compatible_pbe")[:1000])

Note: HDF5 performance is heavily influenced by compression and chunking settings. Both asebytes H5MD and znh5md use gzip compression by default, which reduces file size at the cost of read/write speed. The Zarr backend uses Blosc/LZ4 compression, which achieves compact file sizes with faster decompression than gzip.

Write

Write Trajectory Write Single

Read

Read Trajectory Read Single

Random Access

Random Trajectory Random Single

Property Access

Read Positions Trajectory Read Positions Single

Column Access

Column Energy

Update

Update Property Trajectory

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

asebytes-0.3.0a3.tar.gz (72.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

asebytes-0.3.0a3-py3-none-any.whl (101.6 kB view details)

Uploaded Python 3

File details

Details for the file asebytes-0.3.0a3.tar.gz.

File metadata

  • Download URL: asebytes-0.3.0a3.tar.gz
  • Upload date:
  • Size: 72.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.4 {"installer":{"name":"uv","version":"0.10.4","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for asebytes-0.3.0a3.tar.gz
Algorithm Hash digest
SHA256 0360f79743d219c88073b1a20f3f12bdd0beba41415a0d6522d569c1691cf976
MD5 8e280b16905e6c31108d83650f140641
BLAKE2b-256 ce8b5410d9db6df8214bd3ba21d8cdc61ba75be0887c0f61a687dd9cd22edc15

See more details on using hashes here.

File details

Details for the file asebytes-0.3.0a3-py3-none-any.whl.

File metadata

  • Download URL: asebytes-0.3.0a3-py3-none-any.whl
  • Upload date:
  • Size: 101.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.4 {"installer":{"name":"uv","version":"0.10.4","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for asebytes-0.3.0a3-py3-none-any.whl
Algorithm Hash digest
SHA256 6aa25f5d074cdea8ebe8cee218513eacd9bc7d0723c52fc1c2be1fc33b36fd8d
MD5 8fa29435a99ce410204454476ce343d8
BLAKE2b-256 0a324c5e7b25f090350f96d786077ffd9c960504744425745e272676154f8731

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page