Fast binary serialization and storage for ASE Atoms.
Project description
asebytes
Storage-agnostic, lazy-loading data layer with pluggable backends (LMDB, Zarr, HDF5/H5MD, HuggingFace Datasets, ASE file formats). Three IO tiers — raw bytes, structured dicts, and ASE Atoms — each with full sync and async APIs plus pandas-style column views.
pip install asebytes[lmdb] # LMDB backend (recommended)
pip install asebytes[zarr] # Zarr backend (fast compression)
pip install asebytes[h5md] # HDF5/H5MD backend
pip install asebytes[hf] # HuggingFace Datasets backend
pip install asebytes[mongodb] # MongoDB backend (shared remote storage)
# In-memory backend (MemoryObjectBackend) is built-in — no extras needed
Quick Start
from asebytes import ASEIO
# Sync
db = ASEIO("data.lmdb")
db.extend(atoms_list)
db[0] = new_atoms
atoms = db[0]
# Async
import asyncio
from asebytes import AsyncASEIO
async def main():
db = AsyncASEIO("data.lmdb")
await db.extend(atoms_list)
atoms = await db[0]
async for atoms in db:
process(atoms)
asyncio.run(main())
String paths auto-detect the backend from the file extension. Pass a backend instance directly for full control.
Three IO Layers
| Class | Async class | Row type | Use case |
|---|---|---|---|
ASEIO |
AsyncASEIO |
ase.Atoms |
Atomistic simulations |
ObjectIO |
AsyncObjectIO |
dict[str, Any] |
Structured data without ASE |
BlobIO |
AsyncBlobIO |
dict[bytes, bytes] |
Raw bytes, zero deserialization |
ASEIO — Atoms objects
from asebytes import ASEIO, AsyncASEIO
# Sync
db = ASEIO("atoms.lmdb")
db.extend(atoms_list)
db.update(0, calc={"energy": -10.5})
atoms = db[0] # ase.Atoms
# Async
db = AsyncASEIO("atoms.lmdb")
await db.extend(atoms_list)
atoms = await db[0] # ase.Atoms
await db.update(0, calc={"energy": -10.5})
ObjectIO — plain dicts
from asebytes import ObjectIO, AsyncObjectIO
# Sync
db = ObjectIO("records.lmdb")
db.extend([
{"arrays.numbers": [29], "calc.energy": -3.5},
{"arrays.numbers": [26], "calc.energy": -8.3},
])
row = db[0] # {"arrays.numbers": [29], "calc.energy": -3.5}
# Async
db = AsyncObjectIO("records.lmdb")
await db.extend([{"arrays.numbers": [29], "calc.energy": -3.5}])
row = await db[0]
BlobIO — raw bytes
from asebytes import BlobIO, AsyncBlobIO
# Sync
db = BlobIO("blobs.lmdb")
db.extend([{b"key": b"value"}, {b"key": b"other"}])
row = db[0] # {b"key": b"value"}
# Async
db = AsyncBlobIO("blobs.lmdb")
await db.extend([{b"key": b"value"}])
row = await db[0]
JSON
Serialize ase.Atoms through stdlib json using two encoder/decoder classes. The wire format is a compact base64-of-msgpack envelope — the same binary path used by asebytes.encode / asebytes.decode.
import json
import ase
import asebytes
import molify
frames: list[ase.Atoms] = molify.smiles2conformers("CCO", numConfs=3)
s = json.dumps(frames, cls=asebytes.AtomsEncoder)
recovered = json.loads(s, cls=asebytes.AtomsDecoder) # list[ase.Atoms]
AtomsEncoder is a json.JSONEncoder subclass — override default() in your own subclass to handle additional types.
Lazy Views
Indexing with slices, lists, or strings returns lazy views — nothing is loaded until you iterate or materialize.
Row views
# Sync
view = db[5:100] # RowView (lazy)
view = db[[0, 42, 99]] # RowView from index list
for row in view:
process(row)
# Async
view = db[5:100] # AsyncRowView (lazy)
async for row in view:
process(row)
rows = await view.to_list() # materialize to list
Column views
# Sync
energies = db["calc.energy"].to_list()
cols = db[["calc.energy", "calc.forces"]].to_dict()
# → {"calc.energy": [...], "calc.forces": [...]}
# Async
energies = await db["calc.energy"].to_list()
cols = await db[["calc.energy", "calc.forces"]].to_dict()
Chaining rows + columns
# Sync
db[0:500]["calc.energy"].to_list()
# Async
await db[0:500]["calc.energy"].to_list()
Materialization
# Sync
view.to_list() # load all into memory
view.to_dict() # column-oriented dict (ColumnView only)
for batch in view.chunked(1000): # iterate in chunks
process(batch)
# Async
await view.to_list()
await view.to_dict()
async for batch in view.chunked(1000):
process(batch)
Write-back
Views support in-place mutations when backed by a writable backend.
# Sync
db[0:10].set(new_rows) # overwrite rows
db[0:10].update({"info.tag": "train"}) # partial update (applies to all rows)
db[0:10].delete() # delete rows (contiguous only)
# Async
await db[0:10].set(new_rows)
await db[0:10].update({"info.tag": "train"})
await db[0:10].delete()
Backends
Backend is auto-detected from the file extension:
| Extension | Backend | Install extra |
|---|---|---|
*.lmdb |
LMDBObjectBackend / LMDBBlobBackend |
asebytes[lmdb] |
*.zarr |
ZarrBackend |
asebytes[zarr] |
*.h5 / *.h5md |
H5MDBackend |
asebytes[h5md] |
*.xyz / *.extxyz / *.traj |
ASEReadOnlyBackend |
(none) |
URI schemes for remote/streaming sources:
| Scheme | Source | Example |
|---|---|---|
memory:// |
In-memory (no persistence) | ObjectIO("memory://") |
mongodb:// |
MongoDB | ObjectIO("mongodb://host:port/db") |
redis:// |
Redis | ObjectIO("redis://host:port") |
hf:// |
HuggingFace Datasets | ASEIO("hf://user/dataset", ...) |
colabfit:// |
ColabFit datasets | ASEIO("colabfit://mlearn_Cu_train", ...) |
optimade:// |
OPTIMADE datasets | ASEIO("optimade://LeMaterial/LeMat-Bulk", ...) |
Groups
All backends support a unified group parameter to organize data into independent collections within the same storage location. Groups are useful for storing multiple datasets, splits, or configurations in a single file/database.
# LMDB: separate subdirectories per group
db1 = ASEIO("data.lmdb", group="train")
db2 = ASEIO("data.lmdb", group="test")
# H5MD: /particles/{group}/ in the HDF5 structure
db = ASEIO("multi.h5", group="solvent")
# MongoDB: each group = a collection in the database
db = ObjectIO("mongodb://host:port/mydb", group="train")
# Zarr: separate subdirectories per group
db = ASEIO("data.zarr", group="conformers")
# Redis: key prefix = group
db = ObjectIO("redis://host:port", group="mydata")
# Memory: independent storage per group
db = ObjectIO("memory://", group="temp")
List available groups with list_groups():
from asebytes import ASEIO, H5MDBackend, LMDBObjectBackend
# Static method on backends
groups = H5MDBackend.list_groups("multi.h5")
groups = LMDBObjectBackend.list_groups("data.lmdb")
# Or via facades
groups = ASEIO.list_groups("data.lmdb")
Default group is backend-specific when not specified ("default" for most backends; H5MD defaults to "atoms"). Backends store groups using native strategies:
| Backend | Group storage |
|---|---|
| LMDB | Subdirectory: {path}/{group}/ |
| H5MD | HDF5 group: /particles/{group}/ |
| Zarr | Subdirectory: {path}/{group}/ |
| MongoDB | Collection: group in database |
| Redis | Key prefix: {group}: |
| Memory | Internal dict keyed by group |
Read-Through Cache
For slow or remote sources, cache_to creates a persistent local cache. First pass reads from source and fills the cache; subsequent reads are served from cache.
db = ASEIO("colabfit://dataset", split="train", cache_to="cache.lmdb")
for atoms in db: # epoch 1: reads source, populates cache
train(atoms)
for atoms in db: # epoch 2+: reads from local cache
train(atoms)
cache_to is available on ASEIO only. Accepts a file path (auto-creates backend) or any ReadWriteBackend instance. No invalidation — delete the cache file to reset.
HuggingFace Datasets
Stream or download datasets from the HuggingFace Hub via URI schemes.
# ColabFit (auto-selects column mapping, streams by default)
db = ASEIO("colabfit://mlearn_Cu_train", split="train")
# OPTIMADE (e.g. LeMaterial)
db = ASEIO("optimade://LeMaterial/LeMat-Bulk", split="train", name="compatible_pbe")
# Generic HuggingFace (requires explicit column mapping)
from asebytes import ColumnMapping
mapping = ColumnMapping(
positions="pos", numbers="nums",
calc={"energy": "total_energy"},
)
db = ASEIO("hf://user/dataset", mapping=mapping, split="train")
# Downloaded mode for faster access
db = ASEIO("colabfit://dataset", split="train", streaming=False)
Zarr / HDF5 / H5MD
Zarr
Flat layout with Blosc/LZ4 compression. Compact files and fast reads. Supports variable particle counts via NaN padding.
db = ASEIO("trajectory.zarr")
db.extend(atoms_list)
# Custom compression
from asebytes import ZarrBackend
db = ASEIO(ZarrBackend("data.zarr", compressor="zstd", clevel=9))
HDF5 / H5MD
H5MD-standard files with variable particle counts, per-frame PBC, and bond connectivity.
db = ASEIO("trajectory.h5", author_name="Jane Doe", compression="gzip")
db.extend(atoms_list)
# Multi-group files
from asebytes import H5MDBackend
groups = H5MDBackend.list_groups("multi.h5")
db = ASEIO("multi.h5", group="solvent")
MongoDB
Shared remote storage for multi-client access. Requires a running MongoDB instance (>= 4.4).
# Sync
db = ObjectIO("mongodb://user:pass@host:27017/mydb", group="train")
db.extend([{"energy": -3.5, "positions": [[0, 0, 0]]}])
row = db[0]
# Async — auto-dispatches to native AsyncMongoObjectBackend
db = AsyncObjectIO("mongodb://user:pass@host:27017/mydb", group="test")
row = await db[0]
Uses a sort-key array for O(1) positional access, with server-side field filtering via MongoDB projections — requesting specific keys (e.g. db.get(0, keys=["energy"])) only transfers those fields over the network.
In-Memory Backend
MemoryObjectBackend stores data in a plain Python list — no persistence, no dependencies. Useful for testing, ephemeral storage, and prototyping.
from asebytes import ObjectIO, ASEIO
db = ObjectIO("memory://")
db.extend([{"a": 1}, {"a": 2}])
assert len(db) == 2
# Works with all facades
db = ASEIO("memory://")
db.extend(atoms_list)
Key Convention
All data follows a flat namespace:
| Prefix | Content | Examples |
|---|---|---|
arrays.* |
Per-atom arrays | arrays.positions, arrays.numbers, arrays.forces |
calc.* |
Calculator results | calc.energy, calc.stress |
info.* |
Frame metadata | info.smiles, info.label |
| (top-level) | cell, pbc, constraints |
from asebytes import atoms_to_dict, dict_to_atoms
d = atoms_to_dict(atoms) # Atoms → flat dict
atoms = dict_to_atoms(d) # flat dict → Atoms
Facade API Reference
All three tiers share the same method names. Async facades use await instead of direct calls.
| Method | BlobIO / ObjectIO / ASEIO | AsyncBlobIO / AsyncObjectIO / AsyncASEIO |
|---|---|---|
| Read one row | db[i] |
await db[i] |
| Read with key filter | db.get(i, keys=[...]) |
await db.get(i, keys=[...]) |
| List keys at index | db.keys(i) |
await db.keys(i) |
| Append rows | n = db.extend([...]) |
n = await db.extend([...]) |
| Insert at position | db.insert(i, row) |
await db.insert(i, row) |
| Overwrite row | db[i] = row |
await db[i].set(row) |
| Partial update | db.update(i, {...}) |
await db.update(i, {...}) |
| Delete row | del db[i] |
await db[i].delete() |
| Drop columns | db.drop(keys=[...]) |
await db.drop(keys=[...]) |
| Pre-allocate slots | db.reserve(n) |
await db.reserve(n) |
| Clear all rows | db.clear() |
await db.clear() |
| Remove container | db.remove() |
await db.remove() |
| Length | len(db) |
await db.len() |
| Iterate | for row in db: |
async for row in db: |
| Context manager | with db: |
async with db: |
ASEIO / AsyncASEIO additionally support keyword-style updates:
db.update(i, info={"tag": "done"}, calc={"energy": -10.5})
Backend Adapters
Adapters convert between blob-level (dict[bytes, bytes]) and object-level (dict[str, Any]) backends:
| Adapter | Wraps | Exposes |
|---|---|---|
BlobToObjectReadAdapter |
ReadBackend[bytes, bytes] |
ReadBackend[str, Any] |
BlobToObjectReadWriteAdapter |
ReadWriteBackend[bytes, bytes] |
ReadWriteBackend[str, Any] |
ObjectToBlobReadAdapter |
ReadBackend[str, Any] |
ReadBackend[bytes, bytes] |
ObjectToBlobReadWriteAdapter |
ReadWriteBackend[str, Any] |
ReadWriteBackend[bytes, bytes] |
Async variants (AsyncBlobToObjectReadAdapter, etc.) mirror the same pattern for async backends.
from asebytes import BlobToObjectReadWriteAdapter, ObjectIO
from asebytes import LMDBBlobBackend
# Use a blob backend through the ObjectIO facade
blob_backend = LMDBBlobBackend("data.lmdb")
object_backend = BlobToObjectReadWriteAdapter(blob_backend)
db = ObjectIO(object_backend)
The registry uses these adapters automatically — e.g., BlobIO("data.lmdb") wraps the object backend as a blob backend via ObjectToBlobReadWriteAdapter when no native blob backend is registered.
Custom Backends
Implement ReadBackend[K, V] for read-only access or ReadWriteBackend[K, V] for full read-write:
from asebytes import ReadBackend
class MyBackend(ReadBackend[str, object]):
def __len__(self) -> int: ...
def get(self, index: int, keys: list[str] | None = None) -> dict[str, object] | None: ...
db = ObjectIO(MyBackend())
For async backends, subclass AsyncReadBackend[K, V] / AsyncReadWriteBackend[K, V], or wrap an existing sync backend:
from asebytes import SyncToAsyncAdapter, AsyncObjectIO
async_backend = SyncToAsyncAdapter(MyBackend())
db = AsyncObjectIO(async_backend)
Benchmarks
1000 frames each on two datasets — ethanol conformers (small molecules, fixed size) and LeMat-Traj (periodic structures, variable atom counts). All frames include energy, forces, and stress. Compared against aselmdb, znh5md, extxyz, and SQLite. Log scale — lower is better.
# LeMat-Traj benchmark data
lemat = list(ASEIO("optimade://LeMaterial/LeMat-Traj", split="train", name="compatible_pbe")[:1000])
Note: HDF5 performance is heavily influenced by compression and chunking settings. Both asebytes H5MD and znh5md use gzip compression by default, which reduces file size at the cost of read/write speed. The Zarr backend uses Blosc/LZ4 compression, which achieves compact file sizes with faster decompression than gzip.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file asebytes-0.3.2.tar.gz.
File metadata
- Download URL: asebytes-0.3.2.tar.gz
- Upload date:
- Size: 2.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8838177f4e5fd63bfe92eb3bfd4ac2f3120e0099850311ab822c533d211c80ab
|
|
| MD5 |
dc2f6b08538600e50b3364ef16d4d419
|
|
| BLAKE2b-256 |
0f80a298fb9aae6d6fee9668c84fd28016d9a80f8e70bf0df97aeea9c592e0da
|
Provenance
The following attestation bundles were made for asebytes-0.3.2.tar.gz:
Publisher:
publish.yaml on zincware/asebytes
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
asebytes-0.3.2.tar.gz -
Subject digest:
8838177f4e5fd63bfe92eb3bfd4ac2f3120e0099850311ab822c533d211c80ab - Sigstore transparency entry: 1461293947
- Sigstore integration time:
-
Permalink:
zincware/asebytes@e9f43182169b33e3e5908bfa020adb30b491c06a -
Branch / Tag:
refs/tags/v0.3.2 - Owner: https://github.com/zincware
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yaml@e9f43182169b33e3e5908bfa020adb30b491c06a -
Trigger Event:
release
-
Statement type:
File details
Details for the file asebytes-0.3.2-py3-none-any.whl.
File metadata
- Download URL: asebytes-0.3.2-py3-none-any.whl
- Upload date:
- Size: 127.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
251425d0ecd950234946e6fea8d47d6d6ea4d6236411dd5c45d513af273fd826
|
|
| MD5 |
5139d308fff5977da6c560fcb9799d57
|
|
| BLAKE2b-256 |
2dd6f8b5d7312bb26f3a6c47de54d931be38a6fee54213a92ad8646abf3bd236
|
Provenance
The following attestation bundles were made for asebytes-0.3.2-py3-none-any.whl:
Publisher:
publish.yaml on zincware/asebytes
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
asebytes-0.3.2-py3-none-any.whl -
Subject digest:
251425d0ecd950234946e6fea8d47d6d6ea4d6236411dd5c45d513af273fd826 - Sigstore transparency entry: 1461294027
- Sigstore integration time:
-
Permalink:
zincware/asebytes@e9f43182169b33e3e5908bfa020adb30b491c06a -
Branch / Tag:
refs/tags/v0.3.2 - Owner: https://github.com/zincware
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yaml@e9f43182169b33e3e5908bfa020adb30b491c06a -
Trigger Event:
release
-
Statement type: