LMDB-based storage for ASE.
Project description
asebytes
Storage-agnostic, lazy-loading interface for ASE Atoms objects. Pluggable backends (LMDB, Zarr, HDF5/H5MD, HuggingFace Datasets, ASE file formats) behind a single list-like API with pandas-style column views.
pip install asebytes[lmdb] # LMDB backend (recommended)
pip install asebytes[zarr] # Zarr backend (fast compression)
pip install asebytes[h5md] # HDF5/H5MD backend
pip install asebytes[hf] # HuggingFace Datasets backend
Quick Start
from asebytes import ASEIO
# Write
db = ASEIO("data.lmdb")
db.extend(atoms_list) # bulk append
db[0] = new_atoms # replace row
db.update(0, calc={"energy": -10.5}) # partial update
# Read
atoms = db[0] # ase.Atoms
atoms = db[-1] # negative indexing
Backend is auto-detected from the file extension:
| Extension | Backend | Install extra |
|---|---|---|
*.lmdb |
LMDBBackend |
asebytes[lmdb] |
*.zarr |
ZarrBackend |
asebytes[zarr] |
*.h5 / *.h5md |
H5MDBackend |
asebytes[h5md] |
*.xyz / *.extxyz / *.traj |
ASEReadOnlyBackend |
(none) |
Lazy Views
Indexing with slices, lists, or strings returns lazy views that load data on demand.
# Row views — lazy, stream one frame at a time
view = db[5:100] # slice → RowView (nothing loaded yet)
view = db[[0, 42, 99]] # list of indices → RowView
for atoms in view:
process(atoms)
# Chunked iteration — loads N rows per batch for throughput
for atoms in db[:].chunked(1000):
process(atoms)
# Column views — avoid constructing full Atoms objects
energies = db["calc.energy"].to_list()
cols = db[["calc.energy", "calc.forces"]].to_dict()
# → {"calc.energy": [...], "calc.forces": [...]}
# Chaining — slice rows, then select columns
db[0:500]["calc.energy"].to_list()
Persistent Read-Through Cache
For slow or remote sources, cache_to creates a persistent local cache.
First pass reads from source and fills the cache; all subsequent reads are served from cache.
db = ASEIO("colabfit://dataset", split="train", cache_to="cache.lmdb")
for atoms in db: # epoch 1: reads source, populates cache
train(atoms)
for atoms in db: # epoch 2+: all reads from local cache
train(atoms)
Accepts a file path (auto-creates backend) or any WritableBackend instance.
No invalidation — delete the cache file to reset.
HuggingFace Datasets
Stream or download datasets from the HuggingFace Hub via URI schemes.
# ColabFit (auto-selects column mapping, streams by default)
db = ASEIO("colabfit://mlearn_Cu_train", split="train")
# OPTIMADE (e.g. LeMaterial)
db = ASEIO("optimade://LeMaterial/LeMat-Bulk", split="train", name="compatible_pbe")
# Generic HuggingFace (requires explicit column mapping)
from asebytes import ColumnMapping
mapping = ColumnMapping(
positions="pos", numbers="nums",
calc={"energy": "total_energy"},
)
db = ASEIO("hf://user/dataset", mapping=mapping, split="train")
# Downloaded mode for faster access
db = ASEIO("colabfit://dataset", split="train", streaming=False)
Zarr
Zarr backend with flat layout and Blosc/LZ4 compression. Offers compact file sizes and fast read performance. Supports variable particle counts via NaN padding, append-only writes.
db = ASEIO("trajectory.zarr")
db.extend(atoms_list)
# Custom compression
from asebytes import ZarrBackend
db = ASEIO(ZarrBackend("data.zarr", compressor="zstd", clevel=9))
HDF5 / H5MD
H5MD-standard files with support for variable particle counts, per-frame PBC, and bond connectivity.
db = ASEIO("trajectory.h5", author_name="Jane Doe", compression="gzip")
db.extend(atoms_list)
# Multi-group files
from asebytes import H5MDBackend
groups = H5MDBackend.list_groups("multi.h5")
db = ASEIO("multi.h5", particles_group="solvent")
Key Convention
All data follows a flat namespace:
| Prefix | Content | Examples |
|---|---|---|
arrays.* |
Per-atom arrays | arrays.positions, arrays.numbers, arrays.forces |
calc.* |
Calculator results | calc.energy, calc.stress |
info.* |
Frame metadata | info.smiles, info.label |
| (top-level) | cell, pbc, constraints |
from asebytes import atoms_to_dict, dict_to_atoms
d = atoms_to_dict(atoms) # Atoms → flat dict (~5x faster than encode/decode)
atoms = dict_to_atoms(d) # flat dict → Atoms
Custom Backends
Implement ReadableBackend for read-only or WritableBackend for read-write:
from asebytes import ASEIO, ReadableBackend
class MyBackend(ReadableBackend):
def __len__(self): ...
def columns(self, index=0): ...
def read_row(self, index, keys=None): ...
db = ASEIO(MyBackend())
Benchmarks
1000 frames each on two datasets — ethanol conformers (small molecules, fixed size) and LeMat-Traj (periodic structures, variable atom counts). All frames include energy, forces, and stress. Compared against aselmdb, znh5md, extxyz, and SQLite.
# LeMat-Traj benchmark data
lemat = list(ASEIO("optimade://LeMaterial/LeMat-Traj", split="train", name="compatible_pbe")[:1000])
Note: HDF5 performance is heavily influenced by compression and chunking settings. Both asebytes H5MD and znh5md use gzip compression by default, which reduces file size at the cost of read/write speed. The Zarr backend uses Blosc/LZ4 compression, which achieves compact file sizes with faster decompression than gzip.
Write
Sequential Read
Random Access
Column Access
File Size
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file asebytes-0.2.0.tar.gz.
File metadata
- Download URL: asebytes-0.2.0.tar.gz
- Upload date:
- Size: 39.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.1 {"installer":{"name":"uv","version":"0.10.1","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e80bfde31b5bdb4a085bd46ed375ee14fb26ea66d3f89915328262a5b81fc48f
|
|
| MD5 |
5f22597e6d4d83350f7f1e8d786deac6
|
|
| BLAKE2b-256 |
3f9f85114f55f5ec7c963bf406c371378831b61900faf62d1d11c64491d5233e
|
File details
Details for the file asebytes-0.2.0-py3-none-any.whl.
File metadata
- Download URL: asebytes-0.2.0-py3-none-any.whl
- Upload date:
- Size: 49.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.1 {"installer":{"name":"uv","version":"0.10.1","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ba9ce34d5871b791b3c467383275bd3673f6947fb0badba6be1fd34e757b02a8
|
|
| MD5 |
e7336fee3964a23e4a22f685e05c7b1e
|
|
| BLAKE2b-256 |
a83414ea3cd9c063ed99a32568d8508a1ba166057768416c511ce61133fdf012
|