Fast binary serialization and storage for ASE Atoms.

These details have been verified by PyPI

Project links

Owner

zincware

GitHub Statistics

These details have not been verified by PyPI

Project links

Discord

Project description

asebytes

Storage-agnostic, lazy-loading data layer with pluggable backends (LMDB, Zarr, HDF5/H5MD, HuggingFace Datasets, ASE file formats). Three IO tiers — raw bytes, structured dicts, and ASE Atoms — each with full sync and async APIs plus pandas-style column views.

pip install asebytes[lmdb]      # LMDB backend (recommended)
pip install asebytes[zarr]      # Zarr backend (fast compression)
pip install asebytes[h5md]      # HDF5/H5MD backend
pip install asebytes[hf]        # HuggingFace Datasets backend
pip install asebytes[mongodb]   # MongoDB backend (shared remote storage)
# In-memory backend (MemoryObjectBackend) is built-in — no extras needed

Quick Start

from asebytes import ASEIO

# Sync
db = ASEIO("data.lmdb")
db.extend(atoms_list)
db[0] = new_atoms
atoms = db[0]

# Async
import asyncio
from asebytes import AsyncASEIO

async def main():
    db = AsyncASEIO("data.lmdb")
    await db.extend(atoms_list)
    atoms = await db[0]
    async for atoms in db:
        process(atoms)

asyncio.run(main())

String paths auto-detect the backend from the file extension. Pass a backend instance directly for full control.

Three IO Layers

Class	Async class	Row type	Use case
`ASEIO`	`AsyncASEIO`	`ase.Atoms`	Atomistic simulations
`ObjectIO`	`AsyncObjectIO`	`dict[str, Any]`	Structured data without ASE
`BlobIO`	`AsyncBlobIO`	`dict[bytes, bytes]`	Raw bytes, zero deserialization

ASEIO — Atoms objects

from asebytes import ASEIO, AsyncASEIO

# Sync
db = ASEIO("atoms.lmdb")
db.extend(atoms_list)
db.update(0, calc={"energy": -10.5})
atoms = db[0]                  # ase.Atoms

# Async
db = AsyncASEIO("atoms.lmdb")
await db.extend(atoms_list)
atoms = await db[0]            # ase.Atoms
await db.update(0, calc={"energy": -10.5})

ObjectIO — plain dicts

from asebytes import ObjectIO, AsyncObjectIO

# Sync
db = ObjectIO("records.lmdb")
db.extend([
    {"arrays.numbers": [29], "calc.energy": -3.5},
    {"arrays.numbers": [26], "calc.energy": -8.3},
])
row = db[0]  # {"arrays.numbers": [29], "calc.energy": -3.5}

# Async
db = AsyncObjectIO("records.lmdb")
await db.extend([{"arrays.numbers": [29], "calc.energy": -3.5}])
row = await db[0]

BlobIO — raw bytes

from asebytes import BlobIO, AsyncBlobIO

# Sync
db = BlobIO("blobs.lmdb")
db.extend([{b"key": b"value"}, {b"key": b"other"}])
row = db[0]                    # {b"key": b"value"}

# Async
db = AsyncBlobIO("blobs.lmdb")
await db.extend([{b"key": b"value"}])
row = await db[0]

JSON

Serialize ase.Atoms through stdlib json using two encoder/decoder classes. The wire format is a compact base64-of-msgpack envelope — the same binary path used by asebytes.encode / asebytes.decode.

import json

import ase
import asebytes
import molify

frames: list[ase.Atoms] = molify.smiles2conformers("CCO", numConfs=3)

s = json.dumps(frames, cls=asebytes.AtomsEncoder)
recovered = json.loads(s, cls=asebytes.AtomsDecoder)  # list[ase.Atoms]

AtomsEncoder is a json.JSONEncoder subclass — override default() in your own subclass to handle additional types.

Lazy Views

Indexing with slices, lists, or strings returns lazy views — nothing is loaded until you iterate or materialize.

Row views

# Sync
view = db[5:100]               # RowView (lazy)
view = db[[0, 42, 99]]         # RowView from index list
for row in view:
    process(row)

# Async
view = db[5:100]               # AsyncRowView (lazy)
async for row in view:
    process(row)
rows = await view.to_list()    # materialize to list

Column views

# Sync
energies = db["calc.energy"].to_list()
cols = db[["calc.energy", "calc.forces"]].to_dict()
# → {"calc.energy": [...], "calc.forces": [...]}

# Async
energies = await db["calc.energy"].to_list()
cols = await db[["calc.energy", "calc.forces"]].to_dict()

Chaining rows + columns

# Sync
db[0:500]["calc.energy"].to_list()

# Async
await db[0:500]["calc.energy"].to_list()

Materialization

# Sync
view.to_list()                 # load all into memory
view.to_dict()                 # column-oriented dict (ColumnView only)
for batch in view.chunked(1000):  # iterate in chunks
    process(batch)

# Async
await view.to_list()
await view.to_dict()
async for batch in view.chunked(1000):
    process(batch)

Write-back

Views support in-place mutations when backed by a writable backend.

# Sync
db[0:10].set(new_rows)         # overwrite rows
db[0:10].update({"info.tag": "train"})  # partial update (applies to all rows)
db[0:10].delete()              # delete rows (contiguous only)

# Async
await db[0:10].set(new_rows)
await db[0:10].update({"info.tag": "train"})
await db[0:10].delete()

Backends

Backend is auto-detected from the file extension:

Extension	Backend	Install extra
`*.lmdb`	`LMDBObjectBackend` / `LMDBBlobBackend`	`asebytes[lmdb]`
`*.zarr`	`ZarrBackend`	`asebytes[zarr]`
`.h5` / `.h5md`	`H5MDBackend`	`asebytes[h5md]`
`.xyz` / `.extxyz` / `*.traj`	`ASEReadOnlyBackend`	(none)

URI schemes for remote/streaming sources:

Scheme	Source	Example
`memory://`	In-memory (no persistence)	`ObjectIO("memory://")`
`mongodb://`	MongoDB	`ObjectIO("mongodb://host:port/db")`
`redis://`	Redis	`ObjectIO("redis://host:port")`
`hf://`	HuggingFace Datasets	`ASEIO("hf://user/dataset", ...)`
`colabfit://`	ColabFit datasets	`ASEIO("colabfit://mlearn_Cu_train", ...)`
`optimade://`	OPTIMADE datasets	`ASEIO("optimade://LeMaterial/LeMat-Bulk", ...)`

Groups

All backends support a unified group parameter to organize data into independent collections within the same storage location. Groups are useful for storing multiple datasets, splits, or configurations in a single file/database.

# LMDB: separate subdirectories per group
db1 = ASEIO("data.lmdb", group="train")
db2 = ASEIO("data.lmdb", group="test")

# H5MD: /particles/{group}/ in the HDF5 structure
db = ASEIO("multi.h5", group="solvent")

# MongoDB: each group = a collection in the database
db = ObjectIO("mongodb://host:port/mydb", group="train")

# Zarr: separate subdirectories per group
db = ASEIO("data.zarr", group="conformers")

# Redis: key prefix = group
db = ObjectIO("redis://host:port", group="mydata")

# Memory: independent storage per group
db = ObjectIO("memory://", group="temp")

List available groups with list_groups():

from asebytes import ASEIO, H5MDBackend, LMDBObjectBackend

# Static method on backends
groups = H5MDBackend.list_groups("multi.h5")
groups = LMDBObjectBackend.list_groups("data.lmdb")

# Or via facades
groups = ASEIO.list_groups("data.lmdb")

Default group is backend-specific when not specified ("default" for most backends; H5MD defaults to "atoms"). Backends store groups using native strategies:

Backend	Group storage
LMDB	Subdirectory: `{path}/{group}/`
H5MD	HDF5 group: `/particles/{group}/`
Zarr	Subdirectory: `{path}/{group}/`
MongoDB	Collection: `group` in database
Redis	Key prefix: `{group}:`
Memory	Internal dict keyed by group

Read-Through Cache

For slow or remote sources, cache_to creates a persistent local cache. First pass reads from source and fills the cache; subsequent reads are served from cache.

db = ASEIO("colabfit://dataset", split="train", cache_to="cache.lmdb")
for atoms in db:    # epoch 1: reads source, populates cache
    train(atoms)
for atoms in db:    # epoch 2+: reads from local cache
    train(atoms)

cache_to is available on ASEIO only. Accepts a file path (auto-creates backend) or any ReadWriteBackend instance. No invalidation — delete the cache file to reset.

HuggingFace Datasets

Stream or download datasets from the HuggingFace Hub via URI schemes.

# ColabFit (auto-selects column mapping, streams by default)
db = ASEIO("colabfit://mlearn_Cu_train", split="train")

# OPTIMADE (e.g. LeMaterial)
db = ASEIO("optimade://LeMaterial/LeMat-Bulk", split="train", name="compatible_pbe")

# Generic HuggingFace (requires explicit column mapping)
from asebytes import ColumnMapping
mapping = ColumnMapping(
    positions="pos", numbers="nums",
    calc={"energy": "total_energy"},
)
db = ASEIO("hf://user/dataset", mapping=mapping, split="train")

# Downloaded mode for faster access
db = ASEIO("colabfit://dataset", split="train", streaming=False)

Zarr / HDF5 / H5MD

Zarr

Flat layout with Blosc/LZ4 compression. Compact files and fast reads. Supports variable particle counts via NaN padding.

db = ASEIO("trajectory.zarr")
db.extend(atoms_list)

# Custom compression
from asebytes import ZarrBackend
db = ASEIO(ZarrBackend("data.zarr", compressor="zstd", clevel=9))

HDF5 / H5MD

H5MD-standard files with variable particle counts, per-frame PBC, and bond connectivity.

db = ASEIO("trajectory.h5", author_name="Jane Doe", compression="gzip")
db.extend(atoms_list)

# Multi-group files
from asebytes import H5MDBackend
groups = H5MDBackend.list_groups("multi.h5")
db = ASEIO("multi.h5", group="solvent")

MongoDB

Shared remote storage for multi-client access. Requires a running MongoDB instance (>= 4.4).

# Sync
db = ObjectIO("mongodb://user:pass@host:27017/mydb", group="train")
db.extend([{"energy": -3.5, "positions": [[0, 0, 0]]}])
row = db[0]

# Async — auto-dispatches to native AsyncMongoObjectBackend
db = AsyncObjectIO("mongodb://user:pass@host:27017/mydb", group="test")
row = await db[0]

Uses a sort-key array for O(1) positional access, with server-side field filtering via MongoDB projections — requesting specific keys (e.g. db.get(0, keys=["energy"])) only transfers those fields over the network.

In-Memory Backend

MemoryObjectBackend stores data in a plain Python list — no persistence, no dependencies. Useful for testing, ephemeral storage, and prototyping.

from asebytes import ObjectIO, ASEIO

db = ObjectIO("memory://")
db.extend([{"a": 1}, {"a": 2}])
assert len(db) == 2

# Works with all facades
db = ASEIO("memory://")
db.extend(atoms_list)

Key Convention

All data follows a flat namespace:

Prefix	Content	Examples
`arrays.*`	Per-atom arrays	`arrays.positions`, `arrays.numbers`, `arrays.forces`
`calc.*`	Calculator results	`calc.energy`, `calc.stress`
`info.*`	Frame metadata	`info.smiles`, `info.label`
(top-level)	`cell`, `pbc`, `constraints`

from asebytes import atoms_to_dict, dict_to_atoms

d = atoms_to_dict(atoms)   # Atoms → flat dict
atoms = dict_to_atoms(d)   # flat dict → Atoms

Facade API Reference

All three tiers share the same method names. Async facades use await instead of direct calls.

Method	BlobIO / ObjectIO / ASEIO	AsyncBlobIO / AsyncObjectIO / AsyncASEIO
Read one row	`db[i]`	`await db[i]`
Read with key filter	`db.get(i, keys=[...])`	`await db.get(i, keys=[...])`
List keys at index	`db.keys(i)`	`await db.keys(i)`
Append rows	`n = db.extend([...])`	`n = await db.extend([...])`
Insert at position	`db.insert(i, row)`	`await db.insert(i, row)`
Overwrite row	`db[i] = row`	`await db[i].set(row)`
Partial update	`db.update(i, {...})`	`await db.update(i, {...})`
Delete row	`del db[i]`	`await db[i].delete()`
Drop columns	`db.drop(keys=[...])`	`await db.drop(keys=[...])`
Pre-allocate slots	`db.reserve(n)`	`await db.reserve(n)`
Clear all rows	`db.clear()`	`await db.clear()`
Remove container	`db.remove()`	`await db.remove()`
Length	`len(db)`	`await db.len()`
Iterate	`for row in db:`	`async for row in db:`
Context manager	`with db:`	`async with db:`

ASEIO / AsyncASEIO additionally support keyword-style updates:

db.update(i, info={"tag": "done"}, calc={"energy": -10.5})

Backend Adapters

Adapters convert between blob-level (dict[bytes, bytes]) and object-level (dict[str, Any]) backends:

Adapter	Wraps	Exposes
`BlobToObjectReadAdapter`	`ReadBackend[bytes, bytes]`	`ReadBackend[str, Any]`
`BlobToObjectReadWriteAdapter`	`ReadWriteBackend[bytes, bytes]`	`ReadWriteBackend[str, Any]`
`ObjectToBlobReadAdapter`	`ReadBackend[str, Any]`	`ReadBackend[bytes, bytes]`
`ObjectToBlobReadWriteAdapter`	`ReadWriteBackend[str, Any]`	`ReadWriteBackend[bytes, bytes]`

Async variants (AsyncBlobToObjectReadAdapter, etc.) mirror the same pattern for async backends.

from asebytes import BlobToObjectReadWriteAdapter, ObjectIO
from asebytes import LMDBBlobBackend

# Use a blob backend through the ObjectIO facade
blob_backend = LMDBBlobBackend("data.lmdb")
object_backend = BlobToObjectReadWriteAdapter(blob_backend)
db = ObjectIO(object_backend)

The registry uses these adapters automatically — e.g., BlobIO("data.lmdb") wraps the object backend as a blob backend via ObjectToBlobReadWriteAdapter when no native blob backend is registered.

Custom Backends

Implement ReadBackend[K, V] for read-only access or ReadWriteBackend[K, V] for full read-write:

from asebytes import ReadBackend

class MyBackend(ReadBackend[str, object]):
    def __len__(self) -> int: ...
    def get(self, index: int, keys: list[str] | None = None) -> dict[str, object] | None: ...

db = ObjectIO(MyBackend())

For async backends, subclass AsyncReadBackend[K, V] / AsyncReadWriteBackend[K, V], or wrap an existing sync backend:

from asebytes import SyncToAsyncAdapter, AsyncObjectIO

async_backend = SyncToAsyncAdapter(MyBackend())
db = AsyncObjectIO(async_backend)

Benchmarks

1000 frames each on two datasets — ethanol conformers (small molecules, fixed size) and LeMat-Traj (periodic structures, variable atom counts). All frames include energy, forces, and stress. Compared against aselmdb, znh5md, extxyz, and SQLite. Log scale — lower is better.

# LeMat-Traj benchmark data
lemat = list(ASEIO("optimade://LeMaterial/LeMat-Traj", split="train", name="compatible_pbe")[:1000])

Note: HDF5 performance is heavily influenced by compression and chunking settings. Both asebytes H5MD and znh5md use gzip compression by default, which reduces file size at the cost of read/write speed. The Zarr backend uses Blosc/LZ4 compression, which achieves compact file sizes with faster decompression than gzip.

View benchmark dashboard

Project details

These details have been verified by PyPI

Project links

Owner

zincware

GitHub Statistics

These details have not been verified by PyPI

Project links

Discord

Release history Release notifications | RSS feed

0.3.3

Jun 1, 2026

This version

0.3.2

May 7, 2026

0.3.1

Mar 10, 2026

0.3.0

Mar 10, 2026

0.3.0a3 pre-release

Mar 1, 2026

0.3.0a2 pre-release

Mar 1, 2026

0.3.0a1 pre-release

Feb 28, 2026

0.2.1

Feb 26, 2026

0.2.0

Feb 20, 2026

0.1.7

Dec 13, 2025

0.1.6

Nov 14, 2025

0.1.5

Nov 7, 2025

0.1.4

Nov 7, 2025

0.1.3

Nov 7, 2025

0.1.2

Nov 7, 2025

0.1.1

Nov 7, 2025

0.1.0

Nov 6, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

asebytes-0.3.2.tar.gz (2.5 MB view details)

Uploaded May 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

asebytes-0.3.2-py3-none-any.whl (127.3 kB view details)

Uploaded May 7, 2026 Python 3

File details

Details for the file asebytes-0.3.2.tar.gz.

File metadata

Download URL: asebytes-0.3.2.tar.gz
Upload date: May 7, 2026
Size: 2.5 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for asebytes-0.3.2.tar.gz
Algorithm	Hash digest
SHA256	`8838177f4e5fd63bfe92eb3bfd4ac2f3120e0099850311ab822c533d211c80ab`
MD5	`dc2f6b08538600e50b3364ef16d4d419`
BLAKE2b-256	`0f80a298fb9aae6d6fee9668c84fd28016d9a80f8e70bf0df97aeea9c592e0da`

See more details on using hashes here.

Provenance

The following attestation bundles were made for asebytes-0.3.2.tar.gz:

Publisher: publish.yaml on zincware/asebytes

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: asebytes-0.3.2.tar.gz
- Subject digest: 8838177f4e5fd63bfe92eb3bfd4ac2f3120e0099850311ab822c533d211c80ab
- Sigstore transparency entry: 1461293947
- Sigstore integration time: May 7, 2026
Source repository:
- Permalink: zincware/asebytes@e9f43182169b33e3e5908bfa020adb30b491c06a
- Branch / Tag: refs/tags/v0.3.2
- Owner: https://github.com/zincware
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yaml@e9f43182169b33e3e5908bfa020adb30b491c06a
- Trigger Event: release

File details

Details for the file asebytes-0.3.2-py3-none-any.whl.

File metadata

Download URL: asebytes-0.3.2-py3-none-any.whl
Upload date: May 7, 2026
Size: 127.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for asebytes-0.3.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`251425d0ecd950234946e6fea8d47d6d6ea4d6236411dd5c45d513af273fd826`
MD5	`5139d308fff5977da6c560fcb9799d57`
BLAKE2b-256	`2dd6f8b5d7312bb26f3a6c47de54d931be38a6fee54213a92ad8646abf3bd236`

See more details on using hashes here.

Provenance

The following attestation bundles were made for asebytes-0.3.2-py3-none-any.whl:

Publisher: publish.yaml on zincware/asebytes

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: asebytes-0.3.2-py3-none-any.whl
- Subject digest: 251425d0ecd950234946e6fea8d47d6d6ea4d6236411dd5c45d513af273fd826
- Sigstore transparency entry: 1461294027
- Sigstore integration time: May 7, 2026
Source repository:
- Permalink: zincware/asebytes@e9f43182169b33e3e5908bfa020adb30b491c06a
- Branch / Tag: refs/tags/v0.3.2
- Owner: https://github.com/zincware
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yaml@e9f43182169b33e3e5908bfa020adb30b491c06a
- Trigger Event: release

asebytes 0.3.2

Navigation

Verified details

Project links

Owner

GitHub Statistics

Unverified details

Project links

Meta

Project description

asebytes

Quick Start

Three IO Layers

ASEIO — Atoms objects

ObjectIO — plain dicts

BlobIO — raw bytes

JSON

Lazy Views

Row views

Column views

Chaining rows + columns

Materialization

Write-back

Backends

Groups

Read-Through Cache

HuggingFace Datasets

Zarr / HDF5 / H5MD

Zarr

HDF5 / H5MD

MongoDB

In-Memory Backend

Key Convention

Facade API Reference

Backend Adapters

Custom Backends

Benchmarks

Project details

Verified details

Project links

Owner

GitHub Statistics

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance