Store agent conversation logs (e.g. Claude Code JSONL) as HDF5 for cheaper context reconstruction, analytical queries, and self-contained provenance.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

mlarsonhdf

Project description

agentic-conversations-hdf5

This repo explores two ways to use HDF5 in agentic AI pipelines.

The main body of the repo focuses on storing Claude Code session logs as HDF5 instead of JSONL. This shows improvement on recovering recent messages through hyperslab reads instead of full-file scans, and also supports faster aggregate information computation. Tool call data lives in the same file as any numerical artifacts the agent produced.

The subfolder claude-mem-vectors focuses on using HDF5 as a drop-in vector store backend for claude-mem, replacing ChromaDB. Three HDF5 layout variants (VLEN, packed, compound) are measured against SQLite+BLOB and in-memory baselines on the exact interface claude-mem's ChromaSync exercises.

Conversation Log Storage

Install

pip install -e .
# for benchmarks:
pip install -e ".[bench]"

Live session recording (hook)

The easiest way to capture sessions is the live hook, which writes incrementally to HDF5 as Claude Code runs — no post-hoc conversion needed.

pip install -e .
agentic-conversations-hdf5 setup-hook

That one command patches ~/.claude/settings.json to register hooks on UserPromptSubmit and Stop. Sessions are written to ~/.claude/hdf5-sessions/<session-id>.h5 by default.

# Inspect a live session (while Claude Code is running or after):
agentic-conversations-hdf5 inspect ~/.claude/hdf5-sessions/<session-id>.h5

To write files to a different directory:

agentic-conversations-hdf5 setup-hook --output-dir ~/my-sessions
# or set the env var when the hook runs:
export AGENTIC_HDF5_DIR=~/my-sessions

To remove the hook:

agentic-conversations-hdf5 teardown-hook

The hook never blocks Claude Code — all errors are swallowed silently so a broken HDF5 install cannot interrupt your session.

Converting a Claude Code session (post-hoc)

agentic-conversations-hdf5 convert \
    ~/.claude/projects/<encoded-cwd>/<session-id>.jsonl \
    -o session.h5

agentic-conversations-hdf5 inspect session.h5
agentic-conversations-hdf5 tail session.h5 <session-id> -n 5

Multiple sessions can share one file:

agentic-conversations-hdf5 convert \
    ~/.claude/projects/<encoded-cwd>/*.jsonl \
    -o all-sessions.h5

Backends

HDF5Session implements the SessionBackend interface; SQLite and JSON+NumPy backends implement the same interface for benchmark comparison.

Identifier columns are VLEN UTF-8 strings; unbounded content (content_text, content_json, tool args/results) is packed into flat uint8 byte buffers with a compound offset/length index, so gzip compresses it in full. Token usage is a standalone compound numeric dataset for one-read analytical queries. Embeddings, when present, are a single consolidated (N, D) dataset.

Schema

Full schema is in docs/schema.md. The short version:

/sessions/<sid>/
    messages/       — uuid, parent_uuid, type, role, model (VLEN str), timestamp,
                      usage (compound), content_index + content_bytes (packed text),
                      has_embedding, embeddings (N, D)
    tool_calls/     — one row per tool invocation, joined by tool_use_id;
                      args/result packed into call_index + call_bytes
    artifacts/      — arbitrary binary outputs (figures, arrays, etc.)

parent_uuid is preserved verbatim, so forks in the source log survive the round-trip. The usage compound dataset is the main analytical win: total cache tokens for a session is arr["cache_read_input_tokens"].sum() — one read, no JSON parsing.

Benchmarks

python benchmarks/benchmark.py --quick

Synthetic sessions are generated via benchmarks/gen_synthetic.py. Test fixtures at three scales (2 MB, 25 MB, 250 MB) live in tests/fixtures/.

HDF5 as a Vector Store for claude-mem

Source and benchmarks are in claude-mem-vectors/. The VectorStore ABC in claude-mem-vectors/store/vector_store.py mirrors exactly the interface ChromaSync calls: upsert, delete, query with metadata where filters, list_ids, and update_metadata. Swapping backends requires no changes above the store layer.

The same three layout variants from the conversation log portion appear here: VLEN, packed, and compound, applied to embedding metadata rather than conversation turns.

Using HDF5 as your claude-mem backend

An MCP shim server that acts as a drop-in replacement for chroma-mcp is in claude-mem-vectors/mcp_server/. It exposes the same chroma_* tool interface claude-mem calls, backed by your choice of HDF5 or SQLite. No changes to claude-mem are required — you change one line in your MCP config.

See claude-mem-vectors/mcp_server/README.md for installation and configuration instructions.

Results and benchmarks

Full benchmark tables, design comparison, and reproduction instructions are in claude-mem-vectors/results.md.

The short version: for claude-mem's hook-driven write pattern (one document per hook call), SQLite is the practical choice — h5py.flush() dominates upsert cost regardless of layout, putting all three HDF5 variants ~200× behind SQLite at batch size 1. HDF5 earns its place if sessions also store large numerical artifacts alongside embeddings, which is the scenario where its hierarchical structure adds something SQLite cannot match.

Tests

pip install -e ".[dev]"
pytest

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

mlarsonhdf

Release history Release notifications | RSS feed

This version

1.0.0

May 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentic_conversations_hdf5-1.0.0.tar.gz (28.4 kB view details)

Uploaded May 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

agentic_conversations_hdf5-1.0.0-py3-none-any.whl (28.3 kB view details)

Uploaded May 26, 2026 Python 3

File details

Details for the file agentic_conversations_hdf5-1.0.0.tar.gz.

File metadata

Download URL: agentic_conversations_hdf5-1.0.0.tar.gz
Upload date: May 26, 2026
Size: 28.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for agentic_conversations_hdf5-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`3ddfee482d345ee64c785adbda5050b73b320b337b661a40fa8ff39a9af276e7`
MD5	`6f6e20b339b6de2232a8028e7da5c25f`
BLAKE2b-256	`376c5c89ba916ab61dd99886ed5b5395c0f60cd1af84f2e8ca7288070cb7a773`

See more details on using hashes here.

Provenance

The following attestation bundles were made for agentic_conversations_hdf5-1.0.0.tar.gz:

Publisher: release.yml on mattjala/agentic-conversations-hdf5

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: agentic_conversations_hdf5-1.0.0.tar.gz
- Subject digest: 3ddfee482d345ee64c785adbda5050b73b320b337b661a40fa8ff39a9af276e7
- Sigstore transparency entry: 1634587405
- Sigstore integration time: May 26, 2026
Source repository:
- Permalink: mattjala/agentic-conversations-hdf5@cd876ce6ac4ab0b9337ea95b4776789d0ec652b4
- Branch / Tag: refs/heads/main
- Owner: https://github.com/mattjala
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@cd876ce6ac4ab0b9337ea95b4776789d0ec652b4
- Trigger Event: workflow_dispatch

File details

Details for the file agentic_conversations_hdf5-1.0.0-py3-none-any.whl.

File metadata

Download URL: agentic_conversations_hdf5-1.0.0-py3-none-any.whl
Upload date: May 26, 2026
Size: 28.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for agentic_conversations_hdf5-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`78e5edc69dca49f2fee7064f98a6980db91713136bb0db21c7c68714ee99b094`
MD5	`63920f1dafb39bf3672882a98e8aa8f6`
BLAKE2b-256	`11f44f03de2340687c99e01ae474a08830f331cf8160cd170027b6717af41562`

See more details on using hashes here.

Provenance

The following attestation bundles were made for agentic_conversations_hdf5-1.0.0-py3-none-any.whl:

Publisher: release.yml on mattjala/agentic-conversations-hdf5

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: agentic_conversations_hdf5-1.0.0-py3-none-any.whl
- Subject digest: 78e5edc69dca49f2fee7064f98a6980db91713136bb0db21c7c68714ee99b094
- Sigstore transparency entry: 1634587459
- Sigstore integration time: May 26, 2026
Source repository:
- Permalink: mattjala/agentic-conversations-hdf5@cd876ce6ac4ab0b9337ea95b4776789d0ec652b4
- Branch / Tag: refs/heads/main
- Owner: https://github.com/mattjala
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@cd876ce6ac4ab0b9337ea95b4776789d0ec652b4
- Trigger Event: workflow_dispatch

agentic-conversations-hdf5 1.0.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Project description

agentic-conversations-hdf5

Conversation Log Storage

Install

Live session recording (hook)

Converting a Claude Code session (post-hoc)

Backends

Schema

Benchmarks

HDF5 as a Vector Store for claude-mem

Using HDF5 as your claude-mem backend

Results and benchmarks

Tests

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance