Skip to main content

Store agent conversation logs (e.g. Claude Code JSONL) as HDF5 for cheaper context reconstruction, analytical queries, and self-contained provenance.

Project description

agentic-conversations-hdf5

CI

This repo explores two ways to use HDF5 in agentic AI pipelines.

The main body of the repo focuses on storing Claude Code session logs as HDF5 instead of JSONL. This shows improvement on recovering recent messages through hyperslab reads instead of full-file scans, and also supports faster aggregate information computation. Tool call data lives in the same file as any numerical artifacts the agent produced.

The subfolder claude-mem-vectors focuses on using HDF5 as a drop-in vector store backend for claude-mem, replacing ChromaDB. Three HDF5 layout variants (VLEN, packed, compound) are measured against SQLite+BLOB and in-memory baselines on the exact interface claude-mem's ChromaSync exercises.


Conversation Log Storage

Install

pip install -e .
# for benchmarks:
pip install -e ".[bench]"

Live session recording (hook)

The easiest way to capture sessions is the live hook, which writes incrementally to HDF5 as Claude Code runs — no post-hoc conversion needed.

pip install -e .
agentic-conversations-hdf5 setup-hook

That one command patches ~/.claude/settings.json to register hooks on UserPromptSubmit and Stop. Sessions are written to ~/.claude/hdf5-sessions/<session-id>.h5 by default.

# Inspect a live session (while Claude Code is running or after):
agentic-conversations-hdf5 inspect ~/.claude/hdf5-sessions/<session-id>.h5

To write files to a different directory:

agentic-conversations-hdf5 setup-hook --output-dir ~/my-sessions
# or set the env var when the hook runs:
export AGENTIC_HDF5_DIR=~/my-sessions

To remove the hook:

agentic-conversations-hdf5 teardown-hook

The hook never blocks Claude Code — all errors are swallowed silently so a broken HDF5 install cannot interrupt your session.

Converting a Claude Code session (post-hoc)

agentic-conversations-hdf5 convert \
    ~/.claude/projects/<encoded-cwd>/<session-id>.jsonl \
    -o session.h5

agentic-conversations-hdf5 inspect session.h5
agentic-conversations-hdf5 tail session.h5 <session-id> -n 5

Multiple sessions can share one file:

agentic-conversations-hdf5 convert \
    ~/.claude/projects/<encoded-cwd>/*.jsonl \
    -o all-sessions.h5

Backends

HDF5Session implements the SessionBackend interface; SQLite and JSON+NumPy backends implement the same interface for benchmark comparison.

Identifier columns are VLEN UTF-8 strings; unbounded content (content_text, content_json, tool args/results) is packed into flat uint8 byte buffers with a compound offset/length index, so gzip compresses it in full. Token usage is a standalone compound numeric dataset for one-read analytical queries. Embeddings, when present, are a single consolidated (N, D) dataset.

Schema

Full schema is in docs/schema.md. The short version:

/sessions/<sid>/
    messages/       — uuid, parent_uuid, type, role, model (VLEN str), timestamp,
                      usage (compound), content_index + content_bytes (packed text),
                      has_embedding, embeddings (N, D)
    tool_calls/     — one row per tool invocation, joined by tool_use_id;
                      args/result packed into call_index + call_bytes
    artifacts/      — arbitrary binary outputs (figures, arrays, etc.)

parent_uuid is preserved verbatim, so forks in the source log survive the round-trip. The usage compound dataset is the main analytical win: total cache tokens for a session is arr["cache_read_input_tokens"].sum() — one read, no JSON parsing.

Benchmarks

python benchmarks/benchmark.py --quick

Synthetic sessions are generated via benchmarks/gen_synthetic.py. Test fixtures at three scales (2 MB, 25 MB, 250 MB) live in tests/fixtures/.


HDF5 as a Vector Store for claude-mem

Source and benchmarks are in claude-mem-vectors/. The VectorStore ABC in claude-mem-vectors/store/vector_store.py mirrors exactly the interface ChromaSync calls: upsert, delete, query with metadata where filters, list_ids, and update_metadata. Swapping backends requires no changes above the store layer.

The same three layout variants from the conversation log portion appear here: VLEN, packed, and compound, applied to embedding metadata rather than conversation turns.

Using HDF5 as your claude-mem backend

An MCP shim server that acts as a drop-in replacement for chroma-mcp is in claude-mem-vectors/mcp_server/. It exposes the same chroma_* tool interface claude-mem calls, backed by your choice of HDF5 or SQLite. No changes to claude-mem are required — you change one line in your MCP config.

See claude-mem-vectors/mcp_server/README.md for installation and configuration instructions.

Results and benchmarks

Full benchmark tables, design comparison, and reproduction instructions are in claude-mem-vectors/results.md.

The short version: for claude-mem's hook-driven write pattern (one document per hook call), SQLite is the practical choice — h5py.flush() dominates upsert cost regardless of layout, putting all three HDF5 variants ~200× behind SQLite at batch size 1. HDF5 earns its place if sessions also store large numerical artifacts alongside embeddings, which is the scenario where its hierarchical structure adds something SQLite cannot match.


Tests

pip install -e ".[dev]"
pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentic_conversations_hdf5-1.0.0.tar.gz (28.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agentic_conversations_hdf5-1.0.0-py3-none-any.whl (28.3 kB view details)

Uploaded Python 3

File details

Details for the file agentic_conversations_hdf5-1.0.0.tar.gz.

File metadata

File hashes

Hashes for agentic_conversations_hdf5-1.0.0.tar.gz
Algorithm Hash digest
SHA256 3ddfee482d345ee64c785adbda5050b73b320b337b661a40fa8ff39a9af276e7
MD5 6f6e20b339b6de2232a8028e7da5c25f
BLAKE2b-256 376c5c89ba916ab61dd99886ed5b5395c0f60cd1af84f2e8ca7288070cb7a773

See more details on using hashes here.

Provenance

The following attestation bundles were made for agentic_conversations_hdf5-1.0.0.tar.gz:

Publisher: release.yml on mattjala/agentic-conversations-hdf5

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file agentic_conversations_hdf5-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for agentic_conversations_hdf5-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 78e5edc69dca49f2fee7064f98a6980db91713136bb0db21c7c68714ee99b094
MD5 63920f1dafb39bf3672882a98e8aa8f6
BLAKE2b-256 11f44f03de2340687c99e01ae474a08830f331cf8160cd170027b6717af41562

See more details on using hashes here.

Provenance

The following attestation bundles were made for agentic_conversations_hdf5-1.0.0-py3-none-any.whl:

Publisher: release.yml on mattjala/agentic-conversations-hdf5

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page