Skip to main content

ChromaDB virtual filesystem backend for deepagents — instant session creation, zero marginal compute cost.

Project description

deepagents-chromafs

A read-only BackendProtocol backend for DeepAgents that treats a ChromaDB collection as a virtual filesystem.

Inspired by the ChromaFs algorithm from Mintlify: replace expensive sandbox boot (~46 s) with an in-memory virtual filesystem bootstrapped from a single Chroma document (~100 ms).


How it works

Path tree

The entire directory tree is stored as a single JSON document in Chroma under the key __path_tree__:

{
    "auth/oauth.md": { "isPublic": true, "groups": [] },
    "auth/api-keys.mdx": { "isPublic": true, "groups": [] },
    "internal/billing.md": { "isPublic": false, "groups": ["admin", "billing"] }
}

Slug format contract: every key must exactly match the page_slug metadata on each chunk in the same collection. Slugs may or may not carry a file extension — auth/oauth.md, Makefile, and Dockerfile are all valid. Extension-based glob patterns (**/*.md, **/*.py) only match slugs that include the corresponding extension; slugs without an extension simply won't match those patterns, which is the expected behavior.

The document may optionally be gzip-compressed and base64-encoded. On bootstrap, the backend fetches this document, applies RBAC filtering (hiding paths the user cannot access), and builds an in-memory directory index — no further network calls are needed for ls, glob, or path-scoping.

Content (cat)

Page content is stored as chunks in Chroma, each with page_slug and chunk_index metadata fields. On first read, all chunks are fetched, sorted, joined, and cached for the session lifetime.

Grep (4-step pipeline)

  1. Scope — derive candidate slugs from the in-memory tree (limited to the requested path / glob).
  2. Coarse filter — Chroma $contains / $regex on where_document to find matching chunks.
  3. Bulk prefetch — fetch all matched page slugs concurrently into the in-memory cache.
  4. Fine filter — in-memory regex on cached content to produce line-level GrepMatch results.

Write operations

All write operations (write, edit, upload_files) return an EROFS error. The filesystem is stateless by design.


Installation

pip install deepagents-chromafs

Or with uv:

uv add deepagents-chromafs

With Redis cache support:

pip install deepagents-chromafs[redis]
# or
uv add deepagents-chromafs[redis]

Quick start

import chromadb
from deepagents_chromafs import ChromaFsBackend

client = chromadb.Client()
collection = client.get_collection("my_docs")

backend = ChromaFsBackend(collection)

# List root directory
result = backend.ls("/")
for entry in result.entries:
    print(entry["path"], "dir" if entry.get("is_dir") else "file")

# Read a page
result = backend.read("/auth/oauth.md")
print(result.file_data["content"])

# Grep across all pages
result = backend.grep("OAuth2")
for match in result.matches:
    print(f"{match['path']}:{match['line']}: {match['text']}")

# Glob for files
result = backend.glob("**/*.md")
for entry in result.matches:
    print(entry["path"])

RBAC (group-based access control)

backend = ChromaFsBackend(
    collection,
    user_groups=frozenset({"admin", "billing"}),
)

Paths whose isPublic is False and whose groups list does not intersect with user_groups are hidden from the tree entirely — they do not appear in ls, glob, or grep results.

Custom metadata field names

backend = ChromaFsBackend(
    collection,
    slug_field="doc_slug",        # default: "page_slug"
    chunk_index_field="seq",      # default: "chunk_index"
)

Redis cache (multi-session / multi-worker)

By default, page content is cached in-memory for the lifetime of the ChromaFsBackend instance. For multi-session or multi-worker deployments, plug in RedisContentCache to share the cache across processes:

import redis
from deepagents_chromafs import ChromaFsBackend
from deepagents_chromafs.redis_cache import RedisContentCache

cache = RedisContentCache(
    redis.Redis(host="localhost", port=6379, db=0),
    prefix="myapp",   # namespace — avoids key collisions between collections
    ttl=3600,         # seconds; 0 = no expiry
)

backend = ChromaFsBackend(collection, cache=cache)

Any ContentCache subclass is accepted, so you can wire in other backends (Memcached, DynamoDB, etc.) by subclassing ContentCache and overriding get, put, has, and clear.


ChromaDB schema

Each page chunk document must have these metadata fields:

Field Type Description
page_slug str Page identifier including extension (e.g. auth/oauth.md)
chunk_index int Chunk ordering within the page

The path tree is stored as a single document with ID __path_tree__.

Preventing __path_tree__ from polluting search

By default ChromaDB auto-generates an embedding for every document added via collection.add(), including __path_tree__. This wastes embedding compute and lets the document surface in semantic similarity searches (collection.query()). Two mitigations are recommended when inserting the path tree:

1. Zero-vector embedding (semantic search)

Pass an explicit zero vector so the document never wins a cosine similarity match:

EMBEDDING_DIM = 1536  # match your collection's embedding dimension

collection.add(
    ids=["__path_tree__"],
    documents=[tree_json],
    embeddings=[[0.0] * EMBEDDING_DIM],
)

2. Metadata marker (full-text / where_document queries)

Add a metadata field that lets you exclude the document from your own queries:

collection.add(
    ids=["__path_tree__"],
    documents=[tree_json],
    embeddings=[[0.0] * EMBEDDING_DIM],
    metadatas=[{"_system": True}],
)

Then filter it out in any custom where_document scan:

collection.get(
    where={"_system": {"$ne": True}},
    where_document={"$contains": "access_token"},
)

Note: ChromaFsBackend itself is not affected — its grep pipeline always scopes queries to page_slug metadata, so __path_tree__ (which has no page_slug) is naturally excluded from all results.


Development

# Install dev dependencies
make install

# Run tests
make test

# Lint
make lint

# Format
make format

Algorithm reference

See the ChromaFs algorithm post on Mintlify for the original description.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deepagents_chromafs-0.1.0.tar.gz (19.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

deepagents_chromafs-0.1.0-py3-none-any.whl (18.8 kB view details)

Uploaded Python 3

File details

Details for the file deepagents_chromafs-0.1.0.tar.gz.

File metadata

  • Download URL: deepagents_chromafs-0.1.0.tar.gz
  • Upload date:
  • Size: 19.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for deepagents_chromafs-0.1.0.tar.gz
Algorithm Hash digest
SHA256 cd73cab1637ac36a5be98f26fbe6828c7ddaef5bad886c9fe191bbeaa9f42249
MD5 ddb2863223a7489db925280f7e3b2f81
BLAKE2b-256 365b3dc31187ab43ded8f132384ed426c0f1c3fee7ead9fe5ce453b040102932

See more details on using hashes here.

File details

Details for the file deepagents_chromafs-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: deepagents_chromafs-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 18.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for deepagents_chromafs-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fa912d8e9959959e8f7a4246c66cefa4363b84fc71ffbd3561a73d68a97cad9d
MD5 2fe41f02c082248561bc0b86e002b18d
BLAKE2b-256 1cbab8f6d880ac16eb188823deb985ebafe492bedac7550eadd622ff65f256d9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page