Embedded, ontology-leaning, Arrow-native analytical GraphDB for KG and GNN MLOps workflows.
Project description
CaracalDB
An Embedded, Ontology-Leaning, Arrow-Native Analytical GraphDB for KG and GNN Workflows.
Why CaracalDB | Quickstart | API Overview | Architecture
CaracalDB is an embedded graph database for knowledge graphs, ontology-aware query planning, GNN sampling, and ML feature workflows. The current implementation is a Python reference engine that validates the .crcl storage format, Tuft query language, planner surface, and user-facing API. A Rust core is planned, but it is not part of the current package.
Quickstart
Install
pip install caracaldb
or
uv add caracaldb
For development from a repository checkout:
uv sync --extra dev
uv run pytest
30-Second Quickstart
import caracaldb as cdb
with cdb.connect("demo") as db:
db.define_class("Gene")
db.insert_nodes("Gene", [{"symbol": "TP53", "chromosome": "17"}])
rows = db.sql("MATCH (g:Gene) RETURN g.symbol").rows()
print(rows)
The current Python reference query path supports a single MATCH (alias:Class) node pattern with WHERE, RETURN, and LIMIT. Broader graph patterns, richer binding, and multi-hop query execution are tracked in the milestone docs.
Start Here
- Language spec:
docs/01_language_spec.md - Engine spec:
docs/02_engine_spec.md - Modeling case study:
docs/03_user_modeling_case_study.md - Implementation plan:
docs/04_caracaldb_implementation.md - Work breakdown:
docs/05_wbs.md - Error index:
docs/errors/TF-INDEX.md - Examples:
examples/ - Benchmark CI:
.github/workflows/bench.yml
Why CaracalDB
CaracalDB is built around explicit storage, ontology, and execution boundaries:
flowchart LR
A["Tuft query"] --> B["Parser and diagnostics"]
B --> C["Binder and ontology catalog"]
C --> D["Logical plan"]
D --> E["Physical operators"]
E --> F["Arrow RecordBatch"]
G[".crcl bundle or packed file"] --> H["Catalog, WAL, snapshots, stores"]
H --> E
H --> I["CSR / CSC graph indexes"]
I --> J["Traversal, sampling, and ML adapters"]
- Embedded-first operation: no required server process.
- Tuft combines Cypher-like graph patterns with SPARQL-like ontology semantics.
- Arrow is the execution boundary for scan results and downstream analytics.
- CSR and CSC graph layouts support traversal, neighbor sampling, and GNN workflows.
- Snapshot, WAL, and packed
.crclstorage paths are tested as first-class engine pieces. - The Python API is intentionally small; Rust core work is planned after the reference behavior is stable.
Benchmarks
Benchmark automation is scaffolded in the repository:
- CI automation:
.github/workflows/bench.yml - Benchmark harness tests:
tests/test_bench_pkg/
The CLI exposes a benchmark command for registered scenarios:
caracal bench NAME
CLI
The CLI is available as caracal:
# Initialise an empty .crcl bundle
caracal init demo
# Run a Tuft query from a file
caracal run demo.crcl --file query.tuft
# Print an explain tree
caracal explain demo.crcl Gene
# Pack and unpack .crcl storage
caracal pack demo.crcl -o demo-packed.crcl
caracal unpack demo-packed.crcl -o restored.crcl
API Overview
Top-level functions and types
| API | Description |
|---|---|
cdb.connect(path, mode="rw", format="auto") |
Open or create a .crcl database |
Database.cursor() |
Create a query connection |
Database.catalog |
Access the ontology catalog |
Database.bundle |
Access the underlying storage bundle |
Database.open_node_store(class_iri) |
Open a node store for a class |
Connection.sql(text, params=None) |
Execute supported Tuft query text |
Result.arrow() |
Return a pyarrow.Table |
Result.record_batches() |
Iterate pyarrow.RecordBatch results |
CLI commands
| Command | Description |
|---|---|
caracal init PATH |
Initialise an empty .crcl bundle |
caracal run BUNDLE --file QUERY |
Execute a Tuft query and emit JSON |
caracal explain BUNDLE QUERY |
Print a logical explain tree |
caracal bench NAME |
Run a registered microbenchmark |
caracal pack BUNDLE -o FILE |
Package a directory bundle into a packed .crcl file |
caracal unpack FILE -o DIR |
Restore a packed .crcl file into a bundle |
Architecture
CaracalDB is organized as a Python package with focused modules for language, planning, execution, storage, graph layout, ontology, and ML interop:
caracaldb/
api.py Public connect / Database / Connection / Result API
cli/ Typer command-line interface
lang/tuft/ Tuft parser, AST, binder, transformer, typing
plan/ Logical plan nodes, rules, cost model, pattern compiler
exec/ Physical operators and execution context
storage/ .crcl bundle, WAL, snapshots, pack/unpack, stores
graph/ CSR / CSC builders, readers, HNSW support
onto/ Catalog, hierarchy, closure, reasoner
ingest/ Parquet ingestion helpers
ml/ Subgraph, neighbor loader, framework adapters
observability/ Explain, profile, and tracing helpers
udf/ Python and Tuft UDF registry
Execution Pipeline
Tuft text
|
v
Parser -> Binder -> Logical plan -> Physical pipeline
|
v
NodeScan / Filter / Project / Expand / Join / Aggregate operators
|
v
Arrow RecordBatch -> pyarrow.Table
Storage Pipeline
.crcl path
|
+-- packed single file
| |
| v
| temporary working bundle -> repacked on close
|
+-- directory bundle
|
v
manifest / catalog / WAL / snapshots / node stores / edge stores / indexes
Repository Layout
caracaldb/ Python package source
tests/ Unit, golden, property, and end-to-end tests
schema/ FlatBuffers and storage/catalog schemas
docs/ Design documents and user documentation
examples/ Runnable examples and case-study notebooks
Project Status
CaracalDB is pre-release and not yet suitable for production use. M0 through M5 are accepted in docs/milestones/, and the engine is currently in the v0.2.x docs and benchmark sweep. Multi-hop pattern matching, rel-type unions, and the degree() graph built-in are wired through Connection.sql; variable-length paths, multi-label nodes, and the remaining graph-topology built-ins (neighbors, shortest_path, k_hop) are tracked carry-overs.
The closest peers — embedded analytical graph engines — are kuzu, DuckPGQ, and Memgraph's embedded library mode. Comparisons against server-tier graph databases (Neo4j Enterprise, Neptune, TigerGraph) are not the right reference frame for an embedded .crcl file.
Non-goals
CaracalDB is deliberately scoped against a small set of features that belong in a different product:
- No server process, no network protocol. No Bolt, no gRPC, no HTTP endpoint. The analogue is DuckDB or SQLite, not Neo4j Enterprise.
- No multi-writer concurrency. A
.crclbundle is opened by one writer; readers can hold older snapshots. Coordinating multiple writers belongs to a layer above the engine. - No authentication, authorization, or row-level ACLs. Filesystem permissions are the only access boundary. Embedded governance belongs to the host application or a server tier.
- No SPARQL endpoint, no full OWL-DL. CaracalDB supports OWL-RL-style class/property hierarchies and IRI identity; RDF/Turtle is an import concern, not an engine surface (see docs/adr/0005-rdf-as-import-only.md).
- No bundled LLM / GraphRAG framework. CaracalDB is a substrate for GNN and KG workflows; LLM glue is the host application's job. The Arrow
record_batches()/arrow()outputs are the integration contract.
The one governance-adjacent feature that does fit the embedded model is deterministic, named snapshots with content-addressable manifests, plus a caracal diff command for auditing graph versions. That lets an outer governance layer pin and diff a database without the engine taking on multi-tenant concerns.
Contributing
Start with docs/04_caracaldb_implementation.md and docs/05_wbs.md. The core project constraints are:
- Keep the engine embedded-first.
- Preserve Arrow-native execution boundaries.
- Treat Tuft diagnostics and golden parser tests as public contract.
- Keep
.crclstorage reproducible through WAL, snapshots, and pack/unpack tests. - Measure performance changes before claiming speedups.
License
Apache License 2.0. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file caracaldb-0.2.3.tar.gz.
File metadata
- Download URL: caracaldb-0.2.3.tar.gz
- Upload date:
- Size: 245.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
979f557e164c66a90346863ba1c54c482e51025fdeff702539528f6146474f48
|
|
| MD5 |
78246e496e41f1e0b48745f317cd60bf
|
|
| BLAKE2b-256 |
cfd5a3782cab86296ee869fa27cb7fb94aa7db94ff974c2cef3b65367ec72b44
|
Provenance
The following attestation bundles were made for caracaldb-0.2.3.tar.gz:
Publisher:
release.yml on eastlighting1/CaracalDB
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
caracaldb-0.2.3.tar.gz -
Subject digest:
979f557e164c66a90346863ba1c54c482e51025fdeff702539528f6146474f48 - Sigstore transparency entry: 1421138365
- Sigstore integration time:
-
Permalink:
eastlighting1/CaracalDB@4922fc9f51b293ae1129322e9727eb10a31035cc -
Branch / Tag:
refs/tags/v0.2.3 - Owner: https://github.com/eastlighting1
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@4922fc9f51b293ae1129322e9727eb10a31035cc -
Trigger Event:
push
-
Statement type:
File details
Details for the file caracaldb-0.2.3-py3-none-any.whl.
File metadata
- Download URL: caracaldb-0.2.3-py3-none-any.whl
- Upload date:
- Size: 184.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
64dd8b2c3964aea14cbe3b165ead9c27de4d776790c22302dec2d1739a38a946
|
|
| MD5 |
232d0a9e002e54eea329a8963d19d425
|
|
| BLAKE2b-256 |
0573dabe4d41cad4bbc38ae9794de04e58126e3600bcbaf2271ee3068e4f6667
|
Provenance
The following attestation bundles were made for caracaldb-0.2.3-py3-none-any.whl:
Publisher:
release.yml on eastlighting1/CaracalDB
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
caracaldb-0.2.3-py3-none-any.whl -
Subject digest:
64dd8b2c3964aea14cbe3b165ead9c27de4d776790c22302dec2d1739a38a946 - Sigstore transparency entry: 1421138436
- Sigstore integration time:
-
Permalink:
eastlighting1/CaracalDB@4922fc9f51b293ae1129322e9727eb10a31035cc -
Branch / Tag:
refs/tags/v0.2.3 - Owner: https://github.com/eastlighting1
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@4922fc9f51b293ae1129322e9727eb10a31035cc -
Trigger Event:
push
-
Statement type: