Embedded, ontology-leaning, Arrow-native analytical GraphDB for KG and GNN MLOps workflows.
Project description
CaracalDB
An Embedded, Ontology-Leaning, Arrow-Native Analytical GraphDB for KG and GNN Workflows.
Why CaracalDB | Quickstart | API Overview | Architecture
CaracalDB is an embedded graph database for knowledge graphs, ontology-aware query planning, GNN sampling, and ML feature workflows. The current implementation is a Python reference engine that validates the .crcl storage format, Tuft query language, planner surface, and user-facing API. A Rust core is planned, but it is not part of the current package.
Quickstart
Install
pip install caracaldb
or
uv add caracaldb
For development from a repository checkout:
uv sync --extra dev
uv run pytest
30-Second Quickstart
import caracaldb as cdb
with cdb.connect("demo") as db:
db.define_class("Gene")
db.insert_nodes("Gene", [{"symbol": "TP53", "chromosome": "17"}])
rows = db.sql("MATCH (g:Gene) RETURN g.symbol").rows()
print(rows)
The current Python reference query path supports a single MATCH (alias:Class) node pattern with WHERE, RETURN, and LIMIT. Broader graph patterns, richer binding, and multi-hop query execution are tracked in the milestone docs.
Start Here
- Language spec:
docs/01_language_spec.md - Engine spec:
docs/02_engine_spec.md - Modeling case study:
docs/03_user_modeling_case_study.md - Implementation plan:
docs/04_caracaldb_implementation.md - Work breakdown:
docs/05_wbs.md - Error index:
docs/errors/TF-INDEX.md - Examples:
examples/ - Benchmark CI:
.github/workflows/bench.yml
Why CaracalDB
CaracalDB is built around explicit storage, ontology, and execution boundaries:
flowchart LR
A["Tuft query"] --> B["Parser and diagnostics"]
B --> C["Binder and ontology catalog"]
C --> D["Logical plan"]
D --> E["Physical operators"]
E --> F["Arrow RecordBatch"]
G[".crcl bundle or packed file"] --> H["Catalog, WAL, snapshots, stores"]
H --> E
H --> I["CSR / CSC graph indexes"]
I --> J["Traversal, sampling, and ML adapters"]
- Embedded-first operation: no required server process.
- Tuft combines Cypher-like graph patterns with SPARQL-like ontology semantics.
- Arrow is the execution boundary for scan results and downstream analytics.
- CSR and CSC graph layouts support traversal, neighbor sampling, and GNN workflows.
- Snapshot, WAL, and packed
.crclstorage paths are tested as first-class engine pieces. - The Python API is intentionally small; Rust core work is planned after the reference behavior is stable.
Benchmarks
Benchmark automation is scaffolded in the repository:
- CI automation:
.github/workflows/bench.yml - Benchmark harness tests:
tests/test_bench_pkg/
The CLI exposes a benchmark command for registered scenarios:
caracal bench NAME
CLI
The CLI is available as caracal:
# Initialise an empty .crcl bundle
caracal init demo
# Run a Tuft query from a file
caracal run demo.crcl --file query.tuft
# Print an explain tree
caracal explain demo.crcl Gene
# Pack and unpack .crcl storage
caracal pack demo.crcl -o demo-packed.crcl
caracal unpack demo-packed.crcl -o restored.crcl
API Overview
Top-level functions and types
| API | Description |
|---|---|
cdb.connect(path, mode="rw", format="auto") |
Open or create a .crcl database |
Database.cursor() |
Create a query connection |
Database.catalog |
Access the ontology catalog |
Database.bundle |
Access the underlying storage bundle |
Database.open_node_store(class_iri) |
Open a node store for a class |
Connection.sql(text, params=None) |
Execute supported Tuft query text |
Result.arrow() |
Return a pyarrow.Table |
Result.record_batches() |
Iterate pyarrow.RecordBatch results |
CLI commands
| Command | Description |
|---|---|
caracal init PATH |
Initialise an empty .crcl bundle |
caracal run BUNDLE --file QUERY |
Execute a Tuft query and emit JSON |
caracal explain BUNDLE QUERY |
Print a logical explain tree |
caracal bench NAME |
Run a registered microbenchmark |
caracal pack BUNDLE -o FILE |
Package a directory bundle into a packed .crcl file |
caracal unpack FILE -o DIR |
Restore a packed .crcl file into a bundle |
Architecture
CaracalDB is organized as a Python package with focused modules for language, planning, execution, storage, graph layout, ontology, and ML interop:
caracaldb/
api.py Public connect / Database / Connection / Result API
cli/ Typer command-line interface
lang/tuft/ Tuft parser, AST, binder, transformer, typing
plan/ Logical plan nodes, rules, cost model, pattern compiler
exec/ Physical operators and execution context
storage/ .crcl bundle, WAL, snapshots, pack/unpack, stores
graph/ CSR / CSC builders, readers, HNSW support
onto/ Catalog, hierarchy, closure, reasoner
ingest/ Parquet ingestion helpers
ml/ Subgraph, neighbor loader, framework adapters
observability/ Explain, profile, and tracing helpers
udf/ Python and Tuft UDF registry
Execution Pipeline
Tuft text
|
v
Parser -> Binder -> Logical plan -> Physical pipeline
|
v
NodeScan / Filter / Project / Expand / Join / Aggregate operators
|
v
Arrow RecordBatch -> pyarrow.Table
Storage Pipeline
.crcl path
|
+-- packed single file
| |
| v
| temporary working bundle -> repacked on close
|
+-- directory bundle
|
v
manifest / catalog / WAL / snapshots / node stores / edge stores / indexes
Repository Layout
caracaldb/ Python package source
tests/ Unit, golden, property, and end-to-end tests
schema/ FlatBuffers and storage/catalog schemas
docs/ Design documents and user documentation
examples/ Runnable examples and case-study notebooks
Project Status
CaracalDB is pre-release and not yet suitable for production use. The current milestone line is documented in docs/05_wbs.md; M0 has been accepted in docs/milestones/M0-gate.md, and the repository is now focused on expanding the M1 vertical slice.
Contributing
Start with docs/04_caracaldb_implementation.md and docs/05_wbs.md. The core project constraints are:
- Keep the engine embedded-first.
- Preserve Arrow-native execution boundaries.
- Treat Tuft diagnostics and golden parser tests as public contract.
- Keep
.crclstorage reproducible through WAL, snapshots, and pack/unpack tests. - Measure performance changes before claiming speedups.
License
Apache License 2.0. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file caracaldb-0.2.0.tar.gz.
File metadata
- Download URL: caracaldb-0.2.0.tar.gz
- Upload date:
- Size: 209.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1afc971d0f269dd40cf4d767c69c95df3f4557c5b1ec781642043ce71cc5696f
|
|
| MD5 |
a94fd5fe4a2921b5b07dab2618817344
|
|
| BLAKE2b-256 |
80710359e74fba67ca870ce3003386d3ff25aeac138bd36a69928ac380cb9200
|
Provenance
The following attestation bundles were made for caracaldb-0.2.0.tar.gz:
Publisher:
release.yml on eastlighting1/CaracalDB
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
caracaldb-0.2.0.tar.gz -
Subject digest:
1afc971d0f269dd40cf4d767c69c95df3f4557c5b1ec781642043ce71cc5696f - Sigstore transparency entry: 1402974207
- Sigstore integration time:
-
Permalink:
eastlighting1/CaracalDB@806953e2dcf3e9d9e2360c3514d6f99e9f0d977f -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/eastlighting1
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@806953e2dcf3e9d9e2360c3514d6f99e9f0d977f -
Trigger Event:
push
-
Statement type:
File details
Details for the file caracaldb-0.2.0-py3-none-any.whl.
File metadata
- Download URL: caracaldb-0.2.0-py3-none-any.whl
- Upload date:
- Size: 156.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e25456d3253d72e1735d14e011f2992a4c9a200b623ea189e1d35bc9c15bc216
|
|
| MD5 |
a87103d16c30d85b777e94920efd31a4
|
|
| BLAKE2b-256 |
7ce01fc3c4563eac4bfabdb30510a4bd0a6be5945dd228db9f4a1b04ce8d5904
|
Provenance
The following attestation bundles were made for caracaldb-0.2.0-py3-none-any.whl:
Publisher:
release.yml on eastlighting1/CaracalDB
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
caracaldb-0.2.0-py3-none-any.whl -
Subject digest:
e25456d3253d72e1735d14e011f2992a4c9a200b623ea189e1d35bc9c15bc216 - Sigstore transparency entry: 1402974282
- Sigstore integration time:
-
Permalink:
eastlighting1/CaracalDB@806953e2dcf3e9d9e2360c3514d6f99e9f0d977f -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/eastlighting1
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@806953e2dcf3e9d9e2360c3514d6f99e9f0d977f -
Trigger Event:
push
-
Statement type: