Skip to main content

Embedded, ontology-leaning, Arrow-native analytical GraphDB for KG and GNN MLOps workflows.

Project description

CaracalDB

An Embedded, Ontology-Leaning, Arrow-Native Analytical GraphDB for KG and GNN Workflows.

PyPI version Python versions pre-alpha python-reference-engine rust-core-planned Apache-2.0

Why CaracalDB | Quickstart | API Overview | Architecture

CaracalDB is an embedded graph database for knowledge graphs, ontology-aware query planning, GNN sampling, and ML feature workflows. The current implementation is a Python reference engine that validates the .crcl storage format, Tuft query language, planner surface, and user-facing API. A Rust core is planned, but it is not part of the current package.

Quickstart

Install

pip install caracaldb

or

uv add caracaldb

For development from a repository checkout:

uv sync --extra dev
uv run pytest

30-Second Quickstart

import caracaldb as cdb

with cdb.connect("demo") as db:
    db.define_class("Gene")
    db.insert_nodes("Gene", [{"symbol": "TP53", "chromosome": "17"}])

    rows = db.sql("MATCH (g:Gene) RETURN g.symbol").rows()
    print(rows)

The current Python reference query path supports a single MATCH (alias:Class) node pattern with WHERE, RETURN, and LIMIT. Broader graph patterns, richer binding, and multi-hop query execution are tracked in the milestone docs.

Start Here

  • Language spec: docs/01_language_spec.md
  • Engine spec: docs/02_engine_spec.md
  • Modeling case study: docs/03_user_modeling_case_study.md
  • Implementation plan: docs/04_caracaldb_implementation.md
  • Work breakdown: docs/05_wbs.md
  • Error index: docs/errors/TF-INDEX.md
  • Examples: examples/
  • Benchmark CI: .github/workflows/bench.yml

Why CaracalDB

CaracalDB is built around explicit storage, ontology, and execution boundaries:

flowchart LR
    A["Tuft query"] --> B["Parser and diagnostics"]
    B --> C["Binder and ontology catalog"]
    C --> D["Logical plan"]
    D --> E["Physical operators"]
    E --> F["Arrow RecordBatch"]
    G[".crcl bundle or packed file"] --> H["Catalog, WAL, snapshots, stores"]
    H --> E
    H --> I["CSR / CSC graph indexes"]
    I --> J["Traversal, sampling, and ML adapters"]
  • Embedded-first operation: no required server process.
  • Tuft combines Cypher-like graph patterns with SPARQL-like ontology semantics.
  • Arrow is the execution boundary for scan results and downstream analytics.
  • CSR and CSC graph layouts support traversal, neighbor sampling, and GNN workflows.
  • Snapshot, WAL, and packed .crcl storage paths are tested as first-class engine pieces.
  • The Python API is intentionally small; Rust core work is planned after the reference behavior is stable.

Benchmarks

Benchmark automation is scaffolded in the repository:

  • CI automation: .github/workflows/bench.yml
  • Benchmark harness tests: tests/test_bench_pkg/

The CLI exposes a benchmark command for registered scenarios:

caracal bench NAME

CLI

The CLI is available as caracal:

# Initialise an empty .crcl bundle
caracal init demo

# Run a Tuft query from a file
caracal run demo.crcl --file query.tuft

# Print an explain tree
caracal explain demo.crcl Gene

# Pack and unpack .crcl storage
caracal pack demo.crcl -o demo-packed.crcl
caracal unpack demo-packed.crcl -o restored.crcl

API Overview

Top-level functions and types

API Description
cdb.connect(path, mode="rw", format="auto") Open or create a .crcl database
Database.cursor() Create a query connection
Database.catalog Access the ontology catalog
Database.bundle Access the underlying storage bundle
Database.open_node_store(class_iri) Open a node store for a class
Connection.sql(text, params=None) Execute supported Tuft query text
Result.arrow() Return a pyarrow.Table
Result.record_batches() Iterate pyarrow.RecordBatch results

CLI commands

Command Description
caracal init PATH Initialise an empty .crcl bundle
caracal run BUNDLE --file QUERY Execute a Tuft query and emit JSON
caracal explain BUNDLE QUERY Print a logical explain tree
caracal bench NAME Run a registered microbenchmark
caracal pack BUNDLE -o FILE Package a directory bundle into a packed .crcl file
caracal unpack FILE -o DIR Restore a packed .crcl file into a bundle

Architecture

CaracalDB is organized as a Python package with focused modules for language, planning, execution, storage, graph layout, ontology, and ML interop:

caracaldb/
  api.py                 Public connect / Database / Connection / Result API
  cli/                   Typer command-line interface
  lang/tuft/             Tuft parser, AST, binder, transformer, typing
  plan/                  Logical plan nodes, rules, cost model, pattern compiler
  exec/                  Physical operators and execution context
  storage/               .crcl bundle, WAL, snapshots, pack/unpack, stores
  graph/                 CSR / CSC builders, readers, HNSW support
  onto/                  Catalog, hierarchy, closure, reasoner
  ingest/                Parquet ingestion helpers
  ml/                    Subgraph, neighbor loader, framework adapters
  observability/         Explain, profile, and tracing helpers
  udf/                   Python and Tuft UDF registry

Execution Pipeline

Tuft text
    |
    v
Parser -> Binder -> Logical plan -> Physical pipeline
                                      |
                                      v
NodeScan / Filter / Project / Expand / Join / Aggregate operators
                                      |
                                      v
Arrow RecordBatch -> pyarrow.Table

Storage Pipeline

.crcl path
    |
    +-- packed single file
    |       |
    |       v
    |   temporary working bundle -> repacked on close
    |
    +-- directory bundle
            |
            v
    manifest / catalog / WAL / snapshots / node stores / edge stores / indexes

Repository Layout

caracaldb/   Python package source
tests/       Unit, golden, property, and end-to-end tests
schema/      FlatBuffers and storage/catalog schemas
docs/        Design documents and user documentation
examples/    Runnable examples and case-study notebooks

Project Status

CaracalDB is pre-release and not yet suitable for production use. The current milestone line is documented in docs/05_wbs.md; M0 has been accepted in docs/milestones/M0-gate.md, and the repository is now focused on expanding the M1 vertical slice.

Contributing

Start with docs/04_caracaldb_implementation.md and docs/05_wbs.md. The core project constraints are:

  1. Keep the engine embedded-first.
  2. Preserve Arrow-native execution boundaries.
  3. Treat Tuft diagnostics and golden parser tests as public contract.
  4. Keep .crcl storage reproducible through WAL, snapshots, and pack/unpack tests.
  5. Measure performance changes before claiming speedups.

License

Apache License 2.0. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

caracaldb-0.2.1.tar.gz (211.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

caracaldb-0.2.1-py3-none-any.whl (156.3 kB view details)

Uploaded Python 3

File details

Details for the file caracaldb-0.2.1.tar.gz.

File metadata

  • Download URL: caracaldb-0.2.1.tar.gz
  • Upload date:
  • Size: 211.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for caracaldb-0.2.1.tar.gz
Algorithm Hash digest
SHA256 abf47fdad32186f0ecdde5bcee323bc40f9ae7382066aa8bac0412494598c84c
MD5 77a02884fc5315edee71e0d6c2c0ff84
BLAKE2b-256 e3e045f45c810336e7dcc3dd0b204db929191512304d7d09b0bccab496b14e56

See more details on using hashes here.

Provenance

The following attestation bundles were made for caracaldb-0.2.1.tar.gz:

Publisher: release.yml on eastlighting1/CaracalDB

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file caracaldb-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: caracaldb-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 156.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for caracaldb-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ba05068db132bbbd1b72fb46d50080adf03412dddcff3ce3c0f0dbe1e4b0327c
MD5 c0d159be27339fc62c0fd8a4c019b17a
BLAKE2b-256 65a81b26d07032730999bb105e54f7f94a9acfc0116fe966239da590a3206b1d

See more details on using hashes here.

Provenance

The following attestation bundles were made for caracaldb-0.2.1-py3-none-any.whl:

Publisher: release.yml on eastlighting1/CaracalDB

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page