Embedded, ontology-leaning, Arrow-native analytical GraphDB for KG and GNN MLOps workflows.

These details have not been verified by PyPI

Project links

Project description

CaracalDB

An Embedded, Ontology-Leaning, Arrow-Native Analytical GraphDB for KG and GNN Workflows.

Python versions pre-alpha python-reference-engine rust-core-planned Apache-2.0

Why CaracalDB | Quickstart | API Overview | Architecture

CaracalDB is an embedded graph database for knowledge graphs, ontology-aware query planning, GNN sampling, and ML feature workflows. The current implementation is a Python reference engine that validates the .crcl storage format, Tuft query language, planner surface, and user-facing API. A Rust core is planned, but it is not part of the current package.

Quickstart

Install

pip install caracaldb

uv add caracaldb

For development from a repository checkout:

uv sync --extra dev
uv run pytest

30-Second Quickstart

import caracaldb as cdb

with cdb.connect("examples/data/example_simple.crcl") as db:
    db.define_class("Person")
    db.insert_nodes(
        "Person",
        [
            {"name": "Alice", "age": 28, "city": "New York"},
            {"name": "Bob", "age": 34, "city": "London"},
            {"name": "Charlie", "age": 25, "city": "Paris"},
            {"name": "Diana", "age": 42, "city": "Tokyo"},
        ],
    )

with cdb.connect("examples/data/example_simple.crcl", mode="ro") as db:
    rows = db.sql("MATCH (p:Person) RETURN p.name, p.city LIMIT 2").rows()
    print(rows)

Expected output:

[{'name': 'Alice', 'city': 'New York'}, {'name': 'Bob', 'city': 'London'}]

The current Python reference query path supports a single MATCH (alias:Class) node pattern with WHERE, RETURN, and LIMIT. Broader graph patterns, richer binding, and multi-hop query execution are tracked in the milestone docs.

Start Here

Language spec: docs/01_language_spec.md
Engine spec: docs/02_engine_spec.md
Modeling case study: docs/03_user_modeling_case_study.md
Implementation plan: docs/04_caracaldb_implementation.md
Work breakdown: docs/05_wbs.md
Error index: docs/errors/TF-INDEX.md
Examples: examples/
Benchmark CI: .github/workflows/bench.yml

Why CaracalDB

CaracalDB is built around explicit storage, ontology, and execution boundaries:

flowchart LR
    A["Tuft query"] --> B["Parser and diagnostics"]
    B --> C["Binder and ontology catalog"]
    C --> D["Logical plan"]
    D --> E["Physical operators"]
    E --> F["Arrow RecordBatch"]
    G[".crcl bundle or packed file"] --> H["Catalog, WAL, snapshots, stores"]
    H --> E
    H --> I["CSR / CSC graph indexes"]
    I --> J["Traversal, sampling, and ML adapters"]

Embedded-first operation: no required server process.
Tuft combines Cypher-like graph patterns with SPARQL-like ontology semantics.
Arrow is the execution boundary for scan results and downstream analytics.
CSR and CSC graph layouts support traversal, neighbor sampling, and GNN workflows.
Snapshot, WAL, and packed .crcl storage paths are tested as first-class engine pieces.
The Python API is intentionally small; Rust core work is planned after the reference behavior is stable.

Benchmarks

Benchmark automation is scaffolded in the repository:

CI automation: .github/workflows/bench.yml
Benchmark harness tests: tests/test_bench_pkg/

The CLI exposes a benchmark command for registered scenarios:

caracal bench NAME

CLI

The CLI is available as caracal:

# Initialise an empty .crcl bundle
caracal init demo

# Run a Tuft query from a file
caracal run demo.crcl --file query.tuft

# Print an explain tree
caracal explain demo.crcl Gene

# Pack and unpack .crcl storage
caracal pack demo.crcl -o demo-packed.crcl
caracal unpack demo-packed.crcl -o restored.crcl

API Overview

Top-level functions and types

API	Description
`cdb.connect(path, mode="rw", format="auto")`	Open or create a `.crcl` database
`Database.cursor()`	Create a query connection
`Database.catalog`	Access the ontology catalog
`Database.bundle`	Access the underlying storage bundle
`Database.open_node_store(class_iri)`	Open a node store for a class
`Connection.sql(text, params=None)`	Execute supported Tuft query text
`Result.arrow()`	Return a `pyarrow.Table`
`Result.record_batches()`	Iterate `pyarrow.RecordBatch` results

CLI commands

Command	Description
`caracal init PATH`	Initialise an empty `.crcl` bundle
`caracal run BUNDLE --file QUERY`	Execute a Tuft query and emit JSON
`caracal explain BUNDLE QUERY`	Print a logical explain tree
`caracal bench NAME`	Run a registered microbenchmark
`caracal pack BUNDLE -o FILE`	Package a directory bundle into a packed `.crcl` file
`caracal unpack FILE -o DIR`	Restore a packed `.crcl` file into a bundle

Architecture

CaracalDB is organized as a Python package with focused modules for language, planning, execution, storage, graph layout, ontology, and ML interop:

caracaldb/
  api.py                 Public connect / Database / Connection / Result API
  cli/                   Typer command-line interface
  lang/tuft/             Tuft parser, AST, binder, transformer, typing
  plan/                  Logical plan nodes, rules, cost model, pattern compiler
  exec/                  Physical operators and execution context
  storage/               .crcl bundle, WAL, snapshots, pack/unpack, stores
  graph/                 CSR / CSC builders, readers, HNSW support
  onto/                  Catalog, hierarchy, closure, reasoner
  ingest/                Parquet ingestion helpers
  ml/                    Subgraph, neighbor loader, framework adapters
  observability/         Explain, profile, and tracing helpers
  udf/                   Python and Tuft UDF registry

Execution Pipeline

Tuft text
    |
    v
Parser -> Binder -> Logical plan -> Physical pipeline
                                      |
                                      v
NodeScan / Filter / Project / Expand / Join / Aggregate operators
                                      |
                                      v
Arrow RecordBatch -> pyarrow.Table

Storage Pipeline

.crcl path
    |
    +-- packed single file
    |       |
    |       v
    |   temporary working bundle -> repacked on close
    |
    +-- directory bundle
            |
            v
    manifest / catalog / WAL / snapshots / node stores / edge stores / indexes

Repository Layout

caracaldb/   Python package source
tests/       Unit, golden, property, and end-to-end tests
schema/      FlatBuffers and storage/catalog schemas
docs/        Design documents and user documentation
examples/    Runnable examples and case-study notebooks

Project Status

CaracalDB is pre-release and not yet suitable for production use. M0 through M5 are accepted in docs/milestones/, and the engine is currently in the v0.2.x docs and benchmark sweep. Multi-hop pattern matching, rel-type unions, Tuft bounded variable-length paths, vector search calls, and the degree() graph built-in are wired through Connection.sql. Python-level graph ecosystem primitives now include vector index lifecycle, vector search, neighbors, k_hop, bounded paths, shortest_path, Arrow batch upsert, property-index metadata, capabilities, and profile/explain telemetry. Multi-label nodes remain a carry-over.

The closest peers — embedded analytical graph engines — are kuzu, DuckPGQ, and Memgraph's embedded library mode. Comparisons against server-tier graph databases (Neo4j Enterprise, Neptune, TigerGraph) are not the right reference frame for an embedded .crcl file.

Non-goals

CaracalDB is deliberately scoped against a small set of features that belong in a different product:

No server process, no network protocol. No Bolt, no gRPC, no HTTP endpoint. The analogue is DuckDB or SQLite, not Neo4j Enterprise.
No multi-writer concurrency. A .crcl bundle is opened by one writer; readers can hold older snapshots. Coordinating multiple writers belongs to a layer above the engine.
No authentication, authorization, or row-level ACLs. Filesystem permissions are the only access boundary. Embedded governance belongs to the host application or a server tier.
No SPARQL endpoint, no full OWL-DL. CaracalDB supports OWL-RL-style class/property hierarchies and IRI identity; RDF/Turtle is an import concern, not an engine surface (see docs/adr/0005-rdf-as-import-only.md).
No bundled LLM / GraphRAG framework. CaracalDB is a substrate for GNN and KG workflows; LLM glue is the host application's job. The Arrow record_batches() / arrow() outputs are the integration contract.

The one governance-adjacent feature that does fit the embedded model is deterministic, named snapshots with content-addressable manifests, plus a caracal diff command for auditing graph versions. That lets an outer governance layer pin and diff a database without the engine taking on multi-tenant concerns.

Contributing

Start with docs/04_caracaldb_implementation.md and docs/05_wbs.md. The core project constraints are:

Keep the engine embedded-first.
Preserve Arrow-native execution boundaries.
Treat Tuft diagnostics and golden parser tests as public contract.
Keep .crcl storage reproducible through WAL, snapshots, and pack/unpack tests.
Measure performance changes before claiming speedups.

License

Apache License 2.0. See LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.7

May 7, 2026

0.2.6

May 6, 2026

0.2.5

May 6, 2026

0.2.4

May 2, 2026

0.2.3

May 1, 2026

0.2.2

May 1, 2026

0.2.1

Apr 30, 2026

0.2.0

Apr 29, 2026

0.1.4

Apr 29, 2026

0.1.3

Apr 29, 2026

0.1.2

Apr 29, 2026

0.1.1

Apr 29, 2026

0.1.0

Apr 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

caracaldb-0.2.7.tar.gz (292.5 kB view details)

Uploaded May 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

caracaldb-0.2.7-py3-none-any.whl (210.5 kB view details)

Uploaded May 7, 2026 Python 3

File details

Details for the file caracaldb-0.2.7.tar.gz.

File metadata

Download URL: caracaldb-0.2.7.tar.gz
Upload date: May 7, 2026
Size: 292.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for caracaldb-0.2.7.tar.gz
Algorithm	Hash digest
SHA256	`ac50f32324994fdf7f0c36ffee65175fd913c6bf7f0b5b1362a828011dced352`
MD5	`dd7652c83080d3b56e20dfc5bbbd6365`
BLAKE2b-256	`346e6644d70be37cc111506b942f53d2a9233cd1146b56e1dbf5d3f8c0925bd5`

See more details on using hashes here.

Provenance

The following attestation bundles were made for caracaldb-0.2.7.tar.gz:

Publisher: release.yml on eastlighting1/CaracalDB

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: caracaldb-0.2.7.tar.gz
- Subject digest: ac50f32324994fdf7f0c36ffee65175fd913c6bf7f0b5b1362a828011dced352
- Sigstore transparency entry: 1460636531
- Sigstore integration time: May 7, 2026
Source repository:
- Permalink: eastlighting1/CaracalDB@0eefc4607d9fc6302c85907fccb8d9f2d1c6517d
- Branch / Tag: refs/tags/v0.2.7
- Owner: https://github.com/eastlighting1
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@0eefc4607d9fc6302c85907fccb8d9f2d1c6517d
- Trigger Event: push

File details

Details for the file caracaldb-0.2.7-py3-none-any.whl.

File metadata

Download URL: caracaldb-0.2.7-py3-none-any.whl
Upload date: May 7, 2026
Size: 210.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for caracaldb-0.2.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f2682932fec1f37bee1bf7ef7371295b494414661d9e3ead455944bbfdd44567`
MD5	`570ff364a679bf37c44eb8f4692e2710`
BLAKE2b-256	`01ef211e694f9a397bc28db289b0b6b52cc484689843daa65617e969a9c55c75`

See more details on using hashes here.

Provenance

The following attestation bundles were made for caracaldb-0.2.7-py3-none-any.whl:

Publisher: release.yml on eastlighting1/CaracalDB

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: caracaldb-0.2.7-py3-none-any.whl
- Subject digest: f2682932fec1f37bee1bf7ef7371295b494414661d9e3ead455944bbfdd44567
- Sigstore transparency entry: 1460636786
- Sigstore integration time: May 7, 2026
Source repository:
- Permalink: eastlighting1/CaracalDB@0eefc4607d9fc6302c85907fccb8d9f2d1c6517d
- Branch / Tag: refs/tags/v0.2.7
- Owner: https://github.com/eastlighting1
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@0eefc4607d9fc6302c85907fccb8d9f2d1c6517d
- Trigger Event: push

caracaldb 0.2.7

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

CaracalDB

Quickstart

Install

30-Second Quickstart

Start Here

Why CaracalDB

Benchmarks

CLI

API Overview

Top-level functions and types

CLI commands

Architecture

Execution Pipeline

Storage Pipeline

Repository Layout

Project Status

Non-goals

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance