Agentic schema analyzer for ArangoDB: conceptual model + conceptual-to-physical mapping for transpilers.
Project description
arangodb-schema-analyzer
Standalone Python library that analyzes an ArangoDB database's physical schema and produces:
- a conceptual schema (entities, relationships, properties)
- a conceptual→physical mapping suitable for transpilers (Cypher, SPARQL, future)
- metadata (confidence, timestamp, analyzed collection counts, detected patterns, per-entity tenant scope, deployment-style sharding profile)
Current release: see CHANGELOG.md. The version is the single
source of truth in pyproject.toml.
Install
From source (this repo):
python -m pip install -e .
Optional LLM provider extras:
python -m pip install -e ".[openai]"
python -m pip install -e ".[anthropic]"
python -m pip install -e ".[openrouter]"
OpenRouter actually requires no extra SDK (uses stdlib urllib); the [openrouter] extra exists as a documentation marker only and pulls in nothing.
MCP (Model Context Protocol) — optional stdio server wrapping the v1 JSON tool contract:
python -m pip install -e ".[mcp]"
arangodb-schema-analyzer-mcp
Development extras (pytest, ruff, mypy, etc.):
python -m pip install -e ".[dev]"
If you don't install a provider SDK (or you don't provide an API key), analysis degrades gracefully to deterministic baseline inference.
Usage
from arango import ArangoClient
from schema_analyzer import AgenticSchemaAnalyzer
client = ArangoClient(hosts="http://localhost:8529")
db = client.db("mydb", username="root", password="openSesame")
analyzer = AgenticSchemaAnalyzer(
llm_provider="openai", # or "anthropic" or "openrouter"
api_key=None, # e.g. os.environ["OPENAI_API_KEY"]
model="gpt-4o-mini",
cache={"type": "filesystem", "directory": ".schema-analyzer-cache"},
)
analysis = analyzer.analyze_physical_schema(
db,
timeout_ms=60_000,
sample_limit_per_collection=5,
)
print(analysis.metadata.confidence)
Tool usage (CLI)
This project can be called as a non-interactive tool (stdin JSON → stdout JSON) using the v1 contract under docs/tool-contract/v1/.
Install (editable):
python -m pip install -e .
Example (analyze) using the provided request example:
cat docs/tool-contract/v1/examples/request.analyze.json | arangodb-schema-analyzer --pretty
CLI options
arangodb-schema-analyzer [--request FILE] [--out FILE] [--pretty] [-v|--verbose]
--request FILE— path to request JSON (default: read from stdin)--out FILE— write response JSON to file (default: stdout)--pretty— pretty-print JSON output-v/--verbose— enable verbose logging
Evaluation CLI
Run analysis quality benchmarks against domain packs:
arangodb-schema-analyzer eval \
--provider openai \
--model gpt-4o-mini \
--report eval_report.json
Pass --baseline <prior-report.json> to diff a new run against an earlier
report (the baseline file is whatever a previous --report produced; no
baseline ships in the repo).
Options: --url, --user, --password, --database, --domains, --sample-limit, --timeout-ms, --scale, --no-cleanup.
Domains included: healthcare, financial_fraud_detection, insurance, intelligence, network_asset_management.
Public API
Exports (see schema_analyzer/__init__.py):
AgenticSchemaAnalyzer— main analyzer classConceptualSchema— conceptual schema dataclassPhysicalMapping— physical mapping dataclass with AQL helpersgenerate_schema_docs(analysis)— Markdown documentation generatorexport_mapping(analysis, target)— transpiler export (currently onlycypher)export_conceptual_model_as_owl_turtle(analysis)— OWL Turtle exportregister_provider(name, ...)— register custom LLM providerslist_providers()— list registered LLM provider namesrun_tool(request_dict)— programmatic entrypoint to the v1 tool contractfingerprint_physical_schema(snapshot)— full-snapshot SHA-256 (cache key)fingerprint_physical_shape(db, *, exclude_collections=None)— cheap probe that hashes only the collection set + per-collection type + index digestsfingerprint_physical_counts(db, *, exclude_collections=None)— shape fingerprint combined with per-collectioncount()
Recent additions
See CHANGELOG.md for the full history. Highlights since 0.3.0:
- 0.6.0 — Shard-family detection (
physicalMapping.shardFamilies) groups conceptual entities that share an identical property set and a common name suffix (the per-source / per-repo collection-duplication pattern), so downstream consumers can emit UNION-aware guidance instead of silently picking one member. Plus multitenancy classification (metadata.multitenancy) layered on the sharding profile. - 0.5.0 — Sharding-profile classification. Every analysis stamps
metadata.shardingProfilewith one ofOneShard,DisjointSmartGraph,SmartGraph,SatelliteGraph, orSharded, plus per-graph and per-collection evidence. Snapshot-only, no extra DB round trip. - 0.4.0 — Tenant-scope annotations. Every entity in
physicalMapping.entities[*]now carries atenantScopeblock (tenant_root/tenant_scoped/global), with a per-runmetadata.tenantScopeReportsummary. Configurable viaSCHEMA_ANALYZER_TENANT_*env vars. - 0.3.0 — Cheap change-detection probes (
fingerprint_physical_shape,fingerprint_physical_counts), statistics block onmetadata.statistics, and a reconciliation step that backfills any collections the LLM omitted.
Configuration
Tunable defaults live in schema_analyzer/defaults.py (full list there).
Selected parameters:
| Parameter | Default | Description |
|---|---|---|
MAX_REPAIR_ATTEMPTS |
2 | LLM repair loop iterations |
LLM_TEMPERATURE |
0.0 | Sampling temperature |
DEFAULT_TIMEOUT_MS |
60000 | Analysis timeout (ms) |
DEFAULT_REVIEW_THRESHOLD |
0.6 | Confidence threshold for review_required |
DEFAULT_CACHE_TTL_SECONDS |
86400 | Cache TTL (seconds) |
TENANT_SCOPE_ROOT_NAMES |
("Tenant",) |
Entity names treated as tenant roots |
TENANT_SCOPE_FIELD_REGEX |
^tenant[_-]?(id|key)$ |
Denormalised tenant-reference field detector (regex pipe escaped as | to avoid breaking the markdown table) |
MIN_TENANT_FIELD_COVERAGE_FRACTION |
0.5 | Threshold for discriminator_field multitenancy |
MIN_SHARD_FAMILY_SIZE |
2 | Min members for a shard-family group |
Notes
- Secrets: API keys are read from config/env; never persisted by this library.
- AQL fragments: helper methods return AQL text + bind variables; collection names are passed via bind parameters.
- Graceful degradation: without an LLM provider, the analyzer returns deterministic baseline inference with
review_required=True.
Integration evaluation (Docker ArangoDB)
Bring up a local ArangoDB:
docker compose up -d
Run integration tests (opt-in):
export RUN_INTEGRATION=1
export ARANGO_URL=http://localhost:18529
export ARANGO_DB=schema_analyzer_it
export ARANGO_USER=root
export ARANGO_PASS=openSesame
pytest -q -m integration
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file arangodb_schema_analyzer-0.6.1.tar.gz.
File metadata
- Download URL: arangodb_schema_analyzer-0.6.1.tar.gz
- Upload date:
- Size: 105.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cd945523f3cabc2047f5704a46cffa34c640695cc942a660d14302d6bed01967
|
|
| MD5 |
ae6034e7835e35946f00fa730d2d4e82
|
|
| BLAKE2b-256 |
4166ff1186972b3df6f910d9bb2474862910b72203a1a21a1fa5bc652887ef86
|
Provenance
The following attestation bundles were made for arangodb_schema_analyzer-0.6.1.tar.gz:
Publisher:
publish.yml on ArthurKeen/arango-schema-mapper
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
arangodb_schema_analyzer-0.6.1.tar.gz -
Subject digest:
cd945523f3cabc2047f5704a46cffa34c640695cc942a660d14302d6bed01967 - Sigstore transparency entry: 1367572311
- Sigstore integration time:
-
Permalink:
ArthurKeen/arango-schema-mapper@2b8dd612ea71f4a3df3c0b368a89f64f99e3d3c4 -
Branch / Tag:
refs/tags/v0.6.1 - Owner: https://github.com/ArthurKeen
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@2b8dd612ea71f4a3df3c0b368a89f64f99e3d3c4 -
Trigger Event:
push
-
Statement type:
File details
Details for the file arangodb_schema_analyzer-0.6.1-py3-none-any.whl.
File metadata
- Download URL: arangodb_schema_analyzer-0.6.1-py3-none-any.whl
- Upload date:
- Size: 108.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4a9ce46a3bab446442218295f1565e1dcc728cabd0618c89fae0193e4b9d8f04
|
|
| MD5 |
fcf1037cb06db699dc73f9c79d9fde5a
|
|
| BLAKE2b-256 |
4715075d295279f2eef786de4bb543fce907bf3f57cd087bf9092f73ec2b3962
|
Provenance
The following attestation bundles were made for arangodb_schema_analyzer-0.6.1-py3-none-any.whl:
Publisher:
publish.yml on ArthurKeen/arango-schema-mapper
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
arangodb_schema_analyzer-0.6.1-py3-none-any.whl -
Subject digest:
4a9ce46a3bab446442218295f1565e1dcc728cabd0618c89fae0193e4b9d8f04 - Sigstore transparency entry: 1367572350
- Sigstore integration time:
-
Permalink:
ArthurKeen/arango-schema-mapper@2b8dd612ea71f4a3df3c0b368a89f64f99e3d3c4 -
Branch / Tag:
refs/tags/v0.6.1 - Owner: https://github.com/ArthurKeen
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@2b8dd612ea71f4a3df3c0b368a89f64f99e3d3c4 -
Trigger Event:
push
-
Statement type: