Central, multi-repo code knowledge graph for AI agents — Neo4j + Tree-sitter + MCP.
Project description
central-code-knowledge-graph
Stop re-reading. Start querying.
AI coding tools re-read your entire codebase on every task. ckg fixes that. One server indexes every repo in your org with Tree-sitter across 26 languages, stores the structural map as a Neo4j property graph, keeps it fresh via incremental ingest + webhooks, and serves precise context to your AI assistant via MCP so it reads only what matters.
One server that:
- ingests many repositories (not just one) and keeps them incrementally fresh
- stores them as a Neo4j property graph (
File,Class,Function,Module+CONTAINS,DEFINES,HAS_METHOD,CALLS,IMPORTS) - exposes REST, GraphQL, MCP/JSON-RPC, and a
ckgCLI - supports structural queries (callers, callees, imports, blast radius, downstream dependencies), full-text search, and semantic vector search
- generates an architecture map with coupling warnings (cyclic deps, god modules, SDP violations) every ingest
- secures every endpoint with scoped API tokens (argon2id-hashed)
- runs as a single
docker compose up
Why
| Need | How this server delivers |
|---|---|
| Rock-solid, won't fall over | Stateless API + workers; Neo4j/Postgres/Redis run with healthchecks + restart: unless-stopped; horizontal scale via --scale worker=N |
| Fast relationship search for AI agents | Native graph DB (Cypher) + Lucene FTS + vector index — all in Neo4j |
| Multi-language | Tree-sitter parsers (23): Python, JS/TS (incl. JSX/TSX → React, Angular), Rust, Go, Java, Ruby, C, C++, C#, Kotlin, Scala, Swift, PHP, Solidity, Dart, R, Perl, Lua, Zig, PowerShell, Julia, Nix. Extraction wrappers (3): Vue, Svelte (delegates <script> to JS/TS), Jupyter/Databricks .ipynb (concatenates code cells, dispatches by kernel). Pluggable — one file under ckg/parsers/ adds another language |
| Precise cross-file edges | Opt-in LSP pass (CKG_LSP_ENABLED=true) upgrades CALLS edges with language-server-resolved targets. Pyright today; rust-analyzer / gopls / ts-server / jdtls planned. Graph stays functional with no LSP installed |
| Fast updates | Incremental ingest (--incremental): sha-diffs files against the graph, only re-parses what changed. Full reparse stays available as --full |
| Context for AI tools | Built-in MCP HTTP server → Cursor, VS Code, Claude Code drop in |
| Two query surfaces | REST (/v1/*) for simple calls + GraphQL (/v1/graphql) for composed traversals; both use the same API token |
| CLI for automation | ckg Typer CLI: register, ingest, query, search |
| Spec-driven | Auto-generated OpenAPI at /docs; GraphiQL UI at /v1/graphql; ADRs under docs/adr/ |
| Whole-codebase index | One Neo4j graph spans all registered repos |
| Neo4j-backed | Functions, classes, files, imports, calls all stored as labeled nodes + typed relationships |
| Secure | API tokens with scopes (admin, repo:write, repo:read); hashed at rest |
Supported languages
Tree-sitter parsers (23): Python · JavaScript (incl. JSX → React) · TypeScript (incl. TSX → Angular) · Rust · Go · Java · Ruby · C · C++ · C# · Kotlin · Scala · Swift · PHP · Solidity · Dart · R · Perl · Lua · Zig · PowerShell · Julia · Nix
Extraction wrappers (3): Vue & Svelte SFCs (delegate <script> to JS/TS) · Jupyter / Databricks .ipynb (concatenate code cells, dispatch by kernel language)
Pluggable — adding another language is one file under ckg/parsers/ and one line in the registry.
Architecture
┌──────────────┐
AI agents ───MCP──▶│ │
CLI (ckg) ──REST──▶│ FastAPI │──▶ Auth (API tokens, scopes)
Web UI ────GQL───▶ │ │──▶ Audit log
└──────┬───────┘
│
┌────────────────┼─────────────────────────────┐
▼ ▼ ▼
┌────────────┐ ┌─────────────┐ ┌───────────────┐
│ Neo4j 5 │ │ Postgres │ │ Redis │
│ graph + │ │ repos + │ │ cache + queue │
│ vector + │ │ tokens + │ └───────┬───────┘
│ FTS │ │ runs + │ │
└────────────┘ │ audit │ ┌───────▼───────┐
└─────────────┘ │ Celery workers│
│ - clone │
│ - parse │
│ - embed │
│ - write graph│
└───────┬───────┘
│
┌───────▼───────┐
│ Tree-sitter │
│ parsers │
│ Py / JS / TS │
│ (Rust/Ruby/ │
│ Go/Java soon)│
└───────────────┘
Full design rationale: docs/adr/0001-architecture.md.
Quickstart
1. Prerequisites
- Docker Desktop (macOS / Windows) or Docker Engine + Compose v2 (Linux)
- 8 GB free RAM recommended
- Python 3.11+ on the host only if you want the CLI locally
2. Clone and configure
git clone https://github.com/ajankurjain/central-code-knowledge-graph.git
cd central-code-knowledge-graph
cp .env.example .env
Edit .env and replace every change-me-*. Generate strong values with:
python -c "import secrets; print(secrets.token_urlsafe(32))"
3. Start the stack
make up
(or docker compose up -d)
First boot is 1–3 minutes (image pulls + Neo4j schema init).
curl http://localhost:8080/readyz
# {"ready": true, "checks": {"neo4j": true, "postgres": true, "redis": true}, ...}
Open the auto-generated API docs: http://localhost:8080/docs
Open the web UI: http://localhost:3000 (paste an API token to sign in).
4. Install the CLI
From PyPI (recommended — CLI-only, light install):
pip install central-code-knowledge-graph
# or, isolated:
pipx install central-code-knowledge-graph
Or from a checkout for development:
pip install -e .
# Or with everything (server stack + dev tools):
pip install -e '.[dev]'
Then point the CLI at your server and sign in with the bootstrap token:
export CKG_SERVER=http://localhost:8080
ckg login --token "$(grep ^CKG_BOOTSTRAP_TOKEN .env | cut -d= -f2)"
# Mint a real token, then re-login with it:
ckg token create my-laptop --scope repo:read --scope repo:write
ckg login --token ckg_xxxxxxxxxxxxxxxxxxxx
5. Ingest your first repo
ckg repo register my-repo file:///Users/you/code/my-repo --branch main
ckg repo ingest my-repo
ckg repo runs my-repo # watch progress
ckg graph stats
ckg search keyword "ingest pipeline"
ckg search semantic "where do we parse Tree-sitter trees?"
ckg graph callers my-repo my.module.foo --depth 2
ckg graph blast my-repo src/foo/bar.py # what breaks if bar.py changes
ckg graph downstream my-repo src/foo/bar.py # what bar.py depends on
5b. Or pull an entire org / group / workspace at once
Paste a single URL — GitHub org/user, GitLab group/user, Bitbucket workspace, or a JSON/YAML manifest — and ckg discovers every accessible repo, registers them, and queues a full ingest for each.
# Public org, anonymous
ckg source add https://github.com/orgs/anthropics
# Private org with a Personal Access Token (example: read from env)
export CKG_SOURCE_TOKEN="$GH_PAT"
ckg source add https://github.com/orgs/acme --include-forks
# GitLab group (incl. subgroups)
ckg source add https://gitlab.com/groups/gitlab-org
# Bitbucket workspace (token format: "username:app-password")
ckg source add https://bitbucket.org/atlassian --token "$BB_USER:$BB_APP_PASSWORD"
# Manifest URL (JSON or YAML list)
ckg source add https://example.com/all-repos.yaml
ckg source list # see what you've added
ckg source repos 1 # repos discovered for source 1
ckg source sync 1 # re-discover; queues ingests for newly-added repos
ckg source delete 1 --yes # CASCADE — drops every repo + graph data this source created
PATs are encrypted at rest with Fernet (key in CKG_SECRET_KEY). They
never appear in repos.url — the worker injects them into the clone
URL at fetch time.
5c. Keep the graph fresh — polling + webhooks
Two ways to keep ingested repos up-to-date without manual triggers. Polling uses a Celery Beat scheduler (one extra Compose service); webhooks are push-driven by GitHub / GitLab / Bitbucket.
# Polling
ckg source schedule 1 30m # re-discover source 1 every 30 minutes
ckg repo poll my-repo 5m # incremental ingest of my-repo every 5 minutes
# Webhooks (returns the secret + receiver URL — paste both into the provider)
ckg source webhook 1 --enable
Provider setup:
| Provider | Where | Field |
|---|---|---|
| GitHub | repo / org Settings → Webhooks | Payload URL = <your-server>/v1/webhooks/<source_id>; Content type: application/json; Secret = the printed value; tick just the push event |
| GitLab | project Settings → Webhooks | URL = same as above; Secret token = the printed value; tick Push events |
| Bitbucket | workspace Webhooks → Add | URL = <your-server>/v1/webhooks/<source_id>?secret=<paste>; trigger on Repository push |
GitHub uses HMAC-SHA256 of the body, GitLab a shared-token header,
Bitbucket Cloud the URL-embedded secret. The same /v1/webhooks/<id>
endpoint detects the provider from headers automatically.
6. Browse it in the UI
Open http://localhost:3000, paste an API token, and explore:
- Dashboard — node/edge/repo/file counts; repo list
- Repos — register repos, queue incremental or full ingests, watch run status
- Sources — paste a GitHub org / GitLab group / Bitbucket workspace / manifest URL and bulk-add every repo it exposes
- Search — keyword (Lucene FTS) or semantic (vector) across all (or one) repos
- Graph — force-directed call graph for any function, callers + callees up to depth 4
The UI is a static Next.js bundle served from the web container; the
browser hits the API directly using the bearer token kept in
localStorage.
7. Hook up your editor
| Editor | Guide |
|---|---|
| Cursor | integrations/cursor/README.md |
| VS Code (Copilot Chat / Cline / Roo Code) | integrations/vscode/README.md |
| Claude Code | integrations/claude-code/README.md |
What the graph looks like
(Repo)-[:CONTAINS]->(File)-[:DEFINES]->(Class)-[:HAS_METHOD]->(Function)
-[:DEFINES]->(Function)-[:CALLS]->(Function)
-[:IMPORTS]->(Module|File)
Function nodes carry a embedding vector property indexed for cosine
similarity. Names + docs feed Lucene full-text indexes. So one Cypher store
answers all three styles of query (structural / keyword / semantic).
API surface (short)
Full reference: docs/api.md.
| Verb | Path | Purpose |
|---|---|---|
GET |
/healthz |
Liveness |
GET |
/readyz |
Readiness (per-store) |
POST |
/v1/tokens |
Mint a token (admin) |
GET |
/v1/tokens |
List tokens (admin) |
DELETE |
/v1/tokens/{id} |
Revoke (admin) |
POST |
/v1/repos |
Register a repo |
GET |
/v1/repos |
List repos |
POST |
/v1/repos/{id}/ingest |
Queue ingest |
GET |
/v1/repos/{id}/runs |
Ingest history |
GET |
/v1/graph/stats |
Graph counts |
GET |
/v1/graph/callers_of |
Transitive callers |
GET |
/v1/graph/callees_of |
Transitive callees |
GET |
/v1/graph/imports_of |
Imports for a file |
GET |
/v1/graph/blast_radius |
Files affected if this file changes (upstream callers) |
GET |
/v1/graph/downstream_dependencies |
Files this file depends on (outgoing callees) |
GET |
/v1/graph/file |
Symbols in a file |
GET |
/v1/search/keyword |
Lucene FTS |
GET |
/v1/search/semantic |
Vector cosine |
POST |
/v1/mcp |
MCP JSON-RPC for IDEs |
POST |
/v1/graphql |
GraphQL endpoint (open in browser for GraphiQL UI) |
Roadmap
- Phase 1 — Foundation, auth, Python/JS/TS ingest, REST + MCP, CLI
- Phase 2 — Incremental updates (per-file sha diff), GraphQL endpoint, Rust/Go/Java/Ruby parsers
- Phase 3 — C/C++ parsers; opt-in LSP precision pass (pyright today; rust-analyzer / gopls / ts-server / jdtls planned)
- Phase 4 — Next.js web UI: token login, dashboard, repo management, search (keyword + semantic), force-directed function call-graph viz
- Phase 5 — Multi-tenant orgs/users, k8s/Helm, OpenTelemetry, Neo4j Causal Cluster
Development
Backend:
pip install -e '.[dev]'
pytest -q
ruff check ckg
Web UI:
cd web
npm install --legacy-peer-deps
NEXT_PUBLIC_CKG_API=http://localhost:8080 npm run dev
# open http://localhost:3000
Project layout:
ckg/
├── api/ # FastAPI app + routes (REST + GraphQL + MCP)
├── auth.py # API tokens, principal, scopes
├── cli/ # `ckg` Typer CLI
├── config.py # Pydantic settings
├── db/ # neo4j / postgres / redis clients + schema
├── lsp/ # Opt-in LSP precision pass (Phase 3)
├── parsers/ # tree-sitter parsers, one per language
├── services/ # ingest, embeddings, lsp_resolve
└── worker/ # Celery app + tasks
web/ # Next.js 15 + Tailwind + react-force-graph-2d (Phase 4)
docker/ # API + worker + web Dockerfiles
docs/ # ADRs, deployment, API
integrations/ # cursor / vscode / claude-code MCP snippets
tests/ # pytest
Security
- API tokens are 32-byte URL-safe random strings prefixed
ckg_, never stored in plaintext — only argon2id hashes are persisted. - The bootstrap token (
.env) is your only way in on day 0; rotate it immediately after minting a scoped token. - All non-health endpoints require a token; CORS is restricted to
CKG_CORS_ORIGINS. .envis git-ignored. Do not commit it. Do not paste tokens into chats.
If you find a security issue, please open a private vulnerability report on GitHub.
Pre-commit credential audit
A small audit script refuses to commit credentials, IDE-assistant configs
(.claude/, CLAUDE.md, .mcp.json, .cursor/, .continue/, .aider*,
.windsurf/), or files matching common secret patterns (GitHub PAT,
OpenAI key, AWS access key, Slack token, JWT, PEM private key):
./scripts/audit-secrets.sh
# install as a git pre-commit hook (recommended):
ln -sf ../../scripts/audit-secrets.sh .git/hooks/pre-commit
License
MIT — see LICENSE.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file central_code_knowledge_graph-0.1.1.tar.gz.
File metadata
- Download URL: central_code_knowledge_graph-0.1.1.tar.gz
- Upload date:
- Size: 92.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5ba8ef2c3ea118b3d5764e4e1d04a27faa2f567ea294c83cb11b0707d71e3ee1
|
|
| MD5 |
888cd0afd34ad1409a3f620a0d15d31e
|
|
| BLAKE2b-256 |
49d818e2a387779355b6a1d9a327dddd1f6e603559f1792447b17e0c182a0f04
|
Provenance
The following attestation bundles were made for central_code_knowledge_graph-0.1.1.tar.gz:
Publisher:
publish.yml on ajankurjain/central-code-knowledge-graph
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
central_code_knowledge_graph-0.1.1.tar.gz -
Subject digest:
5ba8ef2c3ea118b3d5764e4e1d04a27faa2f567ea294c83cb11b0707d71e3ee1 - Sigstore transparency entry: 1516493284
- Sigstore integration time:
-
Permalink:
ajankurjain/central-code-knowledge-graph@8576fd7a14ef372a09b9918ff71134911f147f13 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/ajankurjain
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@8576fd7a14ef372a09b9918ff71134911f147f13 -
Trigger Event:
push
-
Statement type:
File details
Details for the file central_code_knowledge_graph-0.1.1-py3-none-any.whl.
File metadata
- Download URL: central_code_knowledge_graph-0.1.1-py3-none-any.whl
- Upload date:
- Size: 124.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bef5fab15522227771f1e96fd2aed121b5122181a65bb6c9bbfc75001cb61055
|
|
| MD5 |
2d249125dc23a78cbbb20a194cdaf18d
|
|
| BLAKE2b-256 |
c00c1ced8d937a29fe787d9caa5523eb3d8c9c49369e26aebc87d4e04b736858
|
Provenance
The following attestation bundles were made for central_code_knowledge_graph-0.1.1-py3-none-any.whl:
Publisher:
publish.yml on ajankurjain/central-code-knowledge-graph
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
central_code_knowledge_graph-0.1.1-py3-none-any.whl -
Subject digest:
bef5fab15522227771f1e96fd2aed121b5122181a65bb6c9bbfc75001cb61055 - Sigstore transparency entry: 1516493934
- Sigstore integration time:
-
Permalink:
ajankurjain/central-code-knowledge-graph@8576fd7a14ef372a09b9918ff71134911f147f13 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/ajankurjain
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@8576fd7a14ef372a09b9918ff71134911f147f13 -
Trigger Event:
push
-
Statement type: