Skip to main content

Central, multi-repo code knowledge graph for AI agents — Neo4j + Tree-sitter + MCP.

Project description

central-code-knowledge-graph

Stop re-reading. Start querying.

AI coding tools re-read your entire codebase on every task. ckg fixes that. One server indexes every repo in your org with Tree-sitter across 26 languages, stores the structural map as a Neo4j property graph, keeps it fresh via incremental ingest + webhooks, and serves precise context to your AI assistant via MCP so it reads only what matters.

PyPI CI License: MIT Python Docker Compose Neo4j MCP Languages Tree-sitter

One server that:

  • ingests many repositories (not just one) and keeps them incrementally fresh
  • stores them as a Neo4j property graph (File, Class, Function, Module + CONTAINS, DEFINES, HAS_METHOD, CALLS, IMPORTS)
  • exposes REST, GraphQL, MCP/JSON-RPC, and a ckg CLI
  • supports structural queries (callers, callees, imports, blast radius, downstream dependencies), full-text search, and semantic vector search
  • generates an architecture map with coupling warnings (cyclic deps, god modules, SDP violations) every ingest
  • secures every endpoint with scoped API tokens (argon2id-hashed)
  • runs as a single docker compose up

Why

Need How this server delivers
Rock-solid, won't fall over Stateless API + workers; Neo4j/Postgres/Redis run with healthchecks + restart: unless-stopped; horizontal scale via --scale worker=N
Fast relationship search for AI agents Native graph DB (Cypher) + Lucene FTS + vector index — all in Neo4j
Multi-language Tree-sitter parsers (23): Python, JS/TS (incl. JSX/TSX → React, Angular), Rust, Go, Java, Ruby, C, C++, C#, Kotlin, Scala, Swift, PHP, Solidity, Dart, R, Perl, Lua, Zig, PowerShell, Julia, Nix. Extraction wrappers (3): Vue, Svelte (delegates <script> to JS/TS), Jupyter/Databricks .ipynb (concatenates code cells, dispatches by kernel). Pluggable — one file under ckg/parsers/ adds another language
Precise cross-file edges Opt-in LSP pass (CKG_LSP_ENABLED=true) upgrades CALLS edges with language-server-resolved targets. Pyright today; rust-analyzer / gopls / ts-server / jdtls planned. Graph stays functional with no LSP installed
Fast updates Incremental ingest (--incremental): sha-diffs files against the graph, only re-parses what changed. Full reparse stays available as --full
Context for AI tools Built-in MCP HTTP server → Cursor, VS Code, Claude Code drop in
Two query surfaces REST (/v1/*) for simple calls + GraphQL (/v1/graphql) for composed traversals; both use the same API token
CLI for automation ckg Typer CLI: register, ingest, query, search
Spec-driven Auto-generated OpenAPI at /docs; GraphiQL UI at /v1/graphql; ADRs under docs/adr/
Whole-codebase index One Neo4j graph spans all registered repos
Neo4j-backed Functions, classes, files, imports, calls all stored as labeled nodes + typed relationships
Secure API tokens with scopes (admin, repo:write, repo:read); hashed at rest

Supported languages

Tree-sitter parsers (23): Python · JavaScript (incl. JSX → React) · TypeScript (incl. TSX → Angular) · Rust · Go · Java · Ruby · C · C++ · C# · Kotlin · Scala · Swift · PHP · Solidity · Dart · R · Perl · Lua · Zig · PowerShell · Julia · Nix

Extraction wrappers (3): Vue & Svelte SFCs (delegate <script> to JS/TS) · Jupyter / Databricks .ipynb (concatenate code cells, dispatch by kernel language)

Pluggable — adding another language is one file under ckg/parsers/ and one line in the registry.

Architecture

                      ┌──────────────┐
   AI agents ───MCP──▶│              │
   CLI (ckg) ──REST──▶│   FastAPI    │──▶ Auth (API tokens, scopes)
   Web UI ────GQL───▶ │              │──▶ Audit log
                      └──────┬───────┘
                             │
            ┌────────────────┼─────────────────────────────┐
            ▼                ▼                             ▼
     ┌────────────┐   ┌─────────────┐              ┌───────────────┐
     │ Neo4j 5    │   │ Postgres    │              │ Redis         │
     │ graph +    │   │ repos +     │              │ cache + queue │
     │ vector +   │   │ tokens +    │              └───────┬───────┘
     │ FTS        │   │ runs +      │                      │
     └────────────┘   │ audit       │              ┌───────▼───────┐
                      └─────────────┘              │ Celery workers│
                                                   │  - clone      │
                                                   │  - parse      │
                                                   │  - embed      │
                                                   │  - write graph│
                                                   └───────┬───────┘
                                                           │
                                                   ┌───────▼───────┐
                                                   │ Tree-sitter   │
                                                   │ parsers       │
                                                   │ Py / JS / TS  │
                                                   │ (Rust/Ruby/   │
                                                   │  Go/Java soon)│
                                                   └───────────────┘

Full design rationale: docs/adr/0001-architecture.md.

Quickstart

1. Prerequisites

Required Notes
Docker Docker Desktop (macOS / Windows) or Docker Engine + Compose v2 (Linux) Must be running before step 3. Confirm with docker info.
RAM 8 GB free Neo4j wants 2 GB, sentence-transformers ~500 MB on first warmup
Disk ~3 GB free Base images (Neo4j, Postgres, Redis, Python, Node) total ~2 GB. Plus your repo clones under the repo_data volume.
Network Outbound HTTPS First boot pulls images from Docker Hub + npm + PyPI
Python 3.11+ (host) Only if you want to install the CLI on your laptop. Not needed otherwise — make up runs everything in containers.

2. Clone and configure

git clone https://github.com/ajankurjain/central-code-knowledge-graph.git
cd central-code-knowledge-graph
cp .env.example .env

Replace every change-me-* in .env with strong randoms — the snippet below generates a full, ready-to-go .env for local dev in one shot:

python3 - <<'PY'
import secrets, base64, os
subs = {
    "change-me-please-bootstrap-token": secrets.token_urlsafe(32),
    "change-me-please-fernet-key":     base64.urlsafe_b64encode(os.urandom(32)).decode(),
    "change-me-neo4j-password":        secrets.token_urlsafe(24),
    "change-me-postgres-password":     secrets.token_urlsafe(24),
}
env = open(".env").read()
for k, v in subs.items():
    env = env.replace(k, v)
open(".env", "w").write(env)
PY
chmod 600 .env

⚠️ Keep .env out of git — it's already in .gitignore, the pre-commit hook (scripts/audit-secrets.sh) refuses any commit that contains it.

3. Start the stack

Make sure Docker Desktop is running first (docker info should succeed), then:

make up
# or: docker compose up -d --build

First boot takes 5–10 minutes — it pulls ~2 GB of base images and builds the api / worker / web / beat images locally. Subsequent make up runs are ~10 seconds.

Confirm everything came up healthy:

docker compose ps
# all containers should show "running" and (healthy):
# ckg-api-1, ckg-beat-1, ckg-neo4j-1, ckg-postgres-1, ckg-redis-1, ckg-web-1, ckg-worker-1

Health check from outside:

curl http://localhost:8080/readyz
# {"ready":true,"checks":{"neo4j":true,"postgres":true,"redis":true},"version":"0.1.3"}

URLs:

Service URL
Web UI http://localhost:3000
API (Swagger UI) http://localhost:8080/docs
GraphQL (GraphiQL) http://localhost:8080/v1/graphql
Neo4j Browser http://localhost:7474 (login neo4j / value of NEO4J_PASSWORD from .env)
Postgres localhost:5433 (mapped off default port to avoid clashes)
Redis localhost:6379

4. Sign in

Grab the bootstrap token from .env:

grep ^CKG_BOOTSTRAP_TOKEN .env | cut -d= -f2-

Then either:

a) Use the web UI — open http://localhost:3000/login, paste the token, click Sign in. The Dashboard lights up.

b) Use the ckg CLI:

# From PyPI (light install — CLI only, talks to the Docker server):
pip install central-code-knowledge-graph

# Or pipx for an isolated install:
pipx install central-code-knowledge-graph

# Or editable install from a checkout for development:
pip install -e '.[dev]'

# Then:
export CKG_SERVER=http://localhost:8080
ckg login --token "$(grep ^CKG_BOOTSTRAP_TOKEN .env | cut -d= -f2-)"
ckg status        # should print graph counts

The bootstrap token has admin scope and is meant for one-time setup — mint a scoped token and use that going forward:

ckg token create my-laptop --scope repo:read --scope repo:write
# copy the printed `ckg_…` token, then:
ckg login --token ckg_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

5. Ingest your first repo

# Pick a local repo to index. `file:///abs/path` clones-in-place, no network.
ckg repo register my-repo file:///Users/you/code/my-repo --branch main

# Run a full ingest.
ckg repo ingest    my-repo --full
ckg repo runs      my-repo                 # watch progress; a small repo finishes in seconds

# Verify the graph populated.
ckg graph stats
# → {"nodes": 3288, "edges": 6676, "repos": 1, "files": 80}  (example)

# Search.
ckg search keyword  "ingest pipeline"
ckg search semantic "where do we parse Tree-sitter trees?"

# Structural queries.
ckg graph callers    my-repo my.module.foo --depth 2
ckg graph blast      my-repo src/foo/bar.py        # files that break if bar.py changes
ckg graph downstream my-repo src/foo/bar.py        # files bar.py depends on

You can do the same from the web UI under ReposRegister → fill the form, then click ingest Δ or full reparse. Watch progress on the repo detail page (auto-refreshes while a run is in flight).

5b. Or pull an entire org / group / workspace at once

Paste a single URL — GitHub org/user, GitLab group/user, Bitbucket workspace, or a JSON/YAML manifest — and ckg discovers every accessible repo, registers them, and queues a full ingest for each.

# Public org, anonymous
ckg source add https://github.com/orgs/anthropics

# Private org with a Personal Access Token (example: read from env)
export CKG_SOURCE_TOKEN="$GH_PAT"
ckg source add https://github.com/orgs/acme --include-forks

# GitLab group (incl. subgroups)
ckg source add https://gitlab.com/groups/gitlab-org

# Bitbucket workspace (token format: "username:app-password")
ckg source add https://bitbucket.org/atlassian --token "$BB_USER:$BB_APP_PASSWORD"

# Manifest URL (JSON or YAML list)
ckg source add https://example.com/all-repos.yaml

ckg source list          # see what you've added
ckg source repos 1       # repos discovered for source 1
ckg source sync 1        # re-discover; queues ingests for newly-added repos
ckg source delete 1 --yes  # CASCADE — drops every repo + graph data this source created

PATs are encrypted at rest with Fernet (key in CKG_SECRET_KEY). They never appear in repos.url — the worker injects them into the clone URL at fetch time.

5c. Keep the graph fresh — polling + webhooks

Two ways to keep ingested repos up-to-date without manual triggers. Polling uses a Celery Beat scheduler (one extra Compose service); webhooks are push-driven by GitHub / GitLab / Bitbucket.

# Polling
ckg source schedule 1 30m       # re-discover source 1 every 30 minutes
ckg repo   poll     my-repo 5m  # incremental ingest of my-repo every 5 minutes

# Webhooks (returns the secret + receiver URL — paste both into the provider)
ckg source webhook  1 --enable

Provider setup:

Provider Where Field
GitHub repo / org Settings → Webhooks Payload URL = <your-server>/v1/webhooks/<source_id>; Content type: application/json; Secret = the printed value; tick just the push event
GitLab project Settings → Webhooks URL = same as above; Secret token = the printed value; tick Push events
Bitbucket workspace Webhooks → Add URL = <your-server>/v1/webhooks/<source_id>?secret=<paste>; trigger on Repository push

GitHub uses HMAC-SHA256 of the body, GitLab a shared-token header, Bitbucket Cloud the URL-embedded secret. The same /v1/webhooks/<id> endpoint detects the provider from headers automatically.

6. Browse it in the UI

Open http://localhost:3000, paste an API token, and explore:

  • Dashboard — node/edge/repo/file counts; repo list
  • Repos — register repos, queue incremental or full ingests, watch run status
  • Sources — paste a GitHub org / GitLab group / Bitbucket workspace / manifest URL and bulk-add every repo it exposes
  • Search — keyword (Lucene FTS) or semantic (vector) across all (or one) repos
  • Graph — force-directed call graph for any function, callers + callees up to depth 4

The UI is a static Next.js bundle served from the web container; the browser hits the API directly using the bearer token kept in localStorage.

7. Hook up your editor

Editor Guide
Cursor integrations/cursor/README.md
VS Code (Copilot Chat / Cline / Roo Code) integrations/vscode/README.md
Claude Code integrations/claude-code/README.md

Day-2 operations

make logs                 # tail every service
make restart              # restart api + worker only
docker compose stop       # park everything; data volumes persist
make up                   # bring it back
make clean                # WARNING: removes volumes — wipes graph + Postgres
make psql                 # psql shell inside the postgres container
make neo4j-shell          # cypher-shell inside the neo4j container

Troubleshooting

Things that bit me during local setup — keep this open the first time you run.

Symptom Diagnosis / fix
docker: command not found Docker Desktop isn't on PATH. macOS shortcut: export PATH="/Applications/Docker.app/Contents/Resources/bin:$PATH".
docker info fails / "Cannot connect to the Docker daemon" Docker Desktop is installed but not running. Launch the Docker app and wait ~10s.
make up errors with neo4j password required You skipped step 2 — .env doesn't exist (or still has change-me-* placeholders for the strict-required vars). Re-run the Python one-liner in step 2.
ckg-web-1 stays in Created state and never starts The image was never built. Run docker compose build web && docker compose up -d web.
ckg-neo4j-1 flaps Restarting with Unrecognized setting. No declared setting with name: PASSWORD Old compose file. Pull main — fixed in v0.1.1 by renaming the healthcheck env vars to CKG_HEALTHCHECK_*.
API container loops with TypeError: APIRouter.__init__() got an unexpected keyword argument 'graphiql' strawberry-graphql renamed the arg. Fixed in v0.1.1. Pull main.
Worker / beat crash with exec: "celery": executable file not found in $PATH Dockerfile didn't install [server] extras. Fixed in v0.1.1. Pull main + docker compose build --no-cache worker beat.
Ingest reports files_skipped for every file, files_parsed: 0 tree-sitter-language-pack 1.x compatibility issue. Fixed in v0.1.1 by pinning to 0.7-0.9. Pull main + rebuild api/worker.
GitHub README badge stuck on a stale version GitHub's camo proxy caches images by URL. Bump the URL slightly (e.g. change cacheSeconds=N to a different N) to force a refetch.
Forgot the bootstrap token grep ^CKG_BOOTSTRAP_TOKEN .env | cut -d= -f2-
Want to wipe the graph and start over make clean && make up && python … (regenerate .env). Note: this also drops the Postgres data, so all minted tokens go too.
Forgot which port is which All ports are configurable via .env (CKG_API_PORT, CKG_WEB_PORT). Defaults: 8080 / 3000 / 7474 (Neo4j) / 5433 (Postgres) / 6379 (Redis).
Run integration tests against the live stack docker compose exec api pytest tests/integration/ -q (after make up).

What the graph looks like

(Repo)-[:CONTAINS]->(File)-[:DEFINES]->(Class)-[:HAS_METHOD]->(Function)
                          -[:DEFINES]->(Function)-[:CALLS]->(Function)
                          -[:IMPORTS]->(Module|File)

Function nodes carry a embedding vector property indexed for cosine similarity. Names + docs feed Lucene full-text indexes. So one Cypher store answers all three styles of query (structural / keyword / semantic).

API surface (short)

Full reference: docs/api.md.

Verb Path Purpose
GET /healthz Liveness
GET /readyz Readiness (per-store)
POST /v1/tokens Mint a token (admin)
GET /v1/tokens List tokens (admin)
DELETE /v1/tokens/{id} Revoke (admin)
POST /v1/repos Register a repo
GET /v1/repos List repos
POST /v1/repos/{id}/ingest Queue ingest
GET /v1/repos/{id}/runs Ingest history
GET /v1/graph/stats Graph counts
GET /v1/graph/callers_of Transitive callers
GET /v1/graph/callees_of Transitive callees
GET /v1/graph/imports_of Imports for a file
GET /v1/graph/blast_radius Files affected if this file changes (upstream callers)
GET /v1/graph/downstream_dependencies Files this file depends on (outgoing callees)
GET /v1/graph/file Symbols in a file
GET /v1/search/keyword Lucene FTS
GET /v1/search/semantic Vector cosine
POST /v1/mcp MCP JSON-RPC for IDEs
POST /v1/graphql GraphQL endpoint (open in browser for GraphiQL UI)

Releases

Current: v0.1.3 on PyPI · full notes at Releases.

Version Highlights
v0.1.3 Self-hosted GitLab support (gitlab_instance kind + bring-your-own-base-URL). Worker now scrubs <scheme>://user:pw@… userinfo and bare GitHub / GitLab PAT shapes before persisting clone errors, so a failed git clone https://oauth2:glpat-…@host/path no longer leaks the token into ingest_runs.error. credentialed_clone_url_for_repo falls back to a host-aware token injector when the source's kind is unknown to the running worker. Per-source live progress bar on /sources driven by a new GET /v1/sources/{id}/progress endpoint (indexed / queued / running / failed counts + last-sync / last-ingest timestamps + 5 s auto-poll while in flight). New ckg.reconcile_stuck_ingests beat task every 60 s — re-publishes orphaned queued rows (DB-vs-broker desync) and reaps zombie running rows so the progress bar never silently freezes. Searchable repo combobox on /repos, /arch, /graph with INDEXED / NOT INDEXED badges + refresh icon. Dashboard and /repos table now paginated (10 / 25 per page) with id / url / language filter. /graph shows top-20 most-connected functions as click-to-fill entry points when no qname is set — backed by GET /v1/graph/entry_points — plus a back-to-functions button. tree-sitter>=0.25.2,<0.26 (csharp grammar v15 ABI). python parser detects async either as the node type OR as a child keyword. lua parser covers modern (function_declaration, local_function) and legacy node types.
v0.1.2 Per-repo PAT for cloning private repos (POST /v1/repos {token} + PUT …/credentials + Credentials panel in the web UI + ckg repo register --token / ckg repo credentials). Idempotent ADD COLUMN IF NOT EXISTS migration so existing installs pick up new columns on restart. Per-row ingest feedback on the /repos page. Runs-table error column now collapsible with full text. Web register form auto-slugifies with live preview. tree-sitter>=0.24,<0.25 (csharp grammar v15).
v0.1.1 Live-verified runtime fixes: Dockerfile installs [server] extras so celery is on the worker's PATH; Neo4j healthcheck creds exposed via CKG_HEALTHCHECK_* (was conflicting with Neo4j's NEO4J_*-as-setting parsing); strawberry-graphql graphql_ide= arg compatibility; tree-sitter-language-pack pinned to the 0.x line where the parser objects still expose .parse().
v0.1.0 Initial release.

Upgrade:

pip install --upgrade central-code-knowledge-graph
# or, in a Docker checkout:
git pull && make build && make restart

Roadmap

  • Phase 1 — Foundation, auth, Python/JS/TS ingest, REST + MCP, CLI
  • Phase 2 — Incremental updates (per-file sha diff), GraphQL endpoint, Rust/Go/Java/Ruby parsers
  • Phase 3 — C/C++ parsers; opt-in LSP precision pass (pyright today; rust-analyzer / gopls / ts-server / jdtls planned)
  • Phase 4 — Next.js web UI: token login, dashboard, repo management, search (keyword + semantic), force-directed function call-graph viz
  • Phase 5 — Multi-tenant orgs/users, k8s/Helm, OpenTelemetry, Neo4j Causal Cluster

Development

Backend:

pip install -e '.[dev]'
pytest -q
ruff check ckg

Web UI:

cd web
npm install --legacy-peer-deps
NEXT_PUBLIC_CKG_API=http://localhost:8080 npm run dev
# open http://localhost:3000

Project layout:

ckg/
├── api/        # FastAPI app + routes (REST + GraphQL + MCP)
├── auth.py     # API tokens, principal, scopes
├── cli/        # `ckg` Typer CLI
├── config.py   # Pydantic settings
├── db/         # neo4j / postgres / redis clients + schema
├── lsp/        # Opt-in LSP precision pass (Phase 3)
├── parsers/    # tree-sitter parsers, one per language
├── services/   # ingest, embeddings, lsp_resolve
└── worker/     # Celery app + tasks
web/            # Next.js 15 + Tailwind + react-force-graph-2d (Phase 4)
docker/         # API + worker + web Dockerfiles
docs/           # ADRs, deployment, API
integrations/   # cursor / vscode / claude-code MCP snippets
tests/          # pytest

Security

  • API tokens are 32-byte URL-safe random strings prefixed ckg_, never stored in plaintext — only argon2id hashes are persisted.
  • The bootstrap token (.env) is your only way in on day 0; rotate it immediately after minting a scoped token.
  • All non-health endpoints require a token; CORS is restricted to CKG_CORS_ORIGINS.
  • .env is git-ignored. Do not commit it. Do not paste tokens into chats.

If you find a security issue, please open a private vulnerability report on GitHub.

Pre-commit credential audit

A small audit script refuses to commit credentials, IDE-assistant configs (.claude/, CLAUDE.md, .mcp.json, .cursor/, .continue/, .aider*, .windsurf/), or files matching common secret patterns (GitHub PAT, OpenAI key, AWS access key, Slack token, JWT, PEM private key):

./scripts/audit-secrets.sh

# install as a git pre-commit hook (recommended):
ln -sf ../../scripts/audit-secrets.sh .git/hooks/pre-commit

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

central_code_knowledge_graph-0.1.3.tar.gz (102.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

central_code_knowledge_graph-0.1.3-py3-none-any.whl (134.1 kB view details)

Uploaded Python 3

File details

Details for the file central_code_knowledge_graph-0.1.3.tar.gz.

File metadata

File hashes

Hashes for central_code_knowledge_graph-0.1.3.tar.gz
Algorithm Hash digest
SHA256 b456437d1707023ab90f4f0c8cdc6ade5bf1aac7b5b2e895242fbfb2b4d17121
MD5 3f0cf3364a5ddca5624244b709ce2956
BLAKE2b-256 5eec8bd756346b91d15bc42009f43d886d25aceadf1540595d957d7559bc7c0e

See more details on using hashes here.

Provenance

The following attestation bundles were made for central_code_knowledge_graph-0.1.3.tar.gz:

Publisher: publish.yml on ajankurjain/central-code-knowledge-graph

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file central_code_knowledge_graph-0.1.3-py3-none-any.whl.

File metadata

File hashes

Hashes for central_code_knowledge_graph-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 25cacb33320aeb60bfeded8569e8ea59bcfe04789be3759c8ffeca3a830d8ac8
MD5 a30cabfc7afd756cf4f8f9c42e8beba6
BLAKE2b-256 31716c9fa21305d1ac20a514bcdb0f7bf38388c498e6b246b25f820297d443f9

See more details on using hashes here.

Provenance

The following attestation bundles were made for central_code_knowledge_graph-0.1.3-py3-none-any.whl:

Publisher: publish.yml on ajankurjain/central-code-knowledge-graph

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page