Skip to main content

Centralized codebase knowledge graph + coordinate index for AI coding agents (Neo4j-backed).

Project description

Code Spider

Centralized codebase knowledge graph + coordinate index for AI coding agents. Backed by Neo4j 5.x Community, written in Python 3.13+, parses with Tree-sitter, exposes the graph to agents via the Model Context Protocol (MCP).

Status: Phase 0 — Foundations. End-to-end indexing for a single Python repo into Neo4j is the current goal. Phases 1 (TS/JS, REST flow, Kafka flow, MCP server, hybrid search) and 2 (incremental, observability) follow.

Why

AI coding agents waste enormous context windows on grep/list/read loops while exploring large polyglot codebases. Code Spider precomputes the structural + semantic shape of an entire workspace (every symbol, import, call, REST route, Kafka topic flow, code chunk embedding) into a single queryable Neo4j graph, then exposes navigation primitives via MCP so agents can:

  • Jump directly to file/line coordinates without scanning.
  • Trace call graphs, impact analysis, and cross-service HTTP/Kafka flows in a single Cypher hop.
  • Resolve natural-language queries via hybrid lexical + vector search and receive precise coordinates.

See the design plan: ~/.windsurf/plans/code-spider-knowledge-graph-aea777.md.

Architecture (one screen)

workspaces.yaml --> CI indexer ----> Neo4j 5.x Community
                       |                  ^
                       v                  | Cypher
                Shared FS (commit SHA)    |
                       ^                  |
                       +----- MCP server (Python)
                                          ^
                                          | MCP / JSON-RPC
                                  AI agents (Windsurf / Cursor / Claude Code / Codex)

Locked design decisions

Dimension Decision
Topology Single shared central Neo4j 5.x Community
MVP languages Python, TypeScript, JavaScript
Cross-service edges REST/HTTP + Kafka producer/consumer
Enrichment Structural + hybrid lexical/vector search (RRF)
Indexing trigger CI pipeline step on merge to main
Vector storage Neo4j native HNSW (abstracted behind VectorBackend)
Call resolution Tree-sitter + 6-strategy heuristic cascade
Agent interface MCP server only
Workspace model Explicit workspaces.yaml manifest
Embedding model Local sentence-transformers in-process
Snippet retrieval Indexer-managed shared filesystem keyed by commit SHA

Quickstart for developers (consume an existing central graph)

If your team already runs a central Neo4j with the graph indexed, this is all you need. No Docker, no local Neo4j, no indexing.

# 1. Install (requires Python 3.12+)
pip install code-spider              # or: pipx install code-spider
# or zero-install with uv:           uvx code-spider serve

# 2. Point it at the central Neo4j
code-spider configure                # interactive wizard, saves to
                                     # ~/.config/code-spider/config.env (0600)

# 3. Verify the connection end-to-end
code-spider doctor                   # checks env -> bolt -> auth -> schema

# 4. Print the MCP JSON snippet for your coding agent
code-spider mcp-config --agent windsurf       # or: cursor | claude-code | generic
# Paste the printed JSON into the path the wizard tells you about.

That's it — restart your agent and the code-spider MCP server is wired in.

Supported coding agents

Agent Where to paste the mcp-config output
Windsurf ~/.codeium/windsurf/mcp_config.json
Cursor ~/.cursor/mcp.json (or project-level .cursor/mcp.json)
Claude Code claude mcp add-json code-spider '<inner object>'
Generic Any MCP client that consumes the standard JSON schema

Quickstart for admins (run the central server)

This is the side that operates Neo4j, defines workspaces.yaml, and indexes repos in CI on every merge to main.

1. Start Neo4j Community

docker compose up -d neo4j
# Browser: http://localhost:7474  (neo4j / codespider-dev-password)

2. Install with dev extras

python3.12 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev,embedding]"

3. Deploy graph schema

code-spider migrate

4. Index repositories

cp workspaces.example.yaml workspaces.yaml
# edit workspaces.yaml to point at real repos (path or git URL)
code-spider index --workspace demo

5. Verify

// in Neo4j Browser
MATCH (s:Symbol) RETURN s.kind, count(*) AS n ORDER BY n DESC;

6. Production indexing options

# Full run with embeddings + Prometheus metrics
code-spider index --workspace demo --embed sentence-transformers --metrics-port 9464

# Incremental on subsequent CI runs (skip unchanged files)
code-spider index --workspace demo --incremental --embed auto

# Prometheus scraping
curl http://localhost:9464/metrics | grep code_spider_

7. Recommended security model for developers

Create a read-only Neo4j user for developers so a leaked password can't mutate the graph:

// run as the admin user in Neo4j Browser
CREATE USER codespider_ro SET PASSWORD 'rotate-me' CHANGE NOT REQUIRED;
GRANT ROLE reader TO codespider_ro;

Hand codespider_ro (not the admin user) to developers running code-spider configure.

8. Hand-rolled MCP JSON (if you don't want to use mcp-config)

{
  "mcpServers": {
    "code-spider": {
      "command": "/absolute/path/to/code-spider",
      "args": ["serve"],
      "env": {
        "CODE_SPIDER_NEO4J_URI": "bolt://central-neo4j.example.com:7687",
        "CODE_SPIDER_NEO4J_USER": "codespider_ro",
        "CODE_SPIDER_NEO4J_PASSWORD": "rotate-me",
        "CODE_SPIDER_NEO4J_DATABASE": "neo4j"
      }
    }
  }
}

Layout

code_spider/
├── config.py             # env + manifest loading (CWD .env + ~/.config/code-spider/config.env)
├── onboarding.py         # `configure` wizard, `mcp-config`, `doctor`
├── workspace/manifest.py # YAML schema + diff
├── checkout/git.py       # GitPython wrapper
├── parser/               # tree-sitter language adapters
├── symbols/              # domain model + FQN helpers
├── resolver/             # 6-strategy cascade (Phase 1)
├── routes/               # REST extractors + HTTP_FLOW matcher (Phase 1)
├── messaging/            # Kafka extractors + KAFKA_FLOW matcher (Phase 1)
├── chunker/              # AST-aware chunker (Phase 1)
├── embedding/            # sentence-transformers wrapper (Phase 1)
├── graph/                # Neo4j client, schema, writer, vector backends
├── search/               # lexical + vector + RRF fusion (Phase 1)
├── mcp/                  # MCP server + 8 tools (Phase 1)
└── cli.py                # `code-spider configure|doctor|mcp-config|migrate|index|serve`

Development

pytest                                # unit tests
pytest -m integration                 # requires Neo4j on localhost:7687
ruff check . && ruff format --check . # lint + format
mypy code_spider                      # type-check

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

code_spider-0.1.0.tar.gz (128.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

code_spider-0.1.0-py3-none-any.whl (124.3 kB view details)

Uploaded Python 3

File details

Details for the file code_spider-0.1.0.tar.gz.

File metadata

  • Download URL: code_spider-0.1.0.tar.gz
  • Upload date:
  • Size: 128.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for code_spider-0.1.0.tar.gz
Algorithm Hash digest
SHA256 9cef630b694326c3b7a2ed7b9fdc2c4e0e8457f797deea75d402feb218b2a983
MD5 5260ba74fac3ff9e9fa3c22837b10fe5
BLAKE2b-256 fff3648434e5dc3a204a9ed9a9bebd6732c2a61183b83f50411bfd2e0a02cade

See more details on using hashes here.

Provenance

The following attestation bundles were made for code_spider-0.1.0.tar.gz:

Publisher: publish.yml on hypen-code/code-spider

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file code_spider-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: code_spider-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 124.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for code_spider-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 765463bc6b68875ac22ffdffbf4b0e5d0b9ab80d1f0bcc281119b1080b21a11a
MD5 03504ea20841b82183ecdba507ad5762
BLAKE2b-256 564b114f9e6baa64f5c1a8e0d6e87ca531249b6e673c2b680acebecdc14784d0

See more details on using hashes here.

Provenance

The following attestation bundles were made for code_spider-0.1.0-py3-none-any.whl:

Publisher: publish.yml on hypen-code/code-spider

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page