Centralized codebase knowledge graph + coordinate index for AI coding agents (Neo4j-backed).

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

Code Spider

Centralized codebase knowledge graph + coordinate index for AI coding agents. Backed by Neo4j 5.x Community, written in Python 3.13+, parses with Tree-sitter, exposes the graph to agents via the Model Context Protocol (MCP).

Status: Phase 0 — Foundations. End-to-end indexing for a single Python repo into Neo4j is the current goal. Phases 1 (TS/JS, REST flow, Kafka flow, MCP server, hybrid search) and 2 (incremental, observability) follow.

Why

AI coding agents waste enormous context windows on grep/list/read loops while exploring large polyglot codebases. Code Spider precomputes the structural + semantic shape of an entire workspace (every symbol, import, call, REST route, Kafka topic flow, code chunk embedding) into a single queryable Neo4j graph, then exposes navigation primitives via MCP so agents can:

Jump directly to file/line coordinates without scanning.
Trace call graphs, impact analysis, and cross-service HTTP/Kafka flows in a single Cypher hop.
Resolve natural-language queries via hybrid lexical + vector search and receive precise coordinates.

See the design plan: ~/.windsurf/plans/code-spider-knowledge-graph-aea777.md.

Architecture (one screen)

workspaces.yaml --> CI indexer ----> Neo4j 5.x Community
                       |                  ^
                       v                  | Cypher
                Shared FS (commit SHA)    |
                       ^                  |
                       +----- MCP server (Python)
                                          ^
                                          | MCP / JSON-RPC
                                  AI agents (Windsurf / Cursor / Claude Code / Codex)

Locked design decisions

Dimension	Decision
Topology	Single shared central Neo4j 5.x Community
MVP languages	Python, TypeScript, JavaScript
Cross-service edges	REST/HTTP + Kafka producer/consumer
Enrichment	Structural + hybrid lexical/vector search (RRF)
Indexing trigger	CI pipeline step on merge to main
Vector storage	Neo4j native HNSW (abstracted behind `VectorBackend`)
Call resolution	Tree-sitter + 6-strategy heuristic cascade
Agent interface	MCP server only
Workspace model	Explicit `workspaces.yaml` manifest
Embedding model	Local `sentence-transformers` by default; optional LiteLLM-backed external models (Voyage, OpenAI, Cohere, OpenRouter) via `.env`
Snippet retrieval	Indexer-managed shared filesystem keyed by commit SHA

Quickstart for developers (consume an existing central graph)

If your team already runs a central Neo4j with the graph indexed, this is all you need. No Docker, no local Neo4j, no indexing.

# 1. Install (requires Python 3.12+)
pip install code-spider              # or: pipx install code-spider
pip install 'code-spider[embedding]' # for internal embedding models
# or zero-install with uv:           uvx code-spider serve

# 2. Point it at the central Neo4j
code-spider configure                # interactive wizard, saves to
                                     # ~/.config/code-spider/config.env (0600)

# 3. Verify the connection end-to-end
code-spider doctor                   # checks env -> bolt -> auth -> schema

# 4. Print the MCP JSON snippet for your coding agent
code-spider mcp-config --agent windsurf       # or: cursor | claude-code | generic
# Paste the printed JSON into the path the wizard tells you about.

That's it — restart your agent and the code-spider MCP server is wired in.

Supported coding agents

Agent	Where to paste the `mcp-config` output
Windsurf	`~/.codeium/windsurf/mcp_config.json`
Cursor	`~/.cursor/mcp.json` (or project-level `.cursor/mcp.json`)
Claude Code	`claude mcp add-json code-spider '<inner object>'`
Generic	Any MCP client that consumes the standard JSON schema

Quickstart for admins (run the central server)

This is the side that operates Neo4j, defines workspaces.yaml, and indexes repos in CI on every merge to main.

1. Start Neo4j Community

docker compose up -d neo4j
# Browser: http://localhost:7474  (neo4j / codespider-dev-password)

2. Install with dev extras

python3.12 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev,embedding]"

3. Deploy graph schema

code-spider migrate

4. Index repositories

cp workspaces.example.yaml workspaces.yaml
# edit workspaces.yaml to point at real repos (path or git URL)
code-spider index --workspace demo

5. Verify

// in Neo4j Browser
MATCH (s:Symbol) RETURN s.kind, count(*) AS n ORDER BY n DESC;

6. Production indexing options

# Full run with embeddings + Prometheus metrics
code-spider index --workspace demo --embed sentence-transformers --metrics-port 9464

# Incremental on subsequent CI runs (skip unchanged files)
code-spider index --workspace demo --incremental --embed auto

# Prometheus scraping
curl http://localhost:9464/metrics | grep code_spider_

6a. External embedding models (LiteLLM)

The default sentence-transformers/all-MiniLM-L6-v2 runs locally and needs no API key. For production-grade code retrieval quality you can switch to a hosted model via the LiteLLM SDK without touching any code:

pip install -e ".[litellm]"        # adds the litellm dependency

Pick one of the recommended models in .env:

Model	Dim	Strengths	Env vars
`voyage/voyage-code-3` (recommended for code)	1024	Tuned on source code; tops code-retrieval benchmarks	`VOYAGE_API_KEY`
`openai/text-embedding-3-small`	1536	Cheap, widely available, strong general baseline	`OPENAI_API_KEY`
`cohere/embed-multilingual-v3.0`	1024	Multilingual code + prose	`COHERE_API_KEY`
OpenRouter (OpenAI-compatible)	varies	Single key, many backends (verify the chosen route exposes /embeddings)	`CODE_SPIDER_EMBED_API_BASE`, `CODE_SPIDER_EMBED_API_KEY`

.env example for Voyage:

CODE_SPIDER_EMBED_PROVIDER=litellm
CODE_SPIDER_EMBED_MODEL=voyage/voyage-code-3
CODE_SPIDER_EMBED_DIM=1024
VOYAGE_API_KEY=...

Then re-create the vector index at the new dimension and reindex:

code-spider migrate                                       # auto-recreates index at CODE_SPIDER_EMBED_DIM
code-spider index --workspace demo                        # picks up litellm via .env

migrate auto-detects when CODE_SPIDER_EMBED_DIM differs from the existing chunk_embedding index and drops + recreates the index at the new dimension. This deletes every existing chunk embedding, so you must reindex affected workspaces afterwards (you would need to anyway — vectors from one model can't be compared to vectors from another).

Precedence: an explicit --embed <name> flag always wins; --embed auto (the default) reads CODE_SPIDER_EMBED_PROVIDER.

6b. Resource tuning (4 GiB / 2 vCPU and bigger boxes)

The indexer is engineered to run on small CI workers without OOM kills. Three knobs control the trade-off between speed and memory:

Env var	Default	What it does
`CODE_SPIDER_MAX_FILE_BYTES`	`1048576` (1 MiB)	Skip files larger than this at the walker, before they are even read. Auto-generated bundles, minified assets, vendored libraries, and lockfiles are almost always over 1 MiB and have near-zero semantic value for code intelligence. Set to `0` to disable.
`CODE_SPIDER_EMBED_BATCH_SIZE`	`64`	Inputs per outbound embedding call. Lower → smaller request bodies (helps under gateway caps) but more roundtrips.
`CODE_SPIDER_EMBED_WORKERS`	`min(cpu_count, 4)`	Number of concurrent embedding sub-batches dispatched per repo. Threaded — fine on 2 vCPUs because embedding is I/O-bound. Lower this if you're hitting upstream rate limits.
`CODE_SPIDER_EMBED_MAX_INPUT_CHARS`	`120000`	Per-input character cap. Anything longer is pre-truncated before being sent. Set well below your model's context window (e.g. `2000` for `all-MiniLM-L6-v2`) to keep the request body small.

4 GiB / 2 vCPU recipe (.env):

# Memory-safe small-box defaults
CODE_SPIDER_MAX_FILE_BYTES=524288        # 512 KiB — extra safety margin
CODE_SPIDER_EMBED_WORKERS=2              # one per vCPU
CODE_SPIDER_EMBED_BATCH_SIZE=16          # smaller request bodies
CODE_SPIDER_EMBED_MAX_INPUT_CHARS=8000   # tune to your model's context window

The walker chunks files inline during the parse pass and drops the source bytes immediately, so the resident set is bounded by one file at a time rather than the full workspace. The embedding stage processes one repo at a time with WORKERS threads in flight; if any sub-batch fails (provider outage, transient 5xx, persistent payload cap), the remaining sub-batches finish and the failure is isolated to that slice. Progress is rendered live via rich.progress when stderr is a TTY, otherwise as structured log lines every 5 % so you always see motion.

7. Recommended security model for developers

Create a read-only Neo4j user for developers so a leaked password can't mutate the graph:

// run as the admin user in Neo4j Browser
CREATE USER codespider_ro SET PASSWORD 'rotate-me' CHANGE NOT REQUIRED;
GRANT ROLE reader TO codespider_ro;

Hand codespider_ro (not the admin user) to developers running code-spider configure.

8. Hand-rolled MCP JSON (if you don't want to use `mcp-config`)

{
  "mcpServers": {
    "code-spider": {
      "command": "/absolute/path/to/code-spider",
      "args": ["serve"],
      "env": {
        "CODE_SPIDER_NEO4J_URI": "bolt://central-neo4j.example.com:7687",
        "CODE_SPIDER_NEO4J_USER": "codespider_ro",
        "CODE_SPIDER_NEO4J_PASSWORD": "rotate-me",
        "CODE_SPIDER_NEO4J_DATABASE": "neo4j"
      }
    }
  }
}

Layout

code_spider/
├── config.py             # env + manifest loading (CWD .env + ~/.config/code-spider/config.env)
├── onboarding.py         # `configure` wizard, `mcp-config`, `doctor`
├── progress.py           # rich.progress (TTY) / structured-log (CI) reporters
├── workspace/manifest.py # YAML schema + diff
├── checkout/git.py       # GitPython wrapper
├── parser/               # tree-sitter language adapters
├── symbols/              # domain model + FQN helpers
├── resolver/             # 6-strategy cascade (Phase 1)
├── routes/               # REST extractors + HTTP_FLOW matcher (Phase 1)
├── messaging/            # Kafka extractors + KAFKA_FLOW matcher (Phase 1)
├── chunker/              # AST-aware chunker (Phase 1)
├── embedding/            # sentence-transformers wrapper (Phase 1)
├── graph/                # Neo4j client, schema, writer, vector backends
├── search/               # lexical + vector + RRF fusion (Phase 1)
├── mcp/                  # MCP server + 8 tools (Phase 1)
└── cli.py                # `code-spider configure|doctor|mcp-config|migrate|index|serve`

Development

pytest                                # unit tests
pytest -m integration                 # requires Neo4j on localhost:7687
ruff check . && ruff format --check . # lint + format
mypy code_spider                      # type-check

License

Apache-2.0

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

hypen-code

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.2

May 29, 2026

0.1.1

May 29, 2026

0.1.0

May 29, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

code_spider-0.1.2.tar.gz (161.1 kB view details)

Uploaded May 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

code_spider-0.1.2-py3-none-any.whl (145.6 kB view details)

Uploaded May 29, 2026 Python 3

File details

Details for the file code_spider-0.1.2.tar.gz.

File metadata

Download URL: code_spider-0.1.2.tar.gz
Upload date: May 29, 2026
Size: 161.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for code_spider-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`a8049ae5b8c4d5a4bbf55cd36510bff2ba1a24506de577a6b12e7bf85455b784`
MD5	`8f5a5539a16be728aae283ba5c2b16d4`
BLAKE2b-256	`f37ffcb36f369bd57b4f76448a4622e42e26e70ca9146baf16c9f4a4d0c2617b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for code_spider-0.1.2.tar.gz:

Publisher: publish.yml on hypen-code/code-spider

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: code_spider-0.1.2.tar.gz
- Subject digest: a8049ae5b8c4d5a4bbf55cd36510bff2ba1a24506de577a6b12e7bf85455b784
- Sigstore transparency entry: 1669972634
- Sigstore integration time: May 29, 2026
Source repository:
- Permalink: hypen-code/code-spider@fd0e952716af6e8f6364dbaab0a3925b9aa881d8
- Branch / Tag: refs/tags/v0.1.2
- Owner: https://github.com/hypen-code
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@fd0e952716af6e8f6364dbaab0a3925b9aa881d8
- Trigger Event: push

File details

Details for the file code_spider-0.1.2-py3-none-any.whl.

File metadata

Download URL: code_spider-0.1.2-py3-none-any.whl
Upload date: May 29, 2026
Size: 145.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for code_spider-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`628f1d64e9174d0e1a59d9c76614802e2bca0faea4907cfbd3bef75ed2d1da3e`
MD5	`46ddad5ff51906c8119183ad175d7019`
BLAKE2b-256	`702d9121565a49757d90f4cdf156be88bafca830d95563232d801b146600b036`

See more details on using hashes here.

Provenance

The following attestation bundles were made for code_spider-0.1.2-py3-none-any.whl:

Publisher: publish.yml on hypen-code/code-spider

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: code_spider-0.1.2-py3-none-any.whl
- Subject digest: 628f1d64e9174d0e1a59d9c76614802e2bca0faea4907cfbd3bef75ed2d1da3e
- Sigstore transparency entry: 1669972769
- Sigstore integration time: May 29, 2026
Source repository:
- Permalink: hypen-code/code-spider@fd0e952716af6e8f6364dbaab0a3925b9aa881d8
- Branch / Tag: refs/tags/v0.1.2
- Owner: https://github.com/hypen-code
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@fd0e952716af6e8f6364dbaab0a3925b9aa881d8
- Trigger Event: push

code-spider 0.1.2

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Code Spider

Why

Architecture (one screen)

Locked design decisions

Quickstart for developers (consume an existing central graph)

Supported coding agents

Quickstart for admins (run the central server)

1. Start Neo4j Community

2. Install with dev extras

3. Deploy graph schema

4. Index repositories

5. Verify

6. Production indexing options

6a. External embedding models (LiteLLM)

6b. Resource tuning (4 GiB / 2 vCPU and bigger boxes)

7. Recommended security model for developers

8. Hand-rolled MCP JSON (if you don't want to use mcp-config)

Layout

Development

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

8. Hand-rolled MCP JSON (if you don't want to use `mcp-config`)