Skip to main content

Code->CPG chunker: tree-sitter symbol + relation extraction, size-capped chunks, ProximaRecord projection. Shared by Victor, ProximaDB SDK, and AnvaiOps.

Project description

victor-codegraph

Shared code → Code-Property-Graph chunker: tree-sitter symbol + relation extraction, size-capped embeddable chunks, and a ProximaRecord projection. One chunker, three consumers — Victor (owner), the ProximaDB SDK ([codegraph] extra), and AnvaiOps (SaaS code-graph vertical).

Design: ProximaDB ADR-029 (authoritative) · Victor ADR-014 (owner/donor) · AnvaiOps ADR-0018 (consumer). This package is the TD-CG1 scaffold.

Why

The same tree-sitter code→symbol+relation chunker existed twice (ProximaDB SDK code.py and Victor victor-coding) and was about to be written a third time in AnvaiOps. This package is the single neutral home. It merges the best of both donors and fixes their two gaps:

  • Size-capping — ProximaDB's code.py emitted one chunk per symbol with no size bound (a huge function became a huge chunk). Here, oversized symbols are body-split with overlap (LlamaIndex CodeSplitter discipline). See sizing.py.
  • Real JS/TS — the donor JS/TS parser was a stub returning no symbols. Here JS/TS get a real tree-sitter extractor (functions, classes, methods, const … = () =>, imports).

Install

Not yet published to PyPI — use an editable install from the monorepo for now. Consumers (Victor, the ProximaDB SDK, AnvaiOps) reference it editable until the first victor-codegraph-v* release is cut.

# dev: editable, with tree-sitter grammars + test tooling
make -C victor-codegraph dev          # = pip install -e ../victor-contracts && pip install -e ".[dev]"

# minimal: Python-only (stdlib ast) path, zero native deps
pip install -e ./victor-codegraph

# once published:
#   pip install victor-codegraph                 # Python path
#   pip install "victor-codegraph[treesitter]"   # + multi-language grammars

Releasing

CI: .github/workflows/ci-codegraph.yml runs the suite (editable install, grammars on) for every PR touching victor-codegraph/**. Publishing: push a tag victor-codegraph-v0.1.0 to trigger .github/workflows/release-codegraph.yml, which builds and publishes via PyPI Trusted Publishing (OIDC — no API token). Configure the publisher once on PyPI (owner vjsingh1984, repo victor, workflow release-codegraph.yml, environments pypi / testpypi); see the header of that workflow.

Use

from victor_codegraph import chunk, parse, to_proxima_records, ChunkConfig

# Size-capped, embeddable chunks:
chunks = chunk(source, file_path="app/service.py", config=ChunkConfig(max_chunk_tokens=512))

# Symbols + relations:
parsed = parse(source, file_path="app/service.py")

# Project to the ProximaDB substrate-keystone record shape (one symbol = row+node+vector):
records = to_proxima_records(parsed, repo_graph_id="myrepo", branch_id="main",
                             embedder=my_embed_fn)  # embedder optional

Design principles (the "best posture" this encodes)

  1. Chunk at symbol granularity (not statement, not fixed-size).
  2. AST-aligned and size-capped — never split mid-statement, never exceed the budget.
  3. Extract relations (CALLS/EXTENDS/CONTAINS/…) and project to a CPG.
  4. Deterministic IDs + content hash → idempotent incremental re-index.
  5. Graceful fallback chain: python-ast → tree-sitter → sliding-window.
  6. Token budget matched to the embedding model (BGE-small 384-d ≈ 512 tokens).

Status

0.1.0 — TD-CG1 scaffold. Python (stdlib ast) is the primary, fully-offline path. Multi-language extraction is best-effort via tree-sitter; deeper per-language relation extraction (the donor parsers' Rust/Go/Java specifics) lands incrementally.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

victor_codegraph-0.1.1.tar.gz (23.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

victor_codegraph-0.1.1-py3-none-any.whl (20.9 kB view details)

Uploaded Python 3

File details

Details for the file victor_codegraph-0.1.1.tar.gz.

File metadata

  • Download URL: victor_codegraph-0.1.1.tar.gz
  • Upload date:
  • Size: 23.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for victor_codegraph-0.1.1.tar.gz
Algorithm Hash digest
SHA256 4fc5e40f246e72f793e960a459db229ea2594584131121c2757bf2bd6f2e3bb2
MD5 3a1fe2e9e4e61dc2df2478b20749ee86
BLAKE2b-256 0a3e5d30373ee695d91e21fe22170d88700ba38674a6b1c6a15aa63f20cdb9f2

See more details on using hashes here.

Provenance

The following attestation bundles were made for victor_codegraph-0.1.1.tar.gz:

Publisher: release-codegraph.yml on vjsingh1984/victor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file victor_codegraph-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for victor_codegraph-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b11356e94565afd69647fa44749caa97859aaa493c307d922b08110fdb491ba5
MD5 21ef93419eadadcd8ad6beb419f0465b
BLAKE2b-256 113011c258d2b36eeb61a7ede523721a68f8c91bc788d54200677fbfedfb8263

See more details on using hashes here.

Provenance

The following attestation bundles were made for victor_codegraph-0.1.1-py3-none-any.whl:

Publisher: release-codegraph.yml on vjsingh1984/victor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page