Skip to main content

Code->CPG chunker: tree-sitter symbol + relation extraction, size-capped chunks, ProximaRecord projection. Shared by Victor, ProximaDB SDK, and AnvaiOps.

Project description

victor-codegraph

Shared code → Code-Property-Graph chunker: tree-sitter symbol + relation extraction, size-capped embeddable chunks, and a ProximaRecord projection. One chunker, three consumers — Victor (owner), the ProximaDB SDK ([codegraph] extra), and AnvaiOps (SaaS code-graph vertical).

Design: ProximaDB ADR-029 (authoritative) · Victor ADR-014 (owner/donor) · AnvaiOps ADR-0018 (consumer). This package is the TD-CG1 scaffold.

Why

The same tree-sitter code→symbol+relation chunker existed twice (ProximaDB SDK code.py and Victor victor-coding) and was about to be written a third time in AnvaiOps. This package is the single neutral home. It merges the best of both donors and fixes their two gaps:

  • Size-capping — ProximaDB's code.py emitted one chunk per symbol with no size bound (a huge function became a huge chunk). Here, oversized symbols are body-split with overlap (LlamaIndex CodeSplitter discipline). See sizing.py.
  • Real JS/TS — the donor JS/TS parser was a stub returning no symbols. Here JS/TS get a real tree-sitter extractor (functions, classes, methods, const … = () =>, imports).

Install

Not yet published to PyPI — use an editable install from the monorepo for now. Consumers (Victor, the ProximaDB SDK, AnvaiOps) reference it editable until the first victor-codegraph-v* release is cut.

# dev: editable, with tree-sitter grammars + test tooling
make -C victor-codegraph dev          # = pip install -e ../victor-contracts && pip install -e ".[dev]"

# minimal: Python-only (stdlib ast) path, zero native deps
pip install -e ./victor-codegraph

# once published:
#   pip install victor-codegraph                 # Python path
#   pip install "victor-codegraph[treesitter]"   # + multi-language grammars

Releasing

CI: .github/workflows/ci-codegraph.yml runs the suite (editable install, grammars on) for every PR touching victor-codegraph/**. Publishing: push a tag victor-codegraph-v0.1.0 to trigger .github/workflows/release-codegraph.yml, which builds and publishes via PyPI Trusted Publishing (OIDC — no API token). Configure the publisher once on PyPI (owner vjsingh1984, repo victor, workflow release-codegraph.yml, environments pypi / testpypi); see the header of that workflow.

Use

from victor_codegraph import chunk, parse, to_proxima_records, ChunkConfig

# Size-capped, embeddable chunks:
chunks = chunk(source, file_path="app/service.py", config=ChunkConfig(max_chunk_tokens=512))

# Symbols + relations:
parsed = parse(source, file_path="app/service.py")

# Project to the ProximaDB substrate-keystone record shape (one symbol = row+node+vector):
records = to_proxima_records(parsed, repo_graph_id="myrepo", branch_id="main",
                             embedder=my_embed_fn)  # embedder optional

Design principles (the "best posture" this encodes)

  1. Chunk at symbol granularity (not statement, not fixed-size).
  2. AST-aligned and size-capped — never split mid-statement, never exceed the budget.
  3. Extract relations (CALLS/EXTENDS/CONTAINS/…) and project to a CPG.
  4. Deterministic IDs + content hash → idempotent incremental re-index.
  5. Graceful fallback chain: python-ast → tree-sitter → sliding-window.
  6. Token budget matched to the embedding model (BGE-small 384-d ≈ 512 tokens).

Status

0.1.0 — TD-CG1 scaffold. Python (stdlib ast) is the primary, fully-offline path. Multi-language extraction is best-effort via tree-sitter; deeper per-language relation extraction (the donor parsers' Rust/Go/Java specifics) lands incrementally.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

victor_codegraph-0.1.2.tar.gz (27.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

victor_codegraph-0.1.2-py3-none-any.whl (23.1 kB view details)

Uploaded Python 3

File details

Details for the file victor_codegraph-0.1.2.tar.gz.

File metadata

  • Download URL: victor_codegraph-0.1.2.tar.gz
  • Upload date:
  • Size: 27.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for victor_codegraph-0.1.2.tar.gz
Algorithm Hash digest
SHA256 bec08d6be71f2b8986535c560e3e6ff767c3625b3b7af62ac5782654a24c9bb1
MD5 1e14d75c7c277591dd1947abcfca70a1
BLAKE2b-256 cbdf46a10910282390520884537a53a42a44740ce8b3f81cfd30b39005979d57

See more details on using hashes here.

Provenance

The following attestation bundles were made for victor_codegraph-0.1.2.tar.gz:

Publisher: release-codegraph.yml on vjsingh1984/victor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file victor_codegraph-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for victor_codegraph-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 fe4297312efb3dc1549dd66c12e80c8b077804fa876db13e81b788413348e74d
MD5 d145033951ba13cfe87e0a2129d2a880
BLAKE2b-256 964a475be684b5d67badf95f5456e4d0459cb9651d4e94a98e69f3d9fa6f4db3

See more details on using hashes here.

Provenance

The following attestation bundles were made for victor_codegraph-0.1.2-py3-none-any.whl:

Publisher: release-codegraph.yml on vjsingh1984/victor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page