Skip to main content

Code->CPG chunker: tree-sitter symbol + relation extraction, size-capped chunks, ProximaRecord projection. Shared by Victor, ProximaDB SDK, and AnvaiOps.

Project description

victor-codegraph

Shared code → Code-Property-Graph chunker: tree-sitter symbol + relation extraction, size-capped embeddable chunks, and a ProximaRecord projection. One chunker, three consumers — Victor (owner), the ProximaDB SDK ([codegraph] extra), and AnvaiOps (SaaS code-graph vertical).

Design: ProximaDB ADR-029 (authoritative) · Victor ADR-014 (owner/donor) · AnvaiOps ADR-0018 (consumer). This package is the TD-CG1 scaffold.

Why

The same tree-sitter code→symbol+relation chunker existed twice (ProximaDB SDK code.py and Victor victor-coding) and was about to be written a third time in AnvaiOps. This package is the single neutral home. It merges the best of both donors and fixes their two gaps:

  • Size-capping — ProximaDB's code.py emitted one chunk per symbol with no size bound (a huge function became a huge chunk). Here, oversized symbols are body-split with overlap (LlamaIndex CodeSplitter discipline). See sizing.py.
  • Real JS/TS — the donor JS/TS parser was a stub returning no symbols. Here JS/TS get a real tree-sitter extractor (functions, classes, methods, const … = () =>, imports).

Install

Not yet published to PyPI — use an editable install from the monorepo for now. Consumers (Victor, the ProximaDB SDK, AnvaiOps) reference it editable until the first victor-codegraph-v* release is cut.

# dev: editable, with tree-sitter grammars + test tooling
make -C victor-codegraph dev          # = pip install -e ../victor-contracts && pip install -e ".[dev]"

# minimal: Python-only (stdlib ast) path, zero native deps
pip install -e ./victor-codegraph

# once published:
#   pip install victor-codegraph                 # Python path
#   pip install "victor-codegraph[treesitter]"   # + multi-language grammars

Releasing

CI: .github/workflows/ci-codegraph.yml runs the suite (editable install, grammars on) for every PR touching victor-codegraph/**. Publishing: push a tag victor-codegraph-v0.1.0 to trigger .github/workflows/release-codegraph.yml, which builds and publishes via PyPI Trusted Publishing (OIDC — no API token). Configure the publisher once on PyPI (owner vjsingh1984, repo victor, workflow release-codegraph.yml, environments pypi / testpypi); see the header of that workflow.

Use

from victor_codegraph import chunk, parse, to_proxima_records, ChunkConfig

# Size-capped, embeddable chunks:
chunks = chunk(source, file_path="app/service.py", config=ChunkConfig(max_chunk_tokens=512))

# Symbols + relations:
parsed = parse(source, file_path="app/service.py")

# Project to the ProximaDB substrate-keystone record shape (one symbol = row+node+vector):
records = to_proxima_records(parsed, repo_graph_id="myrepo", branch_id="main",
                             embedder=my_embed_fn)  # embedder optional

Design principles (the "best posture" this encodes)

  1. Chunk at symbol granularity (not statement, not fixed-size).
  2. AST-aligned and size-capped — never split mid-statement, never exceed the budget.
  3. Extract relations (CALLS/EXTENDS/CONTAINS/…) and project to a CPG.
  4. Deterministic IDs + content hash → idempotent incremental re-index.
  5. Graceful fallback chain: python-ast → tree-sitter → sliding-window.
  6. Token budget matched to the embedding model (BGE-small 384-d ≈ 512 tokens).

Status

0.1.0 — TD-CG1 scaffold. Python (stdlib ast) is the primary, fully-offline path. Multi-language extraction is best-effort via tree-sitter; deeper per-language relation extraction (the donor parsers' Rust/Go/Java specifics) lands incrementally.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

victor_codegraph-0.1.0.tar.gz (20.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

victor_codegraph-0.1.0-py3-none-any.whl (18.7 kB view details)

Uploaded Python 3

File details

Details for the file victor_codegraph-0.1.0.tar.gz.

File metadata

  • Download URL: victor_codegraph-0.1.0.tar.gz
  • Upload date:
  • Size: 20.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for victor_codegraph-0.1.0.tar.gz
Algorithm Hash digest
SHA256 771dfb51118f3d13e814e78431ea62dd64e0dfe41755d4c95752666168828a94
MD5 4d13ed0ef7b11e6a7d443272cbbeb9c5
BLAKE2b-256 50f911e092a984f5a452f977d4129ac17646101242aee116081f92f28b41fcb9

See more details on using hashes here.

Provenance

The following attestation bundles were made for victor_codegraph-0.1.0.tar.gz:

Publisher: release-codegraph.yml on vjsingh1984/victor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file victor_codegraph-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for victor_codegraph-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ef2de118a3f8e4dc4170cdcbd9f4e5b7947c10080fc8c6edd858c07f3bf0de90
MD5 aeeb8541231b14e27a3bd0d4c84d3b18
BLAKE2b-256 c465a3a6517711bccdfd426d57add7856d9abf19d38ad3b5b41a4f02b4735798

See more details on using hashes here.

Provenance

The following attestation bundles were made for victor_codegraph-0.1.0-py3-none-any.whl:

Publisher: release-codegraph.yml on vjsingh1984/victor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page