Skip to main content

Code->CPG chunker: tree-sitter symbol + relation extraction, size-capped chunks, ProximaRecord projection. Shared by Victor, ProximaDB SDK, and AnvaiOps.

Project description

victor-codegraph

Shared code → Code-Property-Graph chunker: tree-sitter symbol + relation extraction, size-capped embeddable chunks, and a ProximaRecord projection. One chunker, three consumers — Victor (owner), the ProximaDB SDK ([codegraph] extra), and AnvaiOps (SaaS code-graph vertical).

Design: ProximaDB ADR-029 (authoritative) · Victor ADR-014 (owner/donor) · AnvaiOps ADR-0018 (consumer). This package is the TD-CG1 scaffold.

Why

The same tree-sitter code→symbol+relation chunker existed twice (ProximaDB SDK code.py and Victor victor-coding) and was about to be written a third time in AnvaiOps. This package is the single neutral home. It merges the best of both donors and fixes their two gaps:

  • Size-capping — ProximaDB's code.py emitted one chunk per symbol with no size bound (a huge function became a huge chunk). Here, oversized symbols are body-split with overlap (LlamaIndex CodeSplitter discipline). See sizing.py.
  • Real JS/TS — the donor JS/TS parser was a stub returning no symbols. Here JS/TS get a real tree-sitter extractor (functions, classes, methods, const … = () =>, imports).

Install

Not yet published to PyPI — use an editable install from the monorepo for now. Consumers (Victor, the ProximaDB SDK, AnvaiOps) reference it editable until the first victor-codegraph-v* release is cut.

# dev: editable, with tree-sitter grammars + test tooling
make -C victor-codegraph dev          # = pip install -e ../victor-contracts && pip install -e ".[dev]"

# minimal: Python-only (stdlib ast) path, zero native deps
pip install -e ./victor-codegraph

# once published:
#   pip install victor-codegraph                 # Python path
#   pip install "victor-codegraph[treesitter]"   # + multi-language grammars

Releasing

CI: .github/workflows/ci-codegraph.yml runs the suite (editable install, grammars on) for every PR touching victor-codegraph/**. Publishing: push a tag victor-codegraph-v0.1.0 to trigger .github/workflows/release-codegraph.yml, which builds and publishes via PyPI Trusted Publishing (OIDC — no API token). Configure the publisher once on PyPI (owner vjsingh1984, repo victor, workflow release-codegraph.yml, environments pypi / testpypi); see the header of that workflow.

Use

from victor_codegraph import chunk, parse, to_proxima_records, ChunkConfig

# Size-capped, embeddable chunks:
chunks = chunk(source, file_path="app/service.py", config=ChunkConfig(max_chunk_tokens=512))

# Symbols + relations:
parsed = parse(source, file_path="app/service.py")

# Project to the ProximaDB substrate-keystone record shape (one symbol = row+node+vector):
records = to_proxima_records(parsed, repo_graph_id="myrepo", branch_id="main",
                             embedder=my_embed_fn)  # embedder optional

Design principles (the "best posture" this encodes)

  1. Chunk at symbol granularity (not statement, not fixed-size).
  2. AST-aligned and size-capped — never split mid-statement, never exceed the budget.
  3. Extract relations (CALLS/EXTENDS/CONTAINS/…) and project to a CPG.
  4. Deterministic IDs + content hash → idempotent incremental re-index.
  5. Graceful fallback chain: python-ast → tree-sitter → sliding-window.
  6. Token budget matched to the embedding model (BGE-small 384-d ≈ 512 tokens).

Status

0.1.0 — TD-CG1 scaffold. Python (stdlib ast) is the primary, fully-offline path. Multi-language extraction is best-effort via tree-sitter; deeper per-language relation extraction (the donor parsers' Rust/Go/Java specifics) lands incrementally.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

victor_codegraph-0.0.1.tar.gz (20.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

victor_codegraph-0.0.1-py3-none-any.whl (18.7 kB view details)

Uploaded Python 3

File details

Details for the file victor_codegraph-0.0.1.tar.gz.

File metadata

  • Download URL: victor_codegraph-0.0.1.tar.gz
  • Upload date:
  • Size: 20.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for victor_codegraph-0.0.1.tar.gz
Algorithm Hash digest
SHA256 c7ec3ac3409e4f2e3a00e243ee3729479ed235bddab8064c9a0a685b4ed47eee
MD5 ccaf4802d3e53c1c535a807eddfc9787
BLAKE2b-256 3613ae53f99732dc5b4ea360d75d51a25e666c88fb422b5990e23d448df8fd94

See more details on using hashes here.

Provenance

The following attestation bundles were made for victor_codegraph-0.0.1.tar.gz:

Publisher: release-codegraph.yml on vjsingh1984/victor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file victor_codegraph-0.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for victor_codegraph-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 f88283a9ee90ab277b95852a9fbaf2aed354442ebc74d5acab0b5820bf29b1cf
MD5 41c94b60a4f126b1ac030c501598fa75
BLAKE2b-256 98487e306becf4bc01896a9a805f52602da887d35bfb7803a4a4f07278a0083a

See more details on using hashes here.

Provenance

The following attestation bundles were made for victor_codegraph-0.0.1-py3-none-any.whl:

Publisher: release-codegraph.yml on vjsingh1984/victor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page