Skip to main content

TriVector Code Intelligence - Multi-view code relationship model with advanced semantic embeddings

Project description

TriVector Embeddings for Smarter Code Search for AI Agents

image PyPI - Python Version

Build Status Downloads

TriCoder learns high-quality symbol-level embeddings from codebases using three complementary views:

  1. Graph View: Structural relationships via PPMI and SVD
  2. Context View: Semantic context via Node2Vec random walks and Word2Vec
  3. Typed View: Type information via type-token co-occurrence (optional)

Features

  • Subtoken Semantic Graph: Captures fine-grained semantic relationships through subtoken analysis
  • File & Module Hierarchy: Leverages file/directory structure for better clustering
  • Static Call-Graph Expansion: Propagates call relationships to depth 2-3
  • Type Semantic Expansion: Expands composite types into constructors and primitives
  • Context Window Co-occurrence: Captures lexical context within ±5 lines
  • Improved Negative Sampling: Biased sampling for better temperature calibration
  • Hybrid Similarity Scoring: Length-penalized cosine similarity
  • Iterative Embedding Smoothing: Diffusion-based smoothing for better clustering
  • Query-Time Semantic Expansion: Expands queries with subtokens and types
  • GPU Acceleration: Supports CUDA (NVIDIA) and MPS (Mac) for faster training
  • Keyword Search: Search symbols by keywords and type tokens
  • Graph Optimization: Filter out low-value nodes and edges to improve training efficiency

Installation

Using Poetry (Recommended)

poetry install

Using pip

pip install .

GPU Support (Optional)

For NVIDIA GPUs (CUDA):

pip install cupy-cuda12x

For Mac GPUs (MPS):

pip install torch

Usage

1. Extract Symbols from Codebase

# Basic extraction (Python files only)
tricoder extract --input-dir /path/to/codebase

# Extract specific file types
tricoder extract --input-dir /path/to/codebase --extensions "py,js,ts"

# Exclude specific keywords from extraction
tricoder extract --input-dir /path/to/codebase --exclude-keywords debug --exclude-keywords temp

# Custom output files
tricoder extract --input-dir /path/to/codebase --output-nodes my_nodes.jsonl --output-edges my_edges.jsonl

Extraction Options:

  • --input-dir, --root, -r: Input directory to scan (default: current directory)
  • --extensions, --ext: Comma-separated file extensions (default: py)
  • --include-dirs, -i: Include only specific subdirectories (can specify multiple)
  • --exclude-dirs, -e: Exclude directories (default: .venv, __pycache__, .git, node_modules, .pytest_cache)
  • --exclude-keywords, --exclude: Exclude symbol names (appended to default excluded keywords)
  • --output-nodes, -n: Output file for nodes (default: nodes.jsonl)
  • --output-edges, -d: Output file for edges (default: edges.jsonl)
  • --output-types, -t: Output file for types (default: types.jsonl)
  • --no-gitignore: Disable .gitignore filtering (enabled by default)

2. Optimize Graph (Optional)

Reduce graph size by filtering low-value nodes and edges:

# Basic optimization (overwrites input files)
tricoder optimize

# Custom output files
tricoder optimize --output-nodes nodes_opt.jsonl --output-edges edges_opt.jsonl

# Customize thresholds
tricoder optimize --min-edge-weight 0.5 --remove-isolated --remove-generic

# Keep isolated nodes
tricoder optimize --keep-isolated

Optimization Options:

  • --nodes, -n: Input nodes file (default: nodes.jsonl)
  • --edges, -e: Input edges file (default: edges.jsonl)
  • --types, -t: Input types file (default: types.jsonl, optional)
  • --output-nodes, -N: Output nodes file (default: overwrites input)
  • --output-edges, -E: Output edges file (default: overwrites input)
  • --output-types, -T: Output types file (default: overwrites input)
  • --min-edge-weight: Minimum edge weight to keep (default: 0.3)
  • --remove-isolated: Remove nodes with no edges (default: True)
  • --keep-isolated: Keep isolated nodes (overrides --remove-isolated)
  • --remove-generic: Remove generic names (default: True)
  • --keep-generic: Keep generic names (overrides --remove-generic)
  • --exclude-keywords, --exclude: Additional keywords to exclude (can specify multiple)

3. Train Model

# Basic training
tricoder train --out model_output

# With GPU acceleration
tricoder train --out model_output --use-gpu

# Fast mode (faster training, slightly lower quality)
tricoder train --out model_output --fast

# Custom dimensions
tricoder train --out model_output --graph-dim 128 --context-dim 128 --final-dim 256

# Custom training parameters
tricoder train --out model_output --num-walks 20 --walk-length 100 --train-ratio 0.9

Training Options:

  • --nodes, -n: Path to nodes.jsonl (default: nodes.jsonl)
  • --edges, -e: Path to edges.jsonl (default: edges.jsonl)
  • --types, -t: Path to types.jsonl (default: types.jsonl, optional)
  • --out, -o: Output directory (required)
  • --graph-dim: Graph view dimensionality (default: auto-calculated)
  • --context-dim: Context view dimensionality (default: auto-calculated)
  • --typed-dim: Typed view dimensionality (default: auto-calculated)
  • --final-dim: Final fused embedding dimensionality (default: auto-calculated)
  • --num-walks: Number of random walks per node (default: 10)
  • --walk-length: Length of each random walk (default: 80)
  • --train-ratio: Fraction of edges for training (default: 0.8)
  • --random-state: Random seed for reproducibility (default: 42)
  • --fast: Enable fast mode (reduces parameters for faster training)
  • --use-gpu: Enable GPU acceleration (CUDA or MPS, falls back to CPU if unavailable)

4. Query Model

# Query by symbol ID
tricoder query --model-dir model_output --symbol function_my_function_0001 --top-k 10

# Search by keywords
tricoder query --model-dir model_output --keywords "database connection" --top-k 10

# Multi-word phrases (use quotes)
tricoder query --model-dir model_output --keywords '"user authentication" login'

# Exclude specific keywords from results
tricoder query --model-dir model_output --keywords handler --exclude-keywords debug --exclude-keywords temp

# Interactive mode
tricoder query --model-dir model_output --interactive

Query Options:

  • --model-dir, -m: Path to model directory (required)
  • --symbol, -s: Symbol ID to query
  • --keywords, -w: Keywords to search for (use quotes for multi-word: "my function")
  • --top-k, -k: Number of results to return (default: 10)
  • --exclude-keywords, --exclude: Additional keywords to exclude (appended to default excluded keywords)
  • --interactive, -i: Interactive mode

5. Incremental Retraining

Retrain only on changed files since last training:

# Basic retraining (detects changed files automatically)
tricoder retrain --model-dir model_output --codebase-dir /path/to/codebase

# Force full retraining
tricoder retrain --model-dir model_output --codebase-dir /path/to/codebase --force

# Custom training parameters
tricoder retrain --model-dir model_output --codebase-dir /path/to/codebase --num-walks 20

Retrain Options:

  • --model-dir, -m: Path to existing model directory (required)
  • --codebase-dir, -c: Path to codebase root (default: current directory)
  • --output-nodes, -n: Temporary nodes file (default: nodes_retrain.jsonl)
  • --output-edges, -d: Temporary edges file (default: edges_retrain.jsonl)
  • --output-types, -t: Temporary types file (default: types_retrain.jsonl)
  • --graph-dim, --context-dim, --typed-dim, --final-dim: Override model dimensions
  • --num-walks, --walk-length, --train-ratio, --random-state: Training parameters
  • --force: Force full retraining even if no files changed

Examples

Complete Workflow

# 1. Extract symbols from codebase
tricoder extract --input-dir ./my_project --extensions "py,js"

# 2. (Optional) Optimize the graph
tricoder optimize --min-edge-weight 0.4

# 3. Train model with GPU acceleration
tricoder train --out ./models/my_project --use-gpu

# 4. Query for similar symbols
tricoder query --model-dir ./models/my_project --keywords "database" --top-k 5

# 5. After code changes, retrain incrementally
tricoder retrain --model-dir ./models/my_project --codebase-dir ./my_project

Keyword Search Examples

# Search for authentication-related code
tricoder query --model-dir model_output --keywords "auth login password"

# Search for specific function name
tricoder query --model-dir model_output --keywords '"process_payment"'

# Search excluding common keywords
tricoder query --model-dir model_output --keywords handler --exclude-keywords temp --exclude-keywords debug

Requirements

  • Python 3.8+
  • numpy >= 1.21.0
  • scipy >= 1.7.0
  • scikit-learn >= 1.0.0
  • gensim >= 4.0.0
  • annoy >= 1.17.0
  • click >= 8.0.0
  • rich >= 13.0.0

Optional (for GPU acceleration):

  • cupy-cuda12x >= 12.0.0 (for NVIDIA GPUs)
  • torch >= 2.0.0 (for Mac GPUs or CUDA fallback)

License

TriCoder is available under a Non-Commercial License.

  • Free for non-commercial use: Personal projects, education, research, open-source
  • Commercial license required: Paid products, SaaS, commercial consulting, enterprise use

For commercial licensing inquiries, please contact: j.f.otoupal@gmail.com

See LICENSE for full terms and LICENSE_COMMERCIAL.md for commercial license information.


Did I made your life less painfull ?

Support my coffee addiction ;)
Buy me a Coffee

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tricoder-1.2.7.tar.gz (59.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tricoder-1.2.7-py3-none-any.whl (64.1 kB view details)

Uploaded Python 3

File details

Details for the file tricoder-1.2.7.tar.gz.

File metadata

  • Download URL: tricoder-1.2.7.tar.gz
  • Upload date:
  • Size: 59.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tricoder-1.2.7.tar.gz
Algorithm Hash digest
SHA256 695066248fa1cd274680b108ebcd1c9889e877e5811d0090d2aca23c7fe7f9a2
MD5 15b00608ca4e27a02ea1cdc7b731f5af
BLAKE2b-256 6fee81d5f3fa49138e79a7d8ea6a35366dc11062b1ee08813fd6ed3ba1bf834e

See more details on using hashes here.

File details

Details for the file tricoder-1.2.7-py3-none-any.whl.

File metadata

  • Download URL: tricoder-1.2.7-py3-none-any.whl
  • Upload date:
  • Size: 64.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tricoder-1.2.7-py3-none-any.whl
Algorithm Hash digest
SHA256 53819dd91efe2d57146672e12711f5cff6f5967e929015d64c25a343cc50c069
MD5 3e18700d9d5187484a543ae4aede0dcb
BLAKE2b-256 1dcd2c3e1f5f5a3477f380b845da2f0a997658bb9f1e4da2c2e97e924eb98c97

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page