Skip to main content

Retrieval-Augmented Generation for VTK code and documentation

Project description

VTK RAG

Retrieval-Augmented Generation for VTK code and documentation.

Transform natural language queries into relevant VTK code examples and class/method documentation using semantic search with hybrid vector + BM25 indexing.

Quick Start

# Setup (installs uv if needed, creates .venv, installs dependencies)
./setup.sh --dev

# Start Qdrant
docker run -d -p 6333:6333 -p 6334:6334 qdrant/qdrant

# Build (chunk + index)
uv run vtk-rag build

# Search
uv run vtk-rag search "create a sphere"

Installation

This project uses uv for fast, reproducible dependency management.

Option 1: Using setup.sh (Recommended)

./setup.sh          # Production dependencies
./setup.sh --dev    # Production + development (pytest, ruff)

Option 2: Manual with uv

# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create virtual environment and install
uv venv .venv
uv pip install -e ".[dev]"

# Copy environment config
cp .env.example .env

Option 3: Traditional pip

python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"

CLI

uv run vtk-rag chunk              # Process raw data into chunks
uv run vtk-rag index              # Build Qdrant indexes
uv run vtk-rag build              # Full pipeline (chunk + index)
uv run vtk-rag clean              # Remove processed data and indexes
uv run vtk-rag search "query"     # Search code and docs

Or activate the venv and run directly:

source .venv/bin/activate
vtk-rag build

Search Options

vtk-rag search "query" -n 10           # Limit results
vtk-rag search "query" --hybrid        # Hybrid search (dense + BM25)
vtk-rag search "query" --bm25          # BM25 keyword search
vtk-rag search "query" --code          # Code chunks only
vtk-rag search "query" --docs          # Doc chunks only
vtk-rag search "query" --role source_geometric  # Filter by role
vtk-rag search "query" -v              # Verbose (show content)

Python API

from vtk_rag.retrieval import Retriever, FilterBuilder

retriever = Retriever()

# Semantic search
results = retriever.search("create a sphere", collection="vtk_code")

# Hybrid search (dense + BM25)
results = retriever.hybrid_search("vtkSphereSource", collection="vtk_docs")

# BM25 keyword search
results = retriever.bm25_search("vtkConeSource SetRadius")

# Filtered search
results = retriever.search(
    "render pipeline",
    filters={"role": "source_geometric", "visibility_score": {"gte": 0.7}},
)

# Convenience methods
results = retriever.search_code("how to create a cylinder")
results = retriever.search_docs("vtkPolyDataMapper")
results = retriever.search_by_class("vtkSphereSource")

# Access results
for r in results:
    print(f"{r.class_name}: {r.synopsis}")
    print(f"  Score: {r.score:.3f}")
    print(f"  Code:\n{r.content}")

Testing

uv run pytest tests/              # Run tests
uv run pytest tests/ -v           # Verbose output
uv run ruff check vtk_rag/ tests/ # Lint code

Architecture

Data Flow

Raw Data (JSONL)                    
    │                               
    ▼                               
┌─────────┐     ┌─────────┐        
│ Chunker │────→│ Indexer │        
└─────────┘     └─────────┘        
    │               │              
    ▼               ▼              
code-chunks.jsonl   Qdrant         
doc-chunks.jsonl    Collections    
                        │          
                        ▼          
                  ┌───────────┐    
                  │ Retriever │    
                  └───────────┘    
                        │          
                        ▼          
                  SearchResults    

Collections

Collection Chunks Description
vtk_code ~15,700 Code examples from VTK examples/tests
vtk_docs ~32,800 Class/method documentation

Chunking Module

Semantic chunking for VTK Python code and class/method documentation.

Code Chunks

Chunk Types

Type Description
Visualization Pipeline property→mapper→actor groups
Rendering Infrastructure camera/lights→renderer→window→interactor groups
vtkmodules.{module} sources, filters, readers, writers (individual chunks)

Semantic Grouping

The LifecycleAnalyzer tracks VTK object lifecycles and groups them:

  1. Visualization pipelines - Groups property + mapper + actor via SetMapper/SetProperty
  2. Rendering infrastructure - Combines cameras, lights, renderers, windows, interactors
  3. Query elements - Sources, readers, writers, filters as individual chunks

Code Chunk Metadata

Field Description
chunk_id Unique identifier
example_id Source example URL
type Chunk type
function_name Containing function
title Human-readable title
description Detailed description
synopsis Natural language summary
content Executable Python code with imports
roles Functional roles (source_geometric, mapper_polydata, etc.)
visibility_score User-facing likelihood (0.0-1.0)
input_datatype / output_datatype Data types
vtk_class Primary VTK class
queries Pre-generated search queries

Doc Chunks

Chunk Types

Type Description
class_overview Class description and synopsis
constructor How to instantiate the class
property_group Related Set/Get/On/Off methods grouped by property
standalone_methods Methods not part of property groups
inheritance Parent class hierarchy

Property Grouping

VTK methods are grouped by property name:

  • SetRadius, GetRadius, GetRadiusMinValue, GetRadiusMaxValue → one chunk
  • ScalarVisibilityOn, ScalarVisibilityOff, SetScalarVisibility, GetScalarVisibility → one chunk

Doc Chunk Metadata

Field Description
chunk_id Unique identifier
chunk_type Type (class_overview, constructor, property_group, etc.)
class_name VTK class name
content Full documentation text
synopsis Brief summary
role Functional role
action_phrase Concise action description
visibility User-facing likelihood
queries Pre-generated search queries

Query Generation

Queries are pre-generated for each chunk to improve search recall:

Code chunks: Pattern templates, configuration categories, synopsis values
Doc chunks: Action phrases, camelCase→words conversion, class names


Indexing Module

Index VTK chunks into Qdrant for hybrid search.

Search Architecture

Each collection supports three search modes:

Dense Vectors (Semantic)

  • Model: SentenceTransformer (all-MiniLM-L6-v2, 384-dim)
  • content: Single embedding of chunk text
  • queries: Multi-vector of pre-generated query embeddings

Sparse Vectors (BM25)

  • Model: FastEmbed (Qdrant/bm25)
  • bm25: Sparse embedding with IDF weighting
  • Good for exact VTK class names like vtkSphereSource

Payload Indexes (Filtering)

  • keyword: Exact match (fast)
  • text: Tokenized full-text
  • float: Range queries

Hybrid Search

Combine dense + sparse with Reciprocal Rank Fusion (RRF).

Collection Fields

vtk_code

Field Type Description
content Dense Semantic similarity
bm25 Sparse BM25 keyword matching
queries Multi Pre-generated queries
type Keyword Visualization Pipeline, Rendering Infrastructure, vtkmodules.*
vtk_class Keyword Primary VTK class
function_name Keyword Containing function
roles Keyword Functional roles
input_datatype Keyword Input data type
output_datatype Keyword Output data type
visibility_score Float User-facing likelihood (0.0-1.0)

vtk_docs

Field Type Description
content Dense Semantic similarity
bm25 Sparse BM25 keyword matching
queries Multi Pre-generated queries
chunk_type Keyword class_overview, constructor, property_group, etc.
class_name Keyword VTK class name
role Keyword Functional role
visibility Keyword User-facing likelihood
metadata.module Keyword VTK module path

Retrieval Module

Core retrieval primitives for searching VTK code and documentation.

Search Modes

Semantic Search

Dense vector similarity using SentenceTransformer embeddings. Best for natural language queries.

results = retriever.search("how do I visualize medical imaging data")

BM25 Search

Sparse vector keyword matching using FastEmbed BM25. Best for exact VTK class/method names.

results = retriever.bm25_search("vtkDICOMImageReader")

Hybrid Search

Combines dense + sparse with Reciprocal Rank Fusion (RRF). Best for mixed queries with both natural language and VTK terms.

results = retriever.hybrid_search("create sphere using vtkSphereSource")

Multi-Vector Search

Search against pre-generated query embeddings for better recall.

results = retriever.search("sphere", vector_name="queries")

Filtering

Filters narrow search results by metadata fields. No Qdrant imports required.

Dict Syntax (Simple)

For basic filters with AND logic only:

# Exact match
results = retriever.search("sphere", filters={"role": "source_geometric"})

# Match any
results = retriever.search("sphere", filters={
    "class_name": ["vtkSphereSource", "vtkConeSource"]
})

# Range
results = retriever.search("sphere", filters={
    "visibility_score": {"gte": 0.7}
})

# Combined (all must match)
results = retriever.search("sphere", filters={
    "type": "Visualization Pipeline",
    "visibility_score": {"gte": 0.5},
})

FilterBuilder (Full Control)

For exclusions, optional matches, or complex logic:

from vtk_rag.retrieval import FilterBuilder

filters = (
    FilterBuilder()
    .match("role", "source_geometric")           # must match exactly
    .match_any("vtk_class", ["vtkSphereSource", "vtkConeSource"])  # must match one
    .range("visibility_score", gte=0.7)          # must be >= 0.7
    .exclude("chunk_type", "inheritance")        # must NOT match
    .should_match("type", "Visualization Pipeline")  # bonus if matches
    .build()
)

results = retriever.search("sphere", filters=filters)

Available Filter Fields

Code collection (vtk_code):

  • type, vtk_class, function_name, roles
  • input_datatype, output_datatype, visibility_score
  • example_id, variable_name

Doc collection (vtk_docs):

  • chunk_type, class_name, role, visibility
  • metadata.module, metadata.input_datatype, metadata.output_datatype

SearchResult

result = results[0]

# Core fields
result.id          # Qdrant point ID
result.score       # Relevance score
result.content     # Chunk text
result.chunk_id    # Original chunk identifier
result.collection  # vtk_code or vtk_docs
result.payload     # Full metadata dict

# Common properties
result.class_name       # VTK class name
result.chunk_type       # Chunk type
result.synopsis         # Brief summary
result.role             # Primary functional role
result.input_datatype   # Input data type
result.output_datatype  # Output data type
result.module           # VTK module path

# Code chunk properties
result.title            # Human-readable title
result.description      # Detailed description
result.example_id       # Source example URL
result.function_name    # Containing function
result.variable_name    # Primary variable
result.roles            # All functional roles (list)
result.visibility_score # User-facing likelihood (0.0-1.0)

# Doc chunk properties
result.action_phrase    # Concise action description

Code Map

vtk_rag/
├── __init__.py
├── __main__.py          # python -m vtk_rag
├── cli.py               # Unified CLI
├── build.py             # Build pipeline
│
├── chunking/
│   ├── __init__.py      # Exports: Chunker, CodeChunker, DocChunker, etc.
│   ├── chunk.py         # CLI entry point
│   ├── chunker.py       # Chunker class
│   ├── code_chunker.py  # Code chunk extraction
│   ├── code_chunk.py    # CodeChunk dataclass
│   ├── doc_chunker.py   # Doc chunk extraction
│   ├── doc_chunk.py     # DocChunk dataclass
│   ├── code_query_generator.py
│   ├── doc_query_generator.py
│   ├── lifecycle_analyzer.py
│   ├── semantic_chunk_builder.py
│   └── vtk_categories.py
│
├── mcp/
│   ├── __init__.py          # Exports: VTKAPIClient, VTK_API_CLIENT, PersistentMCPClient
│   ├── vtk_api_client.py    # VTK class resolution via MCP
│   └── persistent_mcp_client.py
│
├── indexing/
│   ├── __init__.py      # Exports: Indexer, CollectionConfig, FieldConfig
│   ├── index.py         # CLI entry point
│   ├── indexer.py       # Indexer class
│   └── collection_config.py
│
└── retrieval/
    ├── __init__.py      # Exports: Retriever, SearchResult, FilterBuilder
    ├── retriever.py     # Retriever class
    ├── filter_builder.py
    └── search_result.py

tests/
├── conftest.py          # Fixtures
├── test_chunking.py
├── test_indexing.py
├── test_retrieval.py
└── test_cli.py

Prerequisites

  • Python 3.10+
  • Qdrant: docker run -d -p 6333:6333 -p 6334:6334 qdrant/qdrant
  • Raw data files in data/raw/:
    • vtk-python-docs.jsonl (~2,900 classes)
    • vtk-python-examples.jsonl (~850 examples)
    • vtk-python-tests.jsonl (~900 tests)

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vtk_rag-0.2.0.tar.gz (64.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vtk_rag-0.2.0-py3-none-any.whl (66.2 kB view details)

Uploaded Python 3

File details

Details for the file vtk_rag-0.2.0.tar.gz.

File metadata

  • Download URL: vtk_rag-0.2.0.tar.gz
  • Upload date:
  • Size: 64.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for vtk_rag-0.2.0.tar.gz
Algorithm Hash digest
SHA256 ecda654ded33c7786f5a05f794a5b39ad0800f81e2854f6d11d1a1b943f71cd6
MD5 861631678a51cd825ae64e304300c404
BLAKE2b-256 6e58821ab860af3e641da900ff32b913be6657b684f172b021e165730a27638b

See more details on using hashes here.

File details

Details for the file vtk_rag-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: vtk_rag-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 66.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for vtk_rag-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 61f648c12476403bf9c90098baf59c14795a8d344609eee9a6efbbea7459f4c1
MD5 3bee83256b555f679799d767a3abfb42
BLAKE2b-256 d561f7537cf80d8ed8998c1caedcbc32399e7bc9ef359133448e9f23c0dfe457

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page