Retrieval-Augmented Generation for VTK code and documentation
Project description
VTK RAG
Retrieval-Augmented Generation for VTK code and documentation.
Transform natural language queries into relevant VTK code examples and class/method documentation using semantic search with hybrid vector + BM25 indexing.
Quick Start
# Setup (installs uv if needed, creates .venv, installs dependencies)
./setup.sh --dev
# Start Qdrant
docker run -d -p 6333:6333 -p 6334:6334 qdrant/qdrant
# Build (chunk + index)
uv run vtk-rag build
# Search
uv run vtk-rag search "create a sphere"
Installation
This project uses uv for fast, reproducible dependency management.
Option 1: Using setup.sh (Recommended)
./setup.sh # Production dependencies
./setup.sh --dev # Production + development (pytest, ruff)
Option 2: Manual with uv
# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create virtual environment and install
uv venv .venv
uv pip install -e ".[dev]"
# Copy environment config
cp .env.example .env
Option 3: Traditional pip
python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
CLI
uv run vtk-rag chunk # Process raw data into chunks
uv run vtk-rag index # Build Qdrant indexes
uv run vtk-rag build # Full pipeline (chunk + index)
uv run vtk-rag clean # Remove processed data and indexes
uv run vtk-rag search "query" # Search code and docs
Or activate the venv and run directly:
source .venv/bin/activate
vtk-rag build
Search Options
vtk-rag search "query" -n 10 # Limit results
vtk-rag search "query" --hybrid # Hybrid search (dense + BM25)
vtk-rag search "query" --bm25 # BM25 keyword search
vtk-rag search "query" --code # Code chunks only
vtk-rag search "query" --docs # Doc chunks only
vtk-rag search "query" --role source_geometric # Filter by role
vtk-rag search "query" -v # Verbose (show content)
Python API
from vtk_rag.retrieval import Retriever, FilterBuilder
retriever = Retriever()
# Semantic search
results = retriever.search("create a sphere", collection="vtk_code")
# Hybrid search (dense + BM25)
results = retriever.hybrid_search("vtkSphereSource", collection="vtk_docs")
# BM25 keyword search
results = retriever.bm25_search("vtkConeSource SetRadius")
# Filtered search
results = retriever.search(
"render pipeline",
filters={"role": "source_geometric", "visibility_score": {"gte": 0.7}},
)
# Convenience methods
results = retriever.search_code("how to create a cylinder")
results = retriever.search_docs("vtkPolyDataMapper")
results = retriever.search_by_class("vtkSphereSource")
# Access results
for r in results:
print(f"{r.class_name}: {r.synopsis}")
print(f" Score: {r.score:.3f}")
print(f" Code:\n{r.content}")
Testing
uv run pytest tests/ # Run tests
uv run pytest tests/ -v # Verbose output
uv run ruff check vtk_rag/ tests/ # Lint code
Architecture
Data Flow
Raw Data (JSONL)
│
▼
┌─────────┐ ┌─────────┐
│ Chunker │────→│ Indexer │
└─────────┘ └─────────┘
│ │
▼ ▼
code-chunks.jsonl Qdrant
doc-chunks.jsonl Collections
│
▼
┌───────────┐
│ Retriever │
└───────────┘
│
▼
SearchResults
Collections
| Collection | Chunks | Description |
|---|---|---|
vtk_code |
~15,700 | Code examples from VTK examples/tests |
vtk_docs |
~32,800 | Class/method documentation |
Chunking Module
Semantic chunking for VTK Python code and class/method documentation.
Code Chunks
Chunk Types
| Type | Description |
|---|---|
| Visualization Pipeline | property→mapper→actor groups |
| Rendering Infrastructure | camera/lights→renderer→window→interactor groups |
| vtkmodules.{module} | sources, filters, readers, writers (individual chunks) |
Semantic Grouping
The LifecycleAnalyzer tracks VTK object lifecycles and groups them:
- Visualization pipelines - Groups property + mapper + actor via
SetMapper/SetProperty - Rendering infrastructure - Combines cameras, lights, renderers, windows, interactors
- Query elements - Sources, readers, writers, filters as individual chunks
Code Chunk Metadata
| Field | Description |
|---|---|
chunk_id |
Unique identifier |
example_id |
Source example URL |
type |
Chunk type |
function_name |
Containing function |
title |
Human-readable title |
description |
Detailed description |
synopsis |
Natural language summary |
content |
Executable Python code with imports |
roles |
Functional roles (source_geometric, mapper_polydata, etc.) |
visibility_score |
User-facing likelihood (0.0-1.0) |
input_datatype / output_datatype |
Data types |
vtk_class |
Primary VTK class |
queries |
Pre-generated search queries |
Doc Chunks
Chunk Types
| Type | Description |
|---|---|
| class_overview | Class description and synopsis |
| constructor | How to instantiate the class |
| property_group | Related Set/Get/On/Off methods grouped by property |
| standalone_methods | Methods not part of property groups |
| inheritance | Parent class hierarchy |
Property Grouping
VTK methods are grouped by property name:
SetRadius,GetRadius,GetRadiusMinValue,GetRadiusMaxValue→ one chunkScalarVisibilityOn,ScalarVisibilityOff,SetScalarVisibility,GetScalarVisibility→ one chunk
Doc Chunk Metadata
| Field | Description |
|---|---|
chunk_id |
Unique identifier |
chunk_type |
Type (class_overview, constructor, property_group, etc.) |
class_name |
VTK class name |
content |
Full documentation text |
synopsis |
Brief summary |
role |
Functional role |
action_phrase |
Concise action description |
visibility |
User-facing likelihood |
queries |
Pre-generated search queries |
Query Generation
Queries are pre-generated for each chunk to improve search recall:
Code chunks: Pattern templates, configuration categories, synopsis values
Doc chunks: Action phrases, camelCase→words conversion, class names
Indexing Module
Index VTK chunks into Qdrant for hybrid search.
Search Architecture
Each collection supports three search modes:
Dense Vectors (Semantic)
- Model: SentenceTransformer (
all-MiniLM-L6-v2, 384-dim) - content: Single embedding of chunk text
- queries: Multi-vector of pre-generated query embeddings
Sparse Vectors (BM25)
- Model: FastEmbed (
Qdrant/bm25) - bm25: Sparse embedding with IDF weighting
- Good for exact VTK class names like
vtkSphereSource
Payload Indexes (Filtering)
- keyword: Exact match (fast)
- text: Tokenized full-text
- float: Range queries
Hybrid Search
Combine dense + sparse with Reciprocal Rank Fusion (RRF).
Collection Fields
vtk_code
| Field | Type | Description |
|---|---|---|
content |
Dense | Semantic similarity |
bm25 |
Sparse | BM25 keyword matching |
queries |
Multi | Pre-generated queries |
type |
Keyword | Visualization Pipeline, Rendering Infrastructure, vtkmodules.* |
vtk_class |
Keyword | Primary VTK class |
function_name |
Keyword | Containing function |
roles |
Keyword | Functional roles |
input_datatype |
Keyword | Input data type |
output_datatype |
Keyword | Output data type |
visibility_score |
Float | User-facing likelihood (0.0-1.0) |
vtk_docs
| Field | Type | Description |
|---|---|---|
content |
Dense | Semantic similarity |
bm25 |
Sparse | BM25 keyword matching |
queries |
Multi | Pre-generated queries |
chunk_type |
Keyword | class_overview, constructor, property_group, etc. |
class_name |
Keyword | VTK class name |
role |
Keyword | Functional role |
visibility |
Keyword | User-facing likelihood |
metadata.module |
Keyword | VTK module path |
Retrieval Module
Core retrieval primitives for searching VTK code and documentation.
Search Modes
Semantic Search
Dense vector similarity using SentenceTransformer embeddings. Best for natural language queries.
results = retriever.search("how do I visualize medical imaging data")
BM25 Search
Sparse vector keyword matching using FastEmbed BM25. Best for exact VTK class/method names.
results = retriever.bm25_search("vtkDICOMImageReader")
Hybrid Search
Combines dense + sparse with Reciprocal Rank Fusion (RRF). Best for mixed queries with both natural language and VTK terms.
results = retriever.hybrid_search("create sphere using vtkSphereSource")
Multi-Vector Search
Search against pre-generated query embeddings for better recall.
results = retriever.search("sphere", vector_name="queries")
Filtering
Filters narrow search results by metadata fields. No Qdrant imports required.
Dict Syntax (Simple)
For basic filters with AND logic only:
# Exact match
results = retriever.search("sphere", filters={"role": "source_geometric"})
# Match any
results = retriever.search("sphere", filters={
"class_name": ["vtkSphereSource", "vtkConeSource"]
})
# Range
results = retriever.search("sphere", filters={
"visibility_score": {"gte": 0.7}
})
# Combined (all must match)
results = retriever.search("sphere", filters={
"type": "Visualization Pipeline",
"visibility_score": {"gte": 0.5},
})
FilterBuilder (Full Control)
For exclusions, optional matches, or complex logic:
from vtk_rag.retrieval import FilterBuilder
filters = (
FilterBuilder()
.match("role", "source_geometric") # must match exactly
.match_any("vtk_class", ["vtkSphereSource", "vtkConeSource"]) # must match one
.range("visibility_score", gte=0.7) # must be >= 0.7
.exclude("chunk_type", "inheritance") # must NOT match
.should_match("type", "Visualization Pipeline") # bonus if matches
.build()
)
results = retriever.search("sphere", filters=filters)
Available Filter Fields
Code collection (vtk_code):
type,vtk_class,function_name,rolesinput_datatype,output_datatype,visibility_scoreexample_id,variable_name
Doc collection (vtk_docs):
chunk_type,class_name,role,visibilitymetadata.module,metadata.input_datatype,metadata.output_datatype
SearchResult
result = results[0]
# Core fields
result.id # Qdrant point ID
result.score # Relevance score
result.content # Chunk text
result.chunk_id # Original chunk identifier
result.collection # vtk_code or vtk_docs
result.payload # Full metadata dict
# Common properties
result.class_name # VTK class name
result.chunk_type # Chunk type
result.synopsis # Brief summary
result.role # Primary functional role
result.input_datatype # Input data type
result.output_datatype # Output data type
result.module # VTK module path
# Code chunk properties
result.title # Human-readable title
result.description # Detailed description
result.example_id # Source example URL
result.function_name # Containing function
result.variable_name # Primary variable
result.roles # All functional roles (list)
result.visibility_score # User-facing likelihood (0.0-1.0)
# Doc chunk properties
result.action_phrase # Concise action description
Code Map
vtk_rag/
├── __init__.py
├── __main__.py # python -m vtk_rag
├── cli.py # Unified CLI
├── build.py # Build pipeline
│
├── chunking/
│ ├── __init__.py # Exports: Chunker, CodeChunker, DocChunker, etc.
│ ├── chunk.py # CLI entry point
│ ├── chunker.py # Chunker class
│ ├── code_chunker.py # Code chunk extraction
│ ├── code_chunk.py # CodeChunk dataclass
│ ├── doc_chunker.py # Doc chunk extraction
│ ├── doc_chunk.py # DocChunk dataclass
│ ├── code_query_generator.py
│ ├── doc_query_generator.py
│ ├── lifecycle_analyzer.py
│ ├── semantic_chunk_builder.py
│ └── vtk_categories.py
│
├── mcp/
│ ├── __init__.py # Exports: VTKAPIClient, VTK_API_CLIENT, PersistentMCPClient
│ ├── vtk_api_client.py # VTK class resolution via MCP
│ └── persistent_mcp_client.py
│
├── indexing/
│ ├── __init__.py # Exports: Indexer, CollectionConfig, FieldConfig
│ ├── index.py # CLI entry point
│ ├── indexer.py # Indexer class
│ └── collection_config.py
│
└── retrieval/
├── __init__.py # Exports: Retriever, SearchResult, FilterBuilder
├── retriever.py # Retriever class
├── filter_builder.py
└── search_result.py
tests/
├── conftest.py # Fixtures
├── test_chunking.py
├── test_indexing.py
├── test_retrieval.py
└── test_cli.py
Prerequisites
- Python 3.10+
- Qdrant:
docker run -d -p 6333:6333 -p 6334:6334 qdrant/qdrant - Raw data files in
data/raw/:vtk-python-docs.jsonl(~2,900 classes)vtk-python-examples.jsonl(~850 examples)vtk-python-tests.jsonl(~900 tests)
License
MIT License
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vtk_rag-0.2.0.tar.gz.
File metadata
- Download URL: vtk_rag-0.2.0.tar.gz
- Upload date:
- Size: 64.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ecda654ded33c7786f5a05f794a5b39ad0800f81e2854f6d11d1a1b943f71cd6
|
|
| MD5 |
861631678a51cd825ae64e304300c404
|
|
| BLAKE2b-256 |
6e58821ab860af3e641da900ff32b913be6657b684f172b021e165730a27638b
|
File details
Details for the file vtk_rag-0.2.0-py3-none-any.whl.
File metadata
- Download URL: vtk_rag-0.2.0-py3-none-any.whl
- Upload date:
- Size: 66.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
61f648c12476403bf9c90098baf59c14795a8d344609eee9a6efbbea7459f4c1
|
|
| MD5 |
3bee83256b555f679799d767a3abfb42
|
|
| BLAKE2b-256 |
d561f7537cf80d8ed8998c1caedcbc32399e7bc9ef359133448e9f23c0dfe457
|