Skip to main content

Retrieval-Augmented Generation for VTK code and documentation

Project description

VTK RAG

Retrieval-Augmented Generation for VTK code and documentation.

Transform natural language queries into relevant VTK code examples and class/method documentation using semantic search with hybrid vector + BM25 indexing.

Quick Start

# Install
pip install -e .

# Start Qdrant
docker run -d -p 6333:6333 -p 6334:6334 qdrant/qdrant

# Build (chunk + index)
vtk-rag build

# Search
vtk-rag search "create a sphere"

CLI

vtk-rag chunk              # Process raw data into chunks
vtk-rag index              # Build Qdrant indexes
vtk-rag build              # Full pipeline (chunk + index)
vtk-rag clean              # Remove processed data and indexes
vtk-rag search "query"     # Search code and docs

Or use module invocation: python -m vtk_rag <command>

Search Options

vtk-rag search "query" -n 10           # Limit results
vtk-rag search "query" --hybrid        # Hybrid search (dense + BM25)
vtk-rag search "query" --bm25          # BM25 keyword search
vtk-rag search "query" --code          # Code chunks only
vtk-rag search "query" --docs          # Doc chunks only
vtk-rag search "query" --role source_geometric  # Filter by role
vtk-rag search "query" -v              # Verbose (show content)

Python API

from vtk_rag.retrieval import Retriever, FilterBuilder

retriever = Retriever()

# Semantic search
results = retriever.search("create a sphere", collection="vtk_code")

# Hybrid search (dense + BM25)
results = retriever.hybrid_search("vtkSphereSource", collection="vtk_docs")

# BM25 keyword search
results = retriever.bm25_search("vtkConeSource SetRadius")

# Filtered search
results = retriever.search(
    "render pipeline",
    filters={"role": "source_geometric", "visibility_score": {"gte": 0.7}},
)

# Convenience methods
results = retriever.search_code("how to create a cylinder")
results = retriever.search_docs("vtkPolyDataMapper")
results = retriever.search_by_class("vtkSphereSource")

# Access results
for r in results:
    print(f"{r.class_name}: {r.synopsis}")
    print(f"  Score: {r.score:.3f}")
    print(f"  Code:\n{r.content}")

Testing

python -m pytest tests/           # Run tests with coverage
python -m pytest tests/ --no-cov  # Without coverage

Architecture

Data Flow

Raw Data (JSONL)                    
    │                               
    ▼                               
┌─────────┐     ┌─────────┐        
│ Chunker │────→│ Indexer │        
└─────────┘     └─────────┘        
    │               │              
    ▼               ▼              
code-chunks.jsonl   Qdrant         
doc-chunks.jsonl    Collections    
                        │          
                        ▼          
                  ┌───────────┐    
                  │ Retriever │    
                  └───────────┘    
                        │          
                        ▼          
                  SearchResults    

Collections

Collection Chunks Description
vtk_code ~15,700 Code examples from VTK examples/tests
vtk_docs ~32,800 Class/method documentation

Chunking Module

Semantic chunking for VTK Python code and class/method documentation.

Code Chunks

Chunk Types

Type Description
Visualization Pipeline property→mapper→actor groups
Rendering Infrastructure camera/lights→renderer→window→interactor groups
vtkmodules.{module} sources, filters, readers, writers (individual chunks)

Semantic Grouping

The LifecycleAnalyzer tracks VTK object lifecycles and groups them:

  1. Visualization pipelines - Groups property + mapper + actor via SetMapper/SetProperty
  2. Rendering infrastructure - Combines cameras, lights, renderers, windows, interactors
  3. Query elements - Sources, readers, writers, filters as individual chunks

Code Chunk Metadata

Field Description
chunk_id Unique identifier
example_id Source example URL
type Chunk type
function_name Containing function
title Human-readable title
description Detailed description
synopsis Natural language summary
content Executable Python code with imports
roles Functional roles (source_geometric, mapper_polydata, etc.)
visibility_score User-facing likelihood (0.0-1.0)
input_datatype / output_datatype Data types
vtk_class Primary VTK class
queries Pre-generated search queries

Doc Chunks

Chunk Types

Type Description
class_overview Class description and synopsis
constructor How to instantiate the class
property_group Related Set/Get/On/Off methods grouped by property
standalone_methods Methods not part of property groups
inheritance Parent class hierarchy

Property Grouping

VTK methods are grouped by property name:

  • SetRadius, GetRadius, GetRadiusMinValue, GetRadiusMaxValue → one chunk
  • ScalarVisibilityOn, ScalarVisibilityOff, SetScalarVisibility, GetScalarVisibility → one chunk

Doc Chunk Metadata

Field Description
chunk_id Unique identifier
chunk_type Type (class_overview, constructor, property_group, etc.)
class_name VTK class name
content Full documentation text
synopsis Brief summary
role Functional role
action_phrase Concise action description
visibility User-facing likelihood
queries Pre-generated search queries

Query Generation

Queries are pre-generated for each chunk to improve search recall:

Code chunks: Pattern templates, configuration categories, synopsis values
Doc chunks: Action phrases, camelCase→words conversion, class names


Indexing Module

Index VTK chunks into Qdrant for hybrid search.

Search Architecture

Each collection supports three search modes:

Dense Vectors (Semantic)

  • Model: SentenceTransformer (all-MiniLM-L6-v2, 384-dim)
  • content: Single embedding of chunk text
  • queries: Multi-vector of pre-generated query embeddings

Sparse Vectors (BM25)

  • Model: FastEmbed (Qdrant/bm25)
  • bm25: Sparse embedding with IDF weighting
  • Good for exact VTK class names like vtkSphereSource

Payload Indexes (Filtering)

  • keyword: Exact match (fast)
  • text: Tokenized full-text
  • float: Range queries

Hybrid Search

Combine dense + sparse with Reciprocal Rank Fusion (RRF).

Collection Fields

vtk_code

Field Type Description
content Dense Semantic similarity
bm25 Sparse BM25 keyword matching
queries Multi Pre-generated queries
type Keyword Visualization Pipeline, Rendering Infrastructure, vtkmodules.*
vtk_class Keyword Primary VTK class
function_name Keyword Containing function
roles Keyword Functional roles
input_datatype Keyword Input data type
output_datatype Keyword Output data type
visibility_score Float User-facing likelihood (0.0-1.0)

vtk_docs

Field Type Description
content Dense Semantic similarity
bm25 Sparse BM25 keyword matching
queries Multi Pre-generated queries
chunk_type Keyword class_overview, constructor, property_group, etc.
class_name Keyword VTK class name
role Keyword Functional role
visibility Keyword User-facing likelihood
metadata.module Keyword VTK module path

Retrieval Module

Core retrieval primitives for searching VTK code and documentation.

Search Modes

Semantic Search

Dense vector similarity using SentenceTransformer embeddings. Best for natural language queries.

results = retriever.search("how do I visualize medical imaging data")

BM25 Search

Sparse vector keyword matching using FastEmbed BM25. Best for exact VTK class/method names.

results = retriever.bm25_search("vtkDICOMImageReader")

Hybrid Search

Combines dense + sparse with Reciprocal Rank Fusion (RRF). Best for mixed queries with both natural language and VTK terms.

results = retriever.hybrid_search("create sphere using vtkSphereSource")

Multi-Vector Search

Search against pre-generated query embeddings for better recall.

results = retriever.search("sphere", vector_name="queries")

Filtering

Filters narrow search results by metadata fields. No Qdrant imports required.

Dict Syntax (Simple)

For basic filters with AND logic only:

# Exact match
results = retriever.search("sphere", filters={"role": "source_geometric"})

# Match any
results = retriever.search("sphere", filters={
    "class_name": ["vtkSphereSource", "vtkConeSource"]
})

# Range
results = retriever.search("sphere", filters={
    "visibility_score": {"gte": 0.7}
})

# Combined (all must match)
results = retriever.search("sphere", filters={
    "type": "Visualization Pipeline",
    "visibility_score": {"gte": 0.5},
})

FilterBuilder (Full Control)

For exclusions, optional matches, or complex logic:

from vtk_rag.retrieval import FilterBuilder

filters = (
    FilterBuilder()
    .match("role", "source_geometric")           # must match exactly
    .match_any("vtk_class", ["vtkSphereSource", "vtkConeSource"])  # must match one
    .range("visibility_score", gte=0.7)          # must be >= 0.7
    .exclude("chunk_type", "inheritance")        # must NOT match
    .should_match("type", "Visualization Pipeline")  # bonus if matches
    .build()
)

results = retriever.search("sphere", filters=filters)

Available Filter Fields

Code collection (vtk_code):

  • type, vtk_class, function_name, roles
  • input_datatype, output_datatype, visibility_score
  • example_id, variable_name

Doc collection (vtk_docs):

  • chunk_type, class_name, role, visibility
  • metadata.module, metadata.input_datatype, metadata.output_datatype

SearchResult

result = results[0]

# Core fields
result.id          # Qdrant point ID
result.score       # Relevance score
result.content     # Chunk text
result.chunk_id    # Original chunk identifier
result.collection  # vtk_code or vtk_docs
result.payload     # Full metadata dict

# Common properties
result.class_name       # VTK class name
result.chunk_type       # Chunk type
result.synopsis         # Brief summary
result.role             # Primary functional role
result.input_datatype   # Input data type
result.output_datatype  # Output data type
result.module           # VTK module path

# Code chunk properties
result.title            # Human-readable title
result.description      # Detailed description
result.example_id       # Source example URL
result.function_name    # Containing function
result.variable_name    # Primary variable
result.roles            # All functional roles (list)
result.visibility_score # User-facing likelihood (0.0-1.0)

# Doc chunk properties
result.action_phrase    # Concise action description

Code Map

vtk_rag/
├── __init__.py
├── __main__.py          # python -m vtk_rag
├── cli.py               # Unified CLI
├── build.py             # Build pipeline
│
├── chunking/
│   ├── __init__.py      # Exports: Chunker, CodeChunker, DocChunker, etc.
│   ├── chunk.py         # CLI entry point
│   ├── chunker.py       # Chunker class
│   ├── code_chunker.py  # Code chunk extraction
│   ├── code_chunk.py    # CodeChunk dataclass
│   ├── doc_chunker.py   # Doc chunk extraction
│   ├── doc_chunk.py     # DocChunk dataclass
│   ├── code_query_generator.py
│   ├── doc_query_generator.py
│   ├── lifecycle_analyzer.py
│   ├── semantic_chunk_builder.py
│   ├── vtk_categories.py
│   ├── vtk_class_resolver.py
│   └── persistent_mcp_client.py
│
├── indexing/
│   ├── __init__.py      # Exports: Indexer, CollectionConfig, FieldConfig
│   ├── index.py         # CLI entry point
│   ├── indexer.py       # Indexer class
│   └── collection_config.py
│
└── retrieval/
    ├── __init__.py      # Exports: Retriever, SearchResult, FilterBuilder
    ├── retriever.py     # Retriever class
    ├── filter_builder.py
    └── search_result.py

tests/
├── conftest.py          # Fixtures
├── test_chunking.py
├── test_indexing.py
├── test_retrieval.py
└── test_cli.py

Prerequisites

  • Python 3.10+
  • Qdrant: docker run -d -p 6333:6333 -p 6334:6334 qdrant/qdrant
  • Raw data files in data/raw/:
    • vtk-python-docs.jsonl (~2,900 classes)
    • vtk-python-examples.jsonl (~850 examples)
    • vtk-python-tests.jsonl (~900 tests)

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vtk_rag-0.1.0.tar.gz (61.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vtk_rag-0.1.0-py3-none-any.whl (63.5 kB view details)

Uploaded Python 3

File details

Details for the file vtk_rag-0.1.0.tar.gz.

File metadata

  • Download URL: vtk_rag-0.1.0.tar.gz
  • Upload date:
  • Size: 61.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for vtk_rag-0.1.0.tar.gz
Algorithm Hash digest
SHA256 b0ef1e9ef4dbed4be0053e3ec026df8b988a43efd7dbf6075260f0c2b4d7b15c
MD5 b265916392d409a36a48b1dcfb1911c9
BLAKE2b-256 3f7879ed222f3c248d18897a260c1612ed29fa762e7105d64903e6154f507039

See more details on using hashes here.

File details

Details for the file vtk_rag-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: vtk_rag-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 63.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for vtk_rag-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7e0fb05941485a21a191a1c8fb186bdfcc950b35d7144535a7f1c5ef4916fc5f
MD5 3187712ea29dce7bdc44f2f0f3f8dbc4
BLAKE2b-256 94a9a9e964ad3da1dc8622629184d83164cb74bdb058dc7d3995b26822bed3a7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page