Retrieval-Augmented Generation for VTK code and documentation

These details have not been verified by PyPI

Project links

Project description

VTK RAG

Retrieval-Augmented Generation for VTK code and documentation.

Transform natural language queries into relevant VTK code examples and class/method documentation using semantic search with hybrid vector + BM25 indexing.

Quick Start

# Setup (installs uv if needed, creates .venv, installs dependencies)
./setup.sh --dev

# Start Qdrant
docker run -d -p 6333:6333 -p 6334:6334 qdrant/qdrant

# Build (chunk + index)
uv run vtk-rag build

# Search
uv run vtk-rag search "create a sphere"

Installation

This project uses uv for fast, reproducible dependency management.

Option 1: Using setup.sh (Recommended)

./setup.sh          # Production dependencies
./setup.sh --dev    # Production + development (pytest, ruff)

Option 2: Manual with uv

# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create virtual environment and install
uv venv .venv
uv pip install -e ".[dev]"

# Copy environment config
cp .env.example .env

Option 3: Traditional pip

python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"

CLI

uv run vtk-rag chunk              # Process raw data into chunks
uv run vtk-rag index              # Build Qdrant indexes
uv run vtk-rag build              # Full pipeline (chunk + index)
uv run vtk-rag clean              # Remove processed data and indexes
uv run vtk-rag search "query"     # Search code and docs

Or activate the venv and run directly:

source .venv/bin/activate
vtk-rag build

Search Options

vtk-rag search "query" -n 10           # Limit results
vtk-rag search "query" --hybrid        # Hybrid search (dense + BM25)
vtk-rag search "query" --bm25          # BM25 keyword search
vtk-rag search "query" --code          # Code chunks only
vtk-rag search "query" --docs          # Doc chunks only
vtk-rag search "query" --role source_geometric  # Filter by role
vtk-rag search "query" -v              # Verbose (show content)

Python API

from vtk_rag.retrieval import Retriever, FilterBuilder

retriever = Retriever()

# Semantic search
results = retriever.search("create a sphere", collection="vtk_code")

# Hybrid search (dense + BM25)
results = retriever.hybrid_search("vtkSphereSource", collection="vtk_docs")

# BM25 keyword search
results = retriever.bm25_search("vtkConeSource SetRadius")

# Filtered search
results = retriever.search(
    "render pipeline",
    filters={"role": "source_geometric", "visibility_score": {"gte": 0.7}},
)

# Convenience methods
results = retriever.search_code("how to create a cylinder")
results = retriever.search_docs("vtkPolyDataMapper")
results = retriever.search_by_class("vtkSphereSource")

# Access results
for r in results:
    print(f"{r.class_name}: {r.synopsis}")
    print(f"  Score: {r.score:.3f}")
    print(f"  Code:\n{r.content}")

Testing

uv run pytest tests/              # Run tests
uv run pytest tests/ -v           # Verbose output
uv run ruff check vtk_rag/ tests/ # Lint code

Architecture

Data Flow

Raw Data (JSONL)                    
    │                               
    ▼                               
┌─────────┐     ┌─────────┐        
│ Chunker │────→│ Indexer │        
└─────────┘     └─────────┘        
    │               │              
    ▼               ▼              
code-chunks.jsonl   Qdrant         
doc-chunks.jsonl    Collections    
                        │          
                        ▼          
                  ┌───────────┐    
                  │ Retriever │    
                  └───────────┘    
                        │          
                        ▼          
                  SearchResults

Collections

Collection	Chunks	Description
`vtk_code`	~15,700	Code examples from VTK examples/tests
`vtk_docs`	~32,800	Class/method documentation

Chunking Module

Semantic chunking for VTK Python code and class/method documentation.

Code Chunks

Chunk Types

Type	Description
Visualization Pipeline	property→mapper→actor groups
Rendering Infrastructure	camera/lights→renderer→window→interactor groups
vtkmodules.{module}	sources, filters, readers, writers (individual chunks)

Semantic Grouping

The LifecycleAnalyzer tracks VTK object lifecycles and groups them:

Visualization pipelines - Groups property + mapper + actor via SetMapper/SetProperty
Rendering infrastructure - Combines cameras, lights, renderers, windows, interactors
Query elements - Sources, readers, writers, filters as individual chunks

Code Chunk Metadata

Field	Description
`chunk_id`	Unique identifier
`example_id`	Source example URL
`type`	Chunk type
`function_name`	Containing function
`title`	Human-readable title
`description`	Detailed description
`synopsis`	Natural language summary
`content`	Executable Python code with imports
`roles`	Functional roles (source_geometric, mapper_polydata, etc.)
`visibility_score`	User-facing likelihood (0.0-1.0)
`input_datatype` / `output_datatype`	Data types
`vtk_class`	Primary VTK class
`queries`	Pre-generated search queries

Doc Chunks

Chunk Types

Type	Description
class_overview	Class description and synopsis
constructor	How to instantiate the class
property_group	Related Set/Get/On/Off methods grouped by property
standalone_methods	Methods not part of property groups
inheritance	Parent class hierarchy

Property Grouping

VTK methods are grouped by property name:

SetRadius, GetRadius, GetRadiusMinValue, GetRadiusMaxValue → one chunk
ScalarVisibilityOn, ScalarVisibilityOff, SetScalarVisibility, GetScalarVisibility → one chunk

Doc Chunk Metadata

Field	Description
`chunk_id`	Unique identifier
`chunk_type`	Type (class_overview, constructor, property_group, etc.)
`class_name`	VTK class name
`content`	Full documentation text
`synopsis`	Brief summary
`role`	Functional role
`action_phrase`	Concise action description
`visibility`	User-facing likelihood
`queries`	Pre-generated search queries

Query Generation

Queries are pre-generated for each chunk to improve search recall:

Code chunks: Pattern templates, configuration categories, synopsis values
Doc chunks: Action phrases, camelCase→words conversion, class names

Indexing Module

Index VTK chunks into Qdrant for hybrid search.

Search Architecture

Each collection supports three search modes:

Dense Vectors (Semantic)

Model: SentenceTransformer (all-MiniLM-L6-v2, 384-dim)
content: Single embedding of chunk text
queries: Multi-vector of pre-generated query embeddings

Sparse Vectors (BM25)

Model: FastEmbed (Qdrant/bm25)
bm25: Sparse embedding with IDF weighting
Good for exact VTK class names like vtkSphereSource

Payload Indexes (Filtering)

keyword: Exact match (fast)
text: Tokenized full-text
float: Range queries

Hybrid Search

Combine dense + sparse with Reciprocal Rank Fusion (RRF).

Collection Fields

vtk_code

Field	Type	Description
`content`	Dense	Semantic similarity
`bm25`	Sparse	BM25 keyword matching
`queries`	Multi	Pre-generated queries
`type`	Keyword	Visualization Pipeline, Rendering Infrastructure, vtkmodules.*
`vtk_class`	Keyword	Primary VTK class
`function_name`	Keyword	Containing function
`roles`	Keyword	Functional roles
`input_datatype`	Keyword	Input data type
`output_datatype`	Keyword	Output data type
`visibility_score`	Float	User-facing likelihood (0.0-1.0)

vtk_docs

Field	Type	Description
`content`	Dense	Semantic similarity
`bm25`	Sparse	BM25 keyword matching
`queries`	Multi	Pre-generated queries
`chunk_type`	Keyword	class_overview, constructor, property_group, etc.
`class_name`	Keyword	VTK class name
`role`	Keyword	Functional role
`visibility`	Keyword	User-facing likelihood
`metadata.module`	Keyword	VTK module path

Retrieval Module

Core retrieval primitives for searching VTK code and documentation.

Search Modes

Semantic Search

Dense vector similarity using SentenceTransformer embeddings. Best for natural language queries.

results = retriever.search("how do I visualize medical imaging data")

BM25 Search

Sparse vector keyword matching using FastEmbed BM25. Best for exact VTK class/method names.

results = retriever.bm25_search("vtkDICOMImageReader")

Hybrid Search

Combines dense + sparse with Reciprocal Rank Fusion (RRF). Best for mixed queries with both natural language and VTK terms.

results = retriever.hybrid_search("create sphere using vtkSphereSource")

Multi-Vector Search

Search against pre-generated query embeddings for better recall.

results = retriever.search("sphere", vector_name="queries")

Filtering

Filters narrow search results by metadata fields. No Qdrant imports required.

Dict Syntax (Simple)

For basic filters with AND logic only:

# Exact match
results = retriever.search("sphere", filters={"role": "source_geometric"})

# Match any
results = retriever.search("sphere", filters={
    "class_name": ["vtkSphereSource", "vtkConeSource"]
})

# Range
results = retriever.search("sphere", filters={
    "visibility_score": {"gte": 0.7}
})

# Combined (all must match)
results = retriever.search("sphere", filters={
    "type": "Visualization Pipeline",
    "visibility_score": {"gte": 0.5},
})

FilterBuilder (Full Control)

For exclusions, optional matches, or complex logic:

from vtk_rag.retrieval import FilterBuilder

filters = (
    FilterBuilder()
    .match("role", "source_geometric")           # must match exactly
    .match_any("vtk_class", ["vtkSphereSource", "vtkConeSource"])  # must match one
    .range("visibility_score", gte=0.7)          # must be >= 0.7
    .exclude("chunk_type", "inheritance")        # must NOT match
    .should_match("type", "Visualization Pipeline")  # bonus if matches
    .build()
)

results = retriever.search("sphere", filters=filters)

Available Filter Fields

Code collection (vtk_code):

type, vtk_class, function_name, roles
input_datatype, output_datatype, visibility_score
example_id, variable_name

Doc collection (vtk_docs):

chunk_type, class_name, role, visibility
metadata.module, metadata.input_datatype, metadata.output_datatype

SearchResult

result = results[0]

# Core fields
result.id          # Qdrant point ID
result.score       # Relevance score
result.content     # Chunk text
result.chunk_id    # Original chunk identifier
result.collection  # vtk_code or vtk_docs
result.payload     # Full metadata dict

# Common properties
result.class_name       # VTK class name
result.chunk_type       # Chunk type
result.synopsis         # Brief summary
result.role             # Primary functional role
result.input_datatype   # Input data type
result.output_datatype  # Output data type
result.module           # VTK module path

# Code chunk properties
result.title            # Human-readable title
result.description      # Detailed description
result.example_id       # Source example URL
result.function_name    # Containing function
result.variable_name    # Primary variable
result.roles            # All functional roles (list)
result.visibility_score # User-facing likelihood (0.0-1.0)

# Doc chunk properties
result.action_phrase    # Concise action description

Code Map

vtk_rag/
├── __init__.py
├── __main__.py          # python -m vtk_rag
├── cli.py               # Unified CLI
├── build.py             # Build pipeline
│
├── chunking/
│   ├── __init__.py      # Exports: Chunker, CodeChunker, DocChunker, etc.
│   ├── chunk.py         # CLI entry point
│   ├── chunker.py       # Chunker class
│   ├── code_chunker.py  # Code chunk extraction
│   ├── code_chunk.py    # CodeChunk dataclass
│   ├── doc_chunker.py   # Doc chunk extraction
│   ├── doc_chunk.py     # DocChunk dataclass
│   ├── code_query_generator.py
│   ├── doc_query_generator.py
│   ├── lifecycle_analyzer.py
│   ├── semantic_chunk_builder.py
│   └── vtk_categories.py
│
├── mcp/
│   ├── __init__.py          # Exports: VTKAPIClient, VTK_API_CLIENT, PersistentMCPClient
│   ├── vtk_api_client.py    # VTK class resolution via MCP
│   └── persistent_mcp_client.py
│
├── indexing/
│   ├── __init__.py      # Exports: Indexer, CollectionConfig, FieldConfig
│   ├── index.py         # CLI entry point
│   ├── indexer.py       # Indexer class
│   └── collection_config.py
│
└── retrieval/
    ├── __init__.py      # Exports: Retriever, SearchResult, FilterBuilder
    ├── retriever.py     # Retriever class
    ├── filter_builder.py
    └── search_result.py

tests/
├── conftest.py          # Fixtures
├── test_chunking.py
├── test_indexing.py
├── test_retrieval.py
└── test_cli.py

Prerequisites

Python 3.10+
Qdrant: docker run -d -p 6333:6333 -p 6334:6334 qdrant/qdrant
Raw data files in data/raw/:
- vtk-python-docs.jsonl (~2,900 classes)
- vtk-python-examples.jsonl (~850 examples)
- vtk-python-tests.jsonl (~900 tests)

License

MIT License

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.1.0

Jan 2, 2026

1.0.0

Dec 26, 2025

This version

0.2.0

Dec 10, 2025

0.1.0

Dec 5, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vtk_rag-0.2.0.tar.gz (64.5 kB view details)

Uploaded Dec 10, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vtk_rag-0.2.0-py3-none-any.whl (66.2 kB view details)

Uploaded Dec 10, 2025 Python 3

File details

Details for the file vtk_rag-0.2.0.tar.gz.

File metadata

Download URL: vtk_rag-0.2.0.tar.gz
Upload date: Dec 10, 2025
Size: 64.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for vtk_rag-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`ecda654ded33c7786f5a05f794a5b39ad0800f81e2854f6d11d1a1b943f71cd6`
MD5	`861631678a51cd825ae64e304300c404`
BLAKE2b-256	`6e58821ab860af3e641da900ff32b913be6657b684f172b021e165730a27638b`

See more details on using hashes here.

File details

Details for the file vtk_rag-0.2.0-py3-none-any.whl.

File metadata

Download URL: vtk_rag-0.2.0-py3-none-any.whl
Upload date: Dec 10, 2025
Size: 66.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for vtk_rag-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`61f648c12476403bf9c90098baf59c14795a8d344609eee9a6efbbea7459f4c1`
MD5	`3bee83256b555f679799d767a3abfb42`
BLAKE2b-256	`d561f7537cf80d8ed8998c1caedcbc32399e7bc9ef359133448e9f23c0dfe457`

See more details on using hashes here.

vtk-rag 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

VTK RAG

Quick Start

Installation

Option 1: Using setup.sh (Recommended)

Option 2: Manual with uv

Option 3: Traditional pip

CLI

Search Options

Python API

Testing

Architecture

Data Flow

Collections

Chunking Module

Code Chunks

Chunk Types

Semantic Grouping

Code Chunk Metadata

Doc Chunks

Chunk Types

Property Grouping

Doc Chunk Metadata

Query Generation

Indexing Module

Search Architecture

Dense Vectors (Semantic)

Sparse Vectors (BM25)

Payload Indexes (Filtering)

Hybrid Search

Collection Fields

vtk_code

vtk_docs

Retrieval Module

Search Modes

Semantic Search

BM25 Search

Hybrid Search

Multi-Vector Search

Filtering

Dict Syntax (Simple)

FilterBuilder (Full Control)

Available Filter Fields

SearchResult

Code Map

Prerequisites

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes