Codebase indexing and semantic search engine
Project description
vortexa ย ๐ง
Codebase indexing and semantic search engine
Dense + sparse hybrid retrieval ยท AST-aware chunking ยท LMDB persistence ยท MCP server
Table of Contents
Overview
vortexa is a standalone codebase indexing and semantic search engine designed for AI agents and developers. It builds a persistent, hybrid search index over source code using:
- Dense retrieval via static or learned embeddings (Model2Vec / SentenceTransformers)
- Sparse retrieval via BM25 keyword scoring
- AST-aware chunking that respects function and class boundaries via tree-sitter
- LMDB-backed storage for fast, persistent vector and chunk storage
The result: natural language code search that understands intent, not just keywords.
results = indexer.search("authentication middleware that validates JWT tokens", top_k=5)
# โ Finds the right files even if they use "auth", "verify", "token" instead of "authentication"
vortexa can run as a standalone Python library, be embedded into any agent, or serve as an MCP server for LLM tools.
Features
| Semantic search | Find code by describing what it does in natural language โ no exact-string matching needed. |
| Hybrid retrieval | Combines dense embeddings (semantic meaning) with BM25 (keyword precision) using adaptive alpha weighting. |
| AST-aware chunking | Splits source code at function/class/block boundaries using tree-sitter when available, falls back to line-based splitting. |
| Incremental indexing | Content-hash memoization means only changed files are re-indexed. Full re-index avoids redundant embedding computations. |
| Persistent storage | LMDB-backed vector store survives restarts. Embedding cache avoids recomputing identical content. |
| Live watch mode | Background thread polls for file changes and auto-re-indexes with configurable debounce. |
| MCP server | Expose as a single search tool for any MCP-compatible agent (Claude Code, Cursor, etc.) |
| Zero mandatory heavy deps | Core requires only numpy, lmdb, and pathspec. Model2Vec and tree-sitter are optional extras. |
Quick Start
Installation
# Core (BM25 + line-based chunking)
pip install vortexa
# Full (Model2Vec embeddings + tree-sitter AST chunking)
pip install "vortexa[full]"
# With MCP server support
pip install "vortexa[full]" fastmcp
Index a codebase
from vortexa.core.indexer import CodebaseIndexer
indexer = CodebaseIndexer(root=".")
stats = indexer.index()
print(f"Indexed {stats.indexed_files} files, {stats.total_chunks} chunks")
print(f"Languages detected: {stats.languages}")
Search with natural language
results = indexer.search("CSV parser implementation", top_k=5)
for r in results:
print(f"{r.chunk.file_path}:{r.chunk.start_line} score={r.score:.3f}")
print(f" {r.chunk.content[:150].strip()}")
print()
Output:
src/parsers/csv_parser.py:42 score=0.892
def parse_csv(filepath: str, delimiter: str = ",") -> list[dict]:
"""Parse a CSV file into a list of dictionaries."""
with open(filepath, "r") as f:
tests/test_csv_parser.py:15 score=0.756
def test_parse_csv_with_header():
result = parse_csv("test.csv")
assert len(result) == 3
Python API
Indexing
from vortexa.core.indexer import CodebaseIndexer
from vortexa.core.types import ChunkConfig
# Default chunking (aim for 50-line chunks, 5-line overlap)
indexer = CodebaseIndexer(root="/path/to/project")
stats = indexer.index()
# โ IndexStats(indexed_files=127, total_chunks=843, languages={"python": 45, "typescript": 32, ...})
# Custom chunk configuration
indexer = CodebaseIndexer(
root=".",
chunk_config=ChunkConfig(chunk_size=100, chunk_overlap=10),
)
stats = indexer.index(force=False, include_text_files=True)
# Force full re-index
stats = indexer.index(force=True)
Searching
# Hybrid search (auto-weighted semantic + BM25)
results = indexer.search("error handling", top_k=10)
# Pure semantic search
results = indexer.search("database connection pool", top_k=5, alpha=1.0)
# Pure BM25 keyword search
results = indexer.search("parse csv", top_k=5, alpha=0.0)
# Symbol lookup (find definitions by name)
results = indexer.find_symbol("ConnectionPool", top_k=5)
# Related chunks (find chunks similar to a given chunk index)
results = indexer.find_related(chunk_idx=3, top_k=5)
Each result is a SearchResult with:
| Field | Type | Description |
|---|---|---|
chunk.file_path |
str |
Relative file path |
chunk.start_line |
int |
Start line number |
chunk.end_line |
int |
End line number |
chunk.content |
str |
Code snippet (up to 500 chars) |
chunk.language |
str |
Detected programming language |
chunk.lineage |
Lineage |
Source path + byte offsets |
chunk.chunk_hash |
str |
Content hash for memoization |
score |
float |
Relevance score (0โ1) |
source |
str |
"semantic", "bm25", or "hybrid" |
Watch Mode
from vortexa.interfaces.watcher import IndexWatcher
watcher = IndexWatcher(indexer, poll_interval=3.0)
watcher.start() # Background thread, polls every 3s, debounces 2s
# ... files change on disk, auto-re-index happens ...
watcher.stop()
Management
# Index statistics
stats = indexer.stats()
# โ {indexed_files: 127, total_chunks: 843, languages: {...}, memo_hits: 42, memo_misses: 15}
# Reset
indexer.clear() # Delete the persistent index
MCP Server
vortexa ships with a built-in MCP (Model Context Protocol) server that exposes codebase search as a single search tool. Start it with:
# Auto-indexes current directory, serves on stdio
python -m vortexa.interfaces.mcp_server
# Or via the installed entry point
vortexa-mcp
On startup it indexes the current working directory and prints stats to stderr:
[vortexa] Indexing C:\projects\my-app ...
[vortexa] Ready: 127 files, 843 chunks
[vortexa] Auto-reindex watcher started (polling every 3s)
The server exposes one tool:
| Tool | Description | Arguments |
|---|---|---|
search |
Semantic + BM25 hybrid code search | query (str), top_k (int, default 10) |
Usage with Claude Code / Cursor
Add to your MCP configuration file (~/.cursor/mcp.json or Claude Code's mcp_servers config):
{
"mcpServers": {
"vortexa": {
"command": "python",
"args": ["-m", "vortexa.interfaces.mcp_server"],
"cwd": "/path/to/your/project"
}
}
}
The agent will now have access to semantic code search โ it can find functions, classes, and patterns by describing them in natural language. This is significantly more effective than grep or rg for exploratory queries.
Architecture
Directory Layout
vortexa/
โโโ core/
โ โโโ indexer.py # CodebaseIndexer โ main orchestrator
โ โโโ chunking.py # AST-aware (tree-sitter) + line-based chunking
โ โโโ embedding.py # Embedding models (Model2Vec, SentenceTransformers)
โ โโโ language.py # Language detection & file extension mapping
โ โโโ types.py # Shared types (Chunk, ChunkConfig, IndexStats, SearchResult, ...)
โโโ storage/
โ โโโ vector_store.py # LMDB-backed persistent vector store
โ โโโ bm25.py # BM25 keyword index with persistent storage
โ โโโ walker.py # File system walker with .gitignore support
โโโ search/
โ โโโ search.py # Hybrid search orchestrator (dense + sparse)
โ โโโ ranking.py # Result ranking & symbol query detection
โ โโโ tokens.py # Identifier tokenization (camelCase, snake_case)
โโโ interfaces/
โโโ mcp_server.py # MCP server (stdio transport)
โโโ watcher.py # Live file poller with debounced auto-reindex
Data Flow
sequenceDiagram
participant User as User Code
participant Indexer as CodebaseIndexer
participant Walker as File Walker
participant Chunker as Chunking Engine
participant Embedder as Embedding Model
participant Store as LMDB Vector Store
participant BM25 as BM25 Index
participant Search as Search Engine
User->>Indexer: index()
Indexer->>Walker: walk_files(root, extensions)
Walker-->>Indexer: file_paths
loop Each file
Indexer->>Chunker: chunk_source(source, language)
Chunker-->>Indexer: list[Chunk]
Indexer->>Embedder: embed(chunks)
Embedder-->>Indexer: vectors
Indexer->>Store: store(vectors, chunks)
Indexer->>BM25: index(chunks)
end
Indexer-->>User: IndexStats
User->>Search: search(query)
Search->>Store: query(vector)
Search->>BM25: query(tokens)
Search->>Search: hybrid_fusion(results)
Search-->>User: list[SearchResult]
Indexing Pipeline
graph LR
A[Source Files] --> B[File Walker<br/>.gitignore aware]
B --> C[Language Detector]
C --> D{AST Available?}
D -->|Yes| E[Tree-sitter Parser<br/>Function/class boundaries]
D -->|No| F[Line-based Splitter<br/>Configurable size/overlap]
E --> G[Chunk Set]
F --> G
G --> H[Embedding Model<br/>Model2Vec / SentenceTransformer]
G --> I[BM25 Tokenizer]
H --> J[(LMDB Vector Store)]
I --> K[(BM25 Index)]
J --> L[Content Hash Memo]
K --> L
L --> M[Skip unchanged files]
Module Dependencies
graph TD
subgraph "Public API"
Indexer["core.indexer<br/>CodebaseIndexer"]
Search["search.search<br/>search_hybrid()"]
end
subgraph "Core"
Chunking["core.chunking<br/>chunk_source()"]
Embedding["core.embedding<br/>Embedder"]
Language["core.language<br/>detect_language()"]
Types["core.types<br/>Chunk, ChunkConfig, ..."]
end
subgraph "Storage"
VectorStore["storage.vector_store<br/>LMDB Vector Store"]
BM25["storage.bm25<br/>BM25 Index"]
Walker["storage.walker<br/>walk_files()"]
end
subgraph "Interfaces"
MCP["interfaces.mcp_server<br/>FastMCP server"]
Watcher["interfaces.watcher<br/>IndexWatcher"]
end
Indexer --> Chunking
Indexer --> Embedding
Indexer --> Language
Indexer --> Types
Indexer --> VectorStore
Indexer --> BM25
Indexer --> Walker
Indexer --> Search
Search --> Embedding
Search --> VectorStore
Search --> BM25
Search --> Types
MCP --> Indexer
MCP --> Watcher
Watcher --> Walker
Dependencies
| Package | Required | Used For |
|---|---|---|
numpy |
Yes | Vector operations, embedding inference |
lmdb |
Yes | Persistent vector and chunk metadata storage |
pathspec |
Yes | .gitignore pattern matching in file walker |
model2vec |
Optional | Alternative static embeddings |
huggingface-hub |
Yes (default model) | Loading VTXAI/Vortex-Embed-4.7M |
tokenizers |
Yes (default model) | HF tokenizer for embedding model |
safetensors |
Yes (default model) | Safe tensor loading for 4-bit weights |
sentence-transformers |
Optional | Transformer-based dense embeddings |
model2vec |
Optional | Alternative static embeddings |
tree-sitter-language-pack |
Optional | AST-aware code chunking |
fastmcp |
Optional | MCP server for LLM tool integration |
Install optional groups:
pip install "vortexa[full]" # model2vec + sentence-transformers + tree-sitter
pip install "vortexa[full, mcp]" # everything including MCP server
License
Copyright 2025 VortexAI
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vortexa-0.1.0.tar.gz.
File metadata
- Download URL: vortexa-0.1.0.tar.gz
- Upload date:
- Size: 9.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.24 {"installer":{"name":"uv","version":"0.9.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fe092d0cb517002b5a36760e23cd3a31294dc059b933deef3dc60d4daa2b8702
|
|
| MD5 |
96238707d0b2a1fa27449f10642d3e54
|
|
| BLAKE2b-256 |
79c4518b4fde5a8cd036babc12bea4c2e6eabe3ce9d11ed0d3f939e440bf949b
|
File details
Details for the file vortexa-0.1.0-py3-none-any.whl.
File metadata
- Download URL: vortexa-0.1.0-py3-none-any.whl
- Upload date:
- Size: 6.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.24 {"installer":{"name":"uv","version":"0.9.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
779ac5053f777b3b123fcff5c1df069325fec211abee5b30752902100d15acf8
|
|
| MD5 |
44511aa0f805e9ba589e73f31decffd4
|
|
| BLAKE2b-256 |
5382f29b9206ddeae85f235c2e1bef69d9ec257328b9f8916f2400f80d04da7f
|