Skip to main content

Codebase indexing and semantic search engine

Project description

vortexa ย  ๐Ÿง 

Codebase indexing and semantic search engine

Dense + sparse hybrid retrieval ยท AST-aware chunking ยท LMDB persistence ยท MCP server

License Python PyPI version


Table of Contents


Overview

vortexa is a standalone codebase indexing and semantic search engine designed for AI agents and developers. It builds a persistent, hybrid search index over source code using:

  • Dense retrieval via static or learned embeddings (Model2Vec / SentenceTransformers)
  • Sparse retrieval via BM25 keyword scoring
  • AST-aware chunking that respects function and class boundaries via tree-sitter
  • LMDB-backed storage for fast, persistent vector and chunk storage

The result: natural language code search that understands intent, not just keywords.

results = indexer.search("authentication middleware that validates JWT tokens", top_k=5)
# โ†’ Finds the right files even if they use "auth", "verify", "token" instead of "authentication"

vortexa can run as a standalone Python library, be embedded into any agent, or serve as an MCP server for LLM tools.


Features

Semantic search Find code by describing what it does in natural language โ€” no exact-string matching needed.
Hybrid retrieval Combines dense embeddings (semantic meaning) with BM25 (keyword precision) using adaptive alpha weighting.
AST-aware chunking Splits source code at function/class/block boundaries using tree-sitter when available, falls back to line-based splitting.
Incremental indexing Content-hash memoization means only changed files are re-indexed. Full re-index avoids redundant embedding computations.
Persistent storage LMDB-backed vector store survives restarts. Embedding cache avoids recomputing identical content.
Live watch mode Background thread polls for file changes and auto-re-indexes with configurable debounce.
MCP server Expose as a single search tool for any MCP-compatible agent (Claude Code, Cursor, etc.)
Zero mandatory heavy deps Core requires only numpy, lmdb, and pathspec. Model2Vec and tree-sitter are optional extras.

Quick Start

Installation

# Core (BM25 + line-based chunking)
pip install vortexa

# Full (Model2Vec embeddings + tree-sitter AST chunking)
pip install "vortexa[full]"

# With MCP server support
pip install "vortexa[full]" fastmcp

Index a codebase

from vortexa.core.indexer import CodebaseIndexer

indexer = CodebaseIndexer(root=".")
stats = indexer.index()

print(f"Indexed {stats.indexed_files} files, {stats.total_chunks} chunks")
print(f"Languages detected: {stats.languages}")

Search with natural language

results = indexer.search("CSV parser implementation", top_k=5)

for r in results:
    print(f"{r.chunk.file_path}:{r.chunk.start_line}  score={r.score:.3f}")
    print(f"  {r.chunk.content[:150].strip()}")
    print()

Output:

src/parsers/csv_parser.py:42  score=0.892
  def parse_csv(filepath: str, delimiter: str = ",") -> list[dict]:
      """Parse a CSV file into a list of dictionaries."""
      with open(filepath, "r") as f:

tests/test_csv_parser.py:15  score=0.756
  def test_parse_csv_with_header():
      result = parse_csv("test.csv")
      assert len(result) == 3

Python API

Indexing

from vortexa.core.indexer import CodebaseIndexer
from vortexa.core.types import ChunkConfig

# Default chunking (aim for 50-line chunks, 5-line overlap)
indexer = CodebaseIndexer(root="/path/to/project")
stats = indexer.index()
# โ†’ IndexStats(indexed_files=127, total_chunks=843, languages={"python": 45, "typescript": 32, ...})

# Custom chunk configuration
indexer = CodebaseIndexer(
    root=".",
    chunk_config=ChunkConfig(chunk_size=100, chunk_overlap=10),
)
stats = indexer.index(force=False, include_text_files=True)

# Force full re-index
stats = indexer.index(force=True)

Searching

# Hybrid search (auto-weighted semantic + BM25)
results = indexer.search("error handling", top_k=10)

# Pure semantic search
results = indexer.search("database connection pool", top_k=5, alpha=1.0)

# Pure BM25 keyword search
results = indexer.search("parse csv", top_k=5, alpha=0.0)

# Symbol lookup (find definitions by name)
results = indexer.find_symbol("ConnectionPool", top_k=5)

# Related chunks (find chunks similar to a given chunk index)
results = indexer.find_related(chunk_idx=3, top_k=5)

Each result is a SearchResult with:

Field Type Description
chunk.file_path str Relative file path
chunk.start_line int Start line number
chunk.end_line int End line number
chunk.content str Code snippet (up to 500 chars)
chunk.language str Detected programming language
chunk.lineage Lineage Source path + byte offsets
chunk.chunk_hash str Content hash for memoization
score float Relevance score (0โ€“1)
source str "semantic", "bm25", or "hybrid"

Watch Mode

from vortexa.interfaces.watcher import IndexWatcher

watcher = IndexWatcher(indexer, poll_interval=3.0)
watcher.start()   # Background thread, polls every 3s, debounces 2s
# ... files change on disk, auto-re-index happens ...
watcher.stop()

Management

# Index statistics
stats = indexer.stats()
# โ†’ {indexed_files: 127, total_chunks: 843, languages: {...}, memo_hits: 42, memo_misses: 15}

# Reset
indexer.clear()   # Delete the persistent index

MCP Server

vortexa ships with a built-in MCP (Model Context Protocol) server that exposes codebase search as a single search tool. Start it with:

# Auto-indexes current directory, serves on stdio
python -m vortexa.interfaces.mcp_server

# Or via the installed entry point
vortexa-mcp

On startup it indexes the current working directory and prints stats to stderr:

[vortexa] Indexing C:\projects\my-app ...
[vortexa] Ready: 127 files, 843 chunks
[vortexa] Auto-reindex watcher started (polling every 3s)

The server exposes one tool:

Tool Description Arguments
search Semantic + BM25 hybrid code search query (str), top_k (int, default 10)

Usage with Claude Code / Cursor

Add to your MCP configuration file (~/.cursor/mcp.json or Claude Code's mcp_servers config):

{
  "mcpServers": {
    "vortexa": {
      "command": "python",
      "args": ["-m", "vortexa.interfaces.mcp_server"],
      "cwd": "/path/to/your/project"
    }
  }
}

The agent will now have access to semantic code search โ€” it can find functions, classes, and patterns by describing them in natural language. This is significantly more effective than grep or rg for exploratory queries.


Architecture

Directory Layout

vortexa/
โ”œโ”€โ”€ core/
โ”‚   โ”œโ”€โ”€ indexer.py       # CodebaseIndexer โ€” main orchestrator
โ”‚   โ”œโ”€โ”€ chunking.py      # AST-aware (tree-sitter) + line-based chunking
โ”‚   โ”œโ”€โ”€ embedding.py     # Embedding models (Model2Vec, SentenceTransformers)
โ”‚   โ”œโ”€โ”€ language.py      # Language detection & file extension mapping
โ”‚   โ””โ”€โ”€ types.py         # Shared types (Chunk, ChunkConfig, IndexStats, SearchResult, ...)
โ”œโ”€โ”€ storage/
โ”‚   โ”œโ”€โ”€ vector_store.py  # LMDB-backed persistent vector store
โ”‚   โ”œโ”€โ”€ bm25.py          # BM25 keyword index with persistent storage
โ”‚   โ””โ”€โ”€ walker.py        # File system walker with .gitignore support
โ”œโ”€โ”€ search/
โ”‚   โ”œโ”€โ”€ search.py        # Hybrid search orchestrator (dense + sparse)
โ”‚   โ”œโ”€โ”€ ranking.py       # Result ranking & symbol query detection
โ”‚   โ””โ”€โ”€ tokens.py        # Identifier tokenization (camelCase, snake_case)
โ””โ”€โ”€ interfaces/
    โ”œโ”€โ”€ mcp_server.py    # MCP server (stdio transport)
    โ””โ”€โ”€ watcher.py       # Live file poller with debounced auto-reindex

Data Flow

sequenceDiagram
    participant User as User Code
    participant Indexer as CodebaseIndexer
    participant Walker as File Walker
    participant Chunker as Chunking Engine
    participant Embedder as Embedding Model
    participant Store as LMDB Vector Store
    participant BM25 as BM25 Index
    participant Search as Search Engine

    User->>Indexer: index()
    Indexer->>Walker: walk_files(root, extensions)
    Walker-->>Indexer: file_paths
    loop Each file
        Indexer->>Chunker: chunk_source(source, language)
        Chunker-->>Indexer: list[Chunk]
        Indexer->>Embedder: embed(chunks)
        Embedder-->>Indexer: vectors
        Indexer->>Store: store(vectors, chunks)
        Indexer->>BM25: index(chunks)
    end
    Indexer-->>User: IndexStats

    User->>Search: search(query)
    Search->>Store: query(vector)
    Search->>BM25: query(tokens)
    Search->>Search: hybrid_fusion(results)
    Search-->>User: list[SearchResult]

Indexing Pipeline

graph LR
    A[Source Files] --> B[File Walker<br/>.gitignore aware]
    B --> C[Language Detector]
    C --> D{AST Available?}
    D -->|Yes| E[Tree-sitter Parser<br/>Function/class boundaries]
    D -->|No| F[Line-based Splitter<br/>Configurable size/overlap]
    E --> G[Chunk Set]
    F --> G
    G --> H[Embedding Model<br/>Model2Vec / SentenceTransformer]
    G --> I[BM25 Tokenizer]
    H --> J[(LMDB Vector Store)]
    I --> K[(BM25 Index)]
    J --> L[Content Hash Memo]
    K --> L
    L --> M[Skip unchanged files]

Module Dependencies

graph TD
    subgraph "Public API"
        Indexer["core.indexer<br/>CodebaseIndexer"]
        Search["search.search<br/>search_hybrid()"]
    end

    subgraph "Core"
        Chunking["core.chunking<br/>chunk_source()"]
        Embedding["core.embedding<br/>Embedder"]
        Language["core.language<br/>detect_language()"]
        Types["core.types<br/>Chunk, ChunkConfig, ..."]
    end

    subgraph "Storage"
        VectorStore["storage.vector_store<br/>LMDB Vector Store"]
        BM25["storage.bm25<br/>BM25 Index"]
        Walker["storage.walker<br/>walk_files()"]
    end

    subgraph "Interfaces"
        MCP["interfaces.mcp_server<br/>FastMCP server"]
        Watcher["interfaces.watcher<br/>IndexWatcher"]
    end

    Indexer --> Chunking
    Indexer --> Embedding
    Indexer --> Language
    Indexer --> Types
    Indexer --> VectorStore
    Indexer --> BM25
    Indexer --> Walker
    Indexer --> Search

    Search --> Embedding
    Search --> VectorStore
    Search --> BM25
    Search --> Types

    MCP --> Indexer
    MCP --> Watcher
    Watcher --> Walker

Dependencies

Package Required Used For
numpy Yes Vector operations, embedding inference
lmdb Yes Persistent vector and chunk metadata storage
pathspec Yes .gitignore pattern matching in file walker
model2vec Optional Alternative static embeddings
huggingface-hub Yes (default model) Loading VTXAI/Vortex-Embed-4.7M
tokenizers Yes (default model) HF tokenizer for embedding model
safetensors Yes (default model) Safe tensor loading for 4-bit weights
sentence-transformers Optional Transformer-based dense embeddings
model2vec Optional Alternative static embeddings
tree-sitter-language-pack Optional AST-aware code chunking
fastmcp Optional MCP server for LLM tool integration

Install optional groups:

pip install "vortexa[full]"     # model2vec + sentence-transformers + tree-sitter
pip install "vortexa[full, mcp]" # everything including MCP server

License

Copyright 2025 VortexAI

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vortexa-0.1.0.tar.gz (9.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vortexa-0.1.0-py3-none-any.whl (6.9 kB view details)

Uploaded Python 3

File details

Details for the file vortexa-0.1.0.tar.gz.

File metadata

  • Download URL: vortexa-0.1.0.tar.gz
  • Upload date:
  • Size: 9.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.24 {"installer":{"name":"uv","version":"0.9.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for vortexa-0.1.0.tar.gz
Algorithm Hash digest
SHA256 fe092d0cb517002b5a36760e23cd3a31294dc059b933deef3dc60d4daa2b8702
MD5 96238707d0b2a1fa27449f10642d3e54
BLAKE2b-256 79c4518b4fde5a8cd036babc12bea4c2e6eabe3ce9d11ed0d3f939e440bf949b

See more details on using hashes here.

File details

Details for the file vortexa-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: vortexa-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 6.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.24 {"installer":{"name":"uv","version":"0.9.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for vortexa-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 779ac5053f777b3b123fcff5c1df069325fec211abee5b30752902100d15acf8
MD5 44511aa0f805e9ba589e73f31decffd4
BLAKE2b-256 5382f29b9206ddeae85f235c2e1bef69d9ec257328b9f8916f2400f80d04da7f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page