Skip to main content

CLI tool for vectorizing codebases and serving them via MCP

Project description

Project Vectorizer

A powerful CLI tool that vectorizes codebases, stores them in a vector database, tracks changes, and serves them via MCP (Model Context Protocol) for AI agents like Claude, Codex, and others.

Latest Version: 0.1.4 | Changelog | GitHub


๐Ÿ“‹ Table of Contents


Features

๐Ÿš€ Performance & Optimization

  • Auto-Optimized Config - Auto-detect CPU cores and RAM for optimal settings (--optimize)
  • Max Resources Mode - Use maximum system resources for fastest indexing (--max-resources)
  • Smart Incremental - 60-70% faster indexing with intelligent change categorization
  • Git-Aware Indexing - 80-90% faster by indexing only git-changed files
  • Parallel Processing - Multi-threaded with auto-detected optimal worker count (up to 16 workers)
  • Memory Monitoring - Real-time memory tracking with automatic garbage collection
  • Batch Optimization - Memory-based batch size calculation for safe processing

๐Ÿ” Search & Indexing

  • Code Vectorization - Parse and vectorize with sentence-transformers or OpenAI embeddings
  • Multi-Level Chunking - Functions, classes, micro-chunks, and word-level chunks for precision
  • Enhanced Single-Word Search - High-precision search for single keywords (0.8+ thresholds)
  • Semantic + Exact Search - Combines semantic similarity with exact word matching
  • Adaptive Thresholds - Automatically adjusts for optimal results
  • Multiple Languages - 30+ languages (Python, JS, TS, Go, Rust, Java, C++, C, PHP, Ruby, Swift, Kotlin, and more)

๐Ÿ”„ Change Management

  • Git Integration - Track changes via git commits with index-git command
  • Smart File Categorization - Detects New, Modified, and Deleted files
  • Watch Mode - Real-time monitoring with configurable debouncing (0.5-10s)
  • Incremental Updates - Only re-index changed content
  • Hash-Based Detection - SHA256 file hashing for accurate change detection

๐ŸŒ AI Integration

  • MCP Server - Model Context Protocol for AI agents (Claude, Codex, etc.)
  • HTTP Fallback API - RESTful endpoints when MCP unavailable
  • Semantic Search - Natural language queries for code discovery
  • File Operations - Get content, list files, project statistics

๐ŸŽจ User Experience

  • Clean Progress Output - Single unified progress bar with timing information
  • Suppressed Library Logs - No cluttered batch progress bars from dependencies
  • Timing Information - Elapsed time for all operations (seconds or minutes+seconds)
  • Verbose Mode - Optional detailed logging for debugging
  • Professional UI - Rich terminal output with colors, panels, and formatting
  • Real-time Updates - Live file names and status tags during indexing

๐Ÿ’พ Database & Storage

  • ChromaDB Backend - High-performance vector database
  • Fast HNSW Indexing - Optimized similarity search algorithm
  • Scalable - Handles 500K+ chunks efficiently
  • Single Database - No external dependencies required
  • Custom Paths - Configurable database location

Installation

From PyPI (Recommended)

# Install from PyPI
pip install project-vectorizer

# Verify installation
pv --version

From Source

# Clone repository
git clone https://github.com/starkbaknet/project-vectorizer.git
cd project-vectorizer

# Install
pip install -e .

# Or with development dependencies
pip install -e ".[dev]"

Quick Start

1. Initialize Your Project

# ๐Ÿš€ Recommended: Auto-optimize based on your system (16 workers, 400 batch on 8-core/16GB RAM)
pv init /path/to/project --optimize

# Or with custom settings
pv init /path/to/project \
  --name "My Project" \
  --embedding-model "all-MiniLM-L6-v2" \
  --chunk-size 256 \
  --optimize

Output:

โœ“ Project initialized successfully!

Name: My Project
Path: /path/to/project
Model: all-MiniLM-L6-v2
Provider: sentence-transformers
Chunk Size: 256 tokens

Optimized Settings:
  โ€ข Workers: 16
  โ€ข Batch Size: 400
  โ€ข Embedding Batch: 200
  โ€ข Memory Monitoring: Enabled
  โ€ข GC Interval: 100 files

2. Index Your Codebase

# ๐Ÿš€ Recommended: First-time indexing with max resources (2-4x faster)
pv index /path/to/project --max-resources

# ๐Ÿš€ Recommended: Smart incremental for updates (60-70% faster)
pv index /path/to/project --smart

# ๐Ÿš€ Recommended: Git-aware for recent changes (80-90% faster)
pv index-git /path/to/project --since HEAD~5

# Standard full indexing
pv index /path/to/project

# Force re-index everything
pv index /path/to/project --force

# Combine for maximum performance
pv index /path/to/project --smart --max-resources

Output:

Using maximum system resources (optimized settings)...
  โ€ข Workers: 16
  โ€ข Batch Size: 400
  โ€ข Embedding Batch: 200

  Indexing examples/demo.py โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 100%

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Indexing Complete โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ โœ“ Indexing complete!                                  โ”‚
โ”‚                                                       โ”‚
โ”‚ Files indexed: 48/49                                  โ”‚
โ”‚ Total chunks: 9222                                    โ”‚
โ”‚ Model: all-MiniLM-L6-v2                               โ”‚
โ”‚ Time taken: 2m 16s                                    โ”‚
โ”‚                                                       โ”‚
โ”‚ You can now search with: pv search . "your query"     โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

3. Search Your Code

# Natural language search
pv search /path/to/project "authentication logic"

# Single-word searches work great (high precision)
pv search /path/to/project "async" --threshold 0.8
pv search /path/to/project "test" --threshold 0.9

# Multi-word queries (semantic search)
pv search /path/to/project "user login validation" --threshold 0.5

# Find specific constructs
pv search /path/to/project "class" --limit 10

Output:

Search Results for: authentication logic

Found 5 result(s) with threshold >= 0.5

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Result 1 โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ src/auth/login.py                                      โ”‚
โ”‚ Lines 45-67 | Similarity: 0.892                        โ”‚
โ”‚                                                        โ”‚
โ”‚ def authenticate_user(username: str, password: str):   โ”‚
โ”‚     """                                                โ”‚
โ”‚     Authenticate user credentials against database     โ”‚
โ”‚     Returns user object if valid, None otherwise       โ”‚
โ”‚     """                                                โ”‚
โ”‚     ...                                                โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

4. Start MCP Server

# Start server (default: localhost:8000)
pv serve /path/to/project

# Custom host/port
pv serve /path/to/project --host 0.0.0.0 --port 8080

5. Monitor Changes in Real-Time

# Watch for file changes (default 2s debounce)
pv sync /path/to/project --watch

# Fast feedback (0.5s)
pv sync /path/to/project --watch --debounce 0.5

# Slower systems (5s)
pv sync /path/to/project --watch --debounce 5.0

Performance Optimization

Understanding the Optimization Flags

--optimize (Permanent)

Use when initializing a new project. Detects your system and saves optimal settings.

pv init /path/to/project --optimize

What it does:

  • Detects CPU cores โ†’ sets max_workers (e.g., 8 cores = 16 workers)
  • Calculates RAM โ†’ sets safe batch_size (e.g., 16GB = 400 batch)
  • Sets memory thresholds based on total RAM
  • Saves to config - All future operations use these settings

When to use:

  • โœ… New projects
  • โœ… Want permanent optimization
  • โœ… Same machine for all operations
  • โœ… "Set and forget" approach

--max-resources (Temporary)

Use when indexing to temporarily boost performance without changing config.

pv index /path/to/project --max-resources
pv index-git /path/to/project --since HEAD~1 --max-resources

What it does:

  • Detects system resources (same as --optimize)
  • Temporarily overrides config for this operation only
  • Original config unchanged

When to use:

  • โœ… Existing project without optimization
  • โœ… One-time heavy indexing
  • โœ… CI/CD with dedicated resources
  • โœ… Don't want to modify config

Performance Benchmarks

System: 8-core CPU, 16GB RAM, SSD

Mode Files Chunks Time Settings
Standard 48 9222 4m 32s 4 workers, 100 batch
--max-resources 48 9222 2m 16s 16 workers, 400 batch
Smart incremental 5 changed 412 24s 16 workers, 400 batch
Git-aware (HEAD~1) 3 changed 287 15s 16 workers, 400 batch

Key Findings:

  • --max-resources: 2x faster for full indexing
  • Smart incremental: 60-70% faster than full reindex
  • Git-aware: 80-90% faster for recent changes
  • Chunk size (128 vs 512): No performance difference (same ~2m 16s)

System Resource Detection

CPU Detection:

Detected: 8 cores
Optimal workers: min(8 * 2, 16) = 16 workers

Memory Detection:

Total RAM: 16GB
Available RAM: 8GB
Safe batch size: 8GB * 0.5 * 100 = 400
Embedding batch: 400 * 0.5 = 200
GC interval: 100 files

Memory Thresholds:

32GB+ RAM โ†’ threshold: 50000
16-32GB   โ†’ threshold: 20000
8-16GB    โ†’ threshold: 10000
<8GB      โ†’ threshold: 5000

Best Practices

  1. Initialize with optimization

    pv init ~/my-project --optimize
    
  2. Use max resources for heavy operations

    pv index ~/my-project --force --max-resources
    
  3. Use smart mode for daily updates

    pv index ~/my-project --smart
    
  4. Use git-aware after pulling changes

    pv index-git ~/my-project --since HEAD~1
    
  5. Monitor memory with verbose mode

    pv index ~/my-project --max-resources --verbose
    

CLI Commands

Global Options

pv [OPTIONS] COMMAND [ARGS]

Options:
  -v, --verbose    Enable verbose output
  --version        Show version
  --help           Show help

pv init - Initialize Project

Initialize a new project for vectorization.

pv init [OPTIONS] PROJECT_PATH

Options:
  -n, --name TEXT              Project name (default: directory name)
  -m, --embedding-model TEXT   Model name (default: all-MiniLM-L6-v2)
  -p, --embedding-provider     Provider: sentence-transformers | openai
  -c, --chunk-size INT         Chunk size in tokens (default: 256)
  -o, --chunk-overlap INT      Overlap in tokens (default: 32)
  --optimize                   Auto-optimize based on system resources โญ

Examples:

# Basic initialization
pv init /path/to/project

# With optimization (recommended)
pv init /path/to/project --optimize

# With OpenAI embeddings
export OPENAI_API_KEY="sk-..."
pv init /path/to/project \
  --embedding-provider openai \
  --embedding-model text-embedding-ada-002 \
  --optimize

pv index - Index Codebase

Index the codebase for searching.

pv index [OPTIONS] PROJECT_PATH

Options:
  -i, --incremental      Only index changed files
  -s, --smart            Smart incremental (categorized: new/modified/deleted) โญ
  -f, --force            Force re-index all files
  --max-resources        Use maximum system resources โญ

Examples:

# Full indexing with max resources
pv index /path/to/project --max-resources

# Smart incremental (fastest for updates)
pv index /path/to/project --smart

# Combine for maximum performance
pv index /path/to/project --smart --max-resources

# Force complete reindex
pv index /path/to/project --force

pv index-git - Git-Aware Indexing

Index only files changed in git commits.

pv index-git [OPTIONS] PROJECT_PATH

Options:
  -s, --since TEXT       Git reference (default: HEAD~1)
  --max-resources        Use maximum system resources โญ

Examples:

# Last commit
pv index-git /path/to/project --since HEAD~1

# Last 5 commits
pv index-git /path/to/project --since HEAD~5

# Since main branch
pv index-git /path/to/project --since main

# Since specific commit
pv index-git /path/to/project --since abc123def

# With max resources
pv index-git /path/to/project --since HEAD~10 --max-resources

Use Cases:

  • After git pull - index only new changes
  • Before code review - index PR changes
  • CI/CD pipelines - index commit range
  • After branch switch - index differences

pv search - Search Code

Search through vectorized codebase.

pv search [OPTIONS] PROJECT_PATH QUERY

Options:
  -l, --limit INT        Number of results (default: 10)
  -t, --threshold FLOAT  Similarity threshold 0.0-1.0 (default: 0.3)

Examples:

# Natural language search
pv search /path/to/project "error handling in database connections"

# Single-word search (high threshold)
pv search /path/to/project "async" --threshold 0.9

# Find all tests
pv search /path/to/project "test" --limit 20 --threshold 0.8

# Broad semantic search (low threshold)
pv search /path/to/project "api authentication" --threshold 0.3

Threshold Guide:

  • 0.8-0.95: Single words, exact matches
  • 0.5-0.7: Multi-word phrases, semantic
  • 0.3-0.5: Complex queries, broad search
  • 0.1-0.3: Very broad, exploratory

pv sync - Sync Changes / Watch Mode

Sync changes or watch for file modifications.

pv sync [OPTIONS] PROJECT_PATH

Options:
  -w, --watch           Watch for file changes
  -d, --debounce FLOAT  Debounce delay in seconds (default: 2.0)

Examples:

# One-time sync (smart incremental)
pv sync /path/to/project

# Watch mode with default debounce (2s)
pv sync /path/to/project --watch

# Fast feedback (0.5s)
pv sync /path/to/project --watch --debounce 0.5

# Slower systems (5s)
pv sync /path/to/project --watch --debounce 5.0

Debounce Explained:

  • Waits X seconds after last file change before indexing
  • Batches multiple rapid changes together
  • Prevents redundant indexing when saving files repeatedly
  • Reduces CPU usage during active development

Recommended Values:

  • 0.5-1.0s: Fast machines, need instant feedback
  • 2.0s: Balanced (default)
  • 5.0-10.0s: Slower machines, large codebases

pv serve - Start MCP Server

Start MCP server for AI agent integration.

pv serve [OPTIONS] PROJECT_PATH

Options:
  -p, --port INT   Port number (default: 8000)
  -h, --host TEXT  Host address (default: localhost)

Examples:

# Start server
pv serve /path/to/project

# Custom port
pv serve /path/to/project --port 8080

# Expose to network
pv serve /path/to/project --host 0.0.0.0 --port 8000

pv status - Show Project Status

Show project status and statistics.

pv status PROJECT_PATH

Output:

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Project Status โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ Name              my-project               โ”‚
โ”‚ Path              /path/to/project         โ”‚
โ”‚ Embedding Model   all-MiniLM-L6-v2         โ”‚
โ”‚                                            โ”‚
โ”‚ Total Files       49                       โ”‚
โ”‚ Indexed Files     48                       โ”‚
โ”‚ Total Chunks      9222                     โ”‚
โ”‚                                            โ”‚
โ”‚ Git Branch        main                     โ”‚
โ”‚ Last Updated      2025-10-13 12:15:42      โ”‚
โ”‚ Created           2025-10-10 09:30:15      โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

Configuration

Config File Location

Configuration is stored at <project>/.vectorizer/config.json

Full Configuration Reference

{
  "chromadb_path": null,
  "embedding_model": "all-MiniLM-L6-v2",
  "embedding_provider": "sentence-transformers",
  "openai_api_key": null,
  "chunk_size": 128,
  "chunk_overlap": 32,
  "max_file_size_mb": 10,
  "included_extensions": [
    ".py",
    ".js",
    ".ts",
    ".jsx",
    ".tsx",
    ".go",
    ".rs",
    ".java",
    ".cpp",
    ".c",
    ".h",
    ".hpp",
    ".cs",
    ".php",
    ".rb",
    ".swift",
    ".kt",
    ".scala",
    ".clj",
    ".sh",
    ".bash",
    ".zsh",
    ".fish",
    ".ps1",
    ".bat",
    ".cmd",
    ".md",
    ".txt",
    ".rst",
    ".json",
    ".yaml",
    ".yml",
    ".toml",
    ".xml",
    ".html",
    ".css",
    ".scss",
    ".sql",
    ".graphql",
    ".proto"
  ],
  "excluded_patterns": [
    "node_modules/**",
    ".git/**",
    "__pycache__/**",
    "*.pyc",
    ".pytest_cache/**",
    "venv/**",
    "env/**",
    ".env/**",
    "build/**",
    "dist/**",
    "*.egg-info/**",
    ".DS_Store",
    "*.min.js",
    "*.min.css"
  ],
  "mcp_host": "localhost",
  "mcp_port": 8000,
  "log_level": "INFO",
  "log_file": null,
  "max_workers": 4,
  "batch_size": 100,
  "embedding_batch_size": 100,
  "parallel_file_processing": true,
  "memory_monitoring_enabled": true,
  "memory_efficient_search_threshold": 10000,
  "gc_interval": 100
}

Key Settings Explained

Embedding Settings:

  • embedding_model: Model for embeddings (all-MiniLM-L6-v2, text-embedding-ada-002, etc.)
  • embedding_provider: "sentence-transformers" (local) or "openai" (API)
  • chunk_size: Tokens per chunk (128 for precision, 512 for context)
  • chunk_overlap: Overlap between chunks (16-32 recommended)

Performance Settings:

  • max_workers: Parallel workers (auto-detected with --optimize)
  • batch_size: Files per batch (auto-calculated with --optimize)
  • embedding_batch_size: Embeddings per batch
  • parallel_file_processing: Enable parallel processing (recommended: true)

Memory Settings:

  • memory_monitoring_enabled: Monitor RAM usage (recommended: true)
  • memory_efficient_search_threshold: Switch to streaming for large results
  • gc_interval: Garbage collection frequency (files between GC)

File Filtering:

  • included_extensions: File types to index
  • excluded_patterns: Glob patterns to ignore
  • max_file_size_mb: Skip files larger than this

Server Settings:

  • mcp_host: MCP server host
  • mcp_port: MCP server port
  • log_level: INFO, DEBUG, WARNING, ERROR
  • chromadb_path: Custom ChromaDB location (optional)

Environment Variables

Create .env file or export:

# OpenAI API Key (required for OpenAI embeddings)
export OPENAI_API_KEY="sk-..."

# Override config values
export EMBEDDING_PROVIDER="sentence-transformers"
export EMBEDDING_MODEL="all-MiniLM-L6-v2"
export CHUNK_SIZE="256"
export DEFAULT_SEARCH_THRESHOLD="0.3"

# Database
export CHROMADB_PATH="/custom/path/to/chromadb"

# Logging
export LOG_LEVEL="INFO"
export LOG_FILE="/var/log/vectorizer.log"

For complete list, see docs/ENVIRONMENT.md

Editing Configuration

# View current config
cat /path/to/project/.vectorizer/config.json

# Edit manually
nano /path/to/project/.vectorizer/config.json

# Or regenerate with optimization
pv init /path/to/project --optimize

Search Features

Single-Word Search

Optimized for high-precision single-keyword searches.

# Programming keywords
pv search /path/to/project "async" --threshold 0.9
pv search /path/to/project "test" --threshold 0.8
pv search /path/to/project "class" --threshold 0.9
pv search /path/to/project "import" --threshold 0.85

# Works great for finding specific constructs
pv search /path/to/project "def" --threshold 0.9  # Python functions
pv search /path/to/project "function" --threshold 0.9  # JS functions
pv search /path/to/project "catch" --threshold 0.8  # Error handling

Features:

  • Exact Word Matching: Prioritizes exact word boundaries
  • Keyword Detection: Special handling for programming keywords
  • Relevance Boosting: Huge boost for exact matches
  • High Thresholds: Reliable results even at 0.8-0.9+

Multi-Word Search

Semantic search for phrases and concepts.

# Natural language
pv search /path/to/project "user authentication logic" --threshold 0.5

# Code patterns
pv search /path/to/project "error handling in database" --threshold 0.4

# Features
pv search /path/to/project "rate limiting middleware" --threshold 0.6

Search Result Ranking

Results ranked by:

  1. Exact word matches (highest priority)
  2. Content type (micro/word chunks get boost)
  3. Partial matches within larger words
  4. Semantic similarity from embeddings

Recommended Thresholds by Query Type

Query Type Threshold Example
Single keyword 0.7-0.95 "async", "test", "class"
Two words 0.5-0.8 "error handling", "api routes"
Short phrase 0.4-0.7 "user login validation"
Complex query 0.3-0.5 "authentication with jwt tokens"
Exploratory 0.1-0.3 "machine learning model training"

MCP Server

Starting the Server

# Default (localhost:8000)
pv serve /path/to/project

# Custom settings
pv serve /path/to/project --host 0.0.0.0 --port 8080

Available MCP Tools

When running, AI agents can use these tools:

  1. search_code - Search vectorized codebase

    {
      "query": "authentication logic",
      "limit": 10,
      "threshold": 0.5
    }
    
  2. get_file_content - Retrieve full file

    {
      "file_path": "src/auth/login.py"
    }
    
  3. list_files - List all files

    {
      "file_type": "py" // optional filter
    }
    
  4. get_project_stats - Get statistics

    {}
    

HTTP Fallback API

If MCP unavailable, HTTP endpoints provided:

# Search
curl "http://localhost:8000/search?q=authentication&limit=5&threshold=0.5"

# Get file
curl "http://localhost:8000/file/src/auth/login.py"

# List files
curl "http://localhost:8000/files?type=py"

# Statistics
curl "http://localhost:8000/stats"

# Health check
curl "http://localhost:8000/health"

Use Cases

  1. AI Code Review: Let Claude analyze your codebase semantically
  2. Intelligent Navigation: Ask AI to find relevant code
  3. Documentation: Generate docs from actual code
  4. Onboarding: Help new devs understand codebase
  5. Refactoring: Find similar patterns across project

Advanced Usage

Python API

Basic Usage

import asyncio
from pathlib import Path
from project_vectorizer.core.config import Config
from project_vectorizer.core.project import ProjectManager

async def main():
    # Initialize project
    config = Config.create_optimized(
        embedding_model="all-MiniLM-L6-v2",
        chunk_size=256
    )

    project_path = Path("/path/to/project")
    manager = ProjectManager(project_path, config)

    # Initialize
    await manager.initialize("My Project")

    # Index
    await manager.load()
    await manager.index_all()

    # Search
    results = await manager.search("authentication", limit=10, threshold=0.5)
    for result in results:
        print(f"{result['file_path']}: {result['similarity']:.3f}")

asyncio.run(main())

Progress Tracking

from rich.progress import Progress, BarColumn, TaskProgressColumn

async def index_with_progress(project_path):
    config = Config.load_from_project(project_path)
    manager = ProjectManager(project_path, config)
    await manager.load()

    with Progress() as progress:
        task = progress.add_task("Indexing...", total=100)

        def update_progress(current, total, description):
            progress.update(task, completed=current, total=total, description=description)

        manager.set_progress_callback(update_progress)
        await manager.index_all()

Custom Resource Limits

import psutil

async def adaptive_index(project_path):
    """Index with resources based on current load."""
    cpu_percent = psutil.cpu_percent(interval=1)

    if cpu_percent < 50:  # System idle
        config = Config.create_optimized()
    else:  # System busy
        config = Config(max_workers=4, batch_size=100)

    manager = ProjectManager(project_path, config)
    await manager.load()
    await manager.index_all()

Chunk Size Optimization

The engine enforces a maximum of 128 tokens per chunk (see engine.py:35) for precision, but you can configure larger sizes for more context:

# Precision (default, forced max 128)
pv init /path/to/project --chunk-size 128

# More context (still capped at 128 by engine)
pv init /path/to/project --chunk-size 512

Performance Note: Chunk size has virtually NO impact on indexing speed (~2m 16s for both 128 and 512 tokens). Choose based on search quality needs:

  • 128: Better precision, exact matches
  • 512: More context, better understanding

CI/CD Integration

# .github/workflows/vectorize.yml
name: Vectorize Codebase

on:
  push:
    branches: [main]

jobs:
  vectorize:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Setup Python
        uses: actions/setup-python@v4
        with:
          python-version: "3.9"

      - name: Install vectorizer
        run: pip install project-vectorizer

      - name: Initialize and index
        run: |
          pv init . --optimize --name "${{ github.repository }}"
          pv index . --max-resources

      - name: Test search
        run: pv search . "test" --limit 5

Custom File Filters

{
  "included_extensions": [".py", ".js", ".custom"],
  "excluded_patterns": ["tests/**", "*.generated.js", "vendor/**", "*.min.*"]
}

Watch Mode During Development

# Terminal 1: Watch mode
pv sync /path/to/project --watch --debounce 1.0

# Terminal 2: Make code changes
# Auto-indexes when you save

# Terminal 3: Search as you code
pv search /path/to/project "your new function" --threshold 0.5

Troubleshooting

Common Issues

1. Slow Indexing

Problem: Indexing taking too long

Solutions:

# Use max resources
pv index /path/to/project --max-resources

# Use smart incremental for updates
pv index /path/to/project --smart

# Use git-aware for recent changes
pv index-git /path/to/project --since HEAD~1

# Check if optimization is working
pv index /path/to/project --max-resources --verbose
# Look for: "Workers: 16, Batch Size: 400"

2. High Memory Usage

Problem: Process using too much RAM or getting killed

Solutions:

# Reduce batch size in config
{
  "batch_size": 50,
  "max_workers": 4
}

# Enable memory monitoring
{
  "memory_monitoring_enabled": true,
  "gc_interval": 50
}

# Use smaller chunks
pv init /path/to/project --chunk-size 128

3. Poor Search Results

Problem: Search not finding relevant code

Solutions:

# Lower threshold for phrases
pv search /path/to/project "your query" --threshold 0.3

# Higher threshold for keywords
pv search /path/to/project "async" --threshold 0.9

# Use smaller chunk size for precision
# Edit config: "chunk_size": 128

# Ensure index is up to date
pv index /path/to/project --smart

4. No Results for Single Words

Problem: Single-word searches return nothing

Solutions:

# Try lower threshold
pv search /path/to/project "yourword" --threshold 0.5

# Check if word exists
pv search /path/to/project "yourword" --threshold 0.1 --limit 1

# Reindex with smaller chunks
# Edit config: "chunk_size": 128
pv index /path/to/project --force

5. Missing Recent Changes

Problem: Just-edited code not showing in search

Solutions:

# Run smart incremental
pv index /path/to/project --smart

# Or git-aware
pv index-git /path/to/project --since HEAD~1

# Check status
pv status /path/to/project

6. psutil Not Found

Problem: Optimization not working

Solution:

# Install psutil
pip install psutil

# Verify
python -c "import psutil; print(f'CPUs: {psutil.cpu_count()}, RAM: {psutil.virtual_memory().available / 1024**3:.1f}GB')"

# Try again
pv init /path/to/project --optimize

Debug Mode

# Enable verbose logging
pv --verbose index /path/to/project

# Check project status
pv status /path/to/project

# View config
cat /path/to/project/.vectorizer/config.json

# Check ChromaDB
ls -lh /path/to/project/.vectorizer/chromadb/

Performance Debugging

# Time operations
time pv index /path/to/project
time pv index /path/to/project --max-resources

# Monitor resources during indexing
# Terminal 1:
pv index /path/to/project --max-resources

# Terminal 2:
htop  # or top
# Should see high CPU across all cores

# Check memory warnings
pv index /path/to/project --max-resources --verbose
# Look for memory warnings

Changelog

[0.1.4] - 2025-10-13

Fixed

  • **Hardcoded value ** โ€“ Replaced hardcoded configuration with dynamic variable lookup

Notes

  • This is a minor bugfix release with no API or CLI changes.

[0.1.3] - 2025-10-13

Fixed

  • Hardcoded value โ€“ Replaced hardcoded configuration with dynamic variable lookup
    • Prevents unexpected behavior when running with custom configs

Notes

  • This is a minor bugfix release with no API or CLI changes.

[0.1.2] - 2025-10-13

Added

  • Optimized Config Generation - Config.create_optimized() auto-detects CPU/RAM
  • Max Resources Flag - --max-resources for temporary performance boost
  • psutil Integration - Automatic system resource detection
  • Unified Progress Tracking - Clean single-line progress bar
  • Library Progress Suppression - No more cluttered batch progress bars
  • Timing Information - All operations show elapsed time
  • Clean Terminal Output - Professional UI with timing

Performance

  • 2x faster full indexing with --max-resources
  • 60-70% faster smart incremental updates
  • 80-90% faster git-aware indexing

Documentation

  • Comprehensive documentation overhaul
  • Consolidated all guides into main README
  • Added CHANGELOG.md with version history

[0.1.1] - 2025-10-12

  • Enhanced single-word search with high precision
  • Multi-level chunking (micro + word-level)
  • Adaptive search thresholds
  • Programming keyword detection
  • Improved word matching and relevance boosting

[0.1.0] - 2025-10-10

  • Initial release
  • Code vectorization
  • Smart incremental indexing
  • Git-aware indexing
  • MCP server
  • Watch mode
  • ChromaDB backend
  • 30+ language support

Contributing

Development Setup

# Clone repository
git clone https://github.com/starkbaknet/project-vectorizer.git
cd project-vectorizer

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install with dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Format code
black .
isort .

Running Tests

# All tests
pytest

# With coverage
pytest --cov=project_vectorizer

# Specific test
pytest tests/test_config.py

# Verbose
pytest -v

See docs/TESTING.md for details.

Publishing

See docs/PUBLISHING.md for PyPI publishing guide.

Contributing Guidelines

  1. Fork repository
  2. Create feature branch: git checkout -b feature/amazing-feature
  3. Make changes and add tests
  4. Ensure tests pass: pytest
  5. Format code: black . && isort .
  6. Commit: git commit -m 'Add amazing feature'
  7. Push: git push origin feature/amazing-feature
  8. Open Pull Request

License

MIT License - see LICENSE file


Additional Resources


Made with โค๏ธ by StarkBakNet

Vectorize your codebase. Empower your AI agents. Build better software.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

project_vectorizer-0.1.4.tar.gz (86.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

project_vectorizer-0.1.4-py3-none-any.whl (51.7 kB view details)

Uploaded Python 3

File details

Details for the file project_vectorizer-0.1.4.tar.gz.

File metadata

  • Download URL: project_vectorizer-0.1.4.tar.gz
  • Upload date:
  • Size: 86.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for project_vectorizer-0.1.4.tar.gz
Algorithm Hash digest
SHA256 fa4699438eb8939ffb8fb6ab2031f1cc0f874325382d9eeb205b17bd6283af4b
MD5 81b703f4af27a026158d8caa560ebfb2
BLAKE2b-256 6dedf51c80f3dd2726895be4212db0c328698140c945d6a44dafc30382cf4c95

See more details on using hashes here.

File details

Details for the file project_vectorizer-0.1.4-py3-none-any.whl.

File metadata

File hashes

Hashes for project_vectorizer-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 c7d594dda72cbc79ee326e3747bdb41d394e540939fc6b013808da340e878499
MD5 e551b49d5a598aff58db971607636fc9
BLAKE2b-256 96f87cfa508def86b0c182206dd6be215ef3015c42a5da84c10a6d8f6db435b0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page