CLI tool for vectorizing codebases and serving them via MCP
Project description
Project Vectorizer
A powerful CLI tool that vectorizes codebases, stores them in a vector database, tracks changes, and serves them via MCP (Model Context Protocol) for AI agents like Claude, Codex, and others.
Latest Version: 0.1.3 | Changelog | GitHub
๐ Table of Contents
- Features
- Installation
- Quick Start
- Performance Optimization
- CLI Commands
- Configuration
- Search Features
- MCP Server
- Advanced Usage
- Troubleshooting
- Changelog
- Contributing
Features
๐ Performance & Optimization
- Auto-Optimized Config - Auto-detect CPU cores and RAM for optimal settings (
--optimize) - Max Resources Mode - Use maximum system resources for fastest indexing (
--max-resources) - Smart Incremental - 60-70% faster indexing with intelligent change categorization
- Git-Aware Indexing - 80-90% faster by indexing only git-changed files
- Parallel Processing - Multi-threaded with auto-detected optimal worker count (up to 16 workers)
- Memory Monitoring - Real-time memory tracking with automatic garbage collection
- Batch Optimization - Memory-based batch size calculation for safe processing
๐ Search & Indexing
- Code Vectorization - Parse and vectorize with sentence-transformers or OpenAI embeddings
- Multi-Level Chunking - Functions, classes, micro-chunks, and word-level chunks for precision
- Enhanced Single-Word Search - High-precision search for single keywords (0.8+ thresholds)
- Semantic + Exact Search - Combines semantic similarity with exact word matching
- Adaptive Thresholds - Automatically adjusts for optimal results
- Multiple Languages - 30+ languages (Python, JS, TS, Go, Rust, Java, C++, C, PHP, Ruby, Swift, Kotlin, and more)
๐ Change Management
- Git Integration - Track changes via git commits with
index-gitcommand - Smart File Categorization - Detects New, Modified, and Deleted files
- Watch Mode - Real-time monitoring with configurable debouncing (0.5-10s)
- Incremental Updates - Only re-index changed content
- Hash-Based Detection - SHA256 file hashing for accurate change detection
๐ AI Integration
- MCP Server - Model Context Protocol for AI agents (Claude, Codex, etc.)
- HTTP Fallback API - RESTful endpoints when MCP unavailable
- Semantic Search - Natural language queries for code discovery
- File Operations - Get content, list files, project statistics
๐จ User Experience
- Clean Progress Output - Single unified progress bar with timing information
- Suppressed Library Logs - No cluttered batch progress bars from dependencies
- Timing Information - Elapsed time for all operations (seconds or minutes+seconds)
- Verbose Mode - Optional detailed logging for debugging
- Professional UI - Rich terminal output with colors, panels, and formatting
- Real-time Updates - Live file names and status tags during indexing
๐พ Database & Storage
- ChromaDB Backend - High-performance vector database
- Fast HNSW Indexing - Optimized similarity search algorithm
- Scalable - Handles 500K+ chunks efficiently
- Single Database - No external dependencies required
- Custom Paths - Configurable database location
Installation
From PyPI (Recommended)
# Install from PyPI
pip install project-vectorizer
# Verify installation
pv --version
From Source
# Clone repository
git clone https://github.com/starkbaknet/project-vectorizer.git
cd project-vectorizer
# Install
pip install -e .
# Or with development dependencies
pip install -e ".[dev]"
Quick Start
1. Initialize Your Project
# ๐ Recommended: Auto-optimize based on your system (16 workers, 400 batch on 8-core/16GB RAM)
pv init /path/to/project --optimize
# Or with custom settings
pv init /path/to/project \
--name "My Project" \
--embedding-model "all-MiniLM-L6-v2" \
--chunk-size 256 \
--optimize
Output:
โ Project initialized successfully!
Name: My Project
Path: /path/to/project
Model: all-MiniLM-L6-v2
Provider: sentence-transformers
Chunk Size: 256 tokens
Optimized Settings:
โข Workers: 16
โข Batch Size: 400
โข Embedding Batch: 200
โข Memory Monitoring: Enabled
โข GC Interval: 100 files
2. Index Your Codebase
# ๐ Recommended: First-time indexing with max resources (2-4x faster)
pv index /path/to/project --max-resources
# ๐ Recommended: Smart incremental for updates (60-70% faster)
pv index /path/to/project --smart
# ๐ Recommended: Git-aware for recent changes (80-90% faster)
pv index-git /path/to/project --since HEAD~5
# Standard full indexing
pv index /path/to/project
# Force re-index everything
pv index /path/to/project --force
# Combine for maximum performance
pv index /path/to/project --smart --max-resources
Output:
Using maximum system resources (optimized settings)...
โข Workers: 16
โข Batch Size: 400
โข Embedding Batch: 200
Indexing examples/demo.py โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 100%
โญโโโโโโโโโโโโโโโโโโ Indexing Complete โโโโโโโโโโโโโโโโโโโฎ
โ โ Indexing complete! โ
โ โ
โ Files indexed: 48/49 โ
โ Total chunks: 9222 โ
โ Model: all-MiniLM-L6-v2 โ
โ Time taken: 2m 16s โ
โ โ
โ You can now search with: pv search . "your query" โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
3. Search Your Code
# Natural language search
pv search /path/to/project "authentication logic"
# Single-word searches work great (high precision)
pv search /path/to/project "async" --threshold 0.8
pv search /path/to/project "test" --threshold 0.9
# Multi-word queries (semantic search)
pv search /path/to/project "user login validation" --threshold 0.5
# Find specific constructs
pv search /path/to/project "class" --limit 10
Output:
Search Results for: authentication logic
Found 5 result(s) with threshold >= 0.5
โญโโโโโโโโโโโโโโโโโโโโโโโ Result 1 โโโโโโโโโโโโโโโโโโโโโโโโฎ
โ src/auth/login.py โ
โ Lines 45-67 | Similarity: 0.892 โ
โ โ
โ def authenticate_user(username: str, password: str): โ
โ """ โ
โ Authenticate user credentials against database โ
โ Returns user object if valid, None otherwise โ
โ """ โ
โ ... โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
4. Start MCP Server
# Start server (default: localhost:8000)
pv serve /path/to/project
# Custom host/port
pv serve /path/to/project --host 0.0.0.0 --port 8080
5. Monitor Changes in Real-Time
# Watch for file changes (default 2s debounce)
pv sync /path/to/project --watch
# Fast feedback (0.5s)
pv sync /path/to/project --watch --debounce 0.5
# Slower systems (5s)
pv sync /path/to/project --watch --debounce 5.0
Performance Optimization
Understanding the Optimization Flags
--optimize (Permanent)
Use when initializing a new project. Detects your system and saves optimal settings.
pv init /path/to/project --optimize
What it does:
- Detects CPU cores โ sets
max_workers(e.g., 8 cores = 16 workers) - Calculates RAM โ sets safe
batch_size(e.g., 16GB = 400 batch) - Sets memory thresholds based on total RAM
- Saves to config - All future operations use these settings
When to use:
- โ New projects
- โ Want permanent optimization
- โ Same machine for all operations
- โ "Set and forget" approach
--max-resources (Temporary)
Use when indexing to temporarily boost performance without changing config.
pv index /path/to/project --max-resources
pv index-git /path/to/project --since HEAD~1 --max-resources
What it does:
- Detects system resources (same as --optimize)
- Temporarily overrides config for this operation only
- Original config unchanged
When to use:
- โ Existing project without optimization
- โ One-time heavy indexing
- โ CI/CD with dedicated resources
- โ Don't want to modify config
Performance Benchmarks
System: 8-core CPU, 16GB RAM, SSD
| Mode | Files | Chunks | Time | Settings |
|---|---|---|---|---|
| Standard | 48 | 9222 | 4m 32s | 4 workers, 100 batch |
| --max-resources | 48 | 9222 | 2m 16s | 16 workers, 400 batch |
| Smart incremental | 5 changed | 412 | 24s | 16 workers, 400 batch |
| Git-aware (HEAD~1) | 3 changed | 287 | 15s | 16 workers, 400 batch |
Key Findings:
--max-resources: 2x faster for full indexing- Smart incremental: 60-70% faster than full reindex
- Git-aware: 80-90% faster for recent changes
- Chunk size (128 vs 512): No performance difference (same ~2m 16s)
System Resource Detection
CPU Detection:
Detected: 8 cores
Optimal workers: min(8 * 2, 16) = 16 workers
Memory Detection:
Total RAM: 16GB
Available RAM: 8GB
Safe batch size: 8GB * 0.5 * 100 = 400
Embedding batch: 400 * 0.5 = 200
GC interval: 100 files
Memory Thresholds:
32GB+ RAM โ threshold: 50000
16-32GB โ threshold: 20000
8-16GB โ threshold: 10000
<8GB โ threshold: 5000
Best Practices
-
Initialize with optimization
pv init ~/my-project --optimize
-
Use max resources for heavy operations
pv index ~/my-project --force --max-resources
-
Use smart mode for daily updates
pv index ~/my-project --smart
-
Use git-aware after pulling changes
pv index-git ~/my-project --since HEAD~1
-
Monitor memory with verbose mode
pv index ~/my-project --max-resources --verbose
CLI Commands
Global Options
pv [OPTIONS] COMMAND [ARGS]
Options:
-v, --verbose Enable verbose output
--version Show version
--help Show help
pv init - Initialize Project
Initialize a new project for vectorization.
pv init [OPTIONS] PROJECT_PATH
Options:
-n, --name TEXT Project name (default: directory name)
-m, --embedding-model TEXT Model name (default: all-MiniLM-L6-v2)
-p, --embedding-provider Provider: sentence-transformers | openai
-c, --chunk-size INT Chunk size in tokens (default: 256)
-o, --chunk-overlap INT Overlap in tokens (default: 32)
--optimize Auto-optimize based on system resources โญ
Examples:
# Basic initialization
pv init /path/to/project
# With optimization (recommended)
pv init /path/to/project --optimize
# With OpenAI embeddings
export OPENAI_API_KEY="sk-..."
pv init /path/to/project \
--embedding-provider openai \
--embedding-model text-embedding-ada-002 \
--optimize
pv index - Index Codebase
Index the codebase for searching.
pv index [OPTIONS] PROJECT_PATH
Options:
-i, --incremental Only index changed files
-s, --smart Smart incremental (categorized: new/modified/deleted) โญ
-f, --force Force re-index all files
--max-resources Use maximum system resources โญ
Examples:
# Full indexing with max resources
pv index /path/to/project --max-resources
# Smart incremental (fastest for updates)
pv index /path/to/project --smart
# Combine for maximum performance
pv index /path/to/project --smart --max-resources
# Force complete reindex
pv index /path/to/project --force
pv index-git - Git-Aware Indexing
Index only files changed in git commits.
pv index-git [OPTIONS] PROJECT_PATH
Options:
-s, --since TEXT Git reference (default: HEAD~1)
--max-resources Use maximum system resources โญ
Examples:
# Last commit
pv index-git /path/to/project --since HEAD~1
# Last 5 commits
pv index-git /path/to/project --since HEAD~5
# Since main branch
pv index-git /path/to/project --since main
# Since specific commit
pv index-git /path/to/project --since abc123def
# With max resources
pv index-git /path/to/project --since HEAD~10 --max-resources
Use Cases:
- After
git pull- index only new changes - Before code review - index PR changes
- CI/CD pipelines - index commit range
- After branch switch - index differences
pv search - Search Code
Search through vectorized codebase.
pv search [OPTIONS] PROJECT_PATH QUERY
Options:
-l, --limit INT Number of results (default: 10)
-t, --threshold FLOAT Similarity threshold 0.0-1.0 (default: 0.3)
Examples:
# Natural language search
pv search /path/to/project "error handling in database connections"
# Single-word search (high threshold)
pv search /path/to/project "async" --threshold 0.9
# Find all tests
pv search /path/to/project "test" --limit 20 --threshold 0.8
# Broad semantic search (low threshold)
pv search /path/to/project "api authentication" --threshold 0.3
Threshold Guide:
- 0.8-0.95: Single words, exact matches
- 0.5-0.7: Multi-word phrases, semantic
- 0.3-0.5: Complex queries, broad search
- 0.1-0.3: Very broad, exploratory
pv sync - Sync Changes / Watch Mode
Sync changes or watch for file modifications.
pv sync [OPTIONS] PROJECT_PATH
Options:
-w, --watch Watch for file changes
-d, --debounce FLOAT Debounce delay in seconds (default: 2.0)
Examples:
# One-time sync (smart incremental)
pv sync /path/to/project
# Watch mode with default debounce (2s)
pv sync /path/to/project --watch
# Fast feedback (0.5s)
pv sync /path/to/project --watch --debounce 0.5
# Slower systems (5s)
pv sync /path/to/project --watch --debounce 5.0
Debounce Explained:
- Waits X seconds after last file change before indexing
- Batches multiple rapid changes together
- Prevents redundant indexing when saving files repeatedly
- Reduces CPU usage during active development
Recommended Values:
- 0.5-1.0s: Fast machines, need instant feedback
- 2.0s: Balanced (default)
- 5.0-10.0s: Slower machines, large codebases
pv serve - Start MCP Server
Start MCP server for AI agent integration.
pv serve [OPTIONS] PROJECT_PATH
Options:
-p, --port INT Port number (default: 8000)
-h, --host TEXT Host address (default: localhost)
Examples:
# Start server
pv serve /path/to/project
# Custom port
pv serve /path/to/project --port 8080
# Expose to network
pv serve /path/to/project --host 0.0.0.0 --port 8000
pv status - Show Project Status
Show project status and statistics.
pv status PROJECT_PATH
Output:
โญโโโโโโโโโโโโโโ Project Status โโโโโโโโโโโโโโโฎ
โ Name my-project โ
โ Path /path/to/project โ
โ Embedding Model all-MiniLM-L6-v2 โ
โ โ
โ Total Files 49 โ
โ Indexed Files 48 โ
โ Total Chunks 9222 โ
โ โ
โ Git Branch main โ
โ Last Updated 2025-10-13 12:15:42 โ
โ Created 2025-10-10 09:30:15 โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Configuration
Config File Location
Configuration is stored at <project>/.vectorizer/config.json
Full Configuration Reference
{
"chromadb_path": null,
"embedding_model": "all-MiniLM-L6-v2",
"embedding_provider": "sentence-transformers",
"openai_api_key": null,
"chunk_size": 128,
"chunk_overlap": 32,
"max_file_size_mb": 10,
"included_extensions": [
".py",
".js",
".ts",
".jsx",
".tsx",
".go",
".rs",
".java",
".cpp",
".c",
".h",
".hpp",
".cs",
".php",
".rb",
".swift",
".kt",
".scala",
".clj",
".sh",
".bash",
".zsh",
".fish",
".ps1",
".bat",
".cmd",
".md",
".txt",
".rst",
".json",
".yaml",
".yml",
".toml",
".xml",
".html",
".css",
".scss",
".sql",
".graphql",
".proto"
],
"excluded_patterns": [
"node_modules/**",
".git/**",
"__pycache__/**",
"*.pyc",
".pytest_cache/**",
"venv/**",
"env/**",
".env/**",
"build/**",
"dist/**",
"*.egg-info/**",
".DS_Store",
"*.min.js",
"*.min.css"
],
"mcp_host": "localhost",
"mcp_port": 8000,
"log_level": "INFO",
"log_file": null,
"max_workers": 4,
"batch_size": 100,
"embedding_batch_size": 100,
"parallel_file_processing": true,
"memory_monitoring_enabled": true,
"memory_efficient_search_threshold": 10000,
"gc_interval": 100
}
Key Settings Explained
Embedding Settings:
embedding_model: Model for embeddings (all-MiniLM-L6-v2, text-embedding-ada-002, etc.)embedding_provider: "sentence-transformers" (local) or "openai" (API)chunk_size: Tokens per chunk (128 for precision, 512 for context)chunk_overlap: Overlap between chunks (16-32 recommended)
Performance Settings:
max_workers: Parallel workers (auto-detected with --optimize)batch_size: Files per batch (auto-calculated with --optimize)embedding_batch_size: Embeddings per batchparallel_file_processing: Enable parallel processing (recommended: true)
Memory Settings:
memory_monitoring_enabled: Monitor RAM usage (recommended: true)memory_efficient_search_threshold: Switch to streaming for large resultsgc_interval: Garbage collection frequency (files between GC)
File Filtering:
included_extensions: File types to indexexcluded_patterns: Glob patterns to ignoremax_file_size_mb: Skip files larger than this
Server Settings:
mcp_host: MCP server hostmcp_port: MCP server portlog_level: INFO, DEBUG, WARNING, ERRORchromadb_path: Custom ChromaDB location (optional)
Environment Variables
Create .env file or export:
# OpenAI API Key (required for OpenAI embeddings)
export OPENAI_API_KEY="sk-..."
# Override config values
export EMBEDDING_PROVIDER="sentence-transformers"
export EMBEDDING_MODEL="all-MiniLM-L6-v2"
export CHUNK_SIZE="256"
export DEFAULT_SEARCH_THRESHOLD="0.3"
# Database
export CHROMADB_PATH="/custom/path/to/chromadb"
# Logging
export LOG_LEVEL="INFO"
export LOG_FILE="/var/log/vectorizer.log"
For complete list, see docs/ENVIRONMENT.md
Editing Configuration
# View current config
cat /path/to/project/.vectorizer/config.json
# Edit manually
nano /path/to/project/.vectorizer/config.json
# Or regenerate with optimization
pv init /path/to/project --optimize
Search Features
Single-Word Search
Optimized for high-precision single-keyword searches.
# Programming keywords
pv search /path/to/project "async" --threshold 0.9
pv search /path/to/project "test" --threshold 0.8
pv search /path/to/project "class" --threshold 0.9
pv search /path/to/project "import" --threshold 0.85
# Works great for finding specific constructs
pv search /path/to/project "def" --threshold 0.9 # Python functions
pv search /path/to/project "function" --threshold 0.9 # JS functions
pv search /path/to/project "catch" --threshold 0.8 # Error handling
Features:
- Exact Word Matching: Prioritizes exact word boundaries
- Keyword Detection: Special handling for programming keywords
- Relevance Boosting: Huge boost for exact matches
- High Thresholds: Reliable results even at 0.8-0.9+
Multi-Word Search
Semantic search for phrases and concepts.
# Natural language
pv search /path/to/project "user authentication logic" --threshold 0.5
# Code patterns
pv search /path/to/project "error handling in database" --threshold 0.4
# Features
pv search /path/to/project "rate limiting middleware" --threshold 0.6
Search Result Ranking
Results ranked by:
- Exact word matches (highest priority)
- Content type (micro/word chunks get boost)
- Partial matches within larger words
- Semantic similarity from embeddings
Recommended Thresholds by Query Type
| Query Type | Threshold | Example |
|---|---|---|
| Single keyword | 0.7-0.95 | "async", "test", "class" |
| Two words | 0.5-0.8 | "error handling", "api routes" |
| Short phrase | 0.4-0.7 | "user login validation" |
| Complex query | 0.3-0.5 | "authentication with jwt tokens" |
| Exploratory | 0.1-0.3 | "machine learning model training" |
MCP Server
Starting the Server
# Default (localhost:8000)
pv serve /path/to/project
# Custom settings
pv serve /path/to/project --host 0.0.0.0 --port 8080
Available MCP Tools
When running, AI agents can use these tools:
-
search_code - Search vectorized codebase
{ "query": "authentication logic", "limit": 10, "threshold": 0.5 }
-
get_file_content - Retrieve full file
{ "file_path": "src/auth/login.py" }
-
list_files - List all files
{ "file_type": "py" // optional filter }
-
get_project_stats - Get statistics
{}
HTTP Fallback API
If MCP unavailable, HTTP endpoints provided:
# Search
curl "http://localhost:8000/search?q=authentication&limit=5&threshold=0.5"
# Get file
curl "http://localhost:8000/file/src/auth/login.py"
# List files
curl "http://localhost:8000/files?type=py"
# Statistics
curl "http://localhost:8000/stats"
# Health check
curl "http://localhost:8000/health"
Use Cases
- AI Code Review: Let Claude analyze your codebase semantically
- Intelligent Navigation: Ask AI to find relevant code
- Documentation: Generate docs from actual code
- Onboarding: Help new devs understand codebase
- Refactoring: Find similar patterns across project
Advanced Usage
Python API
Basic Usage
import asyncio
from pathlib import Path
from project_vectorizer.core.config import Config
from project_vectorizer.core.project import ProjectManager
async def main():
# Initialize project
config = Config.create_optimized(
embedding_model="all-MiniLM-L6-v2",
chunk_size=256
)
project_path = Path("/path/to/project")
manager = ProjectManager(project_path, config)
# Initialize
await manager.initialize("My Project")
# Index
await manager.load()
await manager.index_all()
# Search
results = await manager.search("authentication", limit=10, threshold=0.5)
for result in results:
print(f"{result['file_path']}: {result['similarity']:.3f}")
asyncio.run(main())
Progress Tracking
from rich.progress import Progress, BarColumn, TaskProgressColumn
async def index_with_progress(project_path):
config = Config.load_from_project(project_path)
manager = ProjectManager(project_path, config)
await manager.load()
with Progress() as progress:
task = progress.add_task("Indexing...", total=100)
def update_progress(current, total, description):
progress.update(task, completed=current, total=total, description=description)
manager.set_progress_callback(update_progress)
await manager.index_all()
Custom Resource Limits
import psutil
async def adaptive_index(project_path):
"""Index with resources based on current load."""
cpu_percent = psutil.cpu_percent(interval=1)
if cpu_percent < 50: # System idle
config = Config.create_optimized()
else: # System busy
config = Config(max_workers=4, batch_size=100)
manager = ProjectManager(project_path, config)
await manager.load()
await manager.index_all()
Chunk Size Optimization
The engine enforces a maximum of 128 tokens per chunk (see engine.py:35) for precision, but you can configure larger sizes for more context:
# Precision (default, forced max 128)
pv init /path/to/project --chunk-size 128
# More context (still capped at 128 by engine)
pv init /path/to/project --chunk-size 512
Performance Note: Chunk size has virtually NO impact on indexing speed (~2m 16s for both 128 and 512 tokens). Choose based on search quality needs:
- 128: Better precision, exact matches
- 512: More context, better understanding
CI/CD Integration
# .github/workflows/vectorize.yml
name: Vectorize Codebase
on:
push:
branches: [main]
jobs:
vectorize:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: "3.9"
- name: Install vectorizer
run: pip install project-vectorizer
- name: Initialize and index
run: |
pv init . --optimize --name "${{ github.repository }}"
pv index . --max-resources
- name: Test search
run: pv search . "test" --limit 5
Custom File Filters
{
"included_extensions": [".py", ".js", ".custom"],
"excluded_patterns": ["tests/**", "*.generated.js", "vendor/**", "*.min.*"]
}
Watch Mode During Development
# Terminal 1: Watch mode
pv sync /path/to/project --watch --debounce 1.0
# Terminal 2: Make code changes
# Auto-indexes when you save
# Terminal 3: Search as you code
pv search /path/to/project "your new function" --threshold 0.5
Troubleshooting
Common Issues
1. Slow Indexing
Problem: Indexing taking too long
Solutions:
# Use max resources
pv index /path/to/project --max-resources
# Use smart incremental for updates
pv index /path/to/project --smart
# Use git-aware for recent changes
pv index-git /path/to/project --since HEAD~1
# Check if optimization is working
pv index /path/to/project --max-resources --verbose
# Look for: "Workers: 16, Batch Size: 400"
2. High Memory Usage
Problem: Process using too much RAM or getting killed
Solutions:
# Reduce batch size in config
{
"batch_size": 50,
"max_workers": 4
}
# Enable memory monitoring
{
"memory_monitoring_enabled": true,
"gc_interval": 50
}
# Use smaller chunks
pv init /path/to/project --chunk-size 128
3. Poor Search Results
Problem: Search not finding relevant code
Solutions:
# Lower threshold for phrases
pv search /path/to/project "your query" --threshold 0.3
# Higher threshold for keywords
pv search /path/to/project "async" --threshold 0.9
# Use smaller chunk size for precision
# Edit config: "chunk_size": 128
# Ensure index is up to date
pv index /path/to/project --smart
4. No Results for Single Words
Problem: Single-word searches return nothing
Solutions:
# Try lower threshold
pv search /path/to/project "yourword" --threshold 0.5
# Check if word exists
pv search /path/to/project "yourword" --threshold 0.1 --limit 1
# Reindex with smaller chunks
# Edit config: "chunk_size": 128
pv index /path/to/project --force
5. Missing Recent Changes
Problem: Just-edited code not showing in search
Solutions:
# Run smart incremental
pv index /path/to/project --smart
# Or git-aware
pv index-git /path/to/project --since HEAD~1
# Check status
pv status /path/to/project
6. psutil Not Found
Problem: Optimization not working
Solution:
# Install psutil
pip install psutil
# Verify
python -c "import psutil; print(f'CPUs: {psutil.cpu_count()}, RAM: {psutil.virtual_memory().available / 1024**3:.1f}GB')"
# Try again
pv init /path/to/project --optimize
Debug Mode
# Enable verbose logging
pv --verbose index /path/to/project
# Check project status
pv status /path/to/project
# View config
cat /path/to/project/.vectorizer/config.json
# Check ChromaDB
ls -lh /path/to/project/.vectorizer/chromadb/
Performance Debugging
# Time operations
time pv index /path/to/project
time pv index /path/to/project --max-resources
# Monitor resources during indexing
# Terminal 1:
pv index /path/to/project --max-resources
# Terminal 2:
htop # or top
# Should see high CPU across all cores
# Check memory warnings
pv index /path/to/project --max-resources --verbose
# Look for memory warnings
Changelog
[0.1.3] - 2025-10-13
Fixed
- Hardcoded value in work module โ Replaced hardcoded configuration with dynamic variable lookup
- Prevents unexpected behavior when running with custom configs
Notes
- This is a minor bugfix release with no API or CLI changes.
[0.1.2] - 2025-10-13
Added
- Optimized Config Generation -
Config.create_optimized()auto-detects CPU/RAM - Max Resources Flag -
--max-resourcesfor temporary performance boost - psutil Integration - Automatic system resource detection
- Unified Progress Tracking - Clean single-line progress bar
- Library Progress Suppression - No more cluttered batch progress bars
- Timing Information - All operations show elapsed time
- Clean Terminal Output - Professional UI with timing
Performance
- 2x faster full indexing with --max-resources
- 60-70% faster smart incremental updates
- 80-90% faster git-aware indexing
Documentation
- Comprehensive documentation overhaul
- Consolidated all guides into main README
- Added CHANGELOG.md with version history
[0.1.1] - 2025-10-12
- Enhanced single-word search with high precision
- Multi-level chunking (micro + word-level)
- Adaptive search thresholds
- Programming keyword detection
- Improved word matching and relevance boosting
[0.1.0] - 2025-10-10
- Initial release
- Code vectorization
- Smart incremental indexing
- Git-aware indexing
- MCP server
- Watch mode
- ChromaDB backend
- 30+ language support
Contributing
Development Setup
# Clone repository
git clone https://github.com/starkbaknet/project-vectorizer.git
cd project-vectorizer
# Create virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install with dev dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Format code
black .
isort .
Running Tests
# All tests
pytest
# With coverage
pytest --cov=project_vectorizer
# Specific test
pytest tests/test_config.py
# Verbose
pytest -v
See docs/TESTING.md for details.
Publishing
See docs/PUBLISHING.md for PyPI publishing guide.
Contributing Guidelines
- Fork repository
- Create feature branch:
git checkout -b feature/amazing-feature - Make changes and add tests
- Ensure tests pass:
pytest - Format code:
black . && isort . - Commit:
git commit -m 'Add amazing feature' - Push:
git push origin feature/amazing-feature - Open Pull Request
License
MIT License - see LICENSE file
Additional Resources
- GitHub: https://github.com/starkbaknet/project-vectorizer
- PyPI: https://pypi.org/project/project-vectorizer/
- Issues: https://github.com/starkbaknet/project-vectorizer/issues
Made with โค๏ธ by StarkBakNet
Vectorize your codebase. Empower your AI agents. Build better software.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file project_vectorizer-0.1.3.tar.gz.
File metadata
- Download URL: project_vectorizer-0.1.3.tar.gz
- Upload date:
- Size: 86.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
48cdb5099c8bfbe4ed489ccbfa9c404ab6ab8b58599681f889b227edf9e4403f
|
|
| MD5 |
e001a46ab20e30f745476c04808ffd65
|
|
| BLAKE2b-256 |
daafd0a6628be3c8918d1bc18a94bac4e35a77276ff60d27fe423921ddefb9a1
|
File details
Details for the file project_vectorizer-0.1.3-py3-none-any.whl.
File metadata
- Download URL: project_vectorizer-0.1.3-py3-none-any.whl
- Upload date:
- Size: 51.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d0096991040959a7780b37dd3e551bcfa5554aa4c3f06ee3862d16b7d6f4dd98
|
|
| MD5 |
140e776bfbcd93af3dc0fd995c210aa2
|
|
| BLAKE2b-256 |
7eede88f0174817fcf538d86346d1e97fd810c7eb7d08b8fca3cfd1c03643d19
|