A command-line tool for indexing and querying large codebases using AI
Project description
Large Codebase Indexer
A command-line tool for indexing large codebases and enabling AI-powered queries.
Overview
This tool allows you to index any codebase and query it using natural language. It leverages:
- OpenAI's embedding model for code semantics
- Pinecone vector database for efficient storage and retrieval
- Claude LLM for high-quality responses
- LangChain framework for integrating all components
Installation
Option 1: Install from PyPI (recommended)
# Install the package
pip install codebase-indexer
# Configure your API keys interactively
codebase-indexer configure
If you encounter "command not found" errors, use our installer scripts:
For macOS/Linux:
curl -O https://raw.githubusercontent.com/yourusername/indexer/main/easy-install.sh
chmod +x easy-install.sh
./easy-install.sh
For Windows:
curl -O https://raw.githubusercontent.com/yourusername/indexer/main/install-windows.bat
install-windows.bat
Option 2: Install from source
-
Clone this repository:
git clone https://github.com/yourusername/indexer.git cd indexer
-
Install the package in development mode:
# Create a virtual environment (recommended) python3 -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate # Install in development mode pip install -e .
-
Create a
.envfile with your API keys:cp .env.example .env # Edit .env with your actual API keys
Option 3: Quick setup (using the install script)
./install.sh
API Keys
This tool requires API keys for:
- OpenAI (for embeddings)
- Anthropic (for Claude LLM)
- Pinecone (for vector storage)
You can get these keys by signing up at:
Setting up API keys
You can configure API keys using the interactive CLI:
# Interactive configuration
codebase-indexer configure
# Non-interactive configuration
codebase-indexer configure --openai=your-openai-key --anthropic=your-anthropic-key --pinecone=your-pinecone-key
The configuration command will:
- Create a
.envfile if it doesn't exist - Prompt for missing API keys (or use the ones provided via command-line arguments)
- Allow you to select the Claude model to use
- Validate that all required keys are set
Usage
Indexing a Codebase
Index a codebase to create vector embeddings stored in Pinecone:
python src/main.py index --path /path/to/your/codebase --index-name your-index-name
Options:
--path: Path to the codebase directory (required)--index-name: Name of the Pinecone index (default: "codebase-index")--namespace: Namespace within the index for this codebase (default: directory name)--chunk-size: Size of code chunks (default: 500, recommended range: 300-800)--chunk-overlap: Overlap between chunks (default: 100, recommended range: 50-150)--extensions: Comma-separated list of file extensions to index (e.g., py,js,java)--batch-size: Batch size for indexing (default: 100)
The tool uses a text splitter that divides code into chunks with appropriate chunk size and overlap. This approach balances simplicity and effectiveness for most codebases. The chunk_size and chunk_overlap parameters can be adjusted to optimize for your specific codebase structure and typical query needs.
By default, the tool indexes only Python, JavaScript (including JSX), TypeScript (including TSX), HTML, CSS, SCSS, Markdown files, and Dockerfiles to focus on the most relevant code files.
Listing Files in a Codebase
List all files in a codebase or show file count by extension:
python src/main.py list --path /path/to/your/codebase --extensions py,js,java
python src/main.py list --path /path/to/your/codebase --count
Analyzing a Codebase
Analyze a codebase to extract project metadata:
python src/main.py analyze --path /path/to/your/codebase
Scanning a Codebase
Scan a codebase to get information about files and languages:
python src/main.py scan --path /path/to/your/codebase
Managing Indexes
List all Pinecone indexes:
python src/main.py list-indexes
Get statistics about an index:
python src/main.py stats --index-name your-index-name
Delete an index or namespace:
python src/main.py delete --index-name your-index-name
python src/main.py delete --index-name your-index-name --namespace your-namespace
Querying the Indexed Codebase
Query the indexed codebase using natural language:
python src/main.py query --query "What does the authenticate_user function do?" --index-name your-index-name --namespace your-namespace
Options:
--query: The query string (required)--index-name: Name of the Pinecone index (default: "codebase-index")--namespace: Namespace to query in Pinecone--limit: Maximum number of results to return (default: 5)--no-mmr: Disable Maximum Marginal Relevance retrieval (MMR is enabled by default)--diversity: Diversity parameter for MMR (0-1, default: 0.3)
The tool uses Maximum Marginal Relevance (MMR) by default, which balances relevance with diversity in search results.
How MMR Works:
- First, it finds a larger set of potentially relevant code snippets based on similarity to your query
- Then it selects a diverse subset by considering both:
- Relevance: How closely each snippet matches your query
- Diversity: How different each snippet is from ones already selected
This approach avoids returning redundant information and provides broader context from different parts of the codebase. The diversity parameter (default: 0.3) controls this balance:
- Higher values (closer to 1.0) prioritize diversity over relevance
- Lower values (closer to 0.0) prioritize relevance over diversity
Chat with Conversation History
Have a conversation with the codebase, maintaining context between questions:
python src/main.py chat --query "How does the file loading work?" --index-name your-index-name
python src/main.py chat --query "What parameters does it accept?" --index-name your-index-name
Find Related Code
Get code snippets related to a query without generating an answer:
python src/main.py related --query "error handling" --index-name your-index-name --limit 10
Testing the CLI (without API keys)
You can use the test CLI script for commands that don't require API keys:
python src/test_cli.py list --path /path/to/your/codebase --count
python src/test_cli.py analyze --path /path/to/your/codebase
python src/test_cli.py file --path /path/to/your/codebase/some_file.py
Command-Line Interface
The tool provides the following commands:
index: Index a codebaselist: List files in a codebaseanalyze: Analyze a codebase and extract project metadatascan: Scan a codebase and output file statisticslist-indexes: List all Pinecone indexesstats: Show statistics about an indexdelete: Delete an index or namespacequery: Query the indexed codebasechat: Chat with the codebase using conversation historyrelated: Get code snippets related to a query
Use --help with any command to see available options:
python src/main.py --help
python src/main.py index --help
Development Status
This project is being developed in phases:
- ✅ Environment Setup
- ✅ Command-Line Tool Framework
- ✅ Codebase Indexing
- ✅ RAG System and Agent Development
- 🔜 Testing, Refinement, and Deployment
Project Structure
indexer/
├── docs/
│ └── ADR.md # Architecture Decision Record
├── src/
│ ├── agents/ # RAG agent implementation
│ ├── indexers/ # Code indexing functionality
│ ├── models/ # OpenAI and Claude model wrappers
│ ├── utils/ # Utility functions
│ ├── main.py # Main CLI entry point
│ └── test_cli.py # Test CLI (no API keys required)
├── .env.example # Example environment variables
├── README.md # This file
├── requirements.txt # Python dependencies
└── setup.py # Package installation
Current Features
Milestone 1: Environment Setup
- ✅ Virtual environment and dependency management
- ✅ Configuration and API key handling
- ✅ Logging setup
- ✅ Basic project structure
Milestone 2: Command-Line Tool Framework
- ✅ Argument parsing and command handling
- ✅ Directory traversal for any codebase path
- ✅ File filtering by extension
- ✅ Project metadata extraction
- ✅ Code analysis for supported languages
- ✅ Test CLI for verification without API keys
Milestone 3: Codebase Indexing
- ✅ Loading and chunking code files
- ✅ Generating embeddings with OpenAI
- ✅ Storing embeddings in Pinecone DB
- ✅ Namespace support for multiple codebases
- ✅ Index management (create, delete, stats)
- ✅ Batch processing for large codebases
Milestone 4: RAG System and Agent Development
- ✅ Semantic code retrieval via embeddings
- ✅ Natural language querying of code
- ✅ Conversational interface with memory
- ✅ Code-specific prompt engineering
- ✅ Finding related code snippets
- ✅ Integration with Claude LLM for high-quality responses
License
This project is licensed under the MIT License - see the LICENSE file for details.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Troubleshooting
Common Issues
-
API key errors: Make sure you have properly set up your .env file with valid API keys.
OPENAI_API_KEY=your-openai-api-key ANTHROPIC_API_KEY=your-anthropic-api-key PINECONE_API_KEY=your-pinecone-api-key
-
Package not found errors: If you encounter errors about packages not being found, try reinstalling the dependencies:
pip install -r requirements.txt
-
Model not found: If you encounter errors about Claude models not being found, you can update the model name in
src/utils/config.py:# Try using a different model name if the current one isn't accessible LLM_MODEL = "claude-3-haiku-20240307" # or another available model
-
Rate limiting: If you hit API rate limits, try reducing the batch size when indexing:
codebase-indexer index --path /path/to/codebase --batch-size 50
-
Import errors: If you see "No module named 'utils'" or similar import errors after installation, this is a Python import path issue. Use one of these solutions:
- Use the latest version (1.0.1+) which fixes the import issues
- Install from the latest distribution:
pip install --upgrade codebase-indexer
- Or alternatively, use the direct runner script which includes import path fixes:
# Download run-indexer.py curl -O https://raw.githubusercontent.com/yourusername/indexer/main/run-indexer.py python run-indexer.py configure
-
Command not found: If you get a "command not found" error after installing with pip:
-
Use our easy installer script (macOS/Linux):
curl -O https://raw.githubusercontent.com/yourusername/indexer/main/easy-install.sh chmod +x easy-install.sh ./easy-install.sh
-
Check that the Python scripts directory is in your PATH:
# Find where the script was installed pip show codebase-indexer # On macOS/Linux: find ~/Library/Python/*/bin ~/.local/bin -name "codebase-indexer*" 2>/dev/null # On Windows: dir %USERPROFILE%\AppData\Roaming\Python\*\Scripts\codebase-indexer*.exe # Add to your PATH: # macOS/Linux (add to your .bashrc, .zshrc, etc.) export PATH="$PATH:~/Library/Python/3.9/bin" # Adjust the path as needed # Windows (in Command Prompt as Administrator) setx PATH "%PATH%;%USERPROFILE%\AppData\Roaming\Python\Python39\Scripts"
-
Try the alternate script names:
indexer --help code-indexer --help
-
Install with
pip install --userto ensure it installs in your user directory:pip install --user codebase-indexer
-
Run the package module directly if all else fails:
# On macOS/Linux: python -m src.main # On Windows: python -m src.main
-
Use the direct runner script:
# Download the runner script curl -O https://raw.githubusercontent.com/yourusername/indexer/main/run-indexer.py # Or clone the repo and use the script directly python run-indexer.py --help
-
For more help, please open an issue on GitHub.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file codebase_indexer-1.2.2.tar.gz.
File metadata
- Download URL: codebase_indexer-1.2.2.tar.gz
- Upload date:
- Size: 51.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
88e4376013f4e4a81e0b477f38441756c6559d4e1de190028dfc0198280bc9af
|
|
| MD5 |
5f85671c15be7c4c62fafb3c023c03cb
|
|
| BLAKE2b-256 |
7725f13546ad0eed2630327cb3ccad35f6ba46594db1982721931bafcbe07507
|
File details
Details for the file codebase_indexer-1.2.2-py3-none-any.whl.
File metadata
- Download URL: codebase_indexer-1.2.2-py3-none-any.whl
- Upload date:
- Size: 45.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6116bff40096bbc2849681cca1f8b9d0bca6d2f5b2e8e1a917d2d3e509785661
|
|
| MD5 |
21015374f20280dfa6a98f5c7645d94c
|
|
| BLAKE2b-256 |
dd4f364b8ec15c66763eae9251c54d34cc6384aa37db08a29b1b43b126e6c4e9
|