Skip to main content

A command-line tool for indexing and querying large codebases using AI

Project description

Large Codebase Indexer

A command-line tool for indexing large codebases and enabling AI-powered queries.

Overview

This tool allows you to index any codebase and query it using natural language. It leverages:

  • OpenAI's embedding model for code semantics
  • Pinecone vector database for efficient storage and retrieval
  • Claude LLM for high-quality responses
  • LangChain framework for integrating all components

Installation

Option 1: Install from PyPI (recommended)

# Install the package
pip install codebase-indexer

# Configure your API keys interactively
codebase-indexer configure

If you encounter "command not found" errors, use our installer scripts:

For macOS/Linux:

curl -O https://raw.githubusercontent.com/yourusername/indexer/main/easy-install.sh
chmod +x easy-install.sh
./easy-install.sh

For Windows:

curl -O https://raw.githubusercontent.com/yourusername/indexer/main/install-windows.bat
install-windows.bat

Option 2: Install from source

  1. Clone this repository:

    git clone https://github.com/yourusername/indexer.git
    cd indexer
    
  2. Install the package in development mode:

    # Create a virtual environment (recommended)
    python3 -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
    
    # Install in development mode
    pip install -e .
    
  3. Create a .env file with your API keys:

    cp .env.example .env
    # Edit .env with your actual API keys
    

Option 3: Quick setup (using the install script)

./install.sh

API Keys

This tool requires API keys for:

  • OpenAI (for embeddings)
  • Anthropic (for Claude LLM)
  • Pinecone (for vector storage)

You can get these keys by signing up at:

Setting up API keys

You can configure API keys using the interactive CLI:

# Interactive configuration
codebase-indexer configure

# Non-interactive configuration
codebase-indexer configure --openai=your-openai-key --anthropic=your-anthropic-key --pinecone=your-pinecone-key

The configuration command will:

  1. Create a .env file if it doesn't exist
  2. Prompt for missing API keys (or use the ones provided via command-line arguments)
  3. Allow you to select the Claude model to use
  4. Validate that all required keys are set

Usage

Indexing a Codebase

Index a codebase to create vector embeddings stored in Pinecone:

python src/main.py index --path /path/to/your/codebase --index-name your-index-name

Options:

  • --path: Path to the codebase directory (required)
  • --index-name: Name of the Pinecone index (default: "codebase-index")
  • --namespace: Namespace within the index for this codebase (default: directory name)
  • --chunk-size: Size of code chunks (default: 500, recommended range: 300-800)
  • --chunk-overlap: Overlap between chunks (default: 100, recommended range: 50-150)
  • --extensions: Comma-separated list of file extensions to index (e.g., py,js,java)
  • --batch-size: Batch size for indexing (default: 100)

The tool uses a text splitter that divides code into chunks with appropriate chunk size and overlap. This approach balances simplicity and effectiveness for most codebases. The chunk_size and chunk_overlap parameters can be adjusted to optimize for your specific codebase structure and typical query needs.

By default, the tool indexes only Python, JavaScript (including JSX), TypeScript (including TSX), HTML, CSS, SCSS, Markdown files, and Dockerfiles to focus on the most relevant code files.

Listing Files in a Codebase

List all files in a codebase or show file count by extension:

python src/main.py list --path /path/to/your/codebase --extensions py,js,java
python src/main.py list --path /path/to/your/codebase --count

Analyzing a Codebase

Analyze a codebase to extract project metadata:

python src/main.py analyze --path /path/to/your/codebase

Scanning a Codebase

Scan a codebase to get information about files and languages:

python src/main.py scan --path /path/to/your/codebase

Managing Indexes

List all Pinecone indexes:

python src/main.py list-indexes

Get statistics about an index:

python src/main.py stats --index-name your-index-name

Delete an index or namespace:

python src/main.py delete --index-name your-index-name
python src/main.py delete --index-name your-index-name --namespace your-namespace

Querying the Indexed Codebase

Query the indexed codebase using natural language:

python src/main.py query --query "What does the authenticate_user function do?" --index-name your-index-name --namespace your-namespace

Options:

  • --query: The query string (required)
  • --index-name: Name of the Pinecone index (default: "codebase-index")
  • --namespace: Namespace to query in Pinecone
  • --limit: Maximum number of results to return (default: 5)
  • --no-mmr: Disable Maximum Marginal Relevance retrieval (MMR is enabled by default)
  • --diversity: Diversity parameter for MMR (0-1, default: 0.3)

The tool uses Maximum Marginal Relevance (MMR) by default, which balances relevance with diversity in search results.

How MMR Works:

  1. First, it finds a larger set of potentially relevant code snippets based on similarity to your query
  2. Then it selects a diverse subset by considering both:
    • Relevance: How closely each snippet matches your query
    • Diversity: How different each snippet is from ones already selected

This approach avoids returning redundant information and provides broader context from different parts of the codebase. The diversity parameter (default: 0.3) controls this balance:

  • Higher values (closer to 1.0) prioritize diversity over relevance
  • Lower values (closer to 0.0) prioritize relevance over diversity

Chat with Conversation History

Have a conversation with the codebase, maintaining context between questions:

python src/main.py chat --query "How does the file loading work?" --index-name your-index-name
python src/main.py chat --query "What parameters does it accept?" --index-name your-index-name

Find Related Code

Get code snippets related to a query without generating an answer:

python src/main.py related --query "error handling" --index-name your-index-name --limit 10

Testing the CLI (without API keys)

You can use the test CLI script for commands that don't require API keys:

python src/test_cli.py list --path /path/to/your/codebase --count
python src/test_cli.py analyze --path /path/to/your/codebase
python src/test_cli.py file --path /path/to/your/codebase/some_file.py

Command-Line Interface

The tool provides the following commands:

  • index: Index a codebase
  • list: List files in a codebase
  • analyze: Analyze a codebase and extract project metadata
  • scan: Scan a codebase and output file statistics
  • list-indexes: List all Pinecone indexes
  • stats: Show statistics about an index
  • delete: Delete an index or namespace
  • query: Query the indexed codebase
  • chat: Chat with the codebase using conversation history
  • related: Get code snippets related to a query

Use --help with any command to see available options:

python src/main.py --help
python src/main.py index --help

Development Status

This project is being developed in phases:

  1. ✅ Environment Setup
  2. ✅ Command-Line Tool Framework
  3. ✅ Codebase Indexing
  4. ✅ RAG System and Agent Development
  5. 🔜 Testing, Refinement, and Deployment

Project Structure

indexer/
├── docs/
│   └── ADR.md           # Architecture Decision Record
├── src/
│   ├── agents/          # RAG agent implementation
│   ├── indexers/        # Code indexing functionality
│   ├── models/          # OpenAI and Claude model wrappers
│   ├── utils/           # Utility functions
│   ├── main.py          # Main CLI entry point
│   └── test_cli.py      # Test CLI (no API keys required)
├── .env.example         # Example environment variables
├── README.md            # This file
├── requirements.txt     # Python dependencies
└── setup.py             # Package installation

Current Features

Milestone 1: Environment Setup

  • ✅ Virtual environment and dependency management
  • ✅ Configuration and API key handling
  • ✅ Logging setup
  • ✅ Basic project structure

Milestone 2: Command-Line Tool Framework

  • ✅ Argument parsing and command handling
  • ✅ Directory traversal for any codebase path
  • ✅ File filtering by extension
  • ✅ Project metadata extraction
  • ✅ Code analysis for supported languages
  • ✅ Test CLI for verification without API keys

Milestone 3: Codebase Indexing

  • ✅ Loading and chunking code files
  • ✅ Generating embeddings with OpenAI
  • ✅ Storing embeddings in Pinecone DB
  • ✅ Namespace support for multiple codebases
  • ✅ Index management (create, delete, stats)
  • ✅ Batch processing for large codebases

Milestone 4: RAG System and Agent Development

  • ✅ Semantic code retrieval via embeddings
  • ✅ Natural language querying of code
  • ✅ Conversational interface with memory
  • ✅ Code-specific prompt engineering
  • ✅ Finding related code snippets
  • ✅ Integration with Claude LLM for high-quality responses

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Troubleshooting

Common Issues

  1. API key errors: Make sure you have properly set up your .env file with valid API keys.

    OPENAI_API_KEY=your-openai-api-key
    ANTHROPIC_API_KEY=your-anthropic-api-key
    PINECONE_API_KEY=your-pinecone-api-key
    
  2. Package not found errors: If you encounter errors about packages not being found, try reinstalling the dependencies:

    pip install -r requirements.txt
    
  3. Model not found: If you encounter errors about Claude models not being found, you can update the model name in src/utils/config.py:

    # Try using a different model name if the current one isn't accessible
    LLM_MODEL = "claude-3-haiku-20240307"  # or another available model
    
  4. Rate limiting: If you hit API rate limits, try reducing the batch size when indexing:

    codebase-indexer index --path /path/to/codebase --batch-size 50
    
  5. Import errors: If you see "No module named 'utils'" or similar import errors after installation, this is a Python import path issue. Use one of these solutions:

    • Use the latest version (1.0.1+) which fixes the import issues
    • Install from the latest distribution:
      pip install --upgrade codebase-indexer
      
    • Or alternatively, use the direct runner script which includes import path fixes:
      # Download run-indexer.py
      curl -O https://raw.githubusercontent.com/yourusername/indexer/main/run-indexer.py
      python run-indexer.py configure
      
  6. Command not found: If you get a "command not found" error after installing with pip:

    • Use our easy installer script (macOS/Linux):

      curl -O https://raw.githubusercontent.com/yourusername/indexer/main/easy-install.sh
      chmod +x easy-install.sh
      ./easy-install.sh
      
    • Check that the Python scripts directory is in your PATH:

      # Find where the script was installed
      pip show codebase-indexer
      
      # On macOS/Linux:
      find ~/Library/Python/*/bin ~/.local/bin -name "codebase-indexer*" 2>/dev/null
      
      # On Windows:
      dir %USERPROFILE%\AppData\Roaming\Python\*\Scripts\codebase-indexer*.exe
      
      # Add to your PATH:
      # macOS/Linux (add to your .bashrc, .zshrc, etc.)
      export PATH="$PATH:~/Library/Python/3.9/bin"  # Adjust the path as needed
      
      # Windows (in Command Prompt as Administrator)
      setx PATH "%PATH%;%USERPROFILE%\AppData\Roaming\Python\Python39\Scripts"
      
    • Try the alternate script names:

      indexer --help
      code-indexer --help
      
    • Install with pip install --user to ensure it installs in your user directory:

      pip install --user codebase-indexer
      
    • Run the package module directly if all else fails:

      # On macOS/Linux:
      python -m src.main
      
      # On Windows:
      python -m src.main
      
    • Use the direct runner script:

      # Download the runner script
      curl -O https://raw.githubusercontent.com/yourusername/indexer/main/run-indexer.py
      # Or clone the repo and use the script directly
      python run-indexer.py --help
      

For more help, please open an issue on GitHub.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

codebase_indexer-1.2.2.tar.gz (51.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

codebase_indexer-1.2.2-py3-none-any.whl (45.2 kB view details)

Uploaded Python 3

File details

Details for the file codebase_indexer-1.2.2.tar.gz.

File metadata

  • Download URL: codebase_indexer-1.2.2.tar.gz
  • Upload date:
  • Size: 51.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.6

File hashes

Hashes for codebase_indexer-1.2.2.tar.gz
Algorithm Hash digest
SHA256 88e4376013f4e4a81e0b477f38441756c6559d4e1de190028dfc0198280bc9af
MD5 5f85671c15be7c4c62fafb3c023c03cb
BLAKE2b-256 7725f13546ad0eed2630327cb3ccad35f6ba46594db1982721931bafcbe07507

See more details on using hashes here.

File details

Details for the file codebase_indexer-1.2.2-py3-none-any.whl.

File metadata

File hashes

Hashes for codebase_indexer-1.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 6116bff40096bbc2849681cca1f8b9d0bca6d2f5b2e8e1a917d2d3e509785661
MD5 21015374f20280dfa6a98f5c7645d94c
BLAKE2b-256 dd4f364b8ec15c66763eae9251c54d34cc6384aa37db08a29b1b43b126e6c4e9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page