A command-line tool for indexing and querying large codebases using AI

These details have not been verified by PyPI

Project links

Homepage

Project description

Large Codebase Indexer

A command-line tool for indexing large codebases and enabling AI-powered queries.

Overview

This tool allows you to index any codebase and query it using natural language. It leverages:

OpenAI's embedding model for code semantics
Pinecone vector database for efficient storage and retrieval
Claude LLM for high-quality responses
LangChain framework for integrating all components

Installation

Option 1: Install from PyPI (recommended)

# Install the package
pip install codebase-indexer

# Configure your API keys interactively
codebase-indexer configure

If you encounter "command not found" errors, use our installer scripts:

For macOS/Linux:

curl -O https://raw.githubusercontent.com/yourusername/indexer/main/easy-install.sh
chmod +x easy-install.sh
./easy-install.sh

For Windows:

curl -O https://raw.githubusercontent.com/yourusername/indexer/main/install-windows.bat
install-windows.bat

Option 2: Install from source

Clone this repository:

git clone https://github.com/yourusername/indexer.git
cd indexer

Install the package in development mode:

# Create a virtual environment (recommended)
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install in development mode
pip install -e .

Create a .env file with your API keys:

cp .env.example .env
# Edit .env with your actual API keys

Option 3: Quick setup (using the install script)

./install.sh

API Keys

This tool requires API keys for:

OpenAI (for embeddings)
Anthropic (for Claude LLM)
Pinecone (for vector storage)

You can get these keys by signing up at:

Setting up API keys

You can configure API keys using the interactive CLI:

# Interactive configuration
codebase-indexer configure

# Non-interactive configuration
codebase-indexer configure --openai=your-openai-key --anthropic=your-anthropic-key --pinecone=your-pinecone-key

The configuration command will:

Create a .env file if it doesn't exist
Prompt for missing API keys (or use the ones provided via command-line arguments)
Allow you to select the Claude model to use
Validate that all required keys are set

Usage

Indexing a Codebase

Index a codebase to create vector embeddings stored in Pinecone:

python src/main.py index --path /path/to/your/codebase --index-name your-index-name

Options:

--path: Path to the codebase directory (required)
--index-name: Name of the Pinecone index (default: "codebase-index")
--namespace: Namespace within the index for this codebase (default: directory name)
--chunk-size: Size of code chunks (default: 500, recommended range: 300-800)
--chunk-overlap: Overlap between chunks (default: 100, recommended range: 50-150)
--extensions: Comma-separated list of file extensions to index (e.g., py,js,java)
--batch-size: Batch size for indexing (default: 100)

The tool uses a text splitter that divides code into chunks with appropriate chunk size and overlap. This approach balances simplicity and effectiveness for most codebases. The chunk_size and chunk_overlap parameters can be adjusted to optimize for your specific codebase structure and typical query needs.

By default, the tool indexes only Python, JavaScript (including JSX), TypeScript (including TSX), HTML, CSS, SCSS, Markdown files, and Dockerfiles to focus on the most relevant code files.

Listing Files in a Codebase

List all files in a codebase or show file count by extension:

python src/main.py list --path /path/to/your/codebase --extensions py,js,java
python src/main.py list --path /path/to/your/codebase --count

Analyzing a Codebase

Analyze a codebase to extract project metadata:

python src/main.py analyze --path /path/to/your/codebase

Scanning a Codebase

Scan a codebase to get information about files and languages:

python src/main.py scan --path /path/to/your/codebase

Managing Indexes

List all Pinecone indexes:

python src/main.py list-indexes

Get statistics about an index:

python src/main.py stats --index-name your-index-name

Delete an index or namespace:

python src/main.py delete --index-name your-index-name
python src/main.py delete --index-name your-index-name --namespace your-namespace

Querying the Indexed Codebase

Query the indexed codebase using natural language:

python src/main.py query --query "What does the authenticate_user function do?" --index-name your-index-name --namespace your-namespace

Options:

--query: The query string (required)
--index-name: Name of the Pinecone index (default: "codebase-index")
--namespace: Namespace to query in Pinecone
--limit: Maximum number of results to return (default: 5)
--no-mmr: Disable Maximum Marginal Relevance retrieval (MMR is enabled by default)
--diversity: Diversity parameter for MMR (0-1, default: 0.3)

The tool uses Maximum Marginal Relevance (MMR) by default, which balances relevance with diversity in search results.

How MMR Works:

First, it finds a larger set of potentially relevant code snippets based on similarity to your query
Then it selects a diverse subset by considering both:
- Relevance: How closely each snippet matches your query
- Diversity: How different each snippet is from ones already selected

This approach avoids returning redundant information and provides broader context from different parts of the codebase. The diversity parameter (default: 0.3) controls this balance:

Higher values (closer to 1.0) prioritize diversity over relevance
Lower values (closer to 0.0) prioritize relevance over diversity

Chat with Conversation History

Have a conversation with the codebase, maintaining context between questions:

python src/main.py chat --query "How does the file loading work?" --index-name your-index-name
python src/main.py chat --query "What parameters does it accept?" --index-name your-index-name

Find Related Code

Get code snippets related to a query without generating an answer:

python src/main.py related --query "error handling" --index-name your-index-name --limit 10

Testing the CLI (without API keys)

You can use the test CLI script for commands that don't require API keys:

python src/test_cli.py list --path /path/to/your/codebase --count
python src/test_cli.py analyze --path /path/to/your/codebase
python src/test_cli.py file --path /path/to/your/codebase/some_file.py

Command-Line Interface

The tool provides the following commands:

index: Index a codebase
list: List files in a codebase
analyze: Analyze a codebase and extract project metadata
scan: Scan a codebase and output file statistics
list-indexes: List all Pinecone indexes
stats: Show statistics about an index
delete: Delete an index or namespace
query: Query the indexed codebase
chat: Chat with the codebase using conversation history
related: Get code snippets related to a query

Use --help with any command to see available options:

python src/main.py --help
python src/main.py index --help

Development Status

This project is being developed in phases:

✅ Environment Setup
✅ Command-Line Tool Framework
✅ Codebase Indexing
✅ RAG System and Agent Development
🔜 Testing, Refinement, and Deployment

Project Structure

indexer/
├── docs/
│   └── ADR.md           # Architecture Decision Record
├── src/
│   ├── agents/          # RAG agent implementation
│   ├── indexers/        # Code indexing functionality
│   ├── models/          # OpenAI and Claude model wrappers
│   ├── utils/           # Utility functions
│   ├── main.py          # Main CLI entry point
│   └── test_cli.py      # Test CLI (no API keys required)
├── .env.example         # Example environment variables
├── README.md            # This file
├── requirements.txt     # Python dependencies
└── setup.py             # Package installation

Current Features

Milestone 1: Environment Setup

✅ Virtual environment and dependency management
✅ Configuration and API key handling
✅ Logging setup
✅ Basic project structure

Milestone 2: Command-Line Tool Framework

✅ Argument parsing and command handling
✅ Directory traversal for any codebase path
✅ File filtering by extension
✅ Project metadata extraction
✅ Code analysis for supported languages
✅ Test CLI for verification without API keys

Milestone 3: Codebase Indexing

✅ Loading and chunking code files
✅ Generating embeddings with OpenAI
✅ Storing embeddings in Pinecone DB
✅ Namespace support for multiple codebases
✅ Index management (create, delete, stats)
✅ Batch processing for large codebases

Milestone 4: RAG System and Agent Development

✅ Semantic code retrieval via embeddings
✅ Natural language querying of code
✅ Conversational interface with memory
✅ Code-specific prompt engineering
✅ Finding related code snippets
✅ Integration with Claude LLM for high-quality responses

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Troubleshooting

Common Issues

API key errors: Make sure you have properly set up your .env file with valid API keys.

OPENAI_API_KEY=your-openai-api-key
ANTHROPIC_API_KEY=your-anthropic-api-key
PINECONE_API_KEY=your-pinecone-api-key

Package not found errors: If you encounter errors about packages not being found, try reinstalling the dependencies:
```
pip install -r requirements.txt
```

Model not found: If you encounter errors about Claude models not being found, you can update the model name in src/utils/config.py:

# Try using a different model name if the current one isn't accessible
LLM_MODEL = "claude-3-haiku-20240307"  # or another available model

Rate limiting: If you hit API rate limits, try reducing the batch size when indexing:
```
codebase-indexer index --path /path/to/codebase --batch-size 50
```
Import errors: If you see "No module named 'utils'" or similar import errors after installation, this is a Python import path issue. Use one of these solutions:
- Use the latest version (1.0.1+) which fixes the import issues
- Install from the latest distribution:
```
pip install --upgrade codebase-indexer
```
- Or alternatively, use the direct runner script which includes import path fixes:
```
# Download run-indexer.py
curl -O https://raw.githubusercontent.com/yourusername/indexer/main/run-indexer.py
python run-indexer.py configure
```

Command not found: If you get a "command not found" error after installing with pip:

Use our easy installer script (macOS/Linux):

curl -O https://raw.githubusercontent.com/yourusername/indexer/main/easy-install.sh
chmod +x easy-install.sh
./easy-install.sh

Check that the Python scripts directory is in your PATH:

# Find where the script was installed
pip show codebase-indexer

# On macOS/Linux:
find ~/Library/Python/*/bin ~/.local/bin -name "codebase-indexer*" 2>/dev/null

# On Windows:
dir %USERPROFILE%\AppData\Roaming\Python\*\Scripts\codebase-indexer*.exe

# Add to your PATH:
# macOS/Linux (add to your .bashrc, .zshrc, etc.)
export PATH="$PATH:~/Library/Python/3.9/bin"  # Adjust the path as needed

# Windows (in Command Prompt as Administrator)
setx PATH "%PATH%;%USERPROFILE%\AppData\Roaming\Python\Python39\Scripts"

Try the alternate script names:
```
indexer --help
code-indexer --help
```
Install with pip install --user to ensure it installs in your user directory:
```
pip install --user codebase-indexer
```

Run the package module directly if all else fails:

# On macOS/Linux:
python -m src.main

# On Windows:
python -m src.main

Use the direct runner script:

# Download the runner script
curl -O https://raw.githubusercontent.com/yourusername/indexer/main/run-indexer.py
# Or clone the repo and use the script directly
python run-indexer.py --help

For more help, please open an issue on GitHub.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

1.2.2

Apr 19, 2025

1.2.1

Apr 19, 2025

1.2.0

Apr 19, 2025

1.1.5

Apr 19, 2025

1.1.4

Apr 19, 2025

1.1.3

Apr 19, 2025

1.1.2

Apr 19, 2025

1.1.1

Apr 19, 2025

1.0.1

Apr 19, 2025

1.0.0

Apr 19, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

codebase_indexer-1.2.2.tar.gz (51.1 kB view details)

Uploaded Apr 19, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

codebase_indexer-1.2.2-py3-none-any.whl (45.2 kB view details)

Uploaded Apr 19, 2025 Python 3

File details

Details for the file codebase_indexer-1.2.2.tar.gz.

File metadata

Download URL: codebase_indexer-1.2.2.tar.gz
Upload date: Apr 19, 2025
Size: 51.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.9.6

File hashes

Hashes for codebase_indexer-1.2.2.tar.gz
Algorithm	Hash digest
SHA256	`88e4376013f4e4a81e0b477f38441756c6559d4e1de190028dfc0198280bc9af`
MD5	`5f85671c15be7c4c62fafb3c023c03cb`
BLAKE2b-256	`7725f13546ad0eed2630327cb3ccad35f6ba46594db1982721931bafcbe07507`

See more details on using hashes here.

File details

Details for the file codebase_indexer-1.2.2-py3-none-any.whl.

File metadata

Download URL: codebase_indexer-1.2.2-py3-none-any.whl
Upload date: Apr 19, 2025
Size: 45.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.9.6

File hashes

Hashes for codebase_indexer-1.2.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6116bff40096bbc2849681cca1f8b9d0bca6d2f5b2e8e1a917d2d3e509785661`
MD5	`21015374f20280dfa6a98f5c7645d94c`
BLAKE2b-256	`dd4f364b8ec15c66763eae9251c54d34cc6384aa37db08a29b1b43b126e6c4e9`

See more details on using hashes here.

codebase-indexer 1.2.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Large Codebase Indexer

Overview

Installation

Option 1: Install from PyPI (recommended)

Option 2: Install from source

Option 3: Quick setup (using the install script)

API Keys

Setting up API keys

Usage

Indexing a Codebase

Listing Files in a Codebase

Analyzing a Codebase

Scanning a Codebase

Managing Indexes

Querying the Indexed Codebase

How MMR Works:

Chat with Conversation History

Find Related Code

Testing the CLI (without API keys)

Command-Line Interface

Development Status

Project Structure

Current Features

Milestone 1: Environment Setup

Milestone 2: Command-Line Tool Framework

Milestone 3: Codebase Indexing

Milestone 4: RAG System and Agent Development

License

Contributing

Troubleshooting

Common Issues

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes