Skip to main content

A library to index a code repository and chat with it via LLMs.

Project description

repo2vec

An open-source pair programmer for chatting with any codebase.

screenshot
Our chat window, showing a conversation with the Transformers library. 🚀

Getting started

Installation

To install the library, simply run pip install repo2vec!

Prerequisites

repo2vec performs two steps:

  1. Indexes your codebase (requiring an embdder and a vector store)
  2. Enables chatting via LLM + RAG (requiring access to an LLM)
:computer: Running locally
  1. To index the codebase locally, we use the open-source project Marqo, which is both an embedder and a vector store. To bring up a Marqo instance:

    docker rm -f marqo
    docker pull marqoai/marqo:latest
    docker run --name marqo -it -p 8882:8882 marqoai/marqo:latest
    
  2. To chat with an LLM locally, we use Ollama:

    • Head over to ollama.com to download the appropriate binary for your machine.
    • Pull the desired model, e.g. ollama pull llama3.1.
:cloud: Using external providers
  1. We support OpenAI for embeddings (they have a super fast batch embedding API) and Pinecone for the vector store. So you will need two API keys:

    export OPENAI_API_KEY=...
    export PINECONE_API_KEY=...
    
  2. For chatting with an LLM, we support OpenAI and Anthropic. For the latter, set an additional API key:

    export ANTHROPIC_API_KEY=...
    

Optional If you are planning on indexing GitHub issues in addition to the codebase, you will need a GitHub token:
export GITHUB_TOKEN=...

Running it

:computer: Running locally

To index the codebase:

r2v-index github-repo-name \  # e.g. Storia-AI/repo2vec
    --embedder-type=marqo \
    --vector-store-type=marqo \
    --index-name=your-index-name

To chat with your codebase:

r2v-chat github-repo-name \
    --vector-store-type=marqo \
    --index-name=your-index-name \
    --llm-provider=ollama \
    --llm-model=llama3.1
:cloud: Using external providers

To index the codebase:

r2v-index github-repo-name \  # e.g. Storia-AI/repo2vec
    --embedder-type=openai \
    --vector-store-type=pinecone \
    --index-name=your-index-name

To chat with your codebase:

r2v-chat github-repo-name \
    --vector-store-type=pinecone \
    --index-name=your-index-name \
    --llm-provider=openai \
    --llm-model=gpt-4

To get a public URL for your chat app, set --share=true.

Additional features

  • Control which files get indexed based on their extension. You can whitelist or blacklist extensions by passing a file with one extension per line (in the format .ext):

    • To only index a whitelist of files:

      ```
      r2v-index ... --include=/path/to/extensions/file
      ```
      
    • To index all code except a blacklist of files:

      ```
      r2v-index ... --exclude=/path/to/extensions/file
      ```
      
  • Index open GitHub issues (remember to export GITHUB_TOKEN=...):

    • To index GitHub issues without comments:

      ```
      r2v-index ... --index-issues
      ```
      
    • To index GitHub issues with comments:

      ```
      r2v-index ... --index-issues --index-issue-comments
      ```
      
    • To index GitHub issues, but not the codebase:

      ```
      r2v-index ... --index-issues --no-index-repo
      ```
      

Why chat with a codebase?

Sometimes you just want to learn how a codebase works and how to integrate it, without spending hours sifting through the code itself.

repo2vec is like an open-source GitHub Copilot with the most up-to-date information about your repo.

Features:

  • Dead-simple set-up. Run two scripts and you have a functional chat interface for your code. That's really it.
  • Heavily documented answers. Every response shows where in the code the context for the answer was pulled from. Let's build trust in the AI.
  • Runs locally or on the cloud.
  • Plug-and-play. Want to improve the algorithms powering the code understanding/generation? We've made every component of the pipeline easily swappable. Google-grade engineering standards allow you to customize to your heart's content.

Changelog

  • 2024-09-06: Updated command names to r2v-index and r2v-chat to avoid clash with local utilities.
  • 2024-09-03: repo2vec is now available on pypi.
  • 2024-09-03: Support for indexing GitHub issues.
  • 2024-08-30: Support for running everything locally (Marqo for embeddings, Ollama for LLMs).

Want your repository hosted?

We're working to make all code on the internet searchable and understandable for devs. You can check out our early product, Code Sage. We pre-indexed a slew of OSS repos, and you can index your desired ones by simply pasting a GitHub URL.

If you're the maintainer of an OSS repo and would like a dedicated page on Code Sage (e.g. sage.storia.ai/your-repo), then send us a message at founders@storia.ai. We'll do it for free!

Extensions & Contributions

We built the code purposefully modular so that you can plug in your desired embeddings, LLM and vector stores providers by simply implementing the relevant abstract classes.

Feel free to send feature requests to founders@storia.ai or make a pull request!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

repo2vec-0.1.6.tar.gz (26.6 kB view details)

Uploaded Source

Built Distribution

repo2vec-0.1.6-py3-none-any.whl (27.7 kB view details)

Uploaded Python 3

File details

Details for the file repo2vec-0.1.6.tar.gz.

File metadata

  • Download URL: repo2vec-0.1.6.tar.gz
  • Upload date:
  • Size: 26.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.5

File hashes

Hashes for repo2vec-0.1.6.tar.gz
Algorithm Hash digest
SHA256 f5d4c9cd1a4de21cb3aa684cf7164a104ca87224ca320097fb376548b45f2205
MD5 b91b640d46e4362c6e030d19d3fa18f1
BLAKE2b-256 540a0fec93e1f1c8286605e0f0e9463983fc9b25a87d6095179fc420bbe5fa23

See more details on using hashes here.

File details

Details for the file repo2vec-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: repo2vec-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 27.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.5

File hashes

Hashes for repo2vec-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 7bcbb7b69a0082c06e9d005de096252b3a03bb63d0991ed9b92c2047324ffa45
MD5 53e40c79b31560243ece650c65fd2ad6
BLAKE2b-256 bfb14fd90884ed13c47d413b9e46af01642d72956ca080751b60ab81293d3176

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page