Skip to main content

Code Indexer Loop

Project description

Code Indexer Loop

PyPI version License Forks Stars Twitter Discord

Code Indexer Loop is a Python library designed to index and retrieve code snippets.

It uses the useful indexing utilities of the LlamaIndex library and the multi-language tree-sitter library to parse the code from many popular programming languages. tiktoken is used to right-size retrieval based on number of tokens and LangChain is used to obtain embeddings (defaults to OpenAI's text-embedding-ada-002) and store them in an embedded ChromaDB vector database. watchdog is used for continuous updating of the index based on file system events.

Read the launch blog post for more details about why we've built this!

Installation:

Use pip to install Code Indexer Loop from PyPI.

pip install code-indexer-loop

Usage:

  1. Import necessary modules:
from code_indexer_loop.api import CodeIndexer
  1. Create a CodeIndexer object and have it watch for changes:
indexer = CodeIndexer(src_dir="path/to/code/", watch=True)
  1. Use .query to perform a search query:
query = "pandas"
print(indexer.query(query)[0:30])

Note: make sure the OPENAI_API_KEY environment variable is set. This is needed for generating the embeddings.

You can also use indexer.query_nodes to get the nodes of a query or indexer.query_documents to receive the entire source code files.

Note that if you edit any of the source code files in the src_dir it will efficiently re-index those files using watchdog and an md5 based caching mechanism. This results in up-to-date embeddings every time you query the index.

Examples

Check out the basic_usage notebook for a quick overview of the API.

Token limits

You can configure token limits for the chunks through the CodeIndexer constructor:

indexer = CodeIndexer(
    src_dir="path/to/code/", watch=True,
    target_chunk_tokens = 300,
    max_chunk_tokens = 1000,
    enforce_max_chunk_tokens = False,
    coalesce = 50
    token_model = "gpt-4"
)

Note you can choose whether the max_chunk_tokens is enforced. If it is, it will raise an exception in case there is no semantic parsing that respects the max_chunk_tokens.

The coalesce argument controls the limit of combining smaller chunks into single chunks to avoid having many very small chunks. The unit for coalesce is also tokens.

tree-sitter

Using tree-sitter for parsing, the chunks are broken only at valid node-level string positions in the source file. This avoids breaking up e.g. function and class definitions.

Supported languages:

C, C++, C#, Go, Haskell, Java, Julia, JavaScript, PHP, Python, Ruby, Rust, Scala, Swift, SQL, TypeScript

Note, we're mainly testing Python support. Use other languages at your own peril.

Contributing

Pull requests are welcome. Please make sure to update tests as appropriate. Use tools provided within dev dependencies to maintain the code standard.

Tests

Run the unit tests by invoking pytest in the root.

License

Please see the LICENSE file provided with the source code.

Attribution

We'd like to thank the Sweep AI for publishing their ideas about code chunking. Read their blog posts about the topic here and here. The implementation in code_indexer_loop is modified from their original implementation mainly to limit based on tokens instead of characters and to achieve perfect document reconstruction ("".join(chunks) == original_source_code).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

code_indexer_loop-0.2.1.tar.gz (31.8 kB view details)

Uploaded Source

Built Distribution

code_indexer_loop-0.2.1-py3-none-any.whl (29.8 kB view details)

Uploaded Python 3

File details

Details for the file code_indexer_loop-0.2.1.tar.gz.

File metadata

  • Download URL: code_indexer_loop-0.2.1.tar.gz
  • Upload date:
  • Size: 31.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-requests/2.31.0

File hashes

Hashes for code_indexer_loop-0.2.1.tar.gz
Algorithm Hash digest
SHA256 5636d772919a7e1cb77d4b4b2890c7e553905b79ff51a3c2ac45909e06baa33c
MD5 1ba6944126f62e28d5e155b80d4bddea
BLAKE2b-256 2dc58ccca6ad7f645414169a5e176935595c04cb035e9bccc944b2d19b88a167

See more details on using hashes here.

File details

Details for the file code_indexer_loop-0.2.1-py3-none-any.whl.

File metadata

File hashes

Hashes for code_indexer_loop-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4b906efad2d1e8fceac26c0477aa19a843c6042053479a426375cee070320de4
MD5 41d173607d4709c820e9a23d4fc319ab
BLAKE2b-256 ed81f92dafcbc62b6cbf74f98af8a937264007f2d9bc2520630ade9fce595e60

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page