Skip to main content

Code Indexer Loop

Project description

Code Indexer Loop

PyPI version License Forks Stars Twitter Discord

Code Indexer Loop is a Python library designed to index and retrieve code snippets.

It uses the useful indexing utilities of the LlamaIndex library and the multi-language tree-sitter library to parse the code from many popular programming languages. tiktoken is used to right-size retrieval based on number of tokens and LangChain is used to obtain embeddings (defaults to OpenAI's text-embedding-ada-002) and store them in an embedded ChromaDB vector database. watchdog is used for continuous updating of the index based on file system events.

Read the launch blog post for more details about why we've built this!

Installation:

Use pip to install Code Indexer Loop from PyPI.

pip install code-indexer-loop

Usage:

  1. Import necessary modules:
from code_indexer_loop.api import CodeIndexer
  1. Create a CodeIndexer object and have it watch for changes:
indexer = CodeIndexer(src_dir="path/to/code/", watch=True)
  1. Use .query to perform a search query:
query = "pandas"
print(indexer.query(query)[0:30])

Note: make sure the OPENAI_API_KEY environment variable is set. This is needed for generating the embeddings.

You can also use indexer.query_nodes to get the nodes of a query or indexer.query_documents to receive the entire source code files.

Note that if you edit any of the source code files in the src_dir it will efficiently re-index those files using watchdog and an md5 based caching mechanism. This results in up-to-date embeddings every time you query the index.

Examples

Check out the basic_usage notebook for a quick overview of the API.

Token limits

You can configure token limits for the chunks through the CodeIndexer constructor:

indexer = CodeIndexer(
    src_dir="path/to/code/", watch=True,
    target_chunk_tokens = 300,
    max_chunk_tokens = 1000,
    enforce_max_chunk_tokens = False,
    coalesce = 50
    token_model = "gpt-4"
)

Note you can choose whether the max_chunk_tokens is enforced. If it is, it will raise an exception in case there is no semantic parsing that respects the max_chunk_tokens.

The coalesce argument controls the limit of combining smaller chunks into single chunks to avoid having many very small chunks. The unit for coalesce is also tokens.

tree-sitter

Using tree-sitter for parsing, the chunks are broken only at valid node-level string positions in the source file. This avoids breaking up e.g. function and class definitions.

Supported languages:

C, C++, C#, Go, Haskell, Java, Julia, JavaScript, PHP, Python, Ruby, Rust, Scala, Swift, SQL, TypeScript

Note, we're mainly testing Python support. Use other languages at your own peril.

Contributing

Pull requests are welcome. Please make sure to update tests as appropriate. Use tools provided within dev dependencies to maintain the code standard.

Tests

Run the unit tests by invoking pytest in the root.

License

Please see the LICENSE file provided with the source code.

Attribution

We'd like to thank the Sweep AI for publishing their ideas about code chunking. Read their blog posts about the topic here and here. The implementation in code_indexer_loop is modified from their original implementation mainly to limit based on tokens instead of characters and to achieve perfect document reconstruction ("".join(chunks) == original_source_code).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

code_indexer_loop-0.2.0.tar.gz (31.8 kB view details)

Uploaded Source

Built Distribution

code_indexer_loop-0.2.0-py3-none-any.whl (29.8 kB view details)

Uploaded Python 3

File details

Details for the file code_indexer_loop-0.2.0.tar.gz.

File metadata

  • Download URL: code_indexer_loop-0.2.0.tar.gz
  • Upload date:
  • Size: 31.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-requests/2.31.0

File hashes

Hashes for code_indexer_loop-0.2.0.tar.gz
Algorithm Hash digest
SHA256 49a9765df944dc2fdb2a38c8e019ed1a21c36c40a2dc973e6ec7b6a1e723c0a4
MD5 424eee98c626f5986763b571aea087ec
BLAKE2b-256 a77dda2069ac75721a9a80ba3915aebe4543f54c94f32f00166e4ec6f7f034b5

See more details on using hashes here.

File details

Details for the file code_indexer_loop-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for code_indexer_loop-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 040aff67fe0686eb70d57231cf496386c95a59056da404872b47fd24572cf927
MD5 dd3995ab334ce2defe4f1076e3dcb3e9
BLAKE2b-256 fb85ac8a2bd742a4586535e090ed7be0c51da926d324d4137369bda0a9dda4b9

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page