A package for smartly splitting code into chunks
Project description
AST Snowball Splitter
AST Snowball Splitter is a Python package designed for intelligent code chunking using the nodes of Abstract Syntax Trees (AST). This approach allows for more relevant chunks compared to treating code as natural language. It is ideal for use in Retrieval-Augmented Generation (RAG) systems employed in coding assistants.
Features
- Intelligent Code Chunking: Generates chunks based on the structure of the code, resulting in more meaningful segments.
- AST-Based: Utilizes Abstract Syntax Trees to split the code, ensuring that chunks respect the logical boundaries of the code.
- RAG System Integration: Perfect for use in Retrieval-Augmented Generation systems, enhancing the capabilities of coding assistants.
- Langchain Integration: Outputs are compatible with Langchain, using
Document
types for seamless integration.
Installation
To install the package, use pip:
pip install astsnowballsplitter
Usage
from astsnowballsplitter.code_splitter import ASTSnowballSplitter, ASTLanguageConfig
from transformers import AutoTokenizer
from langchain.schema import Document
# Define the configuration for the languages
languages_config = [
ASTLanguageConfig(language='python', library_path='build/my-languages.so', grammar_path='tree-sitter-python'),
ASTLanguageConfig(language='javascript', library_path='build/my-languages.so', grammar_path='tree-sitter-javascript')
]
# Initialize the tokenizer and splitter
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
splitter = ASTSnowballSplitter(
tokenizer=tokenizer,
chunk_size=100,
chunk_overlap=10,
languages_config=languages_config
)
# Define source code to split
python_code = """
def example_function():
print("Hello, world!")
for i in range(10):
print(i)
"""
javascript_code = """
function exampleFunction() {
console.log("Hello, world!");
for (let i = 0; i < 10; i++) {
console.log(i);
}
}
"""
# Split the code into chunks
texts = [python_code, javascript_code]
file_extensions = ['py', 'js']
documents = splitter.split_text(texts, file_extensions)
# Display the results
for doc in documents:
print(doc.page_content)
print(doc.metadata)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file astsnowballsplitter-0.2.0.tar.gz
.
File metadata
- Download URL: astsnowballsplitter-0.2.0.tar.gz
- Upload date:
- Size: 7.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2280f92396852e748e34ffd488b782081f35ab4941caca386d480ef444e06adf |
|
MD5 | 529d4df4442bd105d7c43c1b1688d2cb |
|
BLAKE2b-256 | 8eac2c50f700118ea4cd853ec42390a0002d504f025b976bce4106ff02cbe446 |
File details
Details for the file astsnowballsplitter-0.2.0-py3-none-any.whl
.
File metadata
- Download URL: astsnowballsplitter-0.2.0-py3-none-any.whl
- Upload date:
- Size: 6.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 52dde79fcf6b26036994bc6c94c3d35ef542296e37343bcce7da4bd35ee1dee7 |
|
MD5 | 3440e1461a65e35f0ce6b0a117be606f |
|
BLAKE2b-256 | 2c59a77fb4995b78b21a7531b09ae0e1f7e109e4a2c46062050354ebdb635a72 |