Skip to main content

A package for smartly splitting code into chunks

Project description

AST Snowball Splitter

AST Snowball Splitter is a Python package designed for intelligent code chunking using the nodes of Abstract Syntax Trees (AST). This approach allows for more relevant chunks compared to treating code as natural language. It is ideal for use in Retrieval-Augmented Generation (RAG) systems employed in coding assistants.

Features

  • Intelligent Code Chunking: Generates chunks based on the structure of the code, resulting in more meaningful segments.
  • AST-Based: Utilizes Abstract Syntax Trees to split the code, ensuring that chunks respect the logical boundaries of the code.
  • RAG System Integration: Perfect for use in Retrieval-Augmented Generation systems, enhancing the capabilities of coding assistants.
  • Langchain Integration: Outputs are compatible with Langchain, using Document types for seamless integration.

Installation

To install the package, use pip:

pip install astsnowballsplitter

Usage

from astsnowballsplitter.code_splitter import ASTSnowballSplitter, ASTLanguageConfig
from transformers import AutoTokenizer
from langchain.schema import Document

# Define the configuration for the languages
languages_config = [
    ASTLanguageConfig(language='python', library_path='build/my-languages.so', grammar_path='tree-sitter-python'),
    ASTLanguageConfig(language='javascript', library_path='build/my-languages.so', grammar_path='tree-sitter-javascript')
]

# Initialize the tokenizer and splitter
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
splitter = ASTSnowballSplitter(
    tokenizer=tokenizer,
    chunk_size=100,
    chunk_overlap=10,
    languages_config=languages_config
)

# Define source code to split
python_code = """
def example_function():
    print("Hello, world!")
    for i in range(10):
        print(i)
"""

javascript_code = """
function exampleFunction() {
    console.log("Hello, world!");
    for (let i = 0; i < 10; i++) {
        console.log(i);
    }
}
"""

# Split the code into chunks
texts = [python_code, javascript_code]
file_extensions = ['py', 'js']
documents = splitter.split_text(texts, file_extensions)

# Display the results
for doc in documents:
    print(doc.page_content)
    print(doc.metadata)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

astsnowballsplitter-0.2.0.tar.gz (7.2 kB view details)

Uploaded Source

Built Distribution

astsnowballsplitter-0.2.0-py3-none-any.whl (6.8 kB view details)

Uploaded Python 3

File details

Details for the file astsnowballsplitter-0.2.0.tar.gz.

File metadata

  • Download URL: astsnowballsplitter-0.2.0.tar.gz
  • Upload date:
  • Size: 7.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.5

File hashes

Hashes for astsnowballsplitter-0.2.0.tar.gz
Algorithm Hash digest
SHA256 2280f92396852e748e34ffd488b782081f35ab4941caca386d480ef444e06adf
MD5 529d4df4442bd105d7c43c1b1688d2cb
BLAKE2b-256 8eac2c50f700118ea4cd853ec42390a0002d504f025b976bce4106ff02cbe446

See more details on using hashes here.

File details

Details for the file astsnowballsplitter-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for astsnowballsplitter-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 52dde79fcf6b26036994bc6c94c3d35ef542296e37343bcce7da4bd35ee1dee7
MD5 3440e1461a65e35f0ce6b0a117be606f
BLAKE2b-256 2c59a77fb4995b78b21a7531b09ae0e1f7e109e4a2c46062050354ebdb635a72

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page