Skip to main content

Split code into semantic chunks using tree-sitter

Project description

code-splitter Python Bindings

License PyPI

The code-splitter Python package provides bindings for the code-splitter Rust crate. It leverages the tree-sitter parsing library and tokenizers to split code into semantically meaningful chunks. This functionality is particularly useful in Retrieval Augmented Generation (RAG), a technique that enhances the generation capabilities of Large Language Models (LLMs) by leveraging external knowledge sources.

Installation

You can install the package from PyPI:

pip install code-splitter

Usage

Here's an example of how to use the package:

from code_splitter import Language, CharSplitter

# Load the code you want to split
with open("example.py", "rb") as f:
    code = f.read()

# Create a splitter instance
splitter = CharSplitter(Language.Python, max_size=200)

# Split the code into chunks
chunks = splitter.split(code)

# Print the chunks
for chunk in chunks:
    print(f"Start: {chunk.start}, End: {chunk.end}, Size: {chunk.size}")
    print(chunk.text)
    print()

This example uses the CharSplitter to split Python code into chunks of maximum 200 characters. The Chunk objects contain information about the start and end lines, size, and the actual text of the chunk.

Available Splitters

The package provides the following splitters:

  • CharSplitter: Splits code based on character count.
  • WordSplitter: Splits code based on word count.
  • TiktokenSplitter: Splits code based on Tiktoken tokenizer.
  • HuggingfaceSplitter: Splits code based on HuggingFace tokenizers.

Supported Languages

The following programming languages are currently supported:

  • Golang
  • Markdown
  • Python
  • Rust

Examples

Here are some examples of splitting code using different splitters and languages:

Split Python Code by Characters

from code_splitter import Language, CharSplitter

splitter = CharSplitter(Language.Python, max_size=200)
chunks = splitter.split(code)

Split Markdown by Words

from code_splitter import Language, WordSplitter

splitter = WordSplitter(Language.Markdown, max_size=50)
chunks = splitter.split(code)

Split Rust Code by Tiktoken Tokenizer

from code_splitter import Language, TiktokenSplitter

splitter = TiktokenSplitter(Language.Rust, max_size=100)
chunks = splitter.split(code)

Split Go Code by HuggingFace Tokenizer

from code_splitter import Language, HuggingfaceSplitter

splitter = HuggingfaceSplitter(Language.Golang, max_size=100, pretrained_model_name_or_path="bert-base-cased")
chunks = splitter.split(code)

For more examples, please refer to the tests directory in the repository.

Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

code_splitter-0.1.5.tar.gz (27.4 kB view hashes)

Uploaded Source

Built Distributions

code_splitter-0.1.5-pp310-pypy310_pp73-manylinux_2_28_armv7l.whl (4.0 MB view hashes)

Uploaded PyPy manylinux: glibc 2.28+ ARMv7l

code_splitter-0.1.5-pp310-pypy310_pp73-manylinux_2_28_aarch64.whl (4.3 MB view hashes)

Uploaded PyPy manylinux: glibc 2.28+ ARM64

code_splitter-0.1.5-pp39-pypy39_pp73-manylinux_2_28_armv7l.whl (4.0 MB view hashes)

Uploaded PyPy manylinux: glibc 2.28+ ARMv7l

code_splitter-0.1.5-pp39-pypy39_pp73-manylinux_2_28_aarch64.whl (4.3 MB view hashes)

Uploaded PyPy manylinux: glibc 2.28+ ARM64

code_splitter-0.1.5-pp38-pypy38_pp73-manylinux_2_28_armv7l.whl (4.0 MB view hashes)

Uploaded PyPy manylinux: glibc 2.28+ ARMv7l

code_splitter-0.1.5-pp38-pypy38_pp73-manylinux_2_28_aarch64.whl (4.3 MB view hashes)

Uploaded PyPy manylinux: glibc 2.28+ ARM64

code_splitter-0.1.5-cp312-none-win_amd64.whl (3.7 MB view hashes)

Uploaded CPython 3.12 Windows x86-64

code_splitter-0.1.5-cp312-none-win32.whl (3.3 MB view hashes)

Uploaded CPython 3.12 Windows x86

code_splitter-0.1.5-cp312-cp312-manylinux_2_34_x86_64.whl (4.3 MB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.34+ x86-64

code_splitter-0.1.5-cp312-cp312-manylinux_2_28_armv7l.whl (4.0 MB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.28+ ARMv7l

code_splitter-0.1.5-cp312-cp312-manylinux_2_28_aarch64.whl (4.3 MB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.28+ ARM64

code_splitter-0.1.5-cp312-cp312-macosx_11_0_arm64.whl (3.9 MB view hashes)

Uploaded CPython 3.12 macOS 11.0+ ARM64

code_splitter-0.1.5-cp312-cp312-macosx_10_12_x86_64.whl (4.0 MB view hashes)

Uploaded CPython 3.12 macOS 10.12+ x86-64

code_splitter-0.1.5-cp311-none-win_amd64.whl (3.7 MB view hashes)

Uploaded CPython 3.11 Windows x86-64

code_splitter-0.1.5-cp311-none-win32.whl (3.3 MB view hashes)

Uploaded CPython 3.11 Windows x86

code_splitter-0.1.5-cp311-cp311-manylinux_2_28_armv7l.whl (4.0 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.28+ ARMv7l

code_splitter-0.1.5-cp311-cp311-manylinux_2_28_aarch64.whl (4.3 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.28+ ARM64

code_splitter-0.1.5-cp311-cp311-macosx_11_0_arm64.whl (3.9 MB view hashes)

Uploaded CPython 3.11 macOS 11.0+ ARM64

code_splitter-0.1.5-cp311-cp311-macosx_10_12_x86_64.whl (4.0 MB view hashes)

Uploaded CPython 3.11 macOS 10.12+ x86-64

code_splitter-0.1.5-cp310-none-win_amd64.whl (3.7 MB view hashes)

Uploaded CPython 3.10 Windows x86-64

code_splitter-0.1.5-cp310-none-win32.whl (3.3 MB view hashes)

Uploaded CPython 3.10 Windows x86

code_splitter-0.1.5-cp310-cp310-manylinux_2_34_x86_64.whl (4.3 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.34+ x86-64

code_splitter-0.1.5-cp310-cp310-manylinux_2_28_armv7l.whl (4.0 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.28+ ARMv7l

code_splitter-0.1.5-cp310-cp310-manylinux_2_28_aarch64.whl (4.3 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.28+ ARM64

code_splitter-0.1.5-cp310-cp310-macosx_11_0_arm64.whl (3.9 MB view hashes)

Uploaded CPython 3.10 macOS 11.0+ ARM64

code_splitter-0.1.5-cp39-none-win_amd64.whl (3.7 MB view hashes)

Uploaded CPython 3.9 Windows x86-64

code_splitter-0.1.5-cp39-none-win32.whl (3.3 MB view hashes)

Uploaded CPython 3.9 Windows x86

code_splitter-0.1.5-cp39-cp39-manylinux_2_28_armv7l.whl (4.0 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.28+ ARMv7l

code_splitter-0.1.5-cp39-cp39-manylinux_2_28_aarch64.whl (4.3 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.28+ ARM64

code_splitter-0.1.5-cp39-cp39-macosx_11_0_arm64.whl (3.9 MB view hashes)

Uploaded CPython 3.9 macOS 11.0+ ARM64

code_splitter-0.1.5-cp38-none-win_amd64.whl (3.7 MB view hashes)

Uploaded CPython 3.8 Windows x86-64

code_splitter-0.1.5-cp38-none-win32.whl (3.3 MB view hashes)

Uploaded CPython 3.8 Windows x86

code_splitter-0.1.5-cp38-cp38-manylinux_2_28_armv7l.whl (4.0 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.28+ ARMv7l

code_splitter-0.1.5-cp38-cp38-manylinux_2_28_aarch64.whl (4.3 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.28+ ARM64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page