Skip to main content

Semantic Chunker

Project description

semantic-chunker

A strongly-typed semantic text chunking library that intelligently splits content while preserving structure and meaning. Built on top of semantic-text-splitter with an enhanced type-safe API.

Features

  • 🎯 Multiple tokenization strategies:
    • OpenAI's tiktoken models (e.g., "gpt-3.5-turbo")
    • Hugging Face tokenizers (from objects, JSON strings, or files)
    • Custom tokenization callbacks
  • 📝 Three specialized chunking modes:
    • Plain text
    • Markdown (preserves structure)
    • Code (preserves syntax via tree-sitter)
  • 🔄 Configurable chunk overlapping
  • ✂️ Optional whitespace trimming
  • 💪 Full type safety with Protocol types

Installation

Basic installation (text and markdown support):

pip install semantic-chunker

With code chunking support:

pip install semantic-chunker[code]

With Hugging Face tokenizers support:

pip install semantic-chunker[tokenizers]

With all features:

pip install semantic-chunker[all]

Usage

Text Chunking

from semantic_chunker import get_chunker

plain_text = """Contrary to popular belief, Lorem Ipsum is not simply random text. ..."""

chunker = get_chunker(
    "gpt-3.5-turbo",
    chunking_type="text",  # required
    max_tokens=10,  # required
    trim=False,  # default True
    overlap=5,  # default 0
)

chunks = chunker.chunks(plain_text)  # list[str]
chunk_with_indices = chunker.chunk_with_indices(plain_text)  # list[tuple[str, int]]

Markdown Chunking

from semantic_chunker import get_chunker

markdown_text = """# Lorem Ipsum Intro ..."""

chunker = get_chunker(
    "gpt-3.5-turbo",
    chunking_type="markdown",
    max_tokens=10,
    trim=False,
    overlap=5,
)

chunks = chunker.chunks(markdown_text)  # list[str]
chunk_with_indices = chunker.chunk_with_indices(markdown_text)  # list[tuple[str, int]]

Code Chunking

from semantic_chunker import get_chunker

kotlin_snippet = """import kotlin.random.Random ..."""

chunker = get_chunker(
   "gpt-3.5-turbo",
   chunking_type="code",
   max_tokens=10,
   tree_sitter_language="kotlin",  # required for code chunking
   trim=False,
   overlap=5,
)

chunks = chunker.chunks(kotlin_snippet)  # list[str]
chunk_with_indices = chunker.chunk_with_indices(kotlin_snippet)  # list[tuple[str, int]]

Error Handling

# Missing language for code chunking
try:
    chunker = get_chunker("gpt-4", chunking_type="code", max_tokens=10)
except ValueError as e:
    print(e)  # "Language must be provided for code chunking."
# Missing required package for code chunking
try:
    chunker = get_chunker("gpt-4", chunking_type="code", tree_sitter_language="python", max_tokens=10)
except ModuleNotFoundError as e:
    print(e)  # "tree-sitter-language-pack is required for 'code' style chunking..."

Chunking Type and Tokenization Options

  • get_chunker requires the first argument to be one of:

    1. A tiktoken model name string (e.g., gpt-4o)
    2. A function that takes a string and returns a token count (integer)
    3. A tokenizers.Tokenizer instance
    4. A string path to a tokenizers tokenizer JSON file
  • Required kwargs:

    • chunking_type: Either text, markdown, or code.
    • max_tokens: Maximum tokens per chunk. Accepts an integer or a tuple (min, max).
    • If chunking_type is code, tree_sitter_language is required.

Contribution

This library is open to contribution. Feel free to open issues or submit PRs. Its better to discuss issues before submitting PRs to avoid disappointment.

Local Development

  1. Clone the repo
  2. Install the system dependencies
  3. Install the full dependencies with uv sync
  4. Install the pre-commit hooks with:
    pre-commit install && pre-commit install --hook-type commit-msg
    
  5. Make your changes and submit a PR

License

This library uses the MIT license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semantic_chunker-0.2.0.tar.gz (5.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

semantic_chunker-0.2.0-py3-none-any.whl (5.2 kB view details)

Uploaded Python 3

File details

Details for the file semantic_chunker-0.2.0.tar.gz.

File metadata

  • Download URL: semantic_chunker-0.2.0.tar.gz
  • Upload date:
  • Size: 5.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for semantic_chunker-0.2.0.tar.gz
Algorithm Hash digest
SHA256 1c297289e338e253b046bd141938721bdcfca1537b1d79466b50a098048e0b43
MD5 a7b5c7d098062631c6d4fe2b03b54e06
BLAKE2b-256 a1cfec861c40371852ee515bce9066ccd2e53b40f48bb7b063aec8b5892f00ca

See more details on using hashes here.

Provenance

The following attestation bundles were made for semantic_chunker-0.2.0.tar.gz:

Publisher: release.yaml on Goldziher/semantic-chunker

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file semantic_chunker-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for semantic_chunker-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cfe0f00cf5be8415d3250a9e7c84fbf1bd0d3b49ed398fa4fbbd71383bdd3727
MD5 84bc5958e23302b2834dfaab7fd48544
BLAKE2b-256 16a80c07b04b1d2136bfd454b4d37b5ea8a6b69bc04e0f0349042cc9de0071f5

See more details on using hashes here.

Provenance

The following attestation bundles were made for semantic_chunker-0.2.0-py3-none-any.whl:

Publisher: release.yaml on Goldziher/semantic-chunker

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page