Semantic Chunker
Project description
semantic-chunker
A strongly-typed semantic text chunking library that intelligently splits content while preserving structure and meaning. Built on top of semantic-text-splitter with an enhanced type-safe API.
Features
- 🎯 Multiple tokenization strategies:
- OpenAI's tiktoken models (e.g., "gpt-3.5-turbo")
- Hugging Face tokenizers (from objects, JSON strings, or files)
- Custom tokenization callbacks
- 📝 Three specialized chunking modes:
- Plain text
- Markdown (preserves structure)
- Code (preserves syntax via tree-sitter)
- 🔄 Configurable chunk overlapping
- ✂️ Optional whitespace trimming
- 💪 Full type safety with Protocol types
Installation
Basic installation (text and markdown support):
pip install semantic-chunker
With code chunking support:
pip install semantic-chunker[code]
With Hugging Face tokenizers support:
pip install semantic-chunker[tokenizers]
With all features:
pip install semantic-chunker[all]
Usage
Text Chunking
from semantic_chunker import get_chunker
plain_text = """Contrary to popular belief, Lorem Ipsum is not simply random text. ..."""
chunker = get_chunker(
"gpt-3.5-turbo",
chunking_type="text", # required
max_tokens=10, # required
trim=False, # default True
overlap=5, # default 0
)
chunks = chunker.chunks(plain_text) # list[str]
chunk_with_indices = chunker.chunk_with_indices(plain_text) # list[tuple[str, int]]
Markdown Chunking
from semantic_chunker import get_chunker
markdown_text = """# Lorem Ipsum Intro ..."""
chunker = get_chunker(
"gpt-3.5-turbo",
chunking_type="markdown",
max_tokens=10,
trim=False,
overlap=5,
)
chunks = chunker.chunks(markdown_text) # list[str]
chunk_with_indices = chunker.chunk_with_indices(markdown_text) # list[tuple[str, int]]
Code Chunking
from semantic_chunker import get_chunker
kotlin_snippet = """import kotlin.random.Random ..."""
chunker = get_chunker(
"gpt-3.5-turbo",
chunking_type="code",
max_tokens=10,
tree_sitter_language="kotlin", # required for code chunking
trim=False,
overlap=5,
)
chunks = chunker.chunks(kotlin_snippet) # list[str]
chunk_with_indices = chunker.chunk_with_indices(kotlin_snippet) # list[tuple[str, int]]
Error Handling
# Missing language for code chunking
try:
chunker = get_chunker("gpt-4", chunking_type="code", max_tokens=10)
except ValueError as e:
print(e) # "Language must be provided for code chunking."
# Missing required package for code chunking
try:
chunker = get_chunker("gpt-4", chunking_type="code", tree_sitter_language="python", max_tokens=10)
except ModuleNotFoundError as e:
print(e) # "tree-sitter-language-pack is required for 'code' style chunking..."
Chunking Type and Tokenization Options
-
get_chunkerrequires the first argument to be one of:- A tiktoken model name string (e.g.,
gpt-4o) - A function that takes a string and returns a token count (integer)
- A
tokenizers.Tokenizerinstance - A string path to a
tokenizerstokenizer JSON file
- A tiktoken model name string (e.g.,
-
Required kwargs:
chunking_type: Eithertext,markdown, orcode.max_tokens: Maximum tokens per chunk. Accepts an integer or a tuple (min, max).- If
chunking_typeiscode,tree_sitter_languageis required.
Contribution
This library is open to contribution. Feel free to open issues or submit PRs. Its better to discuss issues before submitting PRs to avoid disappointment.
Local Development
- Clone the repo
- Install the system dependencies
- Install the full dependencies with
uv sync - Install the pre-commit hooks with:
pre-commit install && pre-commit install --hook-type commit-msg
- Make your changes and submit a PR
License
This library uses the MIT license.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file semantic_chunker-0.2.0.tar.gz.
File metadata
- Download URL: semantic_chunker-0.2.0.tar.gz
- Upload date:
- Size: 5.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1c297289e338e253b046bd141938721bdcfca1537b1d79466b50a098048e0b43
|
|
| MD5 |
a7b5c7d098062631c6d4fe2b03b54e06
|
|
| BLAKE2b-256 |
a1cfec861c40371852ee515bce9066ccd2e53b40f48bb7b063aec8b5892f00ca
|
Provenance
The following attestation bundles were made for semantic_chunker-0.2.0.tar.gz:
Publisher:
release.yaml on Goldziher/semantic-chunker
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
semantic_chunker-0.2.0.tar.gz -
Subject digest:
1c297289e338e253b046bd141938721bdcfca1537b1d79466b50a098048e0b43 - Sigstore transparency entry: 169290840
- Sigstore integration time:
-
Permalink:
Goldziher/semantic-chunker@02f5c3fa87569c8ec0d876f3cc14373578ef1394 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/Goldziher
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yaml@02f5c3fa87569c8ec0d876f3cc14373578ef1394 -
Trigger Event:
release
-
Statement type:
File details
Details for the file semantic_chunker-0.2.0-py3-none-any.whl.
File metadata
- Download URL: semantic_chunker-0.2.0-py3-none-any.whl
- Upload date:
- Size: 5.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cfe0f00cf5be8415d3250a9e7c84fbf1bd0d3b49ed398fa4fbbd71383bdd3727
|
|
| MD5 |
84bc5958e23302b2834dfaab7fd48544
|
|
| BLAKE2b-256 |
16a80c07b04b1d2136bfd454b4d37b5ea8a6b69bc04e0f0349042cc9de0071f5
|
Provenance
The following attestation bundles were made for semantic_chunker-0.2.0-py3-none-any.whl:
Publisher:
release.yaml on Goldziher/semantic-chunker
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
semantic_chunker-0.2.0-py3-none-any.whl -
Subject digest:
cfe0f00cf5be8415d3250a9e7c84fbf1bd0d3b49ed398fa4fbbd71383bdd3727 - Sigstore transparency entry: 169290842
- Sigstore integration time:
-
Permalink:
Goldziher/semantic-chunker@02f5c3fa87569c8ec0d876f3cc14373578ef1394 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/Goldziher
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yaml@02f5c3fa87569c8ec0d876f3cc14373578ef1394 -
Trigger Event:
release
-
Statement type: