Skip to main content

Split text into semantic chunks, up to a desired chunk size. Supports calculating length by characters and tokens, and is callable from Rust and Python.

Project description

semantic-text-splitter

Documentation Status Licence

Large language models (LLMs) can be used for many tasks, but often have a limited context size that can be smaller than documents you might want to use. To use documents of larger length, you often have to split your text into chunks to fit within this context size.

This crate provides methods for splitting longer pieces of text into smaller chunks, aiming to maximize a desired chunk size, but still splitting at semantically sensible boundaries whenever possible.

Get Started

By Number of Characters

from semantic_text_splitter import TextSplitter

# Maximum number of characters in a chunk
max_characters = 1000
# Optionally can also have the splitter not trim whitespace for you
splitter = TextSplitter(max_characters)
# splitter = TextSplitter(max_characters, trim=False)

chunks = splitter.chunks("your document text")

Using a Range for Chunk Capacity

You also have the option of specifying your chunk capacity as a range.

Once a chunk has reached a length that falls within the range it will be returned.

It is always possible that a chunk may be returned that is less than the start value, as adding the next piece of text may have made it larger than the end capacity.

from semantic_text_splitter import TextSplitter


# Maximum number of characters in a chunk. Will fill up the
# chunk until it is somewhere in this range.
splitter = TextSplitter((200,1000))

chunks = splitter.chunks("your document text")

Using a Hugging Face Tokenizer

from semantic_text_splitter import TextSplitter
from tokenizers import Tokenizer

# Maximum number of tokens in a chunk
max_tokens = 1000
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
# If your tokenizer has truncation enabled, disable it before passing it to
# the splitter. Otherwise chunk sizes can be capped by the tokenizer's
# truncation limit.
tokenizer.no_truncation()
splitter = TextSplitter.from_huggingface_tokenizer(tokenizer, max_tokens)

chunks = splitter.chunks("your document text")

Using a Tiktoken Tokenizer

from semantic_text_splitter import TextSplitter

# Maximum number of tokens in a chunk
max_tokens = 1000
splitter = TextSplitter.from_tiktoken_model("gpt-3.5-turbo", max_tokens)

chunks = splitter.chunks("your document text")

Using a Custom Callback

from semantic_text_splitter import TextSplitter

splitter = TextSplitter.from_callback(lambda text: len(text), 1000)

chunks = splitter.chunks("your document text")

Markdown

All of the above examples also can also work with Markdown text. You can use the MarkdownSplitter in the same ways as the TextSplitter.

from semantic_text_splitter import MarkdownSplitter

# Maximum number of characters in a chunk
max_characters = 1000
# Optionally can also have the splitter not trim whitespace for you
splitter = MarkdownSplitter(max_characters)
# splitter = MarkdownSplitter(max_characters, trim=False)

chunks = splitter.chunks("# Header\n\nyour document text")

Method

To preserve as much semantic meaning within a chunk as possible, each chunk is composed of the largest semantic units that can fit in the next given chunk. For each splitter type, there is a defined set of semantic levels. Here is an example of the steps used:

  1. Split the text by a increasing semantic levels.
  2. Check the first item for each level and select the highest level whose first item still fits within the chunk size.
  3. Merge as many of these neighboring sections of this level or above into a chunk to maximize chunk length. Boundaries of higher semantic levels are always included when merging, so that the chunk doesn't inadvertantly cross semantic boundaries.

The boundaries used to split the text if using the chunks method, in ascending order:

TextSplitter Semantic Levels

  1. Characters
  2. Unicode Grapheme Cluster Boundaries
  3. Unicode Word Boundaries
  4. Unicode Sentence Boundaries
  5. Ascending sequence length of newlines. (Newline is \r\n, \n, or \r) Each unique length of consecutive newline sequences is treated as its own semantic level. So a sequence of 2 newlines is a higher level than a sequence of 1 newline, and so on.

Splitting doesn't occur below the character level, otherwise you could get partial bytes of a char, which may not be a valid unicode str.

MarkdownSplitter Semantic Levels

Markdown is parsed according to the CommonMark spec, along with some optional features such as GitHub Flavored Markdown.

  1. Characters
  2. Unicode Grapheme Cluster Boundaries
  3. Unicode Word Boundaries
  4. Unicode Sentence Boundaries
  5. Soft line breaks (single newline) which isn't necessarily a new element in Markdown.
  6. Inline elements such as: text nodes, emphasis, strong, strikethrough, link, image, table cells, inline code, footnote references, task list markers, and inline html.
  7. Block elements suce as: paragraphs, code blocks, footnote definitions, metadata. Also, a block quote or row/item within a table or list that can contain other "block" type elements, and a list or table that contains items.
  8. Thematic breaks or horizontal rules.
  9. Headings by level

Splitting doesn't occur below the character level, otherwise you could get partial bytes of a char, which may not be a valid unicode str.

Note on sentences

There are lots of methods of determining sentence breaks, all to varying degrees of accuracy, and many requiring ML models to do so. Rather than trying to find the perfect sentence breaks, we rely on unicode method of sentence boundaries, which in most cases is good enough for finding a decent semantic breaking point if a paragraph is too large, and avoids the performance penalties of many other methods.

Inspiration

This crate was inspired by LangChain's TextSplitter. But, looking into the implementation, there was potential for better performance as well as better semantic chunking.

A big thank you to the Unicode team for their icu_segmenter crate that manages a lot of the complexity of matching the Unicode rules for words and sentences.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semantic_text_splitter-0.31.0.tar.gz (289.3 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

semantic_text_splitter-0.31.0-cp310-abi3-win_amd64.whl (8.0 MB view details)

Uploaded CPython 3.10+Windows x86-64

semantic_text_splitter-0.31.0-cp310-abi3-win32.whl (7.8 MB view details)

Uploaded CPython 3.10+Windows x86

semantic_text_splitter-0.31.0-cp310-abi3-manylinux_2_28_x86_64.whl (8.5 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ x86-64

semantic_text_splitter-0.31.0-cp310-abi3-manylinux_2_28_s390x.whl (8.7 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ s390x

semantic_text_splitter-0.31.0-cp310-abi3-manylinux_2_28_ppc64le.whl (8.9 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ ppc64le

semantic_text_splitter-0.31.0-cp310-abi3-manylinux_2_28_armv7l.whl (8.4 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ ARMv7l

semantic_text_splitter-0.31.0-cp310-abi3-manylinux_2_28_aarch64.whl (8.5 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ ARM64

semantic_text_splitter-0.31.0-cp310-abi3-macosx_11_0_arm64.whl (8.3 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

semantic_text_splitter-0.31.0-cp310-abi3-macosx_10_12_x86_64.whl (8.3 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

File details

Details for the file semantic_text_splitter-0.31.0.tar.gz.

File metadata

  • Download URL: semantic_text_splitter-0.31.0.tar.gz
  • Upload date:
  • Size: 289.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.19 {"installer":{"name":"uv","version":"0.11.19","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for semantic_text_splitter-0.31.0.tar.gz
Algorithm Hash digest
SHA256 1d7503d03bc93e59152ddc4f74cda8db481b862a5092b8e6ddb485fbe9ae4141
MD5 12465e5e525176c66e7bcb40275339bf
BLAKE2b-256 56b58895497f76fc11410f11c9dc0db15bfa94544af0e38da648c26cd8ffcca2

See more details on using hashes here.

File details

Details for the file semantic_text_splitter-0.31.0-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: semantic_text_splitter-0.31.0-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 8.0 MB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.19 {"installer":{"name":"uv","version":"0.11.19","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for semantic_text_splitter-0.31.0-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 eb7acab51dba74dce17577ea8e607142e434161fcfdb072b650021855dca4514
MD5 eeb219149112274e43b40cf4d2c97df5
BLAKE2b-256 400b945570bc9dcb225bf4892853912e192c6c114286b90dba3b2a00925e7cdf

See more details on using hashes here.

File details

Details for the file semantic_text_splitter-0.31.0-cp310-abi3-win32.whl.

File metadata

  • Download URL: semantic_text_splitter-0.31.0-cp310-abi3-win32.whl
  • Upload date:
  • Size: 7.8 MB
  • Tags: CPython 3.10+, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.19 {"installer":{"name":"uv","version":"0.11.19","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for semantic_text_splitter-0.31.0-cp310-abi3-win32.whl
Algorithm Hash digest
SHA256 3fd2f3d726c07b96af365df0d95bd9ad043cb1bb0a6b3d4de7df08f9d8712b9b
MD5 312f26238c403abfa8b04df4539b7a25
BLAKE2b-256 7a0b6349e0b3706222a8464f177099a2c024a972e034ac8dcb61cf71db597651

See more details on using hashes here.

File details

Details for the file semantic_text_splitter-0.31.0-cp310-abi3-manylinux_2_28_x86_64.whl.

File metadata

  • Download URL: semantic_text_splitter-0.31.0-cp310-abi3-manylinux_2_28_x86_64.whl
  • Upload date:
  • Size: 8.5 MB
  • Tags: CPython 3.10+, manylinux: glibc 2.28+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.19 {"installer":{"name":"uv","version":"0.11.19","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for semantic_text_splitter-0.31.0-cp310-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 8c04fd66ad6bd097a6696cbe98c0cff6900fdaa280707c640a5895aac6d1e94f
MD5 786ce175f068840b7833a78e7d870f64
BLAKE2b-256 732d9dd57a17b085c674cca74edc0bf4ffc16bf290ba43ea086ff7c86b70abdb

See more details on using hashes here.

File details

Details for the file semantic_text_splitter-0.31.0-cp310-abi3-manylinux_2_28_s390x.whl.

File metadata

  • Download URL: semantic_text_splitter-0.31.0-cp310-abi3-manylinux_2_28_s390x.whl
  • Upload date:
  • Size: 8.7 MB
  • Tags: CPython 3.10+, manylinux: glibc 2.28+ s390x
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.19 {"installer":{"name":"uv","version":"0.11.19","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for semantic_text_splitter-0.31.0-cp310-abi3-manylinux_2_28_s390x.whl
Algorithm Hash digest
SHA256 a072f550f31b826e0f4f718b0d2449e9d4c722132d99c0d221b141dfc7c4c2d9
MD5 e44dda81334690fb196b3a6f577cb387
BLAKE2b-256 46d5381a308fadbc9523479c4ee4d1bf58ece03035cd74c2caf4e9d60fb9fc16

See more details on using hashes here.

File details

Details for the file semantic_text_splitter-0.31.0-cp310-abi3-manylinux_2_28_ppc64le.whl.

File metadata

  • Download URL: semantic_text_splitter-0.31.0-cp310-abi3-manylinux_2_28_ppc64le.whl
  • Upload date:
  • Size: 8.9 MB
  • Tags: CPython 3.10+, manylinux: glibc 2.28+ ppc64le
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.19 {"installer":{"name":"uv","version":"0.11.19","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for semantic_text_splitter-0.31.0-cp310-abi3-manylinux_2_28_ppc64le.whl
Algorithm Hash digest
SHA256 b77671095c64d9d1d18f9173a730ac42334949b73a0c0fd1482f7f5b30f16195
MD5 e7da274480b4e1ad4d4f71201866eb8d
BLAKE2b-256 62204754d6bc4618004c1f591c47347fd34d4786160487cd2badd7be976a1579

See more details on using hashes here.

File details

Details for the file semantic_text_splitter-0.31.0-cp310-abi3-manylinux_2_28_armv7l.whl.

File metadata

  • Download URL: semantic_text_splitter-0.31.0-cp310-abi3-manylinux_2_28_armv7l.whl
  • Upload date:
  • Size: 8.4 MB
  • Tags: CPython 3.10+, manylinux: glibc 2.28+ ARMv7l
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.19 {"installer":{"name":"uv","version":"0.11.19","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for semantic_text_splitter-0.31.0-cp310-abi3-manylinux_2_28_armv7l.whl
Algorithm Hash digest
SHA256 491fa65615edda833fcfa995367f9b94628477d10bbd627286a0a31c5dda656e
MD5 8266a16532be139c07640da3f1afb786
BLAKE2b-256 51da2f98cab5b296236cd1c3c296563b5354ad5394aa01df20a88c370c3e82d3

See more details on using hashes here.

File details

Details for the file semantic_text_splitter-0.31.0-cp310-abi3-manylinux_2_28_aarch64.whl.

File metadata

  • Download URL: semantic_text_splitter-0.31.0-cp310-abi3-manylinux_2_28_aarch64.whl
  • Upload date:
  • Size: 8.5 MB
  • Tags: CPython 3.10+, manylinux: glibc 2.28+ ARM64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.19 {"installer":{"name":"uv","version":"0.11.19","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for semantic_text_splitter-0.31.0-cp310-abi3-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 0d0f270da4f427c9a5fe7dcc81b0bcb4a75a0e7e5dca23317f6cc86f84b6739f
MD5 dd5ac780b5d5bbd4877427068282c720
BLAKE2b-256 4c6d4d64016d65f03a0ec416b5c981429e77af3338e32e1f0981f6efd876374b

See more details on using hashes here.

File details

Details for the file semantic_text_splitter-0.31.0-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

  • Download URL: semantic_text_splitter-0.31.0-cp310-abi3-macosx_11_0_arm64.whl
  • Upload date:
  • Size: 8.3 MB
  • Tags: CPython 3.10+, macOS 11.0+ ARM64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.19 {"installer":{"name":"uv","version":"0.11.19","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for semantic_text_splitter-0.31.0-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 ccd9bc0d1cd217b6c505b9cc2c28f5f9811ac60ddc9c84338845e58f8fe18cdb
MD5 60f4bab2d7450dd9c0fede637eea42d5
BLAKE2b-256 68bbd032c098592ade31ec1b8d19dff00a9e8398632327deb0f286d97370a2d0

See more details on using hashes here.

File details

Details for the file semantic_text_splitter-0.31.0-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

  • Download URL: semantic_text_splitter-0.31.0-cp310-abi3-macosx_10_12_x86_64.whl
  • Upload date:
  • Size: 8.3 MB
  • Tags: CPython 3.10+, macOS 10.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.19 {"installer":{"name":"uv","version":"0.11.19","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for semantic_text_splitter-0.31.0-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 1737e86bf3dd24fed8fa31475e82a609371bc6470aedfa465efc54c1911e3dd1
MD5 14a2a541dcb8b879238c6ea4a29134d5
BLAKE2b-256 82d05909a2f885d0129201c4ee3a9ad0f021566a545d7c0053527c86669d033b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page