Skip to main content

Split text into semantic chunks, up to a desired chunk size. Supports calculating length by characters and tokens, and is callable from Rust and Python.

Project description

semantic-text-splitter

Documentation Status Licence

Large language models (LLMs) can be used for many tasks, but often have a limited context size that can be smaller than documents you might want to use. To use documents of larger length, you often have to split your text into chunks to fit within this context size.

This crate provides methods for splitting longer pieces of text into smaller chunks, aiming to maximize a desired chunk size, but still splitting at semantically sensible boundaries whenever possible.

Get Started

By Number of Characters

from semantic_text_splitter import TextSplitter

# Maximum number of characters in a chunk
max_characters = 1000
# Optionally can also have the splitter not trim whitespace for you
splitter = TextSplitter(max_characters)
# splitter = TextSplitter(max_characters, trim=False)

chunks = splitter.chunks("your document text")

Using a Range for Chunk Capacity

You also have the option of specifying your chunk capacity as a range.

Once a chunk has reached a length that falls within the range it will be returned.

It is always possible that a chunk may be returned that is less than the start value, as adding the next piece of text may have made it larger than the end capacity.

from semantic_text_splitter import TextSplitter


# Maximum number of characters in a chunk. Will fill up the
# chunk until it is somewhere in this range.
splitter = TextSplitter((200,1000))

chunks = splitter.chunks("your document text")

Using a Hugging Face Tokenizer

from semantic_text_splitter import TextSplitter
from tokenizers import Tokenizer

# Maximum number of tokens in a chunk
max_tokens = 1000
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
splitter = TextSplitter.from_huggingface_tokenizer(tokenizer, max_tokens)

chunks = splitter.chunks("your document text")

Using a Tiktoken Tokenizer

from semantic_text_splitter import TextSplitter

# Maximum number of tokens in a chunk
max_tokens = 1000
splitter = TextSplitter.from_tiktoken_model("gpt-3.5-turbo", max_tokens)

chunks = splitter.chunks("your document text")

Using a Custom Callback

from semantic_text_splitter import TextSplitter

splitter = TextSplitter.from_callback(lambda text: len(text), 1000)

chunks = splitter.chunks("your document text")

Markdown

All of the above examples also can also work with Markdown text. You can use the MarkdownSplitter in the same ways as the TextSplitter.

from semantic_text_splitter import MarkdownSplitter

# Maximum number of characters in a chunk
max_characters = 1000
# Optionally can also have the splitter not trim whitespace for you
splitter = MarkdownSplitter(max_characters)
# splitter = MarkdownSplitter(max_characters, trim=False)

chunks = splitter.chunks("# Header\n\nyour document text")

Method

To preserve as much semantic meaning within a chunk as possible, each chunk is composed of the largest semantic units that can fit in the next given chunk. For each splitter type, there is a defined set of semantic levels. Here is an example of the steps used:

  1. Split the text by a increasing semantic levels.
  2. Check the first item for each level and select the highest level whose first item still fits within the chunk size.
  3. Merge as many of these neighboring sections of this level or above into a chunk to maximize chunk length. Boundaries of higher semantic levels are always included when merging, so that the chunk doesn't inadvertantly cross semantic boundaries.

The boundaries used to split the text if using the chunks method, in ascending order:

TextSplitter Semantic Levels

  1. Characters
  2. Unicode Grapheme Cluster Boundaries
  3. Unicode Word Boundaries
  4. Unicode Sentence Boundaries
  5. Ascending sequence length of newlines. (Newline is \r\n, \n, or \r) Each unique length of consecutive newline sequences is treated as its own semantic level. So a sequence of 2 newlines is a higher level than a sequence of 1 newline, and so on.

Splitting doesn't occur below the character level, otherwise you could get partial bytes of a char, which may not be a valid unicode str.

MarkdownSplitter Semantic Levels

Markdown is parsed according to the CommonMark spec, along with some optional features such as GitHub Flavored Markdown.

  1. Characters
  2. Unicode Grapheme Cluster Boundaries
  3. Unicode Word Boundaries
  4. Unicode Sentence Boundaries
  5. Soft line breaks (single newline) which isn't necessarily a new element in Markdown.
  6. Inline elements such as: text nodes, emphasis, strong, strikethrough, link, image, table cells, inline code, footnote references, task list markers, and inline html.
  7. Block elements suce as: paragraphs, code blocks, footnote definitions, metadata. Also, a block quote or row/item within a table or list that can contain other "block" type elements, and a list or table that contains items.
  8. Thematic breaks or horizontal rules.
  9. Headings by level

Splitting doesn't occur below the character level, otherwise you could get partial bytes of a char, which may not be a valid unicode str.

Note on sentences

There are lots of methods of determining sentence breaks, all to varying degrees of accuracy, and many requiring ML models to do so. Rather than trying to find the perfect sentence breaks, we rely on unicode method of sentence boundaries, which in most cases is good enough for finding a decent semantic breaking point if a paragraph is too large, and avoids the performance penalties of many other methods.

Inspiration

This crate was inspired by LangChain's TextSplitter. But, looking into the implementation, there was potential for better performance as well as better semantic chunking.

A big thank you to the Unicode team for their icu_segmenter crate that manages a lot of the complexity of matching the Unicode rules for words and sentences.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semantic_text_splitter-0.30.1.tar.gz (288.6 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

semantic_text_splitter-0.30.1-cp310-abi3-win_amd64.whl (8.1 MB view details)

Uploaded CPython 3.10+Windows x86-64

semantic_text_splitter-0.30.1-cp310-abi3-win32.whl (7.8 MB view details)

Uploaded CPython 3.10+Windows x86

semantic_text_splitter-0.30.1-cp310-abi3-manylinux_2_28_x86_64.whl (8.5 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ x86-64

semantic_text_splitter-0.30.1-cp310-abi3-manylinux_2_28_s390x.whl (8.7 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ s390x

semantic_text_splitter-0.30.1-cp310-abi3-manylinux_2_28_ppc64le.whl (8.8 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ ppc64le

semantic_text_splitter-0.30.1-cp310-abi3-manylinux_2_28_armv7l.whl (8.4 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ ARMv7l

semantic_text_splitter-0.30.1-cp310-abi3-manylinux_2_28_aarch64.whl (8.5 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ ARM64

semantic_text_splitter-0.30.1-cp310-abi3-macosx_11_0_arm64.whl (8.3 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

semantic_text_splitter-0.30.1-cp310-abi3-macosx_10_12_x86_64.whl (8.3 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

File details

Details for the file semantic_text_splitter-0.30.1.tar.gz.

File metadata

  • Download URL: semantic_text_splitter-0.30.1.tar.gz
  • Upload date:
  • Size: 288.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for semantic_text_splitter-0.30.1.tar.gz
Algorithm Hash digest
SHA256 31724221db160218ff7ef6209f770a60a539972329b5a92267a2644a6796a962
MD5 bf1a93d23c3e4e2ad1571f8f0ce23d40
BLAKE2b-256 406d3e6ad68737a1a778a6f3310211dbc43246798a475a65097231f33b31c1b7

See more details on using hashes here.

File details

Details for the file semantic_text_splitter-0.30.1-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: semantic_text_splitter-0.30.1-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 8.1 MB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for semantic_text_splitter-0.30.1-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 192f505cad2e81cf097005dd65f5af9aae46a863f65741042bb3a3173b29db99
MD5 57df08b7cb4993f899c0f4d7fe97b370
BLAKE2b-256 9a19a2c54b42d9a1c0aa71f2b93b22a23f983545924b10bf2423d5a2875ea0d0

See more details on using hashes here.

File details

Details for the file semantic_text_splitter-0.30.1-cp310-abi3-win32.whl.

File metadata

  • Download URL: semantic_text_splitter-0.30.1-cp310-abi3-win32.whl
  • Upload date:
  • Size: 7.8 MB
  • Tags: CPython 3.10+, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for semantic_text_splitter-0.30.1-cp310-abi3-win32.whl
Algorithm Hash digest
SHA256 73508798b309ba22dc069f9cbf4fdc0f1be055e77b6f52c1ec9fcdab1af80e36
MD5 244407a4f299895336daab967bbc920b
BLAKE2b-256 f84150f455ae260b7d94b28dfcdba5dbc770f8bd2ea121a3c84962185c717db4

See more details on using hashes here.

File details

Details for the file semantic_text_splitter-0.30.1-cp310-abi3-manylinux_2_28_x86_64.whl.

File metadata

  • Download URL: semantic_text_splitter-0.30.1-cp310-abi3-manylinux_2_28_x86_64.whl
  • Upload date:
  • Size: 8.5 MB
  • Tags: CPython 3.10+, manylinux: glibc 2.28+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for semantic_text_splitter-0.30.1-cp310-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 1efecacb37f94e897aa87d843b9e19da46b8748228c8674d2ad66db535c75923
MD5 98255ac68d150f1550b02eeae3304972
BLAKE2b-256 701c1e55b45c492fe67d89d4fef3a8e2922aa0bb2d2d831931a855321fdd1659

See more details on using hashes here.

File details

Details for the file semantic_text_splitter-0.30.1-cp310-abi3-manylinux_2_28_s390x.whl.

File metadata

  • Download URL: semantic_text_splitter-0.30.1-cp310-abi3-manylinux_2_28_s390x.whl
  • Upload date:
  • Size: 8.7 MB
  • Tags: CPython 3.10+, manylinux: glibc 2.28+ s390x
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for semantic_text_splitter-0.30.1-cp310-abi3-manylinux_2_28_s390x.whl
Algorithm Hash digest
SHA256 04c6ba1aac998dcc3b910266d8b73bc2c6ffe80ba026801bac581cd7997cba39
MD5 8b2c47b3c62b2be4b5c11978b206a67f
BLAKE2b-256 f8dfe0a572fcd5bdb77d4d9aaee4d20db0d3e88abb928eda8171b28dfaadd95f

See more details on using hashes here.

File details

Details for the file semantic_text_splitter-0.30.1-cp310-abi3-manylinux_2_28_ppc64le.whl.

File metadata

  • Download URL: semantic_text_splitter-0.30.1-cp310-abi3-manylinux_2_28_ppc64le.whl
  • Upload date:
  • Size: 8.8 MB
  • Tags: CPython 3.10+, manylinux: glibc 2.28+ ppc64le
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for semantic_text_splitter-0.30.1-cp310-abi3-manylinux_2_28_ppc64le.whl
Algorithm Hash digest
SHA256 6e94fa3f4415bb5bb92d5030d3a6139e49b37cf7c08537df7f8d183b3b8c814a
MD5 c68587f4cc0dba01bec0f31087fe0aa4
BLAKE2b-256 13494479c566ff30989a3901de76b1cbd2e3565d2e98d3098e19c2168ad0b919

See more details on using hashes here.

File details

Details for the file semantic_text_splitter-0.30.1-cp310-abi3-manylinux_2_28_armv7l.whl.

File metadata

  • Download URL: semantic_text_splitter-0.30.1-cp310-abi3-manylinux_2_28_armv7l.whl
  • Upload date:
  • Size: 8.4 MB
  • Tags: CPython 3.10+, manylinux: glibc 2.28+ ARMv7l
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for semantic_text_splitter-0.30.1-cp310-abi3-manylinux_2_28_armv7l.whl
Algorithm Hash digest
SHA256 9617c3b66278a9e7666bf1363e3e878dd4cd5d946e52b934bcaba04d14b1f823
MD5 4cf1664b0dfafa56aa3f59a6c2d9d5e7
BLAKE2b-256 c71e9c30c90dedaf1647e5efb13e1e70e5c4eb22344e48168eb770b90942728e

See more details on using hashes here.

File details

Details for the file semantic_text_splitter-0.30.1-cp310-abi3-manylinux_2_28_aarch64.whl.

File metadata

  • Download URL: semantic_text_splitter-0.30.1-cp310-abi3-manylinux_2_28_aarch64.whl
  • Upload date:
  • Size: 8.5 MB
  • Tags: CPython 3.10+, manylinux: glibc 2.28+ ARM64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for semantic_text_splitter-0.30.1-cp310-abi3-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 fffd590e1ecf1461da8aeeed5a66b26c7b93cafd4e300477ca1c6093a7184ed5
MD5 746fdff6926fe958e5e7ae6bfcd91d55
BLAKE2b-256 c50df4ad38d1fbf1109bbf81a0d4c7e38ebf37cfb9e9e2ee58d96570ec801b76

See more details on using hashes here.

File details

Details for the file semantic_text_splitter-0.30.1-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

  • Download URL: semantic_text_splitter-0.30.1-cp310-abi3-macosx_11_0_arm64.whl
  • Upload date:
  • Size: 8.3 MB
  • Tags: CPython 3.10+, macOS 11.0+ ARM64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for semantic_text_splitter-0.30.1-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 e3ed7c1c6bc36ee5b33a141fb3815720a3d501d571f8982f390e2ac3a90373d9
MD5 e599671e6363eecde8f84094fa185e5a
BLAKE2b-256 93ed5f3a817770d4cf923f5d5ff7a33888fb3eb223da11900283545ef457ba3a

See more details on using hashes here.

File details

Details for the file semantic_text_splitter-0.30.1-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

  • Download URL: semantic_text_splitter-0.30.1-cp310-abi3-macosx_10_12_x86_64.whl
  • Upload date:
  • Size: 8.3 MB
  • Tags: CPython 3.10+, macOS 10.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for semantic_text_splitter-0.30.1-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 1baeb22f1c6c85dd7190a535e84347437045bebf27b887061152e68ef3ff7a1d
MD5 bd01fd937898d46eb09842cabf2076c5
BLAKE2b-256 9cf043d5dc61ce0b47bcd6c875c9c16f36e6a495f1e84a7f5e313f5f9b76d92d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page