Skip to main content

Split text into semantic chunks, up to a desired chunk size. Supports calculating length by characters and tokens (when used with large language models).

Project description

semantic-text-splitter

Documentation Status Licence

Large language models (LLMs) can be used for many tasks, but often have a limited context size that can be smaller than documents you might want to use. To use documents of larger length, you often have to split your text into chunks to fit within this context size.

This crate provides methods for splitting longer pieces of text into smaller chunks, aiming to maximize a desired chunk size, but still splitting at semantically sensible boundaries whenever possible.

Get Started

By Number of Characters

from semantic_text_splitter import CharacterTextSplitter

# Maximum number of characters in a chunk
max_characters = 1000
# Optionally can also have the splitter not trim whitespace for you
splitter = CharacterTextSplitter(trim_chunks=False)

chunks = splitter.chunks("your document text", max_characters)

With Huggingface Tokenizer

from semantic_text_splitter import HuggingFaceTextSplitter
from tokenizers import Tokenizer

# Maximum number of tokens in a chunk
max_tokens = 1000
# Optionally can also have the splitter not trim whitespace for you
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
splitter = HuggingFaceTextSplitter(tokenizer, trim_chunks=False)

chunks = splitter.chunks("your document text", max_tokens)

With Tiktoken Tokenizer

from semantic_text_splitter import TiktokenTextSplitter

# Maximum number of tokens in a chunk
max_tokens = 1000
# Optionally can also have the splitter not trim whitespace for you
splitter = TiktokenTextSplitter("gpt-3.5-turbo", trim_chunks=False)

chunks = splitter.chunks("your document text", max_tokens)

Using a Range for Chunk Capacity

You also have the option of specifying your chunk capacity as a range.

Once a chunk has reached a length that falls within the range it will be returned.

It is always possible that a chunk may be returned that is less than the start value, as adding the next piece of text may have made it larger than the end capacity.

from semantic_text_splitter import CharacterTextSplitter

# Optionally can also have the splitter trim whitespace for you
splitter = CharacterTextSplitter()

# Maximum number of characters in a chunk. Will fill up the
# chunk until it is somewhere in this range.
chunks = splitter.chunks("your document text", chunk_capacity=(200, 1000))

Method

To preserve as much semantic meaning within a chunk as possible, a recursive approach is used, starting at larger semantic units and, if that is too large, breaking it up into the next largest unit. Here is an example of the steps used:

  1. Split the text by a given level
  2. For each section, does it fit within the chunk size?
    • Yes. Merge as many of these neighboring sections into a chunk as possible to maximize chunk length.
    • No. Split by the next level and repeat.

The boundaries used to split the text if using the top-level chunks method, in descending length:

  1. Descending sequence length of newlines. (Newline is \r\n, \n, or \r) Each unique length of consecutive newline sequences is treated as its own semantic level.
  2. Unicode Sentence Boundaries
  3. Unicode Word Boundaries
  4. Unicode Grapheme Cluster Boundaries
  5. Characters

Splitting doesn't occur below the character level, otherwise you could get partial bytes of a char, which may not be a valid unicode str.

Note on sentences: There are lots of methods of determining sentence breaks, all to varying degrees of accuracy, and many requiring ML models to do so. Rather than trying to find the perfect sentence breaks, we rely on unicode method of sentence boundaries, which in most cases is good enough for finding a decent semantic breaking point if a paragraph is too large, and avoids the performance penalties of many other methods.

Inspiration

This crate was inspired by LangChain's TextSplitter. But, looking into the implementation, there was potential for better performance as well as better semantic chunking.

A big thank you to the unicode-rs team for their unicode-segmentation crate that manages a lot of the complexity of matching the Unicode rules for words and sentences.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semantic_text_splitter-0.6.2.tar.gz (242.2 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

semantic_text_splitter-0.6.2-cp38-abi3-win_amd64.whl (3.2 MB view details)

Uploaded CPython 3.8+Windows x86-64

semantic_text_splitter-0.6.2-cp38-abi3-win32.whl (3.0 MB view details)

Uploaded CPython 3.8+Windows x86

semantic_text_splitter-0.6.2-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.5 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ x86-64

semantic_text_splitter-0.6.2-cp38-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl (4.8 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ s390x

semantic_text_splitter-0.6.2-cp38-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (4.8 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ ppc64le

semantic_text_splitter-0.6.2-cp38-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (4.4 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ ARMv7l

semantic_text_splitter-0.6.2-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (4.5 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ ARM64

semantic_text_splitter-0.6.2-cp38-abi3-manylinux_2_12_i686.manylinux2010_i686.whl (4.5 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.12+ i686

semantic_text_splitter-0.6.2-cp38-abi3-macosx_11_0_arm64.whl (3.4 MB view details)

Uploaded CPython 3.8+macOS 11.0+ ARM64

semantic_text_splitter-0.6.2-cp38-abi3-macosx_10_12_x86_64.whl (3.4 MB view details)

Uploaded CPython 3.8+macOS 10.12+ x86-64

File details

Details for the file semantic_text_splitter-0.6.2.tar.gz.

File metadata

File hashes

Hashes for semantic_text_splitter-0.6.2.tar.gz
Algorithm Hash digest
SHA256 2f8013e9e5d353f9d4dee29f05f9c730fb7a751ad1f63c225b2f038f7e000559
MD5 caf07988ae46c208742ba286e0e47645
BLAKE2b-256 abae1549270c6bfb0b50a7bad10e226ee9eb7a95b0db1136cb43bdf0bdf6e44d

See more details on using hashes here.

File details

Details for the file semantic_text_splitter-0.6.2-cp38-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for semantic_text_splitter-0.6.2-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 3eb4d5f01f68d1f7889f88e44362fd7ede00d3f93cfbcbd28038df9a62cacf34
MD5 ee39e4d7b2839e1808e0280ba2f473b2
BLAKE2b-256 532c37aafc7db73005854bcfff5c0ba05774fbdcbe9350f37b9fc95c0334792d

See more details on using hashes here.

File details

Details for the file semantic_text_splitter-0.6.2-cp38-abi3-win32.whl.

File metadata

File hashes

Hashes for semantic_text_splitter-0.6.2-cp38-abi3-win32.whl
Algorithm Hash digest
SHA256 5d106bc9b7d41ff47931da95b05c53dda688bb55262a247a1f88c8e885acad6b
MD5 ac98fba82e61e174cf3d4ca62e0981ed
BLAKE2b-256 30af1c6f6a21ee87053f57e3e6ec21b10a489520dbea505e48a43d019999d640

See more details on using hashes here.

File details

Details for the file semantic_text_splitter-0.6.2-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for semantic_text_splitter-0.6.2-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 4f52104afeadda694635a211404501155bfed75a2327024b2252ea59d2ce34e1
MD5 5bd06e486f92f98830d28fe4d337bac6
BLAKE2b-256 0128ef6112f3f67ad5e3c6285f7fa800d10bd78daad9d769e9f730993b45256a

See more details on using hashes here.

File details

Details for the file semantic_text_splitter-0.6.2-cp38-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl.

File metadata

File hashes

Hashes for semantic_text_splitter-0.6.2-cp38-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl
Algorithm Hash digest
SHA256 d1597d86e46a748471372f41f0f7e0bac66a88b5647a19da5858ee90cb091ea2
MD5 bb60a949946483a640e5da3972b489e5
BLAKE2b-256 49db640d519bdef8b2081bdf58026887f16b36fe7facdd0a182fa86feb4a4332

See more details on using hashes here.

File details

Details for the file semantic_text_splitter-0.6.2-cp38-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl.

File metadata

File hashes

Hashes for semantic_text_splitter-0.6.2-cp38-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl
Algorithm Hash digest
SHA256 65e91cb6d2880b2493f04017c42b1aa3b50f5fc7173e17d303274b8f3807abab
MD5 3cadd7ae0e5c3852471ac8cba7e84647
BLAKE2b-256 581082a3d77cab324c06119936ae1981b1f34d2cadd7d63a2ef97618427e347e

See more details on using hashes here.

File details

Details for the file semantic_text_splitter-0.6.2-cp38-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl.

File metadata

File hashes

Hashes for semantic_text_splitter-0.6.2-cp38-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl
Algorithm Hash digest
SHA256 75c7b9240edf94ef186c4214bd640d2e8583e099bc1163c40ba10664762b3fa2
MD5 7371a0f66eadbc0f0cf70514c0306c01
BLAKE2b-256 533ed1e84bcee8586af658f0308566826623f73822bf13e6d574d2d8e34948ac

See more details on using hashes here.

File details

Details for the file semantic_text_splitter-0.6.2-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for semantic_text_splitter-0.6.2-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 d956dfdfb539da0f06a58a7ccdfcb378a7d343d014c51b395a7984843e1313d3
MD5 a8216f157412207cc95465dd03aa4981
BLAKE2b-256 abd5b8c514f93934876efc00faf0b5648160c1c040b96805a7018338b7116d81

See more details on using hashes here.

File details

Details for the file semantic_text_splitter-0.6.2-cp38-abi3-manylinux_2_12_i686.manylinux2010_i686.whl.

File metadata

File hashes

Hashes for semantic_text_splitter-0.6.2-cp38-abi3-manylinux_2_12_i686.manylinux2010_i686.whl
Algorithm Hash digest
SHA256 c4a62b52645b3f09c4b925bc3a97c176ce21fd0363f6931e80eb79188c2b41be
MD5 9d5a318d12523dd287df432953236686
BLAKE2b-256 3d74804c17ce0ebbeb47060ca22322954fbb609269a6c4c8aa1b343eed8b4303

See more details on using hashes here.

File details

Details for the file semantic_text_splitter-0.6.2-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for semantic_text_splitter-0.6.2-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 4eec8e3dc125297edd8d19a12fe9399e98f6f3ada961d2061e214b68ab653f92
MD5 22c80675452c9de29cd2634aed79f757
BLAKE2b-256 024050e1a1f8d1249cd4b150f180590822365d5387ccac5039f35496636f7e14

See more details on using hashes here.

File details

Details for the file semantic_text_splitter-0.6.2-cp38-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for semantic_text_splitter-0.6.2-cp38-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 36e01b10dda35babd13635fdb435052b8e730963e7fd61c29075498a46f7bc9d
MD5 067ac0e41461ad922e017c259962c4a0
BLAKE2b-256 c5de69598198ba1336be86c54936c017cb7872ef6b2fcac49786a8a9ff0d9ff1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page