Split text into semantic chunks, up to a desired chunk size. Supports calculating length by characters and tokens, and is callable from Rust and Python.

These details have not been verified by PyPI

Project links

Source Code

Project description

semantic-text-splitter

Large language models (LLMs) can be used for many tasks, but often have a limited context size that can be smaller than documents you might want to use. To use documents of larger length, you often have to split your text into chunks to fit within this context size.

This crate provides methods for splitting longer pieces of text into smaller chunks, aiming to maximize a desired chunk size, but still splitting at semantically sensible boundaries whenever possible.

Get Started

By Number of Characters

from semantic_text_splitter import TextSplitter

# Maximum number of characters in a chunk
max_characters = 1000
# Optionally can also have the splitter not trim whitespace for you
splitter = TextSplitter()
# splitter = TextSplitter(trim_chunks=False)

chunks = splitter.chunks("your document text", max_characters)

Using a Range for Chunk Capacity

You also have the option of specifying your chunk capacity as a range.

Once a chunk has reached a length that falls within the range it will be returned.

It is always possible that a chunk may be returned that is less than the start value, as adding the next piece of text may have made it larger than the end capacity.

from semantic_text_splitter import TextSplitter

splitter = TextSplitter()

# Maximum number of characters in a chunk. Will fill up the
# chunk until it is somewhere in this range.
chunks = splitter.chunks("your document text", chunk_capacity=(200,1000))

Using a Hugging Face Tokenizer

from semantic_text_splitter import TextSplitter
from tokenizers import Tokenizer

# Maximum number of tokens in a chunk
max_tokens = 1000
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
splitter = TextSplitter.from_huggingface_tokenizer(tokenizer)

chunks = splitter.chunks("your document text", max_tokens)

Using a Tiktoken Tokenizer

from semantic_text_splitter import TextSplitter

# Maximum number of tokens in a chunk
max_tokens = 1000
splitter = TextSplitter.from_tiktoken_model("gpt-3.5-turbo")

chunks = splitter.chunks("your document text", max_tokens)

Using a Custom Callback

from semantic_text_splitter import TextSplitter

# Optionally can also have the splitter trim whitespace for you
splitter = TextSplitter.from_callback(lambda text: len(text))

# Maximum number of tokens in a chunk. Will fill up the
# chunk until it is somewhere in this range.
chunks = splitter.chunks("your document text", chunk_capacity=(200,1000))

Markdown

All of the above examples also can also work with Markdown text. You can use the MarkdownSplitter in the same ways as the TextSplitter.

from text_splitter import MarkdownSplitter
# Default implementation uses character count for chunk size.
# Can also use all of the same tokenizer implementations as `TextSplitter`.
splitter = MarkdownSplitter()

splitter.chunks("# Header\n\nyour document text", 1000)

Method

To preserve as much semantic meaning within a chunk as possible, each chunk is composed of the largest semantic units that can fit in the next given chunk. For each splitter type, there is a defined set of semantic levels. Here is an example of the steps used:

Split the text by a increasing semantic levels.
Check the first item for each level and select the highest level whose first item still fits within the chunk size.
Merge as many of these neighboring sections of this level or above into a chunk to maximize chunk length. Boundaries of higher semantic levels are always included when merging, so that the chunk doesn't inadvertantly cross semantic boundaries.

The boundaries used to split the text if using the chunks method, in ascending order:

`TextSplitter` Semantic Levels

Characters
Unicode Grapheme Cluster Boundaries
Unicode Word Boundaries
Unicode Sentence Boundaries
Ascending sequence length of newlines. (Newline is \r\n, \n, or \r) Each unique length of consecutive newline sequences is treated as its own semantic level. So a sequence of 2 newlines is a higher level than a sequence of 1 newline, and so on.

Splitting doesn't occur below the character level, otherwise you could get partial bytes of a char, which may not be a valid unicode str.

`MarkdownSplitter` Semantic Levels

Markdown is parsed according to the CommonMark spec, along with some optional features such as GitHub Flavored Markdown.

Characters
Unicode Grapheme Cluster Boundaries
Unicode Word Boundaries
Unicode Sentence Boundaries
Soft line breaks (single newline) which isn't necessarily a new element in Markdown.
Inline elements such as: text nodes, emphasis, strong, strikethrough, link, image, table cells, inline code, footnote references, task list markers, and inline html.
Block elements suce as: paragraphs, code blocks, footnote definitions, metadata. Also, a block quote or row/item within a table or list that can contain other "block" type elements, and a list or table that contains items.
Thematic breaks or horizontal rules.
Headings by level

Splitting doesn't occur below the character level, otherwise you could get partial bytes of a char, which may not be a valid unicode str.

Note on sentences

There are lots of methods of determining sentence breaks, all to varying degrees of accuracy, and many requiring ML models to do so. Rather than trying to find the perfect sentence breaks, we rely on unicode method of sentence boundaries, which in most cases is good enough for finding a decent semantic breaking point if a paragraph is too large, and avoids the performance penalties of many other methods.

Inspiration

This crate was inspired by LangChain's TextSplitter. But, looking into the implementation, there was potential for better performance as well as better semantic chunking.

A big thank you to the unicode-rs team for their unicode-segmentation crate that manages a lot of the complexity of matching the Unicode rules for words and sentences.

Project details

These details have not been verified by PyPI

Project links

Source Code

Release history Release notifications | RSS feed

0.29.0

Dec 30, 2025

0.28.0

Sep 5, 2025

0.27.0

May 28, 2025

0.26.0

May 9, 2025

0.25.1

Mar 25, 2025

0.25.0

Mar 22, 2025

0.24.2

Mar 19, 2025

0.24.1

Feb 24, 2025

0.24.0

Feb 15, 2025

0.23.0

Feb 9, 2025

0.22.0

Jan 17, 2025

0.21.0 yanked

Jan 16, 2025

Reason this release was yanked:

Regression for many tokenizers at certain chunk sizes

0.20.1

Jan 1, 2025

0.20.0

Dec 14, 2024

0.19.1

Dec 14, 2024

0.19.0

Nov 28, 2024

0.18.1

Oct 25, 2024

0.18.0

Oct 14, 2024

0.17.1

Oct 11, 2024

0.17.0

Oct 6, 2024

0.16.1

Sep 7, 2024

0.16.0

Sep 2, 2024

0.15.0

Aug 11, 2024

0.14.1

Jul 6, 2024

0.14.0

Jun 21, 2024

0.13.3

Jun 2, 2024

0.13.1

May 7, 2024

0.13.0

May 5, 2024

0.12.3

May 1, 2024

0.12.2

Apr 28, 2024

0.12.1

Apr 26, 2024

0.12.0

Apr 23, 2024

0.11.0

Apr 18, 2024

This version

0.10.0

Apr 7, 2024

0.9.1

Apr 4, 2024

0.9.0

Apr 4, 2024

0.8.1

Mar 26, 2024

0.8.0

Mar 26, 2024

0.7.0

Mar 9, 2024

0.6.3

Jan 20, 2024

0.6.2

Jan 20, 2024

0.6.0

Jan 14, 2024

0.5.1

Jan 13, 2024

0.3.1

Dec 27, 2023

0.2.4

Nov 15, 2023

0.2.3

Sep 11, 2023

0.2.2

Jul 2, 2023

0.2.1

Jun 13, 2023

0.2.0

Jun 12, 2023

0.1.4

Jun 9, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semantic_text_splitter-0.10.0.tar.gz (261.4 kB view details)

Uploaded Apr 7, 2024 Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

semantic_text_splitter-0.10.0-cp38-abi3-win_amd64.whl (3.3 MB view details)

Uploaded Apr 7, 2024 CPython 3.8+Windows x86-64

semantic_text_splitter-0.10.0-cp38-abi3-win32.whl (3.2 MB view details)

Uploaded Apr 7, 2024 CPython 3.8+Windows x86

semantic_text_splitter-0.10.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.7 MB view details)

Uploaded Apr 7, 2024 CPython 3.8+manylinux: glibc 2.17+ x86-64

semantic_text_splitter-0.10.0-cp38-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl (5.0 MB view details)

Uploaded Apr 7, 2024 CPython 3.8+manylinux: glibc 2.17+ s390x

semantic_text_splitter-0.10.0-cp38-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (5.0 MB view details)

Uploaded Apr 7, 2024 CPython 3.8+manylinux: glibc 2.17+ ppc64le

semantic_text_splitter-0.10.0-cp38-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (4.5 MB view details)

Uploaded Apr 7, 2024 CPython 3.8+manylinux: glibc 2.17+ ARMv7l

semantic_text_splitter-0.10.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (4.6 MB view details)

Uploaded Apr 7, 2024 CPython 3.8+manylinux: glibc 2.17+ ARM64

semantic_text_splitter-0.10.0-cp38-abi3-manylinux_2_12_i686.manylinux2010_i686.whl (4.7 MB view details)

Uploaded Apr 7, 2024 CPython 3.8+manylinux: glibc 2.12+ i686

semantic_text_splitter-0.10.0-cp38-abi3-macosx_11_0_arm64.whl (3.5 MB view details)

Uploaded Apr 7, 2024 CPython 3.8+macOS 11.0+ ARM64

semantic_text_splitter-0.10.0-cp38-abi3-macosx_10_12_x86_64.whl (3.6 MB view details)

Uploaded Apr 7, 2024 CPython 3.8+macOS 10.12+ x86-64

File details

Details for the file semantic_text_splitter-0.10.0.tar.gz.

File metadata

Download URL: semantic_text_splitter-0.10.0.tar.gz
Upload date: Apr 7, 2024
Size: 261.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.5.1

File hashes

Hashes for semantic_text_splitter-0.10.0.tar.gz
Algorithm	Hash digest
SHA256	`c35d3ec408782e5478aac5f77294c880bd2f54d6bac94250b120a6323783c036`
MD5	`86c4421b51660ef749a338e083249992`
BLAKE2b-256	`5d3155091c24d7ab314352fd555e751c36a5fd242457649e88e96165c3e41680`

See more details on using hashes here.

File details

Details for the file semantic_text_splitter-0.10.0-cp38-abi3-win_amd64.whl.

File metadata

Download URL: semantic_text_splitter-0.10.0-cp38-abi3-win_amd64.whl
Upload date: Apr 7, 2024
Size: 3.3 MB
Tags: CPython 3.8+, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.5.1

File hashes

Hashes for semantic_text_splitter-0.10.0-cp38-abi3-win_amd64.whl
Algorithm	Hash digest
SHA256	`d48a01eeb86eb30de83402dc80f1624e0596bc3fd5c62f28428929ea0f877487`
MD5	`d682e0dead5b8ebef5191cfbd66c61fb`
BLAKE2b-256	`9fd65fdea29cb2c815da8c8b3792acde8ed1ccbe37123e94903976d487b84ab4`

See more details on using hashes here.

File details

Details for the file semantic_text_splitter-0.10.0-cp38-abi3-win32.whl.

File metadata

Download URL: semantic_text_splitter-0.10.0-cp38-abi3-win32.whl
Upload date: Apr 7, 2024
Size: 3.2 MB
Tags: CPython 3.8+, Windows x86
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.5.1

File hashes

Hashes for semantic_text_splitter-0.10.0-cp38-abi3-win32.whl
Algorithm	Hash digest
SHA256	`6a958662c9d1c84121c4c48de1c91fd59ebe771bfdcce842c2927c321023762a`
MD5	`09f3185dbe332088e5c124868e94c4f3`
BLAKE2b-256	`27879759c15db8ebcf24331673df9096d7f96a2cc6bcd94b78caff21be8c2f08`

See more details on using hashes here.

File details

Details for the file semantic_text_splitter-0.10.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

Download URL: semantic_text_splitter-0.10.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Upload date: Apr 7, 2024
Size: 4.7 MB
Tags: CPython 3.8+, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.5.1

File hashes

Hashes for semantic_text_splitter-0.10.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`18130279520efdd2f8a7a7523a8fcc611f356f2578fb4b0b27fc88edd714c980`
MD5	`d400172ebd4d79939e99ce9ceda17672`
BLAKE2b-256	`293315fa9ecc4ed5491e0c63f9e4884785305184dc105d1376ab3946830ca902`

See more details on using hashes here.

File details

Details for the file semantic_text_splitter-0.10.0-cp38-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl.

File metadata

Download URL: semantic_text_splitter-0.10.0-cp38-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl
Upload date: Apr 7, 2024
Size: 5.0 MB
Tags: CPython 3.8+, manylinux: glibc 2.17+ s390x
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.5.1

File hashes

Hashes for semantic_text_splitter-0.10.0-cp38-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl
Algorithm	Hash digest
SHA256	`75e5ef4097d88e8a514029c2f3a97c9fc34aeb21da2368a211bad05a8248a43c`
MD5	`5e552d99e6f39e0034b92d53e171b88e`
BLAKE2b-256	`f6f9859af576e4c4dfebf926a705402fe45a14ead96ebeaf6479141c8085fd61`

See more details on using hashes here.

File details

Details for the file semantic_text_splitter-0.10.0-cp38-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl.

File metadata

Download URL: semantic_text_splitter-0.10.0-cp38-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl
Upload date: Apr 7, 2024
Size: 5.0 MB
Tags: CPython 3.8+, manylinux: glibc 2.17+ ppc64le
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.5.1

File hashes

Hashes for semantic_text_splitter-0.10.0-cp38-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl
Algorithm	Hash digest
SHA256	`7c8440b0f2f95f16dbd9a1f7b3b2a995b370362b1893b2d409762c7cd952fc9e`
MD5	`b9a882dc83e301575e70102df0cb14ca`
BLAKE2b-256	`3f1092d16ddbdfe1c59da5f32b276857119d6d7b7754a38204c16e9b562d920a`

See more details on using hashes here.

File details

Details for the file semantic_text_splitter-0.10.0-cp38-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl.

File metadata

Download URL: semantic_text_splitter-0.10.0-cp38-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl
Upload date: Apr 7, 2024
Size: 4.5 MB
Tags: CPython 3.8+, manylinux: glibc 2.17+ ARMv7l
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.5.1

File hashes

Hashes for semantic_text_splitter-0.10.0-cp38-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl
Algorithm	Hash digest
SHA256	`e1fdcb6e6ae6b0a81b4e06233c4f3d8a007966fddc8dd0e9907359b17ae17e45`
MD5	`0970235868526acd36d4c5bef6f9eb3c`
BLAKE2b-256	`a5d47549e307d2b2a46ffeb4c926499f00d698aa2b0618e568c6cfebf499038f`

See more details on using hashes here.

File details

Details for the file semantic_text_splitter-0.10.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

Download URL: semantic_text_splitter-0.10.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Upload date: Apr 7, 2024
Size: 4.6 MB
Tags: CPython 3.8+, manylinux: glibc 2.17+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.5.1

File hashes

Hashes for semantic_text_splitter-0.10.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm	Hash digest
SHA256	`7be29358d846426baad4a60681114ba0d65da214c18e5f58bd078cb26f973730`
MD5	`05dbc4128f6fbdb6aee3cc9a4128c66f`
BLAKE2b-256	`b9f53bb17010ad7419b8b57542c9ca8c25f7e253da73c303f8bf562e0bc9468a`

See more details on using hashes here.

File details

Details for the file semantic_text_splitter-0.10.0-cp38-abi3-manylinux_2_12_i686.manylinux2010_i686.whl.

File metadata

Download URL: semantic_text_splitter-0.10.0-cp38-abi3-manylinux_2_12_i686.manylinux2010_i686.whl
Upload date: Apr 7, 2024
Size: 4.7 MB
Tags: CPython 3.8+, manylinux: glibc 2.12+ i686
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.5.1

File hashes

Hashes for semantic_text_splitter-0.10.0-cp38-abi3-manylinux_2_12_i686.manylinux2010_i686.whl
Algorithm	Hash digest
SHA256	`c1b3d43ca5e131cfb0b1acf5ea4d5e788854e3889467c475f36e035f6b2c260c`
MD5	`7366f95ff482286e4fb03cf738f89dde`
BLAKE2b-256	`f73a4eb1a39c0f503f1f811201e9a5bb24acac436261da1d59ff43f5187f2381`

See more details on using hashes here.

File details

Details for the file semantic_text_splitter-0.10.0-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

Download URL: semantic_text_splitter-0.10.0-cp38-abi3-macosx_11_0_arm64.whl
Upload date: Apr 7, 2024
Size: 3.5 MB
Tags: CPython 3.8+, macOS 11.0+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.5.1

File hashes

Hashes for semantic_text_splitter-0.10.0-cp38-abi3-macosx_11_0_arm64.whl
Algorithm	Hash digest
SHA256	`8c6775d3ef21e3bb0cd23c8253a5eb050febeb910f412713e892fc80291c87d3`
MD5	`d56209b7d96cf653b63ed6a82a53aae8`
BLAKE2b-256	`49dca117f38513f3087b823d08e271d8531b0960b9b51f827b7af07f739d3d5f`

See more details on using hashes here.

File details

Details for the file semantic_text_splitter-0.10.0-cp38-abi3-macosx_10_12_x86_64.whl.

File metadata

Download URL: semantic_text_splitter-0.10.0-cp38-abi3-macosx_10_12_x86_64.whl
Upload date: Apr 7, 2024
Size: 3.6 MB
Tags: CPython 3.8+, macOS 10.12+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.5.1

File hashes

Hashes for semantic_text_splitter-0.10.0-cp38-abi3-macosx_10_12_x86_64.whl
Algorithm	Hash digest
SHA256	`efe0f4f355da41073632858686bed91798167af31b77ad9216df560d31e19eb8`
MD5	`4610ad0b730bd213aaebf43527e856e6`
BLAKE2b-256	`cbdfa1b2a802cfce037b0341cdbda769d8279b994cda63e4fd0654b043681689`

See more details on using hashes here.

semantic-text-splitter 0.10.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

semantic-text-splitter

Get Started

By Number of Characters

Using a Range for Chunk Capacity

Using a Hugging Face Tokenizer

Using a Tiktoken Tokenizer

Using a Custom Callback

Markdown

Method

TextSplitter Semantic Levels

MarkdownSplitter Semantic Levels

Note on sentences

Inspiration

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

`TextSplitter` Semantic Levels

`MarkdownSplitter` Semantic Levels