Chonky is a Python library that intelligently segments text into meaningful semantic chunks using a fine-tuned transformer model.
Project description
Chonky
Chonky is a Python library that intelligently segments text into meaningful semantic chunks using a fine-tuned transformer model. This library can be used in the RAG systems.
Installation
pip install chonky
Usage:
from chonky import ParagraphSplitter
# on the first run it will download the transformer model
splitter = ParagraphSplitter(device="cpu")
# Or you can select the model
# splitter = ParagraphSplitter(
# model_id="mirth/chonky_modernbert_base_1",
# device="cpu"
# )
text = (
"Before college the two main things I worked on, outside of school, were writing and programming. "
"I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. "
"My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. "
"The first programs I tried writing were on the IBM 1401 that our school district used for what was then called 'data processing.' "
"This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, "
"and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — "
"CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights."
)
for chunk in splitter(text):
print(chunk)
print("--")
Sample Output
Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.
--
The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it.
--
It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.
--
The usage pattern is the following: strip all the markup tags to produce pure text and feed this text into the splitter. For this purpose there is helper class MarkupRemover (it automatically detects the content format):
from chonky.markup_remover import MarkupRemover
from chonky import ParagraphSplitter
remover = MarkupRemover()
splitter = ParagraphSplitter()
text = remover("# Header 1 ...")
splitter(text)
Supported formats: markdown, xml, html.
Supported models
| Model ID | Seq Length | Number of Params | Multilingual |
|---|---|---|---|
| mirth/chonky_modernbert_large_1 | 1024 | 396M | ❌ |
| mirth/chonky_modernbert_base_1 | 1024 | 150M | ❌ |
| mirth/chonky_mmbert_small_multilingual_1 🆕 | 1024 | 140M | ✅ |
| mirth/chonky_distilbert_base_uncased_1 | 512 | 66.4M | ❌ |
Benchmarks
The following values are token based F1 scores computed on first 1M tokens of each datasets (due to performance reasons).
Various English datasets:
The do_ps fragment for SaT models here is do_paragraph_segmentation flag.
| Model | bookcorpus | en_judgements | paul_graham | 20_newsgroups |
|---|---|---|---|---|
| chonkY_modernbert_large_1 | 0.79 ❗ | 0.29 ❗ | 0.69 ❗ | 0.17 |
| chonkY_modernbert_base_1 | 0.72 | 0.08 | 0.63 | 0.15 |
| chonkY_mmbert_small_multilingual_1 🆕 | 0.72 | 0.2 | 0.56 | 0.13 |
| chonkY_distilbert_base_uncased_1 | 0.69 | 0.05 | 0.52 | 0.15 |
| SaT(sat-12l-sm, do_ps=False) | 0.33 | 0.03 | 0.43 | 0.31 |
| SaT(sat-12l-sm, do_ps=True) | 0.33 | 0.06 | 0.42 | 0.3 |
| SaT(sat-3l, do_ps=False) | 0.28 | 0.03 | 0.42 | 0.34 ❗ |
| SaT(sat-3l, do_ps=True) | 0.09 | 0.07 | 0.41 | 0.15 |
| chonkIE SemanticChunker(bge-small-en-v1.5) | 0.21 | 0.01 | 0.12 | 0.06 |
| chonkIE SemanticChunker(potion-base-8M) | 0.19 | 0.01 | 0.15 | 0.08 |
| chonkIE RecursiveChunker | 0.07 | 0.01 | 0.05 | 0.02 |
| langchain SemanticChunker(all-mpnet-base-v2) | 0 | 0 | 0 | 0 |
| langchain SemanticChunker(bge-small-en-v1.5) | 0 | 0 | 0 | 0 |
| langchain SemanticChunker(potion-base-8M) | 0 | 0 | 0 | 0 |
| langchain RecursiveChar | 0 | 0 | 0 | 0 |
| llamaindex SemanticSplitter(bge-small-en-v1.5) | 0.06 | 0 | 0.06 | 0.02 |
Project Gutenberg validation:
| Model | de | en | es | fr | it | nl | pl | pt | ru | sv | zh |
|---|---|---|---|---|---|---|---|---|---|---|---|
| chonky_mmbert_small_multi_1 🆕 | 0.88 ❗ | 0.78 ❗ | 0.91 ❗ | 0.93 ❗ | 0.86 ❗ | 0.81 ❗ | 0.81 ❗ | 0.88 ❗ | 0.97 ❗ | 0.91 ❗ | 0.11 |
| chonky_modernbert_large_1 | 0.53 | 0.43 | 0.48 | 0.51 | 0.56 | 0.21 | 0.65 | 0.53 | 0.87 | 0.51 | 0.33 ❗ |
| chonky_modernbert_base_1 | 0.42 | 0.38 | 0.34 | 0.4 | 0.33 | 0.22 | 0.41 | 0.35 | 0.27 | 0.31 | 0.26 |
| chonky_distilbert_base_uncased_1 | 0.19 | 0.3 | 0.17 | 0.2 | 0.18 | 0.04 | 0.27 | 0.21 | 0.22 | 0.19 | 0.15 |
| Number of val tokens | 1M | 1M | 1M | 1M | 1M | 1M | 38K | 1M | 24K | 1M | 132K |
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file chonky-0.1.7.tar.gz.
File metadata
- Download URL: chonky-0.1.7.tar.gz
- Upload date:
- Size: 9.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
603e0c76b6587bc8c5103114c544b571cd45baeac5353bad4bd113468789e322
|
|
| MD5 |
6d8bcf169dea1a43aa0d5505c39a7a9c
|
|
| BLAKE2b-256 |
f42d71d9dd8d2c0a40dc04d68e5eff5ae6bb9a6ece4b60458384b71d632e822c
|
File details
Details for the file chonky-0.1.7-py3-none-any.whl.
File metadata
- Download URL: chonky-0.1.7-py3-none-any.whl
- Upload date:
- Size: 6.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e9f0113d1ceea5189a239af5c1f0e9b3a5f066f905a01e2a7ba28b2635804256
|
|
| MD5 |
e3543bb4f321cabeb7d62588482bdb00
|
|
| BLAKE2b-256 |
39936e509508fbe5ba9834255a4e3ce903ba2f6f8d84e4434df3543d7f854bc7
|