Smart-Chunker is a semantic chunker to prepare a long document for RAG
Project description
This smart chunker is a semantic chunker to prepare a long document for retrieval augmented generation (RAG).
Unlike a usual chunker, it does not split the text into identical groups of N tokens. Instead, it uses a cross-encoder to calculate the similarity function between neighboring sentences and divides the text based on the most significant boundaries of semantic transitions, i.e. minima in the above-mentioned similarity function.
The BAAI/bge-reranker-v2-m3, or any other model that supports the AutoModelForSequenceClassification interface, should be used as a cross encoder.
The smart chunker supports Russian and English.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file smart_chunker-0.0.3.tar.gz.
File metadata
- Download URL: smart_chunker-0.0.3.tar.gz
- Upload date:
- Size: 3.9 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bf9a709d5319551830908a577628c706bba7b4e865e10f403a8d93970d6e8710
|
|
| MD5 |
197d78040d4b2bd5bf3e95bfc079b138
|
|
| BLAKE2b-256 |
2f6f0582c3c6040a07d75193c4b75eb52e816d156ae5a661ec27a88319c83689
|
File details
Details for the file smart_chunker-0.0.3-py3-none-any.whl.
File metadata
- Download URL: smart_chunker-0.0.3-py3-none-any.whl
- Upload date:
- Size: 9.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bcfc5d1b430abb74eb60de329fd5e62498a045350b5f6d5a02d2f6f1d590e17d
|
|
| MD5 |
7781833e1128d91f33dad4204cc5d3f6
|
|
| BLAKE2b-256 |
d9c5497b87616638e1705ae10e82efd997c16bb9cc5327b5dbe2401c86b80eb1
|