High-performance Chinese document extractor + semantic chunker
Project description
HansChunks
English | 中文
High-performance Chinese document extractor and semantic chunker built in Rust with Python bindings.
Features
- Intelligent Text Chunking: Splits Chinese documents into semantically meaningful chunks while preserving context
- Element Recognition: Identifies different types of document elements (headings, paragraphs, lists, code blocks, etc.)
- Semantic Boundary Preservation: Avoids breaking text at poor split points like colons
- Heading Merging: Option to keep headings with their content
- Customizable Configuration: Adjust chunk sizes and element weights to optimize for your specific use case
- High Performance: Implemented in Rust with optimized algorithms for speed and efficiency
- Python Bindings: Easy to use from Python with a simple, intuitive API
- Advanced Algorithm: Uses dynamic programming with binary search optimization to find optimal split points, ensuring both efficiency and semantic coherence
- Context-Aware Processing: Considers document structure, element types, and semantic connections when making chunking decisions
Installation
pip install hanschunks
Quick Start
from hanschunks.hanschunks import TextChunker, ChunkConfig
# Create a chunker with default settings
chunker = TextChunker()
# Process a document
text = """第一章 引言
随着人工智能技术的快速发展,自然语言处理已经成为计算机科学中最重要的研究领域之一。
文本分块作为信息检索和知识管理的基础技术,其重要性日益凸显。
"""
chunks = chunker.chunk(text)
for i, chunk in enumerate(chunks, 1):
print(f"Chunk {i}: {chunk[:50]}...")
Custom Configuration
# Create custom configuration
config = ChunkConfig()
config.min_size = 160 # Minimum chunk size in characters
config.max_size = 320 # Maximum chunk size in characters
config.merge_headings = True # Merge headings with following content
config.preserve_boundaries = True # Preserve semantic boundaries
# Set element weights for chunking decisions
config.set_element_weights(
heading_base=100.0, # Base weight for headings
heading_level_penalty=10.0, # Penalty per heading level
code_block=80.0, # Weight for code blocks
table=80.0, # Weight for tables
list_item=60.0, # Weight for list items
paragraph=40.0, # Weight for paragraphs
quote=30.0, # Weight for block quotes
empty=10.0, # Weight for empty lines
footer=0.0 # Weight for footer elements
)
# Create chunker with custom config
chunker = TextChunker(config)
Algorithm
HansChunks uses an optimized dynamic programming algorithm to find the best possible split points in a document:
- Document is first parsed into semantic elements (headings, paragraphs, etc.)
- Each element is assigned a weight based on its type
- Dynamic programming with binary search optimization finds optimal split points
- Strong semantic connections are preserved (e.g., avoiding splits after colons)
- The result is a set of chunks that balance size constraints with semantic coherence
Development
Prerequisites
- Rust toolchain (1.75+)
- Python 3.12+
- Maturin (for building Python bindings)
Build develop package
uv run maturin develop --release
uv run example/demo.py
Building from source
uv run maturin build --release --out dist
uv add dist/hanschunks-*.whl
Running tests
cargo test
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hanschunks-0.1.0.tar.gz.
File metadata
- Download URL: hanschunks-0.1.0.tar.gz
- Upload date:
- Size: 31.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.8.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
15f2c9eefe9720418868395f1976a72d4679eb535bacf37babe3afa0cbfc0d2c
|
|
| MD5 |
0e4cba40122420c15b54de9a7436d5a7
|
|
| BLAKE2b-256 |
036975918709cd8fd10d03c88cb126100f99423915421793ac811e9b1971685e
|
File details
Details for the file hanschunks-0.1.0-cp312-abi3-win_amd64.whl.
File metadata
- Download URL: hanschunks-0.1.0-cp312-abi3-win_amd64.whl
- Upload date:
- Size: 2.6 MB
- Tags: CPython 3.12+, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.8.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a36a4668cff36cf1177a369caf9e7ec468ea44a34b34cb403bf9dafda97b6538
|
|
| MD5 |
3abf0ba8ef6c525fe12535dd638bcc67
|
|
| BLAKE2b-256 |
deeb40ebe2e9aeef58032452fdae23780396ab2518162a9f7de08551a69a4af1
|
File details
Details for the file hanschunks-0.1.0-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: hanschunks-0.1.0-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 2.7 MB
- Tags: CPython 3.12+, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.8.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a73732279bbe95a4d5bebeea6794d1a67b8e6233490b705661604667dc492cfd
|
|
| MD5 |
4412260fd9d8671f777459c2ba4359ca
|
|
| BLAKE2b-256 |
d1be69268a5b3bddf5d71025ea37445b8ef4cf09f2279851a17556caa3429a9b
|
File details
Details for the file hanschunks-0.1.0-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: hanschunks-0.1.0-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 2.7 MB
- Tags: CPython 3.12+, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.8.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e0aa5c808383136b4fb57605fd39653d56e4441e03bbc882d57a80cab06663b7
|
|
| MD5 |
344a7cffec88ae9d3a7e23e0335b0eb8
|
|
| BLAKE2b-256 |
23af08956c32074288dccfae49849a6049d23e76a06098f7942c43f58ef5dfa8
|
File details
Details for the file hanschunks-0.1.0-cp312-abi3-macosx_11_0_arm64.whl.
File metadata
- Download URL: hanschunks-0.1.0-cp312-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 2.6 MB
- Tags: CPython 3.12+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.8.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
eddde19de56be270d6966a7c6bf87eebea27ded464155943dce4a288c16f699c
|
|
| MD5 |
9462a6f0650dfc3ac0b0c161f2d55939
|
|
| BLAKE2b-256 |
d08698b36289be3dba2152e738c42839ec63bc449ad9b389b49a24dadaf9873f
|
File details
Details for the file hanschunks-0.1.0-cp312-abi3-macosx_10_12_x86_64.whl.
File metadata
- Download URL: hanschunks-0.1.0-cp312-abi3-macosx_10_12_x86_64.whl
- Upload date:
- Size: 2.7 MB
- Tags: CPython 3.12+, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.8.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3a24376d9661201fbb74b8ba8d673a8afa8256960f67b581060777ad048a01ae
|
|
| MD5 |
180f33d6a2423235361dc86b90d08323
|
|
| BLAKE2b-256 |
636bd9d32ab893e8cebfd3b46567835c9eb5a8082f7e126cf126f0e7f83a6329
|