Skip to main content

High-performance Chinese document extractor + semantic chunker

Project description

HansChunks

English | 中文

High-performance Chinese document extractor and semantic chunker built in Rust with Python bindings.

Features

  • Intelligent Text Chunking: Splits Chinese documents into semantically meaningful chunks while preserving context
  • Element Recognition: Identifies different types of document elements (headings, paragraphs, lists, code blocks, etc.)
  • Semantic Boundary Preservation: Avoids breaking text at poor split points like colons
  • Heading Merging: Option to keep headings with their content
  • Customizable Configuration: Adjust chunk sizes and element weights to optimize for your specific use case
  • High Performance: Implemented in Rust with optimized algorithms for speed and efficiency
  • Python Bindings: Easy to use from Python with a simple, intuitive API
  • Advanced Algorithm: Uses dynamic programming with binary search optimization to find optimal split points, ensuring both efficiency and semantic coherence
  • Context-Aware Processing: Considers document structure, element types, and semantic connections when making chunking decisions

Installation

pip install hanschunks

Quick Start

from hanschunks.hanschunks import TextChunker, ChunkConfig

# Create a chunker with default settings
chunker = TextChunker()

# Process a document
text = """第一章 引言

随着人工智能技术的快速发展,自然语言处理已经成为计算机科学中最重要的研究领域之一。
文本分块作为信息检索和知识管理的基础技术,其重要性日益凸显。
"""

chunks = chunker.chunk(text)
for i, chunk in enumerate(chunks, 1):
    print(f"Chunk {i}: {chunk[:50]}...")

Custom Configuration

# Create custom configuration
config = ChunkConfig()
config.min_size = 160        # Minimum chunk size in characters
config.max_size = 320        # Maximum chunk size in characters
config.merge_headings = True # Merge headings with following content
config.preserve_boundaries = True # Preserve semantic boundaries

# Set element weights for chunking decisions
config.set_element_weights(
    heading_base=100.0,       # Base weight for headings
    heading_level_penalty=10.0, # Penalty per heading level
    code_block=80.0,          # Weight for code blocks
    table=80.0,               # Weight for tables
    list_item=60.0,           # Weight for list items
    paragraph=40.0,           # Weight for paragraphs
    quote=30.0,               # Weight for block quotes
    empty=10.0,               # Weight for empty lines
    footer=0.0                # Weight for footer elements
)

# Create chunker with custom config
chunker = TextChunker(config)

Algorithm

HansChunks uses an optimized dynamic programming algorithm to find the best possible split points in a document:

  1. Document is first parsed into semantic elements (headings, paragraphs, etc.)
  2. Each element is assigned a weight based on its type
  3. Dynamic programming with binary search optimization finds optimal split points
  4. Strong semantic connections are preserved (e.g., avoiding splits after colons)
  5. The result is a set of chunks that balance size constraints with semantic coherence

Development

Prerequisites

  • Rust toolchain (1.75+)
  • Python 3.12+
  • Maturin (for building Python bindings)

Build develop package

uv run maturin develop --release
uv run example/demo.py

Building from source

uv run maturin build --release --out dist 
uv add dist/hanschunks-*.whl

Running tests

cargo test

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hanschunks-0.1.0.tar.gz (31.4 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

hanschunks-0.1.0-cp312-abi3-win_amd64.whl (2.6 MB view details)

Uploaded CPython 3.12+Windows x86-64

hanschunks-0.1.0-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.7 MB view details)

Uploaded CPython 3.12+manylinux: glibc 2.17+ x86-64

hanschunks-0.1.0-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (2.7 MB view details)

Uploaded CPython 3.12+manylinux: glibc 2.17+ ARM64

hanschunks-0.1.0-cp312-abi3-macosx_11_0_arm64.whl (2.6 MB view details)

Uploaded CPython 3.12+macOS 11.0+ ARM64

hanschunks-0.1.0-cp312-abi3-macosx_10_12_x86_64.whl (2.7 MB view details)

Uploaded CPython 3.12+macOS 10.12+ x86-64

File details

Details for the file hanschunks-0.1.0.tar.gz.

File metadata

  • Download URL: hanschunks-0.1.0.tar.gz
  • Upload date:
  • Size: 31.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.8.6

File hashes

Hashes for hanschunks-0.1.0.tar.gz
Algorithm Hash digest
SHA256 15f2c9eefe9720418868395f1976a72d4679eb535bacf37babe3afa0cbfc0d2c
MD5 0e4cba40122420c15b54de9a7436d5a7
BLAKE2b-256 036975918709cd8fd10d03c88cb126100f99423915421793ac811e9b1971685e

See more details on using hashes here.

File details

Details for the file hanschunks-0.1.0-cp312-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for hanschunks-0.1.0-cp312-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 a36a4668cff36cf1177a369caf9e7ec468ea44a34b34cb403bf9dafda97b6538
MD5 3abf0ba8ef6c525fe12535dd638bcc67
BLAKE2b-256 deeb40ebe2e9aeef58032452fdae23780396ab2518162a9f7de08551a69a4af1

See more details on using hashes here.

File details

Details for the file hanschunks-0.1.0-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for hanschunks-0.1.0-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a73732279bbe95a4d5bebeea6794d1a67b8e6233490b705661604667dc492cfd
MD5 4412260fd9d8671f777459c2ba4359ca
BLAKE2b-256 d1be69268a5b3bddf5d71025ea37445b8ef4cf09f2279851a17556caa3429a9b

See more details on using hashes here.

File details

Details for the file hanschunks-0.1.0-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for hanschunks-0.1.0-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 e0aa5c808383136b4fb57605fd39653d56e4441e03bbc882d57a80cab06663b7
MD5 344a7cffec88ae9d3a7e23e0335b0eb8
BLAKE2b-256 23af08956c32074288dccfae49849a6049d23e76a06098f7942c43f58ef5dfa8

See more details on using hashes here.

File details

Details for the file hanschunks-0.1.0-cp312-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for hanschunks-0.1.0-cp312-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 eddde19de56be270d6966a7c6bf87eebea27ded464155943dce4a288c16f699c
MD5 9462a6f0650dfc3ac0b0c161f2d55939
BLAKE2b-256 d08698b36289be3dba2152e738c42839ec63bc449ad9b389b49a24dadaf9873f

See more details on using hashes here.

File details

Details for the file hanschunks-0.1.0-cp312-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for hanschunks-0.1.0-cp312-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 3a24376d9661201fbb74b8ba8d673a8afa8256960f67b581060777ad048a01ae
MD5 180f33d6a2423235361dc86b90d08323
BLAKE2b-256 636bd9d32ab893e8cebfd3b46567835c9eb5a8082f7e126cf126f0e7f83a6329

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page