Japanese-optimized semantic text chunking for RAG applications
Project description
Bunsetsu (文節)
Japanese-optimized semantic text chunking for RAG applications.
Unlike general-purpose text splitters, Bunsetsu understands Japanese text structure—no spaces between words, particles that bind phrases, and sentence patterns that differ from English. This results in more coherent chunks and better retrieval accuracy for Japanese RAG systems.
Why Bunsetsu?
| Feature | Generic Splitters | Bunsetsu |
|---|---|---|
| Japanese word boundaries | ❌ Breaks mid-word | ✅ Respects morphology |
| Particle handling | ❌ Splits は/が from nouns | ✅ Keeps phrases intact |
| Sentence detection | ⚠️ Basic (。only) | ✅ Full (。!?、etc.) |
| Topic boundaries | ❌ Ignores | ✅ Detects は/が patterns |
| Dependencies | Heavy | Zero by default |
Installation
# Basic installation (zero dependencies)
pip install bunsetsu
# With MeCab tokenizer (higher accuracy)
pip install bunsetsu[mecab]
# With Sudachi tokenizer (multiple granularity modes)
pip install bunsetsu[sudachi]
# All tokenizers
pip install bunsetsu[all]
Quick Start
from bunsetsu import chunk_text
text = """
人工知能の発展は目覚ましいものがあります。
特に大規模言語モデルの登場により、自然言語処理の分野は大きく変わりました。
"""
# Simple semantic chunking
chunks = chunk_text(text, strategy="semantic", chunk_size=200)
for chunk in chunks:
print(f"[{chunk.char_count} chars] {chunk.text[:50]}...")
Chunking Strategies
1. Semantic Chunking (Recommended for RAG)
Splits text based on meaning and topic boundaries:
from bunsetsu import SemanticChunker
chunker = SemanticChunker(
min_chunk_size=100,
max_chunk_size=500,
)
chunks = chunker.chunk(text)
2. Fixed-Size with Sentence Awareness
Character-based splitting that respects sentence boundaries:
from bunsetsu import FixedSizeChunker
chunker = FixedSizeChunker(
chunk_size=500,
chunk_overlap=50,
respect_sentences=True, # Don't break mid-sentence
)
chunks = chunker.chunk(text)
3. Recursive (Document Structure)
Splits hierarchically by headings, paragraphs, sentences, then clauses:
from bunsetsu import RecursiveChunker
chunker = RecursiveChunker(
chunk_size=500,
chunk_overlap=50,
)
chunks = chunker.chunk(markdown_text)
Tokenizer Backends
SimpleTokenizer (Default)
Regex-based, zero dependencies. Good for most use cases:
from bunsetsu import SimpleTokenizer
tokenizer = SimpleTokenizer()
tokens = tokenizer.tokenize("日本語のテキスト")
MeCabTokenizer (High Accuracy)
Uses MeCab via fugashi for proper morphological analysis:
from bunsetsu import MeCabTokenizer, SemanticChunker
tokenizer = MeCabTokenizer()
chunker = SemanticChunker(tokenizer=tokenizer)
SudachiTokenizer (Flexible Granularity)
Supports three tokenization modes (A/B/C):
from bunsetsu import SudachiTokenizer
# Mode C: Longest unit (compound words kept together)
tokenizer = SudachiTokenizer(mode="C")
# Mode A: Shortest unit (fine-grained)
tokenizer = SudachiTokenizer(mode="A")
Framework Integrations
LangChain
from bunsetsu.integrations import LangChainTextSplitter
from langchain.schema import Document
splitter = LangChainTextSplitter(
strategy="semantic",
chunk_size=500,
)
# Split plain text
chunks = splitter.split_text(text)
# Split Documents
docs = [Document(page_content=text, metadata={"source": "file.txt"})]
split_docs = splitter.split_documents(docs)
LlamaIndex
from bunsetsu.integrations import LlamaIndexNodeParser
parser = LlamaIndexNodeParser(
strategy="semantic",
chunk_size=500,
)
nodes = parser.get_nodes_from_documents(documents)
API Reference
chunk_text()
Convenience function for quick chunking:
chunks = chunk_text(
text,
strategy="semantic", # "fixed", "semantic", or "recursive"
chunk_size=500, # Target chunk size
chunk_overlap=50, # Overlap between chunks
tokenizer_backend="simple", # "simple", "mecab", or "sudachi"
)
Chunk Object
chunk.text # The chunk content
chunk.start_char # Start position in original text
chunk.end_char # End position in original text
chunk.char_count # Number of characters
chunk.metadata # Additional metadata dict
Token Object
token.surface # Surface form (as written)
token.token_type # TokenType enum (NOUN, VERB, PARTICLE, etc.)
token.reading # Reading (if available)
token.base_form # Dictionary form (if available)
token.is_content_word # True for nouns, verbs, adjectives
Performance
Benchmarked on a 100KB Japanese document:
| Chunker | Time | Chunks | Avg Size |
|---|---|---|---|
| FixedSizeChunker | 12ms | 203 | 492 chars |
| SemanticChunker (simple) | 45ms | 187 | 534 chars |
| SemanticChunker (mecab) | 89ms | 192 | 521 chars |
| RecursiveChunker | 23ms | 198 | 505 chars |
Design Philosophy
- Japanese-first: Built specifically for Japanese text, not adapted from English
- Zero dependencies by default: Works out of the box, optional backends for accuracy
- RAG-optimized: Chunks designed for embedding and retrieval, not just display
- Framework-agnostic: Core library works standalone, integrations provided separately
Contributing
Contributions are welcome! Please check CONTRIBUTING.md for guidelines.
# Development setup
git clone https://github.com/YUALAB/bunsetsu.git
cd bunsetsu
pip install -e ".[dev]"
# Run tests
pytest
# Run linter
ruff check src/
License
MIT License - see LICENSE for details.
About
Developed by YUA LAB (AQUA LLC), Tokyo.
We build AI agents and RAG systems for enterprise. This library powers our production RAG deployments.
- Website: aquallc.jp
- AI Assistant: YUA
- Contact: desk@aquallc.jp
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bunsetsu-0.1.0.tar.gz.
File metadata
- Download URL: bunsetsu-0.1.0.tar.gz
- Upload date:
- Size: 13.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c0bfeb9cd48eca96a0ea0ee9229751360c26c8ba0871a93f91d16bcd50f5744f
|
|
| MD5 |
52c0d75509c4f3f22c2c8dd7e68afee5
|
|
| BLAKE2b-256 |
13d1a60086573762818aa02149be30d574e644edc9eea738af9f9947578bae76
|
File details
Details for the file bunsetsu-0.1.0-py3-none-any.whl.
File metadata
- Download URL: bunsetsu-0.1.0-py3-none-any.whl
- Upload date:
- Size: 15.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
380bfbfe5b278e6b3c171e4e6524d28deb05612521e05f055ed974b476768184
|
|
| MD5 |
2a643c593c825a3279497e85e59bbcfc
|
|
| BLAKE2b-256 |
9236fa45cf6efe9e0fcebe27e7416099e0d599afb7aba1c37a69c01ad6efeb33
|