Japanese-optimized semantic text chunking for RAG applications

These details have not been verified by PyPI

Project links

Project description

Bunsetsu (文節)

Japanese-optimized semantic text chunking for RAG applications.

Unlike general-purpose text splitters, Bunsetsu understands Japanese text structure—no spaces between words, particles that bind phrases, and sentence patterns that differ from English. This results in more coherent chunks and better retrieval accuracy for Japanese RAG systems.

Why Bunsetsu?

Feature	Generic Splitters	Bunsetsu
Japanese word boundaries	❌ Breaks mid-word	✅ Respects morphology
Particle handling	❌ Splits は/が from nouns	✅ Keeps phrases intact
Sentence detection	⚠️ Basic (。only)	✅ Full (。！？、etc.)
Topic boundaries	❌ Ignores	✅ Detects は/が patterns
Dependencies	Heavy	Zero by default

Installation

# Basic installation (zero dependencies)
pip install bunsetsu

# With MeCab tokenizer (higher accuracy)
pip install bunsetsu[mecab]

# With Sudachi tokenizer (multiple granularity modes)
pip install bunsetsu[sudachi]

# All tokenizers
pip install bunsetsu[all]

Quick Start

from bunsetsu import chunk_text

text = """
人工知能の発展は目覚ましいものがあります。
特に大規模言語モデルの登場により、自然言語処理の分野は大きく変わりました。
"""

# Simple semantic chunking
chunks = chunk_text(text, strategy="semantic", chunk_size=200)

for chunk in chunks:
    print(f"[{chunk.char_count} chars] {chunk.text[:50]}...")

Chunking Strategies

1. Semantic Chunking (Recommended for RAG)

Splits text based on meaning and topic boundaries:

from bunsetsu import SemanticChunker

chunker = SemanticChunker(
    min_chunk_size=100,
    max_chunk_size=500,
)

chunks = chunker.chunk(text)

2. Fixed-Size with Sentence Awareness

Character-based splitting that respects sentence boundaries:

from bunsetsu import FixedSizeChunker

chunker = FixedSizeChunker(
    chunk_size=500,
    chunk_overlap=50,
    respect_sentences=True,  # Don't break mid-sentence
)

chunks = chunker.chunk(text)

3. Recursive (Document Structure)

Splits hierarchically by headings, paragraphs, sentences, then clauses:

from bunsetsu import RecursiveChunker

chunker = RecursiveChunker(
    chunk_size=500,
    chunk_overlap=50,
)

chunks = chunker.chunk(markdown_text)

Tokenizer Backends

SimpleTokenizer (Default)

Regex-based, zero dependencies. Good for most use cases:

from bunsetsu import SimpleTokenizer

tokenizer = SimpleTokenizer()
tokens = tokenizer.tokenize("日本語のテキスト")

MeCabTokenizer (High Accuracy)

Uses MeCab via fugashi for proper morphological analysis:

from bunsetsu import MeCabTokenizer, SemanticChunker

tokenizer = MeCabTokenizer()
chunker = SemanticChunker(tokenizer=tokenizer)

SudachiTokenizer (Flexible Granularity)

Supports three tokenization modes (A/B/C):

from bunsetsu import SudachiTokenizer

# Mode C: Longest unit (compound words kept together)
tokenizer = SudachiTokenizer(mode="C")

# Mode A: Shortest unit (fine-grained)
tokenizer = SudachiTokenizer(mode="A")

Framework Integrations

LangChain

from bunsetsu.integrations import LangChainTextSplitter
from langchain.schema import Document

splitter = LangChainTextSplitter(
    strategy="semantic",
    chunk_size=500,
)

# Split plain text
chunks = splitter.split_text(text)

# Split Documents
docs = [Document(page_content=text, metadata={"source": "file.txt"})]
split_docs = splitter.split_documents(docs)

LlamaIndex

from bunsetsu.integrations import LlamaIndexNodeParser

parser = LlamaIndexNodeParser(
    strategy="semantic",
    chunk_size=500,
)

nodes = parser.get_nodes_from_documents(documents)

API Reference

chunk_text()

Convenience function for quick chunking:

chunks = chunk_text(
    text,
    strategy="semantic",      # "fixed", "semantic", or "recursive"
    chunk_size=500,           # Target chunk size
    chunk_overlap=50,         # Overlap between chunks
    tokenizer_backend="simple",  # "simple", "mecab", or "sudachi"
)

Chunk Object

chunk.text        # The chunk content
chunk.start_char  # Start position in original text
chunk.end_char    # End position in original text
chunk.char_count  # Number of characters
chunk.metadata    # Additional metadata dict

Token Object

token.surface      # Surface form (as written)
token.token_type   # TokenType enum (NOUN, VERB, PARTICLE, etc.)
token.reading      # Reading (if available)
token.base_form    # Dictionary form (if available)
token.is_content_word  # True for nouns, verbs, adjectives

Performance

Benchmarked on a 100KB Japanese document:

Chunker	Time	Chunks	Avg Size
FixedSizeChunker	12ms	203	492 chars
SemanticChunker (simple)	45ms	187	534 chars
SemanticChunker (mecab)	89ms	192	521 chars
RecursiveChunker	23ms	198	505 chars

Design Philosophy

Japanese-first: Built specifically for Japanese text, not adapted from English
Zero dependencies by default: Works out of the box, optional backends for accuracy
RAG-optimized: Chunks designed for embedding and retrieval, not just display
Framework-agnostic: Core library works standalone, integrations provided separately

Contributing

Contributions are welcome! Please check CONTRIBUTING.md for guidelines.

# Development setup
git clone https://github.com/YUALAB/bunsetsu.git
cd bunsetsu
pip install -e ".[dev]"

# Run tests
pytest

# Run linter
ruff check src/

License

MIT License - see LICENSE for details.

About

Developed by YUA LAB (AQUA LLC), Tokyo.

We build AI agents and RAG systems for enterprise. This library powers our production RAG deployments.

Website: aquallc.jp
AI Assistant: YUA
Contact: desk@aquallc.jp

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Dec 27, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bunsetsu-0.1.0.tar.gz (13.1 kB view details)

Uploaded Dec 27, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

bunsetsu-0.1.0-py3-none-any.whl (15.1 kB view details)

Uploaded Dec 27, 2025 Python 3

File details

Details for the file bunsetsu-0.1.0.tar.gz.

File metadata

Download URL: bunsetsu-0.1.0.tar.gz
Upload date: Dec 27, 2025
Size: 13.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for bunsetsu-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`c0bfeb9cd48eca96a0ea0ee9229751360c26c8ba0871a93f91d16bcd50f5744f`
MD5	`52c0d75509c4f3f22c2c8dd7e68afee5`
BLAKE2b-256	`13d1a60086573762818aa02149be30d574e644edc9eea738af9f9947578bae76`

See more details on using hashes here.

File details

Details for the file bunsetsu-0.1.0-py3-none-any.whl.

File metadata

Download URL: bunsetsu-0.1.0-py3-none-any.whl
Upload date: Dec 27, 2025
Size: 15.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for bunsetsu-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`380bfbfe5b278e6b3c171e4e6524d28deb05612521e05f055ed974b476768184`
MD5	`2a643c593c825a3279497e85e59bbcfc`
BLAKE2b-256	`9236fa45cf6efe9e0fcebe27e7416099e0d599afb7aba1c37a69c01ad6efeb33`

See more details on using hashes here.

bunsetsu 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Bunsetsu (文節)

Why Bunsetsu?

Installation

Quick Start

Chunking Strategies

1. Semantic Chunking (Recommended for RAG)

2. Fixed-Size with Sentence Awareness

3. Recursive (Document Structure)

Tokenizer Backends

SimpleTokenizer (Default)

MeCabTokenizer (High Accuracy)

SudachiTokenizer (Flexible Granularity)

Framework Integrations

LangChain

LlamaIndex

API Reference

chunk_text()

Chunk Object

Token Object

Performance

Design Philosophy

Contributing

License

About

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes