Skip to main content

Japanese-optimized semantic text chunking for RAG applications

Project description

Bunsetsu (文節)

PyPI version License: MIT Python 3.9+

Japanese-optimized semantic text chunking for RAG applications.

Unlike general-purpose text splitters, Bunsetsu understands Japanese text structure—no spaces between words, particles that bind phrases, and sentence patterns that differ from English. This results in more coherent chunks and better retrieval accuracy for Japanese RAG systems.

Why Bunsetsu?

Feature Generic Splitters Bunsetsu
Japanese word boundaries ❌ Breaks mid-word ✅ Respects morphology
Particle handling ❌ Splits は/が from nouns ✅ Keeps phrases intact
Sentence detection ⚠️ Basic (。only) ✅ Full (。!?、etc.)
Topic boundaries ❌ Ignores ✅ Detects は/が patterns
Dependencies Heavy Zero by default

Installation

# Basic installation (zero dependencies)
pip install bunsetsu

# With MeCab tokenizer (higher accuracy)
pip install bunsetsu[mecab]

# With Sudachi tokenizer (multiple granularity modes)
pip install bunsetsu[sudachi]

# All tokenizers
pip install bunsetsu[all]

Quick Start

from bunsetsu import chunk_text

text = """
人工知能の発展は目覚ましいものがあります。
特に大規模言語モデルの登場により、自然言語処理の分野は大きく変わりました。
"""

# Simple semantic chunking
chunks = chunk_text(text, strategy="semantic", chunk_size=200)

for chunk in chunks:
    print(f"[{chunk.char_count} chars] {chunk.text[:50]}...")

Chunking Strategies

1. Semantic Chunking (Recommended for RAG)

Splits text based on meaning and topic boundaries:

from bunsetsu import SemanticChunker

chunker = SemanticChunker(
    min_chunk_size=100,
    max_chunk_size=500,
)

chunks = chunker.chunk(text)

2. Fixed-Size with Sentence Awareness

Character-based splitting that respects sentence boundaries:

from bunsetsu import FixedSizeChunker

chunker = FixedSizeChunker(
    chunk_size=500,
    chunk_overlap=50,
    respect_sentences=True,  # Don't break mid-sentence
)

chunks = chunker.chunk(text)

3. Recursive (Document Structure)

Splits hierarchically by headings, paragraphs, sentences, then clauses:

from bunsetsu import RecursiveChunker

chunker = RecursiveChunker(
    chunk_size=500,
    chunk_overlap=50,
)

chunks = chunker.chunk(markdown_text)

Tokenizer Backends

SimpleTokenizer (Default)

Regex-based, zero dependencies. Good for most use cases:

from bunsetsu import SimpleTokenizer

tokenizer = SimpleTokenizer()
tokens = tokenizer.tokenize("日本語のテキスト")

MeCabTokenizer (High Accuracy)

Uses MeCab via fugashi for proper morphological analysis:

from bunsetsu import MeCabTokenizer, SemanticChunker

tokenizer = MeCabTokenizer()
chunker = SemanticChunker(tokenizer=tokenizer)

SudachiTokenizer (Flexible Granularity)

Supports three tokenization modes (A/B/C):

from bunsetsu import SudachiTokenizer

# Mode C: Longest unit (compound words kept together)
tokenizer = SudachiTokenizer(mode="C")

# Mode A: Shortest unit (fine-grained)
tokenizer = SudachiTokenizer(mode="A")

Framework Integrations

LangChain

from bunsetsu.integrations import LangChainTextSplitter
from langchain.schema import Document

splitter = LangChainTextSplitter(
    strategy="semantic",
    chunk_size=500,
)

# Split plain text
chunks = splitter.split_text(text)

# Split Documents
docs = [Document(page_content=text, metadata={"source": "file.txt"})]
split_docs = splitter.split_documents(docs)

LlamaIndex

from bunsetsu.integrations import LlamaIndexNodeParser

parser = LlamaIndexNodeParser(
    strategy="semantic",
    chunk_size=500,
)

nodes = parser.get_nodes_from_documents(documents)

API Reference

chunk_text()

Convenience function for quick chunking:

chunks = chunk_text(
    text,
    strategy="semantic",      # "fixed", "semantic", or "recursive"
    chunk_size=500,           # Target chunk size
    chunk_overlap=50,         # Overlap between chunks
    tokenizer_backend="simple",  # "simple", "mecab", or "sudachi"
)

Chunk Object

chunk.text        # The chunk content
chunk.start_char  # Start position in original text
chunk.end_char    # End position in original text
chunk.char_count  # Number of characters
chunk.metadata    # Additional metadata dict

Token Object

token.surface      # Surface form (as written)
token.token_type   # TokenType enum (NOUN, VERB, PARTICLE, etc.)
token.reading      # Reading (if available)
token.base_form    # Dictionary form (if available)
token.is_content_word  # True for nouns, verbs, adjectives

Performance

Benchmarked on a 100KB Japanese document:

Chunker Time Chunks Avg Size
FixedSizeChunker 12ms 203 492 chars
SemanticChunker (simple) 45ms 187 534 chars
SemanticChunker (mecab) 89ms 192 521 chars
RecursiveChunker 23ms 198 505 chars

Design Philosophy

  1. Japanese-first: Built specifically for Japanese text, not adapted from English
  2. Zero dependencies by default: Works out of the box, optional backends for accuracy
  3. RAG-optimized: Chunks designed for embedding and retrieval, not just display
  4. Framework-agnostic: Core library works standalone, integrations provided separately

Contributing

Contributions are welcome! Please check CONTRIBUTING.md for guidelines.

# Development setup
git clone https://github.com/YUALAB/bunsetsu.git
cd bunsetsu
pip install -e ".[dev]"

# Run tests
pytest

# Run linter
ruff check src/

License

MIT License - see LICENSE for details.

About

Developed by YUA LAB (AQUA LLC), Tokyo.

We build AI agents and RAG systems for enterprise. This library powers our production RAG deployments.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bunsetsu-0.1.0.tar.gz (13.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bunsetsu-0.1.0-py3-none-any.whl (15.1 kB view details)

Uploaded Python 3

File details

Details for the file bunsetsu-0.1.0.tar.gz.

File metadata

  • Download URL: bunsetsu-0.1.0.tar.gz
  • Upload date:
  • Size: 13.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for bunsetsu-0.1.0.tar.gz
Algorithm Hash digest
SHA256 c0bfeb9cd48eca96a0ea0ee9229751360c26c8ba0871a93f91d16bcd50f5744f
MD5 52c0d75509c4f3f22c2c8dd7e68afee5
BLAKE2b-256 13d1a60086573762818aa02149be30d574e644edc9eea738af9f9947578bae76

See more details on using hashes here.

File details

Details for the file bunsetsu-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: bunsetsu-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 15.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for bunsetsu-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 380bfbfe5b278e6b3c171e4e6524d28deb05612521e05f055ed974b476768184
MD5 2a643c593c825a3279497e85e59bbcfc
BLAKE2b-256 9236fa45cf6efe9e0fcebe27e7416099e0d599afb7aba1c37a69c01ad6efeb33

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page