Skip to main content

Deterministic normalization utilities for Japanese text variants.

Project description

Utsuho

CI PyPI version Python versions License

Utsuho is a Python library for deterministic normalization of Japanese text variants.

It focuses on character-level conversions such as width normalization and kana conversion, while avoiding unrelated transformations that general-purpose Unicode normalization may introduce.

  • Bidirectional conversion between half-width and full-width katakana
  • Bidirectional conversion between hiragana and katakana
  • Configurable handling of spaces, punctuation, ASCII symbols, digits, and alphabets
  • Command-line interface for interactive use, scripting, and piped stdin processing
  • Model Context Protocol (MCP) server support for tool-based integrations

Why Utsuho?

Japanese text often mixes multiple representations of the same content, such as half-width and full-width katakana, or hiragana and katakana. Python's Unicode normalization can help in some cases, but it may also perform conversions you do not want, such as changing ASCII symbols or decomposing composite characters.

Utsuho provides explicit, deterministic character-level conversions for these Japanese text variants, making it easier to normalize Japanese text without introducing unrelated transformations.

Performance

Utsuho is implemented in pure Python, but still provides practical throughput for character-level normalization workloads.

In the project's long-input benchmarks on CPython 3.10, kana conversion is roughly in the 7 to 8 million input characters per second range, while width conversion is roughly in the 1 to 3 million input characters per second range.

These numbers are intended as indicative throughput rather than fixed guarantees, and will vary by platform, Python version, input mix, and power or thermal conditions.

Installation

Install Utsuho with pip:

pip install Utsuho

Quick Start

Half-width to full-width katakana

from utsuho import HalfToFullConverter

text = "キョウトシ サキョウク ギンカクジチョウ 2"
converted = HalfToFullConverter().convert(text)

print(converted)
# キョウトシ サキョウク ギンカクジチョウ 2

Full-width to half-width katakana

from utsuho import FullToHalfConverter

text = "キョウトシ サキョウク ギンカクジチョウ 2"
converted = FullToHalfConverter().convert(text)

print(converted)
# キョウトシ サキョウク ギンカクジチョウ 2

Hiragana to katakana

from utsuho import HiraganaToKatakanaConverter

text = "きょうとし さきょうく ぎんかくじちょう 2"
converted = HiraganaToKatakanaConverter().convert(text)

print(converted)
# キョウトシ サキョウク ギンカクジチョウ 2

Katakana to hiragana

from utsuho import KatakanaToHiraganaConverter

text = "キョウトシ サキョウク ギンカクジチョウ 2"
converted = KatakanaToHiraganaConverter().convert(text)

print(converted)
# きょうとし さきょうく ぎんかくじちょう 2

Configuring Width Conversion

Use WidthConverterConfig to control which non-katakana characters are normalized during half-width and full-width conversion.

from utsuho import HalfToFullConverter, WidthConverterConfig

config = WidthConverterConfig(
    ascii_symbol=False,
    ascii_digit=False,
    ascii_alphabet=False,
)

converted = HalfToFullConverter(config).convert("ギンカクジ 2F")

Available options:

Parameter Default Description
punctuation True Convert punctuation marks.
corner_brucket True Convert corner brackets.
conjunction_mark True Convert conjunction marks.
length_mark True Convert length marks.
space True Convert spaces.
ascii_symbol True Convert ASCII symbols.
ascii_digit True Convert ASCII digits.
ascii_alphabet True Convert ASCII alphabets.
wave_dash False Convert full-width wave dashes to half-width tildes in full-to-half conversion.

[!NOTE] The current public API uses the parameter name corner_brucket (due to historical reasons).

CLI

Utsuho also provides a command-line interface for interactive use, scripting, and shell pipelines.

% utsuho --help
Usage: utsuho [OPTIONS] COMMAND [ARGS]...

  Utsuho provides deterministic normalization utilities for Japanese text,
  including width normalization and hiragana/katakana conversion.

Options:
  --version  Show the version.
  --help     Show this message and exit.

Commands:
  full-to-half          Convert from full-width to half-width characters.
  half-to-full          Convert from half-width to full-width characters.
  hiragana-to-katakana  Convert from hiragana to katakana.
  katakana-to-hiragana  Convert from katakana to hiragana.

Examples:

% utsuho full-to-half "キョウトシ サキョウク ギンカクジチョウ 2"
キョウトシ サキョウク ギンカクジチョウ 2

% utsuho half-to-full "キョウトシ サキョウク ギンカクジチョウ 2"
キョウトシ サキョウク ギンカクジチョウ 2

% utsuho hiragana-to-katakana "きょうとし さきょうく ぎんかくじちょう 2"
キョウトシ サキョウク ギンカクジチョウ 2

% utsuho katakana-to-hiragana "キョウトシ サキョウク ギンカクジチョウ 2"
きょうとし さきょうく ぎんかくじちょう 2

% echo "キョウトシ 2" | utsuho full-to-half
キョウトシ 2

Each command accepts either a TEXT argument or piped stdin input. If TEXT is omitted, input is read from stdin.

When --file (or -f) is specified, TEXT is required and is treated as a UTF-8 text file path. In this mode, stdin input is not used.

MCP (Model Context Protocol)

Utsuho also provides a Model Context Protocol (MCP) server that exposes its text conversion utilities as tools.

This allows Utsuho to be used from MCP-compatible clients such as AI agents, enabling deterministic text normalization as an external tool.

Installation

Install with the mcp extra:

pip install "Utsuho[mcp]"

Running the MCP server

Start the server using:

utsuho-mcp

The server runs over stdio and provides the following tools.

Available tools

  • half_to_full

    Convert half-width text to full-width text.

  • full_to_half

    Convert full-width text to half-width text.

  • hiragana_to_katakana

    Convert hiragana to katakana.

  • katakana_to_hiragana

    Convert katakana to hiragana.

All tools accept text: str and return the converted string.

The width-conversion tools also accept optional boolean parameters matching WidthConverterConfig:

punctuation
corner_brucket
conjunction_mark
length_mark
space
ascii_symbol
ascii_digit
ascii_alphabet

In addition, full_to_half accepts:

wave_dash

Documentation

License

This project is licensed under the Apache License 2.0. See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

utsuho-2.3.0.tar.gz (14.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

utsuho-2.3.0-py3-none-any.whl (15.2 kB view details)

Uploaded Python 3

File details

Details for the file utsuho-2.3.0.tar.gz.

File metadata

  • Download URL: utsuho-2.3.0.tar.gz
  • Upload date:
  • Size: 14.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for utsuho-2.3.0.tar.gz
Algorithm Hash digest
SHA256 d2b2f07bbf2f28a737182d36207088079721453ecf4d34ec529dba151c0c9eb6
MD5 8f6cb060596cff31e0f4e565f4fb65e8
BLAKE2b-256 2b3762565b36adf707f4f785cc723ec1848ee9f054d8a846c0ad8ad086706a2f

See more details on using hashes here.

Provenance

The following attestation bundles were made for utsuho-2.3.0.tar.gz:

Publisher: release.yaml on juno-rmks/utsuho

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file utsuho-2.3.0-py3-none-any.whl.

File metadata

  • Download URL: utsuho-2.3.0-py3-none-any.whl
  • Upload date:
  • Size: 15.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for utsuho-2.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a53521b9576c9949fdd2fa5810452b787be5f050f168abcfd428ce12328371a9
MD5 d01618348715d4b2f6608faba6cba69f
BLAKE2b-256 994fa080f0cffd1281e2c96569abebefeb2189a40cf5c9f92ebf50e7830ed352

See more details on using hashes here.

Provenance

The following attestation bundles were made for utsuho-2.3.0-py3-none-any.whl:

Publisher: release.yaml on juno-rmks/utsuho

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page