Skip to main content

Japanese text transliteration library

Project description

Yosina Python

A Python port of the Yosina Japanese text transliteration library.

Overview

Yosina is a library for Japanese text transliteration that provides various text normalization and conversion features commonly needed when processing Japanese text.

Usage

from yosina import make_transliterator, TransliterationRecipe

# Create a recipe with desired transformations
recipe = TransliterationRecipe(
    kanji_old_new=True,
    replace_spaces=True,
    replace_suspicious_hyphens_to_prolonged_sound_marks=True,
    replace_circled_or_squared_characters=True,
    replace_combined_characters=True,
    hira_kata="hira-to-kata",  # Convert hiragana to katakana
    replace_japanese_iteration_marks=True,  # Expand iteration marks
    to_fullwidth=True,
)

# Create the transliterator
transliterator = make_transliterator(recipe)

# Use it with various special characters
input_text = "①②③ ⒶⒷⒸ ㍿㍑㌠㋿"  # circled numbers, letters, space, combined characters
result = transliterator(input_text)
print(result)  # "(1)(2)(3) (A)(B)(C) 株式会社リットルサンチーム令和"

# Convert old kanji to new
old_kanji = "舊字體"
result = transliterator(old_kanji)
print(result)  # "旧字体"

# Convert half-width katakana to full-width
half_width = "テストモジレツ"
result = transliterator(half_width)
print(result)  # "テストモジレツ"

# Demonstrate hiragana to katakana conversion with iteration marks
mixed_text = "学問のすゝめ"
result = transliterator(mixed_text)
print(result)  # "学問ノススメ"

Using Direct Configuration

from yosina import make_transliterator

# Configure with direct transliterator configs
configs = [
    ("kanji-old-new", {}),
    ("spaces", {}),
    ("prolonged-sound-marks", {"replace_prolonged_marks_following_alnums": True}),
    ("circled-or-squared", {}),
    ("combined", {}),
    ("hira-kata", {"mode": "kata-to-hira"}),  # Convert katakana to hiragana
    ("japanese-iteration-marks", {}),  # Expand iteration marks like 々, ゝゞ, ヽヾ
]

transliterator = make_transliterator(configs)

# Example with various transformations including the new ones
input_text = "カタカナでの時々の佐々木さん"
result = transliterator(input_text)
print(result)  # "かたかなでの時時の佐佐木さん"

Available Transliterators

1. Circled or Squared (circled-or-squared)

Converts circled or squared characters to their plain equivalents.

  • Options: templates (custom rendering), includeEmojis (include emoji characters)
  • Example: ①②③(1)(2)(3), ㊙㊗(秘)(祝)

2. Combined (combined)

Expands combined characters into their individual character sequences.

  • Example: (Heisei era) → 平成, (株)

3. Hiragana-Katakana Composition (hira-kata-composition)

Combines decomposed hiraganas and katakanas into composed equivalents.

  • Options: composeNonCombiningMarks (compose non-combining marks)
  • Example: か + ゙, ヘ + ゜

4. Hiragana-Katakana (hira-kata)

Converts between hiragana and katakana scripts bidirectionally.

  • Options: mode ("hira-to-kata" or "kata-to-hira")
  • Example: ひらがなヒラガナ (hira-to-kata)

5. Hyphens (hyphens)

Replaces various dash/hyphen symbols with common ones used in Japanese.

  • Options: precedence (mapping priority order)
  • Available mappings: "ascii", "jisx0201", "jisx0208_90", "jisx0208_90_windows", "jisx0208_verbatim"
  • Example: 2019—2020 (em dash) → 2019-2020

6. Ideographic Annotations (ideographic-annotations)

Replaces ideographic annotations used in traditional Chinese-to-Japanese translation.

  • Example: ㆖㆘上下

7. IVS-SVS Base (ivs-svs-base)

Handles Ideographic and Standardized Variation Selectors.

  • Options: charset, mode ("ivs-or-svs" or "base"), preferSVS, dropSelectorsAltogether
  • Example: 葛󠄀 (葛 + IVS) →

8. Japanese Iteration Marks (japanese-iteration-marks)

Expands iteration marks by repeating the preceding character.

  • Example: 時々時時, いすゞいすず

9. JIS X 0201 and Alike (jisx0201-and-alike)

Handles half-width/full-width character conversion.

  • Options: fullwidthToHalfwidth, convertGL (alphanumerics/symbols), convertGR (katakana), u005cAsYenSign
  • Example: ABC123ABC123, カタカナカタカナ

10. Kanji Old-New (kanji-old-new)

Converts old-style kanji (旧字体) to modern forms (新字体).

  • Example: 舊字體の變換旧字体の変換

11. Mathematical Alphanumerics (mathematical-alphanumerics)

Normalizes mathematical alphanumeric symbols to plain ASCII.

  • Example: 𝐀𝐁𝐂 (mathematical bold) → ABC

12. Prolonged Sound Marks (prolonged-sound-marks)

Handles contextual conversion between hyphens and prolonged sound marks.

  • Options: skipAlreadyTransliteratedChars, allowProlongedHatsuon, allowProlongedSokuon, replaceProlongedMarksFollowingAlnums
  • Example: イ−ハト−ヴォ (with hyphen) → イーハトーヴォ (prolonged mark)

13. Radicals (radicals)

Converts CJK radical characters to their corresponding ideographs.

  • Example: ⾔⾨⾷ (Kangxi radicals) → 言門食

14. Spaces (spaces)

Normalizes various Unicode space characters to standard ASCII space.

  • Example: A B (ideographic space) → A B

15. Roman Numerals (roman-numerals)

Converts Unicode Roman numeral characters to their ASCII letter equivalents.

  • Example: Ⅰ Ⅱ ⅢI II III, ⅰ ⅱ ⅲi ii iii

16. Small Hirakatas (small-hirakatas)

Converts small hiragana and katakana characters to their ordinary-sized equivalents.

  • Example: ぁぃぅあいう, ァィゥアイウ

17. Archaic Hirakatas (archaic-hirakatas)

Converts archaic kana (hentaigana) to their modern hiragana or katakana equivalents.

  • Example: 𛀁

18. Historical Hirakatas (historical-hirakatas)

Converts historical hiragana and katakana characters to their modern equivalents.

  • Options: hiraganas ("simple", "decompose", or "skip"), katakanas ("simple", "decompose", or "skip"), voicedKatakanas ("decompose" or "skip")
  • Example: (simple), うぃ (decompose), (simple)

Requirements

  • Python 3.10 or higher

Installation

# Install with uv
uv add yosina

# Install with pip
pip install yosina

Development

This project uses uv for dependency management.

# Code generation
python -m codegen

# Install development dependencies
uv sync --extra dev

# Run tests
uv run pytest

# Run linting
uv run ruff check .

# Run formatting
uv run ruff format .

# Run type checking
uv run pyright

Requirements

  • Python 3.10+
  • typing-extensions

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yosina-1.1.1.tar.gz (360.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

yosina-1.1.1-py3-none-any.whl (140.2 kB view details)

Uploaded Python 3

File details

Details for the file yosina-1.1.1.tar.gz.

File metadata

  • Download URL: yosina-1.1.1.tar.gz
  • Upload date:
  • Size: 360.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.21 {"installer":{"name":"uv","version":"0.9.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for yosina-1.1.1.tar.gz
Algorithm Hash digest
SHA256 aa878d4ab53a1395a96d8c05371b2e297e1d85bf8366a0a75920e36b42af234e
MD5 dba7fe2d093c2656cc8e59aa446d0a1c
BLAKE2b-256 9cc0bf1c9b1f6f0be3eaaecd5c29e581365632988de94c37f9fb495af0c2ce27

See more details on using hashes here.

File details

Details for the file yosina-1.1.1-py3-none-any.whl.

File metadata

  • Download URL: yosina-1.1.1-py3-none-any.whl
  • Upload date:
  • Size: 140.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.21 {"installer":{"name":"uv","version":"0.9.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for yosina-1.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 76cc1e882371278cc2a938bd0db345b2715f168718318d4d469732d1660cd23a
MD5 da101d4fa5f30704d563da5a56f88c64
BLAKE2b-256 bcd874a7888a44655de26e95361828487e2341bb0853d26f73954ca9f1b4fec6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page