Japanese text transliteration library
Project description
Yosina Python
A Python port of the Yosina Japanese text transliteration library.
Overview
Yosina is a library for Japanese text transliteration that provides various text normalization and conversion features commonly needed when processing Japanese text.
Usage
from yosina import make_transliterator, TransliterationRecipe
# Create a recipe with desired transformations
recipe = TransliterationRecipe(
kanji_old_new=True,
replace_spaces=True,
replace_suspicious_hyphens_to_prolonged_sound_marks=True,
replace_circled_or_squared_characters=True,
replace_combined_characters=True,
hira_kata="hira-to-kata", # Convert hiragana to katakana
replace_japanese_iteration_marks=True, # Expand iteration marks
to_fullwidth=True,
)
# Create the transliterator
transliterator = make_transliterator(recipe)
# Use it with various special characters
input_text = "①②③ ⒶⒷⒸ ㍿㍑㌠㋿" # circled numbers, letters, space, combined characters
result = transliterator(input_text)
print(result) # "(1)(2)(3) (A)(B)(C) 株式会社リットルサンチーム令和"
# Convert old kanji to new
old_kanji = "舊字體"
result = transliterator(old_kanji)
print(result) # "旧字体"
# Convert half-width katakana to full-width
half_width = "テストモジレツ"
result = transliterator(half_width)
print(result) # "テストモジレツ"
# Demonstrate hiragana to katakana conversion with iteration marks
mixed_text = "学問のすゝめ"
result = transliterator(mixed_text)
print(result) # "学問ノススメ"
Using Direct Configuration
from yosina import make_transliterator
# Configure with direct transliterator configs
configs = [
("kanji-old-new", {}),
("spaces", {}),
("prolonged-sound-marks", {"replace_prolonged_marks_following_alnums": True}),
("circled-or-squared", {}),
("combined", {}),
("hira-kata", {"mode": "kata-to-hira"}), # Convert katakana to hiragana
("japanese-iteration-marks", {}), # Expand iteration marks like 々, ゝゞ, ヽヾ
]
transliterator = make_transliterator(configs)
# Example with various transformations including the new ones
input_text = "カタカナでの時々の佐々木さん"
result = transliterator(input_text)
print(result) # "かたかなでの時時の佐佐木さん"
Available Transliterators
1. Circled or Squared (circled-or-squared)
Converts circled or squared characters to their plain equivalents.
- Options:
templates(custom rendering),includeEmojis(include emoji characters) - Example:
①②③→(1)(2)(3),㊙㊗→(秘)(祝)
2. Combined (combined)
Expands combined characters into their individual character sequences.
- Example:
㍻(Heisei era) →平成,㈱→(株)
3. Hiragana-Katakana Composition (hira-kata-composition)
Combines decomposed hiraganas and katakanas into composed equivalents.
- Options:
composeNonCombiningMarks(compose non-combining marks) - Example:
か + ゙→が,ヘ + ゜→ペ
4. Hiragana-Katakana (hira-kata)
Converts between hiragana and katakana scripts bidirectionally.
- Options:
mode("hira-to-kata" or "kata-to-hira") - Example:
ひらがな→ヒラガナ(hira-to-kata)
5. Hyphens (hyphens)
Replaces various dash/hyphen symbols with common ones used in Japanese.
- Options:
precedence(mapping priority order) - Available mappings: "ascii", "jisx0201", "jisx0208_90", "jisx0208_90_windows", "jisx0208_verbatim"
- Example:
2019—2020(em dash) →2019-2020
6. Ideographic Annotations (ideographic-annotations)
Replaces ideographic annotations used in traditional Chinese-to-Japanese translation.
- Example:
㆖㆘→上下
7. IVS-SVS Base (ivs-svs-base)
Handles Ideographic and Standardized Variation Selectors.
- Options:
charset,mode("ivs-or-svs" or "base"),preferSVS,dropSelectorsAltogether - Example:
葛󠄀(葛 + IVS) →葛
8. Japanese Iteration Marks (japanese-iteration-marks)
Expands iteration marks by repeating the preceding character.
- Example:
時々→時時,いすゞ→いすず
9. JIS X 0201 and Alike (jisx0201-and-alike)
Handles half-width/full-width character conversion.
- Options:
fullwidthToHalfwidth,convertGL(alphanumerics/symbols),convertGR(katakana),u005cAsYenSign - Example:
ABC123→ABC123,カタカナ→カタカナ
10. Kanji Old-New (kanji-old-new)
Converts old-style kanji (旧字体) to modern forms (新字体).
- Example:
舊字體の變換→旧字体の変換
11. Mathematical Alphanumerics (mathematical-alphanumerics)
Normalizes mathematical alphanumeric symbols to plain ASCII.
- Example:
𝐀𝐁𝐂(mathematical bold) →ABC
12. Prolonged Sound Marks (prolonged-sound-marks)
Handles contextual conversion between hyphens and prolonged sound marks.
- Options:
skipAlreadyTransliteratedChars,allowProlongedHatsuon,allowProlongedSokuon,replaceProlongedMarksFollowingAlnums - Example:
イ−ハト−ヴォ(with hyphen) →イーハトーヴォ(prolonged mark)
13. Radicals (radicals)
Converts CJK radical characters to their corresponding ideographs.
- Example:
⾔⾨⾷(Kangxi radicals) →言門食
14. Spaces (spaces)
Normalizes various Unicode space characters to standard ASCII space.
- Example:
A B(ideographic space) →A B
15. Roman Numerals (roman-numerals)
Converts Unicode Roman numeral characters to their ASCII letter equivalents.
- Example:
Ⅰ Ⅱ Ⅲ→I II III,ⅰ ⅱ ⅲ→i ii iii
16. Small Hirakatas (small-hirakatas)
Converts small hiragana and katakana characters to their ordinary-sized equivalents.
- Example:
ぁぃぅ→あいう,ァィゥ→アイウ
17. Archaic Hirakatas (archaic-hirakatas)
Converts archaic kana (hentaigana) to their modern hiragana or katakana equivalents.
- Example:
𛀁→え
18. Historical Hirakatas (historical-hirakatas)
Converts historical hiragana and katakana characters to their modern equivalents.
- Options:
hiraganas("simple", "decompose", or "skip"),katakanas("simple", "decompose", or "skip"),voicedKatakanas("decompose" or "skip") - Example:
ゐ→い(simple),ゐ→うぃ(decompose),ヰ→イ(simple)
Requirements
- Python 3.10 or higher
Installation
# Install with uv
uv add yosina
# Install with pip
pip install yosina
Development
This project uses uv for dependency management.
# Code generation
python -m codegen
# Install development dependencies
uv sync --extra dev
# Run tests
uv run pytest
# Run linting
uv run ruff check .
# Run formatting
uv run ruff format .
# Run type checking
uv run pyright
Requirements
- Python 3.10+
- typing-extensions
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file yosina-3.0.0.tar.gz.
File metadata
- Download URL: yosina-3.0.0.tar.gz
- Upload date:
- Size: 369.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.21 {"installer":{"name":"uv","version":"0.9.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
236bddc9af7be0a92a8e5705359222ae3ae3a670a45cb29e558c2bd32cfc064b
|
|
| MD5 |
aaea0ed779154ecf9ec1aadb37942202
|
|
| BLAKE2b-256 |
637881fd3251cac3c2c047cbfefc3689edd3f2520f70d2bc263afd44c1220513
|
File details
Details for the file yosina-3.0.0-py3-none-any.whl.
File metadata
- Download URL: yosina-3.0.0-py3-none-any.whl
- Upload date:
- Size: 140.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.21 {"installer":{"name":"uv","version":"0.9.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
90715dd995a3982ba285de6b0512fb4ad6f4cb4ef99f6c304e4e7e03eb4ff7e3
|
|
| MD5 |
7e8f108b0d828280a4d7cca602db5770
|
|
| BLAKE2b-256 |
baced645b0787ff6893b48854d3dc12f7e617124670e21550b791e4074c96b38
|