Skip to main content

The joyokanji converts old-form kanji characters into new-form kanji characters.

Project description

joyo-kanji

tests PyPI - Version

日本語 / English

joyokanji is a tiny, fast Python library that converts old-form kanji (Japanese: kyūjitai, 舊字/旧字) to new-form kanji (shinjitai, 新字) using a mapping grounded in the Agency for Cultural Affairs’ Jōyō Kanji list (常用漢字表). See the source list (Japanese) published by the Government of Japan: https://www.bunka.go.jp/kokugo_nihongo/sisaku/joho/joho/kijun/naikaku/kanji/.

Optionally, you can also normalize common variant glyphs used in personal names (e.g., 髙/𠮷/﨑/隆) by enabling variants=True.

What’s kyūjitai vs. shinjitai? After WWII, Japan simplified the shapes of many commonly used kanji. The older shapes are kyūjitai (e.g., 鹽 → 塩, 國 → 国, 體 → 体), and the simplified shapes are shinjitai. This library helps normalize text by replacing old forms with their modern counterparts.


Table of Contents


Features

  • converts old-form (kyūjitai) kanji to modern (shinjitai) forms.
  • Mapping-based, deterministic behavior — no surprises.
  • Fast single-pass conversion using str.translate (linear time O(n)).
  • Loads mapping once from joyokanji/config/kanji.json and caches it.
  • Optional: normalize common variant glyphs (e.g., 髙, 𠮷, 﨑, 隆, 羽, 練, …) by passing variants=True (uses joyokanji/config/variants.json).
  • Pure-Python, minimal footprint, easy to embed in pipelines.

Installation

pip install joyokanji

If your package name differs on PyPI, update the command above accordingly.

Quick Start

import joyokanji

text = "鹽と黃と黑と點と發"
print(joyokanji.convert(text))  # => 塩と黄と黒と点と発

# Optional: include common variant glyphs
text2 = "髙﨑𠮷野屋"
print(joyokanji.convert(text2, variants=True))  # => 高崎吉野屋

API:

joyokanji.convert(text: str, variants: bool = False) -> str

How It Works

  • On first use, the library loads a JSON dictionary (joyokanji/config/kanji.json) of old→new pairs (e.g., {"鹽": "塩"}) and builds a translation table with str.maketrans.
  • Conversion is then a single pass over your string using str.translate, which is both simple and efficient.
  • The table is cached in memory for subsequent calls.
  • When variants=True, an additional map from joyokanji/config/variants.json is merged in (variant entries take precedence on conflicts). A separate cached table is maintained for this mode.

Examples

Input → Output:

Kyūjitai Shinjitai

Only characters listed in the mapping are transformed; all others remain unchanged.

Variants (optional)

When variants=True, common variant glyphs (often seen in proper names) are also normalized. Examples:

Variant Normalized
𠮷

Scope & Limitations

  • Coverage: The mapping focuses on characters relevant to modern Japanese usage and the Jōyō Kanji context. It is not a general Traditional ↔ Simplified Chinese converter and is not intended for zh-Hant texts (Taiwan/Hong Kong).
  • Context-free: Conversion is character-to-character. The library does not inspect context, readings, or word boundaries.
  • Proper nouns & personal names: Historical documents, proper nouns, and person names may intentionally use old forms (e.g., in legal names). Automatic conversion can be undesirable in such use cases. Review outputs when accuracy matters.
    • For this reason, variant glyph normalization is OFF by default. Enable variants=True only when desired.
  • Normalization: The library does not perform Unicode normalization (e.g., NFKC) by itself. If you need it, run normalization before or after conversion according to your pipeline’s needs.
  • Ambiguous variants: Some characters have multiple historical variants. The mapping chooses a widely accepted modern form; if you need domain-specific variants, consider customizing the mapping.

Data Source & Attribution

Performance Notes

  • Building the translation table happens once per process. Subsequent calls are memory-only and very fast.
  • The complexity is O(n) with low constant overhead, making it suitable for batch text processing.

When to Use / Not to Use

Use when: you need to normalize legacy texts into modern Japanese (OCR outputs, historical corpora, or mixed-form datasets).

Avoid or review carefully when: processing legal names, brand names, or scholarly editions where the original glyph choices carry meaning.

Contributing

  • Issues and PRs are welcome, especially for: (1) mapping improvements, (2) tests covering edge cases, (3) documentation in English/Japanese.
  • If proposing new pairs, please include a source/rationale and examples.

License

Apache License 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

joyokanji-1.1.0.tar.gz (22.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

joyokanji-1.1.0-py3-none-any.whl (18.3 kB view details)

Uploaded Python 3

File details

Details for the file joyokanji-1.1.0.tar.gz.

File metadata

  • Download URL: joyokanji-1.1.0.tar.gz
  • Upload date:
  • Size: 22.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for joyokanji-1.1.0.tar.gz
Algorithm Hash digest
SHA256 88091c7d3b2c43dc64c01e6762433c78aa928f2236048dffe943206490ca5c70
MD5 76d313817a4ee0dad757b9517fba187e
BLAKE2b-256 49aed5fc7c7bd643921bb049f7f425c65ad68bbea4fb1e2ea4e50cabe0ef8c7a

See more details on using hashes here.

File details

Details for the file joyokanji-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: joyokanji-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 18.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for joyokanji-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 afccc1d3f37c588c79c728b3a1cbe8859996fcefa6e9eb0befed57b13f8550cd
MD5 33a3ce5a672f9e05f93d306131d44989
BLAKE2b-256 5ad74e1da851beee9d50082feb7cf376e70554708c3911bbacaf422d7bca9678

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page