Skip to main content

Pure Python implementation of OpenCC for Chinese text conversion

Project description

opencc_purepy

PyPI version License Downloads Build & Release

opencc_purepy is a pure Python implementation of OpenCC (Open Chinese Convert), supporting conversion between Simplified, Traditional, Hong Kong, Taiwan, and Japanese Kanji.
It uses dictionary-based segmentation and mapping logic inspired by the original OpenCC.


🚩 Features

  • Pure Python – no native dependencies
  • Multiple Chinese locale conversions (Simplified, Traditional, HK, TW, JP)
  • Punctuation style conversion (optional)
  • Automatic code detection (Simplified/Traditional)
  • CLI with Office document support (.docx, .xlsx, .pptx, .odt, .ods, .odp, .epub)

🐍 opencc_purepy requires Python 3.7 or later.


🔁 Supported Conversion Configs

Code Description
s2t Simplified → Traditional
t2s Traditional → Simplified
s2tw Simplified → Traditional (Taiwan)
tw2s Traditional (Taiwan) → Simplified
s2twp Simplified → Traditional (Taiwan) with idioms
tw2sp Traditional (Taiwan) → Simplified with idioms
s2hk Simplified → Traditional (Hong Kong)
hk2s Traditional (Hong Kong) → Simplified
t2tw Traditional → Traditional (Taiwan)
tw2t Traditional (Taiwan) → Traditional
t2twp Traditional → Traditional (Taiwan) with idioms
tw2tp Traditional (Taiwan) → Traditional with idioms
t2hk Traditional → Traditional (Hong Kong)
hk2t Traditional (Hong Kong) → Traditional
t2jp Japanese Kyujitai → Shinjitai
jp2t Japanese Shinjitai → Kyujitai

📦 Installation

pip install opencc-purepy

🚀 Usage

Python

from opencc_purepy import OpenCC

text = "“春眠不觉晓,处处闻啼鸟。”"
opencc = OpenCC("s2t")
converted = opencc.convert(text, punctuation=True)
print(converted)  # 「春眠不覺曉,處處聞啼鳥。」

CLI

Text File Conversion

python -m opencc_purepy convert -i input.txt -o output.txt -c s2t -p
# or, if installed as a script:
opencc-purepy convert -i input.txt -o output.txt -c s2t -p

Office Document Conversion subcommand (office)

Supports: .docx, .xlsx, .pptx, .odt, .ods, .odp, .epub

# Convert Word document with font preservation
opencc-purepy office -i example.docx -c t2s --keep-font

# Convert EPUB and auto-detect output name
opencc-purepy office -i book.epub -c s2t --auto-ext

# Convert Excel and specify output path and format
opencc-purepy office -i sheet.xlsx -o result.xlsx -c s2tw --format xlsx

ℹ️ With office subcommand, the input is processed as an Office or EPUB document and OpenCC conversion is applied internally.


📚 Custom Dictionaries

opencc_purepy follows the OpenCC lexicon structure. Custom entries are loaded through existing OpenCC dictionary slots, such as st_phrases or ts_phrases; do not use or document a generic UserDict.txt slot.

This keeps DictionaryMaxlength, DictRefs, and future acceleration structures such as UnionCache stable and OpenCC-compatible.

Append mode

Use appends={...} to load built-in dictionaries first, then custom entries. Duplicate keys use late-comer wins, so custom entries override built-in entries. This is recommended for most users.

from opencc_purepy import OpenCC

cc = OpenCC.from_dicts(
    config="s2t",
    appends={
        "st_phrases": "./UserDict.txt",
    },
)

print(cc.convert("帕兰蒂尔是一家公司"))

Override mode

Use overrides={...} to replace an entire dictionary slot with a custom file. This is intended for advanced users or proprietary full dictionary copies.

from opencc_purepy import OpenCC

cc = OpenCC.from_dicts(
    config="s2t",
    overrides={
        "st_phrases": "./company/STPhrases.txt",
    },
)

Direct dictionary injection

from opencc_purepy import OpenCC
from opencc_purepy.dictionary_lib import DictionaryMaxlength

dictionary = DictionaryMaxlength.from_dicts(
    appends={
        "st_phrases": "./UserDict.txt",
    },
)

cc = OpenCC(config="s2t", dictionary=dictionary)

Dictionary text format

Custom dictionary files are UTF-8 text files. Use one mapping per line in phrase<TAB>translation format. Blank lines are ignored, lines starting with # are comments, and duplicate keys are resolved by late-comer wins.

# Custom company terms
帕兰蒂尔	帕蘭蒂爾

Supported slots

Slot name Default file
st_characters STCharacters.txt
st_phrases STPhrases.txt
ts_characters TSCharacters.txt
ts_phrases TSPhrases.txt
tw_phrases TWPhrases.txt
tw_phrases_rev TWPhrasesRev.txt
tw_variants TWVariants.txt
tw_variants_rev TWVariantsRev.txt
tw_variants_rev_phrases TWVariantsRevPhrases.txt
hk_variants HKVariants.txt
hk_variants_rev HKVariantsRev.txt
hk_variants_rev_phrases HKVariantsRevPhrases.txt
jps_characters JPShinjitaiCharacters.txt
jps_phrases JPShinjitaiPhrases.txt
jp_variants JPVariants.txt
jp_variants_rev JPVariantsRev.txt

Generate JSON with dictgen

TXT dictionaries are human-editable source files. dictionary_maxlength.json is a generated/cache format, so prefer dictgen instead of manually editing JSON.

opencc-purepy dictgen -d ./my_dicts -o dictionary_maxlength.json
from opencc_purepy import OpenCC
from opencc_purepy.dictionary_lib import DictionaryMaxlength

dictionary = DictionaryMaxlength.from_json("./dictionary_maxlength.json")
cc = OpenCC(config="s2t", dictionary=dictionary)

Which mode should I use?

  • Use appends for a few user or company terms.
  • Use overrides when maintaining a full proprietary replacement of an OpenCC dictionary file.
  • Use dictgen when you want to bake TXT dictionaries into JSON for reuse or faster loading.
  • Use direct dictionary injection when sharing one loaded dictionary across many OpenCC instances.

🧩 API Reference

Exports

  • OpenCC
  • OpenccConfig

OpenCC class

  • OpenCC(config: str | OpenccConfig = "s2t")
    Create a converter with a supported config string or OpenccConfig enum value. Raises ValueError for unsupported configs.
  • set_config(config: str | OpenccConfig) -> None
    Update the active conversion config. Raises ValueError for unsupported configs.
  • get_config() -> str
    Return the current canonical config name.
  • supported_configs() -> list[str]
    Return all supported config names.
  • get_last_error() -> str | None
    Return the last validation or conversion error, if any.
  • convert(input: str, punctuation: bool = False) -> str
    Convert text using the active config, with optional punctuation conversion.
  • s2t(input: str, punctuation: bool = False) -> str
    Simplified Chinese to Traditional Chinese.
  • t2s(input: str, punctuation: bool = False) -> str
    Traditional Chinese to Simplified Chinese.
  • s2tw(input: str, punctuation: bool = False) -> str
    Simplified Chinese to Taiwan Traditional.
  • tw2s(input: str, punctuation: bool = False) -> str
    Taiwan Traditional to Simplified Chinese.
  • s2twp(input: str, punctuation: bool = False) -> str
    Simplified Chinese to Taiwan Traditional with idiom and phrase conversion.
  • tw2sp(input: str, punctuation: bool = False) -> str
    Taiwan Traditional with idioms to Simplified Chinese.
  • s2hk(input: str, punctuation: bool = False) -> str
    Simplified Chinese to Hong Kong Traditional.
  • hk2s(input: str, punctuation: bool = False) -> str
    Hong Kong Traditional to Simplified Chinese.
  • t2tw(input: str, punctuation: bool = False) -> str
    Traditional Chinese to Taiwan Traditional.
  • t2twp(input: str, punctuation: bool = False) -> str
    Traditional Chinese to Taiwan Traditional with phrase mappings.
  • tw2t(input: str, punctuation: bool = False) -> str
    Taiwan Traditional to standard Traditional Chinese.
  • tw2tp(input: str, punctuation: bool = False) -> str
    Taiwan Traditional to standard Traditional Chinese with phrase reversal.
  • t2hk(input: str, punctuation: bool = False) -> str
    Traditional Chinese to Hong Kong variant.
  • hk2t(input: str, punctuation: bool = False) -> str
    Hong Kong Traditional to standard Traditional Chinese.
  • t2jp(input: str, punctuation: bool = False) -> str
    Traditional Chinese to Japanese variants.
  • jp2t(input: str, punctuation: bool = False) -> str
    Japanese Shinjitai to Traditional Chinese.
  • st(input: str) -> str
    Character-only Simplified to Traditional conversion.
  • ts(input: str) -> str
    Character-only Traditional to Simplified conversion.
  • zho_check(input: str) -> int
    Detect the input text type:
      1 - Traditional, 2 - Simplified, 0 - Others

OpenccConfig enum

  • Members include: S2T, T2S, S2TW, TW2S, S2TWP, TW2SP, S2HK, HK2S, T2TW, TW2T, T2TWP, TW2TP, T2HK, HK2T, T2JP, JP2T
  • to_canonical_name() -> str
    Return the lowercase OpenCC config string.
  • parse(value: str) -> OpenccConfig
    Parse a config string into an enum value.

🛠 Development


⚡ Benchmark

Measured on GitHub Actions ubuntu-latest using the default s2t configuration.
Each test averaged over 20 runs with the shared dictionary cache reused across runs.

Runner Platform

Field Value
Runner Linux X64
Image ubuntu24 20260413.86.1
Kernel Linux runnervmeorf1 6.17.0-1010-azure #10~24.04.1-Ubuntu SMP Fri Mar 6 22:00:57 UTC 2026 x86_64 x86_64 x86_64 GNU/Linux
CPU Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
CPU Cores 4
Memory Not reported
Python Python 3.10.20

Results

Input Size Avg. Time (ms)
100 chars 0.221 ms
1,000 chars 1.769 ms
10,000 chars 17.584 ms
100,000 chars 173.838 ms

Timings reuse the shared dictionary cache, but still include per-run OpenCC instance setup; results depend on runner hardware and background system load.


Projects That Use opencc-purepy

OpenccPurepyGui


📄 License

This project is licensed under the MIT License.


Powered by Pure Python and OpenCC Lexicons.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

opencc_purepy-1.2.4.tar.gz (1.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

opencc_purepy-1.2.4-py3-none-any.whl (1.0 MB view details)

Uploaded Python 3

File details

Details for the file opencc_purepy-1.2.4.tar.gz.

File metadata

  • Download URL: opencc_purepy-1.2.4.tar.gz
  • Upload date:
  • Size: 1.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for opencc_purepy-1.2.4.tar.gz
Algorithm Hash digest
SHA256 f1c112bbf694cfbdd76587625c94548640a117f00b037e06b3741500ab81bf02
MD5 6a5b1bfe0ae8ab756a1c704f84cb4733
BLAKE2b-256 1033b00ab2d460deb02654b4c7adf9940e10ac60f66486ce8200d9494217df6f

See more details on using hashes here.

Provenance

The following attestation bundles were made for opencc_purepy-1.2.4.tar.gz:

Publisher: release.yml on laisuk/opencc_purepy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file opencc_purepy-1.2.4-py3-none-any.whl.

File metadata

  • Download URL: opencc_purepy-1.2.4-py3-none-any.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for opencc_purepy-1.2.4-py3-none-any.whl
Algorithm Hash digest
SHA256 10ee7bda16b01ca6cdd94c39f7d5450b2c1402594c46f1511c84738558913add
MD5 665ab8cdf5004c07bac0a8e62d1aaf86
BLAKE2b-256 560a4ccbe307fcbe2341960458a67ae6cd07b0d480ce636a3c229faba62960be

See more details on using hashes here.

Provenance

The following attestation bundles were made for opencc_purepy-1.2.4-py3-none-any.whl:

Publisher: release.yml on laisuk/opencc_purepy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page