Skip to main content

Pure Python implementation of OpenCC for Chinese text conversion

Project description

opencc_purepy

PyPI version License Downloads Build & Release

opencc_purepy is a pure Python implementation of OpenCC (Open Chinese Convert), supporting conversion between Simplified, Traditional, Hong Kong, Taiwan, and Japanese Kanji.
It uses dictionary-based segmentation and mapping logic inspired by the original OpenCC.


🚩 Features

  • Pure Python – no native dependencies
  • Multiple Chinese locale conversions (Simplified, Traditional, HK, TW, JP)
  • Punctuation style conversion (optional)
  • Automatic code detection (Simplified/Traditional)
  • CLI with Office document support (.docx, .xlsx, .pptx, .odt, .ods, .odp, .epub)

🐍 opencc_purepy requires Python 3.7 or later.


🔁 Supported Conversion Configs

Code Description
s2t Simplified → Traditional
t2s Traditional → Simplified
s2tw Simplified → Traditional (Taiwan)
tw2s Traditional (Taiwan) → Simplified
s2twp Simplified → Traditional (Taiwan) with idioms
tw2sp Traditional (Taiwan) → Simplified with idioms
s2hk Simplified → Traditional (Hong Kong)
hk2s Traditional (Hong Kong) → Simplified
t2tw Traditional → Traditional (Taiwan)
tw2t Traditional (Taiwan) → Traditional
t2twp Traditional → Traditional (Taiwan) with idioms
tw2tp Traditional (Taiwan) → Traditional with idioms
t2hk Traditional → Traditional (Hong Kong)
hk2t Traditional (Hong Kong) → Traditional
t2jp Japanese Kyujitai → Shinjitai
jp2t Japanese Shinjitai → Kyujitai

📦 Installation

pip install opencc-purepy

🚀 Usage

Python

from opencc_purepy import OpenCC

text = "“春眠不觉晓,处处闻啼鸟。”"
opencc = OpenCC("s2t")
converted = opencc.convert(text, punctuation=True)
print(converted)  # 「春眠不覺曉,處處聞啼鳥。」

CLI

Text File Conversion

python -m opencc_purepy convert -i input.txt -o output.txt -c s2t -p
# or, if installed as a script:
opencc-purepy convert -i input.txt -o output.txt -c s2t -p

Office Document Conversion subcommand (office)

Supports: .docx, .xlsx, .pptx, .odt, .ods, .odp, .epub

# Convert Word document with font preservation
opencc-purepy office -i example.docx -c t2s --keep-font

# Convert EPUB and auto-detect output name
opencc-purepy office -i book.epub -c s2t --auto-ext

# Convert Excel and specify output path and format
opencc-purepy office -i sheet.xlsx -o result.xlsx -c s2tw --format xlsx

ℹ️ With office subcommand, the input is processed as an Office or EPUB document and OpenCC conversion is applied internally.


📚 Custom Dictionaries

opencc_purepy follows the OpenCC lexicon structure. Custom entries are loaded through existing OpenCC dictionary slots, such as DictSlot.STPhrases, DictSlot.TSPhrases, DictSlot.STPunctuations, and other OpenCC slots. There is no generic UserDict slot.

Dictionary slot mappings support both:

  • DictSlot (recommended)
  • string slot names such as "st_phrases" (backward compatible)

Recommended: load-time append mode

Use appends={...} to load built-in dictionaries first, then custom entries. Duplicate keys use late-comer wins, so custom entries override built-in entries. This is recommended for most users.

from opencc_purepy import DictSlot, OpenCC

cc = OpenCC.from_dicts(
    config="s2t",
    appends={
        DictSlot.STPhrases: "./UserDict.txt",
    },
)

print(cc.convert("帕兰蒂尔是一家公司"))

String slot names remain supported for compatibility:

from opencc_purepy import OpenCC

cc = OpenCC.from_dicts(
    config="s2t",
    appends={
        "st_phrases": "./UserDict.txt",
    },
)

The same appends={...} and overrides={...} arguments are also supported by DictionaryMaxlength.from_dicts() when you want to create and reuse a dictionary instance yourself.


Post-load file customization

Use DictionaryMaxlength.with_custom_dict_files() when you already have a dictionary instance and want to apply OpenCC-compatible text dictionary files after loading it. Post-load customization supports both appends={...} and overrides={...}.

from opencc_purepy import DictSlot, OpenCC
from opencc_purepy.dictionary_lib import DictionaryMaxlength

dictionary = DictionaryMaxlength.from_json().with_custom_dict_files(
    appends={
        DictSlot.STPhrases: "./UserDict.txt",
    },
)

cc = OpenCC(config="s2t", dictionary=dictionary)

print(cc.convert("帕兰蒂尔是一家公司"))

Create a private dictionary instance first with DictionaryMaxlength.from_json() or DictionaryMaxlength.from_dicts(). Do not mutate the shared global provider returned by DictionaryMaxlength.get_provider() or DictionaryMaxlength.new(); the post-load customization APIs are intended for private dictionary instances.


Tofu-risk / Extension Unicode fallback pairs

Use DictionaryMaxlength.with_custom_dicts() for exact in-memory custom pairs when you need to patch tofu-risk characters or Extension Unicode mappings without restructuring the built-in OpenCC dictionaries.

This is useful for platforms where some CJK Extension characters may render as tofu boxes, or where you want to provide a temporary project-local fallback before the upstream dictionary data is updated.

from opencc_purepy import DictSlot, OpenCC
from opencc_purepy.dictionary_lib import DictionaryMaxlength

dictionary = DictionaryMaxlength.from_json().with_custom_dicts(
    appends={
        DictSlot.STPhrases: {
            # Project-local fallback pairs for tofu-risk / Extension Unicode cases.
            # Keep these patches small, explicit, and easy to remove later.
            "骖𬴂": "驂騑",
            "𫜩合": "齧合",
            "𫜩蘗吞针": "齧蘗吞針",

            # Normal custom phrase pairs may be mixed in as well.
            "帕兰蒂尔": "帕蘭蒂爾",
        },
    },
)

cc = OpenCC(config="s2t", dictionary=dictionary)

print(cc.convert("骖𬴂"))
print(cc.convert("𫜩合"))
print(cc.convert("帕兰蒂尔"))

This keeps the core dictionary structure unchanged while still allowing applications to patch specific high-risk entries at load time.


Dictionary text format

Custom dictionary files are UTF-8 text files in OpenCC lexicon format. Use one mapping per line:

# Custom company terms
帕兰蒂尔	帕蘭蒂爾
AI模型 AI模型

Each entry is parsed as key<TAB>value or key whitespace value. Blank lines are ignored, comments are allowed with #, and duplicate keys use late-comer wins.

Because file parsing follows OpenCC dictionary rules, leading spaces and embedded spaces in keys are not preserved. Use with_custom_dicts() when the custom key itself contains spaces.


Override mode

Use overrides={...} to replace an entire dictionary slot. This is for advanced users who maintain a full replacement for a selected OpenCC dictionary slot.

from opencc_purepy import DictSlot, OpenCC

cc = OpenCC.from_dicts(
    config="s2t",
    overrides={
        DictSlot.STPhrases: "./company/STPhrases.txt",
    },
)

Post-load override mode works the same way:

from opencc_purepy import DictSlot
from opencc_purepy.dictionary_lib import DictionaryMaxlength

dictionary = DictionaryMaxlength.from_json().with_custom_dict_files(
    overrides={
        DictSlot.STPhrases: "./CompanyOnlySTPhrases.txt",
    },
)

Supported slots

DictSlot Legacy key Default file
DictSlot.STCharacters st_characters STCharacters.txt
DictSlot.STPhrases st_phrases STPhrases.txt
DictSlot.STPunctuations st_punctuations STPunctuations.txt
DictSlot.TSCharacters ts_characters TSCharacters.txt
DictSlot.TSPhrases ts_phrases TSPhrases.txt
DictSlot.TSPunctuations ts_punctuations TSPunctuations.txt
DictSlot.TWPhrases tw_phrases TWPhrases.txt
DictSlot.TWPhrasesRev tw_phrases_rev TWPhrasesRev.txt
DictSlot.TWVariants tw_variants TWVariants.txt
DictSlot.TWVariantsRev tw_variants_rev TWVariantsRev.txt
DictSlot.TWVariantsRevPhrases tw_variants_rev_phrases TWVariantsRevPhrases.txt
DictSlot.HKVariants hk_variants HKVariants.txt
DictSlot.HKVariantsRev hk_variants_rev HKVariantsRev.txt
DictSlot.HKVariantsRevPhrases hk_variants_rev_phrases HKVariantsRevPhrases.txt
DictSlot.JPSCharacters jps_characters JPShinjitaiCharacters.txt
DictSlot.JPSPhrases jps_phrases JPShinjitaiPhrases.txt
DictSlot.JPVariants jp_variants JPVariants.txt
DictSlot.JPVariantsRev jp_variants_rev JPVariantsRev.txt

Generate JSON with dictgen

TXT dictionaries are human-editable source files. dictionary_maxlength.json is a generated/cache format, so prefer dictgen instead of manually editing JSON.

opencc-purepy dictgen -d ./my_dicts -o dictionary_maxlength.json
from opencc_purepy import OpenCC
from opencc_purepy.dictionary_lib import DictionaryMaxlength

dictionary = DictionaryMaxlength.from_json("./dictionary_maxlength.json")

cc = OpenCC(
    config="s2t",
    dictionary=dictionary,
)

Which mode should I use?

  • Use appends for a few user or company terms.
  • Use overrides when maintaining a full proprietary replacement of an OpenCC dictionary file.
  • Use with_custom_dict_files() to apply OpenCC-compatible text files to a private dictionary after loading it.
  • Use with_custom_dicts() for exact in-memory pairs, especially keys with leading or embedded spaces.
  • Use dictgen when you want to bake TXT dictionaries into JSON for reuse or faster loading.
  • Use direct dictionary injection when sharing one loaded dictionary across many OpenCC instances.
  • Prefer DictSlot for new code and IDE-friendly type checking.
  • Legacy str slot keys remain fully supported for backward compatibility.

🧩 API Reference

Exports

  • OpenCC
  • OpenccConfig

OpenCC class

  • OpenCC(config: str | OpenccConfig = "s2t")
    Create a converter with a supported config string or OpenccConfig enum value. Raises ValueError for unsupported configs.
  • set_config(config: str | OpenccConfig) -> None
    Update the active conversion config. Raises ValueError for unsupported configs.
  • get_config() -> str
    Return the current canonical config name.
  • supported_configs() -> list[str]
    Return all supported config names.
  • get_last_error() -> str | None
    Return the last validation or conversion error, if any.
  • convert(input: str, punctuation: bool = False) -> str
    Convert text using the active config, with optional punctuation conversion.
  • s2t(input: str, punctuation: bool = False) -> str
    Simplified Chinese to Traditional Chinese.
  • t2s(input: str, punctuation: bool = False) -> str
    Traditional Chinese to Simplified Chinese.
  • s2tw(input: str, punctuation: bool = False) -> str
    Simplified Chinese to Taiwan Traditional.
  • tw2s(input: str, punctuation: bool = False) -> str
    Taiwan Traditional to Simplified Chinese.
  • s2twp(input: str, punctuation: bool = False) -> str
    Simplified Chinese to Taiwan Traditional with idiom and phrase conversion.
  • tw2sp(input: str, punctuation: bool = False) -> str
    Taiwan Traditional with idioms to Simplified Chinese.
  • s2hk(input: str, punctuation: bool = False) -> str
    Simplified Chinese to Hong Kong Traditional.
  • hk2s(input: str, punctuation: bool = False) -> str
    Hong Kong Traditional to Simplified Chinese.
  • t2tw(input: str, punctuation: bool = False) -> str
    Traditional Chinese to Taiwan Traditional.
  • t2twp(input: str, punctuation: bool = False) -> str
    Traditional Chinese to Taiwan Traditional with phrase mappings.
  • tw2t(input: str, punctuation: bool = False) -> str
    Taiwan Traditional to standard Traditional Chinese.
  • tw2tp(input: str, punctuation: bool = False) -> str
    Taiwan Traditional to standard Traditional Chinese with phrase reversal.
  • t2hk(input: str, punctuation: bool = False) -> str
    Traditional Chinese to Hong Kong variant.
  • hk2t(input: str, punctuation: bool = False) -> str
    Hong Kong Traditional to standard Traditional Chinese.
  • t2jp(input: str, punctuation: bool = False) -> str
    Traditional Chinese to Japanese variants.
  • jp2t(input: str, punctuation: bool = False) -> str
    Japanese Shinjitai to Traditional Chinese.
  • st(input: str) -> str
    Character-only Simplified to Traditional conversion.
  • ts(input: str) -> str
    Character-only Traditional to Simplified conversion.
  • zho_check(input: str) -> int
    Detect the input text type:
      1 - Traditional, 2 - Simplified, 0 - Others

OpenccConfig enum

  • Members include: S2T, T2S, S2TW, TW2S, S2TWP, TW2SP, S2HK, HK2S, T2TW, TW2T, T2TWP, TW2TP, T2HK, HK2T, T2JP, JP2T
  • to_canonical_name() -> str
    Return the lowercase OpenCC config string.
  • parse(value: str) -> OpenccConfig
    Parse a config string into an enum value.

🛠 Development


⚡ Benchmark

Measured on GitHub Actions ubuntu-latest using the default s2t configuration.
Benchmark separates cold startup, first post-init conversion, and warm cached conversion.

Runner Platform

Field Value
Runner Linux X64
Image ubuntu24 20260513.135.3
Kernel Linux runnervmrw5os 6.17.0-1013-azure #13~24.04.1-Ubuntu SMP Wed Apr 15 16:52:17 UTC 2026 x86_64 x86_64 x86_64 GNU/Linux
CPU AMD EPYC 9V74 80-Core Processor
CPU Cores 4
Memory
Python Python 3.10.20

Results

opencc-purepy v1.3.0

Input Size Cold Total (ms) Post-init Cold (ms) Warm (ms)
100 chars 21.849 ms 19.419 ms 0.171 ms
1,000 chars 21.572 ms 21.409 ms 1.643 ms
10,000 chars 34.063 ms 32.584 ms 13.480 ms
100,000 chars 158.355 ms 156.202 ms 136.870 ms

cold_total includes OpenCC(config) setup plus conversion. post_init_cold measures the first conversion after initialization. warm measures conversion after the union cache has already been built. Results depend on runner hardware and background system load.

Notes

Despite being implemented in pure Python, opencc_purepy achieves competitive conversion throughput through aggressive caching and starter-index optimizations.

The warm conversion path is practical for large-text workloads such as document conversion, GUI applications, and batch processing.


Projects That Use opencc-purepy

OpenccPurepyGui


📄 License

This project is licensed under the MIT License.


Powered by Pure Python and OpenCC Lexicons.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

opencc_purepy-1.3.0.tar.gz (1.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

opencc_purepy-1.3.0-py3-none-any.whl (1.0 MB view details)

Uploaded Python 3

File details

Details for the file opencc_purepy-1.3.0.tar.gz.

File metadata

  • Download URL: opencc_purepy-1.3.0.tar.gz
  • Upload date:
  • Size: 1.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for opencc_purepy-1.3.0.tar.gz
Algorithm Hash digest
SHA256 1ede2349eeee3f6ae3ec7366a6074edb066ff83dc82f66aae923cfbd26ad1cc6
MD5 2217bd723136f0af1c9d58fbef138971
BLAKE2b-256 1b0fe8e44f6d3773c72e9b36ef25f2e9f00034e30b20b2bd1f63fb85e214b985

See more details on using hashes here.

Provenance

The following attestation bundles were made for opencc_purepy-1.3.0.tar.gz:

Publisher: release.yml on laisuk/opencc_purepy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file opencc_purepy-1.3.0-py3-none-any.whl.

File metadata

  • Download URL: opencc_purepy-1.3.0-py3-none-any.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for opencc_purepy-1.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 424dd38cb729063df3bc3f09bf50ba9e9fe1e8f5453f970f4bf1241575161c84
MD5 a47f43f7a088cd8ec6c9b23bd7d1d076
BLAKE2b-256 e11a22cef6a44c899f9756cc3afe761ff7ae34fd8a3555a3904d1653484c97ca

See more details on using hashes here.

Provenance

The following attestation bundles were made for opencc_purepy-1.3.0-py3-none-any.whl:

Publisher: release.yml on laisuk/opencc_purepy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page