Skip to main content

Pure Python implementation of OpenCC for Chinese text conversion

Project description

opencc_purepy

PyPI version License Downloads Build & Release

opencc_purepy is a pure Python implementation of OpenCC (Open Chinese Convert), supporting conversion between Simplified, Traditional, Hong Kong, Taiwan, and Japanese Kanji.
It uses dictionary-based segmentation and mapping logic inspired by the original OpenCC.


🚩 Features

  • Pure Python – no native dependencies
  • Multiple Chinese locale conversions (Simplified, Traditional, HK, TW, JP)
  • Punctuation style conversion (optional)
  • Automatic code detection (Simplified/Traditional)
  • CLI with Office document support (.docx, .xlsx, .pptx, .odt, .ods, .odp, .epub)

🐍 opencc_purepy requires Python 3.7 or later.


🔁 Supported Conversion Configs

Code Description
s2t Simplified → Traditional
t2s Traditional → Simplified
s2tw Simplified → Traditional (Taiwan)
tw2s Traditional (Taiwan) → Simplified
s2twp Simplified → Traditional (Taiwan) with idioms
tw2sp Traditional (Taiwan) → Simplified with idioms
s2hk Simplified → Traditional (Hong Kong)
hk2s Traditional (Hong Kong) → Simplified
t2tw Traditional → Traditional (Taiwan)
tw2t Traditional (Taiwan) → Traditional
t2twp Traditional → Traditional (Taiwan) with idioms
tw2tp Traditional (Taiwan) → Traditional with idioms
t2hk Traditional → Traditional (Hong Kong)
hk2t Traditional (Hong Kong) → Traditional
t2jp Japanese Kyujitai → Shinjitai
jp2t Japanese Shinjitai → Kyujitai

📦 Installation

pip install opencc-purepy

🚀 Usage

Python

from opencc_purepy import OpenCC

text = "“春眠不觉晓,处处闻啼鸟。”"
opencc = OpenCC("s2t")
converted = opencc.convert(text, punctuation=True)
print(converted)  # 「春眠不覺曉,處處聞啼鳥。」

CLI

Text File Conversion

python -m opencc_purepy convert -i input.txt -o output.txt -c s2t -p
# or, if installed as a script:
opencc-purepy convert -i input.txt -o output.txt -c s2t -p

Office Document Conversion subcommand (office)

Supports: .docx, .xlsx, .pptx, .odt, .ods, .odp, .epub

# Convert Word document with font preservation
opencc-purepy office -i example.docx -c t2s --keep-font

# Convert EPUB and auto-detect output name
opencc-purepy office -i book.epub -c s2t --auto-ext

# Convert Excel and specify output path and format
opencc-purepy office -i sheet.xlsx -o result.xlsx -c s2tw --format xlsx

ℹ️ With office subcommand, the input is processed as an Office or EPUB document and OpenCC conversion is applied internally.


🧩 API Reference

Exports

  • OpenCC
  • OpenccConfig

OpenCC class

  • OpenCC(config: str | OpenccConfig = "s2t")
    Create a converter with a supported config string or OpenccConfig enum value.
  • set_config(config: str | OpenccConfig) -> None
    Update the active conversion config.
  • get_config() -> str
    Return the current canonical config name.
  • supported_configs() -> list[str]
    Return all supported config names.
  • get_last_error() -> str | None
    Return the last validation or conversion error, if any.
  • convert(input: str, punctuation: bool = False) -> str
    Convert text using the active config, with optional punctuation conversion where supported.
  • s2t(input: str, punctuation: bool = False) -> str
    Simplified Chinese to Traditional Chinese.
  • t2s(input: str, punctuation: bool = False) -> str
    Traditional Chinese to Simplified Chinese.
  • s2tw(input: str, punctuation: bool = False) -> str
    Simplified Chinese to Taiwan Traditional.
  • tw2s(input: str, punctuation: bool = False) -> str
    Taiwan Traditional to Simplified Chinese.
  • s2twp(input: str, punctuation: bool = False) -> str
    Simplified Chinese to Taiwan Traditional with idiom and phrase conversion.
  • tw2sp(input: str, punctuation: bool = False) -> str
    Taiwan Traditional with idioms to Simplified Chinese.
  • s2hk(input: str, punctuation: bool = False) -> str
    Simplified Chinese to Hong Kong Traditional.
  • hk2s(input: str, punctuation: bool = False) -> str
    Hong Kong Traditional to Simplified Chinese.
  • t2tw(input: str) -> str
    Traditional Chinese to Taiwan Traditional.
  • t2twp(input: str) -> str
    Traditional Chinese to Taiwan Traditional with phrase mappings.
  • tw2t(input: str) -> str
    Taiwan Traditional to standard Traditional Chinese.
  • tw2tp(input: str) -> str
    Taiwan Traditional to standard Traditional Chinese with phrase reversal.
  • t2hk(input: str) -> str
    Traditional Chinese to Hong Kong variant.
  • hk2t(input: str) -> str
    Hong Kong Traditional to standard Traditional Chinese.
  • t2jp(input: str) -> str
    Traditional Chinese to Japanese variants.
  • jp2t(input: str) -> str
    Japanese Shinjitai to Traditional Chinese.
  • st(input: str) -> str
    Character-only Simplified to Traditional conversion.
  • ts(input: str) -> str
    Character-only Traditional to Simplified conversion.
  • zho_check(input: str) -> int
    Detect the input text type:
      1 - Traditional, 2 - Simplified, 0 - Others

OpenccConfig enum

  • Members include: S2T, T2S, S2TW, TW2S, S2TWP, TW2SP, S2HK, HK2S, T2TW, TW2T, T2TWP, TW2TP, T2HK, HK2T, T2JP, JP2T
  • to_canonical_name() -> str
    Return the lowercase OpenCC config string.
  • parse(value: str) -> OpenccConfig
    Parse a config string into an enum value.

🛠 Development


⚡ Benchmark

Measured on a local machine using the default "s2t" configuration.
Each test averaged over 20 runs with the shared dictionary cache reused across runs.

Input Size Avg. Time (ms)
100 chars 0.15 ms
1,000 chars 0.93 ms
10,000 chars 8.76 ms
100,000 chars 86.05 ms

Timings reuse the shared dictionary cache, but still include per-run OpenCC instance setup; results depend on local hardware and background system load.


Projects That Use opencc-purepy

OpenccPurepyGui


📄 License

This project is licensed under the MIT License.


Powered by Pure Python and OpenCC Lexicons.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

opencc_purepy-1.2.0.tar.gz (1.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

opencc_purepy-1.2.0-py3-none-any.whl (1.0 MB view details)

Uploaded Python 3

File details

Details for the file opencc_purepy-1.2.0.tar.gz.

File metadata

  • Download URL: opencc_purepy-1.2.0.tar.gz
  • Upload date:
  • Size: 1.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.8.10

File hashes

Hashes for opencc_purepy-1.2.0.tar.gz
Algorithm Hash digest
SHA256 8a6c0f79792bf76d1bf0d25fecf71c41c433ae5c180d885f1b5456aca3c08104
MD5 dbc53723ac5cbf12e007e9181c1ef377
BLAKE2b-256 5478239786fb88320d327532f3b05a8cb0ec079ca9430fd11cd8e02d71a05566

See more details on using hashes here.

File details

Details for the file opencc_purepy-1.2.0-py3-none-any.whl.

File metadata

  • Download URL: opencc_purepy-1.2.0-py3-none-any.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.8.10

File hashes

Hashes for opencc_purepy-1.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2cdc7592301b643ab7fbb810ad8a57c91f4b8c36eb24f32304737505313c97c0
MD5 205929be3092e962a733c065d0ac9b3a
BLAKE2b-256 dc07556e8c193e5449adfd00511d8d09a687f636876af9c8ebb9bc1114dd8192

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page