Skip to main content

Conversion between Traditional and Simplified Chinese (pure Python)

Project description

opencc-py (OpenCC Pure Python Implementation)

This directory contains a pure Python implementation of the OpenCC Chinese conversion algorithm. It provides the same import surface as the Python package:

import opencc

converter = opencc.OpenCC("s2t")
print(converter.convert("汉字"))  # 漢字

Data Dependency

The package does not bundle OpenCC configs or dictionaries directly. Built-in conversion data is loaded from the opencc-data PyPI package at runtime.

This keeps the pure Python package small and avoids depending on generated files under the OpenCC source tree. The converter reads:

  • config JSON files from opencc_data.config_path()
  • dictionary text files from opencc_data.data_path()
  • test cases from opencc_data.test_data_path()

Custom config files are still supported. When a custom config references a local dictionary path such as CustomPhrases.ocd2, the pure Python implementation looks for the corresponding CustomPhrases.txt next to the config file.

Installation

The PyPI package name is opencc-py. Users can install it with pip:

python -m pip install opencc-py

For local development from this directory:

python -m pip install .

The package version matches its opencc-data version and declares the matching data package as an exact install dependency, so pip installs the compatible data package automatically.

Or use editable development mode:

python -m pip install -e .

Supported Configs

opencc.CONFIGS is populated from the configs exposed by opencc-data.

import opencc

print(opencc.CONFIGS)

The standard mmseg configs and configs that do not require segmentation are supported. Jieba plugin configs are not included in opencc-data, so they are not exposed as built-in configs by this package.

Testing

Install test dependencies, then run pytest from the repository root:

python -m pip install -r python-pure/tests/requirements_lock.txt
PYTHONPATH=python-pure python -m pytest python-pure/tests

The tests verify:

  • importing and initializing every built-in config
  • conversion against opencc-data test cases
  • custom config and local dictionary resolution
  • golden output compatibility for supported configs

OpenCC 1.3.2 Feature Coverage

The following OpenCC 1.3.2 features are fully supported:

  • CJK Compatibility Ideographs normalization — all built-in configs include a pre-processing normalization step that maps U+F900–U+FAFF characters to their canonical code points before conversion.
  • match_policy: union — dictionary groups with "match_policy": "union" return the globally longest match across all sub-dictionaries.
  • normalization config field — custom configs may add a normalization array to apply conversion steps before segmentation.
  • New configss2hkp and hk2sp (Simplified ↔ Hong Kong, with phrase conversion) are available through opencc-data.
  • Tofu-risk dictionary suppression — pass include_tofu_risk_dictionaries=False to OpenCC() to exclude dictionaries that may produce characters absent from modern CJK fonts.
  • JSONC — config files may use // line comments and /* */ block comments; the pure Python backend strips them before JSON parsing.
  • Inline dictionaries{"type": "inline", "entries": {"key": "value", ...}} dict nodes are supported in custom configs.

Differences from the Official Implementation

This package intentionally implements only the pieces needed for pure Python text conversion. Compared with the official C++ library and command-line tools, it omits several lower-level details. The official Python implementation is the opencc PyPI package.

  • binary dictionary loading for .ocd2/.ocd; built-in dictionaries are read from .txt data supplied by opencc-data
  • dictionary compilation and extraction tools such as opencc_dict and opencc_phrase_extract
  • the C API, shared-library loading behavior, and ABI/plugin compatibility guarantees
  • native CLI behavior, including streaming I/O, command-line option parity, and platform-specific path handling
  • package, runfiles, and source-tree data discovery fallbacks; built-in data comes from opencc-data
  • automatic loading of optional plugin configs or plugin resources, including the Jieba plugin package layout
  • performance optimizations from marisa-trie, Darts, and the C++ segmentation implementation

The conversion semantics still mirror OpenCC's config-driven pipeline: mmseg segmentation, ordered dictionary groups, longest-prefix matching within a dictionary, conversion chains, normalization, and optional suppression of tofu-risk dictionaries.

License and Compliance

This package is distributed under the Apache License 2.0.

This project is a derivative work of OpenCC. Runtime conversion data is provided by the opencc-data PyPI package.


opencc-py (OpenCC 純 Python 實作)

此目錄包含 OpenCC 中文轉換演算法的純 Python 實作,提供與 Python package 相同的匯入介面:

import opencc

converter = opencc.OpenCC("s2t")
print(converter.convert("汉字"))  # 漢字

資料依賴

此 package 不再直接內嵌 OpenCC config 或 dictionary。內建轉換資料會在執行時 從 PyPI package opencc-data 載入。

這能讓 pure Python package 保持精簡,並避免依賴 OpenCC source tree 底下的 生成檔案。converter 會讀取:

  • opencc_data.config_path() 提供的 config JSON 檔案
  • opencc_data.data_path() 提供的 dictionary text 檔案
  • opencc_data.test_data_path() 提供的測試案例

自訂 config 仍然支援。當自訂 config 參照本地 dictionary 路徑,例如 CustomPhrases.ocd2,純 Python 實作會在 config 檔案旁尋找對應的 CustomPhrases.txt

安裝

PyPI package 名稱是 opencc-py。使用者可以透過 pip 安裝:

python -m pip install opencc-py

從此目錄進行本地開發安裝:

python -m pip install .

此 package 的版本會與 opencc-data 版本一致,並將相同版本的資料 package 宣告為精確安裝依賴,因此 pip 會自動安裝相容的資料 package。

也可以使用 editable development mode:

python -m pip install -e .

支援的 Configs

opencc.CONFIGSopencc-data 提供的 configs 產生。

import opencc

print(opencc.CONFIGS)

標準 mmseg configs 與不需要 segmentation 的 configs 皆受支援。Jieba plugin configs 不包含在 opencc-data 中,因此此 package 不會把它們列為內建 configs。

測試

先安裝測試依賴,再從 repository root 執行 pytest:

python -m pip install -r python-pure/tests/requirements_lock.txt
PYTHONPATH=python-pure python -m pytest python-pure/tests

測試會驗證:

  • 每個內建 config 都能 import 與初始化
  • 轉換結果符合 opencc-data 測試案例
  • 自訂 config 與本地 dictionary 解析
  • 支援 configs 的 golden output 相容性

OpenCC 1.3.2 功能支援狀況

以下 OpenCC 1.3.2 功能已完整支援:

  • CJK 相容表意文字正規化 — 所有內建 config 均包含正規化前處理步驟, 在轉換前先將 U+F900–U+FAFF 區塊字元映射至標準碼位。
  • match_policy: union — 使用 "match_policy": "union" 的 dictionary group 會取所有子 dictionary 中最長的前綴命中。
  • normalization config 欄位 — 自訂 config 可加入 normalization 陣列, 在 segmentation 前插入正規化步驟。
  • 新 configss2hkphk2sp(簡體 ↔ 香港繁體,含詞組轉換) 透過 opencc-data 提供。
  • Tofu-risk dictionary 停用 — 建構 OpenCC() 時傳入 include_tofu_risk_dictionaries=False 可停用可能輸出現代字型缺字的 dictionary。
  • JSONC — config 檔案支援 // 行注釋與 /* */ 區塊注釋;純 Python 後端在解析 JSON 前會先剝除注釋。
  • Inline dictionary — 自訂 config 支援 {"type": "inline", "entries": {"key": "value", ...}} 節點。

與官方實作的差異

此 package 刻意只實作純 Python 文字轉換所需的部分。相較於官方 C++ library 與 command-line tools,它省略了幾個較底層的實作細節。官方 Python 實作是 PyPI 上的 opencc package。

  • .ocd2 / .ocd 二進位 dictionary 載入;內建 dictionary 會讀取 opencc-data 提供的 .txt 資料
  • opencc_dictopencc_phrase_extract 等 dictionary 編譯與抽取工具
  • C API、shared-library 載入行為,以及 ABI/plugin 相容性保證
  • native CLI 行為,包括 streaming I/O、命令列選項完整對齊,以及平台相關路徑處理
  • package、runfiles、source-tree 資料搜尋 fallback;內建資料一律來自 opencc-data
  • optional plugin configs 或 plugin resources 的自動載入,包括 Jieba plugin 的 package layout
  • marisa-trie、Darts 與 C++ segmentation 實作帶來的效能最佳化

轉換語意仍會對齊 OpenCC 的 config-driven pipeline:mmseg segmentation、 ordered dictionary groups、dictionary 內 longest-prefix matching、conversion chains、normalization,以及 tofu-risk dictionaries 的可選停用。

License 與合規

此 package 以 Apache License 2.0 發佈。

此專案屬於 OpenCC 的衍生作品。執行時轉換 資料由 PyPI package opencc-data 提供。

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

opencc_py-1.4.0.tar.gz (22.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

opencc_py-1.4.0-py3-none-any.whl (15.9 kB view details)

Uploaded Python 3

File details

Details for the file opencc_py-1.4.0.tar.gz.

File metadata

  • Download URL: opencc_py-1.4.0.tar.gz
  • Upload date:
  • Size: 22.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for opencc_py-1.4.0.tar.gz
Algorithm Hash digest
SHA256 651dc80a44fe4d4857590b82f751ad39b41a4ed0ba57e78159592ddbeb9f2d9a
MD5 6df6059ddb419c596a049ca064a2c003
BLAKE2b-256 0eb4f2942390ad6f8d3a14f70c4bd28584909cab9445071dc34b3fc8833a53fc

See more details on using hashes here.

Provenance

The following attestation bundles were made for opencc_py-1.4.0.tar.gz:

Publisher: release-pypi-pure.yml on frankslin/OpenCC

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file opencc_py-1.4.0-py3-none-any.whl.

File metadata

  • Download URL: opencc_py-1.4.0-py3-none-any.whl
  • Upload date:
  • Size: 15.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for opencc_py-1.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7d5a7df41ad043a8faccffda331f3f1ddc3d38bee89f82f0d42d0bfdd7d48228
MD5 e81601339983c4ae33b10990ec5a4df4
BLAKE2b-256 8f7d6ffa705b8195e0b8adbdd72bc26a9a4a8cb20acc8b49290483fdf71abe09

See more details on using hashes here.

Provenance

The following attestation bundles were made for opencc_py-1.4.0-py3-none-any.whl:

Publisher: release-pypi-pure.yml on frankslin/OpenCC

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page