Pure Python implementation of OpenCC for Chinese text conversion
Project description
opencc_purepy
opencc_purepy is a pure Python implementation
of OpenCC (Open Chinese Convert),
supporting conversion between Simplified, Traditional, Hong Kong, Taiwan, and Japanese Kanji.
It uses dictionary-based segmentation and mapping logic inspired by the original OpenCC.
🚩 Features
- Pure Python – no native dependencies
- Multiple Chinese locale conversions (Simplified, Traditional, HK, TW, JP)
- Punctuation style conversion (optional)
- Automatic code detection (Simplified/Traditional)
- CLI with Office document support (
.docx,.xlsx,.pptx,.odt,.ods,.odp,.epub)
🐍
opencc_purepyrequires Python 3.7 or later.
🔁 Supported Conversion Configs
| Code | Description |
|---|---|
s2t |
Simplified → Traditional |
t2s |
Traditional → Simplified |
s2tw |
Simplified → Traditional (Taiwan) |
tw2s |
Traditional (Taiwan) → Simplified |
s2twp |
Simplified → Traditional (Taiwan) with idioms |
tw2sp |
Traditional (Taiwan) → Simplified with idioms |
s2hk |
Simplified → Traditional (Hong Kong) |
hk2s |
Traditional (Hong Kong) → Simplified |
t2tw |
Traditional → Traditional (Taiwan) |
tw2t |
Traditional (Taiwan) → Traditional |
t2twp |
Traditional → Traditional (Taiwan) with idioms |
tw2tp |
Traditional (Taiwan) → Traditional with idioms |
t2hk |
Traditional → Traditional (Hong Kong) |
hk2t |
Traditional (Hong Kong) → Traditional |
t2jp |
Japanese Kyujitai → Shinjitai |
jp2t |
Japanese Shinjitai → Kyujitai |
📦 Installation
pip install opencc-purepy
🚀 Usage
Python
from opencc_purepy import OpenCC
text = "“春眠不觉晓,处处闻啼鸟。”"
opencc = OpenCC("s2t")
converted = opencc.convert(text, punctuation=True)
print(converted) # 「春眠不覺曉,處處聞啼鳥。」
CLI
Text File Conversion
python -m opencc_purepy convert -i input.txt -o output.txt -c s2t -p
# or, if installed as a script:
opencc-purepy convert -i input.txt -o output.txt -c s2t -p
Office Document Conversion subcommand (office)
Supports: .docx, .xlsx, .pptx, .odt, .ods, .odp, .epub
# Convert Word document with font preservation
opencc-purepy office -i example.docx -c t2s --keep-font
# Convert EPUB and auto-detect output name
opencc-purepy office -i book.epub -c s2t --auto-ext
# Convert Excel and specify output path and format
opencc-purepy office -i sheet.xlsx -o result.xlsx -c s2tw --format xlsx
ℹ️ With
officesubcommand, the input is processed as an Office or EPUB document and OpenCC conversion is applied internally.
📚 Custom Dictionaries
opencc_purepy follows the OpenCC lexicon structure. Custom entries are loaded through existing OpenCC dictionary
slots, such as DictSlot.STPhrases, DictSlot.TSPhrases, DictSlot.STPunctuations, and other OpenCC slots.
There is no generic UserDict slot.
Dictionary slot mappings support both:
DictSlot(recommended)- string slot names such as
"st_phrases"(backward compatible)
Recommended: load-time append mode
Use appends={...} to load built-in dictionaries first, then custom entries. Duplicate keys use late-comer wins, so
custom entries override built-in entries. This is recommended for most users.
from opencc_purepy import DictSlot, OpenCC
cc = OpenCC.from_dicts(
config="s2t",
appends={
DictSlot.STPhrases: "./UserDict.txt",
},
)
print(cc.convert("帕兰蒂尔是一家公司"))
String slot names remain supported for compatibility:
from opencc_purepy import OpenCC
cc = OpenCC.from_dicts(
config="s2t",
appends={
"st_phrases": "./UserDict.txt",
},
)
The same appends={...} and overrides={...} arguments are also supported by DictionaryMaxlength.from_dicts() when
you want to create and reuse a dictionary instance yourself.
Post-load file customization
Use DictionaryMaxlength.with_custom_dict_files() when you already have a dictionary instance and want to apply
OpenCC-compatible text dictionary files after loading it. Post-load customization supports both appends={...} and
overrides={...}.
from opencc_purepy import DictSlot, OpenCC
from opencc_purepy.dictionary_lib import DictionaryMaxlength
dictionary = DictionaryMaxlength.from_json().with_custom_dict_files(
appends={
DictSlot.STPhrases: "./UserDict.txt",
},
)
cc = OpenCC(config="s2t", dictionary=dictionary)
print(cc.convert("帕兰蒂尔是一家公司"))
Create a private dictionary instance first with DictionaryMaxlength.from_json() or DictionaryMaxlength.from_dicts().
Do not mutate the shared global provider returned by DictionaryMaxlength.get_provider() or
DictionaryMaxlength.new(); the post-load customization APIs are intended for private dictionary instances.
Tofu-risk / Extension Unicode fallback pairs
Use DictionaryMaxlength.with_custom_dicts() for exact in-memory custom pairs when you need to patch
tofu-risk characters or Extension Unicode mappings without restructuring the built-in OpenCC dictionaries.
This is useful for platforms where some CJK Extension characters may render as tofu boxes, or where you want to provide a temporary project-local fallback before the upstream dictionary data is updated.
from opencc_purepy import DictSlot, OpenCC
from opencc_purepy.dictionary_lib import DictionaryMaxlength
dictionary = DictionaryMaxlength.from_json().with_custom_dicts(
appends={
DictSlot.STPhrases: {
# Project-local fallback pairs for tofu-risk / Extension Unicode cases.
# Keep these patches small, explicit, and easy to remove later.
"骖𬴂": "驂騑",
"𫜩合": "齧合",
"𫜩蘗吞针": "齧蘗吞針",
# Normal custom phrase pairs may be mixed in as well.
"帕兰蒂尔": "帕蘭蒂爾",
},
},
)
cc = OpenCC(config="s2t", dictionary=dictionary)
print(cc.convert("骖𬴂"))
print(cc.convert("𫜩合"))
print(cc.convert("帕兰蒂尔"))
This keeps the core dictionary structure unchanged while still allowing applications to patch specific high-risk entries at load time.
Dictionary text format
Custom dictionary files are UTF-8 text files in OpenCC lexicon format. Use one mapping per line:
# Custom company terms
帕兰蒂尔 帕蘭蒂爾
AI模型 AI模型
Each entry is parsed as key<TAB>value or key whitespace value. Blank lines are ignored, comments are allowed with
#, and duplicate keys use late-comer wins.
Because file parsing follows OpenCC dictionary rules, leading spaces and embedded spaces in keys are not preserved. Use
with_custom_dicts() when the custom key itself contains spaces.
Override mode
Use overrides={...} to replace an entire dictionary slot. This is for advanced users who maintain a full replacement
for a selected OpenCC dictionary slot.
from opencc_purepy import DictSlot, OpenCC
cc = OpenCC.from_dicts(
config="s2t",
overrides={
DictSlot.STPhrases: "./company/STPhrases.txt",
},
)
Post-load override mode works the same way:
from opencc_purepy import DictSlot
from opencc_purepy.dictionary_lib import DictionaryMaxlength
dictionary = DictionaryMaxlength.from_json().with_custom_dict_files(
overrides={
DictSlot.STPhrases: "./CompanyOnlySTPhrases.txt",
},
)
Supported slots
DictSlot |
Legacy key | Default file |
|---|---|---|
DictSlot.STCharacters |
st_characters |
STCharacters.txt |
DictSlot.STPhrases |
st_phrases |
STPhrases.txt |
DictSlot.STPunctuations |
st_punctuations |
STPunctuations.txt |
DictSlot.TSCharacters |
ts_characters |
TSCharacters.txt |
DictSlot.TSPhrases |
ts_phrases |
TSPhrases.txt |
DictSlot.TSPunctuations |
ts_punctuations |
TSPunctuations.txt |
DictSlot.TWPhrases |
tw_phrases |
TWPhrases.txt |
DictSlot.TWPhrasesRev |
tw_phrases_rev |
TWPhrasesRev.txt |
DictSlot.TWVariants |
tw_variants |
TWVariants.txt |
DictSlot.TWVariantsRev |
tw_variants_rev |
TWVariantsRev.txt |
DictSlot.TWVariantsRevPhrases |
tw_variants_rev_phrases |
TWVariantsRevPhrases.txt |
DictSlot.HKVariants |
hk_variants |
HKVariants.txt |
DictSlot.HKVariantsRev |
hk_variants_rev |
HKVariantsRev.txt |
DictSlot.HKVariantsRevPhrases |
hk_variants_rev_phrases |
HKVariantsRevPhrases.txt |
DictSlot.JPSCharacters |
jps_characters |
JPShinjitaiCharacters.txt |
DictSlot.JPSPhrases |
jps_phrases |
JPShinjitaiPhrases.txt |
DictSlot.JPVariants |
jp_variants |
JPVariants.txt |
DictSlot.JPVariantsRev |
jp_variants_rev |
JPVariantsRev.txt |
Generate JSON with dictgen
TXT dictionaries are human-editable source files. dictionary_maxlength.json is a generated/cache format, so prefer
dictgen instead of manually editing JSON.
opencc-purepy dictgen -d ./my_dicts -o dictionary_maxlength.json
from opencc_purepy import OpenCC
from opencc_purepy.dictionary_lib import DictionaryMaxlength
dictionary = DictionaryMaxlength.from_json("./dictionary_maxlength.json")
cc = OpenCC(
config="s2t",
dictionary=dictionary,
)
Which mode should I use?
- Use
appendsfor a few user or company terms. - Use
overrideswhen maintaining a full proprietary replacement of an OpenCC dictionary file. - Use
with_custom_dict_files()to apply OpenCC-compatible text files to a private dictionary after loading it. - Use
with_custom_dicts()for exact in-memory pairs, especially keys with leading or embedded spaces. - Use
dictgenwhen you want to bake TXT dictionaries into JSON for reuse or faster loading. - Use direct dictionary injection when sharing one loaded dictionary across many
OpenCCinstances. - Prefer
DictSlotfor new code and IDE-friendly type checking. - Legacy
strslot keys remain fully supported for backward compatibility.
🧩 API Reference
Exports
OpenCCOpenccConfig
OpenCC class
OpenCC(config: str | OpenccConfig = "s2t")
Create a converter with a supported config string orOpenccConfigenum value. RaisesValueErrorfor unsupported configs.set_config(config: str | OpenccConfig) -> None
Update the active conversion config. RaisesValueErrorfor unsupported configs.get_config() -> str
Return the current canonical config name.supported_configs() -> list[str]
Return all supported config names.get_last_error() -> str | None
Return the last validation or conversion error, if any.convert(input: str, punctuation: bool = False) -> str
Convert text using the active config, with optional punctuation conversion.s2t(input: str, punctuation: bool = False) -> str
Simplified Chinese to Traditional Chinese.t2s(input: str, punctuation: bool = False) -> str
Traditional Chinese to Simplified Chinese.s2tw(input: str, punctuation: bool = False) -> str
Simplified Chinese to Taiwan Traditional.tw2s(input: str, punctuation: bool = False) -> str
Taiwan Traditional to Simplified Chinese.s2twp(input: str, punctuation: bool = False) -> str
Simplified Chinese to Taiwan Traditional with idiom and phrase conversion.tw2sp(input: str, punctuation: bool = False) -> str
Taiwan Traditional with idioms to Simplified Chinese.s2hk(input: str, punctuation: bool = False) -> str
Simplified Chinese to Hong Kong Traditional.hk2s(input: str, punctuation: bool = False) -> str
Hong Kong Traditional to Simplified Chinese.t2tw(input: str, punctuation: bool = False) -> str
Traditional Chinese to Taiwan Traditional.t2twp(input: str, punctuation: bool = False) -> str
Traditional Chinese to Taiwan Traditional with phrase mappings.tw2t(input: str, punctuation: bool = False) -> str
Taiwan Traditional to standard Traditional Chinese.tw2tp(input: str, punctuation: bool = False) -> str
Taiwan Traditional to standard Traditional Chinese with phrase reversal.t2hk(input: str, punctuation: bool = False) -> str
Traditional Chinese to Hong Kong variant.hk2t(input: str, punctuation: bool = False) -> str
Hong Kong Traditional to standard Traditional Chinese.t2jp(input: str, punctuation: bool = False) -> str
Traditional Chinese to Japanese variants.jp2t(input: str, punctuation: bool = False) -> str
Japanese Shinjitai to Traditional Chinese.st(input: str) -> str
Character-only Simplified to Traditional conversion.ts(input: str) -> str
Character-only Traditional to Simplified conversion.zho_check(input: str) -> int
Detect the input text type:
1- Traditional,2- Simplified,0- Others
OpenccConfig enum
- Members include:
S2T,T2S,S2TW,TW2S,S2TWP,TW2SP,S2HK,HK2S,T2TW,TW2T,T2TWP,TW2TP,T2HK,HK2T,T2JP,JP2T to_canonical_name() -> str
Return the lowercase OpenCC config string.parse(value: str) -> OpenccConfig
Parse a config string into an enum value.
🛠 Development
- Python bindings:
opencc_purepy/__init__.py,opencc_purepy/core.py - CLI:
opencc_purepy/__main__.py
⚡ Benchmark
Measured on GitHub Actions
ubuntu-latestusing the defaults2tconfiguration.
Benchmark separates cold startup, first post-init conversion, and warm cached conversion.
Runner Platform
| Field | Value |
|---|---|
| Runner | Linux X64 |
| Image | ubuntu24 20260513.135.3 |
| Kernel | Linux runnervmrw5os 6.17.0-1013-azure #13~24.04.1-Ubuntu SMP Wed Apr 15 16:52:17 UTC 2026 x86_64 x86_64 x86_64 GNU/Linux |
| CPU | AMD EPYC 9V74 80-Core Processor |
| CPU Cores | 4 |
| Memory | |
| Python | Python 3.10.20 |
Results
opencc-purepy v1.3.0
| Input Size | Cold Total (ms) | Post-init Cold (ms) | Warm (ms) |
|---|---|---|---|
| 100 chars | 21.849 ms | 19.419 ms | 0.171 ms |
| 1,000 chars | 21.572 ms | 21.409 ms | 1.643 ms |
| 10,000 chars | 34.063 ms | 32.584 ms | 13.480 ms |
| 100,000 chars | 158.355 ms | 156.202 ms | 136.870 ms |
cold_total includes OpenCC(config) setup plus conversion. post_init_cold measures the first conversion after
initialization. warm measures conversion after the union cache has already been built. Results depend on runner
hardware and background system load.
Notes
Despite being implemented in pure Python,
opencc_purepyachieves competitive conversion throughput through aggressive caching and starter-index optimizations.The warm conversion path is practical for large-text workloads such as document conversion, GUI applications, and batch processing.
Projects That Use opencc-purepy
📄 License
This project is licensed under the MIT License.
Powered by Pure Python and OpenCC Lexicons.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file opencc_purepy-1.3.1.tar.gz.
File metadata
- Download URL: opencc_purepy-1.3.1.tar.gz
- Upload date:
- Size: 1.0 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f6a2a663306c32eba015c61197ccad1ab0cffdcb5e145459da22d1d7c1a33028
|
|
| MD5 |
13523c4b3fba667b5a09cdda2b4d0906
|
|
| BLAKE2b-256 |
ba5a5c3e3f2b8904e96497ed4f63f9194cd416a0507e141544f1547787b41ca2
|
Provenance
The following attestation bundles were made for opencc_purepy-1.3.1.tar.gz:
Publisher:
release.yml on laisuk/opencc_purepy
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
opencc_purepy-1.3.1.tar.gz -
Subject digest:
f6a2a663306c32eba015c61197ccad1ab0cffdcb5e145459da22d1d7c1a33028 - Sigstore transparency entry: 1615754896
- Sigstore integration time:
-
Permalink:
laisuk/opencc_purepy@a535246005c214b78de15983b995e01f0ae21945 -
Branch / Tag:
refs/tags/v1.3.1 - Owner: https://github.com/laisuk
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@a535246005c214b78de15983b995e01f0ae21945 -
Trigger Event:
push
-
Statement type:
File details
Details for the file opencc_purepy-1.3.1-py3-none-any.whl.
File metadata
- Download URL: opencc_purepy-1.3.1-py3-none-any.whl
- Upload date:
- Size: 1.0 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ad7cdc16706b9854bbbc2bf200438f7d10ea5ba0926c8519e36f31f99c77698a
|
|
| MD5 |
922977dc84ecc6f516064162698df446
|
|
| BLAKE2b-256 |
7aa3ac6f8c899e18d5a5e0a9a40ef72743b37773c37f172c33e7eefcce20c7df
|
Provenance
The following attestation bundles were made for opencc_purepy-1.3.1-py3-none-any.whl:
Publisher:
release.yml on laisuk/opencc_purepy
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
opencc_purepy-1.3.1-py3-none-any.whl -
Subject digest:
ad7cdc16706b9854bbbc2bf200438f7d10ea5ba0926c8519e36f31f99c77698a - Sigstore transparency entry: 1615754900
- Sigstore integration time:
-
Permalink:
laisuk/opencc_purepy@a535246005c214b78de15983b995e01f0ae21945 -
Branch / Tag:
refs/tags/v1.3.1 - Owner: https://github.com/laisuk
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@a535246005c214b78de15983b995e01f0ae21945 -
Trigger Event:
push
-
Statement type: