Pure Python implementation of OpenCC for Chinese text conversion

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

opencc_purepy

opencc_purepy is a pure Python implementation of OpenCC (Open Chinese Convert), supporting conversion between Simplified, Traditional, Hong Kong, Taiwan, and Japanese Kanji.
It uses dictionary-based segmentation and mapping logic inspired by the original OpenCC.

🚩 Features

Pure Python – no native dependencies
Multiple Chinese locale conversions (Simplified, Traditional, HK, TW, JP)
Punctuation style conversion (optional)
Automatic code detection (Simplified/Traditional)
CLI with Office document support (.docx, .xlsx, .pptx, .odt, .ods, .odp, .epub)

🐍 opencc_purepy requires Python 3.7 or later.

🔁 Supported Conversion Configs

Code	Description
`s2t`	Simplified → Traditional
`t2s`	Traditional → Simplified
`s2tw`	Simplified → Traditional (Taiwan)
`tw2s`	Traditional (Taiwan) → Simplified
`s2twp`	Simplified → Traditional (Taiwan) with idioms
`tw2sp`	Traditional (Taiwan) → Simplified with idioms
`s2hk`	Simplified → Traditional (Hong Kong)
`hk2s`	Traditional (Hong Kong) → Simplified
`t2tw`	Traditional → Traditional (Taiwan)
`tw2t`	Traditional (Taiwan) → Traditional
`t2twp`	Traditional → Traditional (Taiwan) with idioms
`tw2tp`	Traditional (Taiwan) → Traditional with idioms
`t2hk`	Traditional → Traditional (Hong Kong)
`hk2t`	Traditional (Hong Kong) → Traditional
`t2jp`	Japanese Kyujitai → Shinjitai
`jp2t`	Japanese Shinjitai → Kyujitai

📦 Installation

pip install opencc-purepy

🚀 Usage

Python

from opencc_purepy import OpenCC

text = "“春眠不觉晓，处处闻啼鸟。”"
opencc = OpenCC("s2t")
converted = opencc.convert(text, punctuation=True)
print(converted)  # 「春眠不覺曉，處處聞啼鳥。」

CLI

Text File Conversion

python -m opencc_purepy convert -i input.txt -o output.txt -c s2t -p
# or, if installed as a script:
opencc-purepy convert -i input.txt -o output.txt -c s2t -p

Office Document Conversion subcommand (`office`)

Supports: .docx, .xlsx, .pptx, .odt, .ods, .odp, .epub

# Convert Word document with font preservation
opencc-purepy office -i example.docx -c t2s --keep-font

# Convert EPUB and auto-detect output name
opencc-purepy office -i book.epub -c s2t --auto-ext

# Convert Excel and specify output path and format
opencc-purepy office -i sheet.xlsx -o result.xlsx -c s2tw --format xlsx

ℹ️ With office subcommand, the input is processed as an Office or EPUB document and OpenCC conversion is applied internally.

📚 Custom Dictionaries

opencc_purepy follows the OpenCC lexicon structure. Custom entries are loaded through existing OpenCC dictionary slots, such as DictSlot.STPhrases, DictSlot.TSPhrases, DictSlot.STPunctuations, and other OpenCC slots. There is no generic UserDict slot.

Dictionary slot mappings support both:

DictSlot (recommended)
string slot names such as "st_phrases" (backward compatible)

Recommended: load-time append mode

Use appends={...} to load built-in dictionaries first, then custom entries. Duplicate keys use late-comer wins, so custom entries override built-in entries. This is recommended for most users.

from opencc_purepy import DictSlot, OpenCC

cc = OpenCC.from_dicts(
    config="s2t",
    appends={
        DictSlot.STPhrases: "./UserDict.txt",
    },
)

print(cc.convert("帕兰蒂尔是一家公司"))

String slot names remain supported for compatibility:

from opencc_purepy import OpenCC

cc = OpenCC.from_dicts(
    config="s2t",
    appends={
        "st_phrases": "./UserDict.txt",
    },
)

The same appends={...} and overrides={...} arguments are also supported by DictionaryMaxlength.from_dicts() when you want to create and reuse a dictionary instance yourself.

Post-load file customization

Use DictionaryMaxlength.with_custom_dict_files() when you already have a dictionary instance and want to apply OpenCC-compatible text dictionary files after loading it. Post-load customization supports both appends={...} and overrides={...}.

from opencc_purepy import DictSlot, OpenCC
from opencc_purepy.dictionary_lib import DictionaryMaxlength

dictionary = DictionaryMaxlength.from_json().with_custom_dict_files(
    appends={
        DictSlot.STPhrases: "./UserDict.txt",
    },
)

cc = OpenCC(config="s2t", dictionary=dictionary)

print(cc.convert("帕兰蒂尔是一家公司"))

Create a private dictionary instance first with DictionaryMaxlength.from_json() or DictionaryMaxlength.from_dicts(). Do not mutate the shared global provider returned by DictionaryMaxlength.get_provider() or DictionaryMaxlength.new(); the post-load customization APIs are intended for private dictionary instances.

Tofu-risk / Extension Unicode fallback pairs

Use DictionaryMaxlength.with_custom_dicts() for exact in-memory custom pairs when you need to patch tofu-risk characters or Extension Unicode mappings without restructuring the built-in OpenCC dictionaries.

This is useful for platforms where some CJK Extension characters may render as tofu boxes, or where you want to provide a temporary project-local fallback before the upstream dictionary data is updated.

from opencc_purepy import DictSlot, OpenCC
from opencc_purepy.dictionary_lib import DictionaryMaxlength

dictionary = DictionaryMaxlength.from_json().with_custom_dicts(
    appends={
        DictSlot.STPhrases: {
            # Project-local fallback pairs for tofu-risk / Extension Unicode cases.
            # Keep these patches small, explicit, and easy to remove later.
            "骖𬴂": "驂騑",
            "𫜩合": "齧合",
            "𫜩蘗吞针": "齧蘗吞針",

            # Normal custom phrase pairs may be mixed in as well.
            "帕兰蒂尔": "帕蘭蒂爾",
        },
    },
)

cc = OpenCC(config="s2t", dictionary=dictionary)

print(cc.convert("骖𬴂"))
print(cc.convert("𫜩合"))
print(cc.convert("帕兰蒂尔"))

This keeps the core dictionary structure unchanged while still allowing applications to patch specific high-risk entries at load time.

Dictionary text format

Custom dictionary files are UTF-8 text files in OpenCC lexicon format. Use one mapping per line:

# Custom company terms
帕兰蒂尔	帕蘭蒂爾
AI模型 AI模型

Each entry is parsed as key<TAB>value or key whitespace value. Blank lines are ignored, comments are allowed with #, and duplicate keys use late-comer wins.

Because file parsing follows OpenCC dictionary rules, leading spaces and embedded spaces in keys are not preserved. Use with_custom_dicts() when the custom key itself contains spaces.

Override mode

Use overrides={...} to replace an entire dictionary slot. This is for advanced users who maintain a full replacement for a selected OpenCC dictionary slot.

from opencc_purepy import DictSlot, OpenCC

cc = OpenCC.from_dicts(
    config="s2t",
    overrides={
        DictSlot.STPhrases: "./company/STPhrases.txt",
    },
)

Post-load override mode works the same way:

from opencc_purepy import DictSlot
from opencc_purepy.dictionary_lib import DictionaryMaxlength

dictionary = DictionaryMaxlength.from_json().with_custom_dict_files(
    overrides={
        DictSlot.STPhrases: "./CompanyOnlySTPhrases.txt",
    },
)

Supported slots

`DictSlot`	Legacy key	Default file
`DictSlot.STCharacters`	`st_characters`	`STCharacters.txt`
`DictSlot.STPhrases`	`st_phrases`	`STPhrases.txt`
`DictSlot.STPunctuations`	`st_punctuations`	`STPunctuations.txt`
`DictSlot.TSCharacters`	`ts_characters`	`TSCharacters.txt`
`DictSlot.TSPhrases`	`ts_phrases`	`TSPhrases.txt`
`DictSlot.TSPunctuations`	`ts_punctuations`	`TSPunctuations.txt`
`DictSlot.TWPhrases`	`tw_phrases`	`TWPhrases.txt`
`DictSlot.TWPhrasesRev`	`tw_phrases_rev`	`TWPhrasesRev.txt`
`DictSlot.TWVariants`	`tw_variants`	`TWVariants.txt`
`DictSlot.TWVariantsRev`	`tw_variants_rev`	`TWVariantsRev.txt`
`DictSlot.TWVariantsRevPhrases`	`tw_variants_rev_phrases`	`TWVariantsRevPhrases.txt`
`DictSlot.HKVariants`	`hk_variants`	`HKVariants.txt`
`DictSlot.HKVariantsRev`	`hk_variants_rev`	`HKVariantsRev.txt`
`DictSlot.HKVariantsRevPhrases`	`hk_variants_rev_phrases`	`HKVariantsRevPhrases.txt`
`DictSlot.JPSCharacters`	`jps_characters`	`JPShinjitaiCharacters.txt`
`DictSlot.JPSPhrases`	`jps_phrases`	`JPShinjitaiPhrases.txt`
`DictSlot.JPVariants`	`jp_variants`	`JPVariants.txt`
`DictSlot.JPVariantsRev`	`jp_variants_rev`	`JPVariantsRev.txt`

Generate JSON with dictgen

TXT dictionaries are human-editable source files. dictionary_maxlength.json is a generated/cache format, so prefer dictgen instead of manually editing JSON.

opencc-purepy dictgen -d ./my_dicts -o dictionary_maxlength.json

from opencc_purepy import OpenCC
from opencc_purepy.dictionary_lib import DictionaryMaxlength

dictionary = DictionaryMaxlength.from_json("./dictionary_maxlength.json")

cc = OpenCC(
    config="s2t",
    dictionary=dictionary,
)

Which mode should I use?

Use appends for a few user or company terms.
Use overrides when maintaining a full proprietary replacement of an OpenCC dictionary file.
Use with_custom_dict_files() to apply OpenCC-compatible text files to a private dictionary after loading it.
Use with_custom_dicts() for exact in-memory pairs, especially keys with leading or embedded spaces.
Use dictgen when you want to bake TXT dictionaries into JSON for reuse or faster loading.
Use direct dictionary injection when sharing one loaded dictionary across many OpenCC instances.
Prefer DictSlot for new code and IDE-friendly type checking.
Legacy str slot keys remain fully supported for backward compatibility.

🧩 API Reference

Exports

OpenCC
OpenccConfig

`OpenCC` class

OpenCC(config: str | OpenccConfig = "s2t")
Create a converter with a supported config string or OpenccConfig enum value. Raises ValueError for unsupported configs.
set_config(config: str | OpenccConfig) -> None
Update the active conversion config. Raises ValueError for unsupported configs.
get_config() -> str
Return the current canonical config name.
supported_configs() -> list[str]
Return all supported config names.
get_last_error() -> str | None
Return the last validation or conversion error, if any.
convert(input: str, punctuation: bool = False) -> str
Convert text using the active config, with optional punctuation conversion.
s2t(input: str, punctuation: bool = False) -> str
Simplified Chinese to Traditional Chinese.
t2s(input: str, punctuation: bool = False) -> str
Traditional Chinese to Simplified Chinese.
s2tw(input: str, punctuation: bool = False) -> str
Simplified Chinese to Taiwan Traditional.
tw2s(input: str, punctuation: bool = False) -> str
Taiwan Traditional to Simplified Chinese.
s2twp(input: str, punctuation: bool = False) -> str
Simplified Chinese to Taiwan Traditional with idiom and phrase conversion.
tw2sp(input: str, punctuation: bool = False) -> str
Taiwan Traditional with idioms to Simplified Chinese.
s2hk(input: str, punctuation: bool = False) -> str
Simplified Chinese to Hong Kong Traditional.
hk2s(input: str, punctuation: bool = False) -> str
Hong Kong Traditional to Simplified Chinese.
t2tw(input: str, punctuation: bool = False) -> str
Traditional Chinese to Taiwan Traditional.
t2twp(input: str, punctuation: bool = False) -> str
Traditional Chinese to Taiwan Traditional with phrase mappings.
tw2t(input: str, punctuation: bool = False) -> str
Taiwan Traditional to standard Traditional Chinese.
tw2tp(input: str, punctuation: bool = False) -> str
Taiwan Traditional to standard Traditional Chinese with phrase reversal.
t2hk(input: str, punctuation: bool = False) -> str
Traditional Chinese to Hong Kong variant.
hk2t(input: str, punctuation: bool = False) -> str
Hong Kong Traditional to standard Traditional Chinese.
t2jp(input: str, punctuation: bool = False) -> str
Traditional Chinese to Japanese variants.
jp2t(input: str, punctuation: bool = False) -> str
Japanese Shinjitai to Traditional Chinese.
st(input: str) -> str
Character-only Simplified to Traditional conversion.
ts(input: str) -> str
Character-only Traditional to Simplified conversion.
zho_check(input: str) -> int
Detect the input text type:
1 - Traditional, 2 - Simplified, 0 - Others

`OpenccConfig` enum

Members include: S2T, T2S, S2TW, TW2S, S2TWP, TW2SP, S2HK, HK2S, T2TW, TW2T, T2TWP, TW2TP, T2HK, HK2T, T2JP, JP2T
to_canonical_name() -> str
Return the lowercase OpenCC config string.
parse(value: str) -> OpenccConfig
Parse a config string into an enum value.

🛠 Development

Python bindings: opencc_purepy/__init__.py, opencc_purepy/core.py
CLI: opencc_purepy/__main__.py

⚡ Benchmark

Measured on GitHub Actions ubuntu-latest using the default s2t configuration.
Benchmark separates cold startup, first post-init conversion, and warm cached conversion.

Runner Platform

Field	Value
Runner	Linux X64
Image	ubuntu24 20260513.135.3
Kernel	`Linux runnervmrw5os 6.17.0-1013-azure #13~24.04.1-Ubuntu SMP Wed Apr 15 16:52:17 UTC 2026 x86_64 x86_64 x86_64 GNU/Linux`
CPU	AMD EPYC 9V74 80-Core Processor
CPU Cores	4
Memory
Python	Python 3.10.20

Results

opencc-purepy v1.3.0

Input Size	Cold Total (ms)	Post-init Cold (ms)	Warm (ms)
100 chars	21.849 ms	19.419 ms	0.171 ms
1,000 chars	21.572 ms	21.409 ms	1.643 ms
10,000 chars	34.063 ms	32.584 ms	13.480 ms
100,000 chars	158.355 ms	156.202 ms	136.870 ms

cold_total includes OpenCC(config) setup plus conversion. post_init_cold measures the first conversion after initialization. warm measures conversion after the union cache has already been built. Results depend on runner hardware and background system load.

Notes

Despite being implemented in pure Python, opencc_purepy achieves competitive conversion throughput through aggressive caching and starter-index optimizations.

The warm conversion path is practical for large-text workloads such as document conversion, GUI applications, and batch processing.

Projects That Use `opencc-purepy`

OpenccPurepyGui

📄 License

This project is licensed under the MIT License.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

laisuk

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.3.1

May 23, 2026

This version

1.3.0

May 23, 2026

1.2.4

May 15, 2026

1.2.3

May 14, 2026

1.2.2

May 14, 2026

1.2.1

May 7, 2026

1.2.0

Apr 8, 2026

1.1.0

Aug 13, 2025

1.0.3

Jul 6, 2025

1.0.2

Jun 26, 2025

1.0.1

Jun 26, 2025

1.0.0

Jun 18, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

opencc_purepy-1.3.0.tar.gz (1.0 MB view details)

Uploaded May 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

opencc_purepy-1.3.0-py3-none-any.whl (1.0 MB view details)

Uploaded May 23, 2026 Python 3

File details

Details for the file opencc_purepy-1.3.0.tar.gz.

File metadata

Download URL: opencc_purepy-1.3.0.tar.gz
Upload date: May 23, 2026
Size: 1.0 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for opencc_purepy-1.3.0.tar.gz
Algorithm	Hash digest
SHA256	`1ede2349eeee3f6ae3ec7366a6074edb066ff83dc82f66aae923cfbd26ad1cc6`
MD5	`2217bd723136f0af1c9d58fbef138971`
BLAKE2b-256	`1b0fe8e44f6d3773c72e9b36ef25f2e9f00034e30b20b2bd1f63fb85e214b985`

See more details on using hashes here.

Provenance

The following attestation bundles were made for opencc_purepy-1.3.0.tar.gz:

Publisher: release.yml on laisuk/opencc_purepy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: opencc_purepy-1.3.0.tar.gz
- Subject digest: 1ede2349eeee3f6ae3ec7366a6074edb066ff83dc82f66aae923cfbd26ad1cc6
- Sigstore transparency entry: 1615322034
- Sigstore integration time: May 23, 2026
Source repository:
- Permalink: laisuk/opencc_purepy@ec2d65143b9b7c01e2e89d32f18c132557d4c238
- Branch / Tag: refs/tags/v1.3.0
- Owner: https://github.com/laisuk
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@ec2d65143b9b7c01e2e89d32f18c132557d4c238
- Trigger Event: push

File details

Details for the file opencc_purepy-1.3.0-py3-none-any.whl.

File metadata

Download URL: opencc_purepy-1.3.0-py3-none-any.whl
Upload date: May 23, 2026
Size: 1.0 MB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for opencc_purepy-1.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`424dd38cb729063df3bc3f09bf50ba9e9fe1e8f5453f970f4bf1241575161c84`
MD5	`a47f43f7a088cd8ec6c9b23bd7d1d076`
BLAKE2b-256	`e11a22cef6a44c899f9756cc3afe761ff7ae34fd8a3555a3904d1653484c97ca`

See more details on using hashes here.

Provenance

The following attestation bundles were made for opencc_purepy-1.3.0-py3-none-any.whl:

Publisher: release.yml on laisuk/opencc_purepy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: opencc_purepy-1.3.0-py3-none-any.whl
- Subject digest: 424dd38cb729063df3bc3f09bf50ba9e9fe1e8f5453f970f4bf1241575161c84
- Sigstore transparency entry: 1615322041
- Sigstore integration time: May 23, 2026
Source repository:
- Permalink: laisuk/opencc_purepy@ec2d65143b9b7c01e2e89d32f18c132557d4c238
- Branch / Tag: refs/tags/v1.3.0
- Owner: https://github.com/laisuk
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@ec2d65143b9b7c01e2e89d32f18c132557d4c238
- Trigger Event: push

opencc-purepy 1.3.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

opencc_purepy

🚩 Features

🔁 Supported Conversion Configs

📦 Installation

🚀 Usage

Python

CLI

Text File Conversion

Office Document Conversion subcommand (office)

📚 Custom Dictionaries

Recommended: load-time append mode

Post-load file customization

Tofu-risk / Extension Unicode fallback pairs

Dictionary text format

Override mode

Supported slots

Generate JSON with dictgen

Which mode should I use?

🧩 API Reference

Exports

OpenCC class

OpenccConfig enum

🛠 Development

⚡ Benchmark

Runner Platform

Results

opencc-purepy v1.3.0

Notes

Projects That Use opencc-purepy

📄 License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

Office Document Conversion subcommand (`office`)

`OpenCC` class

`OpenccConfig` enum

Projects That Use `opencc-purepy`