Skip to main content

High-performance Chinese text conversion (Simplified ↔ Traditional) powered by Rust, PyO3, and OpenCC lexicons.

Project description

opencc_pyo3

PyPI version Downloads Python Versions License Build Status

opencc_pyo3 is a Python extension module powered by Rust and PyO3, providing fast and accurate conversion between different Chinese text variants using OpenCC algorithms.

Features

  • Convert between Simplified, Traditional, Hong Kong, Taiwan, and Japanese Kanji variants with OpenCC-compatible configurations.
  • High-performance Rust + PyO3 backend for fast, memory-efficient Chinese text conversion in Python.
  • Python API with OpenCC, OpenccConfig, config validation helpers, punctuation conversion, and Chinese text variant detection.
  • Command-line interface for plain text conversion from files or standard input.
  • Office and EPUB document conversion support for .docx, .xlsx, .pptx, .odt, .ods, .odp, and .epub.
  • PDF text extraction helpers, including PDFium-based page-by-page extraction utilities.
  • CJK paragraph reflow helper for cleaning PDF-extracted text before conversion.

Supported Conversion Configurations

  • s2t, t2s, s2tw, tw2s, s2twp, tw2sp, s2hk, hk2s, t2tw, tw2t, t2twp, tw2tp, t2hk, hk2t, t2jp, jp2t

Installation

1. Install from PyPI

pip install opencc-pyo3

2. Build and install the Python wheel using maturin:

# In project root
maturin build --release
pip install ./target/wheels/opencc_pyo3-<version>-cp<pyver>-abi3-<platform>.whl

Or for development (May require venv):

maturin develop -r

See build.txt for detailed build and install instructions.

Usage

Python

from opencc_pyo3 import OpenCC

text = "“春眠不觉晓,处处闻啼鸟。”"
opencc = OpenCC("s2t")
converted = opencc.convert(text, punctuation=True)
print(converted)  # 「春眠不覺曉,處處聞啼鳥。」

CLI

You can also use the CLI interface via Python module or Python script:
Sub-Commands are:

  • convert: Convert Chinese text using OpenCC
  • office: Convert Office document Chinese text using OpenCC
  • pdf: Convert extracted PDF document text using OpenCC

convert

python -m opencc_pyo3 convert --help
usage: opencc-pyo3 convert [-h] [-i <file>] [-o <file>] [-c <conversion>] [-p] [--in-enc <encoding>] [--out-enc <encoding>]

options:
  -h, --help            show this help message and exit
  -i, --input <file>    Read original text from <file>.
  -o, --output <file>   Write converted text to <file>.
  -c, --config <conversion>
                        Conversion configuration: s2t|s2tw|s2twp|s2hk|t2s|tw2s|tw2sp|hk2s|jp2t|t2jp
  -p, --punct           Enable punctuation conversion. (Default: False)
  --in-enc <encoding>   Encoding for input. (Default: UTF-8)
  --out-enc <encoding>  Encoding for output. (Default: UTF-8)

office

Support OpenOffice documents and Epub (.docx, .xlsx, .pptx, .odt, .ods, .odp, .epub)

python -m opencc_pyo3 office --help                                         
usage: opencc-pyo3 office [-h] [-i <file>] [-o <file>] [-c <conversion>] [-p] [-f <format>] [--auto-ext] [--keep-font]

options:
  -h, --help            show this help message and exit
  -i, --input <file>    Input Office document from <file>.
  -o, --output <file>   Output Office document to <file>.
  -c, --config <conversion>
                        conversion: s2t|s2tw|s2twp|s2hk|t2s|tw2s|tw2sp|hk2s|jp2t|t2jp
  -p, --punct           Enable punctuation conversion. (Default: False)
  -f, --format <format>
                        Target Office format (e.g., docx, xlsx, pptx, odt, ods, odp, epub)
  --auto-ext            Auto-append extension to output file
  --keep-font           Preserve font-family information in Office content

PDF

Support PDF files as input, with built-in text extraction and OpenCC-based conversion powered by opencc-fmmseg (available since v0.8.4).

This command allows you to extract Chinese text from PDF documents, optionally apply CJK-aware paragraph reflow, and convert the result using OpenCC configurations.

Note
Only text-embedded (searchable) PDF documents are supported.
Scanned or image-only PDFs without an embedded text layer are not currently supported.

python -m opencc_pyo3 pdf --help

usage: __main__.py pdf [-h] -i <file> [-o <file>] [-c <conversion>] [-p] [-H] [-r] [--compact] [--timing] [-e]

options:
  -h, --help            show this help message and exit
  -i, --input <file>    Input PDF file.
  -o, --output <file>   Output text file (UTF-8). If omitted, defaults to "<input>_converted.txt".
  -c, --config <conversion>
                        Conversion configuration: s2t|s2tw|s2twp|s2hk|t2s|tw2s|tw2sp|hk2s|jp2t|t2jp
  -p, --punct           Enable punctuation conversion. (Default: False)
  -H, --header          Preserve page-break-like gaps when reflowing CJK paragraphs (passed as add_pdf_page_header to reflow_cjk_paragraphs).
  -r, --reflow          Enable CJK-aware paragraph reflow before conversion.
  --compact             Use compact paragraph mode (single newline between paragraphs).
  --timing              Show time use for each process workflow.
  -e, --extract         Extract PDF text only (skip OpenCC conversion).
python -m opencc_pyo3 convert -i input.txt -o output.txt -c s2t --punct

python -m opencc_pyo3 office -c s2t --punct -i input.docx -o output.docx --keep-font

opencc-pyo3 office -c s2tw -p -i input.epub -o output.epub

opencc-pyo3 pdf -i input.pdf -o output.txt -c s2t -punct --reflow

Python API

OpenccConfig

OpenccConfig is an enum exported from opencc_pyo3 for config-safe usage:

from opencc_pyo3 import OpenCC, OpenccConfig

cc = OpenCC(OpenccConfig.S2TW)
print(cc.convert("汉字"))  # 漢字

Available enum values:

  • OpenccConfig.S2T
  • OpenccConfig.T2S
  • OpenccConfig.S2TW
  • OpenccConfig.TW2S
  • OpenccConfig.S2TWP
  • OpenccConfig.TW2SP
  • OpenccConfig.S2HK
  • OpenccConfig.HK2S
  • OpenccConfig.T2TW
  • OpenccConfig.TW2T
  • OpenccConfig.T2TWP
  • OpenccConfig.TW2TP
  • OpenccConfig.T2HK
  • OpenccConfig.HK2T
  • OpenccConfig.T2JP
  • OpenccConfig.JP2T

OpenCC

Core converter class backed by the Rust extension module.

  • OpenCC(config: str | OpenccConfig = "s2t")
    • Creates a converter using a config string or OpenccConfig enum.
    • Invalid config values fall back to s2t.
  • OpenCC.from_dicts(config="s2t", specs=None) -> OpenCC
    • Creates a converter with programmatic, in-memory custom dictionary entries.
  • OpenCC.from_dict_files(config="s2t", specs=None) -> OpenCC
    • Creates a converter with OpenCC-style custom dictionary files.
  • convert(input_text: str, punctuation: bool = False) -> str
    • Converts text using the current config.
  • set_config(config: str | OpenccConfig) -> None
    • Changes the active config.
  • get_config() -> str
    • Returns the current canonical config name.
  • get_last_error() -> str
    • Returns the most recent config error message, or "" if none.
  • zho_check(input_text: str) -> int
    • Detects the text type.
    • 1 = Traditional Chinese, 2 = Simplified Chinese, 0 = other / undetermined
  • OpenCC.supported_configs() -> list[str]
    • Returns all supported config names.
  • OpenCC.is_valid_config(config: str) -> bool
    • Validates a config string.

Example:

from opencc_pyo3 import OpenCC, OpenccConfig

cc = OpenCC(OpenccConfig.S2T)
print(cc.get_config())  # s2t
print(cc.convert("汉字", punctuation=True))
print(cc.zho_check("汉字"))  # 2

cc.set_config("t2jp")
print(cc.convert("圖書館"))  # 図書館

print(OpenCC.supported_configs())
print(OpenCC.is_valid_config("s2hk"))  # True

Custom Dictionaries

OpenCC("s2t") remains the recommended API for normal use and continues to use the built-in embedded dictionaries. Use custom dictionaries only when you need project-specific terms or overrides.

Custom dictionaries are applied during construction. The backend first loads the default embedded zstd dictionaries, then applies post-load customization with DictionaryMaxlength::from_zstd()?.with_custom_dicts(...) or DictionaryMaxlength::from_zstd()?.with_custom_dict_files(...). The final OpenCC instance remains immutable and optimized after construction. Runtime hot reload is not supported; rebuild a new OpenCC instance if dictionaries need to change.

In-memory custom dictionaries

Use OpenCC.from_dicts() for programmatic terms:

from typing import List

from opencc_pyo3 import OpenCC, CustomDictSpec

specs: List[CustomDictSpec] = [
    {
        "slot": "STPhrases",
        "pairs": [("帕兰蒂尔", "柏蘭蒂爾")],
        "mode": "append",
    }
]

cc = OpenCC.from_dicts("s2t", specs)

print(cc.convert("帕兰蒂尔是一家公司"))
# 柏蘭蒂爾是一家公司

File-based custom dictionaries

Use OpenCC.from_dict_files() for OpenCC-style dictionary files:

from typing import List

from opencc_pyo3 import OpenCC, CustomDictFileSpec

specs: List[CustomDictFileSpec] = [
    {
        "slot": "STPhrases",
        "files": ["custom_st_phrases.txt"],
        "mode": "append",
    }
]

cc = OpenCC.from_dict_files("s2t", specs)

Custom dictionary files use one mapping per line:

source<TAB>target

Example:

帕兰蒂尔	柏蘭蒂爾

Merge modes

  • append: add custom entries to the existing dictionary slot. Duplicate keys follow the backend "last wins" behavior.
  • override: replace the entire target dictionary slot with the custom entries or files.

Supported dictionary slots

Use canonical slot names without .txt, such as STPhrases, not STPhrases.txt. The Python wrapper may tolerate .txt, but the documented API uses canonical names only.

Slot Purpose Original OpenCC file
STCharacters Simplified → Traditional character mappings STCharacters.txt
STPhrases Simplified → Traditional phrase mappings STPhrases.txt
STPunctuations Simplified → Traditional punctuation mappings STPunctuations.txt
TSCharacters Traditional → Simplified character mappings TSCharacters.txt
TSPhrases Traditional → Simplified phrase mappings TSPhrases.txt
TSPunctuations Traditional → Simplified punctuation mappings TSPunctuations.txt
TWPhrases Traditional → Taiwan phrase mappings TWPhrases.txt
TWPhrasesRev Taiwan → Traditional reverse phrase mappings TWPhrasesRev.txt
TWVariants Traditional → Taiwan regional variant mappings TWVariants.txt
TWVariantsRev Taiwan → Traditional reverse variant mappings TWVariantsRev.txt
TWVariantsRevPhrases Taiwan → Traditional reverse phrase variant mappings TWVariantsRevPhrases.txt
HKVariants Traditional → Hong Kong regional variant mappings HKVariants.txt
HKVariantsRev Hong Kong → Traditional reverse variant mappings HKVariantsRev.txt
HKVariantsRevPhrases Hong Kong → Traditional reverse phrase variant mappings HKVariantsRevPhrases.txt
JPSCharacters Japanese Shinjitai character mappings JPShinjitaiCharacters.txt
JPSPhrases Japanese Shinjitai phrase mappings JPShinjitaiPhrases.txt
JPVariants Traditional → Japanese variant mappings JPVariants.txt
JPVariantsRev Japanese → Traditional reverse variant mappings JPVariantsRev.txt

Custom dictionary behavior follows the same OpenCC dictionary-slot model. Choosing the wrong slot may have no effect or may affect a different conversion path. For s2t, use STCharacters or STPhrases. For t2s, use TSCharacters or TSPhrases. For regional variants, use the relevant TW, HK, or JP slots.

Typing helpers

CustomDictSpec and CustomDictFileSpec are exported for typed Python code:

from typing import List
from opencc_pyo3 import OpenCC, CustomDictSpec

specs: List[CustomDictSpec] = [
    {
        "slot": "STPhrases",
        "pairs": [("帕兰蒂尔", "柏蘭蒂爾")],
        "mode": "append",
    }
]

cc = OpenCC.from_dicts("s2t", specs)

Notes:

  • Custom dictionaries are loaded at construction time.
  • Existing OpenCC objects are immutable after construction.
  • Runtime hot reload is not supported.
  • Rebuild a new OpenCC instance if dictionaries need to change.
  • Invalid slots, invalid modes, malformed lines, or unreadable files raise errors.
  • OpenCC("s2t") remains the recommended API for normal users.
  • Use from_dicts() for programmatic or in-memory custom terms.
  • Use from_dict_files() for OpenCC-style dictionary files.

Reflow helper

  • reflow_cjk_paragraphs(text: str, add_pdf_page_header: bool, compact: bool) -> str
    • Reflows PDF-extracted CJK text by merging broken line wraps while preserving paragraph structure.
    • add_pdf_page_header=True keeps explicit page-gap style boundaries.
    • compact=True uses single newlines between paragraphs.

Example:

from opencc_pyo3.opencc_pyo3 import reflow_cjk_paragraphs

raw_text = "我来到\n无人海边。"
clean_text = reflow_cjk_paragraphs(raw_text, add_pdf_page_header=False, compact=False)  # 我来到无人海边。

PDF extraction APIs

The package exposes PDFium-based PDF extraction helpers in opencc_pyo3.pdfium_helper.

Import from opencc_pyo3.pdfium_helper:

  • extract_pdf_pages_with_callback_pdfium(path: str, callback, add_page_header: bool = False) -> None
  • extract_pdf_text_pdfium_progress(path: str) -> str
  • extract_pdf_text_pdfium_silent(path: str) -> str
  • extract_pdf_text_pages_pdfium(path: str) -> list[str]
  • extract_pdf_text_pages_pdfium_progress(path: str) -> list[str]
  • make_progress_collector() -> tuple[callback, list[str]]
  • make_silent_collector() -> tuple[callback, list[str]]

Example:

from opencc_pyo3 import OpenCC
from opencc_pyo3.opencc_pyo3 import reflow_cjk_paragraphs
from opencc_pyo3.pdfium_helper import extract_pdf_text_pdfium_silent

raw = extract_pdf_text_pdfium_silent("input.pdf")
text = reflow_cjk_paragraphs(raw, add_pdf_page_header=False, compact=False)
converted = OpenCC("s2t").convert(text, punctuation=True)

Development

Benchmarks

Latest benchmark results for the optimized current opencc_pyo3 version. These replace the much older v0.7.0 numbers.

Package: opencc_pyo3
Python: 3.13.13
Platform: Windows-11-10.0.26200-SP0
Processor: Intel64 Family 6 Model 191 Stepping 2, GenuineIntel
Configs: s2t, s2tw, s2twp
Text sizes: 100, 1,000, 10,000, 100,000 characters

BENCHMARK RESULTS


Method Config TextSize Mean StdDev Min Max Ops/sec Chars/sec
Convert_Small s2t 100 0.005 ms 0.003 ms 0.004 ms 0.021 ms 188,442 18,844,221
Convert_Medium s2t 1,000 0.038 ms 0.006 ms 0.036 ms 0.066 ms 26,189 26,189,437
Convert_Large s2t 10,000 0.253 ms 0.093 ms 0.171 ms 0.629 ms 3,958 39,577,314
Convert_XLarge s2t 100,000 1.394 ms 0.166 ms 1.156 ms 1.699 ms 717 71,726,750
Convert_Small s2tw 100 0.006 ms 0.003 ms 0.005 ms 0.021 ms 175,953 17,595,308
Convert_Medium s2tw 1,000 0.044 ms 0.005 ms 0.042 ms 0.071 ms 22,808 22,808,485
Convert_Large s2tw 10,000 0.318 ms 0.086 ms 0.227 ms 0.514 ms 3,141 31,411,310
Convert_XLarge s2tw 100,000 1.503 ms 0.129 ms 1.355 ms 1.837 ms 665 66,516,340
Convert_Small s2twp 100 0.008 ms 0.003 ms 0.007 ms 0.025 ms 130,435 13,043,478
Convert_Medium s2twp 1,000 0.054 ms 0.006 ms 0.052 ms 0.084 ms 18,378 18,377,849
Convert_Large s2twp 10,000 0.482 ms 0.249 ms 0.335 ms 1.602 ms 2,075 20,746,888
Convert_XLarge s2twp 100,000 1.817 ms 0.197 ms 1.649 ms 2.581 ms 550 55,032,341

Reproduce Benchmarks

python bench/opencc_benchmark_md.py --ci --configs s2t s2tw s2twp --sizes Small Medium Large XLarge --export md json --output-dir bench/out

Projects That Use opencc-pyo3

OpenccPyo3Gui


License

MIT


Powered by Rust, PyO3, OpenCC, Pdfium and opencc-fmmseg.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

opencc_pyo3-0.9.0-cp38-abi3-win_arm64.whl (4.5 MB view details)

Uploaded CPython 3.8+Windows ARM64

opencc_pyo3-0.9.0-cp38-abi3-win_amd64.whl (4.7 MB view details)

Uploaded CPython 3.8+Windows x86-64

opencc_pyo3-0.9.0-cp38-abi3-win32.whl (4.5 MB view details)

Uploaded CPython 3.8+Windows x86

opencc_pyo3-0.9.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.9 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ x86-64

opencc_pyo3-0.9.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (4.9 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ ARM64

opencc_pyo3-0.9.0-cp38-abi3-macosx_11_0_arm64.whl (4.5 MB view details)

Uploaded CPython 3.8+macOS 11.0+ ARM64

opencc_pyo3-0.9.0-cp38-abi3-macosx_10_12_x86_64.whl (4.7 MB view details)

Uploaded CPython 3.8+macOS 10.12+ x86-64

File details

Details for the file opencc_pyo3-0.9.0-cp38-abi3-win_arm64.whl.

File metadata

File hashes

Hashes for opencc_pyo3-0.9.0-cp38-abi3-win_arm64.whl
Algorithm Hash digest
SHA256 4b5c25974d709f0ecab24bf603bfc9033e9609bc66d7a8a768555dcbc3900181
MD5 3d0641b2cd46e40606a285a719d4f5bb
BLAKE2b-256 2b8a30cbd3e6f9f96736be04eb580bf4ca7966c7869d8087a2bc7513c1854cee

See more details on using hashes here.

File details

Details for the file opencc_pyo3-0.9.0-cp38-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for opencc_pyo3-0.9.0-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 3e74113b0757ee3f43ce0775523494051438c4458c535f3450ad87b8bbcd9d7a
MD5 847ef775036bb6fff897b22462a55429
BLAKE2b-256 bea91294a2464278adeceb218292691fa49812875f4d774cd2039df66b5f8d68

See more details on using hashes here.

File details

Details for the file opencc_pyo3-0.9.0-cp38-abi3-win32.whl.

File metadata

  • Download URL: opencc_pyo3-0.9.0-cp38-abi3-win32.whl
  • Upload date:
  • Size: 4.5 MB
  • Tags: CPython 3.8+, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.13.3

File hashes

Hashes for opencc_pyo3-0.9.0-cp38-abi3-win32.whl
Algorithm Hash digest
SHA256 dd1c059adce4c54e6576b450b03fe0b2310126a8a1d97c5c3901f53828ff74bb
MD5 6d69e2b61b7a33b0e3b6b04b50e59460
BLAKE2b-256 c345c6937621740eda229b5ddebd677034f41709763666cc5c5899232eb0e502

See more details on using hashes here.

File details

Details for the file opencc_pyo3-0.9.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for opencc_pyo3-0.9.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ea8fd2b4d957cf008da8d0ce6393c0a9193f3e64c8665304a38fef79e3a7d3d1
MD5 10b2cb6ddfdcfd303ae9caa3fdca169c
BLAKE2b-256 3d5819ad4458ccb1f269b21ae45e5e547ceac59a853bcf6a487cc2b04c9c7cd1

See more details on using hashes here.

File details

Details for the file opencc_pyo3-0.9.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for opencc_pyo3-0.9.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 9b72c6510329d0356a89ab3f2e88c5060bd0b0778ca7d5dda241ce40bcdd8b7e
MD5 a544bff0f6f88b6b05fe15edb02d78d6
BLAKE2b-256 1ac85ff9908554f86eec988fb26cacd3e6d028f5cd5f379c6cbb17061d877bf5

See more details on using hashes here.

File details

Details for the file opencc_pyo3-0.9.0-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for opencc_pyo3-0.9.0-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 c8950ebf7fb28398ced85f37ff1db05fec6de3e009d5310ff2b4297b3fda9734
MD5 dd2e7f2f9a87e42cff17827b82b7e020
BLAKE2b-256 c694441f07e13f41e9122984c46e9ca9b9876be877437e9259cf940b2e238f1f

See more details on using hashes here.

File details

Details for the file opencc_pyo3-0.9.0-cp38-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for opencc_pyo3-0.9.0-cp38-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 504fd5e09207f592f25b59ba977c0bf6b6356138911b044f0b33c7576cd024eb
MD5 7cea4da3b65c88d8e6bdb7b31f96ec0b
BLAKE2b-256 c88f3eac9dfe48e493f8e40444e395c1323ffa24b9ecf88313702b2aa03fcac1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page