Skip to main content

High-performance Chinese text conversion (Simplified ↔ Traditional) powered by Rust, PyO3, and OpenCC lexicons.

Project description

opencc_pyo3

PyPI version Downloads Python Versions License Build Status

opencc_pyo3 is a Python extension module powered by Rust and PyO3, providing fast and accurate conversion between different Chinese text variants using OpenCC algorithms.

Features

  • Convert between Simplified, Traditional, Hong Kong, Taiwan, and Japanese Kanji variants with OpenCC-compatible configurations.
  • High-performance Rust + PyO3 backend for fast, memory-efficient Chinese text conversion in Python.
  • Python API with OpenCC, OpenccConfig, config validation helpers, punctuation conversion, and Chinese text variant detection.
  • Command-line interface for plain text conversion from files or standard input.
  • Office and EPUB document conversion support for .docx, .xlsx, .pptx, .odt, .ods, .odp, and .epub.
  • PDF text extraction helpers, including PDFium-based page-by-page extraction utilities.
  • CJK paragraph reflow helper for cleaning PDF-extracted text before conversion.

Supported Conversion Configurations

  • s2t, t2s, s2tw, tw2s, s2twp, tw2sp, s2hk, hk2s, t2tw, tw2t, t2twp, tw2tp, t2hk, hk2t, t2jp, jp2t

Installation

1. Install from PyPI

pip install opencc-pyo3

2. Build and install the Python wheel using maturin:

# In project root
maturin build --release
pip install ./target/wheels/opencc_pyo3-<version>-cp<pyver>-abi3-<platform>.whl

Or for development (May require venv):

maturin develop -r

See build.txt for detailed build and install instructions.

Usage

Python

from opencc_pyo3 import OpenCC

text = "“春眠不觉晓,处处闻啼鸟。”"
opencc = OpenCC("s2t")
converted = opencc.convert(text, punctuation=True)
print(converted)  # 「春眠不覺曉,處處聞啼鳥。」

CLI

You can also use the CLI interface via Python module or Python script:
Sub-Commands are:

  • convert: Convert Chinese text using OpenCC
  • office: Convert Office document Chinese text using OpenCC
  • pdf: Convert extracted PDF document text using OpenCC

convert

python -m opencc_pyo3 convert --help
usage: opencc-pyo3 convert [-h] [-i <file>] [-o <file>] [-c <conversion>] [-p] [--in-enc <encoding>] [--out-enc <encoding>]

options:
  -h, --help            show this help message and exit
  -i, --input <file>    Read original text from <file>.
  -o, --output <file>   Write converted text to <file>.
  -c, --config <conversion>
                        Conversion configuration: s2t|s2tw|s2twp|s2hk|t2s|tw2s|tw2sp|hk2s|jp2t|t2jp
  -p, --punct           Enable punctuation conversion. (Default: False)
  --in-enc <encoding>   Encoding for input. (Default: UTF-8)
  --out-enc <encoding>  Encoding for output. (Default: UTF-8)

office

Support OpenOffice documents and Epub (.docx, .xlsx, .pptx, .odt, .ods, .odp, .epub)

python -m opencc_pyo3 office --help                                         
usage: opencc-pyo3 office [-h] [-i <file>] [-o <file>] [-c <conversion>] [-p] [-f <format>] [--auto-ext] [--keep-font]

options:
  -h, --help            show this help message and exit
  -i, --input <file>    Input Office document from <file>.
  -o, --output <file>   Output Office document to <file>.
  -c, --config <conversion>
                        conversion: s2t|s2tw|s2twp|s2hk|t2s|tw2s|tw2sp|hk2s|jp2t|t2jp
  -p, --punct           Enable punctuation conversion. (Default: False)
  -f, --format <format>
                        Target Office format (e.g., docx, xlsx, pptx, odt, ods, odp, epub)
  --auto-ext            Auto-append extension to output file
  --keep-font           Preserve font-family information in Office content

PDF

Support PDF files as input, with built-in text extraction and OpenCC-based conversion powered by opencc-fmmseg (available since v0.8.4).

This command allows you to extract Chinese text from PDF documents, optionally apply CJK-aware paragraph reflow, and convert the result using OpenCC configurations.

Note
Only text-embedded (searchable) PDF documents are supported.
Scanned or image-only PDFs without an embedded text layer are not currently supported.

python -m opencc_pyo3 pdf --help

usage: __main__.py pdf [-h] -i <file> [-o <file>] [-c <conversion>] [-p] [-H] [-r] [--compact] [--timing] [-e]

options:
  -h, --help            show this help message and exit
  -i, --input <file>    Input PDF file.
  -o, --output <file>   Output text file (UTF-8). If omitted, defaults to "<input>_converted.txt".
  -c, --config <conversion>
                        Conversion configuration: s2t|s2tw|s2twp|s2hk|t2s|tw2s|tw2sp|hk2s|jp2t|t2jp
  -p, --punct           Enable punctuation conversion. (Default: False)
  -H, --header          Preserve page-break-like gaps when reflowing CJK paragraphs (passed as add_pdf_page_header to reflow_cjk_paragraphs).
  -r, --reflow          Enable CJK-aware paragraph reflow before conversion.
  --compact             Use compact paragraph mode (single newline between paragraphs).
  --timing              Show time use for each process workflow.
  -e, --extract         Extract PDF text only (skip OpenCC conversion).
python -m opencc_pyo3 convert -i input.txt -o output.txt -c s2t --punct

python -m opencc_pyo3 office -c s2t --punct -i input.docx -o output.docx --keep-font

opencc-pyo3 office -c s2tw -p -i input.epub -o output.epub

opencc-pyo3 pdf -i input.pdf -o output.txt -c s2t -punct --reflow

Python API

OpenccConfig

OpenccConfig is an enum exported from opencc_pyo3 for config-safe usage:

from opencc_pyo3 import OpenCC, OpenccConfig

cc = OpenCC(OpenccConfig.S2TW)
print(cc.convert("汉字"))  # 漢字

Available enum values:

  • OpenccConfig.S2T
  • OpenccConfig.T2S
  • OpenccConfig.S2TW
  • OpenccConfig.TW2S
  • OpenccConfig.S2TWP
  • OpenccConfig.TW2SP
  • OpenccConfig.S2HK
  • OpenccConfig.HK2S
  • OpenccConfig.T2TW
  • OpenccConfig.TW2T
  • OpenccConfig.T2TWP
  • OpenccConfig.TW2TP
  • OpenccConfig.T2HK
  • OpenccConfig.HK2T
  • OpenccConfig.T2JP
  • OpenccConfig.JP2T

OpenCC

Core converter class backed by the Rust extension module.

  • OpenCC(config: str | OpenccConfig = "s2t")
    • Creates a converter using a config string or OpenccConfig enum.
    • Invalid config values fall back to s2t.
  • convert(input_text: str, punctuation: bool = False) -> str
    • Converts text using the current config.
  • set_config(config: str | OpenccConfig) -> None
    • Changes the active config.
  • get_config() -> str
    • Returns the current canonical config name.
  • get_last_error() -> str
    • Returns the most recent config error message, or "" if none.
  • zho_check(input_text: str) -> int
    • Detects the text type.
    • 1 = Traditional Chinese, 2 = Simplified Chinese, 0 = other / undetermined
  • OpenCC.supported_configs() -> list[str]
    • Returns all supported config names.
  • OpenCC.is_valid_config(config: str) -> bool
    • Validates a config string.

Example:

from opencc_pyo3 import OpenCC, OpenccConfig

cc = OpenCC(OpenccConfig.S2T)
print(cc.get_config())  # s2t
print(cc.convert("汉字", punctuation=True))
print(cc.zho_check("汉字"))  # 2

cc.set_config("t2jp")
print(cc.convert("圖書館"))  # 図書館

print(OpenCC.supported_configs())
print(OpenCC.is_valid_config("s2hk"))  # True

Reflow helper

  • reflow_cjk_paragraphs(text: str, add_pdf_page_header: bool, compact: bool) -> str
    • Reflows PDF-extracted CJK text by merging broken line wraps while preserving paragraph structure.
    • add_pdf_page_header=True keeps explicit page-gap style boundaries.
    • compact=True uses single newlines between paragraphs.

Example:

from opencc_pyo3.opencc_pyo3 import reflow_cjk_paragraphs

raw_text = "我来到\n无人海边。"
clean_text = reflow_cjk_paragraphs(raw_text, add_pdf_page_header=False, compact=False)  # 我来到无人海边。

PDF extraction APIs

The package currently exposes two PDF extraction layers:

  1. Rust extension functions in opencc_pyo3.opencc_pyo3
  2. PDFium-based helpers in opencc_pyo3.pdfium_helper

Legacy Rust PDF extract functions

These are still exported, but the type stub marks them as deprecated in favor of the PDFium helper module.

  • extract_pdf_text(path: str) -> str
  • extract_pdf_text_pages(path: str) -> list[str]
  • extract_pdf_pages_with_callback(path: str, callback: Callable[[int, int, str], Any]) -> None

Recommended PDFium helper functions

Import from opencc_pyo3.pdfium_helper:

  • extract_pdf_pages_with_callback_pdfium(path: str, callback, add_page_header: bool = False) -> None
  • extract_pdf_text_pdfium_progress(path: str) -> str
  • extract_pdf_text_pdfium_silent(path: str) -> str
  • extract_pdf_text_pages_pdfium(path: str) -> list[str]
  • extract_pdf_text_pages_pdfium_progress(path: str) -> list[str]
  • make_progress_collector() -> tuple[callback, list[str]]
  • make_silent_collector() -> tuple[callback, list[str]]

Example:

from opencc_pyo3 import OpenCC
from opencc_pyo3.opencc_pyo3 import reflow_cjk_paragraphs
from opencc_pyo3.pdfium_helper import extract_pdf_text_pdfium_silent

raw = extract_pdf_text_pdfium_silent("input.pdf")
text = reflow_cjk_paragraphs(raw, add_pdf_page_header=False, compact=False)
converted = OpenCC("s2t").convert(text, punctuation=True)

Development

Benchmarks

Package: opencc_pyo3
Python 3.13.5 (tags/v3.13.5:6cb20a2, Jun 11 2025, 16:15:46) [MSC v.1943 64 bit (AMD64)]
Platform: Windows-11-10.0.26100-SP0
Processor: Intel64 Family 6 Model 191 Stepping 2, GenuineIntel

BENCHMARK RESULTS


Method Config TextSize Mean StdDev Min Max Ops/sec Chars/sec
Convert_Small s2t 100 0.118 ms 0.097 ms 0.049 ms 0.811 ms 8,499 849,910
Convert_Medium s2t 1,000 0.250 ms 0.036 ms 0.211 ms 0.509 ms 4,004 4,003,531
Convert_Large s2t 10,000 0.845 ms 0.060 ms 0.775 ms 1.420 ms 1,184 11,835,419
Convert_XLarge s2t 100,000 4.755 ms 0.152 ms 4.515 ms 5.680 ms 210 21,030,543
Convert_Small s2tw 100 0.141 ms 0.027 ms 0.096 ms 0.321 ms 7,111 711,093
Convert_Medium s2tw 1,000 0.392 ms 0.030 ms 0.355 ms 0.623 ms 2,552 2,552,127
Convert_Large s2tw 10,000 1.271 ms 0.044 ms 1.191 ms 1.474 ms 787 7,869,452
Convert_XLarge s2tw 100,000 6.317 ms 0.139 ms 6.004 ms 7.250 ms 158 15,831,322
Convert_Small s2twp 100 0.204 ms 0.028 ms 0.132 ms 0.380 ms 4,911 491,118
Convert_Medium s2twp 1,000 0.598 ms 0.039 ms 0.527 ms 0.747 ms 1,671 1,671,296
Convert_Large s2twp 10,000 1.942 ms 0.061 ms 1.823 ms 2.223 ms 515 5,149,357
Convert_XLarge s2twp 100,000 9.937 ms 0.173 ms 9.542 ms 10.707 ms 101 10,063,174

Throughput vs Size

Throughput


Projects That Use opencc-pyo3

OpenccPyo3Gui


License

MIT


Powered by Rust, PyO3, OpenCC, Pdfium and opencc-fmmseg.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

opencc_pyo3-0.8.10-cp38-abi3-win_arm64.whl (4.5 MB view details)

Uploaded CPython 3.8+Windows ARM64

opencc_pyo3-0.8.10-cp38-abi3-win_amd64.whl (4.7 MB view details)

Uploaded CPython 3.8+Windows x86-64

opencc_pyo3-0.8.10-cp38-abi3-win32.whl (4.5 MB view details)

Uploaded CPython 3.8+Windows x86

opencc_pyo3-0.8.10-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.9 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ x86-64

opencc_pyo3-0.8.10-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (4.8 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ ARM64

opencc_pyo3-0.8.10-cp38-abi3-macosx_11_0_arm64.whl (4.5 MB view details)

Uploaded CPython 3.8+macOS 11.0+ ARM64

opencc_pyo3-0.8.10-cp38-abi3-macosx_10_12_x86_64.whl (4.7 MB view details)

Uploaded CPython 3.8+macOS 10.12+ x86-64

File details

Details for the file opencc_pyo3-0.8.10-cp38-abi3-win_arm64.whl.

File metadata

File hashes

Hashes for opencc_pyo3-0.8.10-cp38-abi3-win_arm64.whl
Algorithm Hash digest
SHA256 05a89a6d89b9f637a516e785a2d997afb4d215c982c885411baa4e327e360b87
MD5 cd6fb3569d5618a86bfd68bc53668fe3
BLAKE2b-256 8b00bec82271754760569394ed67bfa53b44b2a3a5061a12810d689600bf6273

See more details on using hashes here.

File details

Details for the file opencc_pyo3-0.8.10-cp38-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for opencc_pyo3-0.8.10-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 7678202d3419f63b0c863742a74d599b66201ded51d777f8eea30418482fd872
MD5 89b9d5e614e039f7a0b39110cdad8e94
BLAKE2b-256 6bc9e1c704e4ac631bb93e175527f711fe22932f752ac572f307e20a24e0ef27

See more details on using hashes here.

File details

Details for the file opencc_pyo3-0.8.10-cp38-abi3-win32.whl.

File metadata

  • Download URL: opencc_pyo3-0.8.10-cp38-abi3-win32.whl
  • Upload date:
  • Size: 4.5 MB
  • Tags: CPython 3.8+, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.13.1

File hashes

Hashes for opencc_pyo3-0.8.10-cp38-abi3-win32.whl
Algorithm Hash digest
SHA256 5dc6047b559afe55e57d87aae92be69dca76f8c2ff71e93d15b8cb7dda432eb9
MD5 2df5108997a4748d5993889e093f70cc
BLAKE2b-256 6f1ed11e97e0ea103a285fe77a9d882cc45457cfe86cb0f98a56a93da531277f

See more details on using hashes here.

File details

Details for the file opencc_pyo3-0.8.10-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for opencc_pyo3-0.8.10-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 4ef6c79b0a6d140cda2e57c4a3f27c34356b2763b4a3b26278ee19dbafe1cefc
MD5 96e3ce299ff702719cdf6efe46d1091d
BLAKE2b-256 0c0c80d001e242c3ec04768763f06b4f876ce648c04cec20027bb1fe51045f09

See more details on using hashes here.

File details

Details for the file opencc_pyo3-0.8.10-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for opencc_pyo3-0.8.10-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 aa3b29237b00fea932ea013748542efaedfb88150b22ca2c7b5d12446b21a6da
MD5 82f9b126009f16fedbd781f5044bf53a
BLAKE2b-256 17784bade88096442a64a07eed2ad1fd1ce8316e062b2e0a11da2a2d13cb945e

See more details on using hashes here.

File details

Details for the file opencc_pyo3-0.8.10-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for opencc_pyo3-0.8.10-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 b55a381da74518ea8d51fa651c0bcb9762920119c846c27edb3e31618e209ffe
MD5 7a37b21f7954c10b1cf779b10db0cdce
BLAKE2b-256 2d633daddcd8400012079d3dd534e231725487da6c967a87b84b4cbacc8c2fdf

See more details on using hashes here.

File details

Details for the file opencc_pyo3-0.8.10-cp38-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for opencc_pyo3-0.8.10-cp38-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 6e64810c53a2a6a63c55029c6f9f211771fe5dd4da355c7d1f6fe97a939869f3
MD5 2db0c0d66e52ff8bb31685dd68affeb2
BLAKE2b-256 3f26e00db4635b35d4ecbc7f5c29738604a25394d13fb1bb20fe01cf4b645bdc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page