Skip to main content

High-performance Chinese text conversion (Simplified ↔ Traditional) powered by Rust, PyO3, and OpenCC lexicons.

Project description

opencc_pyo3

PyPI version Downloads Python Versions License Build Status

opencc_pyo3 is a Python extension module powered by Rust and PyO3, providing fast and accurate conversion between different Chinese text variants using OpenCC algorithms.

Features

  • Convert between Simplified, Traditional, Hong Kong, Taiwan, and Japanese Kanji variants with OpenCC-compatible configurations.
  • High-performance Rust + PyO3 backend for fast, memory-efficient Chinese text conversion in Python.
  • Python API with OpenCC, OpenccConfig, config validation helpers, punctuation conversion, and Chinese text variant detection.
  • Command-line interface for plain text conversion from files or standard input.
  • Office and EPUB document conversion support for .docx, .xlsx, .pptx, .odt, .ods, .odp, and .epub.
  • PDF text extraction helpers, including PDFium-based page-by-page extraction utilities.
  • CJK paragraph reflow helper for cleaning PDF-extracted text before conversion.

Supported Conversion Configurations

  • s2t, t2s, s2tw, tw2s, s2twp, tw2sp, s2hk, hk2s, t2tw, tw2t, t2twp, tw2tp, t2hk, hk2t, t2jp, jp2t

Installation

1. Install from PyPI

pip install opencc-pyo3

2. Build and install the Python wheel using maturin:

# In project root
maturin build --release
pip install ./target/wheels/opencc_pyo3-<version>-cp<pyver>-abi3-<platform>.whl

Or for development (May require venv):

maturin develop -r

See build.txt for detailed build and install instructions.

Usage

Python

from opencc_pyo3 import OpenCC

text = "“春眠不觉晓,处处闻啼鸟。”"
opencc = OpenCC("s2t")
converted = opencc.convert(text, punctuation=True)
print(converted)  # 「春眠不覺曉,處處聞啼鳥。」

CLI

You can also use the CLI interface via Python module or Python script:
Sub-Commands are:

  • convert: Convert Chinese text using OpenCC
  • office: Convert Office document Chinese text using OpenCC
  • pdf: Convert extracted PDF document text using OpenCC

convert

python -m opencc_pyo3 convert --help
usage: opencc-pyo3 convert [-h] [-i <file>] [-o <file>] [-c <conversion>] [-p] [--in-enc <encoding>] [--out-enc <encoding>]

options:
  -h, --help            show this help message and exit
  -i, --input <file>    Read original text from <file>.
  -o, --output <file>   Write converted text to <file>.
  -c, --config <conversion>
                        Conversion configuration: s2t|s2tw|s2twp|s2hk|t2s|tw2s|tw2sp|hk2s|jp2t|t2jp
  -p, --punct           Enable punctuation conversion. (Default: False)
  --in-enc <encoding>   Encoding for input. (Default: UTF-8)
  --out-enc <encoding>  Encoding for output. (Default: UTF-8)

office

Support OpenOffice documents and Epub (.docx, .xlsx, .pptx, .odt, .ods, .odp, .epub)

python -m opencc_pyo3 office --help                                         
usage: opencc-pyo3 office [-h] [-i <file>] [-o <file>] [-c <conversion>] [-p] [-f <format>] [--auto-ext] [--keep-font]

options:
  -h, --help            show this help message and exit
  -i, --input <file>    Input Office document from <file>.
  -o, --output <file>   Output Office document to <file>.
  -c, --config <conversion>
                        conversion: s2t|s2tw|s2twp|s2hk|t2s|tw2s|tw2sp|hk2s|jp2t|t2jp
  -p, --punct           Enable punctuation conversion. (Default: False)
  -f, --format <format>
                        Target Office format (e.g., docx, xlsx, pptx, odt, ods, odp, epub)
  --auto-ext            Auto-append extension to output file
  --keep-font           Preserve font-family information in Office content

PDF

Support PDF files as input, with built-in text extraction and OpenCC-based conversion powered by opencc-fmmseg (available since v0.8.4).

This command allows you to extract Chinese text from PDF documents, optionally apply CJK-aware paragraph reflow, and convert the result using OpenCC configurations.

Note
Only text-embedded (searchable) PDF documents are supported.
Scanned or image-only PDFs without an embedded text layer are not currently supported.

python -m opencc_pyo3 pdf --help

usage: __main__.py pdf [-h] -i <file> [-o <file>] [-c <conversion>] [-p] [-H] [-r] [--compact] [--timing] [-e]

options:
  -h, --help            show this help message and exit
  -i, --input <file>    Input PDF file.
  -o, --output <file>   Output text file (UTF-8). If omitted, defaults to "<input>_converted.txt".
  -c, --config <conversion>
                        Conversion configuration: s2t|s2tw|s2twp|s2hk|t2s|tw2s|tw2sp|hk2s|jp2t|t2jp
  -p, --punct           Enable punctuation conversion. (Default: False)
  -H, --header          Preserve page-break-like gaps when reflowing CJK paragraphs (passed as add_pdf_page_header to reflow_cjk_paragraphs).
  -r, --reflow          Enable CJK-aware paragraph reflow before conversion.
  --compact             Use compact paragraph mode (single newline between paragraphs).
  --timing              Show time use for each process workflow.
  -e, --extract         Extract PDF text only (skip OpenCC conversion).
python -m opencc_pyo3 convert -i input.txt -o output.txt -c s2t --punct

python -m opencc_pyo3 office -c s2t --punct -i input.docx -o output.docx --keep-font

opencc-pyo3 office -c s2tw -p -i input.epub -o output.epub

opencc-pyo3 pdf -i input.pdf -o output.txt -c s2t -punct --reflow

Python API

OpenccConfig

OpenccConfig is an enum exported from opencc_pyo3 for config-safe usage:

from opencc_pyo3 import OpenCC, OpenccConfig

cc = OpenCC(OpenccConfig.S2TW)
print(cc.convert("汉字"))  # 漢字

Available enum values:

  • OpenccConfig.S2T
  • OpenccConfig.T2S
  • OpenccConfig.S2TW
  • OpenccConfig.TW2S
  • OpenccConfig.S2TWP
  • OpenccConfig.TW2SP
  • OpenccConfig.S2HK
  • OpenccConfig.HK2S
  • OpenccConfig.T2TW
  • OpenccConfig.TW2T
  • OpenccConfig.T2TWP
  • OpenccConfig.TW2TP
  • OpenccConfig.T2HK
  • OpenccConfig.HK2T
  • OpenccConfig.T2JP
  • OpenccConfig.JP2T

OpenCC

Core converter class backed by the Rust extension module.

  • OpenCC(config: str | OpenccConfig = "s2t")
    • Creates a converter using a config string or OpenccConfig enum.
    • Invalid config values fall back to s2t.
  • convert(input_text: str, punctuation: bool = False) -> str
    • Converts text using the current config.
  • set_config(config: str | OpenccConfig) -> None
    • Changes the active config.
  • get_config() -> str
    • Returns the current canonical config name.
  • get_last_error() -> str
    • Returns the most recent config error message, or "" if none.
  • zho_check(input_text: str) -> int
    • Detects the text type.
    • 1 = Traditional Chinese, 2 = Simplified Chinese, 0 = other / undetermined
  • OpenCC.supported_configs() -> list[str]
    • Returns all supported config names.
  • OpenCC.is_valid_config(config: str) -> bool
    • Validates a config string.

Example:

from opencc_pyo3 import OpenCC, OpenccConfig

cc = OpenCC(OpenccConfig.S2T)
print(cc.get_config())  # s2t
print(cc.convert("汉字", punctuation=True))
print(cc.zho_check("汉字"))  # 2

cc.set_config("t2jp")
print(cc.convert("圖書館"))  # 図書館

print(OpenCC.supported_configs())
print(OpenCC.is_valid_config("s2hk"))  # True

Reflow helper

  • reflow_cjk_paragraphs(text: str, add_pdf_page_header: bool, compact: bool) -> str
    • Reflows PDF-extracted CJK text by merging broken line wraps while preserving paragraph structure.
    • add_pdf_page_header=True keeps explicit page-gap style boundaries.
    • compact=True uses single newlines between paragraphs.

Example:

from opencc_pyo3.opencc_pyo3 import reflow_cjk_paragraphs

raw_text = "我来到\n无人海边。"
clean_text = reflow_cjk_paragraphs(raw_text, add_pdf_page_header=False, compact=False)  # 我来到无人海边。

PDF extraction APIs

The package currently exposes two PDF extraction layers:

  1. Rust extension functions in opencc_pyo3.opencc_pyo3
  2. PDFium-based helpers in opencc_pyo3.pdfium_helper

Legacy Rust PDF extract functions

These are still exported, but the type stub marks them as deprecated in favor of the PDFium helper module.

  • extract_pdf_text(path: str) -> str
  • extract_pdf_text_pages(path: str) -> list[str]
  • extract_pdf_pages_with_callback(path: str, callback: Callable[[int, int, str], Any]) -> None

Recommended PDFium helper functions

Import from opencc_pyo3.pdfium_helper:

  • extract_pdf_pages_with_callback_pdfium(path: str, callback, add_page_header: bool = False) -> None
  • extract_pdf_text_pdfium_progress(path: str) -> str
  • extract_pdf_text_pdfium_silent(path: str) -> str
  • extract_pdf_text_pages_pdfium(path: str) -> list[str]
  • extract_pdf_text_pages_pdfium_progress(path: str) -> list[str]
  • make_progress_collector() -> tuple[callback, list[str]]
  • make_silent_collector() -> tuple[callback, list[str]]

Example:

from opencc_pyo3 import OpenCC
from opencc_pyo3.opencc_pyo3 import reflow_cjk_paragraphs
from opencc_pyo3.pdfium_helper import extract_pdf_text_pdfium_silent

raw = extract_pdf_text_pdfium_silent("input.pdf")
text = reflow_cjk_paragraphs(raw, add_pdf_page_header=False, compact=False)
converted = OpenCC("s2t").convert(text, punctuation=True)

Development

Benchmarks

Latest benchmark results for the optimized current opencc_pyo3 version. These replace the much older v0.7.0 numbers.

Package: opencc_pyo3
Python: 3.13.13
Platform: Windows-11-10.0.26200-SP0
Processor: Intel64 Family 6 Model 191 Stepping 2, GenuineIntel
Configs: s2t, s2tw, s2twp
Text sizes: 100, 1,000, 10,000, 100,000 characters

BENCHMARK RESULTS


Method Config TextSize Mean StdDev Min Max Ops/sec Chars/sec
Convert_Small s2t 100 0.005 ms 0.003 ms 0.004 ms 0.021 ms 188,442 18,844,221
Convert_Medium s2t 1,000 0.038 ms 0.006 ms 0.036 ms 0.066 ms 26,189 26,189,437
Convert_Large s2t 10,000 0.253 ms 0.093 ms 0.171 ms 0.629 ms 3,958 39,577,314
Convert_XLarge s2t 100,000 1.394 ms 0.166 ms 1.156 ms 1.699 ms 717 71,726,750
Convert_Small s2tw 100 0.006 ms 0.003 ms 0.005 ms 0.021 ms 175,953 17,595,308
Convert_Medium s2tw 1,000 0.044 ms 0.005 ms 0.042 ms 0.071 ms 22,808 22,808,485
Convert_Large s2tw 10,000 0.318 ms 0.086 ms 0.227 ms 0.514 ms 3,141 31,411,310
Convert_XLarge s2tw 100,000 1.503 ms 0.129 ms 1.355 ms 1.837 ms 665 66,516,340
Convert_Small s2twp 100 0.008 ms 0.003 ms 0.007 ms 0.025 ms 130,435 13,043,478
Convert_Medium s2twp 1,000 0.054 ms 0.006 ms 0.052 ms 0.084 ms 18,378 18,377,849
Convert_Large s2twp 10,000 0.482 ms 0.249 ms 0.335 ms 1.602 ms 2,075 20,746,888
Convert_XLarge s2twp 100,000 1.817 ms 0.197 ms 1.649 ms 2.581 ms 550 55,032,341

Reproduce Benchmarks

python bench/opencc_benchmark_md.py --ci --configs s2t s2tw s2twp --sizes Small Medium Large XLarge --export md json --output-dir bench/out

Projects That Use opencc-pyo3

OpenccPyo3Gui


License

MIT


Powered by Rust, PyO3, OpenCC, Pdfium and opencc-fmmseg.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

opencc_pyo3-0.8.11-cp38-abi3-win_arm64.whl (4.5 MB view details)

Uploaded CPython 3.8+Windows ARM64

opencc_pyo3-0.8.11-cp38-abi3-win_amd64.whl (4.7 MB view details)

Uploaded CPython 3.8+Windows x86-64

opencc_pyo3-0.8.11-cp38-abi3-win32.whl (4.5 MB view details)

Uploaded CPython 3.8+Windows x86

opencc_pyo3-0.8.11-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.9 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ x86-64

opencc_pyo3-0.8.11-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (4.8 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ ARM64

opencc_pyo3-0.8.11-cp38-abi3-macosx_11_0_arm64.whl (4.5 MB view details)

Uploaded CPython 3.8+macOS 11.0+ ARM64

opencc_pyo3-0.8.11-cp38-abi3-macosx_10_12_x86_64.whl (4.7 MB view details)

Uploaded CPython 3.8+macOS 10.12+ x86-64

File details

Details for the file opencc_pyo3-0.8.11-cp38-abi3-win_arm64.whl.

File metadata

File hashes

Hashes for opencc_pyo3-0.8.11-cp38-abi3-win_arm64.whl
Algorithm Hash digest
SHA256 1393c172fe3bf1148673cf09245aa543ad92d6ed813c5bd79c8bbd0d5d5d1271
MD5 4bb1a886560e90ae6dfe17f21135d91f
BLAKE2b-256 ef1bb561d5e0f8ac047eeca29763e1a3878672f2cdacf90f40bb6544706525d2

See more details on using hashes here.

File details

Details for the file opencc_pyo3-0.8.11-cp38-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for opencc_pyo3-0.8.11-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 570a554a73586b9268cf5a5689badf4faa36091656d24056fc8b8bf567835ba9
MD5 34d301dece24b416be4dab1b66d392f4
BLAKE2b-256 fc31146607d39bf3852bc6dab4c437b4320eee9dc6a6a47139ee31035b0165e1

See more details on using hashes here.

File details

Details for the file opencc_pyo3-0.8.11-cp38-abi3-win32.whl.

File metadata

  • Download URL: opencc_pyo3-0.8.11-cp38-abi3-win32.whl
  • Upload date:
  • Size: 4.5 MB
  • Tags: CPython 3.8+, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.13.1

File hashes

Hashes for opencc_pyo3-0.8.11-cp38-abi3-win32.whl
Algorithm Hash digest
SHA256 319c6597f977a015f1482dc0e0eb0a39be0068d0022205f0e6d91415f0cfeeb4
MD5 8f3a5d6687bb39b0ad58df16dd16cac3
BLAKE2b-256 89f77a387bf251dc01c1564cc45df4de051b21a11a1a3bd6bc1d8e8f850a5c4c

See more details on using hashes here.

File details

Details for the file opencc_pyo3-0.8.11-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for opencc_pyo3-0.8.11-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d311ed947dd154474e942984e6a28178986449eaf6e9516bfbdf4cdc4ef842ed
MD5 fd8e3fabb83e5526a7e6256643b25644
BLAKE2b-256 d8eb22a87698d6266c207d8e7026bde0e2abbcb14dc431c6697e36afbf3632e3

See more details on using hashes here.

File details

Details for the file opencc_pyo3-0.8.11-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for opencc_pyo3-0.8.11-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 734b66aec973d487d205e794d7e0829ef0de714211b07f3ea650d4cde80612f3
MD5 0feb01b9eeccbf1f05b2c90bdd1a8a92
BLAKE2b-256 3ba90c130ed1369743aad3dc6860c45075455d92076ce7ad5ae6fb8bc814bfe5

See more details on using hashes here.

File details

Details for the file opencc_pyo3-0.8.11-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for opencc_pyo3-0.8.11-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d8c9e1097849dd4f1bc04d203eee3ecabd780047c0fa150e7d0c88f536a5140c
MD5 fbf950e822b6fa693910c1b2d5ebc851
BLAKE2b-256 0e4cc75f6efbedf5515d7bb331253ba940a9c8f3fc900fe01ea26fcefd8217a7

See more details on using hashes here.

File details

Details for the file opencc_pyo3-0.8.11-cp38-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for opencc_pyo3-0.8.11-cp38-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 b0e27025fb63f654439cdf684f6b4e5db2ccc7fc4ff51a873be41bcd7c40aadc
MD5 2805732224035bb8bf2141f017255455
BLAKE2b-256 e57fd369e8fd27e615cd637dbf320b8bd84f7192eb2569ae614edd522932bc8e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page