High-performance Chinese text conversion (Simplified ↔ Traditional) powered by Rust, PyO3, and OpenCC lexicons.
Project description
opencc_pyo3
opencc_pyo3 is a Python extension module powered by Rust and PyO3,
providing fast and accurate conversion between different Chinese text variants
using OpenCC algorithms.
Features
- Convert between Simplified, Traditional, Hong Kong, Taiwan, and Japanese Kanji variants with OpenCC-compatible configurations.
- High-performance Rust + PyO3 backend for fast, memory-efficient Chinese text conversion in Python.
- Python API with
OpenCC,OpenccConfig, config validation helpers, punctuation conversion, and Chinese text variant detection. - Command-line interface for plain text conversion from files or standard input.
- Office and EPUB document conversion support for
.docx,.xlsx,.pptx,.odt,.ods,.odp, and.epub. - PDF text extraction helpers, including PDFium-based page-by-page extraction utilities.
- CJK paragraph reflow helper for cleaning PDF-extracted text before conversion.
Supported Conversion Configurations
s2t,t2s,s2tw,tw2s,s2twp,tw2sp,s2hk,hk2s,t2tw,tw2t,t2twp,tw2tp,t2hk,hk2t,t2jp,jp2t
Installation
1. Install from PyPI
pip install opencc-pyo3
2. Build and install the Python wheel using maturin:
# In project root
maturin build --release
pip install ./target/wheels/opencc_pyo3-<version>-cp<pyver>-abi3-<platform>.whl
Or for development (May require venv):
maturin develop -r
See build.txt for detailed build and install instructions.
Usage
Python
from opencc_pyo3 import OpenCC
text = "“春眠不觉晓,处处闻啼鸟。”"
opencc = OpenCC("s2t")
converted = opencc.convert(text, punctuation=True)
print(converted) # 「春眠不覺曉,處處聞啼鳥。」
CLI
You can also use the CLI interface via Python module or Python script:
Sub-Commands are:
convert: Convert Chinese text using OpenCCoffice: Convert Office document Chinese text using OpenCCpdf: Convert extracted PDF document text using OpenCC
convert
python -m opencc_pyo3 convert --help
usage: opencc-pyo3 convert [-h] [-i <file>] [-o <file>] [-c <conversion>] [-p] [--in-enc <encoding>] [--out-enc <encoding>]
options:
-h, --help show this help message and exit
-i, --input <file> Read original text from <file>.
-o, --output <file> Write converted text to <file>.
-c, --config <conversion>
Conversion configuration: s2t|s2tw|s2twp|s2hk|t2s|tw2s|tw2sp|hk2s|jp2t|t2jp
-p, --punct Enable punctuation conversion. (Default: False)
--in-enc <encoding> Encoding for input. (Default: UTF-8)
--out-enc <encoding> Encoding for output. (Default: UTF-8)
office
Support OpenOffice documents and Epub (.docx, .xlsx, .pptx, .odt, .ods, .odp, .epub)
python -m opencc_pyo3 office --help
usage: opencc-pyo3 office [-h] [-i <file>] [-o <file>] [-c <conversion>] [-p] [-f <format>] [--auto-ext] [--keep-font]
options:
-h, --help show this help message and exit
-i, --input <file> Input Office document from <file>.
-o, --output <file> Output Office document to <file>.
-c, --config <conversion>
conversion: s2t|s2tw|s2twp|s2hk|t2s|tw2s|tw2sp|hk2s|jp2t|t2jp
-p, --punct Enable punctuation conversion. (Default: False)
-f, --format <format>
Target Office format (e.g., docx, xlsx, pptx, odt, ods, odp, epub)
--auto-ext Auto-append extension to output file
--keep-font Preserve font-family information in Office content
Support PDF files as input, with built-in text extraction and OpenCC-based conversion powered by opencc-fmmseg
(available since v0.8.4).
This command allows you to extract Chinese text from PDF documents, optionally apply CJK-aware paragraph reflow, and convert the result using OpenCC configurations.
Note
Only text-embedded (searchable) PDF documents are supported.
Scanned or image-only PDFs without an embedded text layer are not currently supported.
python -m opencc_pyo3 pdf --help
usage: __main__.py pdf [-h] -i <file> [-o <file>] [-c <conversion>] [-p] [-H] [-r] [--compact] [--timing] [-e]
options:
-h, --help show this help message and exit
-i, --input <file> Input PDF file.
-o, --output <file> Output text file (UTF-8). If omitted, defaults to "<input>_converted.txt".
-c, --config <conversion>
Conversion configuration: s2t|s2tw|s2twp|s2hk|t2s|tw2s|tw2sp|hk2s|jp2t|t2jp
-p, --punct Enable punctuation conversion. (Default: False)
-H, --header Preserve page-break-like gaps when reflowing CJK paragraphs (passed as add_pdf_page_header to reflow_cjk_paragraphs).
-r, --reflow Enable CJK-aware paragraph reflow before conversion.
--compact Use compact paragraph mode (single newline between paragraphs).
--timing Show time use for each process workflow.
-e, --extract Extract PDF text only (skip OpenCC conversion).
python -m opencc_pyo3 convert -i input.txt -o output.txt -c s2t --punct
python -m opencc_pyo3 office -c s2t --punct -i input.docx -o output.docx --keep-font
opencc-pyo3 office -c s2tw -p -i input.epub -o output.epub
opencc-pyo3 pdf -i input.pdf -o output.txt -c s2t -punct --reflow
Python API
OpenccConfig
OpenccConfig is an enum exported from opencc_pyo3 for config-safe usage:
from opencc_pyo3 import OpenCC, OpenccConfig
cc = OpenCC(OpenccConfig.S2TW)
print(cc.convert("汉字")) # 漢字
Available enum values:
OpenccConfig.S2TOpenccConfig.T2SOpenccConfig.S2TWOpenccConfig.TW2SOpenccConfig.S2TWPOpenccConfig.TW2SPOpenccConfig.S2HKOpenccConfig.HK2SOpenccConfig.T2TWOpenccConfig.TW2TOpenccConfig.T2TWPOpenccConfig.TW2TPOpenccConfig.T2HKOpenccConfig.HK2TOpenccConfig.T2JPOpenccConfig.JP2T
OpenCC
Core converter class backed by the Rust extension module.
OpenCC(config: str | OpenccConfig = "s2t")- Creates a converter using a config string or
OpenccConfigenum. - Invalid config values fall back to
s2t.
- Creates a converter using a config string or
OpenCC.from_dicts(config="s2t", specs=None) -> OpenCC- Creates a converter with programmatic, in-memory custom dictionary entries.
OpenCC.from_dict_files(config="s2t", specs=None) -> OpenCC- Creates a converter with OpenCC-style custom dictionary files.
convert(input_text: str, punctuation: bool = False) -> str- Converts text using the current config.
set_config(config: str | OpenccConfig) -> None- Changes the active config.
get_config() -> str- Returns the current canonical config name.
get_last_error() -> str- Returns the most recent config error message, or
""if none.
- Returns the most recent config error message, or
zho_check(input_text: str) -> int- Detects the text type.
1= Traditional Chinese,2= Simplified Chinese,0= other / undetermined
OpenCC.supported_configs() -> list[str]- Returns all supported config names.
OpenCC.is_valid_config(config: str) -> bool- Validates a config string.
Example:
from opencc_pyo3 import OpenCC, OpenccConfig
cc = OpenCC(OpenccConfig.S2T)
print(cc.get_config()) # s2t
print(cc.convert("汉字", punctuation=True))
print(cc.zho_check("汉字")) # 2
cc.set_config("t2jp")
print(cc.convert("圖書館")) # 図書館
print(OpenCC.supported_configs())
print(OpenCC.is_valid_config("s2hk")) # True
Custom Dictionaries
OpenCC("s2t") remains the recommended API for normal use and continues to use the built-in embedded dictionaries.
Use custom dictionaries only when you need project-specific terms or overrides.
Custom dictionaries are applied during construction. The backend first loads the default embedded zstd dictionaries,
then
applies post-load customization with DictionaryMaxlength::from_zstd()?.with_custom_dicts(...) or
DictionaryMaxlength::from_zstd()?.with_custom_dict_files(...). The final OpenCC instance remains immutable and
optimized after construction. Runtime hot reload is not supported; rebuild a new OpenCC instance if dictionaries need
to change.
In-memory custom dictionaries
Use OpenCC.from_dicts() for programmatic terms:
from typing import List
from opencc_pyo3 import OpenCC, CustomDictSpec
specs: List[CustomDictSpec] = [
{
"slot": "STPhrases",
"pairs": [("帕兰蒂尔", "柏蘭蒂爾")],
"mode": "append",
}
]
cc = OpenCC.from_dicts("s2t", specs)
print(cc.convert("帕兰蒂尔是一家公司"))
# 柏蘭蒂爾是一家公司
File-based custom dictionaries
Use OpenCC.from_dict_files() for OpenCC-style dictionary files:
from typing import List
from opencc_pyo3 import OpenCC, CustomDictFileSpec
specs: List[CustomDictFileSpec] = [
{
"slot": "STPhrases",
"files": ["custom_st_phrases.txt"],
"mode": "append",
}
]
cc = OpenCC.from_dict_files("s2t", specs)
Custom dictionary files use one mapping per line:
source<TAB>target
Example:
帕兰蒂尔 柏蘭蒂爾
Merge modes
append: add custom entries to the existing dictionary slot. Duplicate keys follow the backend "last wins" behavior.override: replace the entire target dictionary slot with the custom entries or files.
Supported dictionary slots
Use canonical slot names without .txt, such as STPhrases, not STPhrases.txt. The Python wrapper may tolerate
.txt, but the documented API uses canonical names only.
| Slot | Purpose | Original OpenCC file |
|---|---|---|
STCharacters |
Simplified → Traditional character mappings | STCharacters.txt |
STPhrases |
Simplified → Traditional phrase mappings | STPhrases.txt |
STPunctuations |
Simplified → Traditional punctuation mappings | STPunctuations.txt |
TSCharacters |
Traditional → Simplified character mappings | TSCharacters.txt |
TSPhrases |
Traditional → Simplified phrase mappings | TSPhrases.txt |
TSPunctuations |
Traditional → Simplified punctuation mappings | TSPunctuations.txt |
TWPhrases |
Traditional → Taiwan phrase mappings | TWPhrases.txt |
TWPhrasesRev |
Taiwan → Traditional reverse phrase mappings | TWPhrasesRev.txt |
TWVariants |
Traditional → Taiwan regional variant mappings | TWVariants.txt |
TWVariantsRev |
Taiwan → Traditional reverse variant mappings | TWVariantsRev.txt |
TWVariantsRevPhrases |
Taiwan → Traditional reverse phrase variant mappings | TWVariantsRevPhrases.txt |
HKVariants |
Traditional → Hong Kong regional variant mappings | HKVariants.txt |
HKVariantsRev |
Hong Kong → Traditional reverse variant mappings | HKVariantsRev.txt |
HKVariantsRevPhrases |
Hong Kong → Traditional reverse phrase variant mappings | HKVariantsRevPhrases.txt |
JPSCharacters |
Japanese Shinjitai character mappings | JPShinjitaiCharacters.txt |
JPSPhrases |
Japanese Shinjitai phrase mappings | JPShinjitaiPhrases.txt |
JPVariants |
Traditional → Japanese variant mappings | JPVariants.txt |
JPVariantsRev |
Japanese → Traditional reverse variant mappings | JPVariantsRev.txt |
Custom dictionary behavior follows the same OpenCC dictionary-slot model. Choosing the wrong slot may have no effect or
may affect a different conversion path. For s2t, use STCharacters or STPhrases. For t2s, use TSCharacters or
TSPhrases. For regional variants, use the relevant TW, HK, or JP slots.
Typing helpers
CustomDictSpec and CustomDictFileSpec are exported for typed Python code:
from typing import List
from opencc_pyo3 import OpenCC, CustomDictSpec
specs: List[CustomDictSpec] = [
{
"slot": "STPhrases",
"pairs": [("帕兰蒂尔", "柏蘭蒂爾")],
"mode": "append",
}
]
cc = OpenCC.from_dicts("s2t", specs)
Notes:
- Custom dictionaries are loaded at construction time.
- Existing
OpenCCobjects are immutable after construction. - Runtime hot reload is not supported.
- Rebuild a new
OpenCCinstance if dictionaries need to change. - Invalid slots, invalid modes, malformed lines, or unreadable files raise errors.
OpenCC("s2t")remains the recommended API for normal users.- Use
from_dicts()for programmatic or in-memory custom terms. - Use
from_dict_files()for OpenCC-style dictionary files.
Reflow helper
reflow_cjk_paragraphs(text: str, add_pdf_page_header: bool, compact: bool) -> str- Reflows PDF-extracted CJK text by merging broken line wraps while preserving paragraph structure.
add_pdf_page_header=Truekeeps explicit page-gap style boundaries.compact=Trueuses single newlines between paragraphs.
Example:
from opencc_pyo3.opencc_pyo3 import reflow_cjk_paragraphs
raw_text = "我来到\n无人海边。"
clean_text = reflow_cjk_paragraphs(raw_text, add_pdf_page_header=False, compact=False) # 我来到无人海边。
PDF extraction APIs
The package exposes PDFium-based PDF extraction helpers in opencc_pyo3.pdfium_helper.
Import from opencc_pyo3.pdfium_helper:
extract_pdf_pages_with_callback_pdfium(path: str, callback, add_page_header: bool = False) -> Noneextract_pdf_text_pdfium_progress(path: str) -> strextract_pdf_text_pdfium_silent(path: str) -> strextract_pdf_text_pages_pdfium(path: str) -> list[str]extract_pdf_text_pages_pdfium_progress(path: str) -> list[str]make_progress_collector() -> tuple[callback, list[str]]make_silent_collector() -> tuple[callback, list[str]]
Example:
from opencc_pyo3 import OpenCC
from opencc_pyo3.opencc_pyo3 import reflow_cjk_paragraphs
from opencc_pyo3.pdfium_helper import extract_pdf_text_pdfium_silent
raw = extract_pdf_text_pdfium_silent("input.pdf")
text = reflow_cjk_paragraphs(raw, add_pdf_page_header=False, compact=False)
converted = OpenCC("s2t").convert(text, punctuation=True)
Development
- Rust source: src/lib.rs
- Python bindings: opencc_pyo3/__init __.py, opencc_pyo3/opencc_pyo3.pyi
- CLI: opencc_pyo3/main.py
Benchmarks
Latest benchmark results for the optimized current opencc_pyo3 version.
These replace the much older v0.7.0 numbers.
Package: opencc_pyo3
Python: 3.13.13
Platform: Windows-11-10.0.26200-SP0
Processor: Intel64 Family 6 Model 191 Stepping 2, GenuineIntel
Configs: s2t, s2tw, s2twp
Text sizes: 100, 1,000, 10,000, 100,000 characters
BENCHMARK RESULTS
| Method | Config | TextSize | Mean | StdDev | Min | Max | Ops/sec | Chars/sec |
|---|---|---|---|---|---|---|---|---|
| Convert_Small | s2t | 100 | 0.005 ms | 0.003 ms | 0.004 ms | 0.021 ms | 188,442 | 18,844,221 |
| Convert_Medium | s2t | 1,000 | 0.038 ms | 0.006 ms | 0.036 ms | 0.066 ms | 26,189 | 26,189,437 |
| Convert_Large | s2t | 10,000 | 0.253 ms | 0.093 ms | 0.171 ms | 0.629 ms | 3,958 | 39,577,314 |
| Convert_XLarge | s2t | 100,000 | 1.394 ms | 0.166 ms | 1.156 ms | 1.699 ms | 717 | 71,726,750 |
| Convert_Small | s2tw | 100 | 0.006 ms | 0.003 ms | 0.005 ms | 0.021 ms | 175,953 | 17,595,308 |
| Convert_Medium | s2tw | 1,000 | 0.044 ms | 0.005 ms | 0.042 ms | 0.071 ms | 22,808 | 22,808,485 |
| Convert_Large | s2tw | 10,000 | 0.318 ms | 0.086 ms | 0.227 ms | 0.514 ms | 3,141 | 31,411,310 |
| Convert_XLarge | s2tw | 100,000 | 1.503 ms | 0.129 ms | 1.355 ms | 1.837 ms | 665 | 66,516,340 |
| Convert_Small | s2twp | 100 | 0.008 ms | 0.003 ms | 0.007 ms | 0.025 ms | 130,435 | 13,043,478 |
| Convert_Medium | s2twp | 1,000 | 0.054 ms | 0.006 ms | 0.052 ms | 0.084 ms | 18,378 | 18,377,849 |
| Convert_Large | s2twp | 10,000 | 0.482 ms | 0.249 ms | 0.335 ms | 1.602 ms | 2,075 | 20,746,888 |
| Convert_XLarge | s2twp | 100,000 | 1.817 ms | 0.197 ms | 1.649 ms | 2.581 ms | 550 | 55,032,341 |
Reproduce Benchmarks
python bench/opencc_benchmark_md.py --ci --configs s2t s2tw s2twp --sizes Small Medium Large XLarge --export md json --output-dir bench/out
Projects That Use opencc-pyo3
License
Powered by Rust, PyO3, OpenCC, Pdfium and opencc-fmmseg.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file opencc_pyo3-0.9.0-cp38-abi3-win_arm64.whl.
File metadata
- Download URL: opencc_pyo3-0.9.0-cp38-abi3-win_arm64.whl
- Upload date:
- Size: 4.5 MB
- Tags: CPython 3.8+, Windows ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4b5c25974d709f0ecab24bf603bfc9033e9609bc66d7a8a768555dcbc3900181
|
|
| MD5 |
3d0641b2cd46e40606a285a719d4f5bb
|
|
| BLAKE2b-256 |
2b8a30cbd3e6f9f96736be04eb580bf4ca7966c7869d8087a2bc7513c1854cee
|
File details
Details for the file opencc_pyo3-0.9.0-cp38-abi3-win_amd64.whl.
File metadata
- Download URL: opencc_pyo3-0.9.0-cp38-abi3-win_amd64.whl
- Upload date:
- Size: 4.7 MB
- Tags: CPython 3.8+, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3e74113b0757ee3f43ce0775523494051438c4458c535f3450ad87b8bbcd9d7a
|
|
| MD5 |
847ef775036bb6fff897b22462a55429
|
|
| BLAKE2b-256 |
bea91294a2464278adeceb218292691fa49812875f4d774cd2039df66b5f8d68
|
File details
Details for the file opencc_pyo3-0.9.0-cp38-abi3-win32.whl.
File metadata
- Download URL: opencc_pyo3-0.9.0-cp38-abi3-win32.whl
- Upload date:
- Size: 4.5 MB
- Tags: CPython 3.8+, Windows x86
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dd1c059adce4c54e6576b450b03fe0b2310126a8a1d97c5c3901f53828ff74bb
|
|
| MD5 |
6d69e2b61b7a33b0e3b6b04b50e59460
|
|
| BLAKE2b-256 |
c345c6937621740eda229b5ddebd677034f41709763666cc5c5899232eb0e502
|
File details
Details for the file opencc_pyo3-0.9.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: opencc_pyo3-0.9.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 4.9 MB
- Tags: CPython 3.8+, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ea8fd2b4d957cf008da8d0ce6393c0a9193f3e64c8665304a38fef79e3a7d3d1
|
|
| MD5 |
10b2cb6ddfdcfd303ae9caa3fdca169c
|
|
| BLAKE2b-256 |
3d5819ad4458ccb1f269b21ae45e5e547ceac59a853bcf6a487cc2b04c9c7cd1
|
File details
Details for the file opencc_pyo3-0.9.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: opencc_pyo3-0.9.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 4.9 MB
- Tags: CPython 3.8+, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9b72c6510329d0356a89ab3f2e88c5060bd0b0778ca7d5dda241ce40bcdd8b7e
|
|
| MD5 |
a544bff0f6f88b6b05fe15edb02d78d6
|
|
| BLAKE2b-256 |
1ac85ff9908554f86eec988fb26cacd3e6d028f5cd5f379c6cbb17061d877bf5
|
File details
Details for the file opencc_pyo3-0.9.0-cp38-abi3-macosx_11_0_arm64.whl.
File metadata
- Download URL: opencc_pyo3-0.9.0-cp38-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 4.5 MB
- Tags: CPython 3.8+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c8950ebf7fb28398ced85f37ff1db05fec6de3e009d5310ff2b4297b3fda9734
|
|
| MD5 |
dd2e7f2f9a87e42cff17827b82b7e020
|
|
| BLAKE2b-256 |
c694441f07e13f41e9122984c46e9ca9b9876be877437e9259cf940b2e238f1f
|
File details
Details for the file opencc_pyo3-0.9.0-cp38-abi3-macosx_10_12_x86_64.whl.
File metadata
- Download URL: opencc_pyo3-0.9.0-cp38-abi3-macosx_10_12_x86_64.whl
- Upload date:
- Size: 4.7 MB
- Tags: CPython 3.8+, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
504fd5e09207f592f25b59ba977c0bf6b6356138911b044f0b33c7576cd024eb
|
|
| MD5 |
7cea4da3b65c88d8e6bdb7b31f96ec0b
|
|
| BLAKE2b-256 |
c88f3eac9dfe48e493f8e40444e395c1323ffa24b9ecf88313702b2aa03fcac1
|