Skip to main content

High-performance Chinese text conversion (Simplified ↔ Traditional), segmentation and keyword extraction powered by Rust, PyO3, Jieba and OpenCC lexicons.

Project description

opencc_jieba_pyo3

PyPI version Downloads Python Versions License Build Status

opencc_jieba_pyo3 is a Python extension module powered by Rust, Jieba and PyO3, providing fast and accurate conversion between different Chinese text variants using opencc-jieba-rs and OpenCC algorithms.

Features

  • Convert between Simplified, Traditional, Hong Kong, Taiwan, and Japanese Kanji Chinese text.
  • Fast and memory-efficient, leveraging Rust's performance.
  • Easy-to-use Python API.
  • Supports punctuation conversion and automatic text code detection.
  • Chinese word (Both Traditional and Simplified) segmentation (Jieba).
  • Keyword extraction (TF-IDF, TextRank).
  • Utility functions for punctuation handling and language detection.

🔁 Supported Conversion Configs

Code Description
s2t Simplified → Traditional
t2s Traditional → Simplified
s2tw Simplified → Traditional (Taiwan)
tw2s Traditional (Taiwan) → Simplified
s2twp Simplified → Traditional (Taiwan) with idioms
tw2sp Traditional (Taiwan) → Simplified with idioms
s2hk Simplified → Traditional (Hong Kong)
hk2s Traditional (Hong Kong) → Simplified
t2tw Traditional → Traditional (Taiwan)
tw2t Traditional (Taiwan) → Traditional
t2twp Traditional → Traditional (Taiwan) with idioms
tw2tp Traditional (Taiwan) → Traditional with idioms
t2hk Traditional → Traditional (Hong Kong)
hk2t Traditional (Hong Kong) → Traditional
t2jp Japanese Kyujitai → Shinjitai
jp2t Japanese Shinjitai → Kyujitai

Installation

Build and install the Python wheel using maturin:

# In project root
maturin build --release
pip install ./target/wheels/opencc_jieba_pyo3-<version>-cp<pyver>-abi3-<platform>.whl

Or for development:

maturin develop -r

See BUILD.md for detailed build and install instructions.


Usage

Python

from opencc_jieba_pyo3 import OpenCC

text = "“春眠不觉晓,处处闻啼鸟。”"
opencc = OpenCC("s2t")
converted = opencc.convert(text, punctuation=True)
print(converted)  # 「春眠不覺曉,處處聞啼鳥。」

# Segmentation
words = opencc.jieba_cut(text, hmm=True)
print(words)  # ['春眠', '不觉', '晓', ',', '处处', '闻', '啼鸟', '。']

# Segmentation and join
joined = opencc.jieba_cut_and_join(text, delimiter="/")
print(joined)  # 春眠/不觉/晓/,/处处/闻/啼鸟/。

# Keyword extraction (TextRank)
keywords = opencc.jieba_keyword_extract_textrank(text, top_k=3)
print(keywords)  # ['春眠', '啼鸟', '处处']

# Keyword extraction (TF-IDF)
keywords_tfidf = opencc.jieba_keyword_extract_tfidf(text, top_k=3)
print(keywords_tfidf)  # ['春眠', '啼鸟', '处处']

# Keyword weights (TextRank)
kw_weights = opencc.jieba_keyword_weight_textrank(text, top_k=3)
print(kw_weights)  # [('春眠', 1.23), ('啼鸟', 0.98), ('处处', 0.75)]

# Keyword weights (TF-IDF)
kw_weights_tfidf = opencc.jieba_keyword_weight_tfidf(text, top_k=3)
print(kw_weights_tfidf)  # [('春眠', 2.34), ('啼鸟', 1.56), ('处处', 1.12)]

CLI

You can also use the CLI interface via Python module or Python script:
Features are:

  • convert: Convert Chinese text using OpenCC + Jieba
  • segment: Segment Chinese text using Jieba
  • office: Convert Office document Chinese text using OpenCC + Jieba

convert

Module: python -m opencc_jieba_pyo3 convert --help
Script: opencc-jieba-pyo3 convert --help

usage: opencc_jieba_pyo3 convert [-h] [-i <file>] [-o <file>] [-c <conversion>] [-p] [--in-enc <encoding>] [--out-enc <encoding>]

options:
  -h, --help            show this help message and exit
  -i, --input <file>    Read original text from <file>.
  -o, --output <file>   Write converted text to <file>.
  -c, --config <conversion>
                        Conversion configuration: [s2t|s2tw|s2twp|s2hk|t2s|tw2s|tw2sp|hk2s|jp2t|t2jp]
  -p, --punct           Punctuation conversion
  --in-enc <encoding>   Encoding for input
  --out-enc <encoding>  Encoding for output

segment

python -m opencc_jieba_pyo3 segment --help
opencc-jieba-pyo3 convert segment --help

usage: opencc-jieba-pyo3 segment [-h] [-i <file>] [-o <file>] [-d <char>] [--mode {cut,search,full}] [--in-enc <encoding>] [--out-enc <encoding>]

options:
  -h, --help            show this help message and exit
  -i, --input <file>    Read input text from <file>. (default: None)
  -o, --output <file>   Write segmented text to <file>. (default: None)
  -d, --delim <char>    Delimiter to join segments (default: )
  --mode {cut,search,full}
                        Segmentation mode (default: cut)
  --in-enc <encoding>   Encoding for input (default: UTF-8)
  --out-enc <encoding>  Encoding for output (default: UTF-8)

office

python -m opencc_jieba_pyo3 office --help                                                     
usage: opencc_jieba_pyo3 office [-h] [-i <file>] [-o <file>] [-c <conversion>] [-p] [-f <format>] [--auto-ext] [--keep-font]

options:
  -h, --help            show this help message and exit
  -i, --input <file>    Input Office document from <file>.
  -o, --output <file>   Output Office document to <file>.
  -c, --config <conversion>
                        conversion: s2t|s2tw|s2twp|s2hk|t2s|tw2s|tw2sp|hk2s|jp2t|t2jp
  -p, --punct           Punctuation conversion
  -f, --format <format>
                        Target Office format (e.g., docx, xlsx, pptx, odt, ods, odp, epub)
  --auto-ext            Auto-append extension to output file
  --keep-font           Preserve font-family information in Office content)
python -m opencc_jieba_pyo3 convert -i input.txt -o output.txt -c s2t --punct
opencc-jieba-pyo3 convert -i input.txt -o output.txt -c s2t --punct

python -m opencc_jieba_pyo3 segment -i input.txt -o output.txt --delim "/"
opencc-jieba-pyo3 segment -i input.txt -o output.txt --delim "/" --mode search

python -m opencc_jieba_pyo3 office -i input.docx -o output.docx -c s2t --punct --keep-font
opencc-jieba-pyo3 office -i input.epub -o output.epub -c s2tw --punct

API

Class: OpenCC

Unified Python interface for OpenCC and Jieba functionalities.

Constructor

  • OpenCC(config: str = "s2t")
    • config: Conversion configuration (see above). Defaults to "s2t".

Attributes

  • config: str
    • Current OpenCC conversion configuration.

Methods

  • convert(input: str, punctuation: bool = False) -> str

    • Convert Chinese text using the current OpenCC config.
    • input: Input text.
    • punctuation: Whether to convert Chinese/Japanese punctuation to the target variant.
    • Returns: Converted text as a string.
  • zho_check(input: str) -> int

    • Detect the type of Chinese in the input text.
    • Returns: Integer code (1: Traditional, 2: Simplified, 0: Others).
  • jieba_cut(input: str, hmm: bool = True) -> list[str]

    • Segment Chinese text using Jieba.
    • input: Input text.
    • hmm: Whether to use HMM for new words.
    • Returns: List of segmented words.
  • jieba_cut_and_join(input: str, delimiter: str = "/") -> str

    • Segment and join Chinese text using Jieba.
    • input: Input text.
    • delimiter: Delimiter for joining words.
    • Returns: Joined segmented string.
  • jieba_keyword_extract_textrank(input: str, top_k: int) -> list[str]

    • Extract keywords using the TextRank algorithm.
    • input: Input text.
    • top_k: Number of keywords to extract.
    • Returns: List of keywords.
  • jieba_keyword_extract_tfidf(input: str, top_k: int) -> list[str]

    • Extract keywords using the TF-IDF algorithm.
    • input: Input text.
    • top_k: Number of keywords to extract.
    • Returns: List of keywords.
  • jieba_keyword_weight_textrank(input: str, top_k: int) -> list[tuple[str, float]]

    • Extract keywords and their weights using TextRank.
    • input: Input text.
    • top_k: Number of keywords to extract.
    • Returns: List of (keyword, weight) tuples.
  • jieba_keyword_weight_tfidf(input: str, top_k: int) -> list[tuple[str, float]]

    • Extract keywords and their weights using TF-IDF.
    • input: Input text.
    • top_k: Number of keywords to extract.
    • Returns: List of (keyword, weight) tuples.

Development

Rust Module Required

opencc-jieba-rs : A Rust implementation of Jieba + OpenCC


Benchmarks

Package: opencc_jieba_pyo3
Python 3.13.4 (tags/v3.13.4:8a526ec, Jun  3 2025, 17:46:04) [MSC v.1943 64 bit (AMD64)]
Platform: Windows-11-10.0.26100-SP0
Processor: Intel64 Family 6 Model 191 Stepping 2, GenuineIntel

BENCHMARK RESULTS

Method Config TextSize Mean StdDev Min Max Ops/sec Chars/sec
Convert_Small s2t 100 0.161 ms 0.109 ms 0.080 ms 0.794 ms 6,217 621,740
Convert_Medium s2t 1,000 0.389 ms 0.092 ms 0.286 ms 0.829 ms 2,571 2,571,236
Convert_Large s2t 10,000 1.261 ms 0.314 ms 1.072 ms 2.580 ms 793 7,932,120
Convert_XLarge s2t 100,000 7.290 ms 0.464 ms 6.864 ms 9.848 ms 137 13,716,798
Convert_Small s2tw 100 0.189 ms 0.104 ms 0.103 ms 0.620 ms 5,285 528,519
Convert_Medium s2tw 1,000 0.442 ms 0.152 ms 0.322 ms 1.084 ms 2,264 2,264,206
Convert_Large s2tw 10,000 1.508 ms 0.200 ms 1.367 ms 2.371 ms 663 6,631,682
Convert_XLarge s2tw 100,000 9.403 ms 0.585 ms 9.009 ms 13.320 ms 106 10,635,363
Convert_Small s2twp 100 0.235 ms 0.113 ms 0.129 ms 0.648 ms 4,256 425,586
Convert_Medium s2twp 1,000 0.518 ms 0.112 ms 0.363 ms 0.913 ms 1,932 1,932,266
Convert_Large s2twp 10,000 1.786 ms 0.209 ms 1.590 ms 2.739 ms 560 5,598,571
Convert_XLarge s2twp 100,000 11.644 ms 0.979 ms 10.892 ms 17.130 ms 86 8,588,034

Throughput VS Size

ThroughputSizeChart


License

MIT


Powered by Rust, Jieba, PyO3, OpenCC and opencc-jieba-rs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

opencc_jieba_pyo3-0.7.4.tar.gz (26.9 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

opencc_jieba_pyo3-0.7.4-cp38-abi3-win_arm64.whl (7.0 MB view details)

Uploaded CPython 3.8+Windows ARM64

opencc_jieba_pyo3-0.7.4-cp38-abi3-win_amd64.whl (7.1 MB view details)

Uploaded CPython 3.8+Windows x86-64

opencc_jieba_pyo3-0.7.4-cp38-abi3-win32.whl (6.9 MB view details)

Uploaded CPython 3.8+Windows x86

opencc_jieba_pyo3-0.7.4-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.4 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ x86-64

opencc_jieba_pyo3-0.7.4-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (7.5 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ ARM64

opencc_jieba_pyo3-0.7.4-cp38-abi3-macosx_11_0_arm64.whl (7.2 MB view details)

Uploaded CPython 3.8+macOS 11.0+ ARM64

opencc_jieba_pyo3-0.7.4-cp38-abi3-macosx_10_12_x86_64.whl (7.2 MB view details)

Uploaded CPython 3.8+macOS 10.12+ x86-64

File details

Details for the file opencc_jieba_pyo3-0.7.4.tar.gz.

File metadata

  • Download URL: opencc_jieba_pyo3-0.7.4.tar.gz
  • Upload date:
  • Size: 26.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.13.1

File hashes

Hashes for opencc_jieba_pyo3-0.7.4.tar.gz
Algorithm Hash digest
SHA256 b32db737c1928b06de69504a62cede944af3930a185bb17cd543c8610b3a103d
MD5 7b0a595a089eac3b41be8d5a78d3ae81
BLAKE2b-256 ed164ca27eedcfa8cf8cce73b25bddf30cc6eccd213203b3e1a7f5249c45130b

See more details on using hashes here.

File details

Details for the file opencc_jieba_pyo3-0.7.4-cp38-abi3-win_arm64.whl.

File metadata

File hashes

Hashes for opencc_jieba_pyo3-0.7.4-cp38-abi3-win_arm64.whl
Algorithm Hash digest
SHA256 fd54f93e18733706a5d0306bfe374c2f4dbbe1618f0103fe467b2c9769f54c52
MD5 4121aa4dac52047335c937ac4a93d319
BLAKE2b-256 6171a967be86170ae0aaf4ba13eec8be19b7a1a5e548e93e03d9f12f2857034c

See more details on using hashes here.

File details

Details for the file opencc_jieba_pyo3-0.7.4-cp38-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for opencc_jieba_pyo3-0.7.4-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 5b7a74cd67097fad2661c70301f4c987404c0dd56470b9a7a6ba41efa8e98c92
MD5 208368af3707d281df36742513763a35
BLAKE2b-256 ae8c1e787323aed52773dcb0d12a944f1a4fc20d88aa610e999bcf9ad7c4b7c8

See more details on using hashes here.

File details

Details for the file opencc_jieba_pyo3-0.7.4-cp38-abi3-win32.whl.

File metadata

File hashes

Hashes for opencc_jieba_pyo3-0.7.4-cp38-abi3-win32.whl
Algorithm Hash digest
SHA256 0df522b7236218baaae674b5ace09bf681d8b7f6526363cdd9802cf0860d2b13
MD5 01d27c9baae8fb957b352e0c94f4c038
BLAKE2b-256 487b6e2780cd6b392ee53c8147e30153f76a874a1bd99ce667705be55403b760

See more details on using hashes here.

File details

Details for the file opencc_jieba_pyo3-0.7.4-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for opencc_jieba_pyo3-0.7.4-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 0b99c0cc726366d7dd8f7b291384a956d5a5dce53c3247664dad1e4568e1b515
MD5 84cdd4016984cfab3e91ba3bfa5c5f52
BLAKE2b-256 583dca350217cffef65b4e57f21b4f59a7af9dc2b7a013359d377923a3ac82b1

See more details on using hashes here.

File details

Details for the file opencc_jieba_pyo3-0.7.4-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for opencc_jieba_pyo3-0.7.4-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 59eeba6d2bc0ecd8d4b3ee90db199d5e1f85fb53681c9c6867d826cafd716219
MD5 1466793195dabb86582b38fd48c55430
BLAKE2b-256 807db784c380590fa412a51126c8f195d32d10b42670de14d641ad2cad4258c9

See more details on using hashes here.

File details

Details for the file opencc_jieba_pyo3-0.7.4-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for opencc_jieba_pyo3-0.7.4-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 21798be310e606658385b8ad38b4820a450dcc67fbe35054cb53339c037e77a4
MD5 dd17787ca1c6507780c6ecd976a546b5
BLAKE2b-256 74b409a20bf62e4486b55ff31e77707a0d2f246371f418a7eeec0d9b7ab21f6b

See more details on using hashes here.

File details

Details for the file opencc_jieba_pyo3-0.7.4-cp38-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for opencc_jieba_pyo3-0.7.4-cp38-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 959220f9ebdea80c26348c22317a2fe4f46de29bb9eff0ebdba040aa027e3219
MD5 fec94400ff527af088863b1604ce59e5
BLAKE2b-256 ea5cc62b82815f9ea3c2d81c0256048c3178b389a188744b6ad8903d2599319a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page