Skip to main content

High-performance Chinese text conversion (Simplified ↔ Traditional), segmentation and keyword extraction powered by Rust, PyO3, Jieba and OpenCC lexicons.

Project description

opencc_jieba_pyo3

PyPI version Downloads Python Versions License Build Status

opencc_jieba_pyo3 is a Python extension module powered by Rust, Jieba and PyO3, providing fast and accurate conversion between different Chinese text variants using opencc-jieba-rs and OpenCC algorithms.

Features

  • Convert between Simplified, Traditional, Hong Kong, Taiwan, and Japanese Kanji Chinese text.
  • Fast and memory-efficient, leveraging Rust's performance.
  • Easy-to-use Python API.
  • Supports punctuation conversion and automatic text code detection.
  • Chinese word (Both Traditional and Simplified) segmentation (Jieba).
  • Keyword extraction (TF-IDF, TextRank).
  • Utility functions for punctuation handling and language detection.

🔁 Supported Conversion Configs

Code Description
s2t Simplified → Traditional
t2s Traditional → Simplified
s2tw Simplified → Traditional (Taiwan)
tw2s Traditional (Taiwan) → Simplified
s2twp Simplified → Traditional (Taiwan) with idioms
tw2sp Traditional (Taiwan) → Simplified with idioms
s2hk Simplified → Traditional (Hong Kong)
hk2s Traditional (Hong Kong) → Simplified
t2tw Traditional → Traditional (Taiwan)
tw2t Traditional (Taiwan) → Traditional
t2twp Traditional → Traditional (Taiwan) with idioms
tw2tp Traditional (Taiwan) → Traditional with idioms
t2hk Traditional → Traditional (Hong Kong)
hk2t Traditional (Hong Kong) → Traditional
t2jp Japanese Kyujitai → Shinjitai
jp2t Japanese Shinjitai → Kyujitai

Installation

Build and install the Python wheel using maturin:

# In project root
maturin build --release
pip install ./target/wheels/opencc_jieba_pyo3-<version>-cp<pyver>-abi3-<platform>.whl

Or for development:

maturin develop -r

See BUILD.md for detailed build and install instructions.


Usage

Python

from opencc_jieba_pyo3 import OpenCC

text = "“春眠不觉晓,处处闻啼鸟。”"
segment_text = "我独自来到无人海岸线"
opencc = OpenCC("s2t")
converted = opencc.convert(text, punctuation=True)
print(converted)  # 「春眠不覺曉,處處聞啼鳥。」

# Segmentation
words = opencc.jieba_cut(segment_text, hmm=True)
print(words)  # ['我', '独自', '来到', '无人', '海岸线']

# Segmentation and join
joined = opencc.jieba_segment_join(segment_text, mode="cut", delim="/")
print(joined)  # 我/独自/来到/无人/海岸线

joined = opencc.jieba_segment_join(segment_text, mode="search", delim="/")
print(joined)  # 我/独自/来到/无人/海岸/岸线/海岸线

joined = opencc.jieba_segment_join(segment_text, mode="full", delim="/")
print(joined)  # 我/独/独自/自/自来/来/来到/到/无/无人/人/人海/海/海岸/海岸线/岸/岸线/线

joined = opencc.jieba_segment_join(segment_text, mode="tag", delim=" ")
print(joined)  # 我/r 独自/d 来到/v 无人/n 海岸线/n

# Keyword extraction (TextRank)
keywords = opencc.jieba_keyword_extract_textrank(segment_text, top_k=3)
print(keywords)  # ['海岸线', '无人', '来到']

# Keyword extraction (TF-IDF)
keywords_tfidf = opencc.jieba_keyword_extract_tfidf(segment_text, top_k=3)
print(keywords_tfidf)  # ['海岸线', '独自', '无人']

# Keyword weights (TextRank)
kw_weights = opencc.jieba_keyword_weight_textrank(segment_text, top_k=3)
print(kw_weights)  # [('海岸线', 9987587364.22353), ('无人', 9986551019.39923), ('来到', 9985428148.988083)]

# Keyword weights (TF-IDF)
kw_weights_tfidf = opencc.jieba_keyword_weight_tfidf(segment_text, top_k=3)
print(kw_weights_tfidf)  # [('海岸线', 1.995445949425), ('独自', 1.8446462134525), ('无人', 1.7299179778125)]

CLI

You can also use the CLI interface via Python module or Python script:
Features are:

  • convert: Convert Chinese text using OpenCC + Jieba
  • segment: Segment Chinese text using Jieba
  • office: Convert Office document Chinese text using OpenCC + Jieba

convert

Module: python -m opencc_jieba_pyo3 convert --help
Script: opencc-jieba-pyo3 convert --help

usage: opencc_jieba_pyo3 convert [-h] [-i <file>] [-o <file>] [-c <conversion>] [-p] [--in-enc <encoding>] [--out-enc <encoding>]

options:
  -h, --help            show this help message and exit
  -i, --input <file>    Read original text from <file>.
  -o, --output <file>   Write converted text to <file>.
  -c, --config <conversion>
                        Conversion configuration: [s2t|s2tw|s2twp|s2hk|t2s|tw2s|tw2sp|hk2s|jp2t|t2jp]
  -p, --punct           Punctuation conversion
  --in-enc <encoding>   Encoding for input
  --out-enc <encoding>  Encoding for output

segment

python -m opencc_jieba_pyo3 segment --help
opencc-jieba-pyo3 convert segment --help

usage: opencc-jieba-pyo3 segment [-h] [-i <file>] [-o <file>] [-d <char>] [-s <char>] [--no-hmm] [-m {cut,search,full,tag}] [--in-enc <encoding>]
                                 [--out-enc <encoding>]

optional arguments:
  -h, --help            show this help message and exit
  -i <file>, --input <file>
                        Read input text from <file>. (default: None)
  -o <file>, --output <file>
                        Write segmented text to <file>. (default: None)
  -d <char>, --delim <char>
                        Delimiter to join segments (default: )
  -s <char>, --separator <char>
                        Separator for segment mode: tag (default: /)
  --no-hmm              Disable HMM (default: False)
  -m {cut,search,full,tag}, --mode {cut,search,full,tag}
                        Segmentation mode (default: cut)
  --in-enc <encoding>   Encoding for input (default: UTF-8)
  --out-enc <encoding>  Encoding for output (default: UTF-8)

office

python -m opencc_jieba_pyo3 office --help                                                     
usage: opencc_jieba_pyo3 office [-h] [-i <file>] [-o <file>] [-c <conversion>] [-p] [-f <format>] [--auto-ext] [--keep-font]

options:
  -h, --help            show this help message and exit
  -i, --input <file>    Input Office document from <file>.
  -o, --output <file>   Output Office document to <file>.
  -c, --config <conversion>
                        conversion: s2t|s2tw|s2twp|s2hk|t2s|tw2s|tw2sp|hk2s|jp2t|t2jp
  -p, --punct           Punctuation conversion
  -f, --format <format>
                        Target Office format (e.g., docx, xlsx, pptx, odt, ods, odp, epub)
  --auto-ext            Auto-append extension to output file
  --keep-font           Preserve font-family information in Office content)
python -m opencc_jieba_pyo3 convert -i input.txt -o output.txt -c s2t --punct
opencc-jieba-pyo3 convert -i input.txt -o output.txt -c s2t --punct

python -m opencc_jieba_pyo3 segment -i input.txt -o output.txt --delim "/"
opencc-jieba-pyo3 segment -i input.txt -o output.txt --delim "/" --mode search

python -m opencc_jieba_pyo3 office -i input.docx -o output.docx -c s2t --punct --keep-font
opencc-jieba-pyo3 office -i input.epub -o output.epub -c s2tw --punct

API

Class: OpenCC

Unified Python interface for OpenCC and Jieba functionalities.

Constructor

  • OpenCC(config: str = "s2t")
    • config: Conversion configuration (see above). Defaults to "s2t".

Attributes

  • config: str
    • Current OpenCC conversion configuration.

Methods

  • is_valid_config(config: str) -> bool

    • Check whether config is a supported OpenCC conversion name.
  • supported_configs() -> list[str]

    • Return all supported OpenCC conversion names in canonical lowercase form.
  • canonicalise_config(config: str) -> str

    • Normalize a valid config name to its canonical lowercase form.
  • set_config(config: str) -> None

    • Update the active OpenCC conversion configuration.
  • get_config() -> str

    • Return the current OpenCC conversion configuration.
  • convert(input: str, punctuation: bool = False) -> str

    • Convert Chinese text using the current OpenCC config.
    • input: Input text.
    • punctuation: Whether to convert Chinese/Japanese punctuation to the target variant.
    • Returns: Converted text as a string.
  • zho_check(input: str) -> int

    • Detect the type of Chinese in the input text.
    • Returns: Integer code (1: Traditional, 2: Simplified, 0: Others).
  • jieba_cut(input: str, hmm: bool = True) -> list[str]

    • Segment Chinese text using Jieba accurate mode.
    • input: Input text.
    • hmm: Whether to use HMM for new words.
    • Returns: List of segmented words.
  • jieba_cut_for_search(input: str, hmm: bool = True) -> list[str]

    • Segment Chinese text in Jieba search mode.
    • Produces finer-grained tokens suitable for search indexing.
  • jieba_cut_all(input: str) -> list[str]

    • Segment Chinese text in Jieba full mode.
    • Returns all possible token matches without disambiguation.
  • jieba_tag(input: str, hmm: bool = True) -> list[tuple[str, str]]

    • Perform Jieba part-of-speech tagging.
    • Returns (word, tag) tuples.
  • jieba_segment_join(input: str, mode: str = "cut", delim: str = " ", hmm: bool = True, separator: str = "/") -> str

    • Segment text and join the result into a single string.
    • mode: One of "cut", "search", "full", or "tag".
    • delim: Delimiter used to join segments or tagged tokens.
    • hmm: Used by "cut", "search", and "tag" modes.
    • separator: Separator between word and POS tag in "tag" mode.
  • jieba_cut_and_join(input: str, delimiter: str = "/") -> str

    • Deprecated compatibility wrapper for jieba_segment_join(input, mode="cut", delim=delimiter).
    • input: Input text.
    • delimiter: Delimiter for joining words.
    • Returns: Joined segmented string.
  • jieba_keyword_extract_textrank(input: str, top_k: int = 10, allowed_pos: list[str] | None = None) -> list[str]

    • Extract keywords using the TextRank algorithm.
    • input: Input text.
    • top_k: Number of keywords to extract.
    • allowed_pos: Optional POS filter list. Each item may contain one or more POS tags separated by whitespace.
    • Returns: List of keywords.
  • jieba_keyword_extract_tfidf(input: str, top_k: int = 10, allowed_pos: list[str] | None = None) -> list[str]

    • Extract keywords using the TF-IDF algorithm.
    • input: Input text.
    • top_k: Number of keywords to extract.
    • allowed_pos: Optional POS filter list. Each item may contain one or more POS tags separated by whitespace.
    • Returns: List of keywords.

jieba_keyword_weight_textrank(input: str, top_k: int = 10, allowed_pos: list[str] | None = None) -> list[tuple[str, float]]

  • Extract keywords and their weights using TextRank.

  • input: Input text.

  • top_k: Number of keywords to extract.

  • allowed_pos: Optional POS filter list. Each item may contain one or more POS tags separated by whitespace.

  • Returns: List of (keyword, weight) tuples.

jieba_keyword_weight_tfidf(input: str, top_k: int = 10, allowed_pos: list[str] | None = None) -> list[tuple[str, float]]

  • Extract keywords and their weights using TF-IDF.
  • input: Input text.
  • top_k: Number of keywords to extract.
  • allowed_pos: Optional POS filter list. Each item may contain one or more POS tags separated by whitespace.
  • Returns: List of (keyword, weight) tuples.

Development

Rust Module Required

opencc-jieba-rs : A Rust implementation of Jieba + OpenCC


Benchmarks

Package: opencc_jieba_pyo3
Python 3.13.4 (tags/v3.13.4:8a526ec, Jun  3 2025, 17:46:04) [MSC v.1943 64 bit (AMD64)]
Platform: Windows-11-10.0.26100-SP0
Processor: Intel64 Family 6 Model 191 Stepping 2, GenuineIntel

BENCHMARK RESULTS

Method Config TextSize Mean StdDev Min Max Ops/sec Chars/sec
Convert_Small s2t 100 0.161 ms 0.109 ms 0.080 ms 0.794 ms 6,217 621,740
Convert_Medium s2t 1,000 0.389 ms 0.092 ms 0.286 ms 0.829 ms 2,571 2,571,236
Convert_Large s2t 10,000 1.261 ms 0.314 ms 1.072 ms 2.580 ms 793 7,932,120
Convert_XLarge s2t 100,000 7.290 ms 0.464 ms 6.864 ms 9.848 ms 137 13,716,798
Convert_Small s2tw 100 0.189 ms 0.104 ms 0.103 ms 0.620 ms 5,285 528,519
Convert_Medium s2tw 1,000 0.442 ms 0.152 ms 0.322 ms 1.084 ms 2,264 2,264,206
Convert_Large s2tw 10,000 1.508 ms 0.200 ms 1.367 ms 2.371 ms 663 6,631,682
Convert_XLarge s2tw 100,000 9.403 ms 0.585 ms 9.009 ms 13.320 ms 106 10,635,363
Convert_Small s2twp 100 0.235 ms 0.113 ms 0.129 ms 0.648 ms 4,256 425,586
Convert_Medium s2twp 1,000 0.518 ms 0.112 ms 0.363 ms 0.913 ms 1,932 1,932,266
Convert_Large s2twp 10,000 1.786 ms 0.209 ms 1.590 ms 2.739 ms 560 5,598,571
Convert_XLarge s2twp 100,000 11.644 ms 0.979 ms 10.892 ms 17.130 ms 86 8,588,034

Throughput VS Size

ThroughputSizeChart


License

MIT


Powered by Rust, Jieba, PyO3, OpenCC and opencc-jieba-rs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

opencc_jieba_pyo3-0.7.5.tar.gz (29.0 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

opencc_jieba_pyo3-0.7.5-cp38-abi3-win_arm64.whl (7.0 MB view details)

Uploaded CPython 3.8+Windows ARM64

opencc_jieba_pyo3-0.7.5-cp38-abi3-win_amd64.whl (7.1 MB view details)

Uploaded CPython 3.8+Windows x86-64

opencc_jieba_pyo3-0.7.5-cp38-abi3-win32.whl (6.9 MB view details)

Uploaded CPython 3.8+Windows x86

opencc_jieba_pyo3-0.7.5-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.4 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ x86-64

opencc_jieba_pyo3-0.7.5-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (7.5 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ ARM64

opencc_jieba_pyo3-0.7.5-cp38-abi3-macosx_11_0_arm64.whl (7.2 MB view details)

Uploaded CPython 3.8+macOS 11.0+ ARM64

opencc_jieba_pyo3-0.7.5-cp38-abi3-macosx_10_12_x86_64.whl (7.2 MB view details)

Uploaded CPython 3.8+macOS 10.12+ x86-64

File details

Details for the file opencc_jieba_pyo3-0.7.5.tar.gz.

File metadata

  • Download URL: opencc_jieba_pyo3-0.7.5.tar.gz
  • Upload date:
  • Size: 29.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.13.1

File hashes

Hashes for opencc_jieba_pyo3-0.7.5.tar.gz
Algorithm Hash digest
SHA256 34271723d39758110db81185fb83f5923dfb8e41f4eb0af5002738194837b4e6
MD5 6644adb6eb0a308f14583094d5e2f6aa
BLAKE2b-256 e9c9a41713911fccf427a36d5bad7db3ee795640f1241cc5ff847f8c40ea6313

See more details on using hashes here.

File details

Details for the file opencc_jieba_pyo3-0.7.5-cp38-abi3-win_arm64.whl.

File metadata

File hashes

Hashes for opencc_jieba_pyo3-0.7.5-cp38-abi3-win_arm64.whl
Algorithm Hash digest
SHA256 59cbf1fcbe334ac7daec8b09a845fcd86ab2a16900d072f5fe7549550cfae2f7
MD5 e903f5654846ec784dcd466cf8fed3de
BLAKE2b-256 2c425642d3966a82d220697d26b9f5803327d78cb7288251cbb3c6875d64f0e5

See more details on using hashes here.

File details

Details for the file opencc_jieba_pyo3-0.7.5-cp38-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for opencc_jieba_pyo3-0.7.5-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 d0635c306ca88fdb8c7e8b795b436d021ad2d7254d8a7e2442e973160352fbdf
MD5 e96e8a663a7b9e68e99caae29076aebd
BLAKE2b-256 9dafdb3f77a441bf20ce94e647358f93a411e54a949a7620cd95035af3c07f86

See more details on using hashes here.

File details

Details for the file opencc_jieba_pyo3-0.7.5-cp38-abi3-win32.whl.

File metadata

File hashes

Hashes for opencc_jieba_pyo3-0.7.5-cp38-abi3-win32.whl
Algorithm Hash digest
SHA256 c1d9a1eac64075c70cfe47a1671ddeeea580d2f6fab693bef51ca7633c0abd3e
MD5 8ee1582806dd415fef8f624eb2a78f8f
BLAKE2b-256 fcecb2529487e59c64c76e2537971a9515a56df018a2ea80cbf2d70944f3ec0f

See more details on using hashes here.

File details

Details for the file opencc_jieba_pyo3-0.7.5-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for opencc_jieba_pyo3-0.7.5-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 bc8f6d44148ae6d59d251eacd3a76d8c36f636279209174743e4b911bdcbb326
MD5 1d2b25a3bcfd930b551c0d82473403f6
BLAKE2b-256 19006ee88ebeea1165b7636b2daf69ec9255f4a38ef24c695d61c772b1cd2a4a

See more details on using hashes here.

File details

Details for the file opencc_jieba_pyo3-0.7.5-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for opencc_jieba_pyo3-0.7.5-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 b1c0ab66d313cb7a7963a567a7b6602606b208a08357faca77f6fa31785d802e
MD5 6683acde41b3b4eaf1154d720acefdaa
BLAKE2b-256 52852d2c780af6ded8bb7969b3c28bade485b4f82c355d7a9c6be0bf4a3ff1b4

See more details on using hashes here.

File details

Details for the file opencc_jieba_pyo3-0.7.5-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for opencc_jieba_pyo3-0.7.5-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 6b83f93ca42817314a9bb9cda1428760f0f4daa4122a8a74463a3c2228f488cf
MD5 ee048c6ed13a49209aa241b975484011
BLAKE2b-256 a6cd81e9fef2471f5f14f3f2bbfe42b1117e23bf8ea305a3bf2bdffb96e4138a

See more details on using hashes here.

File details

Details for the file opencc_jieba_pyo3-0.7.5-cp38-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for opencc_jieba_pyo3-0.7.5-cp38-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 025b37904a213e8693727fcad4606b9cf15d524cd921d5a8c48bc408b0620004
MD5 1476fad563cf9692250097ba0254788f
BLAKE2b-256 06a05e7278719e4d9052cdc2a3861f6970940913f0b174345f2ee7be590a9346

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page