Skip to main content

A Python extension module powered by Rust Jieba and PyO3, providing fast and accurate Chinese text conversion.

Project description

opencc_jieba_pyo3

PyPI version Build

opencc_jieba_pyo3 is a Python extension module powered by Rust Jieba and PyO3, providing fast and accurate conversion between different Chinese text variants using opencc-jieba-rs and OpenCC algorithms.

Features

  • Convert between Simplified, Traditional, Hong Kong, Taiwan, and Japanese Kanji Chinese text.
  • Fast and memory-efficient, leveraging Rust's performance.
  • Easy-to-use Python API.
  • Supports punctuation conversion and automatic text code detection.
  • Chinese word segmentation (Jieba).
  • Keyword extraction (TF-IDF, TextRank).
  • Utility functions for punctuation handling and language detection.

Supported Conversion Configurations

  • s2t, t2s, s2tw, tw2s, s2twp, tw2sp, s2hk, hk2s, t2tw, tw2t, t2twp, tw2tp, t2hk, hk2t, t2jp, jp2t

Installation

Build and install the Python wheel using maturin:

# In project root
maturin build --release
pip install ./target/wheels/opencc_jieba_pyo3-<version>-cp<pyver>-abi3-<platform>.whl

Or for development:

maturin develop -r

See BUILD.md for detailed build and install instructions.

Usage

Python

from opencc_jieba_pyo3 import OpenCC

text = "“春眠不觉晓,处处闻啼鸟。”"
opencc = OpenCC("s2t")
converted = opencc.convert(text, punctuation=True)
print(converted)  # 「春眠不覺曉,處處聞啼鳥。」

# Segmentation
words = opencc.jieba_cut(text, hmm=True)
print(words)  # ['春眠', '不觉', '晓', ',', '处处', '闻', '啼鸟', '。']

# Segmentation and join
joined = opencc.jieba_cut_and_join(text, delimiter="/")
print(joined)  # 春眠/不觉/晓/,/处处/闻/啼鸟/。

# Keyword extraction (TextRank)
keywords = opencc.jieba_keyword_extract_textrank(text, top_k=3)
print(keywords)  # ['春眠', '啼鸟', '处处']

# Keyword extraction (TF-IDF)
keywords_tfidf = opencc.jieba_keyword_extract_tfidf(text, top_k=3)
print(keywords_tfidf)  # ['春眠', '啼鸟', '处处']

# Keyword weights (TextRank)
kw_weights = opencc.jieba_keyword_weight_textrank(text, top_k=3)
print(kw_weights)  # [('春眠', 1.23), ('啼鸟', 0.98), ('处处', 0.75)]

# Keyword weights (TF-IDF)
kw_weights_tfidf = opencc.jieba_keyword_weight_tfidf(text, top_k=3)
print(kw_weights_tfidf)  # [('春眠', 2.34), ('啼鸟', 1.56), ('处处', 1.12)]

CLI

You can also use the CLI interface:

convert

python -m opencc_jieba_pyo3 convert --help
usage: opencc_jieba_pyo3 convert [-h] [-i <file>] [-o <file>] [-c <conversion>] [-p] [--in-enc <encoding>] [--out-enc <encoding>]

options:
  -h, --help            show this help message and exit
  -i, --input <file>    Read original text from <file>.
  -o, --output <file>   Write converted text to <file>.
  -c, --config <conversion>
                        Conversion configuration: [s2t|s2tw|s2twp|s2hk|t2s|tw2s|tw2sp|hk2s|jp2t|t2jp]
  -p, --punct           Punctuation conversion
  --in-enc <encoding>   Encoding for input
  --out-enc <encoding>  Encoding for output

segment

python -m opencc_jieba_pyo3 segment --help
usage: opencc_jieba_pyo3 segment [-h] [-i <file>] [-o <file>] [-d <char>] [--in-enc <encoding>] [--out-enc <encoding>]

options:
  -h, --help            show this help message and exit
  -i, --input <file>    Read input text from <file>.
  -o, --output <file>   Write segmented text to <file>.
  -d, --delim <char>    Delimiter to join segments
  --in-enc <encoding>   Encoding for input
  --out-enc <encoding>  Encoding for output
python -m opencc_jieba_pyo3 convert -i input.txt -o output.txt -c s2t --punct
python -m opencc_jieba_pyo3 segment -i input.txt -o output.txt --delim "/"

API

Class: OpenCC

Unified Python interface for OpenCC and Jieba functionalities.

Constructor

  • OpenCC(config: str = "s2t")
    • config: Conversion configuration (see above). Defaults to "s2t".

Attributes

  • config: str
    • Current OpenCC conversion configuration.

Methods

  • convert(input: str, punctuation: bool = False) -> str

    • Convert Chinese text using the current OpenCC config.
    • input: Input text.
    • punctuation: Whether to convert Chinese/Japanese punctuation to the target variant.
    • Returns: Converted text as a string.
  • zho_check(input: str) -> int

    • Detect the type of Chinese in the input text.
    • Returns: Integer code (1: Traditional, 2: Simplified, 0: Others).
  • jieba_cut(input: str, hmm: bool = True) -> list[str]

    • Segment Chinese text using Jieba.
    • input: Input text.
    • hmm: Whether to use HMM for new words.
    • Returns: List of segmented words.
  • jieba_cut_and_join(input: str, delimiter: str = "/") -> str

    • Segment and join Chinese text using Jieba.
    • input: Input text.
    • delimiter: Delimiter for joining words.
    • Returns: Joined segmented string.
  • jieba_keyword_extract_textrank(input: str, top_k: int) -> list[str]

    • Extract keywords using the TextRank algorithm.
    • input: Input text.
    • top_k: Number of keywords to extract.
    • Returns: List of keywords.
  • jieba_keyword_extract_tfidf(input: str, top_k: int) -> list[str]

    • Extract keywords using the TF-IDF algorithm.
    • input: Input text.
    • top_k: Number of keywords to extract.
    • Returns: List of keywords.
  • jieba_keyword_weight_textrank(input: str, top_k: int) -> list[tuple[str, float]]

    • Extract keywords and their weights using TextRank.
    • input: Input text.
    • top_k: Number of keywords to extract.
    • Returns: List of (keyword, weight) tuples.
  • jieba_keyword_weight_tfidf(input: str, top_k: int) -> list[tuple[str, float]]

    • Extract keywords and their weights using TF-IDF.
    • input: Input text.
    • top_k: Number of keywords to extract.
    • Returns: List of (keyword, weight) tuples.

Development

Rust Module Required

opencc-jieba-rs : A Rust implementation of Jieba + OpenCC

License

MIT


Powered by Rust, Jieba, PyO3, opencc-jieba-rs and OpenCC.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

opencc_jieba_pyo3-0.5.1-cp38-abi3-win_amd64.whl (7.0 MB view details)

Uploaded CPython 3.8+Windows x86-64

opencc_jieba_pyo3-0.5.1-cp38-abi3-manylinux_2_34_x86_64.whl (7.5 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.34+ x86-64

opencc_jieba_pyo3-0.5.1-cp38-abi3-macosx_11_0_arm64.whl (7.1 MB view details)

Uploaded CPython 3.8+macOS 11.0+ ARM64

File details

Details for the file opencc_jieba_pyo3-0.5.1-cp38-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for opencc_jieba_pyo3-0.5.1-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 42445202e4fff1d4d5cc8e050717d13b98deb48765fb002732ce77653840ad55
MD5 df2f811fa734e7ba73c7214508c6cf65
BLAKE2b-256 8c0cd265298c4ac41d898953bb3edf11cd4f888f40ccdb90e21f9c69d49573ce

See more details on using hashes here.

File details

Details for the file opencc_jieba_pyo3-0.5.1-cp38-abi3-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for opencc_jieba_pyo3-0.5.1-cp38-abi3-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 6ee5fdc43b791a2799dfec0f0a0031548b753a44981ca43917fda2bd2f07f51a
MD5 f90b831f2cd4f309deada29cb3938a8b
BLAKE2b-256 7849c9ba9e053a01410472e39c2e980d77183f26c07ee4277e314dcb955ed67d

See more details on using hashes here.

File details

Details for the file opencc_jieba_pyo3-0.5.1-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for opencc_jieba_pyo3-0.5.1-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 a0bd9fd969f4cdd1b13fb1d6a5cd425338394b5984ea96f83fab22d263eebc8e
MD5 9e1f93687d6600ec9385c21c209380bc
BLAKE2b-256 77b913f2f609576d4665a710fc1a0dfcc5d4a4bf588f4ce8f164b348ea79220a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page