Skip to main content

A Python extension module powered by Rust Jieba and PyO3, providing fast and accurate Chinese text conversion.

Project description

opencc_jieba_pyo3

opencc_jieba_pyo3 is a Python extension module powered by Rust Jieba and PyO3, providing fast and accurate conversion between different Chinese text variants using opencc-jieba-rs and OpenCC algorithms.

Features

  • Convert between Simplified, Traditional, Hong Kong, Taiwan, and Japanese Kanji Chinese text.
  • Fast and memory-efficient, leveraging Rust's performance.
  • Easy-to-use Python API.
  • Supports punctuation conversion and automatic text code detection.
  • Chinese word segmentation (Jieba).
  • Keyword extraction (TF-IDF, TextRank).
  • Utility functions for punctuation handling and language detection.

Supported Conversion Configurations

  • s2t, t2s, s2tw, tw2s, s2twp, tw2sp, s2hk, hk2s, t2tw, tw2t, t2twp, tw2tp, t2hk, hk2t, t2jp, jp2t

Installation

Build and install the Python wheel using maturin:

# In project root
maturin build --release
pip install ./target/wheels/opencc_jieba_pyo3-<version>-cp<pyver>-abi3-<platform>.whl

Or for development:

maturin develop -r

See BUILD.md for detailed build and install instructions.

Usage

Python

from opencc_jieba_pyo3 import OpenCC

text = "“春眠不觉晓,处处闻啼鸟。”"
opencc = OpenCC("s2t")
converted = opencc.convert(text, punctuation=True)
print(converted)  # 「春眠不覺曉,處處聞啼鳥。」

# Segmentation
words = opencc.jieba_cut(text, hmm=True)
print(words)  # ['春眠', '不觉', '晓', ',', '处处', '闻', '啼鸟', '。']

# Segmentation and join
joined = opencc.jieba_cut_and_join(text, delimiter="/")
print(joined)  # 春眠/不觉/晓/,/处处/闻/啼鸟/。

# Keyword extraction (TextRank)
keywords = opencc.jieba_keyword_extract_textrank(text, top_k=3)
print(keywords)  # ['春眠', '啼鸟', '处处']

# Keyword extraction (TF-IDF)
keywords_tfidf = opencc.jieba_keyword_extract_tfidf(text, top_k=3)
print(keywords_tfidf)  # ['春眠', '啼鸟', '处处']

# Keyword weights (TextRank)
kw_weights = opencc.jieba_keyword_weight_textrank(text, top_k=3)
print(kw_weights)  # [('春眠', 1.23), ('啼鸟', 0.98), ('处处', 0.75)]

# Keyword weights (TF-IDF)
kw_weights_tfidf = opencc.jieba_keyword_weight_tfidf(text, top_k=3)
print(kw_weights_tfidf)  # [('春眠', 2.34), ('啼鸟', 1.56), ('处处', 1.12)]

CLI

You can also use the CLI interface:

convert

python -m opencc_jieba_pyo3 convert --help
usage: opencc_jieba_pyo3 convert [-h] [-i <file>] [-o <file>] [-c <conversion>] [-p] [--in-enc <encoding>] [--out-enc <encoding>]

options:
  -h, --help            show this help message and exit
  -i, --input <file>    Read original text from <file>.
  -o, --output <file>   Write converted text to <file>.
  -c, --config <conversion>
                        Conversion configuration: [s2t|s2tw|s2twp|s2hk|t2s|tw2s|tw2sp|hk2s|jp2t|t2jp]
  -p, --punct           Punctuation conversion
  --in-enc <encoding>   Encoding for input
  --out-enc <encoding>  Encoding for output

segment

python -m opencc_jieba_pyo3 segment --help
usage: opencc_jieba_pyo3 segment [-h] [-i <file>] [-o <file>] [-d <char>] [--in-enc <encoding>] [--out-enc <encoding>]

options:
  -h, --help            show this help message and exit
  -i, --input <file>    Read input text from <file>.
  -o, --output <file>   Write segmented text to <file>.
  -d, --delim <char>    Delimiter to join segments
  --in-enc <encoding>   Encoding for input
  --out-enc <encoding>  Encoding for output
python -m opencc_jieba_pyo3 convert -i input.txt -o output.txt -c s2t --punct
python -m opencc_jieba_pyo3 segment -i input.txt -o output.txt --delim "/"

API

Class: OpenCC

Unified Python interface for OpenCC and Jieba functionalities.

Constructor

  • OpenCC(config: str = "s2t")
    • config: Conversion configuration (see above). Defaults to "s2t".

Attributes

  • config: str
    • Current OpenCC conversion configuration.

Methods

  • convert(input: str, punctuation: bool = False) -> str

    • Convert Chinese text using the current OpenCC config.
    • input: Input text.
    • punctuation: Whether to convert Chinese/Japanese punctuation to the target variant.
    • Returns: Converted text as a string.
  • zho_check(input: str) -> int

    • Detect the type of Chinese in the input text.
    • Returns: Integer code (1: Traditional, 2: Simplified, 0: Others).
  • jieba_cut(input: str, hmm: bool = True) -> list[str]

    • Segment Chinese text using Jieba.
    • input: Input text.
    • hmm: Whether to use HMM for new words.
    • Returns: List of segmented words.
  • jieba_cut_and_join(input: str, delimiter: str = "/") -> str

    • Segment and join Chinese text using Jieba.
    • input: Input text.
    • delimiter: Delimiter for joining words.
    • Returns: Joined segmented string.
  • jieba_keyword_extract_textrank(input: str, top_k: int) -> list[str]

    • Extract keywords using the TextRank algorithm.
    • input: Input text.
    • top_k: Number of keywords to extract.
    • Returns: List of keywords.
  • jieba_keyword_extract_tfidf(input: str, top_k: int) -> list[str]

    • Extract keywords using the TF-IDF algorithm.
    • input: Input text.
    • top_k: Number of keywords to extract.
    • Returns: List of keywords.
  • jieba_keyword_weight_textrank(input: str, top_k: int) -> list[tuple[str, float]]

    • Extract keywords and their weights using TextRank.
    • input: Input text.
    • top_k: Number of keywords to extract.
    • Returns: List of (keyword, weight) tuples.
  • jieba_keyword_weight_tfidf(input: str, top_k: int) -> list[tuple[str, float]]

    • Extract keywords and their weights using TF-IDF.
    • input: Input text.
    • top_k: Number of keywords to extract.
    • Returns: List of (keyword, weight) tuples.

Development

Rust Module Required

opencc-jieba-rs : A Rust implementation of Jieba + OpenCC

License

MIT


Powered by Rust, Jieba, PyO3, opencc-jieba-rs and OpenCC.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

opencc_jieba_pyo3-0.5.0-cp38-abi3-win_amd64.whl (7.0 MB view details)

Uploaded CPython 3.8+Windows x86-64

opencc_jieba_pyo3-0.5.0-cp38-abi3-manylinux_2_34_x86_64.whl (7.5 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.34+ x86-64

opencc_jieba_pyo3-0.5.0-cp38-abi3-macosx_11_0_arm64.whl (7.1 MB view details)

Uploaded CPython 3.8+macOS 11.0+ ARM64

File details

Details for the file opencc_jieba_pyo3-0.5.0-cp38-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for opencc_jieba_pyo3-0.5.0-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 b883614711428f08b0a740d35b7ad08dd1334ce255e885f933ed33348edd814d
MD5 5deb51ce9745d253cc8f4ba5f5195293
BLAKE2b-256 d1ef37bfda7eaf297e1b316d9045a135913aabfe33da1c3cfa66f81a6ad9ffe2

See more details on using hashes here.

File details

Details for the file opencc_jieba_pyo3-0.5.0-cp38-abi3-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for opencc_jieba_pyo3-0.5.0-cp38-abi3-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 bfeb1f8ef3e04119c8192e8e9e5d8031395a5955a03a463a949670f568501068
MD5 121b11ae9e49c31fc685663e4c05de96
BLAKE2b-256 13530f58163dfca62ed2b2be2eedd0a3d03da4e3f3844e00899761db76918948

See more details on using hashes here.

File details

Details for the file opencc_jieba_pyo3-0.5.0-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for opencc_jieba_pyo3-0.5.0-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 c34195174e4c95d45505e3714bc7193bccf196d0eec915aedaa15a77dad67cee
MD5 498998adb2fb363706ed4cfd53ac5c73
BLAKE2b-256 9c5f222d9d89624366d98b6acffc5cb7c0cdf4d5e87e48b94973ea94273ecc29

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page