Skip to main content

WeTextProcessing Runtime

Project description

WeTextProcessing Runtime

PyPI License

Python runtime for WeTextProcessing (does not depend on Pynini).

WeTextProcessing is a text processing library that provides text normalization (TN) and inverse text normalization (ITN) capabilities for Chinese, English and Japanese text. It uses Finite State Transducers (FSTs) for efficient text processing.

Features

  • Text Normalization (TN) for Chinese, English and Japanese
  • Inverse Text Normalization (ITN) for Chinese, English and Japanese
  • Traditional to Simplified Chinese conversion
  • Full-width to Half-width character conversion
  • Interjection removal
  • Punctuation removal
  • Out-of-vocabulary (OOV) word tagging
  • Erhua removal (for Chinese)
  • 0-to-9 conversion (for Chinese and Japanese ITN)

Installation

pip install wetext

Usage

Python API

Text Normalization (TN)

from wetext import Normalizer

# Chinese TN with erhua removal
normalizer = Normalizer(lang="zh", operator="tn", remove_erhua=True)
result = normalizer.normalize("你好 WeTextProcessing 1.0,全新版本儿,简直666")
print(result)  # 你好 WeTextProcessing 一点零,全新版本,简直六六六

# English TN
normalizer = Normalizer(lang="en", operator="tn")
result = normalizer.normalize("The price is $12.50, please pay now.")
print(result)  # The price is twelve point five dollars, please pay now.

Inverse Text Normalization (ITN)

from wetext import Normalizer

# Chinese ITN
normalizer = Normalizer(lang="zh", operator="itn", enable_0_to_9=False)
result = normalizer.normalize("你好 WeTextProcessing 一点零,全新版本儿,简直六六六,九和六")
print(result)  # 你好 WeTextProcessing 1.0,全新版本儿,简直666,九和六

# English ITN
normalizer = Normalizer(lang="en", operator="itn")
result = normalizer.normalize("twenty three dollars and fifty cents")
print(result)  # $23.50

Command Line Interface

# Basic usage
wetext "你好 WeTextProcessing 1.0,全新版本儿,简直666"

# With options
wetext --lang zh --operator tn --remove-erhua "你好 WeTextProcessing 1.0,全新版本儿,简直666"

# Convert traditional to simplified Chinese
wetext --traditional-to-simple "你好,這是測試。"

# Remove punctuations
wetext --remove-puncts "你好,這是測試。"

API Reference

Normalizer Class

Normalizer(
    lang: Literal["auto", "en", "zh", "ja"] = "auto",
    operator: Literal["tn", "itn"] = "tn",
    traditional_to_simple: bool = False,
    full_to_half: bool = False,
    remove_interjections: bool = False,
    remove_puncts: bool = False,
    tag_oov: bool = False,
    enable_0_to_9: bool = False,
    remove_erhua: bool = False,
)

Parameters

  • lang: The language of the text. Can be "auto", "en", "zh" or "ja". Default is "auto".
  • operator: The operator to use. Can be "tn" (text normalization) or "itn" (inverse text normalization). Default is "tn".
  • traditional_to_simple: Whether to convert traditional Chinese to simplified Chinese. Default is False.
  • full_to_half: Whether to convert full-width characters to half-width characters. Default is False.
  • remove_interjections: Whether to remove interjections. Default is False.
  • remove_puncts: Whether to remove punctuation. Default is False.
  • tag_oov: Whether to tag out-of-vocabulary words. Default is False.
  • enable_0_to_9: Whether to enable 0-to-9 conversion for ITN. Default is False.
  • remove_erhua: Whether to remove erhua for TN. Default is False.

Methods

  • normalize(text: str, lang: Optional[Literal["auto", "en", "zh", "ja"]] = None) -> str: Normalize the text.

CLI Options

  • --lang, -l: Set the language. Choices are "auto", "en", "zh", "ja". Default is "auto".
  • --operator, -o: Set the operator. Choices are "tn", "itn". Default is "tn".
  • --traditional-to-simple: Convert traditional Chinese to simplified Chinese.
  • --full-to-half: Convert full-width characters to half-width characters.
  • --remove-interjections: Remove interjections.
  • --remove-puncts: Remove punctuation.
  • --tag-oov: Tag out-of-vocabulary words.
  • --enable-0-to-9: Enable 0-to-9 conversion.
  • --remove-erhua: Remove erhua.

License

Apache License 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wetext-0.1.4.tar.gz (1.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wetext-0.1.4-py3-none-any.whl (1.9 MB view details)

Uploaded Python 3

File details

Details for the file wetext-0.1.4.tar.gz.

File metadata

  • Download URL: wetext-0.1.4.tar.gz
  • Upload date:
  • Size: 1.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for wetext-0.1.4.tar.gz
Algorithm Hash digest
SHA256 2e5ff3b323b3cb67b207ae196a94c998a21afbfc793b8f145befc56bbd8f2d8f
MD5 2c1af0442eea5fce3ac2410414150950
BLAKE2b-256 660dc34089120586d0727a845d83d9eb45bd8b39fe5f01ee1a3376d54c4413ef

See more details on using hashes here.

File details

Details for the file wetext-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: wetext-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for wetext-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 eb33daea87d69aa366fb541dd5427b90b03346614ba29ea8ed5af54a7a377045
MD5 b9f1a3e9beafd8c7141f1321f1c79992
BLAKE2b-256 053dc0c8cec32b2d44a7b94e5da3ee5e830ebda6e692db5b98d6048d81b40191

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page