WeTextProcessing, including TN & ITN
Project description
Text Normalization & Inverse Text Normalization
0. Brief Introduction
WeTextProcessing: Production First & Production Ready Text Processing Toolkit
0.1 Text Normalization
0.2 Inverse Text Normalization
1. How To Use
1.1 Quick Start:
# install
pip install WeTextProcessing
# tn usage
>>> from tn.chinese.normalizer import Normalizer
>>> normalizer = Normalizer()
>>> normalizer.normalize("2.5平方电线")
# itn usage
>>> from itn.chinese.inverse_normalizer import InverseNormalizer
>>> invnormalizer = InverseNormalizer()
>>> invnormalizer.normalize("二点五平方电线")
1.2 Advanced Usage:
DIY your own rules && Deploy WeTextProcessing with cpp runtime !!
For users who want modifications and adapt tn/itn rules to fix badcase, please try:
git clone https://github.com/wenet-e2e/WeTextProcessing.git
cd WeTextProcessing
# `overwrite_cache` will rebuild all rules according to
# your modifications on tn/chinese/rules/xx.py (itn/chinese/rules/xx.py).
# After rebuild, you can find new far files at `$PWD/tn` and `$PWD/itn`.
python normalize.py --text "2.5平方电线" --overwrite_cache
python inverse_normalize.py --text "二点五平方电线" --overwrite_cache
Once you successfully rebuild your rules, you can deploy them either with your installed pypi packages:
# tn usage
>>> from tn.chinese.normalizer import Normalizer
>>> normalizer = Normalizer(cache_dir="PATH_TO_GIT_CLONED_WETEXTPROCESSING/tn")
>>> normalizer.normalize("2.5平方电线")
# itn usage
>>> from itn.chinese.inverse_normalizer import InverseNormalizer
>>> invnormalizer = InverseNormalizer(cache_dir="PATH_TO_GIT_CLONED_WETEXTPROCESSING/itn")
>>> invnormalizer.normalize("二点五平方电线")
Or with cpp runtime:
cmake -B build -S runtime -DCMAKE_BUILD_TYPE=Release
cmake --build build
# tn usage
./build/bin/processor_main --far PATH_TO_GIT_CLONED_WETEXTPROCESSING/tn/zh_tn_normalizer.far --text "2.5平方电线"
# itn usage
./build/bin/processor_main --far PATH_TO_GIT_CLONED_WETEXTPROCESSING/itn/zh_itn_normalizer.far --text "二点五平方电线"
2. TN Pipeline
Please refer to TN.README
3. ITN Pipeline
Please refer to ITN.README
Acknowledge
- Thank the authors of foundational libraries like OpenFst & Pynini.
- Thank NeMo team & NeMo open-source community.
- Thank Zhenxiang Ma, Jiayu Du, and SpeechColab organization.
- Referred Pynini for reading the FAR, and printing the shortest path of a lattice in the C++ runtime.
- Referred TN of NeMo for the data to build the tagger graph.
- Referred ITN of chinese_text_normalization for the data to build the tagger graph.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
WeTextProcessing-0.0.5.tar.gz
(1.7 MB
view hashes)
Built Distribution
Close
Hashes for WeTextProcessing-0.0.5-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1a029d955a4cd67a8b780fe1ee06f36c69e71d49ec9e39b8b81dbf8ef9dc9c33 |
|
MD5 | 8513ec1e22ab8e31238875d3100e9be4 |
|
BLAKE2b-256 | 7b2f7a7baa4986cc1ff58ef71b5a2fea3356d90ee456d168040725b9b939fa81 |