Normalizing English and Chinese Text
Project description
BuoyanText
v0.0.5 20230720
TextNorm.py
TextNorm class
Normalizing English or Chinese text
Arguments:
text="", language="English",
# Uniform
html_stripping=True, remove_repeated=False, remove_digits=True, stopwords_removal=True,
special_char_removal=True,
# English
contraction_expansion=True, accented_char_removal=True, text_lower_case=True, text_stemming=False,
text_lemmatization=True,
# Chinese
split_words=True, punctuation_drop=True
- text: the text need to be normalized
- language: "English" or "Chinese" (default: "English")
(1) Uniform approach arguments
- html_stripping: strip html tags if True (default: True)
- remove_repeated: remove repeated words (default: False)
- remove_digits: remove numbers (default: True)
- stopwords_removal: remove stopwords (default: True)
- special_char_removal: remove special characters (default: True)
(2) English text arguments
- contraction_expansion: expand contractions, for example "can't" to "can not" (default: True)
- accented_char_removal: (default: True)
- text_lower_case: (default: True)
- text_stemming: (default: False)
- text_lemmatization: (default: True)
(3) Chinese text arguments
- split_words: split words with jieba (default: True)
- punctuation_drop: drop punctuations (default: True)
TextReader.py
(1) file_reader
(2) file_list_reader
(3) pdf_to_txt
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
BuoyanText-0.0.5.tar.gz
(13.7 kB
view hashes)
Built Distribution
BuoyanText-0.0.5-py3-none-any.whl
(13.8 kB
view hashes)
Close
Hashes for BuoyanText-0.0.5-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bff62152d862a8ef673605525b7b2197c301af3ec521d4ac49ab71745ab4d710 |
|
MD5 | 5e7c28b5ccd5f449d12a730b4e9b4c0c |
|
BLAKE2b-256 | 50b1f5ac63c7dfef55061eb04c306d6de8b271ec62243363c3f192be5e760f8a |