Normalizing English and Chinese Text
Project description
BuoyanText
v0.0.1 20230719
ByTextNorm.py
TextNorm class
Normalizing English or Chinese text
Arguments:
text="", language="English",
# Uniform
html_stripping=True, remove_repeated=False, remove_digits=True, stopwords_removal=True,
special_char_removal=True,
# English
contraction_expansion=True, accented_char_removal=True, text_lower_case=True, text_stemming=False,
text_lemmatization=True,
# Chinese
split_words=True, punctuation_drop=True
- text: the text need to be normalized
- language: "English" or "Chinese" (default: "English")
(1) Uniform approach arguments
- html_stripping: strip html tags if True (default: True)
- remove_repeated: remove repeated words (default: False)
- remove_digits: remove numbers (default: True)
- stopwords_removal: remove stopwords (default: True)
- special_char_removal: remove special characters (default: True)
(2) English text arguments
- contraction_expansion: expand contractions, for example "can't" to "can not" (default: True)
- accented_char_removal: (default: True)
- text_lower_case: (default: True)
- text_stemming: (default: False)
- text_lemmatization: (default: True)
(3) Chinese text arguments
- split_words: split words with jieba (default: True)
- punctuation_drop: drop punctuations (default: True)
ByTextReader.py
(1) file_reader
(2) file_list_reader
(3) pdf_to_txt
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
BuoyanText-0.0.4.tar.gz
(13.7 kB
view hashes)
Built Distribution
BuoyanText-0.0.4-py3-none-any.whl
(13.7 kB
view hashes)
Close
Hashes for BuoyanText-0.0.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c5cb0adabda912d3d67fb05d88361196b9ca3f090e711b0fd7e5a12de2049bd7 |
|
MD5 | 0646a2412889328fa0b885be02b811b4 |
|
BLAKE2b-256 | 333ad1242350694fa2545866eb72aabea15a05782860362ecd10d1f0b796a882 |