Project description

BuoyanText

v0.0.1 20230719

ByTextNorm.py

TextNorm class

Normalizing English or Chinese text

Arguments:

text="", language="English",
# Uniform
html_stripping=True, remove_repeated=False, remove_digits=True, stopwords_removal=True,
special_char_removal=True, 
# English
contraction_expansion=True, accented_char_removal=True, text_lower_case=True, text_stemming=False, 
text_lemmatization=True, 
# Chinese
split_words=True, punctuation_drop=True

text: the text need to be normalized
language: "English" or "Chinese" (default: "English")

(1) Uniform approach arguments

html_stripping: strip html tags if True (default: True)
remove_repeated: remove repeated words (default: False)
remove_digits: remove numbers (default: True)
stopwords_removal: remove stopwords (default: True)
special_char_removal: remove special characters (default: True)

(2) English text arguments

contraction_expansion: expand contractions, for example "can't" to "can not" (default: True)
accented_char_removal: (default: True)
text_lower_case: (default: True)
text_stemming: (default: False)
text_lemmatization: (default: True)

(3) Chinese text arguments

split_words: split words with jieba (default: True)
punctuation_drop: drop punctuations (default: True)

ByTextReader.py

(1) file_reader

(2) file_list_reader

(3) pdf_to_txt

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.0.5

Jul 20, 2023

This version

0.0.4

Jul 20, 2023

0.0.1

Jul 19, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

BuoyanText-0.0.4.tar.gz (13.7 kB view hashes)

Uploaded Jul 20, 2023 Source

Built Distribution

BuoyanText-0.0.4-py3-none-any.whl (13.7 kB view hashes)

Uploaded Jul 20, 2023 Python 3

Hashes for BuoyanText-0.0.4.tar.gz

Hashes for BuoyanText-0.0.4.tar.gz
Algorithm	Hash digest
SHA256	`ab26707e60650f95bce05ce7f5a6b6eb357dab507a449d8b0dde3339504eb12f`
MD5	`de12c37e4d81568302fec4e85779c5cc`
BLAKE2b-256	`02daab3b91674dc8457c2253dc70774d57c1171c5bbc64295d601a6c035b4f79`

Hashes for BuoyanText-0.0.4-py3-none-any.whl

Hashes for BuoyanText-0.0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c5cb0adabda912d3d67fb05d88361196b9ca3f090e711b0fd7e5a12de2049bd7`
MD5	`0646a2412889328fa0b885be02b811b4`
BLAKE2b-256	`333ad1242350694fa2545866eb72aabea15a05782860362ecd10d1f0b796a882`