Skip to main content

Normalizing English and Chinese Text

Project description

BuoyanText

v0.0.1 20230719


ByTextNorm.py

TextNorm class

Normalizing English or Chinese text

Arguments:

text="", language="English",
# Uniform
html_stripping=True, remove_repeated=False, remove_digits=True, stopwords_removal=True,
special_char_removal=True, 
# English
contraction_expansion=True, accented_char_removal=True, text_lower_case=True, text_stemming=False, 
text_lemmatization=True, 
# Chinese
split_words=True, punctuation_drop=True
  • text: the text need to be normalized
  • language: "English" or "Chinese" (default: "English")

(1) Uniform approach arguments

  • html_stripping: strip html tags if True (default: True)
  • remove_repeated: remove repeated words (default: False)
  • remove_digits: remove numbers (default: True)
  • stopwords_removal: remove stopwords (default: True)
  • special_char_removal: remove special characters (default: True)

(2) English text arguments

  • contraction_expansion: expand contractions, for example "can't" to "can not" (default: True)
  • accented_char_removal: (default: True)
  • text_lower_case: (default: True)
  • text_stemming: (default: False)
  • text_lemmatization: (default: True)

(3) Chinese text arguments

  • split_words: split words with jieba (default: True)
  • punctuation_drop: drop punctuations (default: True)

ByTextReader.py

(1) file_reader

(2) file_list_reader

(3) pdf_to_txt

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

BuoyanText-0.0.4.tar.gz (13.7 kB view hashes)

Uploaded Source

Built Distribution

BuoyanText-0.0.4-py3-none-any.whl (13.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page