A python package for text preprocessing task in natural language processing
Project description
A python package for text preprocessing task in natural language processing.
Usage
To use this text preprocessing package, first install it using pip:
pip install text-preprocessing
Then, import the package in your python script and call appropriate functions:
from text_preprocessing import preprocess_text
from text_preprocessing import to_lower, remove_email, remove_url, remove_punctuation, lemmatize_word
# Preprocess text using default preprocess functions in the pipeline
text_to_process = 'Helllo, I am John Doe!!! My email is john.doe@email.com. Visit our website www.johndoe.com'
preprocessed_text = preprocess_text(text_to_process)
print(preprocessed_text)
# output: hello email visit website
# Preprocess text using custom preprocess functions in the pipeline
preprocess_functions = [to_lower, remove_email, remove_url, remove_punctuation, lemmatize_word]
preprocessed_text = preprocess_text(text_to_process, preprocess_functions)
print(preprocessed_text)
# output: helllo i am john doe my email is visit our website
Features
Feature |
Function |
---|---|
convert to lower case |
to_lower |
convert to upper case |
to_upper |
keep only alphabetic and numerical characters |
keep_alpha_numeric |
check and correct spellings |
check_spelling |
expand contractions |
expand_contraction |
remove URLs |
remove_url |
remove names |
remove_name |
remove emails |
remove_email |
remove phone numbers |
remove_phone_number |
remove SSNs |
remove_ssn |
remove credit card numbers |
remove_credit_card_number |
remove numbers |
remove_number |
remove bullets and numbering |
remove_itemized_bullet_and_numbering |
remove special characters |
remove_special_character |
remove punctuations |
remove_punctuation |
remove extra whitespace |
remove_whitespace |
normalize unicode (e.g., café -> cafe) |
normalize_unicode |
remove stop words |
remove_stopword |
tokenize words |
tokenize_word |
tokenize sentences |
tokenize_sentence |
substitute custom words (e.g., vs -> versus) |
substitute_token |
stem words |
stem_word |
lemmatize words |
lemmatize_word |
preprocess text through a sequence of preprocessing functions |
preprocess_text |
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for text_preprocessing-0.0.8-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 64742ff4c3df52dbf38e345214f373b62596ac19ecf4d96a91364211ca1b3366 |
|
MD5 | 3b02f6a3f8f6e42fd25e43fbc9e683ab |
|
BLAKE2b-256 | 1db36fff4496dfeeba944b8ab3566d4979ffc9417cdb0f3045f1d01e6e50cf5b |