A python package for text preprocessing task in natural language processing
Project description
A python package for text preprocessing task in natural language processing.
Usage
To use this text preprocessing package, first install it using pip:
pip install text-preprocessing
Then, import the package in your python script and call appropriate functions:
from text_preprocessing import preprocess_text
from text_preprocessing import to_lower, remove_email, remove_url, remove_punctuation, lemmatize_word
# Preprocess text using default preprocess functions in the pipeline
text_to_process = 'Helllo, I am John Doe!!! My email is john.doe@email.com. Visit our website www.johndoe.com'
preprocessed_text = preprocess_text(text_to_process)
print(preprocessed_text)
# output: hello email visit website
# Preprocess text using custom preprocess functions in the pipeline
preprocess_functions = [to_lower, remove_email, remove_url, remove_punctuation, lemmatize_word]
preprocessed_text = preprocess_text(text_to_process, preprocess_functions)
print(preprocessed_text)
# output: helllo i am john doe my email is visit our website
Features
Feature |
Function |
---|---|
convert to lower case |
to_lower |
convert to upper case |
to_upper |
keep only alphabetic and numerical characters |
keep_alpha_numeric |
check and correct spellings |
check_spelling |
expand contractions |
expand_contraction |
remove URLs |
remove_url |
remove names |
remove_name |
remove emails |
remove_email |
remove phone numbers |
remove_phone_number |
remove SSNs |
remove_ssn |
remove credit card numbers |
remove_credit_card_number |
remove numbers |
remove_number |
remove bullets and numbering |
remove_itemized_bullet_and_numbering |
remove special characters |
remove_special_character |
remove punctuations |
remove_punctuation |
remove extra whitespace |
remove_whitespace |
normalize unicode (e.g., café -> cafe) |
normalize_unicode |
remove stop words |
remove_stopword |
tokenize words |
tokenize_word |
tokenize sentences |
tokenize_sentence |
substitute custom words (e.g., vs -> versus) |
substitute_token |
stem words |
stem_word |
lemmatize words |
lemmatize_word |
preprocess text through a sequence of preprocessing functions |
preprocess_text |
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for text_preprocessing-0.0.9-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | cbe08d56f66dd4d8a81fe6afefc8f549effe7e7e5c045d61723884ab0777f276 |
|
MD5 | ecf782ff5101b4919f008c8202c9f195 |
|
BLAKE2b-256 | 3cd70fd129ae41ce69be93cc047269f9ad088e8712c5e5a7fd1d7147bdb4d49d |