A python package for text preprocessing task in natural language processing
Project description
A python package for text preprocessing task in natural language processing.
Usage
To use this text preprocessing package, first install it using pip:
pip install text-preprocessing
Then, import the package in your python script and call appropriate functions:
from text_preprocessing import preprocess_text
from text_preprocessing import to_lower, remove_email, remove_url, remove_punctuation, lemmatize_word
# Preprocess text using default preprocess functions in the pipeline
text_to_process = 'Helllo, I am John Doe!!! My email is john.doe@email.com. Visit our website www.johndoe.com'
preprocessed_text = preprocess_text(text_to_process)
print(preprocessed_text)
# output: hello email visit website
# Preprocess text using custom preprocess functions in the pipeline
preprocess_functions = [to_lower, remove_email, remove_url, remove_punctuation, lemmatize_word]
preprocessed_text = preprocess_text(text_to_process, preprocess_functions)
print(preprocessed_text)
# output: helllo i am john doe my email is visit our website
Features
Feature |
Function |
---|---|
convert to lower case |
to_lower |
convert to upper case |
to_upper |
keep only alphabetic and numerical characters |
keep_alpha_numeric |
check and correct spellings |
check_spelling |
expand contractions |
expand_contraction |
remove URLs |
remove_url |
remove names |
remove_name |
remove emails |
remove_email |
remove phone numbers |
remove_phone_number |
remove SSNs |
remove_ssn |
remove credit card numbers |
remove_credit_card_number |
remove numbers |
remove_number |
remove bullets and numbering |
remove_itemized_bullet_and_numbering |
remove special characters |
remove_special_character |
remove punctuations |
remove_punctuation |
remove extra whitespace |
remove_whitespace |
normalize unicode (e.g., Café -> Cafe) |
normalize_unicode |
remove stop words |
remove_stopword |
tokenize words |
tokenize_word |
tokenize sentences |
tokenize_sentence |
substitute custom words (e.g., msft -> Microsoft) |
substitute_token |
stem words |
stem_word |
lemmatize words |
lemmatize_word |
preprocess text through a sequence of preprocessing functions |
preprocess_text |
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for text_preprocessing-0.0.6-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9745ec19faf948eb52c0e9c7b070a75648b18d04ea81cff0a8744973dd1995a5 |
|
MD5 | 676207e92e56f8364278ec2b2d76e18c |
|
BLAKE2b-256 | f596eb93661641ccfc517057d9ab392c513bec44dd31398969940774f273301e |