A python package for text preprocessing task in natural language processing
Project description
A python package for text preprocessing task in natural language processing.
Usage
To use this text preprocessing package,
from text_preprocessing import preprocess_text
# Preprocess text using default preprocess functions in the pipeline
text_to_process = 'Helllo, I am John Doe!!! My email is john.doe@email.com. Visit our website www.johndoe.com'
preprocessed_text = preprocess_text(text_to_process)
print(preprocessed_text)
# Preprocess text using custom preprocess functions in the pipeline
preprocess_functions = [to_lower, remove_email, remove_url, remove_punctuations, lemmatize_word]
preprocessed_text = preprocess_text(text_to_process, preprocess_functions)
print(preprocessed_text)
Features
Feature |
Function |
---|---|
convert to lower case |
to_lower |
convert to upper case |
to_upper |
keep only alphabetic and numerical characters |
keep_alpha_numeric |
check and correct spellings |
check_spelling |
expand contractions |
expand_contraction |
remove URLs |
remove_url |
remove names |
remove_name |
remove emails |
remove_email |
remove phone numbers |
remove_phone_number |
remove SSNs |
remove_ssn |
remove credit card numbers |
remove_credit_card_number |
remove numbers |
remove_number |
remove special characters |
remove_special_character |
remove punctuations |
remove_punctuation |
remove extra whitespace |
remove_whitespace |
normalize unicode (e.g., Café -> Cafe) |
normalize_unicode |
remove stop words |
remove_stopword |
tokenize words |
tokenize_word |
tokenize sentences |
tokenize_sentence |
substitute custom words (e.g., msft -> Microsoft) |
substitute_token |
stem words |
stem_word |
lemmatize words |
lemmatize_word |
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for text_preprocessing-0.0.4-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b87c0e484e763dd9d9aa79ae9f412dde5f289efff542f68099eafb28605a0089 |
|
MD5 | c67f7602ba4c0bda02405c378a564945 |
|
BLAKE2b-256 | eaa9f48bd14191f2881bb6d50ceef97704bf243576ba41414050b08c8eb35b20 |