A Package for text preprocessing
Project description
nlp-preprocessing
nlp-preprocessing provides text preprocessing functions i.e. text cleaning, dataset preprocessing, tokenization etc
Installation
pip install nlp_preprocessing
Tutorial
1. Text Cleaning
from nlp_preprocessing import clean
texts = ["Hi I am's nakdur"]
cleaned_texts = clean.clean_v1(texts)
There are multiple cleaning functions:
data_list = to_lower(data_list)
data_list = to_normalize(data_list)
data_list = remove_href(data_list)
data_list = remove_control_char(data_list)
data_list = remove_duplicate(data_list)
data_list = remove_underscore(data_list)
data_list = seperate_spam_chars(data_list)
data_list = seperate_brakets_quotes(data_list)
data_list = break_short_words(data_list)
data_list = break_long_words(data_list)
data_list = remove_ending_underscore(data_list)
data_list = remove_starting_underscore(data_list)
data_list = seperate_end_word_punctuations(data_list)
data_list = seperate_start_word_punctuations(data_list)
data_list = clean_contractions(data_list)
data_list = remove_s(data_list)
data_list = isolate_numbers(data_list)
data_list = regex_split_word(data_list)
data_list = leet_clean(data_list)
data_list = clean_open_holded_words(data_list)
data_list = clean_multiple_form(data_list)
2. Dataset Prepration
from nlp_preprocessing import dataset as ds
import pandas as pd
text = ['I am Test 1','I am Test 2']
label = ['A','B']
aspect = ['C','D']
data = pd.DataFrame({'text':text*5,'label':label*5,'aspect':aspect*5})
data
data_config = {
'data_class':'multi-label',
'x_columns':['text'],
'y_columns':['label','aspect'],
'one_hot_encoded_columns':[],
'label_encoded_columns':['label','aspect'],
'data':data,
'split_ratio':0.1
}
dataset = ds.Dataset(data_config)
train, test = dataset.get_train_test_data()
print(train['Y_train'],train['X_train'])
print(test['Y_test'],test['X_test'])
print(dataset.data_config)
3. Seq token generator
texts = ['I am Test 2', 'I am Test 1', 'I am Test 1', 'I am Test 1','I am Test 1', 'I am Test 2', 'I am Test 1', 'I am Test 2','I am Test 2']
tokens = seq_gen.get_word_sequences(texts)
print(tokens)
4. Token embedding creator
from nlp_preprocessing import token_embedding_creator
vector_file='../input/fasttext-crawl-300d-2m-with-subword/crawl-300d-2m-subword/crawl-300d-2M-subword.vec'
input_file='../input/complete-tweet-sentiment-extraction-data/tweet_dataset.csv'
column_name='text'
processor = token_embedding_creator.Processor(vector_file, input_file, column_name)
output_dir = '.'
special_tokens = ['[UNK]','[SEP]']
processor.process(output_dir, special_tokens)
#Loading vectors from ../input/fasttext-crawl-300d-2m-with-subword/crawl-300d-2m-subword/crawl-300d-2M-subword.vec type: index
#Writing vocab at ./full_vocab.txt
#1%| | 218/40000 [00:00<00:18, 2176.72it/s]
#Generating unique tokens ...
#100%|██████████| 40000/40000 [00:18<00:00, 2180.53it/s]
#Writing vocab at ./vocab.txt
#Loading vectors from ../input/fasttext-crawl-300d-2m-with-subword/crawl-300d-2m-subword/crawl-300d-2M-subword.vec type: embedding
#Writing vocab at ./vocab.txt
#Making Final Embedding ...
#Writing embedding at ./embeddings.npy
#Processing Done !
#Vocab stored at : ./vocab.txt of size: 25475
#Embedding stored at : ./embeddings.npy of shape: (25475, 300)
5. seq_parser_token_generator
from nlp_preprocessing import seq_parser_token_generator
text = ['hi how are you']
pos_token, tag_token, dep_token = get_tokens(text[0])
pos_tokens, tag_tokens, dep_tokens = get_tokens_plus(text, 120)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
nlp_preprocessing-0.2.0.tar.gz
(17.7 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nlp_preprocessing-0.2.0.tar.gz.
File metadata
- Download URL: nlp_preprocessing-0.2.0.tar.gz
- Upload date:
- Size: 17.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
169a88d97244d91c4ff940dfa6e6d845e62f458700656f94db70fbdcb7eb09cc
|
|
| MD5 |
da8bce346ed79f6c6543e8b35486afdd
|
|
| BLAKE2b-256 |
dc9dcb73827038b4785ae42ee8cbfe10d624b62f6100031417162873d7df5849
|
File details
Details for the file nlp_preprocessing-0.2.0-py3-none-any.whl.
File metadata
- Download URL: nlp_preprocessing-0.2.0-py3-none-any.whl
- Upload date:
- Size: 19.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d58da7c1f8bb7b6fe5e7aecc466f25737d4414cb284f743246913f33e5f2d172
|
|
| MD5 |
865ee565dc77d05b5ca9a7b58ceb4b21
|
|
| BLAKE2b-256 |
2c1f9d24942d6677712b3a252c945d7a4661af84920676f52951c0025abebe7e
|