Skip to main content

This is text preprocessing package

Project description

Dependencies

pip install spacy==2.2.3
python -m spacy download en_core_web_sm
pip install beautifulsoup4==4.9.1
pip install textblob==0.15.3

INSTALLATION ''' pip install text_hammer

'''

How to use it for preprocessing

You have to have installed spacy and python3 to make it work. import text_hammer as th

def get_clean(x):
    x = str(x).lower().replace('\\', '').replace('_', ' ')
    x = th.cont_exp(x)
    x = th.remove_emails(x)
    x = th.remove_urls(x)
    x = th.remove_html_tags(x)
    x = th.remove_rt(x)
    x = th.remove_accented_chars(x)
    x = th.remove_special_chars(x)
    x = re.sub("(.)\\1{2,}", "\\1", x)
    return x

Use this if you want to use one by one

import pandas as pd
import numpy as np
import text_hammer as th

df = pd.read_csv('imdb_reviews.txt', sep = '\t', header = None)
df.columns = ['reviews', 'sentiment']

# These are series of preprocessing
df['reviews'] = df['reviews'].apply(lambda x: th.cont_exp(x)) #you're -> you are; i'm -> i am
df['reviews'] = df['reviews'].apply(lambda x: th.remove_emails(x))
df['reviews'] = df['reviews'].apply(lambda x: th.remove_html_tags(x))
df['reviews'] = df['reviews'].apply(lambda x: th.remove_urls(x))

df['reviews'] = df['reviews'].apply(lambda x: th.remove_special_chars(x))
df['reviews'] = df['reviews'].apply(lambda x: th.remove_accented_chars(x))
df['reviews'] = df['reviews'].apply(lambda x: th.make_base(x)) #ran -> run, 
df['reviews'] = df['reviews'].apply(lambda x: th.spelling_correction(x).raw_sentences[0]) #seplling -> spelling

Note: Avoid to use make_base and spelling_correction for very large dataset otherwise it might take hours to process.

Extra

x = 'lllooooovvveeee youuuu'
x = re.sub("(.)\\1{2,}", "\\1", x)
print(x)
---
love you

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

text_hammer-0.1.4.tar.gz (6.8 kB view hashes)

Uploaded Source

Built Distribution

text_hammer-0.1.4-py3-none-any.whl (7.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page