Skip to main content

This is text preprocessing package

Project description

Dependencies

pip install spacy==2.2.3
python -m spacy download en_core_web_sm
pip install beautifulsoup4==4.9.1
pip install textblob==0.15.3

INSTALLATION ''' pip install text_hammer

'''

How to use it for preprocessing

You have to have installed spacy and python3 to make it work. import text_hammer as th

def get_clean(x):
    x = str(x).lower().replace('\\', '').replace('_', ' ')
    x = th.cont_exp(x)
    x = th.remove_emails(x)
    x = th.remove_urls(x)
    x = th.remove_html_tags(x)
    x = th.remove_rt(x)
    x = th.remove_accented_chars(x)
    x = th.remove_special_chars(x)
    x = re.sub("(.)\\1{2,}", "\\1", x)
    return x

Use this if you want to use one by one

import pandas as pd
import numpy as np
import text_hammer as th

df = pd.read_csv('imdb_reviews.txt', sep = '\t', header = None)
df.columns = ['reviews', 'sentiment']

# These are series of preprocessing
df['reviews'] = df['reviews'].apply(lambda x: th.cont_exp(x)) #you're -> you are; i'm -> i am
df['reviews'] = df['reviews'].apply(lambda x: th.remove_emails(x))
df['reviews'] = df['reviews'].apply(lambda x: th.remove_html_tags(x))
df['reviews'] = df['reviews'].apply(lambda x: th.remove_urls(x))

df['reviews'] = df['reviews'].apply(lambda x: th.remove_special_chars(x))
df['reviews'] = df['reviews'].apply(lambda x: th.remove_accented_chars(x))
df['reviews'] = df['reviews'].apply(lambda x: th.make_base(x)) #ran -> run, 
df['reviews'] = df['reviews'].apply(lambda x: th.spelling_correction(x).raw_sentences[0]) #seplling -> spelling

Note: Avoid to use make_base and spelling_correction for very large dataset otherwise it might take hours to process.

Extra

x = 'lllooooovvveeee youuuu'
x = re.sub("(.)\\1{2,}", "\\1", x)
print(x)
---
love you

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

text_hammer-0.1.5.tar.gz (6.8 kB view details)

Uploaded Source

Built Distribution

text_hammer-0.1.5-py3-none-any.whl (7.6 kB view details)

Uploaded Python 3

File details

Details for the file text_hammer-0.1.5.tar.gz.

File metadata

  • Download URL: text_hammer-0.1.5.tar.gz
  • Upload date:
  • Size: 6.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.7.9

File hashes

Hashes for text_hammer-0.1.5.tar.gz
Algorithm Hash digest
SHA256 dbf6e3b58f3c758cc91fb3776cf8b0980657f8ce7aceb7163e8e1c7e448273d5
MD5 e1b4c158d5a254fbf9ddc9e6e9886b17
BLAKE2b-256 3267cb0e82a3065520e3bbf77a4ebcf29c8df0c913df0dd9affba1840b3138c4

See more details on using hashes here.

File details

Details for the file text_hammer-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: text_hammer-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 7.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.7.9

File hashes

Hashes for text_hammer-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 6ad509e964d1a51e465d88a13e1dc77bf9636f1315daade5ad986d0ae5018e5b
MD5 51fb4e884521033913b98f60bb4544de
BLAKE2b-256 843a955cead96434a981761e4dbe5ca24241df8595f9459875ea1be7bf6eece7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page