This is text preprocessing package
Project description
Dependencies
pip install spacy==2.2.3
python -m spacy download en_core_web_sm
pip install beautifulsoup4==4.9.1
pip install textblob==0.15.3
INSTALLATION ''' pip install text_hammer
'''
How to use it for preprocessing
You have to have installed spacy and python3 to make it work. import text_hammer as th
def get_clean(x):
x = str(x).lower().replace('\\', '').replace('_', ' ')
x = th.cont_exp(x)
x = th.remove_emails(x)
x = th.remove_urls(x)
x = th.remove_html_tags(x)
x = th.remove_rt(x)
x = th.remove_accented_chars(x)
x = th.remove_special_chars(x)
x = re.sub("(.)\\1{2,}", "\\1", x)
return x
Use this if you want to use one by one
import pandas as pd
import numpy as np
import text_hammer as th
df = pd.read_csv('imdb_reviews.txt', sep = '\t', header = None)
df.columns = ['reviews', 'sentiment']
# These are series of preprocessing
df['reviews'] = df['reviews'].apply(lambda x: th.cont_exp(x)) #you're -> you are; i'm -> i am
df['reviews'] = df['reviews'].apply(lambda x: th.remove_emails(x))
df['reviews'] = df['reviews'].apply(lambda x: th.remove_html_tags(x))
df['reviews'] = df['reviews'].apply(lambda x: th.remove_urls(x))
df['reviews'] = df['reviews'].apply(lambda x: th.remove_special_chars(x))
df['reviews'] = df['reviews'].apply(lambda x: th.remove_accented_chars(x))
df['reviews'] = df['reviews'].apply(lambda x: th.make_base(x)) #ran -> run,
df['reviews'] = df['reviews'].apply(lambda x: th.spelling_correction(x).raw_sentences[0]) #seplling -> spelling
Note: Avoid to use make_base and spelling_correction for very large dataset otherwise it might take hours to process.
Extra
x = 'lllooooovvveeee youuuu'
x = re.sub("(.)\\1{2,}", "\\1", x)
print(x)
---
love you
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file text_hammer-0.1.5.tar.gz.
File metadata
- Download URL: text_hammer-0.1.5.tar.gz
- Upload date:
- Size: 6.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.7.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dbf6e3b58f3c758cc91fb3776cf8b0980657f8ce7aceb7163e8e1c7e448273d5
|
|
| MD5 |
e1b4c158d5a254fbf9ddc9e6e9886b17
|
|
| BLAKE2b-256 |
3267cb0e82a3065520e3bbf77a4ebcf29c8df0c913df0dd9affba1840b3138c4
|
File details
Details for the file text_hammer-0.1.5-py3-none-any.whl.
File metadata
- Download URL: text_hammer-0.1.5-py3-none-any.whl
- Upload date:
- Size: 7.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.7.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6ad509e964d1a51e465d88a13e1dc77bf9636f1315daade5ad986d0ae5018e5b
|
|
| MD5 |
51fb4e884521033913b98f60bb4544de
|
|
| BLAKE2b-256 |
843a955cead96434a981761e4dbe5ca24241df8595f9459875ea1be7bf6eece7
|