Library for tweets preprocessing
Project description
Tweets Preprocessor
Preprocessor is a preprocessing library for tweet data written in Python. When building Machine Learning systems based on tweets and text data like twitter sentiment analysis, topic modelling, etc., preprocessing is required. This is required because of quality of the data as well as dimensionality reduction purposes.
This library makes it easy to clean the tweets so you don't have to write the same helper functions over and over again ever time.
Features
Currently supports cleaning :
- URLs
- Hashtags
- Mentions
- Emojis
- Smileys
- Length constraint
- remove tweets containing few specific keywords like birthday,congratulations,etc.
.csvand.xlsxfile support
Python 3.9+ on Windows.
Usage
Basic cleaning:
.. code:: python
>>># Import Preprocess from your library
>>>from tweets_preprocess import Preprocess
>>>import pandas as pd
>>>import numpy as np
>>># Instantiate a Preprocess object
>>>data = pd.read_excel(r"D:\Ipac_new\My_Python_Lib\tweet_preprocess\sample.xlsx")
>>>data['pre_text'] = ""
>>>rem = ["happy birthday","birthday","congratulations","rip","thank you","congrats","thanks"] ## sample keywords
>>>length = 35
>>>p = Preprocess(data,'Text',rem,length)
>>>d = p.process()
>>>data['pre_text'] = pd.Series(d)
>>>d1 = data.loc[data['pre_text']!='']
>>>#save cleaned tweets to csv file
d1.to_csv('pre-data.csv')
Example:
Raw Tweet: 'Tweet Preprocessor is #awesome 👍 https://github.com/anusha-ipac/tweets_preprocess'
Cleaned Tweet: 'Preprocessor is'
Removed hashtags, emojis, URLs from the raw tweet and returned clean tweet.
Removes tweets containing specific keywords.
Processing files:
Preprocessor currently supports processing .csv and .xlsx
formats.
Installation
Using pip:
.. code:: bash
$ pip install tweets-preprocess
Using manual installation:
.. code:: bash
$ python setup.py build
$ python setup.py install
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tweets_preprocess-0.2.4.tar.gz.
File metadata
- Download URL: tweets_preprocess-0.2.4.tar.gz
- Upload date:
- Size: 3.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.11.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f471b730ea520096cf8144e9317e9020c3da27941d3cec425aeee3a850788c2f
|
|
| MD5 |
9f3b4a2b606beb80024dbd7e515c20aa
|
|
| BLAKE2b-256 |
2a6bb3f4891f79765cda54ff283ea2ab12dea819bc080df2e0aa009909c5ef1f
|
File details
Details for the file tweets_preprocess-0.2.4-py3-none-any.whl.
File metadata
- Download URL: tweets_preprocess-0.2.4-py3-none-any.whl
- Upload date:
- Size: 3.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.11.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e8eff1f57e5bab0d958a1b79b8d36690059a30e5b008446ea69c2528c03534f4
|
|
| MD5 |
ad0a1d8804a285d0b692bf2ac9e2c02b
|
|
| BLAKE2b-256 |
76a24596c5d55ba689230799298f4d5a18bd5cec57336226844c66aad4e02940
|