Skip to main content

Library for tweets preprocessing

Project description


Tweets Preprocessor


Preprocessor is a preprocessing library for tweet data written in Python. When building Machine Learning systems based on tweets and text data like twitter sentiment analysis, topic modelling, etc., preprocessing is required. This is required because of quality of the data as well as dimensionality reduction purposes.

This library makes it easy to clean the tweets so you don't have to write the same helper functions over and over again ever time.

Features

Currently supports cleaning :

  • URLs
  • Hashtags
  • Mentions
  • Emojis
  • Smileys
  • Length constraint
  • remove tweets containing few specific keywords like birthday,congratulations,etc.
  • .csv and .xlsx file support

Python 3.9+ on Windows.

Usage

Basic cleaning:

.. code:: python

>>># Import Preprocess from your library
>>>from tweets_preprocess import Preprocess
>>>import pandas as pd
>>>import numpy as np

>>># Instantiate a Preprocess object
>>>data = pd.read_excel(r"D:\Ipac_new\My_Python_Lib\tweet_preprocess\sample.xlsx")
>>>data['pre_text'] = ""

>>>rem = ["happy birthday","birthday","congratulations","rip","thank you","congrats","thanks"]  ## sample keywords
>>>length = 35 
>>>p = Preprocess(data,'Text',rem,length)
>>>d = p.process()

>>>data['pre_text'] = pd.Series(d)

>>>d1 = data.loc[data['pre_text']!='']
>>>#save cleaned tweets to csv file
d1.to_csv('pre-data.csv')

Example:
Raw Tweet: 'Tweet Preprocessor is #awesome 👍 https://github.com/anusha-ipac/tweets_preprocess'
Cleaned Tweet: 'Preprocessor is'

Removed hashtags, emojis, URLs from the raw tweet and returned clean tweet.
Removes tweets containing specific keywords.

Processing files:

Preprocessor currently supports processing .csv and .xlsx formats.

Installation

Using pip:

.. code:: bash

$ pip install tweets-preprocess

Using manual installation:

.. code:: bash

$ python setup.py build
$ python setup.py install

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tweets_preprocess-0.2.4.tar.gz (3.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tweets_preprocess-0.2.4-py3-none-any.whl (3.4 kB view details)

Uploaded Python 3

File details

Details for the file tweets_preprocess-0.2.4.tar.gz.

File metadata

  • Download URL: tweets_preprocess-0.2.4.tar.gz
  • Upload date:
  • Size: 3.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.11.2

File hashes

Hashes for tweets_preprocess-0.2.4.tar.gz
Algorithm Hash digest
SHA256 f471b730ea520096cf8144e9317e9020c3da27941d3cec425aeee3a850788c2f
MD5 9f3b4a2b606beb80024dbd7e515c20aa
BLAKE2b-256 2a6bb3f4891f79765cda54ff283ea2ab12dea819bc080df2e0aa009909c5ef1f

See more details on using hashes here.

File details

Details for the file tweets_preprocess-0.2.4-py3-none-any.whl.

File metadata

File hashes

Hashes for tweets_preprocess-0.2.4-py3-none-any.whl
Algorithm Hash digest
SHA256 e8eff1f57e5bab0d958a1b79b8d36690059a30e5b008446ea69c2528c03534f4
MD5 ad0a1d8804a285d0b692bf2ac9e2c02b
BLAKE2b-256 76a24596c5d55ba689230799298f4d5a18bd5cec57336226844c66aad4e02940

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page