Skip to main content

Performs pre-processing of tweets

Project description

ci-cd

pytextprep

This is a Python package that offers additional text preprocessing functionality specifically designed for tweets. The package bundles functions to help with cleaning and gaining insight into tweet data, providing additional resources for EDA or enabling feature engineering.

The main functions of this package are:

  • remove_punct : Removes punctuation from a list of tweets

  • extract_ngram: Extracts n-grams from a list of tweets

  • extract_hashtags: Creates a list of hashtags from a list of tweets

  • generate_cloud: Creates a word cloud of the most frequent words from a list of tweets

In the Python ecosystem the only popular package focused on tweet data is tweet-preprocessor. Even though this package is also customized specifically for dealing with Tweeter data its scope is solely oriented to tokenizing and cleaning the tweets. In contrast, our package can be leveraged to extract new features out of tweets.

Installation

Install using pip:

$ pip install pytextprep

Install from source:

$ git clone git@github.com:UBC-MDS/pytextprep.git
cd pytextprep
git checkout main #latest release
pip install .

Usage

Documentation

Please follow the steps below:

Create a new conda environment named pytextprep:

conda create --name pytextprep python=3.9 -y

Activate the conda environment pytextprep:

conda activate pytextprep

Install the package:

pip install pytextprep

If the package fails to install due to the wordcloud package, please install wordcloud using the following command and then install pytextprep again.

conda install -c conda-forge wordcloud -y

Open Python:

python

You can now use the package functions as:

from pytextprep.extract_ngram import extract_ngram
from pytextprep.extract_hashtags import extract_hashtags
from pytextprep.remove_punct import remove_punct
from pytextprep.generate_cloud import generate_cloud
import matplotlib.pyplot as plt

tweets_list = ["Make America Great Again! @DonalTrump", "It's a new day in #America"]
extract_ngram(tweets_list, n=3)
['Make America Great', 'America Great Again!', 'Great Again! @DonalTrump', "Again! @DonalTrump It's", "@DonalTrump It's a", "It's a new", 'a new day', 'new day in', 'day in #America']
extract_hashtags(tweets_list)
['America']
remove_punct(tweets_list, skip=["'", "@", "#", '-'])
['Make America Great Again @DonalTrump', "It's a new day in #America"]
fig, wc = generate_cloud(tweets_list)
plt.show()

word_cloud

Contributing

Contributors: Arijeet Chatterjee, Joshua Sia, Melisa Maidana, Philson Chan (DSCI_524_GROUP21).

Interested in contributing? Check out the contributing guidelines.

Please note that this project is released with a Code of Conduct. By contributing to this project, you agree to abide by its terms.

License

pytextprep was created by Arijeet Chatterjee, Joshua Sia, Melisa Maidana, Philson Chan (DSCI_524_GROUP21).

It is licensed under the terms of the MIT license.

Credits

pytextprep was created with cookiecutter and the py-pkgs-cookiecutter template.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pytextprep-1.0.7.tar.gz (6.2 kB view hashes)

Uploaded Source

Built Distribution

pytextprep-1.0.7-py3-none-any.whl (6.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page