Performs pre-processing of tweets
Project description
pytextprep
This is a Python package that offers additional text preprocessing functionality specifically designed for tweets. The package bundles functions to help with cleaning and gaining insight into tweet data, providing additional resources for EDA or enabling feature engineering.
The main functions of this package are:
-
remove_punct
: Removes punctuation from a list of tweets -
extract_ngram
: Extracts n-grams from a list of tweets -
extract_hashtags
: Creates a list of hashtags from a list of tweets -
generate_cloud
: Creates a word cloud of the most frequent words from a list of tweets
In the Python ecosystem the only popular package focused on tweet data is tweet-preprocessor. Even though this package is also customized specifically for dealing with Tweeter data its scope is solely oriented to tokenizing and cleaning the tweets. In contrast, our package can be leveraged to extract new features out of tweets.
Installation
Install using pip:
$ pip install pytextprep
Install from source:
$ git clone git@github.com:UBC-MDS/pytextprep.git
cd pytextprep
git checkout main #latest release
pip install .
Usage
Please follow the steps below:
Create a new conda environment named pytextprep
:
conda create --name pytextprep python=3.9 -y
Activate the conda environment pytextprep
:
conda activate pytextprep
Install the package:
pip install pytextprep
If the package fails to install due to the wordcloud
package, please install wordcloud
using the following command and then install pytextprep
again.
conda install -c conda-forge wordcloud -y
Open Python:
python
You can now use the package functions as:
from pytextprep.extract_ngram import extract_ngram
from pytextprep.extract_hashtags import extract_hashtags
from pytextprep.remove_punct import remove_punct
from pytextprep.generate_cloud import generate_cloud
import matplotlib.pyplot as plt
tweets_list = ["Make America Great Again! @DonalTrump", "It's a new day in #America"]
extract_ngram(tweets_list, n=3)
['Make America Great', 'America Great Again!', 'Great Again! @DonalTrump', "Again! @DonalTrump It's", "@DonalTrump It's a", "It's a new", 'a new day', 'new day in', 'day in #America']
extract_hashtags(tweets_list)
['America']
remove_punct(tweets_list, skip=["'", "@", "#", '-'])
['Make America Great Again @DonalTrump', "It's a new day in #America"]
fig, wc = generate_cloud(tweets_list)
plt.show()
Contributing
Contributors: Arijeet Chatterjee, Joshua Sia, Melisa Maidana, Philson Chan (DSCI_524_GROUP21).
Interested in contributing? Check out the contributing guidelines.
Please note that this project is released with a Code of Conduct. By contributing to this project, you agree to abide by its terms.
License
pytextprep
was created by Arijeet Chatterjee, Joshua Sia, Melisa Maidana, Philson Chan (DSCI_524_GROUP21).
It is licensed under the terms of the MIT license.
Credits
pytextprep
was created with cookiecutter
and the py-pkgs-cookiecutter
template.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pytextprep-1.0.7.tar.gz
.
File metadata
- Download URL: pytextprep-1.0.7.tar.gz
- Upload date:
- Size: 6.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.10.1 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d57f0e0ae2dce815a8c3bd4e8aba933d7519aa9c00835c2a7981cf8332b8d368 |
|
MD5 | 6ac935cc8962c45a9ef84ba0f44a60c4 |
|
BLAKE2b-256 | c5835f5b03f19839a0892f1824112541a30e14b1a0a1675ad98beed12fe198d3 |
File details
Details for the file pytextprep-1.0.7-py3-none-any.whl
.
File metadata
- Download URL: pytextprep-1.0.7-py3-none-any.whl
- Upload date:
- Size: 6.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.10.1 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 35c007d0ce10383d9ee67bf0fd3b4e405a4d258ed01f8d8e7319667b9992cb2c |
|
MD5 | 2ffeb1110306e5eab0f2b0f2d43ec864 |
|
BLAKE2b-256 | 96c545ee1956df18a9437f2c5935a9e3acc0b52347d707efbf3e864633abb3aa |