Skip to main content

Elegant tweet preprocessing

Project description



Preprocessor is a preprocessing library for tweet data written in Python. It was written as part of my bachelor thesis in sentiment analysis. Later I extracted it to a library for broader usage.

When building Machine Learning systems based on tweet data, a preprocessing is required. This library makes it easy to clean, parse or tokenize the tweets.


Currently supports cleaning, tokenizing and parsing:

  • URLs
  • Hashtags
  • Mentions
  • Reserved words (RT, FAV)
  • Emojis
  • Smileys
  • JSON and .txt file support

Preprocessor v0.6.0 supports Python 2.7 and 3.5+ on Linux, macOS and Windows. Tests run on following setups:

Linux Xenial with Python 2.7, 3.5, 3.6, 3.7
macOS 10.14 with Python 3.7.5, 3.8.0
Windows 10.0.17134 with Python 2.7, 3.5.4, 3.6.8


Basic cleaning:

>>> import preprocessor as p
>>> p.clean('Preprocessor is #awesome 👍')
'Preprocessor is'


>>> p.tokenize('Preprocessor is #awesome 👍')
'Preprocessor is $HASHTAG$ $EMOJI$ $URL$'


>>> parsed_tweet = p.parse('Preprocessor is #awesome')
<preprocessor.parse.ParseResult instance at 0x10f430758>
>>> parsed_tweet.urls
[(25:58) =>]
>>> parsed_tweet.urls[0].start_index
>>> parsed_tweet.urls[0].match
>>> parsed_tweet.urls[0].end_index

Fully customizable:

>>> p.set_options(p.OPT.URL, p.OPT.EMOJI)
>>> p.clean('Preprocessor is #awesome 👍')
'Preprocessor is #awesome'

Preprocessor will go through all of the options by default unless you specify some options.

Processing files:

Preprocessor currently supports processing .json and .txt formats. Please see below examples for the correct input format.

Example JSON file

    "Preprocessor now supports files.",
    "#preprocessing is a cruical part of @ML projects.",
    "@RT @Twitter raw text data usually has lots of #residue."

Example Text file

Preprocessor now supports files.
#preprocessing is a cruical part of @ML projects.
@RT @Twitter raw text data usually has lots of #residue.

Preprocessing JSON file:

# JSON example
>>> input_file_name = "sample_json.json"
>>> p.clean_file(file_name, options=[p.OPT.URL, p.OPT.MENTION])
Saved the cleaned tweets to:/tests/artifacts/24052020_013451892752_vkeCMTwBEMmX_clean_file_sample.json

Preprocessing text file:

# Text file example
>>> input_file_name = "sample_txt.txt"
>>> p.clean_file(file_name, options=[p.OPT.URL, p.OPT.MENTION])
Saved the cleaned tweets to:/tests/artifacts/24052020_013451908865_TE9DWX1BjFws_clean_file_sample.txt

Available Options:

Option Name Option Short Code
Reserved Words p.OPT.RESERVED


using pip:

$ pip install tweet-preprocessor

using manual installation:

$ python build
$ python install


Are you willing to contribute to preprocessor? That’s great! Please follow below steps to contribute to this project:

  1. Create a bug report or a feature idea using the templates on Issues page.
  2. Fork the repository and make your changes.
  3. Open a PR and make sure your PR has tests and all the checks pass.
  4. And that’s all!

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tweet-preprocessor-0.6.0.tar.gz (14.7 kB view hashes)

Uploaded source

Built Distribution

tweet_preprocessor-0.6.0-py3-none-any.whl (27.6 kB view hashes)

Uploaded py3

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Huawei Huawei PSF Sponsor Microsoft Microsoft PSF Sponsor NVIDIA NVIDIA PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page