Skip to main content

Elegant tweet preprocessing

Project description

Preprocessor

image

Preprocessor is a preprocessing library for tweet data written in Python. It was written as part of my bachelor thesis in sentiment analysis. Later I extracted it to a library for broader usage.

When building Machine Learning systems based on tweet data, a preprocessing is required. This library makes it easy to clean, parse or tokenize the tweets.

Features

Currently supports cleaning, tokenizing and parsing:

  • URLs

  • Hashtags

  • Mentions

  • Reserved words (RT, FAV)

  • Emojis

  • Smileys

  • JSON and .txt file support

Preprocessor v0.6.0 supports Python 2.7 and 3.5+ on Linux, macOS and Windows. Tests run on following setups:

Linux Xenial with Python 2.7, 3.5, 3.6, 3.7
macOS 10.14 with Python 3.7.5, 3.8.0
Windows 10.0.17134 with Python 2.7, 3.5.4, 3.6.8

Usage

Basic cleaning:

>>> import preprocessor as p
>>> p.clean('Preprocessor is #awesome 👍 https://github.com/s/preprocessor')
'Preprocessor is'

Tokenizing:

>>> p.tokenize('Preprocessor is #awesome 👍 https://github.com/s/preprocessor')
'Preprocessor is $HASHTAG$ $EMOJI$ $URL$'

Parsing:

>>> parsed_tweet = p.parse('Preprocessor is #awesome https://github.com/s/preprocessor')
<preprocessor.parse.ParseResult instance at 0x10f430758>
>>> parsed_tweet.urls
[(25:58) => https://github.com/s/preprocessor]
>>> parsed_tweet.urls[0].start_index
25
>>> parsed_tweet.urls[0].match
'https://github.com/s/preprocessor'
>>> parsed_tweet.urls[0].end_index
58

Fully customizable:

>>> p.set_options(p.OPT.URL, p.OPT.EMOJI)
>>> p.clean('Preprocessor is #awesome 👍 https://github.com/s/preprocessor')
'Preprocessor is #awesome'

Preprocessor will go through all of the options by default unless you specify some options.

Processing files:

Preprocessor currently supports processing .json and .txt formats. Please see below examples for the correct input format.

Example JSON file

[
    "Preprocessor now supports files. https://github.com/s/preprocessor",
    "#preprocessing is a cruical part of @ML projects.",
    "@RT @Twitter raw text data usually has lots of #residue. http://t.co/g00gl"
]

Example Text file

Preprocessor now supports files. https://github.com/s/preprocessor
#preprocessing is a cruical part of @ML projects.
@RT @Twitter raw text data usually has lots of #residue. http://t.co/g00gl

Preprocessing JSON file:

# JSON example
>>> input_file_name = "sample_json.json"
>>> p.clean_file(file_name, options=[p.OPT.URL, p.OPT.MENTION])
Saved the cleaned tweets to:/tests/artifacts/24052020_013451892752_vkeCMTwBEMmX_clean_file_sample.json

Preprocessing text file:

# Text file example
>>> input_file_name = "sample_txt.txt"
>>> p.clean_file(file_name, options=[p.OPT.URL, p.OPT.MENTION])
Saved the cleaned tweets to:/tests/artifacts/24052020_013451908865_TE9DWX1BjFws_clean_file_sample.txt

Available Options:

Option Name

Option Short Code

URL

p.OPT.URL

Mention

p.OPT.MENTION

Hashtag

p.OPT.HASHTAG

Reserved Words

p.OPT.RESERVED

Emoji

p.OPT.EMOJI

Smiley

p.OPT.SMILEY

Number

p.OPT.NUMBER

Installation

using pip:

$ pip install tweet-preprocessor

using manual installation:

$ python setup.py build
$ python setup.py install

Contributing

Are you willing to contribute to preprocessor? That’s great! Please follow below steps to contribute to this project:

  1. Create a bug report or a feature idea using the templates on Issues page.

  2. Fork the repository and make your changes.

  3. Open a PR and make sure your PR has tests and all the checks pass.

  4. And that’s all!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tweet-preprocessor-0.6.0.tar.gz (14.7 kB view details)

Uploaded Source

Built Distribution

tweet_preprocessor-0.6.0-py3-none-any.whl (27.6 kB view details)

Uploaded Python 3

File details

Details for the file tweet-preprocessor-0.6.0.tar.gz.

File metadata

  • Download URL: tweet-preprocessor-0.6.0.tar.gz
  • Upload date:
  • Size: 14.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/46.4.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.6

File hashes

Hashes for tweet-preprocessor-0.6.0.tar.gz
Algorithm Hash digest
SHA256 827d20d4c3ab8f8c3a084a56991b061be77bdf1d2e30b6b0d930f7f0e140b961
MD5 ce806591317bb74f458bde0d461a464e
BLAKE2b-256 087e60d1b535babb9f90e6809ad16484e8d634bc179056da7438fb8887e1524d

See more details on using hashes here.

File details

Details for the file tweet_preprocessor-0.6.0-py3-none-any.whl.

File metadata

  • Download URL: tweet_preprocessor-0.6.0-py3-none-any.whl
  • Upload date:
  • Size: 27.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/46.4.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.6

File hashes

Hashes for tweet_preprocessor-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 303ce6f1c788cde01eb279a2cc5035d493a31b5a2fb7f8e2a1679d7e1e3e1fa6
MD5 2a59f4a77d298f216341df157cce38e4
BLAKE2b-256 179d71bd016a9edcef8860c607e531f30bd09b13103c7951ae73dd2bf174163c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page