Skip to main content
Join the official 2019 Python Developers SurveyStart the survey!

Elegant tweet preprocessing

Project description

Preprocessor

https://travis-ci.org/s/preprocessor.svg?branch=master

Preprocessor is a preprocessing library for tweet data written in Python.

When building Machine Learning systems based on tweet data, a preprocessing is required. This library makes it easy to clean, parse or tokenize the tweets.

Features

Currently supports cleaning, tokenizing and parsing:

  • URLs
  • Hashtags
  • Mentions
  • Reserved words (RT, FAV)
  • Emojis
  • Smileys

Supports Python 2.7 and 3.3+

Usage

Basic cleaning:

>>> import preprocessor as p
>>> p.clean('Preprocessor is #awesome 👍 https://github.com/s/preprocessor')
'Preprocessor is'

Tokenizing:

>>> p.tokenize('Preprocessor is #awesome 👍 https://github.com/s/preprocessor')
'Preprocessor is $HASHTAG$ $EMOJI$ $URL$'

Parsing:

>>> parsed_tweet = p.parse('Preprocessor is #awesome https://github.com/s/preprocessor')
<preprocessor.parse.ParseResult instance at 0x10f430758>
>>> parsed_tweet.urls
[(25:58) => https://github.com/s/preprocessor]
>>> parsed_tweet.urls[0].start_index
25
>>> parsed_tweet.urls[0].match
'https://github.com/s/preprocessor'
>>> parsed_tweet.urls[0].end_index
58

Fully customizable:

>>> p.set_options(p.OPT.URL, p.OPT.EMOJI)
>>> p.clean('Preprocessor is #awesome 👍 https://github.com/s/preprocessor')
'Preprocessor is #awesome'

Preprocessor will go through all of the options by default unless you specify some options.

Available Options:

Option Name Option Short Code
URL p.OPT.URL
Mention p.OPT.MENTION
Hashtag p.OPT.HASHTAG
Reserved Words p.OPT.RESERVED
Emoji p.OPT.EMOJI
Smiley p.OPT.SMILEY
Number p.OPT.NUMBER

Installation

using pip:

$ pip install tweet-preprocessor

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for tweet-preprocessor, version 0.5.0
Filename, size File type Python version Upload date Hashes
Filename, size tweet-preprocessor-0.5.0.tar.gz (6.3 kB) File type Source Python version None Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page