Skip to main content

Elegant tweet preprocessing

Project description

Preprocessor

https://travis-ci.org/s/preprocessor.svg?branch=master

Preprocessor is a preprocessing library for tweet data written in Python.

When building Machine Learning systems based on tweet data, a preprocessing is required. This library makes it easy to clean, parse or tokenize the tweets.

Features

Currently supports cleaning, tokenizing and parsing:

  • URLs

  • Hashtags

  • Mentions

  • Reserved words (RT, FAV)

  • Emojis

  • Smileys

Supports Python 2.7 and 3.3+

Usage

Basic cleaning:

>>> import preprocessor as p
>>> p.clean('Preprocessor is #awesome 👍 https://github.com/s/preprocessor')
'Preprocessor is'

Tokenizing:

>>> p.tokenize('Preprocessor is #awesome 👍 https://github.com/s/preprocessor')
'Preprocessor is $HASHTAG$ $EMOJI$ $URL$'

Parsing:

>>> parsed_tweet = p.parse('Preprocessor is #awesome https://github.com/s/preprocessor')
<preprocessor.parse.ParseResult instance at 0x10f430758>
>>> parsed_tweet.urls
[(25:58) => https://github.com/s/preprocessor]
>>> parsed_tweet.urls[0].start_index
25
>>> parsed_tweet.urls[0].match
'https://github.com/s/preprocessor'
>>> parsed_tweet.urls[0].end_index
58

Fully customizable:

>>> p.set_options(p.OPT.URL, p.OPT.EMOJI)
>>> p.clean('Preprocessor is #awesome 👍 https://github.com/s/preprocessor')
'Preprocessor is #awesome'

Preprocessor will go through all of the options by default unless you specify some options.

Available Options:

Option Name

Option Short Code

URL

p.OPT.URL

Mention

p.OPT.MENTION

Hashtag

p.OPT.HASHTAG

Reserved Words

p.OPT.RESERVED

Emoji

p.OPT.EMOJI

Smiley

p.OPT.SMILEY

Number

p.OPT.NUMBER

Installation

using pip:

$ pip install tweet-preprocessor

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tweet-preprocessor-0.5.0.tar.gz (6.3 kB view details)

Uploaded Source

File details

Details for the file tweet-preprocessor-0.5.0.tar.gz.

File metadata

File hashes

Hashes for tweet-preprocessor-0.5.0.tar.gz
Algorithm Hash digest
SHA256 994b6ff025d01a6656d2ec9ab55ba93a706147fd8bda639cde5812c126468314
MD5 6de570130c7146abc327cefe7f3eddb6
BLAKE2b-256 2af8810ec35c31cca89bc4f1a02c14b042b9ec6c19dd21f7ef1876874ef069a6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page