Elegant tweet preprocessing
Project Description
Preprocessor
Preprocessor is a preprocessing library for tweet data written in Python.
When building Machine Learning systems based on tweet data, a preprocessing is required. This library makes it easy to clean, parse or tokenize the tweets.
Features
Currently supports cleaning, tokenizing and parsing:
- URLs
- Hashtags
- Mentions
- Reserved words (RT, FAV)
- Emojis
- Smileys
Supports Python 2.7 and 3.3+
Usage
Basic cleaning:
>>> import preprocessor as p >>> p.clean('Preprocessor is #awesome π https://github.com/s/preprocessor') 'Preprocessor is'
Tokenizing:
>>> p.tokenize('Preprocessor is #awesome π https://github.com/s/preprocessor') 'Preprocessor is $HASHTAG$ $EMOJI$ $URL$'
Parsing:
>>> parsed_tweet = p.parse('Preprocessor is #awesome https://github.com/s/preprocessor') <preprocessor.parse.ParseResult instance at 0x10f430758> >>> parsed_tweet.urls [(25:58) => https://github.com/s/preprocessor] >>> parsed_tweet.urls[0].start_index 25 >>> parsed_tweet.urls[0].match 'https://github.com/s/preprocessor' >>> parsed_tweet.urls[0].end_index 58
Fully customizable:
>>> p.set_options(p.OPT.URL, p.OPT.EMOJI) >>> p.clean('Preprocessor is #awesome π https://github.com/s/preprocessor') 'Preprocessor is #awesome'
Preprocessor will go through all of the options by default unless you specify some options.
Available Options:
Option Name | Option Short Code |
---|---|
URL | p.OPT.URL |
Mention | p.OPT.MENTION |
Hashtag | p.OPT.HASHTAG |
Reserved Words | p.OPT.RESERVED |
Emoji | p.OPT.EMOJI |
Smiley | p.OPT.SMILEY |
Number | p.OPT.NUMBER |
Installation
using pip:
$ pip install tweet-preprocessor
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Filename, size & hash SHA256 hash help | File type | Python version | Upload date |
---|---|---|---|
tweet-preprocessor-0.5.0.tar.gz (6.3 kB) Copy SHA256 hash SHA256 | Source | None | Feb 2, 2016 |