Elegant tweet preprocessing
Project description
Preprocessor
Preprocessor is a preprocessing library for tweet data written in Python.
When building Machine Learning systems based on tweet data, a preprocessing is required. This library makes it easy to clean, parse or tokenize the tweets.
Features
Currently supports cleaning, tokenizing and parsing:
URLs
Hashtags
Mentions
Reserved words (RT, FAV)
Emojis
Smileys
Supports Python 2.7 and 3.3+
Usage
Basic cleaning:
>>> import preprocessor as p
>>> p.clean('Preprocessor is #awesome 👍 https://github.com/s/preprocessor')
'Preprocessor is'
Tokenizing:
>>> p.tokenize('Preprocessor is #awesome 👍 https://github.com/s/preprocessor')
'Preprocessor is $HASHTAG$ $EMOJI$ $URL$'
Parsing:
>>> parsed_tweet = p.parse('Preprocessor is #awesome https://github.com/s/preprocessor')
<preprocessor.parse.ParseResult instance at 0x10f430758>
>>> parsed_tweet.urls
[(25:58) => https://github.com/s/preprocessor]
>>> parsed_tweet.urls[0].start_index
25
>>> parsed_tweet.urls[0].match
'https://github.com/s/preprocessor'
>>> parsed_tweet.urls[0].end_index
58
Fully customizable:
>>> p.set_options(p.OPT.URL, p.OPT.EMOJI)
>>> p.clean('Preprocessor is #awesome 👍 https://github.com/s/preprocessor')
'Preprocessor is #awesome'
Preprocessor will go through all of the options by default unless you specify some options.
Available Options:
Option Name |
Option Short Code |
|---|---|
URL |
|
Mention |
|
Hashtag |
|
Reserved Words |
|
Emoji |
|
Smiley |
|
Number |
|
Installation
using pip:
$ pip install tweet-preprocessor
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file tweet-preprocessor-0.5.0.tar.gz.
File metadata
- Download URL: tweet-preprocessor-0.5.0.tar.gz
- Upload date:
- Size: 6.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
994b6ff025d01a6656d2ec9ab55ba93a706147fd8bda639cde5812c126468314
|
|
| MD5 |
6de570130c7146abc327cefe7f3eddb6
|
|
| BLAKE2b-256 |
2af8810ec35c31cca89bc4f1a02c14b042b9ec6c19dd21f7ef1876874ef069a6
|