tweet-preprocessor·PyPI

Elegant tweet preprocessing

These details have not been verified by PyPI

Project links

Homepage

Project description

Preprocessor

Preprocessor is a preprocessing library for tweet data written in Python. It was written as part of my bachelor thesis in sentiment analysis. Later I extracted it to a library for broader usage.

When building Machine Learning systems based on tweet data, a preprocessing is required. This library makes it easy to clean, parse or tokenize the tweets.

Features

Currently supports cleaning, tokenizing and parsing:

URLs
Hashtags
Mentions
Reserved words (RT, FAV)
Emojis
Smileys
JSON and .txt file support

Preprocessor v0.6.0 supports Python 2.7 and 3.5+ on Linux, macOS and Windows. Tests run on following setups:

Linux Xenial with Python 2.7, 3.5, 3.6, 3.7
macOS 10.14 with Python 3.7.5, 3.8.0
Windows 10.0.17134 with Python 2.7, 3.5.4, 3.6.8

Usage

Basic cleaning:

>>> import preprocessor as p
>>> p.clean('Preprocessor is #awesome 👍 https://github.com/s/preprocessor')
'Preprocessor is'

Tokenizing:

>>> p.tokenize('Preprocessor is #awesome 👍 https://github.com/s/preprocessor')
'Preprocessor is $HASHTAG$ $EMOJI$ $URL$'

Parsing:

>>> parsed_tweet = p.parse('Preprocessor is #awesome https://github.com/s/preprocessor')
<preprocessor.parse.ParseResult instance at 0x10f430758>
>>> parsed_tweet.urls
[(25:58) => https://github.com/s/preprocessor]
>>> parsed_tweet.urls[0].start_index
25
>>> parsed_tweet.urls[0].match
'https://github.com/s/preprocessor'
>>> parsed_tweet.urls[0].end_index
58

Fully customizable:

>>> p.set_options(p.OPT.URL, p.OPT.EMOJI)
>>> p.clean('Preprocessor is #awesome 👍 https://github.com/s/preprocessor')
'Preprocessor is #awesome'

Preprocessor will go through all of the options by default unless you specify some options.

Processing files:

Preprocessor currently supports processing .json and .txt formats. Please see below examples for the correct input format.

Example JSON file

[
    "Preprocessor now supports files. https://github.com/s/preprocessor",
    "#preprocessing is a cruical part of @ML projects.",
    "@RT @Twitter raw text data usually has lots of #residue. http://t.co/g00gl"
]

Example Text file

Preprocessor now supports files. https://github.com/s/preprocessor
#preprocessing is a cruical part of @ML projects.
@RT @Twitter raw text data usually has lots of #residue. http://t.co/g00gl

Preprocessing JSON file:

# JSON example
>>> input_file_name = "sample_json.json"
>>> p.clean_file(file_name, options=[p.OPT.URL, p.OPT.MENTION])
Saved the cleaned tweets to:/tests/artifacts/24052020_013451892752_vkeCMTwBEMmX_clean_file_sample.json

Preprocessing text file:

# Text file example
>>> input_file_name = "sample_txt.txt"
>>> p.clean_file(file_name, options=[p.OPT.URL, p.OPT.MENTION])
Saved the cleaned tweets to:/tests/artifacts/24052020_013451908865_TE9DWX1BjFws_clean_file_sample.txt

Available Options:

Option Name	Option Short Code
URL	p.OPT.URL
Mention	p.OPT.MENTION
Hashtag	p.OPT.HASHTAG
Reserved Words	p.OPT.RESERVED
Emoji	p.OPT.EMOJI
Smiley	p.OPT.SMILEY
Number	p.OPT.NUMBER

Installation

using pip:

$ pip install tweet-preprocessor

using manual installation:

$ python setup.py build
$ python setup.py install

Contributing

Are you willing to contribute to preprocessor? That’s great! Please follow below steps to contribute to this project:

Create a bug report or a feature idea using the templates on Issues page.
Fork the repository and make your changes.
Open a PR and make sure your PR has tests and all the checks pass.
And that’s all!

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.6.0

May 24, 2020

0.5.0

Feb 2, 2016

0.4.0

Jan 31, 2016

0.3.0

Jan 27, 2016

0.2.0

Jan 26, 2016

0.1.2

Jan 24, 2016

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tweet-preprocessor-0.6.0.tar.gz (14.7 kB view details)

Uploaded May 24, 2020 Source

Built Distribution

tweet_preprocessor-0.6.0-py3-none-any.whl (27.6 kB view details)

Uploaded May 24, 2020 Python 3

File details

Details for the file tweet-preprocessor-0.6.0.tar.gz.

File metadata

Download URL: tweet-preprocessor-0.6.0.tar.gz
Upload date: May 24, 2020
Size: 14.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/46.4.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.6

File hashes

Hashes for tweet-preprocessor-0.6.0.tar.gz
Algorithm	Hash digest
SHA256	`827d20d4c3ab8f8c3a084a56991b061be77bdf1d2e30b6b0d930f7f0e140b961`
MD5	`ce806591317bb74f458bde0d461a464e`
BLAKE2b-256	`087e60d1b535babb9f90e6809ad16484e8d634bc179056da7438fb8887e1524d`

See more details on using hashes here.

File details

Details for the file tweet_preprocessor-0.6.0-py3-none-any.whl.

File metadata

Download URL: tweet_preprocessor-0.6.0-py3-none-any.whl
Upload date: May 24, 2020
Size: 27.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/46.4.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.6

File hashes

Hashes for tweet_preprocessor-0.6.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`303ce6f1c788cde01eb279a2cc5035d493a31b5a2fb7f8e2a1679d7e1e3e1fa6`
MD5	`2a59f4a77d298f216341df157cce38e4`
BLAKE2b-256	`179d71bd016a9edcef8860c607e531f30bd09b13103c7951ae73dd2bf174163c`

See more details on using hashes here.

tweet-preprocessor 0.6.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Preprocessor

Features

Usage

Basic cleaning:

Tokenizing:

Parsing:

Fully customizable:

Processing files:

Example JSON file

Example Text file

Preprocessing JSON file:

Preprocessing text file:

Available Options:

Installation

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes