Skip to main content

Tokenizer for Twitter comments (tweets)

Project description

Twikenizer

This repository hosts the code for a tokenizer of tweets. It's main purpose is to identify subtle profanity, so it should obtain better performance on data containing hidden profanity (e.g. 'f*ck').

Disclaimer: The following paragraphs may contain profanity.

Description

Python offers a set of sentence tokenizers for different purposes: nltk's word tokenizer, spacy's, scikit-learn's default and TweetTokenizer, among others. All but TweetTokenizer disregard hashtags and mentions by separating the symbols from the rest of the token(s). Although TweetTokenizer considers the Twitter dialect, it fails to tokenize subtle hidden profanity.

For the word f*ck,the tokens considered are [f, *, ck]. The word g@y is tokenized as [g, @y], considering a single token g and a wrongly identified mention @y. While the hashtag #hash_tag is correctly tokenized as [#hash_tag], regular tokens are not underscore separated: love_twitter is tokenized as ['love_twitter'] instead of ['love', '_', 'twitter'].

Twikenizer was created in order to enable a proper identification of hidden profane words, considering the features detailed above. Applying distance related features, i.e. levenshtein distance to slang words should output better results using this tokenizer.

Installation

Using pip

pip install twikenizer

Clone repository

git clone https://github.com/Guilherme-Routar/Twikenizer.git

Usage

> import twikenizer as twk
> twk = twk.Twikenizer()
> tweet = 'This is an #hashtag'
> twk.tokenize(tweet)
['This', 'is', 'an', '#hashtag']

Twikenizer has a built-in function examplify which demonstrates how it tokenizes different kind of words/tokens.

> twk.examplify()
Generated tweet
###############
Tw33t # @dude_really #hash_tag $hit (g@y) retard#d @dude. 😀😀 !😀abc %😀lol #hateit #hate.it $%&/ f*ck-

Generated tokens
################
['Tw33t', '#', '@dude_really', '#hash_tag', '$hit', '(', 'g', '@', 'y', ')', 'retard#d', '@dude', '.', '😀', '😀', '!', '😀', 'abc', '%', '😀', 'lol', '#hateit', '#hate', '.', 'it', '$', '%', '&', '/', 'f*ck', '-']
´´´

Project details


Release history Release notifications | RSS feed

This version

1.0

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

twikenizer-1.0.tar.gz (4.4 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page