Tokenizer for Twitter comments (tweets)
Project description
Twikenizer
This repository hosts the code for a tokenizer of tweets. It's main purpose is to identify subtle profanity, so it should obtain better performance on data containing hidden profanity (e.g. 'f*ck').
Disclaimer: The following paragraphs may contain profanity.
Description
Python offers a set of sentence tokenizers for different purposes: nltk's word tokenizer, spacy's, scikit-learn's default and TweetTokenizer, among others. All but TweetTokenizer disregard hashtags and mentions by separating the symbols from the rest of the token(s). Although TweetTokenizer considers the Twitter dialect, it fails to tokenize subtle hidden profanity.
For the word f*ck
,the tokens considered are [f, *, ck]
. The word g@y
is tokenized as [g, @y]
, considering
a single token g
and a wrongly identified mention @y
. While the hashtag #hash_tag
is correctly tokenized as
[#hash_tag]
, regular tokens are not underscore separated: love_twitter
is tokenized as ['love_twitter']
instead of ['love', '_', 'twitter']
.
Twikenizer was created in order to enable a proper identification of hidden profane words, considering the features detailed above. Applying distance related features, i.e. levenshtein distance to slang words should output better results using this tokenizer.
Installation
Using pip
pip install twikenizer
Clone repository
git clone https://github.com/Guilherme-Routar/Twikenizer.git
Usage
> import twikenizer as twk
> twk = twk.Twikenizer()
> tweet = 'This is an #hashtag'
> twk.tokenize(tweet)
['This', 'is', 'an', '#hashtag']
Twikenizer has a built-in function examplify
which demonstrates how it tokenizes different kind of words/tokens.
> twk.examplify()
Generated tweet
###############
Tw33t # @dude_really #hash_tag $hit (g@y) retard#d @dude. 😀😀 !😀abc %😀lol #hateit #hate.it $%&/ f*ck-
Generated tokens
################
['Tw33t', '#', '@dude_really', '#hash_tag', '$hit', '(', 'g', '@', 'y', ')', 'retard#d', '@dude', '.', '😀', '😀', '!', '😀', 'abc', '%', '😀', 'lol', '#hateit', '#hate', '.', 'it', '$', '%', '&', '/', 'f*ck', '-']
´´´
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.