Skip to main content

Tokenizer for Twitter comments (tweets)

Project description

Twikenizer

This repository hosts the code for a tokenizer of tweets. It's main purpose is to identify subtle profanity, so it should obtain better performance on data containing hidden profanity (e.g. 'f*ck').

Disclaimer: The following paragraphs may contain profanity.

Description

Python offers a set of sentence tokenizers for different purposes: nltk's word tokenizer, spacy's, scikit-learn's default and TweetTokenizer, among others. All but TweetTokenizer disregard hashtags and mentions by separating the symbols from the rest of the token(s). Although TweetTokenizer considers the Twitter dialect, it fails to tokenize subtle hidden profanity.

For the word f*ck,the tokens considered are [f, *, ck]. The word g@y is tokenized as [g, @y], considering a single token g and a wrongly identified mention @y. While the hashtag #hash_tag is correctly tokenized as [#hash_tag], regular tokens are not underscore separated: love_twitter is tokenized as ['love_twitter'] instead of ['love', '_', 'twitter'].

Twikenizer was created in order to enable a proper identification of hidden profane words, considering the features detailed above. Applying distance related features, i.e. levenshtein distance to slang words should output better results using this tokenizer.

Installation

Using pip

pip install twikenizer

Clone repository

git clone https://github.com/Guilherme-Routar/Twikenizer.git

Usage

> import twikenizer as twk
> twk = twk.Twikenizer()
> tweet = 'This is an #hashtag'
> twk.tokenize(tweet)
['This', 'is', 'an', '#hashtag']

Twikenizer has a built-in function examplify which demonstrates how it tokenizes different kind of words/tokens.

> twk.examplify()
Generated tweet
###############
Tw33t # @dude_really #hash_tag $hit (g@y) retard#d @dude. 😀😀 !😀abc %😀lol #hateit #hate.it $%&/ f*ck-

Generated tokens
################
['Tw33t', '#', '@dude_really', '#hash_tag', '$hit', '(', 'g', '@', 'y', ')', 'retard#d', '@dude', '.', '😀', '😀', '!', '😀', 'abc', '%', '😀', 'lol', '#hateit', '#hate', '.', 'it', '$', '%', '&', '/', 'f*ck', '-']
´´´

Project details


Release history Release notifications | RSS feed

This version

1.0

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

twikenizer-1.0.tar.gz (4.4 kB view details)

Uploaded Source

File details

Details for the file twikenizer-1.0.tar.gz.

File metadata

  • Download URL: twikenizer-1.0.tar.gz
  • Upload date:
  • Size: 4.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.18.4 setuptools/40.6.3 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.6.5

File hashes

Hashes for twikenizer-1.0.tar.gz
Algorithm Hash digest
SHA256 678d7fc2adef86f6e9e2693c8710ef31b76a342923558a37d87ec09f8f97a33f
MD5 62806ede5e47dcac792aedd4fd321a9c
BLAKE2b-256 d2517aee33630b948f0716efae7a96c4fd8f859b348694058c380fd899a4227e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page