Skip to main content

Word segmentation / tokenization focussed on Twitter

Project description

ark-twokenize-py

This is a crude Python port of the Twokenize class from ark-tweet-nlp.

It produces nearly identical output to the original Java tokenizer, except in a few infrequent situations. In particular, Python does not support partial case-insensitivity in regular expressions and this causes some tokenization differences for ``Eastern" style emoticons, particularly when the left and right halves are of different cases. For example:

Java (original): v.V
Python (port): v . V

Emoticons of this kind are seemingly pretty rare. Nevertheless, I have included a fix for one special case:

Java (original): o.O
Python (port, w/o fix): o . O
Python (port, w/ fix): o.O

Evaluation

A comparison on 1 million tweets found 83 instances (0.0083%) where tokenization differed between the original Java version and this Python port. The differences were primarily related to the emoticon issue discussed above, and it was not clear in general which output was more desirable. For example:

Text:
Profit-Taking Hits Nikkei http://t.co/hVWpiDQ1 http://t.co/xJSPwE2z RT @WSJmarkets

Java (original):
Profi t-T aking Hits Nikkei http://t.co/hVWpiDQ1 http://t.co/xJSPwE2z RT @WSJmarkets

Python (port):
Profit-Taking Hits Nikkei http://t.co/hVWpiDQ1 http://t.co/xJSPwE2z RT @WSJmarkets

Usage

>>> import twokenize
>>> twokenize.tokenizeRawTweetText("lol ly x0x0,:D")
['lol', 'ly', 'x0x0', ',', ':D']

Installation

pip install twokenize

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

twokenize-1.0.0.tar.gz (8.2 kB view details)

Uploaded Source

Built Distribution

twokenize-1.0.0-py3-none-any.whl (8.5 kB view details)

Uploaded Python 3

File details

Details for the file twokenize-1.0.0.tar.gz.

File metadata

  • Download URL: twokenize-1.0.0.tar.gz
  • Upload date:
  • Size: 8.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for twokenize-1.0.0.tar.gz
Algorithm Hash digest
SHA256 d121ea00caa1c086821391b860140f46fd41761073c875eda711ddaca7677dbe
MD5 2778c0c5dc870e5c70324dad6eef20da
BLAKE2b-256 69e7c51379ef276432b3f92691e1b49596885708b34ebcc7975d9b681805a5ab

See more details on using hashes here.

File details

Details for the file twokenize-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for twokenize-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 77d59ad045eb8289086a4e9e13a053bc04236eb3ad78f61f2995e19c02621cb7
MD5 114fec2b40dcfedf301ae7318fda9531
BLAKE2b-256 3e7c8874d719de00a1da753d21733a4a3043f67b84dccad525ab442fdd572617

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page