Word segmentation / tokenization focussed on Twitter
Project description
ark-twokenize-py
This is a crude Python port of the Twokenize class from ark-tweet-nlp.
It produces nearly identical output to the original Java tokenizer, except in a few infrequent situations. In particular, Python does not support partial case-insensitivity in regular expressions and this causes some tokenization differences for ``Eastern" style emoticons, particularly when the left and right halves are of different cases. For example:
Java (original): v.V
Python (port): v . V
Emoticons of this kind are seemingly pretty rare. Nevertheless, I have included a fix for one special case:
Java (original): o.O
Python (port, w/o fix): o . O
Python (port, w/ fix): o.O
Evaluation
A comparison on 1 million tweets found 83 instances (0.0083%) where tokenization differed between the original Java version and this Python port. The differences were primarily related to the emoticon issue discussed above, and it was not clear in general which output was more desirable. For example:
Text:
Profit-Taking Hits Nikkei http://t.co/hVWpiDQ1 http://t.co/xJSPwE2z RT @WSJmarkets
Java (original):
Profi t-T aking Hits Nikkei http://t.co/hVWpiDQ1 http://t.co/xJSPwE2z RT @WSJmarkets
Python (port):
Profit-Taking Hits Nikkei http://t.co/hVWpiDQ1 http://t.co/xJSPwE2z RT @WSJmarkets
Usage
>>> import twokenize
>>> twokenize.tokenizeRawTweetText("lol ly x0x0,:D")
['lol', 'ly', 'x0x0', ',', ':D']
Installation
pip install twokenize
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file twokenize-1.0.0.tar.gz
.
File metadata
- Download URL: twokenize-1.0.0.tar.gz
- Upload date:
- Size: 8.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d121ea00caa1c086821391b860140f46fd41761073c875eda711ddaca7677dbe |
|
MD5 | 2778c0c5dc870e5c70324dad6eef20da |
|
BLAKE2b-256 | 69e7c51379ef276432b3f92691e1b49596885708b34ebcc7975d9b681805a5ab |
File details
Details for the file twokenize-1.0.0-py3-none-any.whl
.
File metadata
- Download URL: twokenize-1.0.0-py3-none-any.whl
- Upload date:
- Size: 8.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 77d59ad045eb8289086a4e9e13a053bc04236eb3ad78f61f2995e19c02621cb7 |
|
MD5 | 114fec2b40dcfedf301ae7318fda9531 |
|
BLAKE2b-256 | 3e7c8874d719de00a1da753d21733a4a3043f67b84dccad525ab442fdd572617 |