tok·PyPI

Fast and customizable tokenizer

Project description

## tok

[![PyPI](https://img.shields.io/pypi/v/tok.svg?style=flat-square)](https://pypi.python.org/pypi/tok/) [![PyPI](https://img.shields.io/pypi/pyversions/tok.svg?style=flat-square)](https://pypi.python.org/pypi/tok/)

Fast and most complete/customizable tokenizer in Python.

It is roughly 25x faster than spacy’s and nltk’s regex based tokenizers.

Using the aho-corasick algorithm makes it a novelty and allows it to be both explainable and fast in how it will split.

The heavy lifting is done by [textsearch](https://github.com/kootenpv/textsearch) and [pyahocorasick](https://github.com/WojciechMula/pyahocorasick), allowing this to be written in only ~200 lines of code.

Contrary to regex-based approaches, it will go over each character in a text only once. Read [below](#how-it-works) about how this works.

### Installation

pip install tok

### Usage

By default it handles contractions, http, (float) numbers and currencies.

`python from tok import word_tokenize word_tokenize("I wouldn't do that.... would you?") ['I', 'would', 'not', 'do', 'that', '...', 'would', 'you', '?'] `

Or configure it yourself:

`python from tok import Tokenizer tokenizer = Tokenizer(protected_words=["some.thing"]) # still using the defaults tokenizer.word_tokenize("I want to protect some.thing") ['I', 'want', 'to', 'protect', 'some.thing'] `

Split by sentences:

`python from tok import sent_tokenize sent_tokenize("I wouldn't do that.... would you?") [['I', 'would', 'not', 'do', 'that', '...'], ['would', 'you', '?']] `

for more options check the documentation of the Tokenizer.

### Further customization

Given:

`python from tok import Tokenizer t = Tokenizer(protected_words=["some.thing"]) # still using the defaults `

You can add your own ideas to the tokenizer by using:

t.keep(x, reason): Whenever it finds x, it will not add whitespace. Prevents direct tokenization.
t.split(x, reason): Whenever it finds x, it will surround it by whitespace, thus creating a token.
t.drop(x, reason): Whenever it finds x, it will remove it but add a split.
t.strip(x, reason): Whenever it finds x, it will remove it without splitting.

`python t.drop("bla", "bla is not needed") t.word_tokenize("Please remove bla, thank you") ['Please', 'remove', ',', 'thank', 'you'] `

### Explainable

Explain what happened:

`python t.explain("bla") [{'from': 'bla', 'to': ' ', 'explanation': 'bla is not needed'}] `

See everything in there (will help you understand how it works):

`python t.explain_dict `

### How it works

It will always only keep the longest match. By introducing a space in your tokens, it will make it be split.

If you consider how the tokenization of . works, see here:

When it finds a ` A.` it will make it ` A.` (single letter abbreviations)
When it finds a .0 it will make it .0 (numbers)
When it finds a ., it will make it ` . ` (thus making a split)

If you want to make sure something including a dot stays, you can use for example:

t.keep(“cool.”)

### Contributing

It would be greatly appreciated if you want to contribute to this library.

It would also be great to add [contractions](https://github.com/kootenpv/contractions) for other languages.

Project details

Release history Release notifications | RSS feed

This version

0.1.14

Jul 9, 2019

0.1.13

Jul 9, 2019

0.0.9

Jul 4, 2019

0.0.8

Jul 4, 2019

0.0.5

Jul 4, 2019

0.0.4

Jul 4, 2019

0.0.3

Jul 4, 2019

0.0.2

Jul 4, 2019

0.0.1

Jul 4, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tok-0.1.14.tar.gz (5.6 kB view details)

Uploaded Jul 9, 2019 Source

Built Distribution

tok-0.1.14-py2.py3-none-any.whl (7.8 kB view details)

Uploaded Jul 9, 2019 Python 2Python 3

File details

Details for the file tok-0.1.14.tar.gz.

File metadata

Download URL: tok-0.1.14.tar.gz
Upload date: Jul 9, 2019
Size: 5.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: Python-urllib/3.7

File hashes

Hashes for tok-0.1.14.tar.gz
Algorithm	Hash digest
SHA256	`6c48d5b77c8e4e9e6e8413827d6f0ba8c058ce5578198b2ee68901a07fbc2e57`
MD5	`e8b91768944ea55e6426380623170c5a`
BLAKE2b-256	`fbe86798c017485aa2e0713b9f4fdae679e91940b3927839b8d2866820c1b994`

See more details on using hashes here.

File details

Details for the file tok-0.1.14-py2.py3-none-any.whl.

File metadata

Download URL: tok-0.1.14-py2.py3-none-any.whl
Upload date: Jul 9, 2019
Size: 7.8 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No
Uploaded via: Python-urllib/3.7

File hashes

Hashes for tok-0.1.14-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`8c3454e8968871808e4dd6d7d86371b11d7d5dea1b92b260016392750721165f`
MD5	`dbc3d314dd0bc7451734a29e88b1ed4e`
BLAKE2b-256	`a396878cb1996aa90ff3f714205e5ba8d4a891a57463ad54594e553e39fd2691`

See more details on using hashes here.

tok 0.1.14

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes