Project description

DicTok

A dictionary-based tokenizer. It tokenizes a text based on known tokens.

Installation

pip install dictok

Usage

Create your dic-file with a list of tokens e.g. tokens.dic:

super
man
note
book
store
...

Import dictok and pass it the dictionary file as main parameter:

>>> import dictok
>>> dt = dictok.DicTok('tokens.dic')

You are ready to use it:

>>> sent = "Superman bought a notebook in the bookstore."
>>> dt.tokenize(sent)
['Super', 'man', 'bought', 'a', 'note', 'book', 'in', 'the', 'book', 'store', '.']

Options

You can also ignore single characters or unknown tokens:

>>> dt.tokenize(sent, include_unknown = False, include_single_chars = False)
['Super', 'man', 'note', 'book', 'book', 'store']

If, for example, you want to recognize and correct words with typing errors, you can do so by specifying them as pair in the dictionary:

super
man
note
book
buok,book
store
stohre,store
...

>>> dt = dictok.DicTok('/home/samuel/pip/tokens.dic')
>>> sent = "Superman bought a notebuok in the bookstohre."
>>> dt.tokenize(sent, include_unknown = False, include_single_chars = False)
['Super', 'man', 'note', 'book', 'book', 'store']

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.0.5

Aug 25, 2022

0.0.4

Jul 19, 2022

This version

0.0.3

Jul 19, 2022

0.0.2

Jul 17, 2022

0.0.1

Jul 17, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dictok-0.0.3.tar.gz (3.5 kB view hashes)

Uploaded Jul 19, 2022 Source

Built Distribution

dictok-0.0.3-py3-none-any.whl (3.8 kB view hashes)

Uploaded Jul 19, 2022 Python 3

Hashes for dictok-0.0.3.tar.gz

Hashes for dictok-0.0.3.tar.gz
Algorithm	Hash digest
SHA256	`94e16e73d87cb2042f700118c8b7e04184482f484a2262b6d72e7c967a6c6d75`
MD5	`57e1402f3b3697ad1a8a4d08680f00b7`
BLAKE2b-256	`2c65f4e460c430bdd8c2f480ea0be2f1d2ac52830e3d123fed0868e58e2b8871`

Hashes for dictok-0.0.3-py3-none-any.whl

Hashes for dictok-0.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`812b8627270a898db9e7e344cef461b1702687f8e152f82cbe87b775a228b084`
MD5	`3c58862cb04dafb9e92de560bbc859f9`
BLAKE2b-256	`d430c2bc5e1be4b0e87d084199d7bb512141c22af81028e601a44a941882ff88`