A dictionary-based tokenizer.
Project description
DicTok
A dictionary-based tokenizer. It tokenizes a text based on known tokens.
Installation
pip install dictok
Usage
- Create your dic-file with a list of tokens e.g.
tokens.dic
:
super
man
note
book
store
...
- Import
dictok
and pass it the dictionary file as main parameter:
>>> import dictok
>>> dt = dictok.DicTok('tokens.dic')
- You are ready to use it:
>>> sent = "Superman bought a notebook in the bookstore."
>>> dt.tokenize(sent)
['Super', 'man', 'bought', 'a', 'note', 'book', 'in', 'the', 'book', 'store', '.']
Options
You can also ignore single characters or unknown tokens:
>>> dt.tokenize(sent, include_unknown = False, include_single_chars = False)
['Super', 'man', 'note', 'book', 'book', 'store']
If, for example, you want to recognize and correct words with typing errors, you can do so by specifying them as pair in the dictionary:
super
man
note
book
buok,book
store
stohre,store
...
>>> dt = dictok.DicTok('/home/samuel/pip/tokens.dic')
>>> sent = "Superman bought a notebuok in the bookstohre."
>>> dt.tokenize(sent, include_unknown = False, include_single_chars = False)
['Super', 'man', 'note', 'book', 'book', 'store']
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
dictok-0.0.3.tar.gz
(3.5 kB
view hashes)
Built Distribution
dictok-0.0.3-py3-none-any.whl
(3.8 kB
view hashes)