A dictionary-based tokenizer.
Project description
DicTok
A dictionary-based tokenizer. It tokenizes a text based on known tokens defined in a given file.
Installation
pip install dictok
Usage
- Create your dic-file with a list of tokens e.g.
tokens.dic
:
super
man
note
book
store
...
- Import
dictok
and pass it the dictionary file as main parameter:
>>> import dictok
>>> dt = dictok.DicTok('tokens.dic')
- You are ready to use it:
>>> sent = "Superman bought a notebook in the bookstore."
>>> dt.tokenize(sent)
['Super', 'man', 'bought', 'a', 'note', 'book', 'in', 'the', 'book', 'store', '.']
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
dictok-0.0.2.tar.gz
(3.3 kB
view hashes)
Built Distribution
dictok-0.0.2-py3-none-any.whl
(3.6 kB
view hashes)