simple tokenizer for tensorflow 2.x and PyTorch
Project description
meguru tokenizer
installation and initialization
pip install meguru_tokenizer
sudachipy link -t full
Abstruction of Usage
- Preprocess Using Each Tokenizer e.g. sentencepiece preprocess / sudachi preprocess
- Tokenize in your code using its Tokenizer
- basis
see. official docs - Tensorflow
see. tutorial - TODO: PyTorch
- basis
RealWorld Example
import meguru_tokenizer.whitespace_tokenizer import WhitespaceTokenizer
import pprint
sentences = [
"Hello, I don't know how to use it?",
"Tensorflow is awesome!",
"it is good framework.",
]
# define tokenizer and vocaburary
tokenizer = WhitespaceTokenizer(lower=True)
vocab = Vocab()
# build vocaburary
for sentence in sentences:
vocab.add_vocabs(tokenizer.tokenize(sentence))
vocab.build_vocab()
# set vocaburary into tokenizer to enable encoding
tokenizer.vocab = vocab
# save vocaburary information
vocab.dump_vocab(Path("vocab.txt"))
print("vocabs:")
pprint.pprint(vocab.i2w)
# tokenize
print("tokenized sentence")
pprint.pprint(tokenizer.tokenize_list(sentences))
# [['hello', ',', 'i', 'do', "n't", 'know', 'how', 'to', 'use', 'it', '?'],
# ['tensorflow', 'is', 'awesome', '!'],
# ['it', 'is', 'good', 'framework', '.']]
# encode
print("encoded sentence")
pprint.pprint([tokenizer.encode(sentence) for sentence in sentences])
# [[7, 8, 9, 10, 11, 12, 13, 14, 15, 5, 16], [17, 6, 18, 19], [5, 6, 20, 21, 22]]
print("decoded sentence")
pprint.pprint([tokenizer.decode(tokens) for tokens in encodes])
# ["hello , i do n't know how to use it ?",
# 'tensorflow is awesome !',
# 'it is good framework .']
vocab_size = len(vocab)
# restore the vocaburary from dumped file
print("reload from dump file")
vocab = Vocab()
vocab.load_vocab(Path("vocab.txt"))
assert vocab_size == len(vocab)
tokenizer = WhitespaceTokenizer(vocab=vocab)
pprint.pprint([tokenizer.encode(sentence) for sentence in sentences])
# [[7, 8, 9, 10, 11, 12, 13, 14, 15, 5, 16], [17, 6, 18, 19], [5, 6, 20, 21, 22]]
# vocaburary with minimum frequency limitation
vocab = Vocab()
for sentence in sentences:
vocab.add_vocabs(tokenizer.tokenize(sentence))
vocab.build_vocab(min_freq=2)
assert vocab_size != len(vocab)
# vocaburary with maximum voaburary size
vocab = Vocab()
for sentence in sentences:
vocab.add_vocabs(tokenizer.tokenize(sentence))
vocab.build_vocab(vocab_size=10)
assert 10 == len(vocab)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
meguru_tokenizer-0.3.1.tar.gz
(10.5 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file meguru_tokenizer-0.3.1.tar.gz.
File metadata
- Download URL: meguru_tokenizer-0.3.1.tar.gz
- Upload date:
- Size: 10.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.8.0 tqdm/4.46.0 CPython/3.7.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5a2e8af6123f94ed6c59115327f0ea00803454d8b13c965efff9b12e4ba1bd19
|
|
| MD5 |
6b29afd7513de71f52940ef0dcfaffe3
|
|
| BLAKE2b-256 |
0d097aa91f918d057302aa9311b60630436acb834a96ca0adaa8c2687f1da797
|
File details
Details for the file meguru_tokenizer-0.3.1-py3-none-any.whl.
File metadata
- Download URL: meguru_tokenizer-0.3.1-py3-none-any.whl
- Upload date:
- Size: 14.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.8.0 tqdm/4.46.0 CPython/3.7.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
db94b3f78b6d04ab65b033d91a5fdb266e0c0dbe39fa5b9383cbe7524eeb2d40
|
|
| MD5 |
a132a01711db50ac322cae6c9b2ba93d
|
|
| BLAKE2b-256 |
39be263bf90a36817a5148bc77424443949e7fb4b694ec281a3817a177772979
|