Skip to main content

simple tokenizer for tensorflow 2.x and PyTorch

Project description

meguru tokenizer

installation and initialization

pip install meguru_tokenizer
sudachipy link -t full

Abstruction of Usage

  1. Preprocess Using Each Tokenizer e.g. sentencepiece preprocess / sudachi preprocess
  2. Tokenize in your code using its Tokenizer

RealWorld Example

import meguru_tokenizer.whitespace_tokenizer import WhitespaceTokenizer
import pprint

sentences = [
    "Hello, I don't know how to use it?",
    "Tensorflow is awesome!",
    "it is good framework.",
]

# define tokenizer and vocaburary
tokenizer = WhitespaceTokenizer(lower=True)
vocab = Vocab()

# build vocaburary
for sentence in sentences:
    vocab.add_vocabs(tokenizer.tokenize(sentence))
vocab.build_vocab()

# set vocaburary into tokenizer to enable encoding
tokenizer.vocab = vocab

# save vocaburary information
vocab.dump_vocab(Path("vocab.txt"))
print("vocabs:")
pprint.pprint(vocab.i2w)

# tokenize
print("tokenized sentence")
pprint.pprint(tokenizer.tokenize_list(sentences))

# [['hello', ',', 'i', 'do', "n't", 'know', 'how', 'to', 'use', 'it', '?'],
#  ['tensorflow', 'is', 'awesome', '!'],
#  ['it', 'is', 'good', 'framework', '.']]

# encode
print("encoded sentence")
pprint.pprint([tokenizer.encode(sentence) for sentence in sentences])

# [[7, 8, 9, 10, 11, 12, 13, 14, 15, 5, 16], [17, 6, 18, 19], [5, 6, 20, 21, 22]]

print("decoded sentence")
pprint.pprint([tokenizer.decode(tokens) for tokens in encodes])
# ["hello , i do n't know how to use it ?",
#  'tensorflow is awesome !',
#  'it is good framework .']

vocab_size = len(vocab)

# restore the vocaburary from dumped file
print("reload from dump file")
vocab = Vocab()
vocab.load_vocab(Path("vocab.txt"))
assert vocab_size == len(vocab)

tokenizer = WhitespaceTokenizer(vocab=vocab)
pprint.pprint([tokenizer.encode(sentence) for sentence in sentences])

# [[7, 8, 9, 10, 11, 12, 13, 14, 15, 5, 16], [17, 6, 18, 19], [5, 6, 20, 21, 22]]

# vocaburary with minimum frequency limitation
vocab = Vocab()
for sentence in sentences:
    vocab.add_vocabs(tokenizer.tokenize(sentence))
vocab.build_vocab(min_freq=2)
assert vocab_size != len(vocab)

# vocaburary with maximum voaburary size
vocab = Vocab()
for sentence in sentences:
    vocab.add_vocabs(tokenizer.tokenize(sentence))
vocab.build_vocab(vocab_size=10)
assert 10 == len(vocab)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

meguru_tokenizer-0.3.1.tar.gz (10.5 kB view hashes)

Uploaded Source

Built Distribution

meguru_tokenizer-0.3.1-py3-none-any.whl (14.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page