Skip to main content

simple tokenizer for tensorflow 2.x and PyTorch

Project description

meguru tokenizer

installation and initialization

pip install meguru_tokenizer
sudachipy link -t full

Abstruction of Usage

  1. Preprocess Using Each Tokenizer e.g. sentencepiece preprocess / sudachi preprocess
  2. Tokenize in your code using its Tokenizer

RealWorld Example

import meguru_tokenizer.whitespace_tokenizer import WhitespaceTokenizer
import pprint

sentences = [
    "Hello, I don't know how to use it?",
    "Tensorflow is awesome!",
    "it is good framework.",
]

# define tokenizer and vocaburary
tokenizer = WhitespaceTokenizer(lower=True)
vocab = Vocab()

# build vocaburary
for sentence in sentences:
    vocab.add_vocabs(tokenizer.tokenize(sentence))
vocab.build_vocab()

# set vocaburary into tokenizer to enable encoding
tokenizer.vocab = vocab

# save vocaburary information
vocab.dump_vocab(Path("vocab.txt"))
print("vocabs:")
pprint.pprint(vocab.i2w)

# tokenize
print("tokenized sentence")
pprint.pprint(tokenizer.tokenize_list(sentences))

# [['hello', ',', 'i', 'do', "n't", 'know', 'how', 'to', 'use', 'it', '?'],
#  ['tensorflow', 'is', 'awesome', '!'],
#  ['it', 'is', 'good', 'framework', '.']]

# encode
print("encoded sentence")
pprint.pprint([tokenizer.encode(sentence) for sentence in sentences])

# [[7, 8, 9, 10, 11, 12, 13, 14, 15, 5, 16], [17, 6, 18, 19], [5, 6, 20, 21, 22]]

print("decoded sentence")
pprint.pprint([tokenizer.decode(tokens) for tokens in encodes])
# ["hello , i do n't know how to use it ?",
#  'tensorflow is awesome !',
#  'it is good framework .']

vocab_size = len(vocab)

# restore the vocaburary from dumped file
print("reload from dump file")
vocab = Vocab()
vocab.load_vocab(Path("vocab.txt"))
assert vocab_size == len(vocab)

tokenizer = WhitespaceTokenizer(vocab=vocab)
pprint.pprint([tokenizer.encode(sentence) for sentence in sentences])

# [[7, 8, 9, 10, 11, 12, 13, 14, 15, 5, 16], [17, 6, 18, 19], [5, 6, 20, 21, 22]]

# vocaburary with minimum frequency limitation
vocab = Vocab()
for sentence in sentences:
    vocab.add_vocabs(tokenizer.tokenize(sentence))
vocab.build_vocab(min_freq=2)
assert vocab_size != len(vocab)

# vocaburary with maximum voaburary size
vocab = Vocab()
for sentence in sentences:
    vocab.add_vocabs(tokenizer.tokenize(sentence))
vocab.build_vocab(vocab_size=10)
assert 10 == len(vocab)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

meguru_tokenizer-0.3.1.tar.gz (10.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

meguru_tokenizer-0.3.1-py3-none-any.whl (14.1 kB view details)

Uploaded Python 3

File details

Details for the file meguru_tokenizer-0.3.1.tar.gz.

File metadata

  • Download URL: meguru_tokenizer-0.3.1.tar.gz
  • Upload date:
  • Size: 10.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.8.0 tqdm/4.46.0 CPython/3.7.7

File hashes

Hashes for meguru_tokenizer-0.3.1.tar.gz
Algorithm Hash digest
SHA256 5a2e8af6123f94ed6c59115327f0ea00803454d8b13c965efff9b12e4ba1bd19
MD5 6b29afd7513de71f52940ef0dcfaffe3
BLAKE2b-256 0d097aa91f918d057302aa9311b60630436acb834a96ca0adaa8c2687f1da797

See more details on using hashes here.

File details

Details for the file meguru_tokenizer-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: meguru_tokenizer-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 14.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.8.0 tqdm/4.46.0 CPython/3.7.7

File hashes

Hashes for meguru_tokenizer-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 db94b3f78b6d04ab65b033d91a5fdb266e0c0dbe39fa5b9383cbe7524eeb2d40
MD5 a132a01711db50ac322cae6c9b2ba93d
BLAKE2b-256 39be263bf90a36817a5148bc77424443949e7fb4b694ec281a3817a177772979

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page