Skip to main content

VGram tokenization

Project description

pyvgram

🍺 Python implementation on vgram tokenization

VGram is a tokenizer construction algorithm that optimizes the code length of the text. It can be used to tokenize text like BPE (Sennrich et al., 2016).

Read more in our CIKM'18 paper Construction of Efficient V-Gram Dictionary for Sequential Data Analysis.

Install

pip install pyvgram

Examples

1. Quickstart

Let's train tokenizer with size 10000 on file.txt content and encodes some string.

from vgram import VGramTokenizer

tokenizer = VGramTokenizer(10000)
tokenizer.train(["file.txt"])
ids = tokenizer.encode("hello world")

2. Save and load

from vgram import VGramTokenizer

tokenizer = VGramTokenizer(10000)
tokenizer.train(["file.txt"])
ids1 = tokenizer.encode("hello world")

tokenizer.save_pretrained("vgram.tokenizer")
loaded_tokenizer = VGramTokenizer.from_pretrained("vgram.tokenizer")
ids2 = loaded_tokenizer.encode("hello world")

assert tokenizer == loaded_tokenizer
assert ids1 == ids2

3. Learn from raw text

You can learn a tokenizer from raw text by fit method.

from vgram import VGramTokenizer

tokenizer = VGramTokenizer(10000)
tokenizer.fit(" ".join(["hello world"] * 1000))
ids = tokenizer.encode("hello world")

Also, you can specify iters num if you want to learn more by bootstrap sampling.

from vgram import VGramTokenizer

tokenizer = VGramTokenizer(10000)
tokenizer.fit("hello world", iters=1000))
ids = tokenizer.encode("hello world")

4. Learn multiple times

You can learn a tokenizer on one dataset and then finetune on another by multiple calls of fit or train methods.

from vgram import VGramTokenizer, SplitLevel

tokenizer = VGramTokenizer(200, split_level=SplitLevel.NONE)
tokenizer.fit(["hello", "hello world"], iters=10000))
assert len(tokenizer.encode("hello world")) == 1
assert len(tokenizer.encode("pip install pyvgram")) > 1

tokenizer.fit("pip install pyvgram", iters=10000))
assert len(tokenizer.encode("hello world")) > 1
assert len(tokenizer.encode("pip install pyvgram")) == 1

After finetuning tokenizer.encode("hello world") codes by symbols into ['h', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd']
because in finetune dataset it's not meaningful sequence.

5. Vocabulary

from vgram import VGramTokenizer, SplitLevel

tokenizer = VGramTokenizer(10000, split_level=SplitLevel.LINE)
tokenizer.fit(" ".join(["hello world"] * 1000))
print("Vocabulary:", tokenizer.get_vocab())
# Vocabulary: ['h', 'hello world', 'e', 'l', 'o', ' ', 'w', 'r', 'd', '\n']
print("Vocab size:", tokenizer.vocab_size())
# Vocab size: 10

6. Learn with another split-level

The most of bpe-like tokenization libraries split one word to the pieces. pyvgram support different levels of splitting, so you can split whole line in to pieces which consist of few words if they are frequent enough. It's useful for analyzing vocabulary to find patterns in data.

Default split-level is WORD, but you can also use LINE and NONE.

from vgram import VGramTokenizer, SplitLevel

text = "\n".join(["hello world"] * 10000)

tokenizer = VGramTokenizer(200, split_level=SplitLevel.WORD)
tokenizer.fit(text)
print(tokenizer.get_vocab())
# ['h', 'hello', 'e', 'l', 'o', ' ', ' world', 'w', 'r', 'd', '\n']

tokenizer = VGramTokenizer(200, split_level=SplitLevel.LINE)
tokenizer.fit(text)
print(tokenizer.get_vocab())
# ['h', 'hello world', 'e', 'l', 'o', ' ', 'w', 'r', 'd', '\n']

SplitLevel.NONE not split text and handle it like one sequence. Its bad idea to pass very few texts in such case, but if you have many pre-splited texts, it's a good choice

from vgram import VGramTokenizer, SplitLevel

texts = ["hello world"] * 10000

tokenizer = VGramTokenizer(200, split_level=SplitLevel.NONE)
tokenizer.fit(texts)
print(tokenizer.get_vocab())
# ['h', 'hello world', 'e', 'l', 'o', ' ', 'w', 'r', 'd']

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyvgram-0.1.0.tar.gz (10.0 kB view details)

Uploaded Source

Built Distribution

pyvgram-0.1.0-py3-none-any.whl (15.8 kB view details)

Uploaded Python 3

File details

Details for the file pyvgram-0.1.0.tar.gz.

File metadata

  • Download URL: pyvgram-0.1.0.tar.gz
  • Upload date:
  • Size: 10.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/51.1.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.7.3

File hashes

Hashes for pyvgram-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ade21c4565e07ff9d61c6c64b1a9fa80ae002927d05c41b0fd853d9385bca6c0
MD5 22eea6155c337f71be627cdaed9c72b3
BLAKE2b-256 b8df94c32a7b4c4ebac7afe01042edf9541e5ffa45a0d446131ef785a7c102c6

See more details on using hashes here.

File details

Details for the file pyvgram-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pyvgram-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 15.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/51.1.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.7.3

File hashes

Hashes for pyvgram-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 69d617f49c4e579318daddc4f20e415af6ffef4fa85b3bb66215a251be0d8985
MD5 63952913db8163f0d575ecbaa574deb9
BLAKE2b-256 5b7b1ea19887e86a9a5b06e8de40b6b0496505a30cbbd9b568ad468007d9b44a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page