VGram tokenization

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

pyvgram

🍺 Python implementation on vgram tokenization

VGram is a tokenizer construction algorithm that optimizes the code length of the text. It can be used to tokenize text like BPE (Sennrich et al., 2016).

Read more in our CIKM'18 paper Construction of Efficient V-Gram Dictionary for Sequential Data Analysis.

Install

pip install pyvgram

Examples

1. Quickstart

Let's train tokenizer with size 10000 on file.txt content and encodes some string.

from vgram import VGramTokenizer

tokenizer = VGramTokenizer(10000)
tokenizer.train("file.txt")
ids = tokenizer.encode("hello world")

train method used for training from file name or list of names. For learning from string use fit method.

2. Save and load

from vgram import VGramTokenizer

tokenizer = VGramTokenizer(10000)
tokenizer.train(["file1.txt", "file2.txt"])
ids1 = tokenizer.encode("hello world")

tokenizer.save_pretrained("vgram.tokenizer")
loaded_tokenizer = VGramTokenizer.from_pretrained("vgram.tokenizer")
ids2 = loaded_tokenizer.encode("hello world")

assert tokenizer == loaded_tokenizer
assert ids1 == ids2

3. Learn from raw text

You can learn a tokenizer from raw text by fit method by passing string or list of strings.

from vgram import VGramTokenizer

tokenizer = VGramTokenizer(10000)
tokenizer.fit(" ".join(["hello world"] * 1000))
ids = tokenizer.encode("hello world")

Also, you can specify iters number if you want to learn more. Bootstrap sampling is used in case of list of stings.

from vgram import VGramTokenizer

tokenizer = VGramTokenizer(10000)
tokenizer.fit("hello world", iters=1000))
ids = tokenizer.encode("hello world")

4. Learn multiple times

You can learn a tokenizer on one dataset and then finetune on another by multiple calls of fit or train methods.

from vgram import VGramTokenizer, SplitLevel

tokenizer = VGramTokenizer(200, split_level=SplitLevel.NONE)
tokenizer.fit(["hello", "hello world"], iters=10000))
assert len(tokenizer.encode("hello world")) == 1
assert len(tokenizer.encode("pip install pyvgram")) > 1

tokenizer.fit("pip install pyvgram", iters=10000))
assert len(tokenizer.encode("hello world")) > 1
assert len(tokenizer.encode("pip install pyvgram")) == 1

After finetuning tokenizer.encode("hello world") codes by symbols into ['h', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd']
because in finetune dataset it's not meaningful sequence.

5. Vocabulary

from vgram import VGramTokenizer, SplitLevel

tokenizer = VGramTokenizer(10000, split_level=SplitLevel.LINE)
tokenizer.fit(" ".join(["hello world"] * 1000))
print("Vocabulary:", tokenizer.get_vocab())
# Vocabulary: ['h', 'hello world', 'e', 'l', 'o', ' ', 'w', 'r', 'd', '\n']
print("Vocab size:", tokenizer.vocab_size())
# Vocab size: 10

6. Learn with another split-level

The most of bpe-like tokenization libraries split one word to the pieces. pyvgram support different levels of splitting, so you can split whole line in to pieces which consist of few words if they are frequent enough. It's useful for analyzing vocabulary to find patterns in data.

Default split-level is WORD, but you can also use LINE and NONE.

from vgram import VGramTokenizer, SplitLevel

text = "\n".join(["hello world"] * 10000)

tokenizer = VGramTokenizer(200, split_level=SplitLevel.WORD)
tokenizer.fit(text)
print(tokenizer.get_vocab())
# ['h', 'hello', 'e', 'l', 'o', ' ', ' world', 'w', 'r', 'd', '\n']

tokenizer = VGramTokenizer(200, split_level=SplitLevel.LINE)
tokenizer.fit(text)
print(tokenizer.get_vocab())
# ['h', 'hello world', 'e', 'l', 'o', ' ', 'w', 'r', 'd', '\n']

SplitLevel.NONE not split text and handle it like one sequence. Its bad idea to pass very few texts in such case, but if you have many pre-splited texts, it's a good choice

from vgram import VGramTokenizer, SplitLevel

texts = ["hello world"] * 10000

tokenizer = VGramTokenizer(200, split_level=SplitLevel.NONE)
tokenizer.fit(texts)
print(tokenizer.get_vocab())
# ['h', 'hello world', 'e', 'l', 'o', ' ', 'w', 'r', 'd']

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.1.2

Aug 29, 2021

0.1.1

May 5, 2021

0.1.0

May 5, 2021

0.0.4

May 3, 2021

0.0.1

May 3, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyvgram-0.1.2.tar.gz (11.3 kB view details)

Uploaded Aug 29, 2021 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pyvgram-0.1.2-py3-none-any.whl (18.5 kB view details)

Uploaded Aug 29, 2021 Python 3

File details

Details for the file pyvgram-0.1.2.tar.gz.

File metadata

Download URL: pyvgram-0.1.2.tar.gz
Upload date: Aug 29, 2021
Size: 11.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/52.0.0.post20210125 requests-toolbelt/0.9.1 tqdm/4.58.0 CPython/3.7.3

File hashes

Hashes for pyvgram-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`fda2b64eb1c74cc3b003b93a15b18e135a06925f7b974e36c3404b941de79ed5`
MD5	`089f9b582aab7a6847c1e80d59ca1930`
BLAKE2b-256	`d5a183e8edaf4e2633d7cce40045253973c3cad7ef7fa79a148976509cf71f01`

See more details on using hashes here.

File details

Details for the file pyvgram-0.1.2-py3-none-any.whl.

File metadata

Download URL: pyvgram-0.1.2-py3-none-any.whl
Upload date: Aug 29, 2021
Size: 18.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/52.0.0.post20210125 requests-toolbelt/0.9.1 tqdm/4.58.0 CPython/3.7.3

File hashes

Hashes for pyvgram-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3473a79b7d5c948dc8b57270fc3e775e53bea34f2018892016991f14fd37f975`
MD5	`4913592d8db592a392ec0ccdfa738589`
BLAKE2b-256	`f46b8e3fce68f9676b0adf96cd11119cbdf0b44750a2a71acfb391ecd7344962`

See more details on using hashes here.

pyvgram 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

pyvgram

Install

Examples

1. Quickstart

2. Save and load

3. Learn from raw text

4. Learn multiple times

5. Vocabulary

6. Learn with another split-level

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes