VGram tokenization
Project description
pyvgram
🍺 Python implementation on vgram tokenization
VGram is a tokenizer construction algorithm that optimizes the code length of the text. It can be used to tokenize text like BPE (Sennrich et al., 2016).
Read more in our CIKM'18 paper Construction of Efficient V-Gram Dictionary for Sequential Data Analysis.
Install
pip install pyvgram
Examples
1. Quickstart
Let's train tokenizer with size 10000
on file.txt
content and encodes some string.
from vgram import VGramTokenizer
tokenizer = VGramTokenizer(10000)
tokenizer.train("file.txt")
ids = tokenizer.encode("hello world")
train
method used for training from file name or list of names.
For learning from string use fit
method.
2. Save and load
from vgram import VGramTokenizer
tokenizer = VGramTokenizer(10000)
tokenizer.train(["file1.txt", "file2.txt"])
ids1 = tokenizer.encode("hello world")
tokenizer.save_pretrained("vgram.tokenizer")
loaded_tokenizer = VGramTokenizer.from_pretrained("vgram.tokenizer")
ids2 = loaded_tokenizer.encode("hello world")
assert tokenizer == loaded_tokenizer
assert ids1 == ids2
3. Learn from raw text
You can learn a tokenizer from raw text by fit
method by passing string or list of strings.
from vgram import VGramTokenizer
tokenizer = VGramTokenizer(10000)
tokenizer.fit(" ".join(["hello world"] * 1000))
ids = tokenizer.encode("hello world")
Also, you can specify iters
number if you want to learn more.
Bootstrap sampling is used in case of list of stings.
from vgram import VGramTokenizer
tokenizer = VGramTokenizer(10000)
tokenizer.fit("hello world", iters=1000))
ids = tokenizer.encode("hello world")
4. Learn multiple times
You can learn a tokenizer on one dataset and then finetune on another
by multiple calls of fit
or train
methods.
from vgram import VGramTokenizer, SplitLevel
tokenizer = VGramTokenizer(200, split_level=SplitLevel.NONE)
tokenizer.fit(["hello", "hello world"], iters=10000))
assert len(tokenizer.encode("hello world")) == 1
assert len(tokenizer.encode("pip install pyvgram")) > 1
tokenizer.fit("pip install pyvgram", iters=10000))
assert len(tokenizer.encode("hello world")) > 1
assert len(tokenizer.encode("pip install pyvgram")) == 1
After finetuning tokenizer.encode("hello world")
codes by symbols
into ['h', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd']
because in finetune dataset it's not meaningful sequence.
5. Vocabulary
from vgram import VGramTokenizer, SplitLevel
tokenizer = VGramTokenizer(10000, split_level=SplitLevel.LINE)
tokenizer.fit(" ".join(["hello world"] * 1000))
print("Vocabulary:", tokenizer.get_vocab())
# Vocabulary: ['h', 'hello world', 'e', 'l', 'o', ' ', 'w', 'r', 'd', '\n']
print("Vocab size:", tokenizer.vocab_size())
# Vocab size: 10
6. Learn with another split-level
The most of bpe-like tokenization libraries split one word to the pieces.
pyvgram
support different levels of splitting,
so you can split whole line in to pieces which consist of few words if they are frequent enough.
It's useful for analyzing vocabulary to find patterns in data.
Default split-level is WORD
, but you can also use LINE
and NONE
.
from vgram import VGramTokenizer, SplitLevel
text = "\n".join(["hello world"] * 10000)
tokenizer = VGramTokenizer(200, split_level=SplitLevel.WORD)
tokenizer.fit(text)
print(tokenizer.get_vocab())
# ['h', 'hello', 'e', 'l', 'o', ' ', ' world', 'w', 'r', 'd', '\n']
tokenizer = VGramTokenizer(200, split_level=SplitLevel.LINE)
tokenizer.fit(text)
print(tokenizer.get_vocab())
# ['h', 'hello world', 'e', 'l', 'o', ' ', 'w', 'r', 'd', '\n']
SplitLevel.NONE
not split text and handle it like one sequence.
Its bad idea to pass very few texts in such case,
but if you have many pre-splited texts, it's a good choice
from vgram import VGramTokenizer, SplitLevel
texts = ["hello world"] * 10000
tokenizer = VGramTokenizer(200, split_level=SplitLevel.NONE)
tokenizer.fit(texts)
print(tokenizer.get_vocab())
# ['h', 'hello world', 'e', 'l', 'o', ' ', 'w', 'r', 'd']
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pyvgram-0.1.2.tar.gz
.
File metadata
- Download URL: pyvgram-0.1.2.tar.gz
- Upload date:
- Size: 11.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/52.0.0.post20210125 requests-toolbelt/0.9.1 tqdm/4.58.0 CPython/3.7.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
fda2b64eb1c74cc3b003b93a15b18e135a06925f7b974e36c3404b941de79ed5
|
|
MD5 |
089f9b582aab7a6847c1e80d59ca1930
|
|
BLAKE2b-256 |
d5a183e8edaf4e2633d7cce40045253973c3cad7ef7fa79a148976509cf71f01
|
File details
Details for the file pyvgram-0.1.2-py3-none-any.whl
.
File metadata
- Download URL: pyvgram-0.1.2-py3-none-any.whl
- Upload date:
- Size: 18.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/52.0.0.post20210125 requests-toolbelt/0.9.1 tqdm/4.58.0 CPython/3.7.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
3473a79b7d5c948dc8b57270fc3e775e53bea34f2018892016991f14fd37f975
|
|
MD5 |
4913592d8db592a392ec0ccdfa738589
|
|
BLAKE2b-256 |
f46b8e3fce68f9676b0adf96cd11119cbdf0b44750a2a71acfb391ecd7344962
|