A unified tokenization tool for Images, Chinese and English.
Project description
ICE Tokenizer
- Token id
[0, 20000)are image tokens. - Token id
[20000, 20100)are common tokens, mainly punctuations. E.g.,icetk[20000] == '<unk>',icetk[20003] == '<pad>',icetk[20006] == ','. - Token id
[20100, 83823)are English tokens. - Token id
[83823, 145653)are Chinese tokens. - Token id
[145653, 150000)are rare tokens. E.g.,icetk[145803] == 'α'.
You can install the package via
pip install icetk
Tokenization
from icetk import icetk
tokens = icetk.tokenize('Hello World! I am icetk.')
# tokens == ['▁Hello', '▁World', '!', '▁I', '▁am', '▁ice', 'tk', '.']
ids = icetk.encode('Hello World! I am icetk.')
# ids == [39316, 20932, 20035, 20115, 20344, 22881, 35955, 20007]
en = icetk.decode(ids)
# en == 'Hello World! I am icetk.' # always perfectly recover (if without <unk>)
ids = icetk.encode('你好世界!这里是 icetk。')
# ids == [20005, 94874, 84097, 20035, 94947, 22881, 35955, 83823]
ids = icetk.encode(image_path='test.jpeg', image_size=256, compress_rate=8)
# ids == tensor([[12738, 12430, 10398, ..., 7236, 12844, 12386]], device='cuda:0')
# ids.shape == torch.Size([1, 1024])
img = icetk.decode(image_ids=ids, compress_rate=8)
# img.shape == torch.Size([1, 3, 256, 256])
from torchvision.utils import save_image
save_image(img, 'recover.jpg')
# add special tokens
icetk.add_special_tokens(['<start_of_image>', '<start_of_english>', '<start_of_chinese>'])
# transform \n
icetk.decode(icetk.encode('abc\nhi', ignore_linebreak=False))
# 'abc\nhi'
icetk.decode(icetk.encode('abc\nhi'))
# 'abc hi'
# discourage rare composed tokens
icetk.tokenize('//--------')
# ['▁//', '--------']
icetk.text_tokenizer.discourage_ids(range(125653,130000)) # or use icetk.text_tokenizer.discourage_tokens
icetk.tokenize('//--------')
# ['▁//', '-', '-', '-', '-', '-', '-', '-', '-']
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
icetk-0.0.7.tar.gz
(15.0 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
icetk-0.0.7-py3-none-any.whl
(16.0 kB
view details)
File details
Details for the file icetk-0.0.7.tar.gz.
File metadata
- Download URL: icetk-0.0.7.tar.gz
- Upload date:
- Size: 15.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
88ac3d04717cb188562bb2fd2827f1dce26870c9bc9127da448b36e3adcb9d1c
|
|
| MD5 |
1c3f5025d112e324cbf0d6226639964d
|
|
| BLAKE2b-256 |
fe83df39f1cc80e380cbb440757cb89b950cc7bd4d69b5fe286d650217dab030
|
File details
Details for the file icetk-0.0.7-py3-none-any.whl.
File metadata
- Download URL: icetk-0.0.7-py3-none-any.whl
- Upload date:
- Size: 16.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
830eaa0acdaa0c1f3be3b8da820f5731b3960dba27c6ab19f6810a68ad193fa8
|
|
| MD5 |
81f3fa0ce90f56cdb84ee699fe5ab1dd
|
|
| BLAKE2b-256 |
bf8a731927e0901273815b779e6ce0e081a95ecf78835ff80be30830505ae06c
|