Skip to main content

A unified tokenization tool for Images, Chinese and English.

Project description

ICE Tokenizer

  • Token id [0, 20000) are image tokens.
  • Token id [20000, 20100) are common tokens, mainly punctuations. E.g., icetk[20000] == '<unk>', icetk[20003] == '<pad>', icetk[20006] == ','.
  • Token id [20100, 83823) are English tokens.
  • Token id [83823, 145653) are Chinese tokens.
  • Token id [145653, 150000) are rare tokens. E.g., icetk[145803] == 'α'.

You can install the package via

pip install icetk

Tokenization

from icetk import icetk
tokens = icetk.tokenize('Hello World! I am icetk.')
# tokens == ['▁Hello', '▁World', '!', '▁I', '▁am', '▁ice', 'tk', '.']
ids = icetk.encode('Hello World! I am icetk.')
# ids == [39316, 20932, 20035, 20115, 20344, 22881, 35955, 20007]
en = icetk.decode(ids)
# en == 'Hello World! I am icetk.' # always perfectly recover (if without <unk>)

ids = icetk.encode('你好世界!这里是 icetk。')
# ids == [20005, 94874, 84097, 20035, 94947, 22881, 35955, 83823]

ids = icetk.encode(image_path='test.jpeg', image_size=256, compress_rate=8)
# ids == tensor([[12738, 12430, 10398,  ...,  7236, 12844, 12386]], device='cuda:0')
# ids.shape == torch.Size([1, 1024])
img = icetk.decode(image_ids=ids, compress_rate=8)
# img.shape == torch.Size([1, 3, 256, 256])
from torchvision.utils import save_image
save_image(img, 'recover.jpg')

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

icetk-0.0.3.tar.gz (14.6 kB view hashes)

Uploaded source

Built Distribution

icetk-0.0.3-py3-none-any.whl (15.7 kB view hashes)

Uploaded py3

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Huawei Huawei PSF Sponsor Microsoft Microsoft PSF Sponsor NVIDIA NVIDIA PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page