Skip to main content

A unified tokenization tool for Images, Chinese and English.

Project description

ICE Tokenizer

  • Token id [0, 20000) are image tokens.
  • Token id [20000, 20100) are common tokens, mainly punctuations. E.g., icetk[20000] == '<unk>', icetk[20003] == '<pad>', icetk[20006] == ','.
  • Token id [20100, 83823) are English tokens.
  • Token id [83823, 145653) are Chinese tokens.
  • Token id [145653, 150000) are rare tokens. E.g., icetk[145803] == 'α'.

You can install the package via

pip install icetk

Tokenization

from icetk import icetk
tokens = icetk.tokenize('Hello World! I am icetk.')
# tokens == ['▁Hello', '▁World', '!', '▁I', '▁am', '▁ice', 'tk', '.']
ids = icetk.encode('Hello World! I am icetk.')
# ids == [39316, 20932, 20035, 20115, 20344, 22881, 35955, 20007]
en = icetk.decode(ids)
# en == 'Hello World! I am icetk.' # always perfectly recover (if without <unk>)

ids = icetk.encode('你好世界!这里是 icetk。')
# ids == [20005, 94874, 84097, 20035, 94947, 22881, 35955, 83823]

ids = icetk.encode(image_path='test.jpeg', image_size=256, compress_rate=8)
# ids == tensor([[12738, 12430, 10398,  ...,  7236, 12844, 12386]], device='cuda:0')
# ids.shape == torch.Size([1, 1024])
img = icetk.decode(image_ids=ids, compress_rate=8)
# img.shape == torch.Size([1, 3, 256, 256])
from torchvision.utils import save_image
save_image(img, 'recover.jpg')

# add special tokens
icetk.add_special_tokens(['<start_of_image>', '<start_of_english>', '<start_of_chinese>'])

# transform \n
icetk.decode(icetk.encode('abc\nhi', ignore_linebreak=False))
# 'abc\nhi'
icetk.decode(icetk.encode('abc\nhi'))
# 'abc hi'

# discourage rare composed tokens
icetk.tokenize('//--------')
# ['▁//', '--------']
icetk.text_tokenizer.discourage_ids(range(125653,130000)) # or use icetk.text_tokenizer.discourage_tokens
icetk.tokenize('//--------')
# ['▁//', '-', '-', '-', '-', '-', '-', '-', '-']

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

icetk-0.0.7.tar.gz (15.0 kB view details)

Uploaded Source

Built Distribution

icetk-0.0.7-py3-none-any.whl (16.0 kB view details)

Uploaded Python 3

File details

Details for the file icetk-0.0.7.tar.gz.

File metadata

  • Download URL: icetk-0.0.7.tar.gz
  • Upload date:
  • Size: 15.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.5

File hashes

Hashes for icetk-0.0.7.tar.gz
Algorithm Hash digest
SHA256 88ac3d04717cb188562bb2fd2827f1dce26870c9bc9127da448b36e3adcb9d1c
MD5 1c3f5025d112e324cbf0d6226639964d
BLAKE2b-256 fe83df39f1cc80e380cbb440757cb89b950cc7bd4d69b5fe286d650217dab030

See more details on using hashes here.

File details

Details for the file icetk-0.0.7-py3-none-any.whl.

File metadata

  • Download URL: icetk-0.0.7-py3-none-any.whl
  • Upload date:
  • Size: 16.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.5

File hashes

Hashes for icetk-0.0.7-py3-none-any.whl
Algorithm Hash digest
SHA256 830eaa0acdaa0c1f3be3b8da820f5731b3960dba27c6ab19f6810a68ad193fa8
MD5 81f3fa0ce90f56cdb84ee699fe5ab1dd
BLAKE2b-256 bf8a731927e0901273815b779e6ce0e081a95ecf78835ff80be30830505ae06c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page