Skip to main content

No project description provided

Project description

AI21 Labs Tokenizer

A SentencePiece based tokenizer for production uses with AI21's models

Test Package version Supported Python versions Poetry Supported Python versions License


Installation

pip

pip install ai21-tokenizer

poetry

poetry add ai21-tokenizer

Usage

Tokenizer Creation

Jamba 1.5 Mini Tokenizer

from ai21_tokenizer import Tokenizer, PreTrainedTokenizers

tokenizer = Tokenizer.get_tokenizer(PreTrainedTokenizers.JAMBA_1_5_MINI_TOKENIZER)
# Your code here

Another way would be to use our Jamba 1.5 Mini tokenizer directly:

from ai21_tokenizer import Jamba1_5Tokenizer

model_path = "<Path to your vocabs file>"
tokenizer = Jamba1_5Tokenizer(model_path=model_path)
# Your code here

Async usage

from ai21_tokenizer import Tokenizer, PreTrainedTokenizers

tokenizer = await Tokenizer.get_async_tokenizer(PreTrainedTokenizers.JAMBA_1_5_MINI_TOKENIZER)
# Your code here

Jamba 1.5 Large Tokenizer

from ai21_tokenizer import Tokenizer, PreTrainedTokenizers

tokenizer = Tokenizer.get_tokenizer(PreTrainedTokenizers.JAMBA_1_5_LARGE_TOKENIZER)
# Your code here

Another way would be to use our Jamba 1.5 Large tokenizer directly:

from ai21_tokenizer import Jamba1_5Tokenizer

model_path = "<Path to your vocabs file>"
tokenizer = Jamba1_5Tokenizer(model_path=model_path)
# Your code here

Async usage

from ai21_tokenizer import Tokenizer, PreTrainedTokenizers

tokenizer = await Tokenizer.get_async_tokenizer(PreTrainedTokenizers.JAMBA_1_5_LARGE_TOKENIZER)
# Your code here

Jamba Instruct Tokenizer

from ai21_tokenizer import Tokenizer, PreTrainedTokenizers

tokenizer = Tokenizer.get_tokenizer(PreTrainedTokenizers.JAMBA_INSTRUCT_TOKENIZER)
# Your code here

Another way would be to use our Jamba tokenizer directly:

from ai21_tokenizer import JambaInstructTokenizer

model_path = "<Path to your vocabs file>"
tokenizer = JambaInstructTokenizer(model_path=model_path)
# Your code here

Async usage

from ai21_tokenizer import Tokenizer, PreTrainedTokenizers

tokenizer = await Tokenizer.get_async_tokenizer(PreTrainedTokenizers.JAMBA_INSTRUCT_TOKENIZER)
# Your code here

Another way would be to use our async Jamba tokenizer class method create:

from ai21_tokenizer import AsyncJambaInstructTokenizer

model_path = "<Path to your vocabs file>"
tokenizer = AsyncJambaInstructTokenizer.create(model_path=model_path)
# Your code here

J2 Tokenizer

from ai21_tokenizer import Tokenizer

tokenizer = Tokenizer.get_tokenizer()
# Your code here

Another way would be to use our Jurassic model directly:

from ai21_tokenizer import JurassicTokenizer

model_path = "<Path to your vocabs file. This is usually a binary file that end with .model>"
config = {} # "dictionary object of your config.json file"
tokenizer = JurassicTokenizer(model_path=model_path, config=config)

Async usage

from ai21_tokenizer import Tokenizer

tokenizer = await Tokenizer.get_async_tokenizer()
# Your code here

Another way would be to use our async Jamba tokenizer class method create:

from ai21_tokenizer import AsyncJurassicTokenizer

model_path = "<Path to your vocabs file. This is usually a binary file that end with .model>"
config = {} # "dictionary object of your config.json file"
tokenizer = AsyncJurassicTokenizer.create(model_path=model_path, config=config)
# Your code here

Functions

Encode and Decode

These functions allow you to encode your text to a list of token ids and back to plaintext

text_to_encode = "apple orange banana"
encoded_text = tokenizer.encode(text_to_encode)
print(f"Encoded text: {encoded_text}")

decoded_text = tokenizer.decode(encoded_text)
print(f"Decoded text: {decoded_text}")

Async

# Assuming you have created an async tokenizer
text_to_encode = "apple orange banana"
encoded_text = await tokenizer.encode(text_to_encode)
print(f"Encoded text: {encoded_text}")

decoded_text = await tokenizer.decode(encoded_text)
print(f"Decoded text: {decoded_text}")

What if you had wanted to convert your tokens to ids or vice versa?

tokens = tokenizer.convert_ids_to_tokens(encoded_text)
print(f"IDs corresponds to Tokens: {tokens}")

ids = tokenizer.convert_tokens_to_ids(tokens)

Async

# Assuming you have created an async tokenizer
tokens = await tokenizer.convert_ids_to_tokens(encoded_text)
print(f"IDs corresponds to Tokens: {tokens}")

ids = tokenizer.convert_tokens_to_ids(tokens)

For more examples, please see our examples folder.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ai21_tokenizer-0.12.0.tar.gz (2.6 MB view details)

Uploaded Source

Built Distribution

ai21_tokenizer-0.12.0-py3-none-any.whl (2.7 MB view details)

Uploaded Python 3

File details

Details for the file ai21_tokenizer-0.12.0.tar.gz.

File metadata

  • Download URL: ai21_tokenizer-0.12.0.tar.gz
  • Upload date:
  • Size: 2.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.19

File hashes

Hashes for ai21_tokenizer-0.12.0.tar.gz
Algorithm Hash digest
SHA256 d2a5b17789d21572504b7693148bf66e692bdb3ab563023dbcbee340bcbd11c6
MD5 4dfe06f027d3b761108684ea2d536b5c
BLAKE2b-256 3980183f0bcdcb707a7e6593ff048b60d7e127d241ef8bef58c0a4dc7d1b63c7

See more details on using hashes here.

File details

Details for the file ai21_tokenizer-0.12.0-py3-none-any.whl.

File metadata

File hashes

Hashes for ai21_tokenizer-0.12.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7fd37b9093894b30b0f200e5f44fc8fb8772e2b272ef71b6d73722b4696e63c4
MD5 c616bba971bd67d9480cc8cdeba7d21f
BLAKE2b-256 18956ea741600ed38100a7d01f58b3e61608b753f7ed75ff0dc45b4397443c75

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page