No project description provided
Project description
AI21 Labs Tokenizer
A SentencePiece based tokenizer for production uses with AI21's models
Installation
pip
pip install ai21-tokenizer
poetry
poetry add ai21-tokenizer
Usage
Tokenizer Creation
Jamba 1.5 Mini Tokenizer
from ai21_tokenizer import Tokenizer, PreTrainedTokenizers
tokenizer = Tokenizer.get_tokenizer(PreTrainedTokenizers.JAMBA_1_5_MINI_TOKENIZER)
# Your code here
Another way would be to use our Jamba 1.5 Mini tokenizer directly:
from ai21_tokenizer import Jamba1_5Tokenizer
model_path = "<Path to your vocabs file>"
tokenizer = Jamba1_5Tokenizer(model_path=model_path)
# Your code here
Async usage
from ai21_tokenizer import Tokenizer, PreTrainedTokenizers
tokenizer = await Tokenizer.get_async_tokenizer(PreTrainedTokenizers.JAMBA_1_5_MINI_TOKENIZER)
# Your code here
Jamba 1.5 Large Tokenizer
from ai21_tokenizer import Tokenizer, PreTrainedTokenizers
tokenizer = Tokenizer.get_tokenizer(PreTrainedTokenizers.JAMBA_1_5_LARGE_TOKENIZER)
# Your code here
Another way would be to use our Jamba 1.5 Large tokenizer directly:
from ai21_tokenizer import Jamba1_5Tokenizer
model_path = "<Path to your vocabs file>"
tokenizer = Jamba1_5Tokenizer(model_path=model_path)
# Your code here
Async usage
from ai21_tokenizer import Tokenizer, PreTrainedTokenizers
tokenizer = await Tokenizer.get_async_tokenizer(PreTrainedTokenizers.JAMBA_1_5_LARGE_TOKENIZER)
# Your code here
Jamba Instruct Tokenizer
from ai21_tokenizer import Tokenizer, PreTrainedTokenizers
tokenizer = Tokenizer.get_tokenizer(PreTrainedTokenizers.JAMBA_INSTRUCT_TOKENIZER)
# Your code here
Another way would be to use our Jamba tokenizer directly:
from ai21_tokenizer import JambaInstructTokenizer
model_path = "<Path to your vocabs file>"
tokenizer = JambaInstructTokenizer(model_path=model_path)
# Your code here
Async usage
from ai21_tokenizer import Tokenizer, PreTrainedTokenizers
tokenizer = await Tokenizer.get_async_tokenizer(PreTrainedTokenizers.JAMBA_INSTRUCT_TOKENIZER)
# Your code here
Another way would be to use our async Jamba tokenizer class method create:
from ai21_tokenizer import AsyncJambaInstructTokenizer
model_path = "<Path to your vocabs file>"
tokenizer = AsyncJambaInstructTokenizer.create(model_path=model_path)
# Your code here
J2 Tokenizer
from ai21_tokenizer import Tokenizer
tokenizer = Tokenizer.get_tokenizer()
# Your code here
Another way would be to use our Jurassic model directly:
from ai21_tokenizer import JurassicTokenizer
model_path = "<Path to your vocabs file. This is usually a binary file that end with .model>"
config = {} # "dictionary object of your config.json file"
tokenizer = JurassicTokenizer(model_path=model_path, config=config)
Async usage
from ai21_tokenizer import Tokenizer
tokenizer = await Tokenizer.get_async_tokenizer()
# Your code here
Another way would be to use our async Jamba tokenizer class method create:
from ai21_tokenizer import AsyncJurassicTokenizer
model_path = "<Path to your vocabs file. This is usually a binary file that end with .model>"
config = {} # "dictionary object of your config.json file"
tokenizer = AsyncJurassicTokenizer.create(model_path=model_path, config=config)
# Your code here
Functions
Encode and Decode
These functions allow you to encode your text to a list of token ids and back to plaintext
text_to_encode = "apple orange banana"
encoded_text = tokenizer.encode(text_to_encode)
print(f"Encoded text: {encoded_text}")
decoded_text = tokenizer.decode(encoded_text)
print(f"Decoded text: {decoded_text}")
Async
# Assuming you have created an async tokenizer
text_to_encode = "apple orange banana"
encoded_text = await tokenizer.encode(text_to_encode)
print(f"Encoded text: {encoded_text}")
decoded_text = await tokenizer.decode(encoded_text)
print(f"Decoded text: {decoded_text}")
What if you had wanted to convert your tokens to ids or vice versa?
tokens = tokenizer.convert_ids_to_tokens(encoded_text)
print(f"IDs corresponds to Tokens: {tokens}")
ids = tokenizer.convert_tokens_to_ids(tokens)
Async
# Assuming you have created an async tokenizer
tokens = await tokenizer.convert_ids_to_tokens(encoded_text)
print(f"IDs corresponds to Tokens: {tokens}")
ids = tokenizer.convert_tokens_to_ids(tokens)
For more examples, please see our examples folder.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file ai21_tokenizer-0.12.0.tar.gz
.
File metadata
- Download URL: ai21_tokenizer-0.12.0.tar.gz
- Upload date:
- Size: 2.6 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.19
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d2a5b17789d21572504b7693148bf66e692bdb3ab563023dbcbee340bcbd11c6 |
|
MD5 | 4dfe06f027d3b761108684ea2d536b5c |
|
BLAKE2b-256 | 3980183f0bcdcb707a7e6593ff048b60d7e127d241ef8bef58c0a4dc7d1b63c7 |
File details
Details for the file ai21_tokenizer-0.12.0-py3-none-any.whl
.
File metadata
- Download URL: ai21_tokenizer-0.12.0-py3-none-any.whl
- Upload date:
- Size: 2.7 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.19
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7fd37b9093894b30b0f200e5f44fc8fb8772e2b272ef71b6d73722b4696e63c4 |
|
MD5 | c616bba971bd67d9480cc8cdeba7d21f |
|
BLAKE2b-256 | 18956ea741600ed38100a7d01f58b3e61608b753f7ed75ff0dc45b4397443c75 |