No project description provided
Project description
AI21 Labs Tokenizer
A SentencePiece based tokenizer for production uses with AI21's models
Installation
pip
pip install ai21-tokenizer
poetry
poetry add ai21-tokenizer
Usage
Tokenizer Creation
from ai21_tokenizer import Tokenizer
tokenizer = Tokenizer.get_tokenizer()
# Your code here
Another way would be to use our Jurassic model directly:
from ai21_tokenizer import JurassicTokenizer
model_path = "<Path to your vocabs file. This is usually a binary file that end with .model>"
config = {} # "dictionary object of your config.json file"
tokenizer = JurassicTokenizer(model_path=model_path, config=config)
Functions
Encode and Decode
These functions allow you to encode your text to a list of token ids and back to plaintext
text_to_encode = "apple orange banana"
encoded_text = tokenizer.encode(text_to_encode)
print(f"Encoded text: {encoded_text}")
decoded_text = tokenizer.decode(encoded_text)
print(f"Decoded text: {decoded_text}")
What if you had wanted to convert your tokens to ids or vice versa?
tokens = tokenizer.convert_ids_to_tokens(encoded_text)
print(f"IDs corresponds to Tokens: {tokens}")
ids = tokenizer.convert_tokens_to_ids(tokens)
For more examples, please see our examples folder.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
ai21_tokenizer-0.8.2.tar.gz
(2.6 MB
view hashes)
Built Distribution
Close
Hashes for ai21_tokenizer-0.8.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7f005959c53db035cd79d7302f3a0c656bcdfde7c3f07d850eab0894149ca3e4 |
|
MD5 | 43f9bef2132f5383722e46b43f9f3196 |
|
BLAKE2b-256 | 0a485631ea58b3fc129792a88d02e5f365f7f6f958695188e3d3fb5e415b3429 |