No project description provided
Project description
AI21 Labs Tokenizer
A SentencePiece based tokenizer for production uses with AI21's models
Prerequisites
- If you wish to use the tokenizers for
Jamba MiniorJamba Large, you will need to request access to the relevant model's HuggingFace repo:
Installation
pip
pip install ai21-tokenizer
poetry
poetry add ai21-tokenizer
Usage
Basic Usage
from ai21_tokenizer import Tokenizer
# Create tokenizer (defaults to Jamba Mini)
tokenizer = Tokenizer.get_tokenizer()
# Encode text to token IDs
text = "Hello, world!"
encoded = tokenizer.encode(text)
print(f"Encoded: {encoded}")
# Decode token IDs back to text
decoded = tokenizer.decode(encoded)
print(f"Decoded: {decoded}")
Specific Tokenizer Selection
from ai21_tokenizer import Tokenizer, PreTrainedTokenizers
# Jamba Mini tokenizer
tokenizer = Tokenizer.get_tokenizer(PreTrainedTokenizers.JAMBA_MINI_TOKENIZER)
# Jamba Large tokenizer
tokenizer = Tokenizer.get_tokenizer(PreTrainedTokenizers.JAMBA_LARGE_TOKENIZER)
Async Usage
import asyncio
from ai21_tokenizer import Tokenizer
async def main():
tokenizer = await Tokenizer.get_async_tokenizer()
text = "Hello, world!"
encoded = await tokenizer.encode(text)
decoded = await tokenizer.decode(encoded)
print(f"Original: {text}")
print(f"Encoded: {encoded}")
print(f"Decoded: {decoded}")
asyncio.run(main())
Advanced Token Operations
# Convert between tokens and IDs
tokens = tokenizer.convert_ids_to_tokens(encoded)
print(f"Tokens: {tokens}")
ids = tokenizer.convert_tokens_to_ids(tokens)
print(f"IDs: {ids}")
Direct Class Usage
from ai21_tokenizer import SyncJambaTokenizer
# Using local model file
model_path = "/path/to/your/tokenizer.model"
tokenizer = SyncJambaTokenizer(model_path=model_path)
text = "Hello, world!"
encoded = tokenizer.encode(text)
decoded = tokenizer.decode(encoded)
Async Direct Class Usage
from ai21_tokenizer import AsyncJambaTokenizer
async def main():
model_path = "/path/to/your/tokenizer.model"
tokenizer = await AsyncJambaTokenizer.create(model_path=model_path)
text = "Hello, world!"
encoded = await tokenizer.encode(text)
decoded = await tokenizer.decode(encoded)
asyncio.run(main())
For more examples, please see our examples folder.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
ai21_tokenizer-1.1.0.tar.gz
(2.6 MB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ai21_tokenizer-1.1.0.tar.gz.
File metadata
- Download URL: ai21_tokenizer-1.1.0.tar.gz
- Upload date:
- Size: 2.6 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.23
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
aa511dc7d9717176fe37b24868cef5f8a568c4d3ab6ed336da69da10d7771c92
|
|
| MD5 |
58382dab43ee7872fab526d4bad96a52
|
|
| BLAKE2b-256 |
0a96275e0000aaad03fe118afa964e669d69e0a8706dfbb079fde6837bf602c1
|
File details
Details for the file ai21_tokenizer-1.1.0-py3-none-any.whl.
File metadata
- Download URL: ai21_tokenizer-1.1.0-py3-none-any.whl
- Upload date:
- Size: 2.7 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.23
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9402c8387c09a22b13995ea485b6d3ebe119c380bfaae733f2541da73457f7ed
|
|
| MD5 |
b5e1c0b0a466a45f3705fbfe8b9d0601
|
|
| BLAKE2b-256 |
13fbd7ff8467745892a6fe65b0be7f6679654c9c6d1aba793bf6ccb8198120a5
|