Skip to main content

No project description provided

Project description

AI21 Labs Tokenizer

A SentencePiece based tokenizer for production uses with AI21's models

Test Package version Supported Python versions Poetry Supported Python versions License


Prerequisites

  • If you wish to use the tokenizers for Jamba Mini or Jamba Large, you will need to request access to the relevant model's HuggingFace repo:

Installation

pip

pip install ai21-tokenizer

poetry

poetry add ai21-tokenizer

Usage

Basic Usage

from ai21_tokenizer import Tokenizer

# Create tokenizer (defaults to Jamba Mini)
tokenizer = Tokenizer.get_tokenizer()

# Encode text to token IDs
text = "Hello, world!"
encoded = tokenizer.encode(text)
print(f"Encoded: {encoded}")

# Decode token IDs back to text
decoded = tokenizer.decode(encoded)
print(f"Decoded: {decoded}")

Specific Tokenizer Selection

from ai21_tokenizer import Tokenizer, PreTrainedTokenizers

# Jamba Mini tokenizer
tokenizer = Tokenizer.get_tokenizer(PreTrainedTokenizers.JAMBA_MINI_TOKENIZER)

# Jamba Large tokenizer
tokenizer = Tokenizer.get_tokenizer(PreTrainedTokenizers.JAMBA_LARGE_TOKENIZER)

Async Usage

import asyncio
from ai21_tokenizer import Tokenizer

async def main():
    tokenizer = await Tokenizer.get_async_tokenizer()

    text = "Hello, world!"
    encoded = await tokenizer.encode(text)
    decoded = await tokenizer.decode(encoded)

    print(f"Original: {text}")
    print(f"Encoded: {encoded}")
    print(f"Decoded: {decoded}")

asyncio.run(main())

Advanced Token Operations

# Convert between tokens and IDs
tokens = tokenizer.convert_ids_to_tokens(encoded)
print(f"Tokens: {tokens}")

ids = tokenizer.convert_tokens_to_ids(tokens)
print(f"IDs: {ids}")

Direct Class Usage

from ai21_tokenizer import SyncJambaTokenizer

# Using local model file
model_path = "/path/to/your/tokenizer.model"
tokenizer = SyncJambaTokenizer(model_path=model_path)

text = "Hello, world!"
encoded = tokenizer.encode(text)
decoded = tokenizer.decode(encoded)

Async Direct Class Usage

from ai21_tokenizer import AsyncJambaTokenizer

async def main():
    model_path = "/path/to/your/tokenizer.model"
    tokenizer = await AsyncJambaTokenizer.create(model_path=model_path)

    text = "Hello, world!"
    encoded = await tokenizer.encode(text)
    decoded = await tokenizer.decode(encoded)

asyncio.run(main())

For more examples, please see our examples folder.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ai21_tokenizer-1.1.0.tar.gz (2.6 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ai21_tokenizer-1.1.0-py3-none-any.whl (2.7 MB view details)

Uploaded Python 3

File details

Details for the file ai21_tokenizer-1.1.0.tar.gz.

File metadata

  • Download URL: ai21_tokenizer-1.1.0.tar.gz
  • Upload date:
  • Size: 2.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.23

File hashes

Hashes for ai21_tokenizer-1.1.0.tar.gz
Algorithm Hash digest
SHA256 aa511dc7d9717176fe37b24868cef5f8a568c4d3ab6ed336da69da10d7771c92
MD5 58382dab43ee7872fab526d4bad96a52
BLAKE2b-256 0a96275e0000aaad03fe118afa964e669d69e0a8706dfbb079fde6837bf602c1

See more details on using hashes here.

File details

Details for the file ai21_tokenizer-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: ai21_tokenizer-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 2.7 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.23

File hashes

Hashes for ai21_tokenizer-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9402c8387c09a22b13995ea485b6d3ebe119c380bfaae733f2541da73457f7ed
MD5 b5e1c0b0a466a45f3705fbfe8b9d0601
BLAKE2b-256 13fbd7ff8467745892a6fe65b0be7f6679654c9c6d1aba793bf6ccb8198120a5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page