Skip to main content

No project description provided

Project description

Logo

Docs Tests codecov PyPI

GenLM Backend is a high-performance backend for language model probabilistic programs, built for the GenLM ecosystem. It provides an asynchronous and autobatched interface to language model inference via vllm, sglang, transformers, and mlx-lm.

See our documentation.

🚀 Key Features

  • Automatic batching of concurrent log-probability requests, enabling efficient large-scale inference without having to write batching logic yourself
  • Byte-level decoding of transformers tokenizers, enabling advanced token-level control
  • Support for arbitrary Hugging Face models (e.g., LLaMA, DeepSeek, etc.) with fast inference and automatic KV caching using vllm
  • NEW: support for MLX-LM library, allowing faster inference on Apple silicon devices.

⚡ Quick Start

This library supports installation via pip. This uses transformers as the default inference backend.

pip install genlm-backend

To use a different backend, install the corresponding extra:

pip install genlm-backend[mlx]
pip install genlm-backend[vllm]
pip install genlm-backend[sgl]

For LoRA support:

pip install genlm-backend[lora]

🧪 Example: Autobatched Sequential Importance Sampling with LLMs

This example demonstrates how genlm-backend enables concise, scalable probabilistic inference with language models. It implements a Sequential Importance Sampling (SIS) algorithm that makes asynchronous log-probabality requests which get automatically batched by the language model.

import torch
import asyncio
from genlm.backend import load_model_by_name

# --- Token-level masking using the byte-level vocabulary --- #
def make_masking_function(llm, max_token_length, max_tokens):
    eos_id = llm.tokenizer.eos_token_id
    valid_ids = torch.tensor([
        token_id == eos_id or len(token) <= max_token_length
        for token_id, token in enumerate(llm.byte_vocab)
    ], dtype=torch.float).log()
    eos_one_hot = torch.nn.functional.one_hot(
        torch.tensor(eos_id), len(llm.byte_vocab)
    ).log()

    def masking_function(context):
        return eos_one_hot if len(context) >= max_tokens else valid_ids

    return masking_function

# --- Particle class for SIS --- #
class Particle:
    def __init__(self, llm, mask_function, prompt_ids):
        self.context = []
        self.prompt_ids = prompt_ids
        self.log_weight = 0.0
        self.active = True
        self.llm = llm
        self.mask_function = mask_function

    async def extend(self):
        logps = await self.llm.next_token_logprobs(self.prompt_ids + self.context)
        masked_logps = logps + self.mask_function(self.context).to(logps.device)
        logZ = masked_logps.logsumexp(dim=-1)
        self.log_weight += logZ
        next_token_id = torch.multinomial((masked_logps - logZ).exp(), 1).item()
        if next_token_id == self.llm.tokenizer.eos_token_id:
            self.active = False
        else:
            self.context.append(next_token_id)

# --- Autobatched SIS loop --- #
async def autobatched_sis(n_particles, llm, masking_function, prompt_ids):
    particles = [Particle(llm, masking_function, prompt_ids) for _ in range(n_particles)]
    while any(p.active for p in particles):
        await asyncio.gather(*[p.extend() for p in particles if p.active])
    return particles

# --- Run the example --- #
llm = load_model_by_name("gpt2") # or e.g., "meta-llama/Llama-3.2-1B" if you have access
mask_function = make_masking_function(llm, max_token_length=10, max_tokens=10)
prompt_ids = llm.tokenizer.encode("Montreal is")
particles = await autobatched_sis( # use asyncio.run(autobatched_sis(...)) if you are not in an async context
    n_particles=10, llm=llm, masking_function=mask_function, prompt_ids=prompt_ids
)

strings = [llm.tokenizer.decode(p.context) for p in particles]
log_weights = torch.tensor([p.log_weight for p in particles])
probs = torch.exp(log_weights - log_weights.logsumexp(dim=-1))

for s, p in sorted(zip(strings, probs), key=lambda x: -x[1]):
    print(f"{repr(s)} (probability: {p:.4f})")

This example highlights the following features:

  • 🌀 Asynchronous Inference Loop. Each particle runs independently, but all LLM calls are scheduled concurrently via asyncio.gather. The backend batches them automatically, so we get the efficiency of large batched inference without having to write the batching logic.
  • 🔁 Byte-level Tokenization Support. Token filtering is done using the model’s byte-level vocabulary, which genlm-backend exposes. This enables low-level control over generation in ways not possible with most high-level APIs.

Development

See the DEVELOPING.md file for information on how to install the project for local development.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genlm_backend-0.2.2.tar.gz (3.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

genlm_backend-0.2.2-py3-none-any.whl (42.5 kB view details)

Uploaded Python 3

File details

Details for the file genlm_backend-0.2.2.tar.gz.

File metadata

  • Download URL: genlm_backend-0.2.2.tar.gz
  • Upload date:
  • Size: 3.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for genlm_backend-0.2.2.tar.gz
Algorithm Hash digest
SHA256 ea2d3b7eabc4de6a91bffc3ca3b54dd12c865343e02d9eef2703457af316a7da
MD5 6d43f4906234964b3c4d7c34d65d079d
BLAKE2b-256 a8e854a509f52ce88977d2f93f73656a7210d0245c28c1cac8d52624d0e8effa

See more details on using hashes here.

File details

Details for the file genlm_backend-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: genlm_backend-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 42.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for genlm_backend-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 b23992b99260825485e3dd1a429ef2c60e4b32294f765381d8af6b70a7e017bb
MD5 4bdacea78aaf83d3c6a3ae5eecb8ba33
BLAKE2b-256 b9f6ac03b6676a09146c2836d9b056dfe8acabf69d2e819724d948b7dca8e1f5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page