A streaming chat toolkit for pre-trained large language models(LLM)
Project description
ChatStream
ChatStream is a chat toolkit for pre-trained large language models.
It can be embedded in FastAPI/Starlette based web applications/web APIs to perform sequential sentence generation with pre-trained language models under load control.
Installation
pip install chatstream
Quick Start
Install required packages
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117
pip install transformers
pip install "uvicorn[standard]" gunicorn
Implementing a ChatStream server
Implement a streaming chat server for pre-trained models.
import torch
from fastapi import FastAPI, Request
from fastsession import FastSessionMiddleware, MemoryStore
from transformers import AutoTokenizer, AutoModelForCausalLM
from chatstream import ChatStream,ChatPromptTogetherRedPajamaINCITEChat as ChatPrompt
model_path = "togethercomputer/RedPajama-INCITE-Chat-3B-v1"
device = "cuda" # "cuda" / "cpu"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16)
model.to(device)
chat_stream = ChatStream(
num_of_concurrent_executions=2,# max_concurrent_executions for sentence generation
max_queue_size=5,# size of queue
model=model,
tokenizer=tokenizer,
device=device,
chat_prompt_clazz=ChatPrompt,
)
app = FastAPI()
# Specify session middleware to keep per-user ChatPrompt in the HTTP session
app.add_middleware(FastSessionMiddleware,
secret_key="your-session-secret-key",
store=MemoryStore(),
http_only=True,
secure=False,
)
@app.post("/chat_stream")
async def stream_api(request: Request):.
# Just pass a FastAPI Request object to `handle_starlette_request` to automatically queue and control concurrency
response = await chat_stream.handle_starlette_request(request)
return response
@app.on_event("startup")
async def startup():.
# start the queueing system by doing `start_queue_worker` at the same time the web server starts up
await chat_stream.start_queue_worker()
Table of Contents
-
Implementation of Web API Endpoints
-
Queueing System and Concurrency Limit
-
Start the Web server (ASGI server)
-
Console chat implementation
-
Configuration during development
-
Advanced Settings
- Chat History Persistence
- Configuration for large scale access
- Interfacing with login authentication using OAuth
- - Load Balancing on Multi-GPU
- - Load Balancing with Multi-GPU Server
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
chatstream-0.2.0.tar.gz
(33.8 kB
view hashes)
Built Distribution
chatstream-0.2.0-py3-none-any.whl
(36.1 kB
view hashes)
Close
Hashes for chatstream-0.2.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b89184b2cbe8cf73d1ff4e4b210db27e0fa7f195408aadfc9a9319c4be06842b |
|
MD5 | 3a5b39198d55ef46c3532aad5833cc36 |
|
BLAKE2b-256 | 014979cc3926653be1c3a37bf99ad97f1be12769657d8747216a33628d8be422 |