A streaming chat toolkit for pre-trained large language models(LLM)
Project description
ChatStream
ChatStream is a chat toolkit for pre-trained large language models.
It can be embedded in FastAPI/Starlette based web applications/web APIs to perform sequential sentence generation with pre-trained language models under load control.
Installation
pip install chatstream
Quick Start
Install required packages
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117
pip install transformers
pip install "uvicorn[standard]" gunicorn
Implementing a ChatStream server
Implement a streaming chat server for pre-trained models.
import torch
from fastapi import FastAPI, Request
from fastsession import FastSessionMiddleware, MemoryStore
from transformers import AutoTokenizer, AutoModelForCausalLM
from chatstream import ChatStream, ChatPromptTogetherRedPajamaINCITEChat as ChatPrompt
model_path = "togethercomputer/RedPajama-INCITE-Chat-3B-v1"
device = "cuda" # "cuda" / "cpu"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16)
model.to(device)
chat_stream = ChatStream(
num_of_concurrent_executions=2, # max_concurrent_executions for sentence generation
max_queue_size=5, # size of queue
model=model,
tokenizer=tokenizer,
device=device,
chat_prompt_clazz=ChatPrompt,
)
app = FastAPI()
# Specify session middleware to keep per-user ChatPrompt in the HTTP session
app.add_middleware(FastSessionMiddleware,
secret_key="your-session-secret-key",
store=MemoryStore(),
http_only=True,
secure=False,
)
@app.post("/chat_stream")
async def stream_api(request: Request):
# Just pass a FastAPI Request object to `handle_chat_stream_request` to automatically queue and control concurrency
response = await chat_stream.handle_chat_stream_request(request)
return response
@app.on_event("startup")
async def startup():
# start the queueing system by doing `start_queue_worker` at the same time the web server starts up
await chat_stream.start_queue_worker()
Table of Contents
-
Implementation of Web API Endpoints
-
Queueing System and Concurrency Limit
-
Start the Web server (ASGI server)
-
Console chat implementation
-
Configuration during development
-
Advanced Settings
- Chat History Persistence
- Configuration for large scale access
- Interfacing with login authentication using OAuth
- - Load Balancing on Multi-GPU
- - Load Balancing with Multi-GPU Server
License
Citing ChatStream
@software{chatstream,
title = {{ChatStream: A streaming chat toolkit for pre-trained large language models(LLM)}},
author = {Qualiteg Inc.(https://qualiteg.com) },
url = {https://github.com/qualiteg/ChatStream}
month = {5},
year = {2023},
version = {0.15},
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
chatstream-0.3.0.tar.gz
(75.7 kB
view hashes)
Built Distribution
chatstream-0.3.0-py3-none-any.whl
(116.7 kB
view hashes)
Close
Hashes for chatstream-0.3.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ddf90126b4d2627bd3ac91939b76192db4b9db75fb7dd1b6b2d5c23b597d79d4 |
|
MD5 | 51758ed9ba5223a8ea4804d6d080d668 |
|
BLAKE2b-256 | 51760c77b1235255911aafa2f2a98035ecf7c9df1a6cb080707e9d07342ff7db |