A streaming chat toolkit for pre-trained large language models(LLM)
Project description
ChatStream
ChatStream is a chat toolkit for pre-trained large language models.
It can be embedded in FastAPI/Starlette based web applications/web APIs to perform sequential sentence generation with pre-trained language models under load control.
Installation
pip install chatstream
Quick Start
Install required packages
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117
pip install transformers
pip install "uvicorn[standard]" gunicorn
Implementing a ChatStream server
Implement a streaming chat server for pre-trained models.
import torch
from fastapi import FastAPI, Request
from fastsession import FastSessionMiddleware, MemoryStore
from transformers import AutoTokenizer, AutoModelForCausalLM
from chatstream import ChatStream, ChatPromptTogetherRedPajamaINCITEChat as ChatPrompt
model_path = "togethercomputer/RedPajama-INCITE-Chat-3B-v1"
device = "cuda" # "cuda" / "cpu"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16)
model.to(device)
chat_stream = ChatStream(
num_of_concurrent_executions=2, # max_concurrent_executions for sentence generation
max_queue_size=5, # size of queue
model=model,
tokenizer=tokenizer,
device=device,
chat_prompt_clazz=ChatPrompt,
)
app = FastAPI()
# Specify session middleware to keep per-user ChatPrompt in the HTTP session
app.add_middleware(FastSessionMiddleware,
secret_key="your-session-secret-key",
store=MemoryStore(),
http_only=True,
secure=False,
)
@app.post("/chat_stream")
async def stream_api(request: Request):
# Just pass a FastAPI Request object to `handle_chat_stream_request` to automatically queue and control concurrency
response = await chat_stream.handle_chat_stream_request(request)
return response
@app.on_event("startup")
async def startup():
# start the queueing system by doing `start_queue_worker` at the same time the web server starts up
await chat_stream.start_queue_worker()
Table of Contents
-
Implementation of Web API Endpoints
-
Queueing System and Concurrency Limit
-
Start the Web server (ASGI server)
-
Console chat implementation
-
Configuration during development
-
Advanced Settings
- Chat History Persistence
- Configuration for large scale access
- Interfacing with login authentication using OAuth
- - Load Balancing on Multi-GPU
- - Load Balancing with Multi-GPU Server
License
Citing ChatStream
@software{chatstream,
title = {{ChatStream: A streaming chat toolkit for pre-trained large language models(LLM)}},
author = {Qualiteg Inc.(https://qualiteg.com) },
url = {https://github.com/qualiteg/ChatStream}
month = {5},
year = {2023},
version = {0.15},
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file chatstream-0.3.0.tar.gz
.
File metadata
- Download URL: chatstream-0.3.0.tar.gz
- Upload date:
- Size: 75.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ec56ec1a1d836d3136c2d78a4ff4fdb744033c6b6cc8a9a5080b02f984adec6b |
|
MD5 | d6d16e7f1a9fb7f56a3ea3d30aab434e |
|
BLAKE2b-256 | ed3f2c6438bd99a8c43fae65a8503d942f67b523382b5dc255cb919d95609a0a |
File details
Details for the file chatstream-0.3.0-py3-none-any.whl
.
File metadata
- Download URL: chatstream-0.3.0-py3-none-any.whl
- Upload date:
- Size: 116.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ddf90126b4d2627bd3ac91939b76192db4b9db75fb7dd1b6b2d5c23b597d79d4 |
|
MD5 | 51758ed9ba5223a8ea4804d6d080d668 |
|
BLAKE2b-256 | 51760c77b1235255911aafa2f2a98035ecf7c9df1a6cb080707e9d07342ff7db |