A framework to serve the Kuwa Executors, including LLM model executor.
Project description
Model Serving Tutorial
This guide will help you setup your model serving after your Kernel is started. By default, the kernel should be served at 127.0.0.1:9000
. If you've already started your Kernel, you should be ready to host your own model on your kernel.
You can check the current serving status of the Kernel by connecting to http://127.0.0.1:9000/v1.0/worker/debug
Basic Usage
1. Install Required Packages
Make sure you've installed the required packages by running:
pip install -r requirements.txt
The executor is packaged so that after installation you can directly use the kuwa-executor
command to start up your executor. You can get more detailed parameters with:
> kuwa-executor --list
Available model executors:
debug : [Tool] Debugging executor. It will reflect the last input.
dummy : [Tool] Dummy executor. It will reply fixed message regardless of the user prompt.
geminipro : [Cloud model] Google Gemini-Pro. Need API key.
chatgpt : [Cloud model] OpenAI ChatGPT. Need API key.
huggingface: [On-premises model] Download and run Huggingface model locally.
llamacpp : [On-premises model] Run the GGUF model locally.
Use "kuwa-executor [executor] --help" to get more information.
You can get more detailed instructions for a specific executor type as well:
> kuwa-executor debug --help
usage: kuwa-executor debug [-h] [--access_code ACCESS_CODE] [--version VERSION] [--ignore_kernel] [--https]
[--host HOST] [--port PORT] [--executor_path EXECUTOR_PATH] [--kernel_url KERNEL_URL]
[--log {NOTSET,DEBUG,INFO,WARNING,ERROR,CRITICAL}] [--delay DELAY]
LLM model executor, Please make sure your kernel is working before use.
optional arguments:
-h, --help show this help message and exit
--delay DELAY Inter-token delay (default: 0.02)
General Options:
--access_code ACCESS_CODE
Access code (default: None)
--version VERSION Version of the executor interface (default: v1.0)
--ignore_kernel Ignore kernel (default: False)
--https Register the executor endpoint with https scheme (default: False)
--host HOST The hostname or IP address that will be stored in Kernel, Make sure the location are
accessible by Kernel (default: None)
--port PORT The port to serve. By choosing None, it'll assign an unused port (default: None)
--executor_path EXECUTOR_PATH
The path this model executor is going to use (default: /chat)
--kernel_url KERNEL_URL
Base URL of Kernel's executor management API (default: http://127.0.0.1:9000/)
--log {NOTSET,DEBUG,INFO,WARNING,ERROR,CRITICAL}
The logging level. (default: INFO)
2. Prepare Your Model
If your model is in .gguf
format:
kuwa-executor llamacpp --model_path <PATH_TO_YOUR_GGUF> --visible_gpu <CUDA_VISIBLE_DEVICES>
For models loadable by transformers (.safetensors
, .bin
, .model
, etc.) or hosted by Huggingface:
kuwa-executor huggingface --model_path <PATH_TO_MODEL_FOLDER/HUGGINGFACE_MODEL_NAME> --visible_gpu <CUDA_VISIBLE_DEVICES>
3. Connect to Cloud Model
You can use API Keys to connect to cloud models such as Gemini Pro or ChatGPT.
-
Start the executor with the following commands. The
api_key
is optional and will default to the global value.kuwa-executor geminipro --api_key <YOUR_API_KEY> kuwa-executor chatgpt --api_key <YOUR_API_KEY> --model <gpt-3.5-turbo/gpt-4/gpt-4-32k/...>
-
By default, these will set up the executor with
gemini-pro
andchatgpt
as theaccess_code
respectively. If you'd like to adjust theaccess_code
the executor is setup with, you can use--access_code <your_desired_access_code>
.
Advanced Usage
Detailed Generation Args
In addition to the debug executor, other executors allow you to specify detailed generation args, either through config file or command line arguments. On-premises models also allow you to specify a system prompt and prompt template. For details, use kuwa-executor [executor] --help
.
Custom Executors
Kuwa Executor can be viewed as a function or server that provides a specific functionality. The interface is defined in kuwa.executor.LLMExecutor
, which is a function that takes in the user's chat history and outputs text. You can extend this class to define your own custom executor.
The simplest implementation of LLMExecutor
can be seen in debug.py
and dummy.py
. Here is an explanation of each API:
__init__
: Initialize the service. Make sure to callsuper().__init__()
to complete the initialization.extend_arguments
: Optionally add command-line arguments. Use theargparse
built-in library for parsing arguments.setup
: Initialize anything you need. The command line arguments have been parsed at this stage and can be accessed via theself.args
variable.llm_compute
: The main method for handling requests. Please use an asynchronous iterator to implement this method.abort
: Called when the request is aborted by the user. It is expected to interrupt the current request in progress.
Connecting to other Inference Environments
Kuwa Executor can be easily connected to other inference environments, making it easy to integrate with existing open-source software.
Currently, any OpenAI API compatible inference server can be used with the ChatGPT Executor.
Here's an example using vLLM, a high-throughput inference engine.
1. Start vLLM Server (Shown using the Google Gemma 7B Instruct model as an example)
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HUGGING_FACE_HUB_TOKEN=<secret>" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model google/gemma-7b-it --dtype half
2. Start Kuwa ChatGPT Executor
kuwa-executor chatgpt --access_code vllm --log debug \
--base_url "http://localhost:8000/v1" `# Change the API base URL to vLLM` \
--api_key dummy `# Dummy API Key` \
--model "google/gemma-7b-it" `# Specify Gemma 7B model`
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file kuwa_executor-0.1.0.tar.gz
.
File metadata
- Download URL: kuwa_executor-0.1.0.tar.gz
- Upload date:
- Size: 112.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.11.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 36afa64b0201ad3595596937f61dd18ba83b1ed26098f0cf18e8497e1d339e2b |
|
MD5 | a1c22cb2a741464c4da4f6fdfbd7a654 |
|
BLAKE2b-256 | 1fade1cda3ba091c0f77712943afa1fbd7bc1affc602d02cc41eb17e9728627e |
File details
Details for the file kuwa_executor-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: kuwa_executor-0.1.0-py3-none-any.whl
- Upload date:
- Size: 25.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.11.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e9509e2be8ca9d14910628f2ecce76f9f06fbee368164df96034306a2c8c2436 |
|
MD5 | cc739f7c2e4e5c532b216749576abaff |
|
BLAKE2b-256 | 699ef8a4ac5f40b3e8b418f53549c3674eba11e0bf524166950bdfd9ab6b37a8 |