Skip to main content

Open-source Sky Computing Inference Endpoint

Project description

FlockServe

Open-source Sky Computing Inference Endpoint

Overview

FlockServe is an open-source library for deploying production-ready AI inference endpoints. Similar to closed-source commercial inference endpoints, FlockServe adds the capabilities of autoscaling, load balancing, and monitoring to an inference engine server, turning these into production-ready solutions for serving AI predictions at dynamic request rates and high volumes.

FlockServe uses SkyPilot as the node provisioner, taking SkyPilot task files for single inference engine servers and autoscales these based on request volume. Due to using SkyPilot, an inference endpoint developed with FlockServe will work natively across multiple clouds, and can be migrated between providers just by changing the SkyPilot configuration.

Any inference engine server can be used, and examples of SkyPilot task files are given for vLLM and TGI. Both the OpenAI and /generate APIs used by these are supported. FlockServe runs on FastAPI and uvicorn, so requests are processed fully asynchronously.

FlockServe has a modular design, and different solutions for autoscaling and load balancing can be used. The default option for autoscaling uses a running mean estimate of request queue lengths, which is effective for LLM autoscaling. The default option for load balancing uses Least Connection Load Balancing, which is also well suited for serving LLMs.

Features

  • Scalability: Easily scale your inference endpoint based on the demand using cloud resources.
  • Skypilot Integration: Leverages the power of skypilot for sky computing free from vendor lock-in
  • Flexible Model Support: Supports any inference engine such as vLLM and TGI, and models supported by these.
  • RESTful API: Simple and intuitive API for interacting with the inference endpoint.
  • Monitoring and Logging: Monitor the performance and logs of deployed models for effective debugging and optimization.

Getting Started

Prerequisites

  • Python >= 3.7, < 3.11
  • Docker (if using containerized deployment)

Installation

You can install FlockServe from PyPI with pip:

pip install flockserve

Usage

Running from command line:

flockserve --skypilot_task serving_tgi_cpu_openai.yaml

From Python:

from flockserve import FlockServe
fs = FlockServe(skypilot_task="serving_tgi_cpu_generate.yaml")
fs.run()

The mandatory argument is skypilot_task. The available arguments are:

Argument Default Value Description
skypilot_task Required The path to a YAML file defining the SkyPilot task.
worker_capacity 30 Maximum number of tasks a worker can handle concurrently.
worker_name_prefix 'skypilot-worker' Prefix for naming workers.
host '0.0.0.0' The host IP address to bind the server.
port -1 The port number to listen on. If <0, port is read from skypilot task
worker_ready_path "/health" Path to check worker readiness.
min_workers 1 Minimum number of workers to maintain.
max_workers 2 Maximum number of workers allowed.
autoscale_up 7 Load threshold to trigger scaling up of workers.
autoscale_down 4 Load threshold to trigger scaling down of workers.
queue_tracking_window 600 Time window in seconds to track queue length for autoscaling.
node_control_key None Secret key for node management operations.

Once FlockServe is started, it will print the outputs from SkyPilot, as well as report FlockServe metrics periodically:

INFO:flockserve.flockserve:Workers: 1, Workers Ready: 0, Worker Load: 0, QLRM: 0.0

Once "Workers Ready" is more than 0, you can send requests:

curl -X POST -H "Content-Type: application/json" 0.0.0.0:3000/v1/chat/completions -d "@server_test_tgi_openai.json"

Acknowledgments

FlockServe was developed at JDoodle, the AI-powered online platform for coding.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

flockserve-0.1.3.tar.gz (15.2 kB view details)

Uploaded Source

Built Distribution

flockserve-0.1.3-py2.py3-none-any.whl (22.9 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file flockserve-0.1.3.tar.gz.

File metadata

  • Download URL: flockserve-0.1.3.tar.gz
  • Upload date:
  • Size: 15.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.7

File hashes

Hashes for flockserve-0.1.3.tar.gz
Algorithm Hash digest
SHA256 ba8260f3aa61f08536aa7383f9650cf06c451eaf769efca7e9a400e88c48baf9
MD5 85975decff3af61563a1be8871551748
BLAKE2b-256 02343a5f84da1c2b89463d02bf627860f1f8d0b2e3322a912765f327f458b191

See more details on using hashes here.

File details

Details for the file flockserve-0.1.3-py2.py3-none-any.whl.

File metadata

  • Download URL: flockserve-0.1.3-py2.py3-none-any.whl
  • Upload date:
  • Size: 22.9 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.7

File hashes

Hashes for flockserve-0.1.3-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 f82cde066e1280125a679aa6ec3c0572d88dddca911a8c211e78262bb4a6c012
MD5 0ef2f95512eb205e4cdd8ff2b80471af
BLAKE2b-256 35b52b72433809cdc43b613b11686bc691a5faf9f028c3f580c91db559fc8b99

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page