Skip to main content

Model Serving made Efficient in the Cloud

Project description

MOSEC

discord invitation link PyPI version conda-forge Python Version PyPi monthly Downloads License Check status

Model Serving made Efficient in the Cloud.

Introduction

MOSEC

Mosec is a high-performance and flexible model serving framework for building ML model-enabled backend and microservices. It bridges the gap between any machine learning models you just trained and the efficient online service API.

  • Highly performant: web layer and task coordination built with Rust 🦀, which offers blazing speed in addition to efficient CPU utilization powered by async I/O
  • Ease of use: user interface purely in Python 🐍, by which users can serve their models in an ML framework-agnostic manner using the same code as they do for offline testing
  • Dynamic batching: aggregate requests from different users for batched inference and distribute results back
  • Pipelined stages: spawn multiple processes for pipelined stages to handle CPU/GPU/IO mixed workloads
  • Cloud friendly: designed to run in the cloud, with the model warmup, graceful shutdown, and Prometheus monitoring metrics, easily managed by Kubernetes or any container orchestration systems
  • Do one thing well: focus on the online serving part, users can pay attention to the model optimization and business logic

Installation

Mosec requires Python 3.7 or above. Install the latest PyPI package for Linux x86_64 or macOS x86_64/ARM64 with:

pip install -U mosec
# or install with conda
conda install conda-forge::mosec

To build from the source code, install Rust and run the following command:

make package

You will get a mosec wheel file in the dist folder.

Usage

We demonstrate how Mosec can help you easily host a pre-trained stable diffusion model as a service. You need to install diffusers and transformers as prerequisites:

pip install --upgrade diffusers[torch] transformers

Write the server

Click me for server codes with explanations.

Firstly, we import the libraries and set up a basic logger to better observe what happens.

from io import BytesIO
from typing import List

import torch  # type: ignore
from diffusers import StableDiffusionPipeline  # type: ignore

from mosec import Server, Worker, get_logger
from mosec.mixin import MsgpackMixin

logger = get_logger()

Then, we build an API for clients to query a text prompt and obtain an image based on the stable-diffusion-v1-5 model in just 3 steps.

  1. Define your service as a class which inherits mosec.Worker. Here we also inherit MsgpackMixin to employ the msgpack serialization format(a).

  2. Inside the __init__ method, initialize your model and put it onto the corresponding device. Optionally you can assign self.example with some data to warm up(b) the model. Note that the data should be compatible with your handler's input format, which we detail next.

  3. Override the forward method to write your service handler(c), with the signature forward(self, data: Any | List[Any]) -> Any | List[Any]. Receiving/returning a single item or a tuple depends on whether dynamic batching(d) is configured.

class StableDiffusion(MsgpackMixin, Worker):
    def __init__(self):
        self.pipe = StableDiffusionPipeline.from_pretrained(
            "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16
        )
        device = "cuda" if torch.cuda.is_available() else "cpu"
        self.pipe = self.pipe.to(device)
        self.example = ["useless example prompt"] * 4  # warmup (batch_size=4)

    def forward(self, data: List[str]) -> List[memoryview]:
        logger.debug("generate images for %s", data)
        res = self.pipe(data)
        logger.debug("NSFW: %s", res[1])
        images = []
        for img in res[0]:
            dummy_file = BytesIO()
            img.save(dummy_file, format="JPEG")
            images.append(dummy_file.getbuffer())
        return images

[!NOTE]

(a) In this example we return an image in the binary format, which JSON does not support (unless encoded with base64 that makes the payload larger). Hence, msgpack suits our need better. If we do not inherit MsgpackMixin, JSON will be used by default. In other words, the protocol of the service request/response can be either msgpack, JSON, or any other format (check our mixins).

(b) Warm-up usually helps to allocate GPU memory in advance. If the warm-up example is specified, the service will only be ready after the example is forwarded through the handler. However, if no example is given, the first request's latency is expected to be longer. The example should be set as a single item or a tuple depending on what forward expects to receive. Moreover, in the case where you want to warm up with multiple different examples, you may set multi_examples (demo here).

(c) This example shows a single-stage service, where the StableDiffusion worker directly takes in client's prompt request and responds the image. Thus the forward can be considered as a complete service handler. However, we can also design a multi-stage service with workers doing different jobs (e.g., downloading images, model inference, post-processing) in a pipeline. In this case, the whole pipeline is considered as the service handler, with the first worker taking in the request and the last worker sending out the response. The data flow between workers is done by inter-process communication.

(d) Since dynamic batching is enabled in this example, the forward method will wishfully receive a list of string, e.g., ['a cute cat playing with a red ball', 'a man sitting in front of a computer', ...], aggregated from different clients for batch inference, improving the system throughput.

Finally, we append the worker to the server to construct a single-stage workflow (multiple stages can be pipelined to further boost the throughput, see this example), and specify the number of processes we want it to run in parallel (num=1), and the maximum batch size (max_batch_size=4, the maximum number of requests dynamic batching will accumulate before timeout; timeout is defined with the max_wait_time=10 in milliseconds, meaning the longest time Mosec waits until sending the batch to the Worker).

if __name__ == "__main__":
    server = Server()
    # 1) `num` specifies the number of processes that will be spawned to run in parallel.
    # 2) By configuring the `max_batch_size` with the value > 1, the input data in your
    # `forward` function will be a list (batch); otherwise, it's a single item.
    server.append_worker(StableDiffusion, num=1, max_batch_size=4, max_wait_time=10)
    server.run()

Run the server

Click me to see how to run and query the server.

The above snippets are merged in our example file. You may directly run at the project root level. We first have a look at the command line arguments (explanations here):

python examples/stable_diffusion/server.py --help

Then let's start the server with debug logs:

python examples/stable_diffusion/server.py --log-level debug --timeout 30000

Open http://127.0.0.1:8000/openapi/swagger/ in your browser to get the OpenAPI doc.

And in another terminal, test it:

python examples/stable_diffusion/client.py --prompt "a cute cat playing with a red ball" --output cat.jpg --port 8000

You will get an image named "cat.jpg" in the current directory.

You can check the metrics:

curl http://127.0.0.1:8000/metrics

That's it! You have just hosted your stable-diffusion model as a service! 😉

Examples

More ready-to-use examples can be found in the Example section. It includes:

Configuration

  • Dynamic batching
    • max_batch_size and max_wait_time (millisecond) are configured when you call append_worker.
    • Make sure inference with the max_batch_size value won't cause the out-of-memory in GPU.
    • Normally, max_wait_time should be less than the batch inference time.
    • If enabled, it will collect a batch either when the number of accumulated requests reaches max_batch_size or when max_wait_time has elapsed. The service will benefit from this feature when the traffic is high.
  • Check the arguments doc for other configurations.

Deployment

  • If you're looking for a GPU base image with mosec installed, you can check the official image mosecorg/mosec. For the complex use case, check out envd.
  • This service doesn't need Gunicorn or NGINX, but you can certainly use the ingress controller when necessary.
  • This service should be the PID 1 process in the container since it controls multiple processes. If you need to run multiple processes in one container, you will need a supervisor. You may choose Supervisor or Horust.
  • Remember to collect the metrics.
    • mosec_service_batch_size_bucket shows the batch size distribution.
    • mosec_service_batch_duration_second_bucket shows the duration of dynamic batching for each connection in each stage (starts from receiving the first task).
    • mosec_service_process_duration_second_bucket shows the duration of processing for each connection in each stage (including the IPC time but excluding the mosec_service_batch_duration_second_bucket).
    • mosec_service_remaining_task shows the number of currently processing tasks.
    • mosec_service_throughput shows the service throughput.
  • Stop the service with SIGINT (CTRL+C) or SIGTERM (kill {PID}) since it has the graceful shutdown logic.

Performance tuning

  • Find out the best max_batch_size and max_wait_time for your inference service. The metrics will show the histograms of the real batch size and batch duration. Those are the key information to adjust these two parameters.
  • Try to split the whole inference process into separate CPU and GPU stages (ref DistilBERT). Different stages will be run in a data pipeline, which will keep the GPU busy.
  • You can also adjust the number of workers in each stage. For example, if your pipeline consists of a CPU stage for preprocessing and a GPU stage for model inference, increasing the number of CPU-stage workers can help to produce more data to be batched for model inference at the GPU stage; increasing the GPU-stage workers can fully utilize the GPU memory and computation power. Both ways may contribute to higher GPU utilization, which consequently results in higher service throughput.
  • For multi-stage services, note that the data passing through different stages will be serialized/deserialized by the serialize_ipc/deserialize_ipc methods, so extremely large data might make the whole pipeline slow. The serialized data is passed to the next stage through rust by default, you could enable shared memory to potentially reduce the latency (ref RedisShmIPCMixin).
  • You should choose appropriate serialize/deserialize methods, which are used to decode the user request and encode the response. By default, both are using JSON. However, images and embeddings are not well supported by JSON. You can choose msgpack which is faster and binary compatible (ref Stable Diffusion).
  • Configure the threads for OpenBLAS or MKL. It might not be able to choose the most suitable CPUs used by the current Python process. You can configure it for each worker by using the env (ref custom GPU allocation).

Adopters

Here are some of the companies and individual users that are using Mosec:

Citation

If you find this software useful for your research, please consider citing

@software{yang2021mosec,
  title = {{MOSEC: Model Serving made Efficient in the Cloud}},
  author = {Yang, Keming and Liu, Zichen and Cheng, Philip},
  url = {https://github.com/mosecorg/mosec},
  year = {2021}
}

Contributing

We welcome any kind of contribution. Please give us feedback by raising issues or discussing on Discord. You could also directly contribute your code and pull request!

To start develop, you can use envd to create an isolated and clean Python & Rust environment. Check the envd-docs or build.envd for more information.

Project details


Release history Release notifications | RSS feed

This version

0.8.7

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mosec-0.8.7.tar.gz (86.5 kB view details)

Uploaded Source

Built Distributions

mosec-0.8.7-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.6 MB view details)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

mosec-0.8.7-cp312-cp312-macosx_11_0_arm64.whl (4.6 MB view details)

Uploaded CPython 3.12 macOS 11.0+ ARM64

mosec-0.8.7-cp312-cp312-macosx_10_9_x86_64.whl (4.5 MB view details)

Uploaded CPython 3.12 macOS 10.9+ x86-64

mosec-0.8.7-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.6 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

mosec-0.8.7-cp311-cp311-macosx_11_0_arm64.whl (4.6 MB view details)

Uploaded CPython 3.11 macOS 11.0+ ARM64

mosec-0.8.7-cp311-cp311-macosx_10_9_x86_64.whl (4.5 MB view details)

Uploaded CPython 3.11 macOS 10.9+ x86-64

mosec-0.8.7-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.6 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

mosec-0.8.7-cp310-cp310-macosx_11_0_arm64.whl (4.6 MB view details)

Uploaded CPython 3.10 macOS 11.0+ ARM64

mosec-0.8.7-cp310-cp310-macosx_10_9_x86_64.whl (4.5 MB view details)

Uploaded CPython 3.10 macOS 10.9+ x86-64

mosec-0.8.7-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.6 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

mosec-0.8.7-cp39-cp39-macosx_11_0_arm64.whl (4.6 MB view details)

Uploaded CPython 3.9 macOS 11.0+ ARM64

mosec-0.8.7-cp39-cp39-macosx_10_9_x86_64.whl (4.5 MB view details)

Uploaded CPython 3.9 macOS 10.9+ x86-64

mosec-0.8.7-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.6 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

mosec-0.8.7-cp38-cp38-macosx_11_0_arm64.whl (4.6 MB view details)

Uploaded CPython 3.8 macOS 11.0+ ARM64

mosec-0.8.7-cp38-cp38-macosx_10_9_x86_64.whl (4.5 MB view details)

Uploaded CPython 3.8 macOS 10.9+ x86-64

File details

Details for the file mosec-0.8.7.tar.gz.

File metadata

  • Download URL: mosec-0.8.7.tar.gz
  • Upload date:
  • Size: 86.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.19

File hashes

Hashes for mosec-0.8.7.tar.gz
Algorithm Hash digest
SHA256 525e8e11767f19ee2b3627733876172d96a0e252f8acd7368120baf32204acf0
MD5 983521053c632eab35dfa3fda4b6d1a5
BLAKE2b-256 859be376bccb6f221d43a45723e91dc4e556ffa3f62b1bc750004b5d8276ec0e

See more details on using hashes here.

File details

Details for the file mosec-0.8.7-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for mosec-0.8.7-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 0400cf09c869f08aa49b1da90e8f83a8c6b401cd344edc0fee3b15c099bb2cb9
MD5 2eec7beaaf6ab88be6500499671a6227
BLAKE2b-256 62162da2c5a953c81be0b8fdac72900ffc44cf0de543dce9cc8112f08bd2ac1a

See more details on using hashes here.

File details

Details for the file mosec-0.8.7-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for mosec-0.8.7-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 47512ddd3c55b736dc2fa1bb71d7aac7f35c49ae512b8770d5da2276572fad5d
MD5 2c4b8d5aad40985cffcd0c7d76563922
BLAKE2b-256 fecfb1fb93c9ad86bac54b4108d1627f9c659d3b8d58899239663dd69f596e30

See more details on using hashes here.

File details

Details for the file mosec-0.8.7-cp312-cp312-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for mosec-0.8.7-cp312-cp312-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 4926baefc26182bb348d15c1edf0446ef55f85ae0e02b85ccea9231e549d76c3
MD5 a4c6872593c0d49db1e8cd68233be37a
BLAKE2b-256 4260e14f80d344409aa3ee69ea01a56302b6f72eada5eb01763a39c69361ee7d

See more details on using hashes here.

File details

Details for the file mosec-0.8.7-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for mosec-0.8.7-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d14ad756a7422fdab7b4248fe07e62d8132c5bd514d51bc13e70c07060654b4a
MD5 cf0311993b721c8a3fceb2dd452ad380
BLAKE2b-256 887b15b4768273fccb26268e75a6d94548d5a938a083550bf36e348ce965d63a

See more details on using hashes here.

File details

Details for the file mosec-0.8.7-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for mosec-0.8.7-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 43698d4f2a1415e3e0094be60b62047f167882143d8d4193eedb714d9dae3fcb
MD5 6c8eddf9c347310ccf8694f3e3462467
BLAKE2b-256 9e1e3b33457f4ffaa959aff941c92f1af6672254a681463acd84847104cc3010

See more details on using hashes here.

File details

Details for the file mosec-0.8.7-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for mosec-0.8.7-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 0f6a599cf6ab931917762cfecd7bb31ebc8819a9006f9eceb41bd6e3cffcbd77
MD5 6717e97145ff0ac66a11133f9d2b61fd
BLAKE2b-256 c214ae7847f9d759986c9a71c516997fec6230fee07843bec1f5297c3a9bfa06

See more details on using hashes here.

File details

Details for the file mosec-0.8.7-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for mosec-0.8.7-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a5df71fcecbe3c7675bff1a7c29318b3d467f5bb9d654bb31462db4d7ee3c3ba
MD5 badcb29c4ca746dad4bdd3020a94039c
BLAKE2b-256 562437642e4927f61752b612829a74f5a9b0363d3a35744b1012a0019f7f6a75

See more details on using hashes here.

File details

Details for the file mosec-0.8.7-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for mosec-0.8.7-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 fb9e93792c211f3786f218439e17b3c1cc935fa29f91f77975a85faadab2c5e3
MD5 4d712597cab494376dbfe7b87ca1d368
BLAKE2b-256 ea36258ab9cd34acfa8269531e30fa771e78da7630109ad3f6aefe5f325462a5

See more details on using hashes here.

File details

Details for the file mosec-0.8.7-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for mosec-0.8.7-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 13f6b8ad78029c36b9a2e83d3d6e43a0da44ff28e297a755d0f5c1d4db55077a
MD5 66fb9415b4cc14e1c3355e80d63e4d01
BLAKE2b-256 f3ab3a84eb0cd1f5178373c1ce286141d186d6a875fd8d0cf2b9cef6440d0691

See more details on using hashes here.

File details

Details for the file mosec-0.8.7-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for mosec-0.8.7-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 07955a1a7d3ce0d826ccff549f819347154bcd34a4628b7cad41f5e4648019b5
MD5 c1b9be0bb59827582a2efd18c56f9c7a
BLAKE2b-256 5538e14bc5e49acb8eecdea1e80bde201f51a8ad5673af7aa85ba8fc5797f3f3

See more details on using hashes here.

File details

Details for the file mosec-0.8.7-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for mosec-0.8.7-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 231b3e2099ac291b8398596f025b8d38d9035d79ba92b67936e4156ce5347bca
MD5 8cd29bf1a43b180c825f6ddec6137e2a
BLAKE2b-256 b89086c4795508afe4b4b19dc6e2b0adc8bb59bb50b4592960f1da68e16d0dce

See more details on using hashes here.

File details

Details for the file mosec-0.8.7-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for mosec-0.8.7-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 40eba5f1a6cc91691a1a7b1046978dd09621246b26a42cf72af1d773551cac6b
MD5 b89edec13b628e02ad7d26aea9153d40
BLAKE2b-256 7a8eaa2a3cba299143aa6ed637257de7c2909bf6bd3c89cffcf8b5f6876a042f

See more details on using hashes here.

File details

Details for the file mosec-0.8.7-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for mosec-0.8.7-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ee1191caced02e4bd17972ddd8b78b654464c401111b5ca5ee75abf0aa4ce706
MD5 dcc2673852f9250c99df06998c66094f
BLAKE2b-256 37bb2cdba9872d1692a74325b27a8e443d2dc9f9cc5c956c9b135fd27a89c630

See more details on using hashes here.

File details

Details for the file mosec-0.8.7-cp38-cp38-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for mosec-0.8.7-cp38-cp38-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 77edcb56f904521dcf7a040a0a6a6b37ce33fc0ced285796c00ff210025cf061
MD5 646757c107755de13534f359e3171300
BLAKE2b-256 f56b148e34b9adbdade0144e6a582a4e17909aa870280b74d32eb208532696c7

See more details on using hashes here.

File details

Details for the file mosec-0.8.7-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for mosec-0.8.7-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 cec48a3069d55369ecfee9544b2a7faadb560c8c65fb934f370b27071b051035
MD5 634a4c8429fa3b28de9407ab80736a5a
BLAKE2b-256 033061e1968469c881b4754bfbeec6d0cc06f22f04bbaeae0e59111fc76e0bd7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page