Skip to main content

Quick-Deploy optimize and deploy Machine Learning models as fast inference API.

Project description

Quick-Deploy

Build GitHub GitHub release

Optimize and deploy machine learning models fast and easy as possible.

quick-deploy provide tools to optimize, convert and deploy machine learning models as fast inference API (low latency and high throughput) by Triton Inference Server using Onnx Runtime backend. It support 🤗 transformers, PyToch, Tensorflow, SKLearn and XGBoost models.

Get Started

Let's see an quick example by deploying bert transformers for GPU inference. quick-deploy already have support 🤗 transformers so we can specify the path of pretrained model or just the name from the Hub:

$ quick-deploy transformers \
    -n my-bert-base \
    -p text-classification \
    -m bert-base-uncased \
    -o ./models \
    --model-type bert \
    --seq-len 128 \
    --cuda

The command above created the deployment artifacts by optimizing and converting the model to Onxx. Next just run the inference server:

$ docker run -it --rm \
    --gpus all \
    --shm-size 256m \
    -p 8000:8000 \
    -p 8001:8001 \
    -p 8002:8002 \
    -v $(pwd)/models:/models nvcr.io/nvidia/tritonserver:21.11-py3 \
    tritonserver --model-repository=/models

Now we can use tritonclient which uses gRPC calls to consume our model:

import numpy as np
import tritonclient.http
from scipy.special import softmax
from transformers import BertTokenizer, TensorType


tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

model_name = "my_bert_base"
url = "127.0.0.1:8000"
model_version = "1"
batch_size = 1

text = "The goal of life is [MASK]."
tokens = tokenizer(text=text, return_tensors=TensorType.NUMPY)

triton_client = tritonclient.http.InferenceServerClient(url=url, verbose=False)
assert triton_client.is_model_ready(
    model_name=model_name, model_version=model_version
), f"model {model_name} not yet ready"

input_ids = tritonclient.http.InferInput(name="input_ids", shape=(batch_size, 9), datatype="INT64")
token_type_ids = tritonclient.http.InferInput(name="token_type_ids", shape=(batch_size, 9), datatype="INT64")
attention = tritonclient.http.InferInput(name="attention_mask", shape=(batch_size, 9), datatype="INT64")
model_output = tritonclient.http.InferRequestedOutput(name="output", binary_data=False)

input_ids.set_data_from_numpy(tokens['input_ids'] * batch_size)
token_type_ids.set_data_from_numpy(tokens['token_type_ids'] * batch_size)
attention.set_data_from_numpy(tokens['attention_mask'] * batch_size)

response = triton_client.infer(
    model_name=model_name,
    model_version=model_version,
    inputs=[input_ids, token_type_ids, attention],
    outputs=[model_output],
)

token_logits = response.as_numpy("output")
print(token_logits)

Note: This does only model deployment the tokenizer and post-processing should be done in the client side. The full tansformers deployment is comming soon.

For more use cases please check the examples page.

Install

Before install make sure to install just the target model eg.: "torch", "sklearn" or "all". There two options to use quick-deploy, by docker container:

$ docker run --rm -it rodrigobaron/quick-deploy:0.1.1-all --help

or install the python library quick-deploy:

$ pip install quick-deploy[all]

Note: This will install the full vesion all.

Contributing

Please follow the Contributing guide.

License

Apache License 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

quick-deploy-0.2.2.tar.gz (17.5 kB view details)

Uploaded Source

Built Distribution

quick_deploy-0.2.2-py3-none-any.whl (25.3 kB view details)

Uploaded Python 3

File details

Details for the file quick-deploy-0.2.2.tar.gz.

File metadata

  • Download URL: quick-deploy-0.2.2.tar.gz
  • Upload date:
  • Size: 17.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.1 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.9

File hashes

Hashes for quick-deploy-0.2.2.tar.gz
Algorithm Hash digest
SHA256 37a07e344ed3d38b7d71e01588863e1262e4cbe102aeaf683bfad9ef390704ca
MD5 798249f025c1c7dc20bbdffa268f2534
BLAKE2b-256 49576f8b4ba015146e5b7eb5dd6a21f083f5f52bd577ea6940afdc5a6247ade6

See more details on using hashes here.

File details

Details for the file quick_deploy-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: quick_deploy-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 25.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.1 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.9

File hashes

Hashes for quick_deploy-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 3cbb6786ed1668a1de7aea215ee04e862982f15ef95c4e9181d70fd2e7148ddf
MD5 cd86a72d65beaf7ee8f1cf846d99d1d4
BLAKE2b-256 a08d707005de026acfed69b5fe4031dec1997461439588e8c71f71cd7914e7d5

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page