Quick-Deploy optimize and deploy Machine Learning models as fast inference API.
Project description
Quick-Deploy
Optimize and deploy machine learning models fast and easy as possible.
quick-deploy provide tools to optimize, convert and deploy machine learning models as fast inference API (low latency and high throughput) by Triton Inference Server using Onnx Runtime backend. It support 🤗 transformers, PyToch, Tensorflow, SKLearn and XGBoost models.
Get Started
Let's see an quick example by deploying bert transformers for GPU inference. quick-deploy already have support 🤗 transformers so we can specify the path of pretrained model or just the name from the Hub:
$ quick-deploy transformers \
-n my-bert-base \
-p text-classification \
-m bert-base-uncased \
-o ./models \
--model-type bert \
--seq-len 128 \
--cuda
The command above created the deployment artifacts by optimizing and converting the model to Onxx. Next just run the inference server:
$ docker run -it --rm \
--gpus all \
--shm-size 256m \
-p 8000:8000 \
-p 8001:8001 \
-p 8002:8002 \
-v $(pwd)/models:/models nvcr.io/nvidia/tritonserver:21.11-py3 \
tritonserver --model-repository=/models
Now we can use tritonclient which uses gRPC calls to consume our model:
import numpy as np
import tritonclient.http
from scipy.special import softmax
from transformers import BertTokenizer, TensorType
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model_name = "my_bert_base"
url = "127.0.0.1:8000"
model_version = "1"
batch_size = 1
text = "The goal of life is [MASK]."
tokens = tokenizer(text=text, return_tensors=TensorType.NUMPY)
triton_client = tritonclient.http.InferenceServerClient(url=url, verbose=False)
assert triton_client.is_model_ready(
model_name=model_name, model_version=model_version
), f"model {model_name} not yet ready"
input_ids = tritonclient.http.InferInput(name="input_ids", shape=(batch_size, 9), datatype="INT64")
token_type_ids = tritonclient.http.InferInput(name="token_type_ids", shape=(batch_size, 9), datatype="INT64")
attention = tritonclient.http.InferInput(name="attention_mask", shape=(batch_size, 9), datatype="INT64")
model_output = tritonclient.http.InferRequestedOutput(name="output", binary_data=False)
input_ids.set_data_from_numpy(tokens['input_ids'] * batch_size)
token_type_ids.set_data_from_numpy(tokens['token_type_ids'] * batch_size)
attention.set_data_from_numpy(tokens['attention_mask'] * batch_size)
response = triton_client.infer(
model_name=model_name,
model_version=model_version,
inputs=[input_ids, token_type_ids, attention],
outputs=[model_output],
)
token_logits = response.as_numpy("output")
print(token_logits)
Note: This does only model deployment the tokenizer and post-processing should be done in the client side. The full tansformers deployment is comming soon.
For more use cases please check the examples page.
Install
Before install make sure to install just the target model eg.: "torch", "sklearn" or "all". There two options to use quick-deploy, by docker container:
$ docker run --rm -it rodrigobaron/quick-deploy:0.1.1-all --help
or install the python library quick-deploy
:
$ pip install quick-deploy[all]
Note: This will install the full vesion all
.
Contributing
Please follow the Contributing guide.
License
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file quick-deploy-0.2.2.tar.gz
.
File metadata
- Download URL: quick-deploy-0.2.2.tar.gz
- Upload date:
- Size: 17.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.7.1 importlib_metadata/4.10.1 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 37a07e344ed3d38b7d71e01588863e1262e4cbe102aeaf683bfad9ef390704ca |
|
MD5 | 798249f025c1c7dc20bbdffa268f2534 |
|
BLAKE2b-256 | 49576f8b4ba015146e5b7eb5dd6a21f083f5f52bd577ea6940afdc5a6247ade6 |
File details
Details for the file quick_deploy-0.2.2-py3-none-any.whl
.
File metadata
- Download URL: quick_deploy-0.2.2-py3-none-any.whl
- Upload date:
- Size: 25.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.7.1 importlib_metadata/4.10.1 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3cbb6786ed1668a1de7aea215ee04e862982f15ef95c4e9181d70fd2e7148ddf |
|
MD5 | cd86a72d65beaf7ee8f1cf846d99d1d4 |
|
BLAKE2b-256 | a08d707005de026acfed69b5fe4031dec1997461439588e8c71f71cd7914e7d5 |