No project description provided
Project description
💥 Speedster
Speedster
reduces inference costs by leveraging SOTA optimization techniques that best couple your AI models with the underlying hardware (GPUs and CPUs). The idea is to make AI inference way cheaper in just a few lines of code.
Speedster
makes it easy to combine optimization techniques across the whole software-to-hardware stack, delivering best-in-class cost savings. If you like the idea, give us a star to support the project ⭐
The core Speedster
workflow consists of 3 steps:
- Select: input your model in your preferred DL framework and express your preferences regarding:
- Accuracy loss: do you want to trade off a little accuracy for significant cost savings?
- Optimization time: achieving great savings can be time-consuming. Can you wait, or do you need an instant answer?
- Search: the library automatically tests every combination of optimization techniques across the software-to-hardware stack (sparsity, quantization, compilers, etc.) that is compatible with your needs and local hardware.
- Serve: finally,
Speedster
chooses the best configuration of optimization techniques and returns an accelerated version of your model in the DL framework of your choice (just cheaper 🚀).
Installation
Install Speedster
and its base requirements:
pip install speedster
Then make sure to install all the available deep learning compilers.
python -m nebullvm.installers.auto_installer --compilers all
:warning: For MacOS with ARM processors, please use a conda environment. Moreover, if you want to optimize a PyTorch model, PyTorch must be pre-installed on your environment before proceeding to the next step, please install it from this link.
For more details on how to install Speedster, please visit our Installation guide.
Quick start
Only one line of code - that’s what you need to accelerate your model! Find below your getting started guide for 5 different input model frameworks:
🔥 PyTorch
In this section, we will learn about the 4 main steps needed to optimize PyTorch models:
- Input your model and data
- Run the optimization
- Save your optimized model
- Load and run your optimized model in production
import torch
import torchvision.models as models
from speedster import optimize_model, save_model
#1 Provide input model and data (we support PyTorch Dataloaders and custom input, see the docs to learn more)
model = models.resnet50()
input_data = [((torch.randn(1, 3, 256, 256), ), torch.tensor([0])) for _ in range(100)]
#2 Run Speedster optimization
optimized_model = optimize_model(
model,
input_data=input_data,
optimization_time="constrained",
metric_drop_ths=0.05
)
#3 Save the optimized model
save_model(optimized_model, "model_save_path")
Once the optimization is completed, start using the accelerated model (on steroids 🚀) in your DL framework of choice.
#4 Load and run your PyTorch accelerated model in production
from speedster import load_model
optimized_model = load_model("model_save_path")
output = optimized_model(input_sample)
For more details, please visit Getting Started with PyTorch Optimization.
🤗 Hugging Face Transformers
In this section, we will learn about the 4 main steps needed to optimize 🤗 Hugging Face Transformer models:
- Input your model and data
- Run the optimization
- Save your optimized model
- Load and run your optimized model in production
-
✅ For Decoder-only or Encoder-only architectures (Bert, GPT, etc)
from transformers import AlbertModel, AlbertTokenizer from speedster import optimize_model, save_model #1a. Provide input model: Load Albert as an example model = AlbertModel.from_pretrained("albert-base-v1") tokenizer = AlbertTokenizer.from_pretrained("albert-base-v1") #1b. Dictionary input format (also string format is accepted, see the docs to learn more) text = "This is an example text for the huggingface model." input_dict = tokenizer(text, return_tensors="pt") input_data = [input_dict for _ in range(100)] #2 Run Speedster optimization (if input data is in string format, also the tokenizer # should be given as input argument, see the docs to learn more) optimized_model = optimize_model( model, input_data=input_data, optimization_time="constrained", metric_drop_ths=0.05 ) #3 Save the optimized model save_model(optimized_model, "model_save_path")
Once the optimization is completed, start using the accelerated model (on steroids 🚀) in your DL framework of choice.
#4 Load and run your Huggingface accelerated model in production from speedster import load_model optimized_model = load_model("model_save_path") output = optimized_model(**input_sample)
For more details, please visit Getting Started with HuggingFace optimization.
-
✅ For Encoder-Decoder architectures (T5 etc)
from transformers import T5Tokenizer, T5ForConditionalGeneration from speedster import optimize_model, save_model #1a. Provide input model: Load T5 as an example model = T5ForConditionalGeneration.from_pretrained("t5-small") tokenizer = T5Tokenizer.from_pretrained("t5-small") #1b. Dictionary input format question = "What's the meaning of life?" answer = "The answer is:" input_dict = tokenizer(question, return_tensors="pt") input_dict["decoder_input_ids"] = tokenizer(answer, return_tensors="pt").input_ids input_data = [input_dict for _ in range(100)] #2 Run Speedster optimization (if input data is in string format, also the tokenizer # should be given as input argument, see the docs to learn more) optimized_model = optimize_model( model, input_data=input_data, optimization_time="constrained", metric_drop_ths=0.05 ) #3 Save the optimized model save_model(optimized_model, "model_save_path")
Once the optimization is completed, start using the accelerated model (on steroids 🚀) in your DL framework of choice.
#4 Load and run your Huggingface accelerated model in production from speedster import load_model optimized_model = load_model("model_save_path") output = optimized_model(**input_sample)
For more details, please visit Getting Started with HuggingFace optimization.
🧨 Hugging Face Diffusers
:warning: In order to work properly, the diffusers optimization requires
CUDA>=12.0
,tensorrt>=8.6.0
andtorch<=1.13.1
. For additional details, please look the docs here.
In this section, we will learn about the 4 main steps needed to optimize Stable Diffusion models from the Diffusers library:
- Input your model and data
- Run the optimization
- Save your optimized model
- Load and run your optimized model in production
import torch
from diffusers import StableDiffusionPipeline
from speedster import optimize_model, save_model
#1 Provide input model and data
model_id = "CompVis/stable-diffusion-v1-4"
device = "cuda" if torch.cuda.is_available() else "cpu"
if device == "cuda":
# On GPU we load by default the model in half precision, because it's faster and lighter.
pipe = StableDiffusionPipeline.from_pretrained(model_id, revision='fp16', torch_dtype=torch.float16)
else:
pipe = StableDiffusionPipeline.from_pretrained(model_id)
# Create some example input data
input_data = [
"a photo of an astronaut riding a horse on mars",
"a monkey eating a banana in a forest",
"white car on a road surrounded by palm trees",
"a fridge full of bottles of beer",
"madara uchiha throwing asteroids against people"
]
#2 Run Speedster optimization
optimized_model = optimize_model(
model=pipe,
input_data=input_data,
optimization_time="unconstrained",
ignore_compilers=["torch_tensor_rt", "tvm"],
metric_drop_ths=0.1,
)
#3 Save the optimized model
save_model(optimized_model, "model_save_path")
Once the optimization is completed, start using the accelerated model (on steroids 🚀).
#4 Load and run your PyTorch accelerated model in production
from speedster import load_model
optimized_model = load_model("model_save_path", pipe=pipe)
test_prompt = "futuristic llama with a cyberpunk city on the background"
output = optimized_model(test_prompt).images[0]
For more details, please visit Getting Started with Stable Diffusion optimization.
🌊 TensorFlow/Keras
In this section, we will learn about the 4 main steps needed to optimize TensorFlow/Keras models:
- Input your model and data
- Run the optimization
- Save your optimized model
- Load and run your optimized model in production
import tensorflow as tf
from tensorflow.keras.applications.resnet50 import ResNet50
from speedster import optimize_model, save_model
#1 Provide input model and data (we support Keras dataset and custom input, see the docs to learn more)
model = ResNet50()
input_data = [((tf.random.normal([1, 224, 224, 3]),), tf.constant([0])) for _ in range(100)]
#2 Run Speedster optimization
optimized_model = optimize_model(
model,
input_data=input_data,
optimization_time="constrained",
metric_drop_ths=0.05
)
#3 Save the optimized model
save_model(optimized_model, "model_save_path")
Once the optimization is completed, start using the accelerated model (on steroids 🚀) in your DL framework of choice.
#4 Load and run your TensorFlow accelerated model in production
from speedster import load_model
optimized_model = load_model("model_save_path")
output = optimized_model(input_sample)
For more details, please visit Getting Started with TensorFlow optimization.
⚡ ONNX
In this section, we will learn about the 4 main steps needed to optimize ONNX models:
- Input your model and data
- Run the optimization
- Save your optimized model
- Load and run your optimized model in production
import numpy as np
from speedster import optimize_model, save_model
#1 Provide input model and data
# Model was downloaded from here:
# https://github.com/onnx/models/tree/main/vision/classification/resnet
model = "resnet50-v1-12.onnx"
input_data = [((np.random.randn(1, 3, 224, 224).astype(np.float32), ), np.array([0])) for _ in range(100)]
#2 Run Speedster optimization
optimized_model = optimize_model(
model,
input_data=input_data,
optimization_time="constrained",
metric_drop_ths=0.05
)
#3 Save the optimized model
save_model(optimized_model, "model_save_path")
Once the optimization is completed, start using the accelerated model (on steroids 🚀) in your DL framework of choice.
#4 Load and run your ONNX accelerated model in production
from speedster import load_model
optimized_model = load_model("model_save_path")
output = optimized_model(input_sample)
For more details, please visit Getting Started with ONNX optimization.
Documentation
- Installation
- Getting started with PyTorch optimization
- Getting started with Hugging Face optimization
- Getting started with Stable Diffusion optimization
- Getting started with TensorFlow optimization
- Getting started with ONNX optimization
- Key concepts
- Notebooks
- Advanced options
- Benchmarks
Key concepts
Speedster's design reflects our mission to automatically master each and every existing AI acceleration technique to deliver the most cost-efficient AI ever. As a result, Speedster
leverages available enterprise-grade open-source optimization tools. If these tools and communities already exist, and are distributed under a permissive license (Apache, MIT, etc), we integrate them and happily contribute to their communities. However, many tools do not exist yet, in which case we implement them and open-source the code so that our community can benefit from it.
Speedster
is shaped around 4 building blocks and leverages a modular design to foster scalability and integration of new acceleration components across the software to hardware stack.
- Converter: converts the input model from its original framework to the framework backends supported by
Speedster
, namely PyTorch, ONNX and TensorFlow. This allows the Compressor and Compiler modules to apply any optimization technique to the model. - Compressor: applies various compression techniques to the model, such as pruning, knowledge distillation, or quantization-aware training.
- Compiler: converts the compressed models to the intermediate representation (IR) of the supported deep learning compilers. The compilers apply both post-training quantization techniques and graph optimizations, to produce compiled binary files.
- Inference Learner: takes the best performing compiled model and converts it back into the same interface as the original input model.
The compressor stage leverages the following open-source projects:
- Intel/neural-compressor: targeting to provide unified APIs for network compression technologies, such as low precision quantization, sparsity, pruning, knowledge distillation, across different deep learning frameworks to pursue optimal inference performance.
- SparseML: libraries for applying sparsification recipes to neural networks with a few lines of code, enabling faster and smaller models.
The compiler stage leverages the following open-source projects:
- Apache TVM: open deep learning compiler stack for cpu, gpu and specialized accelerators.
- BladeDISC: end-to-end Dynamic Shape Compiler project for machine learning workloads.
- DeepSparse: neural network inference engine that delivers GPU-class performance for sparsified models on CPUs.
- OpenVINO: open-source toolkit for optimizing and deploying AI inference.
- ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
- TensorRT: C++ library for high performance inference on NVIDIA GPUs and deep learning accelerators.
- TFlite and XLA: open-source libraries to accelerate TensorFlow models.
Community
We’re developing Speedster
for and together with our community, so please get in touch on GitHub or Discord.
• GitHub issues: suggest new acceleration components, request new features, and report bugs and improvements.
• Discord: learn about AI acceleration, share exciting projects and hang out with our global community.
The best way to get started is to pick a good-first issue. Please read our contribution guidelines for a deep dive into how to best contribute to our project!
Don't forget to leave a star ⭐ to support the project and happy acceleration 🚀
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for speedster-0.4.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | febda1ee325569f15f14e57629ac0003e5aa7ecd2716703c26db0627d6ccdf88 |
|
MD5 | 0258873f3131f398b76715cc64fc950c |
|
BLAKE2b-256 | e17b1bfb24d70fffc4930a8458cd2d05e9b14d97f85e8816671853639af92dda |