No project description provided

Project description

💥 Speedster

Speedster reduces inference costs by leveraging SOTA optimization techniques that best couple your AI models with the underlying hardware (GPUs and CPUs). The idea is to make AI inference way cheaper in just a few lines of code.

Speedster makes it easy to combine optimization techniques across the whole software-to-hardware stack, delivering best-in-class cost savings. If you like the idea, give us a star to support the project ⭐

speedster

The core Speedster workflow consists of 3 steps:

Select: input your model in your preferred DL framework and express your preferences regarding:
- Accuracy loss: do you want to trade off a little accuracy for significant cost savings?
- Optimization time: achieving great savings can be time-consuming. Can you wait, or do you need an instant answer?
Search: the library automatically tests every combination of optimization techniques across the software-to-hardware stack (sparsity, quantization, compilers, etc.) that is compatible with your needs and local hardware.
Serve: finally, Speedster chooses the best configuration of optimization techniques and returns an accelerated version of your model in the DL framework of your choice (just cheaper 🚀).

Installation

Install Speedster and its base requirements:

pip install speedster

Then make sure to install all the available deep learning compilers.

python -m nebullvm.installers.auto_installer --compilers all

:warning: For MacOS with ARM processors, please use a conda environment. Moreover, if you want to optimize a PyTorch model, PyTorch must be pre-installed on your environment before proceeding to the next step, please install it from this link.

For more details on how to install Speedster, please visit our Installation guide.

Quick start

Only one line of code - that’s what you need to accelerate your model! Find below your getting started guide for 5 different input model frameworks:

🔥 PyTorch

In this section, we will learn about the 4 main steps needed to optimize PyTorch models:

Input your model and data
Run the optimization
Save your optimized model
Load and run your optimized model in production

import torch
import torchvision.models as models
from speedster import optimize_model, save_model

#1 Provide input model and data (we support PyTorch Dataloaders and custom input, see the docs to learn more)
model = models.resnet50()  
input_data = [((torch.randn(1, 3, 256, 256), ), torch.tensor([0])) for _ in range(100)]

#2 Run Speedster optimization
optimized_model = optimize_model(
    model, 
    input_data=input_data, 
    optimization_time="constrained",
    metric_drop_ths=0.05
)

#3 Save the optimized model
save_model(optimized_model, "model_save_path")

Once the optimization is completed, start using the accelerated model (on steroids 🚀) in your DL framework of choice.

#4 Load and run your PyTorch accelerated model in production
from speedster import load_model

optimized_model = load_model("model_save_path")

output = optimized_model(input_sample)

For more details, please visit Getting Started with PyTorch Optimization.

🤗 Hugging Face Transformers

In this section, we will learn about the 4 main steps needed to optimize 🤗 Hugging Face Transformer models:

Input your model and data
Run the optimization
Save your optimized model
Load and run your optimized model in production

✅ For Decoder-only or Encoder-only architectures (Bert, GPT, etc)

from transformers import AlbertModel, AlbertTokenizer
from speedster import optimize_model, save_model

#1a. Provide input model: Load Albert as an example
model = AlbertModel.from_pretrained("albert-base-v1")
tokenizer = AlbertTokenizer.from_pretrained("albert-base-v1")

#1b. Dictionary input format (also string format is accepted, see the docs to learn more)
text = "This is an example text for the huggingface model."
input_dict = tokenizer(text, return_tensors="pt")
input_data = [input_dict for _ in range(100)]

#2 Run Speedster optimization (if input data is in string format, also the tokenizer 
# should be given as input argument, see the docs to learn more)
optimized_model = optimize_model(
    model, 
    input_data=input_data, 
    optimization_time="constrained",
    metric_drop_ths=0.05
)

#3 Save the optimized model
save_model(optimized_model, "model_save_path")

Once the optimization is completed, start using the accelerated model (on steroids 🚀) in your DL framework of choice.

#4 Load and run your Huggingface accelerated model in production
from speedster import load_model

optimized_model = load_model("model_save_path")

output = optimized_model(**input_sample)

For more details, please visit Getting Started with HuggingFace optimization.

✅ For Encoder-Decoder architectures (T5 etc)

from transformers import T5Tokenizer, T5ForConditionalGeneration
from speedster import optimize_model, save_model

#1a. Provide input model: Load T5 as an example
model = T5ForConditionalGeneration.from_pretrained("t5-small")
tokenizer = T5Tokenizer.from_pretrained("t5-small") 

#1b. Dictionary input format
question = "What's the meaning of life?"
answer = "The answer is:"
input_dict = tokenizer(question, return_tensors="pt")
input_dict["decoder_input_ids"] = tokenizer(answer, return_tensors="pt").input_ids
input_data = [input_dict for _ in range(100)]

#2 Run Speedster optimization (if input data is in string format, also the tokenizer 
# should be given as input argument, see the docs to learn more)
optimized_model = optimize_model(
    model, 
    input_data=input_data, 
    optimization_time="constrained",
    metric_drop_ths=0.05
)

#3 Save the optimized model
save_model(optimized_model, "model_save_path")

Once the optimization is completed, start using the accelerated model (on steroids 🚀) in your DL framework of choice.

#4 Load and run your Huggingface accelerated model in production
from speedster import load_model

optimized_model = load_model("model_save_path")

output = optimized_model(**input_sample)

For more details, please visit Getting Started with HuggingFace optimization.

🧨 Hugging Face Diffusers

:warning: In order to work properly, the diffusers optimization requires CUDA>=12.0, tensorrt>=8.6.0 and torch<=1.13.1. For additional details, please look the docs here.

In this section, we will learn about the 4 main steps needed to optimize Stable Diffusion models from the Diffusers library:

Input your model and data
Run the optimization
Save your optimized model
Load and run your optimized model in production

import torch
from diffusers import StableDiffusionPipeline
from speedster import optimize_model, save_model

#1 Provide input model and data
model_id = "CompVis/stable-diffusion-v1-4"
device = "cuda" if torch.cuda.is_available() else "cpu"

if device == "cuda":
    # On GPU we load by default the model in half precision, because it's faster and lighter.
    pipe = StableDiffusionPipeline.from_pretrained(model_id, revision='fp16', torch_dtype=torch.float16)
else:
    pipe = StableDiffusionPipeline.from_pretrained(model_id)

# Create some example input data
input_data = [
    "a photo of an astronaut riding a horse on mars",
    "a monkey eating a banana in a forest",
    "white car on a road surrounded by palm trees",
    "a fridge full of bottles of beer",
    "madara uchiha throwing asteroids against people"
]

#2 Run Speedster optimization
optimized_model = optimize_model(
    model=pipe,
    input_data=input_data,
    optimization_time="unconstrained",
    ignore_compilers=["torch_tensor_rt", "tvm"],
    metric_drop_ths=0.1,
)

#3 Save the optimized model
save_model(optimized_model, "model_save_path")

Once the optimization is completed, start using the accelerated model (on steroids 🚀).

#4 Load and run your PyTorch accelerated model in production
from speedster import load_model

optimized_model = load_model("model_save_path", pipe=pipe)

test_prompt = "futuristic llama with a cyberpunk city on the background"
output = optimized_model(test_prompt).images[0]

For more details, please visit Getting Started with Stable Diffusion optimization.

🌊 TensorFlow/Keras

In this section, we will learn about the 4 main steps needed to optimize TensorFlow/Keras models:

Input your model and data
Run the optimization
Save your optimized model
Load and run your optimized model in production

import tensorflow as tf
from tensorflow.keras.applications.resnet50 import ResNet50
from speedster import optimize_model, save_model

#1 Provide input model and data (we support Keras dataset and custom input, see the docs to learn more)
model = ResNet50() 
input_data = [((tf.random.normal([1, 224, 224, 3]),), tf.constant([0])) for _ in range(100)]

#2 Run Speedster optimization
optimized_model = optimize_model(
    model, 
    input_data=input_data, 
    optimization_time="constrained",
    metric_drop_ths=0.05
)

#3 Save the optimized model
save_model(optimized_model, "model_save_path")

Once the optimization is completed, start using the accelerated model (on steroids 🚀) in your DL framework of choice.

#4 Load and run your TensorFlow accelerated model in production
from speedster import load_model

optimized_model = load_model("model_save_path")

output = optimized_model(input_sample)

For more details, please visit Getting Started with TensorFlow optimization.

⚡ ONNX

In this section, we will learn about the 4 main steps needed to optimize ONNX models:

Input your model and data
Run the optimization
Save your optimized model
Load and run your optimized model in production

import numpy as np
from speedster import optimize_model, save_model

#1 Provide input model and data
# Model was downloaded from here: 
# https://github.com/onnx/models/tree/main/vision/classification/resnet
model = "resnet50-v1-12.onnx" 
input_data = [((np.random.randn(1, 3, 224, 224).astype(np.float32), ), np.array([0])) for _ in range(100)]

#2 Run Speedster optimization
optimized_model = optimize_model(
    model, 
    input_data=input_data, 
    optimization_time="constrained",
    metric_drop_ths=0.05
)

#3 Save the optimized model
save_model(optimized_model, "model_save_path")

Once the optimization is completed, start using the accelerated model (on steroids 🚀) in your DL framework of choice.

#4 Load and run your ONNX accelerated model in production
from speedster import load_model

optimized_model = load_model("model_save_path")

output = optimized_model(input_sample)

For more details, please visit Getting Started with ONNX optimization.

Documentation

Key concepts

Speedster's design reflects our mission to automatically master each and every existing AI acceleration technique to deliver the most cost-efficient AI ever. As a result, Speedster leverages available enterprise-grade open-source optimization tools. If these tools and communities already exist, and are distributed under a permissive license (Apache, MIT, etc), we integrate them and happily contribute to their communities. However, many tools do not exist yet, in which case we implement them and open-source the code so that our community can benefit from it.

Speedster is shaped around 4 building blocks and leverages a modular design to foster scalability and integration of new acceleration components across the software to hardware stack.

Converter: converts the input model from its original framework to the framework backends supported by Speedster, namely PyTorch, ONNX and TensorFlow. This allows the Compressor and Compiler modules to apply any optimization technique to the model.
Compressor: applies various compression techniques to the model, such as pruning, knowledge distillation, or quantization-aware training.
Compiler: converts the compressed models to the intermediate representation (IR) of the supported deep learning compilers. The compilers apply both post-training quantization techniques and graph optimizations, to produce compiled binary files.
Inference Learner: takes the best performing compiled model and converts it back into the same interface as the original input model.

speedster_blocks

The compressor stage leverages the following open-source projects:

Intel/neural-compressor: targeting to provide unified APIs for network compression technologies, such as low precision quantization, sparsity, pruning, knowledge distillation, across different deep learning frameworks to pursue optimal inference performance.
SparseML: libraries for applying sparsification recipes to neural networks with a few lines of code, enabling faster and smaller models.

The compiler stage leverages the following open-source projects:

Apache TVM: open deep learning compiler stack for cpu, gpu and specialized accelerators.
BladeDISC: end-to-end Dynamic Shape Compiler project for machine learning workloads.
DeepSparse: neural network inference engine that delivers GPU-class performance for sparsified models on CPUs.
OpenVINO: open-source toolkit for optimizing and deploying AI inference.
ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
TensorRT: C++ library for high performance inference on NVIDIA GPUs and deep learning accelerators.
TFlite and XLA: open-source libraries to accelerate TensorFlow models.

Community

We’re developing Speedster for and together with our community, so please get in touch on GitHub or Discord.

• GitHub issues: suggest new acceleration components, request new features, and report bugs and improvements.

• Discord: learn about AI acceleration, share exciting projects and hang out with our global community.

The best way to get started is to pick a good-first issue. Please read our contribution guidelines for a deep dive into how to best contribute to our project!

Don't forget to leave a star ⭐ to support the project and happy acceleration 🚀

Project details

Release history Release notifications | RSS feed

This version

0.4.0

Jun 18, 2023

0.3.0

Mar 21, 2023

0.2.1

Feb 15, 2023

0.2.0

Jan 23, 2023

0.1.3

Jan 13, 2023

0.1.2

Jan 12, 2023

0.1.1

Jan 12, 2023

0.1.0

Jan 9, 2023

0.0.2

Jan 1, 2023

0.0.1

Dec 16, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

speedster-0.4.0.tar.gz (19.6 kB view hashes)

Uploaded Jun 18, 2023 Source

Built Distribution

speedster-0.4.0-py3-none-any.whl (19.0 kB view hashes)

Uploaded Jun 18, 2023 Python 3

Hashes for speedster-0.4.0.tar.gz

Hashes for speedster-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`e61ad06ceaf0fc7dfdee95b07195d8dda18e6dfd2deb7e91ccf196cc4e8122a2`
MD5	`2543976109f51c26f2dc9d4c1687aa96`
BLAKE2b-256	`89c9d386862ab95cda2ea2e92803bd7a973a4603a913da74d41c025b5962436d`

Hashes for speedster-0.4.0-py3-none-any.whl

Hashes for speedster-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`febda1ee325569f15f14e57629ac0003e5aa7ecd2716703c26db0627d6ccdf88`
MD5	`0258873f3131f398b76715cc64fc950c`
BLAKE2b-256	`e17b1bfb24d70fffc4930a8458cd2d05e9b14d97f85e8816671853639af92dda`