No project description provided
Project description
Nebullvm
nebullvm
is an open-source tool designed to speed up AI inference in just a few lines of code. nebullvm
boosts your model to achieve the maximum acceleration that is physically possible on your hardware.
We are building a new AI inference acceleration product leveraging state-of-the-art open-source optimization tools enabling the optimization of the whole software to hardware stack. If you like the idea, give us a star to support the project ⭐
The core nebullvm
workflow consists of 3 steps:
- Select: input your model in your preferred DL framework and express your preferences regarding:
- Accuracy loss: do you want to trade off a little accuracy for much higher performance?
- Optimization time: stellar accelerations can be time-consuming. Can you wait, or do you need an instant answer?
- Search:
nebullvm
automatically tests every combination of optimization techniques across the software-to-hardware stack (sparsity, quantization, compilers, etc.) that is compatible with your needs and local hardware. - Serve: finally,
nebullvm
chooses the best configuration of optimization techniques and returns an accelerated version of your model in the DL framework of your choice (just on steroids 🚀).
API quick view
Only a single line of code is needed to get your accelerated model:
import torch
import torchvision.models as models
from nebullvm.api.functions import optimize_model
# Load a resnet as example
model = models.resnet50()
# Provide an input data for the model
input_data = [((torch.randn(1, 3, 256, 256), ), 0)]
# Run nebullvm optimization in one line of code
optimized_model = optimize_model(
model, input_data=input_data, optimization_time="constrained"
)
# Try the optimized model
x = torch.randn(1, 3, 256, 256)
res = optimized_model(x)
For more details, please visit Installation and Get started.
How it works
We are not here to reinvent the wheel, but to build an all-in-one open-source product to master all the available AI acceleration techniques and deliver the fastest AI ever. As a result, nebullvm
leverages available enterprise-grade open-source optimization tools. If these tools and communities already exist, and are distributed under a permissive license (Apache, MIT, etc), we integrate them and happily contribute to their communities. However, many tools do not exist yet, in which case we implement them and open-source the code so that the community can benefit from it.
Product design
nebullvm
is shaped around 4 building blocks and leverages a modular design to foster scalability and integration of new acceleration components across the stack.
- Converter: converts the input model from its original framework to the framework backends supported by
nebullvm
, namely PyTorch, TensorFlow, and ONNX. This allows the Compressor and Optimizer modules to apply any optimization technique to the model. - Compressor: applies various compression techniques to the model, such as pruning, knowledge distillation, or quantization-aware training.
- Optimizer: converts the compressed models to the intermediate representation (IR) of the supported deep learning compilers. The compilers apply both post-training quantization techniques and graph optimizations, to produce compiled binary files.
- Inference Learner: takes the best performing compiled model and converts it to the same interface as the original input model.
The compressor stage leverages the following open-source projects:
- Intel/neural-compressor: targeting to provide unified APIs for network compression technologies, such as low precision quantization, sparsity, pruning, knowledge distillation, across different deep learning frameworks to pursue optimal inference performance.
- SparseML: libraries for applying sparsification recipes to neural networks with a few lines of code, enabling faster and smaller models.
The optimizer stage leverages the following open-source projects:
- Apache TVM: open deep learning compiler stack for cpu, gpu and specialized accelerators.
- BladeDISC: end-to-end Dynamic Shape Compiler project for machine learning workloads.
- DeepSparse: neural network inference engine that delivers GPU-class performance for sparsified models on CPUs.
- OpenVINO: open-source toolkit for optimizing and deploying AI inference.
- ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
- TensorRT: C++ library for high performance inference on NVIDIA GPUs and deep learning accelerators.
- TFlite and XLA: open-source libraries to accelerate TensorFlow models.
Documentation
Community
- Discord: best for sharing your projects, hanging out with the community and learning about AI acceleration.
- Github issues: ideal for suggesting new acceleration components, requesting new features, and reporting bugs and improvements.
We’re developing nebullvm
together with our community so the best way to get started is to pick a good-first issue
. Please read our contribution guidelines for a deep dive on how to best contribute to our project!
Don't forget to leave a star ⭐ to support the project and happy acceleration 🚀
Status
- Model converter backends
- ONNX, PyTorch, TensorFlow
- Jax
- Compressor
- Pruning and sparsity
- Quantized-aware training, distillation, layer replacement and low rank compression
- Optimizer
- TensorRT, OpenVINO, ONNX Runtime, TVM, PyTorch, DeepSparse, BladeDisc
- TFlite, XLA
- Inference learners
- PyTorch, ONNX, Hugging Face, TensorFlow
- Jax
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.