Skip to main content

Neural network inference engine that delivers GPU-class performance for sparsified models on CPUs

Project description

tool icon  DeepSparse

Neural network inference engine that delivers GPU-class performance for sparsified models on CPUs

Documentation Main GitHub release Contributor Covenant

Overview

The DeepSparse Engine is a CPU runtime that delivers GPU-class performance by taking advantage of sparsity (read more about sparsification here) within neural networks to reduce compute required as well as accelerate memory bound workloads. It is focused on model deployment and scaling machine learning pipelines, fitting seamlessly into your existing deployments as an inference backend.

The GitHub repository includes package APIs along with examples to quickly get started benchmarking and inferencing sparse models.

Highlights

ResNet-50, b64 - ORT: 296 images/sec vs DeepSparse: 2305 images/sec on 24 cores YOLOv3, b64 - PyTorch: 6.9 images/sec vs. DeepSparse: 46.5 images/sec

Tutorials

Installation

This repository is tested on Python 3.6+, and ONNX 1.5.0+. It is recommended to install in a virtual environment to keep your system in order.

Install with pip using:

pip install deepsparse

Hardware Support

The DeepSparse Engine is validated to work on x86 Intel and AMD CPUs running Linux operating systems. Mac and Windows require running Linux in a Docker or virtual machine.

It is highly recommended to run on a CPU with AVX-512 instructions available for optimal algorithms to be enabled.

Here is a table detailing specific support for some algorithms over different microarchitectures:

x86 Extension Microarchitectures Activation Sparsity Kernel Sparsity Sparse Quantization
AMD AVX2 Zen 2, Zen 3 not supported optimized not supported
Intel AVX2 Haswell, Broadwell, and newer not supported optimized not supported
Intel AVX-512 Skylake, Cannon Lake, and newer optimized optimized emulated
Intel AVX-512 VNNI (DL Boost) Cascade Lake, Ice Lake, Cooper Lake, Tiger Lake optimized optimized optimized

Compatibility

The DeepSparse Engine ingests models in the ONNX format, allowing for compatibility with PyTorch, TensorFlow, Keras, and many other frameworks that support it. This reduces the extra work of preparing your trained model for inference to just one step of exporting.

Quick Tour

To expedite inference and benchmarking on real models, we include the sparsezoo package. SparseZoo hosts inference-optimized models, trained on repeatable sparsification recipes using state-of-the-art techniques from SparseML.

Quickstart with SparseZoo ONNX Models

ResNet-50 Dense

Here is how to quickly perform inference with DeepSparse Engine on a pre-trained dense ResNet-50 from SparseZoo.

from deepsparse import compile_model
from sparsezoo.models import classification

batch_size = 64

# Download model and compile as optimized executable for your machine
model = classification.resnet_50()
engine = compile_model(model, batch_size=batch_size)

# Fetch sample input and predict output using engine
inputs = model.data_inputs.sample_batch(batch_size=batch_size)
outputs, inference_time = engine.timed_run(inputs)

ResNet-50 Sparsified

When exploring available optimized models, you can use the Zoo.search_optimized_models utility to find models that share a base.

Try this on the dense ResNet-50 to see what is available:

from sparsezoo import Zoo
from sparsezoo.models import classification

model = classification.resnet_50()
print(Zoo.search_sparse_models(model))

Output:

[
    Model(stub=cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned-conservative), 
    Model(stub=cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned-moderate), 
    Model(stub=cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned_quant-moderate), 
    Model(stub=cv/classification/resnet_v1-50/pytorch/sparseml/imagenet-augmented/pruned_quant-aggressive)
]

We can see there are two pruned versions targeting FP32 and two pruned, quantized versions targeting INT8. The conservative, moderate, and aggressive tags recover to 100%, >=99%, and <99% of baseline accuracy respectively.

For a version of ResNet-50 that recovers close to the baseline and is very performant, choose the pruned_quant-moderate model. This model will run nearly 7x faster than the baseline model on a compatible CPU (with the VNNI instruction set enabled). For hardware compatibility, see the Hardware Support section.

from deepsparse import compile_model
import numpy

batch_size = 64
sample_inputs = [numpy.random.randn(batch_size, 3, 224, 224).astype(numpy.float32)]

# run baseline benchmarking
engine_base = compile_model(
    model="zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/base-none", 
    batch_size=batch_size,
)
benchmarks_base = engine_base.benchmark(sample_inputs)
print(benchmarks_base)

# run sparse benchmarking
engine_sparse = compile_model(
    model="zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned_quant-moderate", 
    batch_size=batch_size,
)
if not engine_sparse.cpu_vnni:
    print("WARNING: VNNI instructions not detected, quantization speedup not well supported")
benchmarks_sparse = engine_sparse.benchmark(sample_inputs)
print(benchmarks_sparse)

print(f"Speedup: {benchmarks_sparse.items_per_second / benchmarks_base.items_per_second:.2f}x")

Quickstart with Custom ONNX Models

We accept ONNX files for custom models, too. Simply plug in your model to compare performance with other solutions.

> wget https://github.com/onnx/models/raw/master/vision/classification/mobilenet/model/mobilenetv2-7.onnx
Saving to: ‘mobilenetv2-7.onnx’
from deepsparse import compile_model
from deepsparse.utils import generate_random_inputs
onnx_filepath = "mobilenetv2-7.onnx"
batch_size = 16

# Generate random sample input
inputs = generate_random_inputs(onnx_filepath, batch_size)

# Compile and run
engine = compile_model(onnx_filepath, batch_size)
outputs = engine.run(inputs)

Compatibility/Support Notes

  • ONNX version 1.5-1.7
  • ONNX opset version 11+
  • ONNX IR version has not been tested at this time

For a more in-depth read on available APIs and workflows, check out the examples and DeepSparse Engine documentation.

Resources

Learning More

Release History

Official builds are hosted on PyPI

Additionally, more information can be found via GitHub Releases.

License

The project's binary containing the DeepSparse Engine is licensed under the Neural Magic Engine License.

Example files and scripts included in this repository are licensed under the Apache License Version 2.0 as noted.

Community

Contribute

We appreciate contributions to the code, examples, integrations, and documentation as well as bug reports and feature requests! Learn how here.

Join

For user help or questions about the DeepSparse Engine, sign up or log in: Deep Sparse Community Discourse Forum and/or Slack. We are growing the community member by member and happy to see you there.

You can get the latest news, webinar and event invites, research papers, and other ML Performance tidbits by subscribing to the Neural Magic community.

For more general questions about Neural Magic, please fill out this form.

Cite

Find this project useful in your research or other communications? Please consider citing:

@InProceedings{
    pmlr-v119-kurtz20a, 
    title = {Inducing and Exploiting Activation Sparsity for Fast Inference on Deep Neural Networks}, 
    author = {Kurtz, Mark and Kopinsky, Justin and Gelashvili, Rati and Matveev, Alexander and Carr, John and Goin, Michael and Leiserson, William and Moore, Sage and Nell, Bill and Shavit, Nir and Alistarh, Dan}, 
    booktitle = {Proceedings of the 37th International Conference on Machine Learning}, 
    pages = {5533--5543}, 
    year = {2020}, 
    editor = {Hal Daumé III and Aarti Singh}, 
    volume = {119}, 
    series = {Proceedings of Machine Learning Research}, 
    address = {Virtual}, 
    month = {13--18 Jul}, 
    publisher = {PMLR}, 
    pdf = {http://proceedings.mlr.press/v119/kurtz20a/kurtz20a.pdf},
    url = {http://proceedings.mlr.press/v119/kurtz20a.html}, 
    abstract = {Optimizing convolutional neural networks for fast inference has recently become an extremely active area of research. One of the go-to solutions in this context is weight pruning, which aims to reduce computational and memory footprint by removing large subsets of the connections in a neural network. Surprisingly, much less attention has been given to exploiting sparsity in the activation maps, which tend to be naturally sparse in many settings thanks to the structure of rectified linear (ReLU) activation functions. In this paper, we present an in-depth analysis of methods for maximizing the sparsity of the activations in a trained neural network, and show that, when coupled with an efficient sparse-input convolution algorithm, we can leverage this sparsity for significant performance gains. To induce highly sparse activation maps without accuracy loss, we introduce a new regularization technique, coupled with a new threshold-based sparsification method based on a parameterized activation function called Forced-Activation-Threshold Rectified Linear Unit (FATReLU). We examine the impact of our methods on popular image classification models, showing that most architectures can adapt to significantly sparser activation maps without any accuracy loss. Our second contribution is showing that these these compression gains can be translated into inference speedups: we provide a new algorithm to enable fast convolution operations over networks with sparse activations, and show that it can enable significant speedups for end-to-end inference on a range of popular models on the large-scale ImageNet image classification task on modern Intel CPUs, with little or no retraining cost.} 
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deepsparse-0.8.0.tar.gz (35.6 MB view details)

Uploaded Source

Built Distributions

deepsparse-0.8.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (38.5 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

deepsparse-0.8.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (38.5 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

deepsparse-0.8.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (38.5 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.17+ x86-64

deepsparse-0.8.0-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (38.5 MB view details)

Uploaded CPython 3.6m manylinux: glibc 2.17+ x86-64

File details

Details for the file deepsparse-0.8.0.tar.gz.

File metadata

  • Download URL: deepsparse-0.8.0.tar.gz
  • Upload date:
  • Size: 35.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.24.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.8.10

File hashes

Hashes for deepsparse-0.8.0.tar.gz
Algorithm Hash digest
SHA256 0eb0f929fe9c2e8b321ec35b080eabd095ab6266effb8e7639a32e3ed09be5b6
MD5 de4abb1c6dc7ea12c048c54586d4e2b0
BLAKE2b-256 4b6eaf6ec69ea905185ea25f08c7a3222bc282f4b2ebd44250919ec6d4fde4a0

See more details on using hashes here.

File details

Details for the file deepsparse-0.8.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for deepsparse-0.8.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 78104bd5bd05d48c5efa2ea47cd40634140a3f8c8e683c266bf33c7be50a8fef
MD5 2eb977adb10e79cb0e84bdfdf441c75b
BLAKE2b-256 0eecac9ef91054be16c5a5b9310d821121f2f7aaf944f0fd1f65216aa44e7444

See more details on using hashes here.

File details

Details for the file deepsparse-0.8.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for deepsparse-0.8.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 153e70d2651193160f5c94ef2e81265511e79586648d6e51d93557969017bc23
MD5 f6c3fe24a0187dc45fa10cf6b7326459
BLAKE2b-256 8292ed407b4234cc9875fe84685f8d7a7446edb78ec0bdae4c1a6104c72b33de

See more details on using hashes here.

File details

Details for the file deepsparse-0.8.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for deepsparse-0.8.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 fea4bc402530c1b4800ddbd3ae3dc56d324db0060d4654b1f9eb3f00b291bdcb
MD5 69c6e2e51ef4dbd2fe53c3a41e463e91
BLAKE2b-256 eccd554a8867671893f0900db8a8b6763ebc052a5ab876b66b0c2d986c20e8f7

See more details on using hashes here.

File details

Details for the file deepsparse-0.8.0-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for deepsparse-0.8.0-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 c9ddd35fd70bfeca99c663cab3a84d63a01e2c31f053bbd9231ce26cd5bae940
MD5 8cf3107bd1a5ec27e736513b0471442c
BLAKE2b-256 b10132660d196f0581fa2ac638a06e36d452d749304a5e2785724581911d5b4d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page