Skip to main content

Neural network inference engine that delivers GPU-class performance for sparsified models on CPUs

Project description

tool icon  DeepSparse Engine

Sparsity-aware neural network inference engine for GPU-class performance on CPUs

Documentation Main GitHub release Contributor Covenant


A CPU runtime that takes advantage of sparsity within neural networks to reduce compute. Read more about sparsification here.

Neural Magic's DeepSparse Engine is able to integrate into popular deep learning libraries (e.g., Hugging Face, Ultralytics) allowing you to leverage DeepSparse for loading and deploying sparse models with ONNX. ONNX gives the flexibility to serve your model in a framework-agnostic environment. Support includes PyTorch, TensorFlow, Keras, and many other frameworks.

Features

Installation

The DeepSparse Engine is tested on Python 3.6-3.10, ONNX 1.5.0-1.12.0, ONNX opset version 11+, and manylinux compliant. Using a virtual environment is highly recommended. Install the engine using the following command:

pip install deepsparse

🔌 DeepSparse Server

The DeepSparse Server allows you to serve models and pipelines from the terminal. The server runs on top of the popular FastAPI web framework and Uvicorn web server. Install the server using the following command:

pip install deepsparse[server]

Single Model

Once installed, the following example CLI command is available for running inference with a single BERT model:

deepsparse.server \
    --task question_answering \
    --model_path "zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/12layer_pruned80_quant-none-vnni"

To look up arguments run: deepsparse.server --help.

Multiple Models

To serve multiple models in your deployment you can easily build a config.yaml. In the example below, we define two BERT models in our configuration for the question answering task:

models:
    - task: question_answering
      model_path: zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/base-none
      batch_size: 1
      alias: question_answering/base
    - task: question_answering
      model_path: zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/12layer_pruned80_quant-none-vnni
      batch_size: 1
      alias: question_answering/pruned_quant

Finally, after your config.yaml file is built, run the server with the config file path as an argument:

deepsparse.server --config_file config.yaml

Getting Started with the DeepSparse Server for more info.

📜 DeepSparse Benchmark

The benchmark tool is available on your CLI to run expressive model benchmarks on the DeepSparse Engine with minimal parameters.

Run deepsparse.benchmark -h to look up arguments:

deepsparse.benchmark [-h] [-b BATCH_SIZE] [-shapes INPUT_SHAPES]
                          [-ncores NUM_CORES] [-s {async,sync}] [-t TIME]
                          [-nstreams NUM_STREAMS] [-pin {none,core,numa}]
                          [-q] [-x EXPORT_PATH]
                          model_path

Getting Started with CLI Benchmarking includes examples of select inference scenarios:

  • Synchronous (Single-stream) Scenario
  • Asynchronous (Multi-stream) Scenario

👩‍💻 NLP Inference Example

from deepsparse import Pipeline

# SparseZoo model stub or path to ONNX file
model_path = "zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/12layer_pruned80_quant-none-vnni"

qa_pipeline = Pipeline.create(
    task="question-answering",
    model_path=model_path,
)

my_name = qa_pipeline(question="What's my name?", context="My name is Snorlax")

NLP Tutorials:

Tasks Supported:

🦉 SparseZoo ONNX vs. Custom ONNX Models

DeepSparse can accept ONNX models from two sources:

  • SparseZoo ONNX: our open-source collection of sparse models available for download. SparseZoo hosts inference-optimized models, trained on repeatable sparsification recipes using state-of-the-art techniques from SparseML.

  • Custom ONNX: your own ONNX model, can be dense or sparse. Plug in your model to compare performance with other solutions.

> wget https://github.com/onnx/models/raw/main/vision/classification/mobilenet/model/mobilenetv2-7.onnx
Saving to: ‘mobilenetv2-7.onnx’

Custom ONNX Benchmark example:

from deepsparse import compile_model
from deepsparse.utils import generate_random_inputs
onnx_filepath = "mobilenetv2-7.onnx"
batch_size = 16

# Generate random sample input
inputs = generate_random_inputs(onnx_filepath, batch_size)

# Compile and run
engine = compile_model(onnx_filepath, batch_size)
outputs = engine.run(inputs)

The GitHub repository includes package APIs along with examples to quickly get started benchmarking and inferencing sparse models.

Scheduling Single-Stream, Multi-Stream, and Elastic Inference

The DeepSparse Engine offers up to three types of inferences based on your use case. Read more details here: Inference Types.

1 ⚡ Single-stream scheduling: the latency/synchronous scenario, requests execute serially. [default]

single stream diagram

Use Case: It's highly optimized for minimum per-request latency, using all of the system's resources provided to it on every request it gets.

2 ⚡ Multi-stream scheduling: the throughput/asynchronous scenario, requests execute in parallel.

multi stream diagram

PRO TIP: The most common use cases for the multi-stream scheduler are where parallelism is low with respect to core count, and where requests need to be made asynchronously without time to batch them.

3 ⚡ Elastic scheduling: requests execute in parallel, but not multiplexed on individual NUMA nodes.

Use Case: A workload that might benefit from the elastic scheduler is one in which multiple requests need to be handled simultaneously, but where performance is hindered when those requests have to share an L3 cache.

🧰 CPU Hardware Support

With support for AVX2, AVX-512, and VNNI instruction sets, the DeepSparse Engine is validated to work on x86 Intel (Haswell generation and later) and AMD CPUs running Linux. Mac and Windows require running Linux in a Docker or virtual machine.

Here is a table detailing specific support for some algorithms over different microarchitectures:

x86 Extension Microarchitectures Activation Sparsity Kernel Sparsity Sparse Quantization
AMD AVX2 Zen 2, Zen 3 not supported optimized emulated
Intel AVX2 Haswell, Broadwell, and newer not supported optimized emulated
Intel AVX-512 Skylake, Cannon Lake, and newer optimized optimized emulated
Intel AVX-512 VNNI (DL Boost) Cascade Lake, Ice Lake, Cooper Lake, Tiger Lake optimized optimized optimized

Resources

Libraries

Versions

Info

Community

Be Part of the Future... And the Future is Sparse!

Contribute with code, examples, integrations, and documentation as well as bug reports and feature requests! Learn how here.

For user help or questions about DeepSparse, sign up or log in to our Deep Sparse Community Slack. We are growing the community member by member and happy to see you there. Bugs, feature requests, or additional questions can also be posted to our GitHub Issue Queue. You can get the latest news, webinar and event invites, research papers, and other ML Performance tidbits by subscribing to the Neural Magic community.

For more general questions about Neural Magic, complete this form.

License

The project's binary containing the DeepSparse Engine is licensed under the Neural Magic Engine License. Example files and scripts included in this repository are licensed under the Apache License Version 2.0 as noted.

Cite

Find this project useful in your research or other communications? Please consider citing:

@InProceedings{
    pmlr-v119-kurtz20a, 
    title = {Inducing and Exploiting Activation Sparsity for Fast Inference on Deep Neural Networks}, 
    author = {Kurtz, Mark and Kopinsky, Justin and Gelashvili, Rati and Matveev, Alexander and Carr, John and Goin, Michael and Leiserson, William and Moore, Sage and Nell, Bill and Shavit, Nir and Alistarh, Dan}, 
    booktitle = {Proceedings of the 37th International Conference on Machine Learning}, 
    pages = {5533--5543}, 
    year = {2020}, 
    editor = {Hal Daumé III and Aarti Singh}, 
    volume = {119}, 
    series = {Proceedings of Machine Learning Research}, 
    address = {Virtual}, 
    month = {13--18 Jul}, 
    publisher = {PMLR}, 
    pdf = {http://proceedings.mlr.press/v119/kurtz20a/kurtz20a.pdf},
    url = {http://proceedings.mlr.press/v119/kurtz20a.html}
}

@article{DBLP:journals/corr/abs-2111-13445,
  author    = {Eugenia Iofinova and
               Alexandra Peste and
               Mark Kurtz and
               Dan Alistarh},
  title     = {How Well Do Sparse Imagenet Models Transfer?},
  journal   = {CoRR},
  volume    = {abs/2111.13445},
  year      = {2021},
  url       = {https://arxiv.org/abs/2111.13445},
  eprinttype = {arXiv},
  eprint    = {2111.13445},
  timestamp = {Wed, 01 Dec 2021 15:16:43 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2111-13445.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deepsparse-1.1.0.tar.gz (41.4 MB view details)

Uploaded Source

Built Distributions

deepsparse-1.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (41.7 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

deepsparse-1.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (41.7 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

deepsparse-1.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (41.7 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

deepsparse-1.1.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (41.7 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.17+ x86-64

deepsparse-1.1.0-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (41.7 MB view details)

Uploaded CPython 3.6m manylinux: glibc 2.17+ x86-64

File details

Details for the file deepsparse-1.1.0.tar.gz.

File metadata

  • Download URL: deepsparse-1.1.0.tar.gz
  • Upload date:
  • Size: 41.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.24.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.8.10

File hashes

Hashes for deepsparse-1.1.0.tar.gz
Algorithm Hash digest
SHA256 0c31930682b289164870ca1d6627cc9ad8481966808c0a0943837996e9f98930
MD5 8ac8ec0f3fbe9082c4b0d1b34c8f8c1a
BLAKE2b-256 c938442bcc9403aaf0dd082e23397a8a5e5ca43b5856058cffa0f5449c8f8a5c

See more details on using hashes here.

File details

Details for the file deepsparse-1.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for deepsparse-1.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 0d450d6ba6667f1e0ec35a520ece7239b50338cba3c1bdf5006075c59b484d44
MD5 1dd4ba7ca9cb4c8f525e5af499da7786
BLAKE2b-256 c21b6245a5ff52c792f1e3480c6a603dac903687f63bd3db070acdc4de5ce56e

See more details on using hashes here.

File details

Details for the file deepsparse-1.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for deepsparse-1.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f39bed0eeb47aae8afd1c177517ee9451961b4dd4b35b3c5d2a1e73c0a038263
MD5 9b426ab9ba92bbdf6d5aa8dd27eb5052
BLAKE2b-256 bf910ca0f27fa63525635157629517a71974d4d974b95757e110bf5af079d106

See more details on using hashes here.

File details

Details for the file deepsparse-1.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for deepsparse-1.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a67565de4c0a8955765b284fe3b943e92ad6223c07c267a1c0d6a15b716f00e9
MD5 db51831739474a1b2b17382d1e534f45
BLAKE2b-256 bbe08c17a492b71663e5af5871adcd01871d208ea45f321dbb383d1086aca010

See more details on using hashes here.

File details

Details for the file deepsparse-1.1.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for deepsparse-1.1.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 5c78f02447210c1dfc281dcb7d7b4d1fbdece4596d9d4bcba640185622f205f4
MD5 57de2242281952bd09be11fe926f0b5e
BLAKE2b-256 238913ca52de27b1efc8e4b2d2700ecb10ad3c051f955ac63bebb890cfcda48c

See more details on using hashes here.

File details

Details for the file deepsparse-1.1.0-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for deepsparse-1.1.0-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 47a897a624eeb2336ae32b40923eead21833c2cdb3664fb4ff5cd6d0252adf75
MD5 db167a799bdabc2fc409983d8b3fd4f9
BLAKE2b-256 b75af076c334690f4bd4a7bc16ab6cfbc498f5be4db80d86b0e11fabcf4374eb

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page