Neural network inference engine that delivers GPU-class performance for sparsified models on CPUs
Project description
A CPU runtime that takes advantage of sparsity within neural networks to reduce compute. Read more about sparsification here.
Neural Magic's DeepSparse Engine is able to integrate into popular deep learning libraries (e.g., Hugging Face, Ultralytics) allowing you to leverage DeepSparse for loading and deploying sparse models with ONNX. ONNX gives the flexibility to serve your model in a framework-agnostic environment. Support includes PyTorch, TensorFlow, Keras, and many other frameworks.
The DeepSparse Engine is available in two editions:
- The Community Edition is open-source and free for evaluation, research, and non-production use with our Engine Community License.
- The Enterprise Edition requires a Trial License or can be fully licensed for production, commercial applications.
Features
🧰 Hardware Support and System Requirements
Review CPU Hardware Support for Various Architectures to understand system requirements. The DeepSparse Engine works natively on Linux; Mac and Windows require running Linux in a Docker or virtual machine; it will not run natively on those operating systems.
The DeepSparse Engine is tested on Python 3.7-3.10, ONNX 1.5.0-1.12.0, ONNX opset version 11+, and manylinux compliant. Using a virtual environment is highly recommended.
Installation
Install the DeepSparse Community Edition as follows:
pip install deepsparse
To trial or inquire about licensing for DeepSparse Enterprise Edition, see the DeepSparse Enterprise documentation.
Features
🔌 DeepSparse Server
The DeepSparse Server allows you to serve models and pipelines from the terminal. The server runs on top of the popular FastAPI web framework and Uvicorn web server. Install the server using the following command:
pip install deepsparse[server]
Single Model
Once installed, the following example CLI command is available for running inference with a single BERT model:
deepsparse.server \
task question_answering \
--model_path "zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/12layer_pruned80_quant-none-vnni"
To look up arguments run: deepsparse.server --help
.
Multiple Models
To serve multiple models in your deployment you can easily build a config.yaml
. In the example below, we define two BERT models in our configuration for the question answering task:
num_cores: 1
num_workers: 1
endpoints:
- task: question_answering
route: /predict/question_answering/base
model: zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/base-none
batch_size: 1
- task: question_answering
route: /predict/question_answering/pruned_quant
model: zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/12layer_pruned80_quant-none-vnni
batch_size: 1
Finally, after your config.yaml
file is built, run the server with the config file path as an argument:
deepsparse.server config config.yaml
Getting Started with the DeepSparse Server for more info.
📜 DeepSparse Benchmark
The benchmark tool is available on your CLI to run expressive model benchmarks on the DeepSparse Engine with minimal parameters.
Run deepsparse.benchmark -h
to look up arguments:
deepsparse.benchmark [-h] [-b BATCH_SIZE] [-shapes INPUT_SHAPES]
[-ncores NUM_CORES] [-s {async,sync}] [-t TIME]
[-nstreams NUM_STREAMS] [-pin {none,core,numa}]
[-q] [-x EXPORT_PATH]
model_path
Getting Started with CLI Benchmarking includes examples of select inference scenarios:
- Synchronous (Single-stream) Scenario
- Asynchronous (Multi-stream) Scenario
👩💻 NLP Inference Example
from deepsparse import Pipeline
# SparseZoo model stub or path to ONNX file
model_path = "zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/12layer_pruned80_quant-none-vnni"
qa_pipeline = Pipeline.create(
task="question-answering",
model_path=model_path,
)
my_name = qa_pipeline(question="What's my name?", context="My name is Snorlax")
NLP Tutorials:
Tasks Supported:
- Token Classification: Named Entity Recognition
- Text Classification: Multi-Class
- Text Classification: Binary
- Text Classification: Sentiment Analysis
- Question Answering
🦉 SparseZoo ONNX vs. Custom ONNX Models
DeepSparse can accept ONNX models from two sources:
-
SparseZoo ONNX: our open-source collection of sparse models available for download. SparseZoo hosts inference-optimized models, trained on repeatable sparsification recipes using state-of-the-art techniques from SparseML.
-
Custom ONNX: your own ONNX model, can be dense or sparse. Plug in your model to compare performance with other solutions.
> wget https://github.com/onnx/models/raw/main/vision/classification/mobilenet/model/mobilenetv2-7.onnx
Saving to: ‘mobilenetv2-7.onnx’
Custom ONNX Benchmark example:
from deepsparse import compile_model
from deepsparse.utils import generate_random_inputs
onnx_filepath = "mobilenetv2-7.onnx"
batch_size = 16
# Generate random sample input
inputs = generate_random_inputs(onnx_filepath, batch_size)
# Compile and run
engine = compile_model(onnx_filepath, batch_size)
outputs = engine.run(inputs)
The GitHub repository includes package APIs along with examples to quickly get started benchmarking and inferencing sparse models.
Scheduling Single-Stream, Multi-Stream, and Elastic Inference
The DeepSparse Engine offers up to three types of inferences based on your use case. Read more details here: Inference Types.
1 ⚡ Single-stream scheduling: the latency/synchronous scenario, requests execute serially. [default
]
Use Case: It's highly optimized for minimum per-request latency, using all of the system's resources provided to it on every request it gets.
2 ⚡ Multi-stream scheduling: the throughput/asynchronous scenario, requests execute in parallel.
PRO TIP: The most common use cases for the multi-stream scheduler are where parallelism is low with respect to core count, and where requests need to be made asynchronously without time to batch them.
3 ⚡ Elastic scheduling: requests execute in parallel, but not multiplexed on individual NUMA nodes.
Use Case: A workload that might benefit from the elastic scheduler is one in which multiple requests need to be handled simultaneously, but where performance is hindered when those requests have to share an L3 cache.
Resources
Libraries
Versions
-
DeepSparse | stable
-
DeepSparse-Nightly | nightly (dev)
-
GitHub | releases
Info
Community
Be Part of the Future... And the Future is Sparse!
Contribute with code, examples, integrations, and documentation as well as bug reports and feature requests! Learn how here.
For user help or questions about DeepSparse, sign up or log in to our Deep Sparse Community Slack. We are growing the community member by member and happy to see you there. Bugs, feature requests, or additional questions can also be posted to our GitHub Issue Queue. You can get the latest news, webinar and event invites, research papers, and other ML Performance tidbits by subscribing to the Neural Magic community.
For more general questions about Neural Magic, complete this form.
License
The Community Edition of the project's binary containing the DeepSparse Engine is licensed under the Neural Magic Engine License. Example files and scripts included in this repository are licensed under the Apache License Version 2.0 as noted.
The Enterprise Edition requires a Trial License or can be fully licensed for production, commercial applications.
Cite
Find this project useful in your research or other communications? Please consider citing:
@InProceedings{
pmlr-v119-kurtz20a,
title = {Inducing and Exploiting Activation Sparsity for Fast Inference on Deep Neural Networks},
author = {Kurtz, Mark and Kopinsky, Justin and Gelashvili, Rati and Matveev, Alexander and Carr, John and Goin, Michael and Leiserson, William and Moore, Sage and Nell, Bill and Shavit, Nir and Alistarh, Dan},
booktitle = {Proceedings of the 37th International Conference on Machine Learning},
pages = {5533--5543},
year = {2020},
editor = {Hal Daumé III and Aarti Singh},
volume = {119},
series = {Proceedings of Machine Learning Research},
address = {Virtual},
month = {13--18 Jul},
publisher = {PMLR},
pdf = {http://proceedings.mlr.press/v119/kurtz20a/kurtz20a.pdf},
url = {http://proceedings.mlr.press/v119/kurtz20a.html}
}
@article{DBLP:journals/corr/abs-2111-13445,
author = {Eugenia Iofinova and
Alexandra Peste and
Mark Kurtz and
Dan Alistarh},
title = {How Well Do Sparse Imagenet Models Transfer?},
journal = {CoRR},
volume = {abs/2111.13445},
year = {2021},
url = {https://arxiv.org/abs/2111.13445},
eprinttype = {arXiv},
eprint = {2111.13445},
timestamp = {Wed, 01 Dec 2021 15:16:43 +0100},
biburl = {https://dblp.org/rec/journals/corr/abs-2111-13445.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for deepsparse_ent-1.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9e3de64a8b75d42159d8be2e6ce5db3aa07d3acd478dc8ae154f68709c75cb16 |
|
MD5 | 65bbc1ca26a773c980deed18104383b7 |
|
BLAKE2b-256 | 83ff363efe37def4e461fe7b15d77e5ec233cbaa1f44ac0d9e853c1dee3c2038 |
Hashes for deepsparse_ent-1.2.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2a178f0ff7fc60ecf716a8a2f9c3b443e1e0fd7c18d5b2eca9494c04392ca547 |
|
MD5 | 8b389353022c303251282da0aaa6338c |
|
BLAKE2b-256 | 15f99457688825707de17bbdb62d76d42bbd09f4c690a3e31791fbb7c17eb3e4 |
Hashes for deepsparse_ent-1.2.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d8e39be071e8664d400ea36b1cce1b1bc0fc5c13055e849a56e4ba777e5d47a0 |
|
MD5 | 3a27a66a74a3379d006e26f8f9d91179 |
|
BLAKE2b-256 | 43f18845785414b28be15d90310806b0ac43b4243686d84b53b90a2ebfe2c310 |
Hashes for deepsparse_ent-1.2.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1cc8d33c6d87c6f168eb2c99311239c642054c91e76233875608148382bc29ad |
|
MD5 | 40eb36a67d1ffdf0e65e8848e6fbc8bb |
|
BLAKE2b-256 | ebfb23c67acef504a2907508e0de63e28813945ba378d9963f9e1b96bb07194c |