Skip to main content

An inference runtime offering GPU-class performance on CPUs and APIs to integrate ML into your application

Project description

tool icon   DeepSparse

Sparsity-aware deep learning inference runtime for CPUs

DeepSparse is a CPU inference runtime that takes advantage of sparsity to accelerate neural network inference. Coupled with SparseML, our optimization library for pruning and quantizing your models, DeepSparse delivers exceptional inference performance on CPU hardware.

NM Flow

✨NEW✨ DeepSparse LLMs

Neural Magic is excited to announce initial support for performant LLM inference in DeepSparse with:

  • sparse kernels for speedups and memory savings from unstructured sparse weights.
  • 8-bit weight and activation quantization support.
  • efficient usage of cached attention keys and values for minimal memory movement.

mpt-chat-comparison

Try It Now

Install (requires Linux):

pip install -U deepsparse-nightly[llm]

Run inference:

from deepsparse import TextGeneration
pipeline = TextGeneration(model="zoo:mpt-7b-dolly_mpt_pretrain-pruned50_quantized")

prompt="""
Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: what is sparsity? ### Response:
"""
print(pipeline(prompt, max_new_tokens=75).generations[0].text)

# Sparsity is the property of a matrix or other data structure in which a large number of elements are zero and a smaller number of elements are non-zero. In the context of machine learning, sparsity can be used to improve the efficiency of training and prediction.

Check out the TextGeneration documentation for usage details.

Sparsity :handshake: Performance

Developed in collaboration with IST Austria, our recent paper details a new technique called Sparse Fine-Tuning, which allows us to prune MPT-7B to 60% sparsity during fine-tuning without drop in accuracy. With our new support for LLMs, DeepSparse accelerates the sparse-quantized model 7x over the dense baseline:

Learn more about our Sparse Fine-Tuning research.

Check out the model running live on Hugging Face.

LLM Roadmap

Following this initial launch, we are rapidly expanding our support for LLMs, including:

  1. Productizing Sparse Fine-Tuning: Enable external users to apply sparse fine-tuning to their datasets via SparseML.
  2. Expanding model support: Apply our sparse fine-tuning results to Llama 2 and Mistral models.
  3. Pushing for higher sparsity: Improving our pruning algorithms to reach even higher sparsity.

Computer Vision and NLP Models

In addition to LLMs, DeepSparse supports many variants of CNNs and Transformer models, such as BERT, ViT, ResNet, EfficientNet, YOLOv5/8, and many more! Take a look at the Computer Vision and Natural Language Processing domains of SparseZoo, our home for optimized models.

Installation

Install via PyPI (optional dependencies detailed here):

pip install deepsparse 

To experiment with the latest features, there is a nightly build available using pip install deepsparse-nightly or you can clone and install from source using pip install -e path/to/deepsparse.

System Requirements

For those using Mac or Windows, we recommend using Linux containers with Docker.

Deployment APIs

DeepSparse includes three deployment APIs:

  • Engine is the lowest-level API. With Engine, you compile an ONNX model, pass tensors as input, and receive the raw outputs.
  • Pipeline wraps the Engine with pre- and post-processing. With Pipeline, you pass raw data and receive the prediction.
  • Server wraps Pipelines with a REST API using FastAPI. With Server, you send raw data over HTTP and receive the prediction.

Engine

The example below downloads a 90% pruned-quantized BERT model for sentiment analysis in ONNX format from SparseZoo, compiles the model, and runs inference on randomly generated input. Users can provide their own ONNX models, whether dense or sparse.

from deepsparse import Engine

# download onnx, compile
zoo_stub = "zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none"
compiled_model = Engine(model=zoo_stub, batch_size=1)

# run inference (input is raw numpy tensors, output is raw scores)
inputs = compiled_model.generate_random_inputs()
output = compiled_model(inputs)
print(output)

# > [array([[-0.3380675 ,  0.09602544]], dtype=float32)] << raw scores

Pipeline

Pipelines wrap Engine with pre- and post-processing, enabling you to pass raw data and receive the post-processed prediction. The example below downloads a 90% pruned-quantized BERT model for sentiment analysis in ONNX format from SparseZoo, sets up a pipeline, and runs inference on sample data.

from deepsparse import Pipeline

# download onnx, set up pipeline
zoo_stub = "zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none"  
sentiment_analysis_pipeline = Pipeline.create(
  task="sentiment-analysis",    # name of the task
  model_path=zoo_stub,          # zoo stub or path to local onnx file
)

# run inference (input is a sentence, output is the prediction)
prediction = sentiment_analysis_pipeline("I love using DeepSparse Pipelines")
print(prediction)
# > labels=['positive'] scores=[0.9954759478569031]

Server

Server wraps Pipelines with REST APIs, enabling you to set up a model-serving endpoint running DeepSparse. This enables you to send raw data to DeepSparse over HTTP and receive the post-processed predictions. DeepSparse Server is launched from the command line and configured via arguments or a server configuration file. The following downloads a 90% pruned-quantized BERT model for sentiment analysis in ONNX format from SparseZoo and launches a sentiment analysis endpoint:

deepsparse.server \
  --task sentiment-analysis \
  --model_path zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none

Sending a request:

import requests

url = "http://localhost:5543/v2/models/sentiment_analysis/infer" # Server's port default to 5543
obj = {"sequences": "Snorlax loves my Tesla!"}

response = requests.post(url, json=obj)
print(response.text)
# {"labels":["positive"],"scores":[0.9965094327926636]}

Additional Resources

Product Usage Analytics

DeepSparse gathers basic usage telemetry, including, but not limited to, Invocations, Package, Version, and IP Address, for Product Usage Analytics purposes. Review Neural Magic's Products Privacy Policy for further details on how we process this data.

To disable Product Usage Analytics, run:

export NM_DISABLE_ANALYTICS=True

Confirm that telemetry is shut off through info logs streamed with engine invocation by looking for the phrase "Skipping Neural Magic's latest package version check."

Community

Get In Touch

For more general questions about Neural Magic, complete this form.

License

Cite

Find this project useful in your research or other communications? Please consider citing:

@misc{kurtic2023sparse,
      title={Sparse Fine-Tuning for Inference Acceleration of Large Language Models}, 
      author={Eldar Kurtic and Denis Kuznedelev and Elias Frantar and Michael Goin and Dan Alistarh},
      year={2023},
      url={https://arxiv.org/abs/2310.06927},
      eprint={2310.06927},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@misc{kurtic2022optimal,
      title={The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models}, 
      author={Eldar Kurtic and Daniel Campos and Tuan Nguyen and Elias Frantar and Mark Kurtz and Benjamin Fineran and Michael Goin and Dan Alistarh},
      year={2022},
      url={https://arxiv.org/abs/2203.07259},
      eprint={2203.07259},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@InProceedings{
    pmlr-v119-kurtz20a, 
    title = {Inducing and Exploiting Activation Sparsity for Fast Inference on Deep Neural Networks}, 
    author = {Kurtz, Mark and Kopinsky, Justin and Gelashvili, Rati and Matveev, Alexander and Carr, John and Goin, Michael and Leiserson, William and Moore, Sage and Nell, Bill and Shavit, Nir and Alistarh, Dan}, 
    booktitle = {Proceedings of the 37th International Conference on Machine Learning}, 
    pages = {5533--5543}, 
    year = {2020}, 
    editor = {Hal Daumé III and Aarti Singh}, 
    volume = {119}, 
    series = {Proceedings of Machine Learning Research}, 
    address = {Virtual}, 
    month = {13--18 Jul}, 
    publisher = {PMLR}, 
    pdf = {http://proceedings.mlr.press/v119/kurtz20a/kurtz20a.pdf},
    url = {http://proceedings.mlr.press/v119/kurtz20a.html}
}

@article{DBLP:journals/corr/abs-2111-13445,
  author    = {Eugenia Iofinova and Alexandra Peste and Mark Kurtz and Dan Alistarh},
  title     = {How Well Do Sparse Imagenet Models Transfer?},
  journal   = {CoRR},
  volume    = {abs/2111.13445},
  year      = {2021},
  url       = {https://arxiv.org/abs/2111.13445},
  eprinttype = {arXiv},
  eprint    = {2111.13445},
  timestamp = {Wed, 01 Dec 2021 15:16:43 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2111-13445.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

All Thanks To Our Contributors

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deepsparse-nightly-1.8.0.20240401.tar.gz (46.9 MB view details)

Uploaded Source

Built Distributions

deepsparse_nightly-1.8.0.20240401-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (47.4 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

deepsparse_nightly-1.8.0.20240401-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (40.6 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ ARM64

deepsparse_nightly-1.8.0.20240401-cp311-cp311-macosx_13_0_arm64.whl (33.5 MB view details)

Uploaded CPython 3.11 macOS 13.0+ ARM64

deepsparse_nightly-1.8.0.20240401-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (47.4 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

deepsparse_nightly-1.8.0.20240401-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (40.6 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ ARM64

deepsparse_nightly-1.8.0.20240401-cp310-cp310-macosx_13_0_arm64.whl (33.5 MB view details)

Uploaded CPython 3.10 macOS 13.0+ ARM64

deepsparse_nightly-1.8.0.20240401-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (47.4 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

deepsparse_nightly-1.8.0.20240401-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (40.6 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ ARM64

deepsparse_nightly-1.8.0.20240401-cp39-cp39-macosx_13_0_arm64.whl (33.5 MB view details)

Uploaded CPython 3.9 macOS 13.0+ ARM64

deepsparse_nightly-1.8.0.20240401-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (47.4 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

deepsparse_nightly-1.8.0.20240401-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (40.6 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ ARM64

deepsparse_nightly-1.8.0.20240401-cp38-cp38-macosx_13_0_arm64.whl (33.5 MB view details)

Uploaded CPython 3.8 macOS 13.0+ ARM64

File details

Details for the file deepsparse-nightly-1.8.0.20240401.tar.gz.

File metadata

  • Download URL: deepsparse-nightly-1.8.0.20240401.tar.gz
  • Upload date:
  • Size: 46.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/39.0.1 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.6.9

File hashes

Hashes for deepsparse-nightly-1.8.0.20240401.tar.gz
Algorithm Hash digest
SHA256 3b1431e19e4d88809007fa6b047abacea8be3e1aa673ecd2455fa8c31de8ebd8
MD5 2c410216072003e1c4ed20f25f786877
BLAKE2b-256 83763bf514e8d981e09394af5330007d2fbbeb1647914e475fda20da2e0e2965

See more details on using hashes here.

File details

Details for the file deepsparse_nightly-1.8.0.20240401-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for deepsparse_nightly-1.8.0.20240401-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 c0f8abf118cdbc2f1c8b16db5b30febce68ff55c4ca957f890b23f0be8644528
MD5 96186554647675b38d7e1bb7eb7f45a3
BLAKE2b-256 696b00972446558485464a273b7799486cce38ac5fa199eb162066cbca51323f

See more details on using hashes here.

File details

Details for the file deepsparse_nightly-1.8.0.20240401-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for deepsparse_nightly-1.8.0.20240401-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 52557c00098f3df5bf205e4765c599633fe6c85c4dfdf4ba4824ed8f400fbde1
MD5 1d968bd049a5a524dc54e8c77b25eb18
BLAKE2b-256 1ff0ec2fd24b0e140224578a16cea99ebd8ad0773479e3a8ebf693ede43c61b4

See more details on using hashes here.

File details

Details for the file deepsparse_nightly-1.8.0.20240401-cp311-cp311-macosx_13_0_arm64.whl.

File metadata

File hashes

Hashes for deepsparse_nightly-1.8.0.20240401-cp311-cp311-macosx_13_0_arm64.whl
Algorithm Hash digest
SHA256 c5cef0b1ff4ade17dd403b9393eebfe64d6052b97a37a6822a6614578549f716
MD5 7b25237aae33118f23a4adfc9f0dda4b
BLAKE2b-256 8732fdd93fdd3ffc8418af285b017c4953a93a6c35b5de0b24406c0ac7ed1134

See more details on using hashes here.

File details

Details for the file deepsparse_nightly-1.8.0.20240401-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for deepsparse_nightly-1.8.0.20240401-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2335674645cb95ceff192e44086e101470360cc825aa45823128bdd05aa27195
MD5 91bf4813846a25a2a3e2a1e69aaf0e61
BLAKE2b-256 14ade2b11f40e44229028cc71d40a9478cbde7f4b0e6aa54555e9700db55838e

See more details on using hashes here.

File details

Details for the file deepsparse_nightly-1.8.0.20240401-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for deepsparse_nightly-1.8.0.20240401-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 bf0b6cb410a3c400c5bcc7b26877bd679dd8e17624db18d1ddbb663509f85202
MD5 dede381c7cd876cd7c9f062dcefaa5fd
BLAKE2b-256 2925ff6171b7aa2e285466e1c55fade59dbf65d82a708b558ac51dfd35d99009

See more details on using hashes here.

File details

Details for the file deepsparse_nightly-1.8.0.20240401-cp310-cp310-macosx_13_0_arm64.whl.

File metadata

File hashes

Hashes for deepsparse_nightly-1.8.0.20240401-cp310-cp310-macosx_13_0_arm64.whl
Algorithm Hash digest
SHA256 26ba2a1f3d4cb5f115b7f7190fb43ce1ff8c6c95b1d982c7b5abe719c438e640
MD5 fb4d6e5e06d05f7c3c3e97d18312f2d7
BLAKE2b-256 fb8562956b4023a77aa9e080e5642e178017226fca7a5d4378e86f220cd73fc0

See more details on using hashes here.

File details

Details for the file deepsparse_nightly-1.8.0.20240401-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for deepsparse_nightly-1.8.0.20240401-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 9bc0c2d047b90c857ebb2d6fb8b03a8b58f56ba3ce57ffa831e83a3139e06681
MD5 cc1b665461b5c66adda8deb5a3109494
BLAKE2b-256 b9b96148a4bb136d12bc4992f812cf47380cd36806509968ad3701e09320a0c5

See more details on using hashes here.

File details

Details for the file deepsparse_nightly-1.8.0.20240401-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for deepsparse_nightly-1.8.0.20240401-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 cb7a110b4a2b0d59feb6b2e8f865f1c8b74397920a226272bd49fedac31415eb
MD5 a6c450d3bbb72947142b8faf168e0708
BLAKE2b-256 360a8e2dc339c03d8abe59409d4fb872a4679b636926464103de25e4f4037383

See more details on using hashes here.

File details

Details for the file deepsparse_nightly-1.8.0.20240401-cp39-cp39-macosx_13_0_arm64.whl.

File metadata

File hashes

Hashes for deepsparse_nightly-1.8.0.20240401-cp39-cp39-macosx_13_0_arm64.whl
Algorithm Hash digest
SHA256 69020bc0c777156ca67e46a1af44accebe63bacf078bf1dbefbcd360066fd690
MD5 027d8e96b51b7884d3dbb33830edbb60
BLAKE2b-256 196ec09fad595f28d3073cfc73888b306b10bc85f5467272d0c3c8843ece82fe

See more details on using hashes here.

File details

Details for the file deepsparse_nightly-1.8.0.20240401-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for deepsparse_nightly-1.8.0.20240401-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 048cdb45f08f5986dcad4051bcbd183f25f2a9274aefac51e87900beb46372c5
MD5 a1c6b56d7f952de8634cf7f99e9674a5
BLAKE2b-256 e77a1a147d3dc8aa84eb6dcccbf45718941183f27ff8fa203e721c413c70c5f1

See more details on using hashes here.

File details

Details for the file deepsparse_nightly-1.8.0.20240401-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for deepsparse_nightly-1.8.0.20240401-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 175a4367baaffffaaa49e1c8f63881dfc6fd637ee21d8f348c87c47cfe0eb137
MD5 f7d4b51e77a3bbf84053eef884022751
BLAKE2b-256 243bc8af52d8b4fa3a225d3d9c00b8c0b93f8bd4a90f5bef8406aee9a6ed8d5b

See more details on using hashes here.

File details

Details for the file deepsparse_nightly-1.8.0.20240401-cp38-cp38-macosx_13_0_arm64.whl.

File metadata

File hashes

Hashes for deepsparse_nightly-1.8.0.20240401-cp38-cp38-macosx_13_0_arm64.whl
Algorithm Hash digest
SHA256 c0eb0a5cadbdb1ada6ae452e066b50f4006c88007c7d6f0b4a1bf7ace0d6a6e2
MD5 a0cd3861bc90037161d175823ca7af5c
BLAKE2b-256 1af432bdcccb371f453fc12c1abb1619e9eb09fd0df8b688cb05896fd12ba268

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page