Skip to main content

An inference runtime offering GPU-class performance on CPUs and APIs to integrate ML into your application

Project description

tool icon   DeepSparse

Sparsity-aware deep learning inference runtime for CPUs

DeepSparse is a CPU inference runtime that takes advantage of sparsity to accelerate neural network inference. Coupled with SparseML, our optimization library for pruning and quantizing your models, DeepSparse delivers exceptional inference performance on CPU hardware.

NM Flow

✨NEW✨ DeepSparse LLMs

Neural Magic is excited to announce initial support for performant LLM inference in DeepSparse with:

  • sparse kernels for speedups and memory savings from unstructured sparse weights.
  • 8-bit weight and activation quantization support.
  • efficient usage of cached attention keys and values for minimal memory movement.

mpt-chat-comparison

Try It Now

Install (requires Linux):

pip install -U deepsparse-nightly[llm]

Run inference:

from deepsparse import TextGeneration
pipeline = TextGeneration(model="zoo:mpt-7b-dolly_mpt_pretrain-pruned50_quantized")

prompt="""
Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: what is sparsity? ### Response:
"""
print(pipeline(prompt, max_new_tokens=75).generations[0].text)

# Sparsity is the property of a matrix or other data structure in which a large number of elements are zero and a smaller number of elements are non-zero. In the context of machine learning, sparsity can be used to improve the efficiency of training and prediction.

Check out the TextGeneration documentation for usage details.

Sparsity :handshake: Performance

Developed in collaboration with IST Austria, our recent paper details a new technique called Sparse Fine-Tuning, which allows us to prune MPT-7B to 60% sparsity during fine-tuning without drop in accuracy. With our new support for LLMs, DeepSparse accelerates the sparse-quantized model 7x over the dense baseline:

Learn more about our Sparse Fine-Tuning research.

Check out the model running live on Hugging Face.

LLM Roadmap

Following this initial launch, we are rapidly expanding our support for LLMs, including:

  1. Productizing Sparse Fine-Tuning: Enable external users to apply sparse fine-tuning to their datasets via SparseML.
  2. Expanding model support: Apply our sparse fine-tuning results to Llama 2 and Mistral models.
  3. Pushing for higher sparsity: Improving our pruning algorithms to reach even higher sparsity.

Computer Vision and NLP Models

In addition to LLMs, DeepSparse supports many variants of CNNs and Transformer models, such as BERT, ViT, ResNet, EfficientNet, YOLOv5/8, and many more! Take a look at the Computer Vision and Natural Language Processing domains of SparseZoo, our home for optimized models.

Installation

Install via PyPI (optional dependencies detailed here):

pip install deepsparse 

To experiment with the latest features, there is a nightly build available using pip install deepsparse-nightly or you can clone and install from source using pip install -e path/to/deepsparse.

System Requirements

For those using Mac or Windows, we recommend using Linux containers with Docker.

Deployment APIs

DeepSparse includes three deployment APIs:

  • Engine is the lowest-level API. With Engine, you compile an ONNX model, pass tensors as input, and receive the raw outputs.
  • Pipeline wraps the Engine with pre- and post-processing. With Pipeline, you pass raw data and receive the prediction.
  • Server wraps Pipelines with a REST API using FastAPI. With Server, you send raw data over HTTP and receive the prediction.

Engine

The example below downloads a 90% pruned-quantized BERT model for sentiment analysis in ONNX format from SparseZoo, compiles the model, and runs inference on randomly generated input. Users can provide their own ONNX models, whether dense or sparse.

from deepsparse import Engine

# download onnx, compile
zoo_stub = "zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none"
compiled_model = Engine(model=zoo_stub, batch_size=1)

# run inference (input is raw numpy tensors, output is raw scores)
inputs = compiled_model.generate_random_inputs()
output = compiled_model(inputs)
print(output)

# > [array([[-0.3380675 ,  0.09602544]], dtype=float32)] << raw scores

Pipeline

Pipelines wrap Engine with pre- and post-processing, enabling you to pass raw data and receive the post-processed prediction. The example below downloads a 90% pruned-quantized BERT model for sentiment analysis in ONNX format from SparseZoo, sets up a pipeline, and runs inference on sample data.

from deepsparse import Pipeline

# download onnx, set up pipeline
zoo_stub = "zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none"  
sentiment_analysis_pipeline = Pipeline.create(
  task="sentiment-analysis",    # name of the task
  model_path=zoo_stub,          # zoo stub or path to local onnx file
)

# run inference (input is a sentence, output is the prediction)
prediction = sentiment_analysis_pipeline("I love using DeepSparse Pipelines")
print(prediction)
# > labels=['positive'] scores=[0.9954759478569031]

Server

Server wraps Pipelines with REST APIs, enabling you to set up a model-serving endpoint running DeepSparse. This enables you to send raw data to DeepSparse over HTTP and receive the post-processed predictions. DeepSparse Server is launched from the command line and configured via arguments or a server configuration file. The following downloads a 90% pruned-quantized BERT model for sentiment analysis in ONNX format from SparseZoo and launches a sentiment analysis endpoint:

deepsparse.server \
  --task sentiment-analysis \
  --model_path zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none

Sending a request:

import requests

url = "http://localhost:5543/v2/models/sentiment_analysis/infer" # Server's port default to 5543
obj = {"sequences": "Snorlax loves my Tesla!"}

response = requests.post(url, json=obj)
print(response.text)
# {"labels":["positive"],"scores":[0.9965094327926636]}

Additional Resources

Product Usage Analytics

DeepSparse gathers basic usage telemetry, including, but not limited to, Invocations, Package, Version, and IP Address, for Product Usage Analytics purposes. Review Neural Magic's Products Privacy Policy for further details on how we process this data.

To disable Product Usage Analytics, run:

export NM_DISABLE_ANALYTICS=True

Confirm that telemetry is shut off through info logs streamed with engine invocation by looking for the phrase "Skipping Neural Magic's latest package version check."

Community

Get In Touch

For more general questions about Neural Magic, complete this form.

License

Cite

Find this project useful in your research or other communications? Please consider citing:

@misc{kurtic2023sparse,
      title={Sparse Fine-Tuning for Inference Acceleration of Large Language Models}, 
      author={Eldar Kurtic and Denis Kuznedelev and Elias Frantar and Michael Goin and Dan Alistarh},
      year={2023},
      url={https://arxiv.org/abs/2310.06927},
      eprint={2310.06927},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@misc{kurtic2022optimal,
      title={The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models}, 
      author={Eldar Kurtic and Daniel Campos and Tuan Nguyen and Elias Frantar and Mark Kurtz and Benjamin Fineran and Michael Goin and Dan Alistarh},
      year={2022},
      url={https://arxiv.org/abs/2203.07259},
      eprint={2203.07259},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@InProceedings{
    pmlr-v119-kurtz20a, 
    title = {Inducing and Exploiting Activation Sparsity for Fast Inference on Deep Neural Networks}, 
    author = {Kurtz, Mark and Kopinsky, Justin and Gelashvili, Rati and Matveev, Alexander and Carr, John and Goin, Michael and Leiserson, William and Moore, Sage and Nell, Bill and Shavit, Nir and Alistarh, Dan}, 
    booktitle = {Proceedings of the 37th International Conference on Machine Learning}, 
    pages = {5533--5543}, 
    year = {2020}, 
    editor = {Hal Daumé III and Aarti Singh}, 
    volume = {119}, 
    series = {Proceedings of Machine Learning Research}, 
    address = {Virtual}, 
    month = {13--18 Jul}, 
    publisher = {PMLR}, 
    pdf = {http://proceedings.mlr.press/v119/kurtz20a/kurtz20a.pdf},
    url = {http://proceedings.mlr.press/v119/kurtz20a.html}
}

@article{DBLP:journals/corr/abs-2111-13445,
  author    = {Eugenia Iofinova and Alexandra Peste and Mark Kurtz and Dan Alistarh},
  title     = {How Well Do Sparse Imagenet Models Transfer?},
  journal   = {CoRR},
  volume    = {abs/2111.13445},
  year      = {2021},
  url       = {https://arxiv.org/abs/2111.13445},
  eprinttype = {arXiv},
  eprint    = {2111.13445},
  timestamp = {Wed, 01 Dec 2021 15:16:43 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2111-13445.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

All Thanks To Our Contributors

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deepsparse-nightly-1.8.0.20240404.tar.gz (46.9 MB view details)

Uploaded Source

Built Distributions

deepsparse_nightly-1.8.0.20240404-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (47.4 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

deepsparse_nightly-1.8.0.20240404-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (40.6 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ ARM64

deepsparse_nightly-1.8.0.20240404-cp311-cp311-macosx_13_0_arm64.whl (33.5 MB view details)

Uploaded CPython 3.11 macOS 13.0+ ARM64

deepsparse_nightly-1.8.0.20240404-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (47.4 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

deepsparse_nightly-1.8.0.20240404-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (40.6 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ ARM64

deepsparse_nightly-1.8.0.20240404-cp310-cp310-macosx_13_0_arm64.whl (33.5 MB view details)

Uploaded CPython 3.10 macOS 13.0+ ARM64

deepsparse_nightly-1.8.0.20240404-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (47.4 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

deepsparse_nightly-1.8.0.20240404-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (40.6 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ ARM64

deepsparse_nightly-1.8.0.20240404-cp39-cp39-macosx_13_0_arm64.whl (33.5 MB view details)

Uploaded CPython 3.9 macOS 13.0+ ARM64

deepsparse_nightly-1.8.0.20240404-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (47.4 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

deepsparse_nightly-1.8.0.20240404-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (40.6 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ ARM64

deepsparse_nightly-1.8.0.20240404-cp38-cp38-macosx_13_0_arm64.whl (33.5 MB view details)

Uploaded CPython 3.8 macOS 13.0+ ARM64

File details

Details for the file deepsparse-nightly-1.8.0.20240404.tar.gz.

File metadata

  • Download URL: deepsparse-nightly-1.8.0.20240404.tar.gz
  • Upload date:
  • Size: 46.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.24.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.8.10

File hashes

Hashes for deepsparse-nightly-1.8.0.20240404.tar.gz
Algorithm Hash digest
SHA256 588c3d72dc192f66666a2f7ce4c39f2cd8da1b242ec0cd1fa580dee92851c333
MD5 75e0d0ccc247c76c7a835f2e001110b8
BLAKE2b-256 72da2f8498d6514072b5dbf478a858134361077bba174e71d332116f498576b9

See more details on using hashes here.

File details

Details for the file deepsparse_nightly-1.8.0.20240404-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for deepsparse_nightly-1.8.0.20240404-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 7fc962cd2e0a12c5f30b27f1dbe7cab372febd2d949011b1c3ea0ad8fc1b9676
MD5 a0b79deff50e995720b77d2e8693fc76
BLAKE2b-256 aee7b056a04ceb118607b7c0350d7601e19be84b1209f21dab8bad38290ec289

See more details on using hashes here.

File details

Details for the file deepsparse_nightly-1.8.0.20240404-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for deepsparse_nightly-1.8.0.20240404-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 416b2624c313e482732c69a93b8c0092963fa8f0a2e5664b6b25f50917a28242
MD5 9726b730a41f6a35b5da6a0671f22ded
BLAKE2b-256 d05ade0da2c7761887c99d74c99b61bf9928c96214dca8481b45a05ae85885ee

See more details on using hashes here.

File details

Details for the file deepsparse_nightly-1.8.0.20240404-cp311-cp311-macosx_13_0_arm64.whl.

File metadata

File hashes

Hashes for deepsparse_nightly-1.8.0.20240404-cp311-cp311-macosx_13_0_arm64.whl
Algorithm Hash digest
SHA256 b5b8d2139839881a123462a1a8369184885ebf9deec97e94941a24d519893aa1
MD5 0af14b448fdbffa2ca974d338af7bf34
BLAKE2b-256 4bb210cc5a40e00085c7ef17c28391489cb2d6a3d623565ebd25553c4d3b4019

See more details on using hashes here.

File details

Details for the file deepsparse_nightly-1.8.0.20240404-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for deepsparse_nightly-1.8.0.20240404-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 982d8b8b8b8b47d295cf1c3360cddfb4757e68361a2a55c83d3ef5938fa26a4a
MD5 50cc554ab597e4d0e46fa0dbdbde7f08
BLAKE2b-256 3971d3bbe728eb1af07f59d686f5428b84d3d91ef76ff576215dfcd5ff1300c6

See more details on using hashes here.

File details

Details for the file deepsparse_nightly-1.8.0.20240404-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for deepsparse_nightly-1.8.0.20240404-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 9cb23c1cca897978c3f172b5426649a6ac832fe1cd90069451774d87a18a58ce
MD5 834198cafaabc19b02bcf7a617f1c792
BLAKE2b-256 050825ec5d0b4df36f7ac8843422b1822d2102ea1a3648d96d487ba4658a6b45

See more details on using hashes here.

File details

Details for the file deepsparse_nightly-1.8.0.20240404-cp310-cp310-macosx_13_0_arm64.whl.

File metadata

File hashes

Hashes for deepsparse_nightly-1.8.0.20240404-cp310-cp310-macosx_13_0_arm64.whl
Algorithm Hash digest
SHA256 21040a33717583c5ed511f4dfe2ee0290caab3df41460f77ce423e67ba9ec180
MD5 2db1bd816cbba0567484ff2a14e58c04
BLAKE2b-256 265243e8255d18176b1035f2ec7267d59f24a934a6b6ee81ac06a16ec647830b

See more details on using hashes here.

File details

Details for the file deepsparse_nightly-1.8.0.20240404-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for deepsparse_nightly-1.8.0.20240404-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a78d2edfe8a84bc912a06ff489f92ba6b308c728b06be5806b5e5c99db16a7e6
MD5 f6ffa65ece7a16224e1362ce84964edd
BLAKE2b-256 60015f22b71e66c9a3b07b765ec8c0d32157eeca348a2cd60456be1c4abb3d53

See more details on using hashes here.

File details

Details for the file deepsparse_nightly-1.8.0.20240404-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for deepsparse_nightly-1.8.0.20240404-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 66f28d99d78b28fc075c1fcc09f293299c3f39df52dff8ea1bc82f52fbdadad3
MD5 9efee65f7280e0d9353883bbd600b09d
BLAKE2b-256 4cee00e26b5c258bf07353e4bc932e3147e9806fe79e8156352401ed9fe23601

See more details on using hashes here.

File details

Details for the file deepsparse_nightly-1.8.0.20240404-cp39-cp39-macosx_13_0_arm64.whl.

File metadata

File hashes

Hashes for deepsparse_nightly-1.8.0.20240404-cp39-cp39-macosx_13_0_arm64.whl
Algorithm Hash digest
SHA256 ce52dc8f8d6bc4e7ac8593106fa51767eb16855cd55530bcf2d8ade9f6e34f0c
MD5 d6cceb5b7b4dcc21ba72971ad07de364
BLAKE2b-256 9ca3c88fc392ded6e05ed3d65ceea1d5e6ec81bd14ae982fab75a9c9db4483f0

See more details on using hashes here.

File details

Details for the file deepsparse_nightly-1.8.0.20240404-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for deepsparse_nightly-1.8.0.20240404-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a9277015ff6d2bef17a66ff23280368814cbbcb8125a219e3bb454603a2f5c09
MD5 1141c52a415fc1698cfdb43534b39375
BLAKE2b-256 17806d85642c43f847f0b305257d4520642ff6d685ab01647b54f01a0f5d99cb

See more details on using hashes here.

File details

Details for the file deepsparse_nightly-1.8.0.20240404-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for deepsparse_nightly-1.8.0.20240404-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 756dd7456145f7c720844d505852587c42e8a4160751dd4ef047f09107ecf9a6
MD5 d26629df96c8f43d2381bb715c5fd386
BLAKE2b-256 b14133079ba475fc9eb3b925d02dffecbc6808ff23d1d9d583f56ddd202d3dbd

See more details on using hashes here.

File details

Details for the file deepsparse_nightly-1.8.0.20240404-cp38-cp38-macosx_13_0_arm64.whl.

File metadata

File hashes

Hashes for deepsparse_nightly-1.8.0.20240404-cp38-cp38-macosx_13_0_arm64.whl
Algorithm Hash digest
SHA256 f1da5f505184ae7c1e003470d74e17c80aa366ea8fcf0d8c55d4e1488ab2ca27
MD5 8954a49c6c916cdabe777150b337518e
BLAKE2b-256 544b45208e9a169ad87de060fe7616ea5473dd89a25ae23bc7e44f6473e35090

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page