Skip to main content

An inference runtime offering GPU-class performance on CPUs and APIs to integrate ML into your application

Project description

tool icon   DeepSparse

Sparsity-aware deep learning inference runtime for CPUs

DeepSparse is a CPU inference runtime that takes advantage of sparsity to accelerate neural network inference. Coupled with SparseML, our optimization library for pruning and quantizing your models, DeepSparse delivers exceptional inference performance on CPU hardware.

NM Flow

✨NEW✨ DeepSparse LLMs

Neural Magic is excited to announce initial support for performant LLM inference in DeepSparse with:

  • sparse kernels for speedups and memory savings from unstructured sparse weights.
  • 8-bit weight and activation quantization support.
  • efficient usage of cached attention keys and values for minimal memory movement.

mpt-chat-comparison

Try It Now

Install (requires Linux):

pip install -U deepsparse-nightly[llm]

Run inference:

from deepsparse import TextGeneration
pipeline = TextGeneration(model="zoo:mpt-7b-dolly_mpt_pretrain-pruned50_quantized")

prompt="""
Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: what is sparsity? ### Response:
"""
print(pipeline(prompt, max_new_tokens=75).generations[0].text)

# Sparsity is the property of a matrix or other data structure in which a large number of elements are zero and a smaller number of elements are non-zero. In the context of machine learning, sparsity can be used to improve the efficiency of training and prediction.

Check out the TextGeneration documentation for usage details and get the latest sparsified LLMs on our HF Collection.

Sparsity :handshake: Performance

Developed in collaboration with IST Austria, our recent paper details a new technique called Sparse Fine-Tuning, which allows us to prune MPT-7B to 60% sparsity during fine-tuning without drop in accuracy. With our new support for LLMs, DeepSparse accelerates the sparse-quantized model 7x over the dense baseline:

Learn more about our Sparse Fine-Tuning research.

Check out the model running live on Hugging Face.

LLM Roadmap

Following this initial launch, we are rapidly expanding our support for LLMs, including:

  1. Productizing Sparse Fine-Tuning: Enable external users to apply sparse fine-tuning to their datasets via SparseML.
  2. Expanding model support: Apply our sparse fine-tuning results to Llama 2 and Mistral models.
  3. Pushing for higher sparsity: Improving our pruning algorithms to reach even higher sparsity.

Computer Vision and NLP Models

In addition to LLMs, DeepSparse supports many variants of CNNs and Transformer models, such as BERT, ViT, ResNet, EfficientNet, YOLOv5/8, and many more! Take a look at the Computer Vision and Natural Language Processing domains of SparseZoo, our home for optimized models.

Installation

Install via PyPI (optional dependencies detailed here):

pip install deepsparse 

To experiment with the latest features, there is a nightly build available using pip install deepsparse-nightly or you can clone and install from source using pip install -e path/to/deepsparse.

System Requirements

For those using Mac or Windows, we recommend using Linux containers with Docker.

Deployment APIs

DeepSparse includes three deployment APIs:

  • Engine is the lowest-level API. With Engine, you compile an ONNX model, pass tensors as input, and receive the raw outputs.
  • Pipeline wraps the Engine with pre- and post-processing. With Pipeline, you pass raw data and receive the prediction.
  • Server wraps Pipelines with a REST API using FastAPI. With Server, you send raw data over HTTP and receive the prediction.

Engine

The example below downloads a 90% pruned-quantized BERT model for sentiment analysis in ONNX format from SparseZoo, compiles the model, and runs inference on randomly generated input. Users can provide their own ONNX models, whether dense or sparse.

from deepsparse import Engine

# download onnx, compile
zoo_stub = "zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none"
compiled_model = Engine(model=zoo_stub, batch_size=1)

# run inference (input is raw numpy tensors, output is raw scores)
inputs = compiled_model.generate_random_inputs()
output = compiled_model(inputs)
print(output)

# > [array([[-0.3380675 ,  0.09602544]], dtype=float32)] << raw scores

Pipeline

Pipelines wrap Engine with pre- and post-processing, enabling you to pass raw data and receive the post-processed prediction. The example below downloads a 90% pruned-quantized BERT model for sentiment analysis in ONNX format from SparseZoo, sets up a pipeline, and runs inference on sample data.

from deepsparse import Pipeline

# download onnx, set up pipeline
zoo_stub = "zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none"  
sentiment_analysis_pipeline = Pipeline.create(
  task="sentiment-analysis",    # name of the task
  model_path=zoo_stub,          # zoo stub or path to local onnx file
)

# run inference (input is a sentence, output is the prediction)
prediction = sentiment_analysis_pipeline("I love using DeepSparse Pipelines")
print(prediction)
# > labels=['positive'] scores=[0.9954759478569031]

Server

Server wraps Pipelines with REST APIs, enabling you to set up a model-serving endpoint running DeepSparse. This enables you to send raw data to DeepSparse over HTTP and receive the post-processed predictions. DeepSparse Server is launched from the command line and configured via arguments or a server configuration file. The following downloads a 90% pruned-quantized BERT model for sentiment analysis in ONNX format from SparseZoo and launches a sentiment analysis endpoint:

deepsparse.server \
  --task sentiment-analysis \
  --model_path zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none

Sending a request:

import requests

url = "http://localhost:5543/v2/models/sentiment_analysis/infer" # Server's port default to 5543
obj = {"sequences": "Snorlax loves my Tesla!"}

response = requests.post(url, json=obj)
print(response.text)
# {"labels":["positive"],"scores":[0.9965094327926636]}

Additional Resources

Product Usage Analytics

DeepSparse gathers basic usage telemetry, including, but not limited to, Invocations, Package, Version, and IP Address, for Product Usage Analytics purposes. Review Neural Magic's Products Privacy Policy for further details on how we process this data.

To disable Product Usage Analytics, run:

export NM_DISABLE_ANALYTICS=True

Confirm that telemetry is shut off through info logs streamed with engine invocation by looking for the phrase "Skipping Neural Magic's latest package version check."

Community

Get In Touch

For more general questions about Neural Magic, complete this form.

License

Cite

Find this project useful in your research or other communications? Please consider citing:

@misc{kurtic2023sparse,
      title={Sparse Fine-Tuning for Inference Acceleration of Large Language Models}, 
      author={Eldar Kurtic and Denis Kuznedelev and Elias Frantar and Michael Goin and Dan Alistarh},
      year={2023},
      url={https://arxiv.org/abs/2310.06927},
      eprint={2310.06927},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@misc{kurtic2022optimal,
      title={The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models}, 
      author={Eldar Kurtic and Daniel Campos and Tuan Nguyen and Elias Frantar and Mark Kurtz and Benjamin Fineran and Michael Goin and Dan Alistarh},
      year={2022},
      url={https://arxiv.org/abs/2203.07259},
      eprint={2203.07259},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@InProceedings{
    pmlr-v119-kurtz20a, 
    title = {Inducing and Exploiting Activation Sparsity for Fast Inference on Deep Neural Networks}, 
    author = {Kurtz, Mark and Kopinsky, Justin and Gelashvili, Rati and Matveev, Alexander and Carr, John and Goin, Michael and Leiserson, William and Moore, Sage and Nell, Bill and Shavit, Nir and Alistarh, Dan}, 
    booktitle = {Proceedings of the 37th International Conference on Machine Learning}, 
    pages = {5533--5543}, 
    year = {2020}, 
    editor = {Hal Daumé III and Aarti Singh}, 
    volume = {119}, 
    series = {Proceedings of Machine Learning Research}, 
    address = {Virtual}, 
    month = {13--18 Jul}, 
    publisher = {PMLR}, 
    pdf = {http://proceedings.mlr.press/v119/kurtz20a/kurtz20a.pdf},
    url = {http://proceedings.mlr.press/v119/kurtz20a.html}
}

@article{DBLP:journals/corr/abs-2111-13445,
  author    = {Eugenia Iofinova and Alexandra Peste and Mark Kurtz and Dan Alistarh},
  title     = {How Well Do Sparse Imagenet Models Transfer?},
  journal   = {CoRR},
  volume    = {abs/2111.13445},
  year      = {2021},
  url       = {https://arxiv.org/abs/2111.13445},
  eprinttype = {arXiv},
  eprint    = {2111.13445},
  timestamp = {Wed, 01 Dec 2021 15:16:43 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2111-13445.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

All Thanks To Our Contributors

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deepsparse-nightly-1.8.0.20240502.tar.gz (46.9 MB view details)

Uploaded Source

Built Distributions

deepsparse_nightly-1.8.0.20240502-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (47.4 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

deepsparse_nightly-1.8.0.20240502-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (40.6 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ ARM64

deepsparse_nightly-1.8.0.20240502-cp311-cp311-macosx_13_0_arm64.whl (33.5 MB view details)

Uploaded CPython 3.11 macOS 13.0+ ARM64

deepsparse_nightly-1.8.0.20240502-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (47.4 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

deepsparse_nightly-1.8.0.20240502-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (40.6 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ ARM64

deepsparse_nightly-1.8.0.20240502-cp310-cp310-macosx_13_0_arm64.whl (33.5 MB view details)

Uploaded CPython 3.10 macOS 13.0+ ARM64

deepsparse_nightly-1.8.0.20240502-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (47.4 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

deepsparse_nightly-1.8.0.20240502-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (40.6 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ ARM64

deepsparse_nightly-1.8.0.20240502-cp39-cp39-macosx_13_0_arm64.whl (33.5 MB view details)

Uploaded CPython 3.9 macOS 13.0+ ARM64

deepsparse_nightly-1.8.0.20240502-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (47.4 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

deepsparse_nightly-1.8.0.20240502-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (40.6 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ ARM64

deepsparse_nightly-1.8.0.20240502-cp38-cp38-macosx_13_0_arm64.whl (33.5 MB view details)

Uploaded CPython 3.8 macOS 13.0+ ARM64

File details

Details for the file deepsparse-nightly-1.8.0.20240502.tar.gz.

File metadata

  • Download URL: deepsparse-nightly-1.8.0.20240502.tar.gz
  • Upload date:
  • Size: 46.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.24.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.8.10

File hashes

Hashes for deepsparse-nightly-1.8.0.20240502.tar.gz
Algorithm Hash digest
SHA256 244789a15edb81042d8a2ee29929b6c6dac6016e4bcd8e9b9a53238ec02c15ba
MD5 64f693edb982c99a9d4823a455fd0a84
BLAKE2b-256 281a060875226f5410c9e620d4899303a5ecf6895e1708bc6d2a05d33b516be5

See more details on using hashes here.

File details

Details for the file deepsparse_nightly-1.8.0.20240502-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for deepsparse_nightly-1.8.0.20240502-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 60fffbfc5259f01e19ebcef0cc07bc4ba4ea702364709c2e6bf83ced8e12224c
MD5 5699af87bc291c205bc993529d895bf3
BLAKE2b-256 7a962151f0e9a7f593d6456d1b7ecf125bd363fea4ebef99e0e56b99c49baebb

See more details on using hashes here.

File details

Details for the file deepsparse_nightly-1.8.0.20240502-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for deepsparse_nightly-1.8.0.20240502-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 43af216b86be392521d19e6683b6f880970e826a016876acb27432ca2ec34d3f
MD5 4da8c9ce75587522bea4c98f5e1cb5d8
BLAKE2b-256 6f085ecbdd9e1c129db7f872d3006cf9a39eb7b6e8a57b0028a2a9a99dc4d902

See more details on using hashes here.

File details

Details for the file deepsparse_nightly-1.8.0.20240502-cp311-cp311-macosx_13_0_arm64.whl.

File metadata

File hashes

Hashes for deepsparse_nightly-1.8.0.20240502-cp311-cp311-macosx_13_0_arm64.whl
Algorithm Hash digest
SHA256 3cd678bfc534126de899a35cabcdade5f2d91dc5c64b9d29af8e4221478d89ff
MD5 eea0884d5c2dc594efde10260d6a5ae7
BLAKE2b-256 bb9e1c0c6167387d9662df936074395044a753ee11acc5763c52839d218eec74

See more details on using hashes here.

File details

Details for the file deepsparse_nightly-1.8.0.20240502-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for deepsparse_nightly-1.8.0.20240502-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 53e752fda6fa949a4221009d2bd668a6f04a89288a1b1ea445659a1d1b5e6366
MD5 0040c7e1a1457d481c2934e23e233050
BLAKE2b-256 8dfc69c888f1c0be1e6ed6d49aecdeca044ba67c0101cc8ba499fa139748a0dd

See more details on using hashes here.

File details

Details for the file deepsparse_nightly-1.8.0.20240502-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for deepsparse_nightly-1.8.0.20240502-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 259f2b0646c750c150b8ed636f4b9b53e8f6bd2098f2477c2c8160c3264663c3
MD5 76154180d04c9565ba776e45a1aba2f7
BLAKE2b-256 6dcb22c1b1519ef3e0caf71f5d7dcac14bf0f6f5c9930a19f9603677175e67c5

See more details on using hashes here.

File details

Details for the file deepsparse_nightly-1.8.0.20240502-cp310-cp310-macosx_13_0_arm64.whl.

File metadata

File hashes

Hashes for deepsparse_nightly-1.8.0.20240502-cp310-cp310-macosx_13_0_arm64.whl
Algorithm Hash digest
SHA256 6425ae23eee217d27c1fc5efae869493aa17a72e7c2f613416df778e1e20bd37
MD5 5d4b25ab729c702dd53c8acfe73a114b
BLAKE2b-256 c15e882cb32e8033db909de1458b638c422385a877530867ce370927c40ba218

See more details on using hashes here.

File details

Details for the file deepsparse_nightly-1.8.0.20240502-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for deepsparse_nightly-1.8.0.20240502-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ce92567c4054b3b1bb72526d4d2f52ac8fc280d34a8b7f9e9e61fc18c061278d
MD5 2dc712d5725767256a6fbb20c723cd57
BLAKE2b-256 567589a573cf63eb62f7634e602661aa7248f5d33ea5a0b7e5b6074720fba147

See more details on using hashes here.

File details

Details for the file deepsparse_nightly-1.8.0.20240502-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for deepsparse_nightly-1.8.0.20240502-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 dd3953b26f40883e8978204cf64a81a6b479375f56c738d1371c3632823460f7
MD5 16086d38b80786ec77bafeb13e0f942e
BLAKE2b-256 9dfb5f2991731de3611fa5bbe640b298ffe680d4b3b99e6165c99a07a975ee81

See more details on using hashes here.

File details

Details for the file deepsparse_nightly-1.8.0.20240502-cp39-cp39-macosx_13_0_arm64.whl.

File metadata

File hashes

Hashes for deepsparse_nightly-1.8.0.20240502-cp39-cp39-macosx_13_0_arm64.whl
Algorithm Hash digest
SHA256 1b6d9ce1566f6642864d03e8e5952b1df31d3f4233810dae7639830ea8446f20
MD5 155d9ccab1dc50d61609fffc93592e15
BLAKE2b-256 d0a3e56a0634606d5e7d402d097b52daea0fe3af685fc6f285b794e1f4d04841

See more details on using hashes here.

File details

Details for the file deepsparse_nightly-1.8.0.20240502-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for deepsparse_nightly-1.8.0.20240502-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 0c8399e229a8402f30bed50068b4fece65122379a31d8576fba2cb4cbedd8a75
MD5 1bceb1259a963e2f08d571606e3ed16f
BLAKE2b-256 2df03872417dbec264a2d37ec6a29f0e8b84a3064730120d9dfd19108d5818db

See more details on using hashes here.

File details

Details for the file deepsparse_nightly-1.8.0.20240502-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for deepsparse_nightly-1.8.0.20240502-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 17ea2c5ee559afbbeabbbec575988e9a93a7363e8808b107514609bc3672c6af
MD5 78717220b454f86daa1d836800832dec
BLAKE2b-256 28368517721265e075b55162bf5e3b789f2e992df6bb991ac6ed3253c87d6d05

See more details on using hashes here.

File details

Details for the file deepsparse_nightly-1.8.0.20240502-cp38-cp38-macosx_13_0_arm64.whl.

File metadata

File hashes

Hashes for deepsparse_nightly-1.8.0.20240502-cp38-cp38-macosx_13_0_arm64.whl
Algorithm Hash digest
SHA256 f3d1f7d4247cca0538e8746bd27d5a4a8b50896acfb4827c5ed9bbf06b5f84ad
MD5 1384b72d3e8c6ef042c362a96c6d56c3
BLAKE2b-256 e8e1992d108f22c223090a98a15216517ef78aaa4d222fdd69a69d4503971aa1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page