Skip to main content

An inference runtime offering GPU-class performance on CPUs and APIs to integrate ML into your application

Project description

tool icon   DeepSparse

Sparsity-aware deep learning inference runtime for CPUs

DeepSparse is a CPU inference runtime that takes advantage of sparsity to accelerate neural network inference. Coupled with SparseML, our optimization library for pruning and quantizing your models, DeepSparse delivers exceptional inference performance on CPU hardware.

NM Flow

✨NEW✨ DeepSparse LLMs

Neural Magic is excited to announce initial support for performant LLM inference in DeepSparse with:

  • sparse kernels for speedups and memory savings from unstructured sparse weights.
  • 8-bit weight and activation quantization support.
  • efficient usage of cached attention keys and values for minimal memory movement.

mpt-chat-comparison

Try It Now

Install (requires Linux):

pip install -U deepsparse-nightly[llm]

Run inference:

from deepsparse import TextGeneration
pipeline = TextGeneration(model="zoo:mpt-7b-dolly_mpt_pretrain-pruned50_quantized")

prompt="""
Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: what is sparsity? ### Response:
"""
print(pipeline(prompt, max_new_tokens=75).generations[0].text)

# Sparsity is the property of a matrix or other data structure in which a large number of elements are zero and a smaller number of elements are non-zero. In the context of machine learning, sparsity can be used to improve the efficiency of training and prediction.

Check out the TextGeneration documentation for usage details and get the latest sparsified LLMs on our HF Collection.

Sparsity :handshake: Performance

Developed in collaboration with IST Austria, our recent paper details a new technique called Sparse Fine-Tuning, which allows us to prune MPT-7B to 60% sparsity during fine-tuning without drop in accuracy. With our new support for LLMs, DeepSparse accelerates the sparse-quantized model 7x over the dense baseline:

Learn more about our Sparse Fine-Tuning research.

Check out the model running live on Hugging Face.

LLM Roadmap

Following this initial launch, we are rapidly expanding our support for LLMs, including:

  1. Productizing Sparse Fine-Tuning: Enable external users to apply sparse fine-tuning to their datasets via SparseML.
  2. Expanding model support: Apply our sparse fine-tuning results to Llama 2 and Mistral models.
  3. Pushing for higher sparsity: Improving our pruning algorithms to reach even higher sparsity.

Computer Vision and NLP Models

In addition to LLMs, DeepSparse supports many variants of CNNs and Transformer models, such as BERT, ViT, ResNet, EfficientNet, YOLOv5/8, and many more! Take a look at the Computer Vision and Natural Language Processing domains of SparseZoo, our home for optimized models.

Installation

Install via PyPI (optional dependencies detailed here):

pip install deepsparse 

To experiment with the latest features, there is a nightly build available using pip install deepsparse-nightly or you can clone and install from source using pip install -e path/to/deepsparse.

System Requirements

For those using Mac or Windows, we recommend using Linux containers with Docker.

Deployment APIs

DeepSparse includes three deployment APIs:

  • Engine is the lowest-level API. With Engine, you compile an ONNX model, pass tensors as input, and receive the raw outputs.
  • Pipeline wraps the Engine with pre- and post-processing. With Pipeline, you pass raw data and receive the prediction.
  • Server wraps Pipelines with a REST API using FastAPI. With Server, you send raw data over HTTP and receive the prediction.

Engine

The example below downloads a 90% pruned-quantized BERT model for sentiment analysis in ONNX format from SparseZoo, compiles the model, and runs inference on randomly generated input. Users can provide their own ONNX models, whether dense or sparse.

from deepsparse import Engine

# download onnx, compile
zoo_stub = "zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none"
compiled_model = Engine(model=zoo_stub, batch_size=1)

# run inference (input is raw numpy tensors, output is raw scores)
inputs = compiled_model.generate_random_inputs()
output = compiled_model(inputs)
print(output)

# > [array([[-0.3380675 ,  0.09602544]], dtype=float32)] << raw scores

Pipeline

Pipelines wrap Engine with pre- and post-processing, enabling you to pass raw data and receive the post-processed prediction. The example below downloads a 90% pruned-quantized BERT model for sentiment analysis in ONNX format from SparseZoo, sets up a pipeline, and runs inference on sample data.

from deepsparse import Pipeline

# download onnx, set up pipeline
zoo_stub = "zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none"  
sentiment_analysis_pipeline = Pipeline.create(
  task="sentiment-analysis",    # name of the task
  model_path=zoo_stub,          # zoo stub or path to local onnx file
)

# run inference (input is a sentence, output is the prediction)
prediction = sentiment_analysis_pipeline("I love using DeepSparse Pipelines")
print(prediction)
# > labels=['positive'] scores=[0.9954759478569031]

Server

Server wraps Pipelines with REST APIs, enabling you to set up a model-serving endpoint running DeepSparse. This enables you to send raw data to DeepSparse over HTTP and receive the post-processed predictions. DeepSparse Server is launched from the command line and configured via arguments or a server configuration file. The following downloads a 90% pruned-quantized BERT model for sentiment analysis in ONNX format from SparseZoo and launches a sentiment analysis endpoint:

deepsparse.server \
  --task sentiment-analysis \
  --model_path zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none

Sending a request:

import requests

url = "http://localhost:5543/v2/models/sentiment_analysis/infer" # Server's port default to 5543
obj = {"sequences": "Snorlax loves my Tesla!"}

response = requests.post(url, json=obj)
print(response.text)
# {"labels":["positive"],"scores":[0.9965094327926636]}

Additional Resources

Product Usage Analytics

DeepSparse gathers basic usage telemetry, including, but not limited to, Invocations, Package, Version, and IP Address, for Product Usage Analytics purposes. Review Neural Magic's Products Privacy Policy for further details on how we process this data.

To disable Product Usage Analytics, run:

export NM_DISABLE_ANALYTICS=True

Confirm that telemetry is shut off through info logs streamed with engine invocation by looking for the phrase "Skipping Neural Magic's latest package version check."

Community

Get In Touch

For more general questions about Neural Magic, complete this form.

License

Cite

Find this project useful in your research or other communications? Please consider citing:

@misc{kurtic2023sparse,
      title={Sparse Fine-Tuning for Inference Acceleration of Large Language Models}, 
      author={Eldar Kurtic and Denis Kuznedelev and Elias Frantar and Michael Goin and Dan Alistarh},
      year={2023},
      url={https://arxiv.org/abs/2310.06927},
      eprint={2310.06927},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@misc{kurtic2022optimal,
      title={The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models}, 
      author={Eldar Kurtic and Daniel Campos and Tuan Nguyen and Elias Frantar and Mark Kurtz and Benjamin Fineran and Michael Goin and Dan Alistarh},
      year={2022},
      url={https://arxiv.org/abs/2203.07259},
      eprint={2203.07259},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@InProceedings{
    pmlr-v119-kurtz20a, 
    title = {Inducing and Exploiting Activation Sparsity for Fast Inference on Deep Neural Networks}, 
    author = {Kurtz, Mark and Kopinsky, Justin and Gelashvili, Rati and Matveev, Alexander and Carr, John and Goin, Michael and Leiserson, William and Moore, Sage and Nell, Bill and Shavit, Nir and Alistarh, Dan}, 
    booktitle = {Proceedings of the 37th International Conference on Machine Learning}, 
    pages = {5533--5543}, 
    year = {2020}, 
    editor = {Hal Daumé III and Aarti Singh}, 
    volume = {119}, 
    series = {Proceedings of Machine Learning Research}, 
    address = {Virtual}, 
    month = {13--18 Jul}, 
    publisher = {PMLR}, 
    pdf = {http://proceedings.mlr.press/v119/kurtz20a/kurtz20a.pdf},
    url = {http://proceedings.mlr.press/v119/kurtz20a.html}
}

@article{DBLP:journals/corr/abs-2111-13445,
  author    = {Eugenia Iofinova and Alexandra Peste and Mark Kurtz and Dan Alistarh},
  title     = {How Well Do Sparse Imagenet Models Transfer?},
  journal   = {CoRR},
  volume    = {abs/2111.13445},
  year      = {2021},
  url       = {https://arxiv.org/abs/2111.13445},
  eprinttype = {arXiv},
  eprint    = {2111.13445},
  timestamp = {Wed, 01 Dec 2021 15:16:43 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2111-13445.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

All Thanks To Our Contributors

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deepsparse-1.8.0.tar.gz (46.9 MB view details)

Uploaded Source

Built Distributions

deepsparse-1.8.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (47.4 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

deepsparse-1.8.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (40.5 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ ARM64

deepsparse-1.8.0-cp311-cp311-macosx_13_0_arm64.whl (33.5 MB view details)

Uploaded CPython 3.11 macOS 13.0+ ARM64

deepsparse-1.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (47.4 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

deepsparse-1.8.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (40.5 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ ARM64

deepsparse-1.8.0-cp310-cp310-macosx_13_0_arm64.whl (33.5 MB view details)

Uploaded CPython 3.10 macOS 13.0+ ARM64

deepsparse-1.8.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (47.4 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

deepsparse-1.8.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (40.5 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ ARM64

deepsparse-1.8.0-cp39-cp39-macosx_13_0_arm64.whl (33.5 MB view details)

Uploaded CPython 3.9 macOS 13.0+ ARM64

deepsparse-1.8.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (47.4 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

deepsparse-1.8.0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (40.5 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ ARM64

deepsparse-1.8.0-cp38-cp38-macosx_13_0_arm64.whl (33.5 MB view details)

Uploaded CPython 3.8 macOS 13.0+ ARM64

File details

Details for the file deepsparse-1.8.0.tar.gz.

File metadata

  • Download URL: deepsparse-1.8.0.tar.gz
  • Upload date:
  • Size: 46.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.24.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.8.10

File hashes

Hashes for deepsparse-1.8.0.tar.gz
Algorithm Hash digest
SHA256 461714d0d1118b0b318139bb0c2a7b5c9facde7ba68d19d3d18b34005792c198
MD5 f1ba806f19b5d955e2d6b45a0d050a93
BLAKE2b-256 6aff0c2015a12d571241e2a40c23106158c6e08e9e5b5a28a2015d2a4566d396

See more details on using hashes here.

File details

Details for the file deepsparse-1.8.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for deepsparse-1.8.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 59a5ee321e93dbf05a9e8a91944daf7280409bfcdeb004253b018816d374261e
MD5 49947f7bf73ec6e23e95c725b1c0b61d
BLAKE2b-256 dc5d898406918e97371292d6443a40fda176039135d738d8ab5202b2763fd729

See more details on using hashes here.

File details

Details for the file deepsparse-1.8.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for deepsparse-1.8.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 687a6597c80389e9646b676d543870bc86cab3908088f41f769c00b0820fdf2d
MD5 be4d4ccf10014fcee5f5099dd21d038b
BLAKE2b-256 af11b1eda14e7a4744140d0c873166fbc90ab787861f882f303b490c6d06eaf2

See more details on using hashes here.

File details

Details for the file deepsparse-1.8.0-cp311-cp311-macosx_13_0_arm64.whl.

File metadata

  • Download URL: deepsparse-1.8.0-cp311-cp311-macosx_13_0_arm64.whl
  • Upload date:
  • Size: 33.5 MB
  • Tags: CPython 3.11, macOS 13.0+ ARM64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.24.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.8.10

File hashes

Hashes for deepsparse-1.8.0-cp311-cp311-macosx_13_0_arm64.whl
Algorithm Hash digest
SHA256 30d075dde69782cf3ce3cd673c3da2c76407814133f3c3cb4f2cffaf54b7d3b6
MD5 6d20ba34af3e6829a9b944e4b6535b5c
BLAKE2b-256 baa63848fe29220cbcd25ed374168b5530c764f2cf68d64cc21ca5627bebf164

See more details on using hashes here.

File details

Details for the file deepsparse-1.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for deepsparse-1.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 7b6897b36f91dc5b39cf2be4365b1a7f9c77f8994834fbcfec14567747eb4f20
MD5 dc1280730f8f8ba561d4c36e3de1d733
BLAKE2b-256 07c7870e6c8f06cb99d83a16c19f2d44eaf4ce3e3ad021ab4412f488dd5df57b

See more details on using hashes here.

File details

Details for the file deepsparse-1.8.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for deepsparse-1.8.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 3532f52f738086c20e2537a8b1e4c472aac8434f8a7c4981c1730911a28f8117
MD5 d25303be1599329705516538141b8a47
BLAKE2b-256 6ec8c22187b89ebc981468323b9c2bcc467b0861547562ac0e205c40421416c4

See more details on using hashes here.

File details

Details for the file deepsparse-1.8.0-cp310-cp310-macosx_13_0_arm64.whl.

File metadata

  • Download URL: deepsparse-1.8.0-cp310-cp310-macosx_13_0_arm64.whl
  • Upload date:
  • Size: 33.5 MB
  • Tags: CPython 3.10, macOS 13.0+ ARM64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.24.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.8.10

File hashes

Hashes for deepsparse-1.8.0-cp310-cp310-macosx_13_0_arm64.whl
Algorithm Hash digest
SHA256 322c41b993127f2b102392ee259d92663acd569eceddb9d64244376c8268abd3
MD5 8e5bddd9a9d38f7d6a45530bb724213b
BLAKE2b-256 69e250e20240f285641032687bd74b54a86c7745ec20ff914ca2709b9edb8ba7

See more details on using hashes here.

File details

Details for the file deepsparse-1.8.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for deepsparse-1.8.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 df19350ffc4728079b84b6e9c3cce3948b3566f0cab49190438a6f76a535e631
MD5 eaef4888d23a9ae7515e9aa902460ca6
BLAKE2b-256 cf774a49d5e75eb8ba69ce903b6d80c55ae44c7e78a6516655c90e2f370f3ea5

See more details on using hashes here.

File details

Details for the file deepsparse-1.8.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for deepsparse-1.8.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 ae33d91510c52258cae3dcfebff6868ddf29962c975cb36fd4694db008b7ad33
MD5 30e7a8178208227ad8430d7e0ef56bfa
BLAKE2b-256 92c6308d5e4d7624bb76593580291ca4cd625a8403737212d73d4062f3b3199c

See more details on using hashes here.

File details

Details for the file deepsparse-1.8.0-cp39-cp39-macosx_13_0_arm64.whl.

File metadata

  • Download URL: deepsparse-1.8.0-cp39-cp39-macosx_13_0_arm64.whl
  • Upload date:
  • Size: 33.5 MB
  • Tags: CPython 3.9, macOS 13.0+ ARM64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.24.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.8.10

File hashes

Hashes for deepsparse-1.8.0-cp39-cp39-macosx_13_0_arm64.whl
Algorithm Hash digest
SHA256 87dc731f118c3cd99e1ee8d81aa9f8ff97bacd66db42ac4c800b7c37b0d05cf6
MD5 54667e8e1beaa73367b015ad7b63fc63
BLAKE2b-256 8d4b461fa341c05368f21982aab3015876ff45d2062c99551ef221e57ce9e69d

See more details on using hashes here.

File details

Details for the file deepsparse-1.8.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for deepsparse-1.8.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 9b8a028e00fe2d689a6927994674063e5120ee5e6237f0eee3b8f92ac8ddafac
MD5 c05fcefe4447fb3faa4576fe11bde643
BLAKE2b-256 aaaa5249aa785dda4ef8a2bba8dbd6c9dad635bbd004884396464f9035026686

See more details on using hashes here.

File details

Details for the file deepsparse-1.8.0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for deepsparse-1.8.0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 ccd87ae53c524490836602034238a72617baed53ace844a3e3b166654f919ec7
MD5 ff8a8be22133843a6a04d4955563dc51
BLAKE2b-256 e3d40e8a42e29303985653d698d4e08f172d58d24a9f7ccea6d82056dc131904

See more details on using hashes here.

File details

Details for the file deepsparse-1.8.0-cp38-cp38-macosx_13_0_arm64.whl.

File metadata

  • Download URL: deepsparse-1.8.0-cp38-cp38-macosx_13_0_arm64.whl
  • Upload date:
  • Size: 33.5 MB
  • Tags: CPython 3.8, macOS 13.0+ ARM64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.24.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.8.10

File hashes

Hashes for deepsparse-1.8.0-cp38-cp38-macosx_13_0_arm64.whl
Algorithm Hash digest
SHA256 0150074efa7436b608c0bdff95c3fa09f3d69a211b2d7a7c0ceacc10c7945d49
MD5 c631a89f0ef39709e83f2787b0ac60de
BLAKE2b-256 80a2bdf1ed43f7eb69d90d2dbc569763eaa4b3d926bbca00172a36c6ad12e6ec

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page