Skip to main content

An inference runtime offering GPU-class performance on CPUs and APIs to integrate ML into your application

Project description

tool icon   DeepSparse

Sparsity-aware deep learning inference runtime for CPUs

DeepSparse is a CPU inference runtime that takes advantage of sparsity to accelerate neural network inference. Coupled with SparseML, our optimization library for pruning and quantizing your models, DeepSparse delivers exceptional inference performance on CPU hardware.

NM Flow

✨NEW✨ DeepSparse LLMs

Neural Magic is excited to announce initial support for performant LLM inference in DeepSparse with:

  • sparse kernels for speedups and memory savings from unstructured sparse weights.
  • 8-bit weight and activation quantization support.
  • efficient usage of cached attention keys and values for minimal memory movement.

mpt-chat-comparison

Try It Now

Install (requires Linux):

pip install -U deepsparse-nightly[llm]

Run inference:

from deepsparse import TextGeneration
pipeline = TextGeneration(model="zoo:mpt-7b-dolly_mpt_pretrain-pruned50_quantized")

prompt="""
Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: what is sparsity? ### Response:
"""
print(pipeline(prompt, max_new_tokens=75).generations[0].text)

# Sparsity is the property of a matrix or other data structure in which a large number of elements are zero and a smaller number of elements are non-zero. In the context of machine learning, sparsity can be used to improve the efficiency of training and prediction.

Check out the TextGeneration documentation for usage details.

Sparsity :handshake: Performance

Developed in collaboration with IST Austria, our recent paper details a new technique called Sparse Fine-Tuning, which allows us to prune MPT-7B to 60% sparsity during fine-tuning without drop in accuracy. With our new support for LLMs, DeepSparse accelerates the sparse-quantized model 7x over the dense baseline:

Learn more about our Sparse Fine-Tuning research.

Check out the model running live on Hugging Face.

LLM Roadmap

Following this initial launch, we are rapidly expanding our support for LLMs, including:

  1. Productizing Sparse Fine-Tuning: Enable external users to apply sparse fine-tuning to their datasets via SparseML.
  2. Expanding model support: Apply our sparse fine-tuning results to Llama 2 and Mistral models.
  3. Pushing for higher sparsity: Improving our pruning algorithms to reach even higher sparsity.

Computer Vision and NLP Models

In addition to LLMs, DeepSparse supports many variants of CNNs and Transformer models, such as BERT, ViT, ResNet, EfficientNet, YOLOv5/8, and many more! Take a look at the Computer Vision and Natural Language Processing domains of SparseZoo, our home for optimized models.

Installation

Install via PyPI (optional dependencies detailed here):

pip install deepsparse 

To experiment with the latest features, there is a nightly build available using pip install deepsparse-nightly or you can clone and install from source using pip install -e path/to/deepsparse.

System Requirements

For those using Mac or Windows, we recommend using Linux containers with Docker.

Deployment APIs

DeepSparse includes three deployment APIs:

  • Engine is the lowest-level API. With Engine, you compile an ONNX model, pass tensors as input, and receive the raw outputs.
  • Pipeline wraps the Engine with pre- and post-processing. With Pipeline, you pass raw data and receive the prediction.
  • Server wraps Pipelines with a REST API using FastAPI. With Server, you send raw data over HTTP and receive the prediction.

Engine

The example below downloads a 90% pruned-quantized BERT model for sentiment analysis in ONNX format from SparseZoo, compiles the model, and runs inference on randomly generated input. Users can provide their own ONNX models, whether dense or sparse.

from deepsparse import Engine

# download onnx, compile
zoo_stub = "zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none"
compiled_model = Engine(model=zoo_stub, batch_size=1)

# run inference (input is raw numpy tensors, output is raw scores)
inputs = compiled_model.generate_random_inputs()
output = compiled_model(inputs)
print(output)

# > [array([[-0.3380675 ,  0.09602544]], dtype=float32)] << raw scores

Pipeline

Pipelines wrap Engine with pre- and post-processing, enabling you to pass raw data and receive the post-processed prediction. The example below downloads a 90% pruned-quantized BERT model for sentiment analysis in ONNX format from SparseZoo, sets up a pipeline, and runs inference on sample data.

from deepsparse import Pipeline

# download onnx, set up pipeline
zoo_stub = "zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none"  
sentiment_analysis_pipeline = Pipeline.create(
  task="sentiment-analysis",    # name of the task
  model_path=zoo_stub,          # zoo stub or path to local onnx file
)

# run inference (input is a sentence, output is the prediction)
prediction = sentiment_analysis_pipeline("I love using DeepSparse Pipelines")
print(prediction)
# > labels=['positive'] scores=[0.9954759478569031]

Server

Server wraps Pipelines with REST APIs, enabling you to set up a model-serving endpoint running DeepSparse. This enables you to send raw data to DeepSparse over HTTP and receive the post-processed predictions. DeepSparse Server is launched from the command line and configured via arguments or a server configuration file. The following downloads a 90% pruned-quantized BERT model for sentiment analysis in ONNX format from SparseZoo and launches a sentiment analysis endpoint:

deepsparse.server \
  --task sentiment-analysis \
  --model_path zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none

Sending a request:

import requests

url = "http://localhost:5543/v2/models/sentiment_analysis/infer" # Server's port default to 5543
obj = {"sequences": "Snorlax loves my Tesla!"}

response = requests.post(url, json=obj)
print(response.text)
# {"labels":["positive"],"scores":[0.9965094327926636]}

Additional Resources

Product Usage Analytics

DeepSparse gathers basic usage telemetry, including, but not limited to, Invocations, Package, Version, and IP Address, for Product Usage Analytics purposes. Review Neural Magic's Products Privacy Policy for further details on how we process this data.

To disable Product Usage Analytics, run:

export NM_DISABLE_ANALYTICS=True

Confirm that telemetry is shut off through info logs streamed with engine invocation by looking for the phrase "Skipping Neural Magic's latest package version check."

Community

Get In Touch

For more general questions about Neural Magic, complete this form.

License

Cite

Find this project useful in your research or other communications? Please consider citing:

@misc{kurtic2023sparse,
      title={Sparse Fine-Tuning for Inference Acceleration of Large Language Models}, 
      author={Eldar Kurtic and Denis Kuznedelev and Elias Frantar and Michael Goin and Dan Alistarh},
      year={2023},
      url={https://arxiv.org/abs/2310.06927},
      eprint={2310.06927},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@misc{kurtic2022optimal,
      title={The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models}, 
      author={Eldar Kurtic and Daniel Campos and Tuan Nguyen and Elias Frantar and Mark Kurtz and Benjamin Fineran and Michael Goin and Dan Alistarh},
      year={2022},
      url={https://arxiv.org/abs/2203.07259},
      eprint={2203.07259},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@InProceedings{
    pmlr-v119-kurtz20a, 
    title = {Inducing and Exploiting Activation Sparsity for Fast Inference on Deep Neural Networks}, 
    author = {Kurtz, Mark and Kopinsky, Justin and Gelashvili, Rati and Matveev, Alexander and Carr, John and Goin, Michael and Leiserson, William and Moore, Sage and Nell, Bill and Shavit, Nir and Alistarh, Dan}, 
    booktitle = {Proceedings of the 37th International Conference on Machine Learning}, 
    pages = {5533--5543}, 
    year = {2020}, 
    editor = {Hal Daumé III and Aarti Singh}, 
    volume = {119}, 
    series = {Proceedings of Machine Learning Research}, 
    address = {Virtual}, 
    month = {13--18 Jul}, 
    publisher = {PMLR}, 
    pdf = {http://proceedings.mlr.press/v119/kurtz20a/kurtz20a.pdf},
    url = {http://proceedings.mlr.press/v119/kurtz20a.html}
}

@article{DBLP:journals/corr/abs-2111-13445,
  author    = {Eugenia Iofinova and Alexandra Peste and Mark Kurtz and Dan Alistarh},
  title     = {How Well Do Sparse Imagenet Models Transfer?},
  journal   = {CoRR},
  volume    = {abs/2111.13445},
  year      = {2021},
  url       = {https://arxiv.org/abs/2111.13445},
  eprinttype = {arXiv},
  eprint    = {2111.13445},
  timestamp = {Wed, 01 Dec 2021 15:16:43 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2111-13445.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

All Thanks To Our Contributors

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deepsparse-1.7.1.tar.gz (46.6 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

deepsparse-1.7.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (47.1 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

deepsparse-1.7.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (40.5 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ ARM64

deepsparse-1.7.1-cp311-cp311-macosx_13_0_arm64.whl (33.4 MB view details)

Uploaded CPython 3.11macOS 13.0+ ARM64

deepsparse-1.7.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (47.1 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

deepsparse-1.7.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (40.5 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ ARM64

deepsparse-1.7.1-cp310-cp310-macosx_13_0_arm64.whl (33.4 MB view details)

Uploaded CPython 3.10macOS 13.0+ ARM64

deepsparse-1.7.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (47.1 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

deepsparse-1.7.1-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (40.5 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ ARM64

deepsparse-1.7.1-cp39-cp39-macosx_13_0_arm64.whl (33.4 MB view details)

Uploaded CPython 3.9macOS 13.0+ ARM64

deepsparse-1.7.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (47.1 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

deepsparse-1.7.1-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (40.5 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ ARM64

deepsparse-1.7.1-cp38-cp38-macosx_13_0_arm64.whl (33.4 MB view details)

Uploaded CPython 3.8macOS 13.0+ ARM64

File details

Details for the file deepsparse-1.7.1.tar.gz.

File metadata

  • Download URL: deepsparse-1.7.1.tar.gz
  • Upload date:
  • Size: 46.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.24.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.8.10

File hashes

Hashes for deepsparse-1.7.1.tar.gz
Algorithm Hash digest
SHA256 f208de63fe1739a2084220a877a7181b2c16599635405f693767b0e0fd7c2f8d
MD5 4d4de597c61ab20ff832900353dcc9ce
BLAKE2b-256 439bb8dbc8a5bd73309d3f904f5c5b8189441e7e708f9eb25f9c3b1880bdde1b

See more details on using hashes here.

File details

Details for the file deepsparse-1.7.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for deepsparse-1.7.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 75dab6570f553434ccc64d86b4e5f0948396c9c5ccdc4567ce9cfd1dfbcb0329
MD5 bab17ac61fd6139d27dadecdd31454ab
BLAKE2b-256 5ce7faddd30ec18467490fb63e461950e65d956b3d05abcb78ccc7dda51cbb67

See more details on using hashes here.

File details

Details for the file deepsparse-1.7.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for deepsparse-1.7.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 a1839fd511fd661855cdf8a53c456ad37e654e2935243f0f983a932449d78189
MD5 303e6837464e7952da555f2684c0ac9e
BLAKE2b-256 f49664b4167470aff44929f980b7bc11e978cc5502df14bc88ddaac552ecdc21

See more details on using hashes here.

File details

Details for the file deepsparse-1.7.1-cp311-cp311-macosx_13_0_arm64.whl.

File metadata

  • Download URL: deepsparse-1.7.1-cp311-cp311-macosx_13_0_arm64.whl
  • Upload date:
  • Size: 33.4 MB
  • Tags: CPython 3.11, macOS 13.0+ ARM64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.24.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.8.10

File hashes

Hashes for deepsparse-1.7.1-cp311-cp311-macosx_13_0_arm64.whl
Algorithm Hash digest
SHA256 523ea5628d22f8e968140f88af0e4ec1cb3d6482102cac4135bdf4d085e65def
MD5 2996ab2929f9c98ad3e4597c5d8fccfd
BLAKE2b-256 28a7732be660f72e7f71d969a6948ca9cf9c991d0cdd3ca33db0e5e9719f30f7

See more details on using hashes here.

File details

Details for the file deepsparse-1.7.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for deepsparse-1.7.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 b7e700ea914d65eda0491a4a3ccba34e9a12d5b3baf60c219ff0c5eb8944dcfc
MD5 d257c55dfec3147c8ac628d1b6678048
BLAKE2b-256 675efa37e37507a50dc7ab4314909de2f6738b6a6689b3757cb07efc0d69474d

See more details on using hashes here.

File details

Details for the file deepsparse-1.7.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for deepsparse-1.7.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 6db2a1a6c5254cb2dea0c5f871edf6fe9053f849341efee4a1a28a6db1d22f4b
MD5 338c0c5be96cfe434c39e745f1423465
BLAKE2b-256 eddce1a282862824f118c6bade17e84b94dae18ddbfb89edc8b61a5b7d11df6d

See more details on using hashes here.

File details

Details for the file deepsparse-1.7.1-cp310-cp310-macosx_13_0_arm64.whl.

File metadata

  • Download URL: deepsparse-1.7.1-cp310-cp310-macosx_13_0_arm64.whl
  • Upload date:
  • Size: 33.4 MB
  • Tags: CPython 3.10, macOS 13.0+ ARM64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.24.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.8.10

File hashes

Hashes for deepsparse-1.7.1-cp310-cp310-macosx_13_0_arm64.whl
Algorithm Hash digest
SHA256 f758b406633c64b21d3839dc27dbe6013e416840744d92c334889f5f07b2e70b
MD5 87222ec40d07ecfb219583628942eb83
BLAKE2b-256 859449aeedb7cf909d35d664c09e54ca4dc8d283bb8f5b183c15ffbe1f94ef73

See more details on using hashes here.

File details

Details for the file deepsparse-1.7.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for deepsparse-1.7.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e39b9ea5ff4ca6e332386a5f1ed85f9bc1c0dcc67684e13ee854c19bd0445df8
MD5 1a8dd06b40702ba26e217307643dc2a3
BLAKE2b-256 d3b9bbfb9de79710cebf8ccd628595f2060d6db4a1710024abe4e27983975521

See more details on using hashes here.

File details

Details for the file deepsparse-1.7.1-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for deepsparse-1.7.1-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 7ab133349f0b5893645e8d606f1ad902839235ec961b88ee2187c23d2f370a2c
MD5 2fe638a2eb21cd98c1aefacc01444665
BLAKE2b-256 b7f76d4e05d7c2e85c98735f6e00be620c22f1771803ce6824bdc08b1eed24e5

See more details on using hashes here.

File details

Details for the file deepsparse-1.7.1-cp39-cp39-macosx_13_0_arm64.whl.

File metadata

  • Download URL: deepsparse-1.7.1-cp39-cp39-macosx_13_0_arm64.whl
  • Upload date:
  • Size: 33.4 MB
  • Tags: CPython 3.9, macOS 13.0+ ARM64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.24.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.8.10

File hashes

Hashes for deepsparse-1.7.1-cp39-cp39-macosx_13_0_arm64.whl
Algorithm Hash digest
SHA256 6917a84dcedb1a0dbbde54985e50e8b5f844c5a4a745864cbbd1f8890390aa7c
MD5 e1651ffff32d20d4d9c2990318e1c6aa
BLAKE2b-256 5f49f525f3f9d832c8420af9060a2b852c9ca8307caecc19429b2ce3db4cb2c2

See more details on using hashes here.

File details

Details for the file deepsparse-1.7.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for deepsparse-1.7.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 019eefc27a1e7e0cc312c46132b49e444576270a1eee8168d43eaa19c72f7691
MD5 f9c7ea7316160c79962fc4da94934a1d
BLAKE2b-256 fa18a6cc3bf8603efd7a6f8a69f0856be6037b1c103faed9d76da2a973c7cc70

See more details on using hashes here.

File details

Details for the file deepsparse-1.7.1-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for deepsparse-1.7.1-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 b92ed04b120d33d42f0bf94ba2d2fc7e002e7fe9547c6655306d8d7f17f6bb69
MD5 5f1c8ff67d302d83ecc18e50d5513364
BLAKE2b-256 e3e56c6d1b82c77179e1a3d79be26e8668151d60d4a8350e7b0aeebdf2c4e465

See more details on using hashes here.

File details

Details for the file deepsparse-1.7.1-cp38-cp38-macosx_13_0_arm64.whl.

File metadata

  • Download URL: deepsparse-1.7.1-cp38-cp38-macosx_13_0_arm64.whl
  • Upload date:
  • Size: 33.4 MB
  • Tags: CPython 3.8, macOS 13.0+ ARM64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.24.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.8.10

File hashes

Hashes for deepsparse-1.7.1-cp38-cp38-macosx_13_0_arm64.whl
Algorithm Hash digest
SHA256 984149463d9ec495e492a4d069c5f0823453b49caa1f608429656da2d5e9bcaa
MD5 46ef1d1868058d666ec2f5b50ea8040d
BLAKE2b-256 1f8565b07418872d78d14478d8fba023a58da6a7939fd278203f272c79bc267b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page