Skip to main content

An inference runtime offering GPU-class performance on CPUs and APIs to integrate ML into your application

Project description

tool icon   DeepSparse

Sparsity-aware deep learning inference runtime for CPUs

DeepSparse is a CPU inference runtime that takes advantage of sparsity to accelerate neural network inference. Coupled with SparseML, our optimization library for pruning and quantizing your models, DeepSparse delivers exceptional inference performance on CPU hardware.

NM Flow

✨NEW✨ DeepSparse LLMs

Neural Magic is excited to announce initial support for performant LLM inference in DeepSparse with:

  • sparse kernels for speedups and memory savings from unstructured sparse weights.
  • 8-bit weight and activation quantization support.
  • efficient usage of cached attention keys and values for minimal memory movement.

mpt-chat-comparison

Try It Now

Install (requires Linux):

pip install -U deepsparse-nightly[llm]

Run inference:

from deepsparse import TextGeneration
pipeline = TextGeneration(model="zoo:mpt-7b-dolly_mpt_pretrain-pruned50_quantized")

prompt="""
Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: what is sparsity? ### Response:
"""
print(pipeline(prompt, max_new_tokens=75).generations[0].text)

# Sparsity is the property of a matrix or other data structure in which a large number of elements are zero and a smaller number of elements are non-zero. In the context of machine learning, sparsity can be used to improve the efficiency of training and prediction.

Check out the TextGeneration documentation for usage details.

Sparsity :handshake: Performance

Developed in collaboration with IST Austria, our recent paper details a new technique called Sparse Fine-Tuning, which allows us to prune MPT-7B to 60% sparsity during fine-tuning without drop in accuracy. With our new support for LLMs, DeepSparse accelerates the sparse-quantized model 7x over the dense baseline:

Learn more about our Sparse Fine-Tuning research.

Check out the model running live on Hugging Face.

LLM Roadmap

Following this initial launch, we are rapidly expanding our support for LLMs, including:

  1. Productizing Sparse Fine-Tuning: Enable external users to apply sparse fine-tuning to their datasets via SparseML.
  2. Expanding model support: Apply our sparse fine-tuning results to Llama 2 and Mistral models.
  3. Pushing for higher sparsity: Improving our pruning algorithms to reach even higher sparsity.

Computer Vision and NLP Models

In addition to LLMs, DeepSparse supports many variants of CNNs and Transformer models, such as BERT, ViT, ResNet, EfficientNet, YOLOv5/8, and many more! Take a look at the Computer Vision and Natural Language Processing domains of SparseZoo, our home for optimized models.

Installation

Install via PyPI (optional dependencies detailed here):

pip install deepsparse 

To experiment with the latest features, there is a nightly build available using pip install deepsparse-nightly or you can clone and install from source using pip install -e path/to/deepsparse.

System Requirements

For those using Mac or Windows, we recommend using Linux containers with Docker.

Deployment APIs

DeepSparse includes three deployment APIs:

  • Engine is the lowest-level API. With Engine, you compile an ONNX model, pass tensors as input, and receive the raw outputs.
  • Pipeline wraps the Engine with pre- and post-processing. With Pipeline, you pass raw data and receive the prediction.
  • Server wraps Pipelines with a REST API using FastAPI. With Server, you send raw data over HTTP and receive the prediction.

Engine

The example below downloads a 90% pruned-quantized BERT model for sentiment analysis in ONNX format from SparseZoo, compiles the model, and runs inference on randomly generated input. Users can provide their own ONNX models, whether dense or sparse.

from deepsparse import Engine

# download onnx, compile
zoo_stub = "zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none"
compiled_model = Engine(model=zoo_stub, batch_size=1)

# run inference (input is raw numpy tensors, output is raw scores)
inputs = compiled_model.generate_random_inputs()
output = compiled_model(inputs)
print(output)

# > [array([[-0.3380675 ,  0.09602544]], dtype=float32)] << raw scores

Pipeline

Pipelines wrap Engine with pre- and post-processing, enabling you to pass raw data and receive the post-processed prediction. The example below downloads a 90% pruned-quantized BERT model for sentiment analysis in ONNX format from SparseZoo, sets up a pipeline, and runs inference on sample data.

from deepsparse import Pipeline

# download onnx, set up pipeline
zoo_stub = "zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none"  
sentiment_analysis_pipeline = Pipeline.create(
  task="sentiment-analysis",    # name of the task
  model_path=zoo_stub,          # zoo stub or path to local onnx file
)

# run inference (input is a sentence, output is the prediction)
prediction = sentiment_analysis_pipeline("I love using DeepSparse Pipelines")
print(prediction)
# > labels=['positive'] scores=[0.9954759478569031]

Server

Server wraps Pipelines with REST APIs, enabling you to set up a model-serving endpoint running DeepSparse. This enables you to send raw data to DeepSparse over HTTP and receive the post-processed predictions. DeepSparse Server is launched from the command line and configured via arguments or a server configuration file. The following downloads a 90% pruned-quantized BERT model for sentiment analysis in ONNX format from SparseZoo and launches a sentiment analysis endpoint:

deepsparse.server \
  --task sentiment-analysis \
  --model_path zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none

Sending a request:

import requests

url = "http://localhost:5543/v2/models/sentiment_analysis/infer" # Server's port default to 5543
obj = {"sequences": "Snorlax loves my Tesla!"}

response = requests.post(url, json=obj)
print(response.text)
# {"labels":["positive"],"scores":[0.9965094327926636]}

Additional Resources

Product Usage Analytics

DeepSparse gathers basic usage telemetry, including, but not limited to, Invocations, Package, Version, and IP Address, for Product Usage Analytics purposes. Review Neural Magic's Products Privacy Policy for further details on how we process this data.

To disable Product Usage Analytics, run:

export NM_DISABLE_ANALYTICS=True

Confirm that telemetry is shut off through info logs streamed with engine invocation by looking for the phrase "Skipping Neural Magic's latest package version check."

Community

Get In Touch

For more general questions about Neural Magic, complete this form.

License

Cite

Find this project useful in your research or other communications? Please consider citing:

@misc{kurtic2023sparse,
      title={Sparse Fine-Tuning for Inference Acceleration of Large Language Models}, 
      author={Eldar Kurtic and Denis Kuznedelev and Elias Frantar and Michael Goin and Dan Alistarh},
      year={2023},
      url={https://arxiv.org/abs/2310.06927},
      eprint={2310.06927},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@misc{kurtic2022optimal,
      title={The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models}, 
      author={Eldar Kurtic and Daniel Campos and Tuan Nguyen and Elias Frantar and Mark Kurtz and Benjamin Fineran and Michael Goin and Dan Alistarh},
      year={2022},
      url={https://arxiv.org/abs/2203.07259},
      eprint={2203.07259},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@InProceedings{
    pmlr-v119-kurtz20a, 
    title = {Inducing and Exploiting Activation Sparsity for Fast Inference on Deep Neural Networks}, 
    author = {Kurtz, Mark and Kopinsky, Justin and Gelashvili, Rati and Matveev, Alexander and Carr, John and Goin, Michael and Leiserson, William and Moore, Sage and Nell, Bill and Shavit, Nir and Alistarh, Dan}, 
    booktitle = {Proceedings of the 37th International Conference on Machine Learning}, 
    pages = {5533--5543}, 
    year = {2020}, 
    editor = {Hal Daumé III and Aarti Singh}, 
    volume = {119}, 
    series = {Proceedings of Machine Learning Research}, 
    address = {Virtual}, 
    month = {13--18 Jul}, 
    publisher = {PMLR}, 
    pdf = {http://proceedings.mlr.press/v119/kurtz20a/kurtz20a.pdf},
    url = {http://proceedings.mlr.press/v119/kurtz20a.html}
}

@article{DBLP:journals/corr/abs-2111-13445,
  author    = {Eugenia Iofinova and Alexandra Peste and Mark Kurtz and Dan Alistarh},
  title     = {How Well Do Sparse Imagenet Models Transfer?},
  journal   = {CoRR},
  volume    = {abs/2111.13445},
  year      = {2021},
  url       = {https://arxiv.org/abs/2111.13445},
  eprinttype = {arXiv},
  eprint    = {2111.13445},
  timestamp = {Wed, 01 Dec 2021 15:16:43 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2111-13445.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deepsparse-1.6.1.tar.gz (46.5 MB view details)

Uploaded Source

Built Distributions

deepsparse-1.6.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (46.9 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

deepsparse-1.6.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (38.6 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ ARM64

deepsparse-1.6.1-cp311-cp311-macosx_13_0_arm64.whl (33.4 MB view details)

Uploaded CPython 3.11 macOS 13.0+ ARM64

deepsparse-1.6.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (46.9 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

deepsparse-1.6.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (38.6 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ ARM64

deepsparse-1.6.1-cp310-cp310-macosx_13_0_arm64.whl (33.4 MB view details)

Uploaded CPython 3.10 macOS 13.0+ ARM64

deepsparse-1.6.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (46.9 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

deepsparse-1.6.1-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (38.6 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ ARM64

deepsparse-1.6.1-cp39-cp39-macosx_13_0_arm64.whl (33.4 MB view details)

Uploaded CPython 3.9 macOS 13.0+ ARM64

deepsparse-1.6.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (46.9 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

deepsparse-1.6.1-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (38.6 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ ARM64

deepsparse-1.6.1-cp38-cp38-macosx_13_0_arm64.whl (33.4 MB view details)

Uploaded CPython 3.8 macOS 13.0+ ARM64

File details

Details for the file deepsparse-1.6.1.tar.gz.

File metadata

  • Download URL: deepsparse-1.6.1.tar.gz
  • Upload date:
  • Size: 46.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.24.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.8.10

File hashes

Hashes for deepsparse-1.6.1.tar.gz
Algorithm Hash digest
SHA256 93e48badd11e62fee73bb0a2dd0d0688f9d682944de0b8135c1734dc26f124a8
MD5 bbd9f2a2b4a143fd36ab2aad60633de7
BLAKE2b-256 2fcffe1f935b831cf3932cf53898a62865bbdcadc7b04dcc2eb5073f769136bc

See more details on using hashes here.

File details

Details for the file deepsparse-1.6.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for deepsparse-1.6.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 9e9cbb1fd950c2fe4c127c4eeecfcf9c5bf5bdb41abd148331981731c5b31e08
MD5 d41b4147f29d8ee75411356306691f13
BLAKE2b-256 39052a590d43bb82a5e8cd08ffab75fdce53ced621a57f15cc6384f264022c55

See more details on using hashes here.

File details

Details for the file deepsparse-1.6.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for deepsparse-1.6.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 1b7625fd5a102ccc104fe46880b9858ae3b7fbbd7c0dbaf5a963858d796bdcb9
MD5 10e441c15ba3001864d2b5e0c81827a4
BLAKE2b-256 4b85bb8b602f3d1e08c852ea21e2211fa61e12a8fc97c60840e2bee45051e7b5

See more details on using hashes here.

File details

Details for the file deepsparse-1.6.1-cp311-cp311-macosx_13_0_arm64.whl.

File metadata

  • Download URL: deepsparse-1.6.1-cp311-cp311-macosx_13_0_arm64.whl
  • Upload date:
  • Size: 33.4 MB
  • Tags: CPython 3.11, macOS 13.0+ ARM64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.24.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.8.10

File hashes

Hashes for deepsparse-1.6.1-cp311-cp311-macosx_13_0_arm64.whl
Algorithm Hash digest
SHA256 bee970851a08a25be4b01ed28b4d5a895145d1cf3c2fac9fdbff6b95b90c1a4b
MD5 1a2cca5cf90dbf8874120e9b83455f5e
BLAKE2b-256 7ae5f12ac84ecc17738e94b014b1477532e162044dcf96af415aab28ec7999a5

See more details on using hashes here.

File details

Details for the file deepsparse-1.6.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for deepsparse-1.6.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 8502a577a5149f4efb980a762963affcdc76898d47a882ea9b0f5e7d2e089c0a
MD5 473012fb8e253f6d4d7690621eaa71eb
BLAKE2b-256 a5b09815f7760aa9857ea8adacad0470e53b0971d8300b33a861427427ba6bdf

See more details on using hashes here.

File details

Details for the file deepsparse-1.6.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for deepsparse-1.6.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 6701d08656cd076cbb5dec939b16268254c64b6e9d6d1b832fd6e39d3e5684e4
MD5 66a4a16e23cdfccc6d1065f2edd6868d
BLAKE2b-256 25bf3a224f07b0a304355dbab75fe46cf4cdb25d10378a130ffd8f854b9fda65

See more details on using hashes here.

File details

Details for the file deepsparse-1.6.1-cp310-cp310-macosx_13_0_arm64.whl.

File metadata

  • Download URL: deepsparse-1.6.1-cp310-cp310-macosx_13_0_arm64.whl
  • Upload date:
  • Size: 33.4 MB
  • Tags: CPython 3.10, macOS 13.0+ ARM64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.24.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.8.10

File hashes

Hashes for deepsparse-1.6.1-cp310-cp310-macosx_13_0_arm64.whl
Algorithm Hash digest
SHA256 aba558ed429983066d56292f85d40a2f658d7cdca82a54adb04c65ccd464b933
MD5 ffe3ec742888a2c0a1119b6b83cc41d0
BLAKE2b-256 27778d108073eda0e2ab94d6eeee8315d83d1000266bed68d8f6920db6a92909

See more details on using hashes here.

File details

Details for the file deepsparse-1.6.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for deepsparse-1.6.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 75e34a668ecb3c1f7a5e1f0c4bd66af72c1cf8287072873e08fce744b48ffb59
MD5 b4bbfe80af4a75836efe1f7a8119d966
BLAKE2b-256 553c0f52680e4e84e18f87f05ea9522697b082ac74e8dc7cafe1a6bb843bd5ca

See more details on using hashes here.

File details

Details for the file deepsparse-1.6.1-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for deepsparse-1.6.1-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 f37eb443d8535be79674ec5094ad5e71018c77bbe0051383893b43e6f38f41c4
MD5 163619a1701c4f375819b0fe3cea544d
BLAKE2b-256 6e0b8fe86cba3c8166dde9aa18922163ed7f424ce89cf60d8c2540dc2c9b6f32

See more details on using hashes here.

File details

Details for the file deepsparse-1.6.1-cp39-cp39-macosx_13_0_arm64.whl.

File metadata

  • Download URL: deepsparse-1.6.1-cp39-cp39-macosx_13_0_arm64.whl
  • Upload date:
  • Size: 33.4 MB
  • Tags: CPython 3.9, macOS 13.0+ ARM64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.24.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.8.10

File hashes

Hashes for deepsparse-1.6.1-cp39-cp39-macosx_13_0_arm64.whl
Algorithm Hash digest
SHA256 b0461a4e096cfb5c7dbe4d997d23664a42c1a8e6590aacf0cd9d23c42ed600ef
MD5 408df1130f012f54806f2f16541ce592
BLAKE2b-256 1eec592eecac24094b4fdf09b13c41fc5e899269433e9a947805107c70072833

See more details on using hashes here.

File details

Details for the file deepsparse-1.6.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for deepsparse-1.6.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 deb97b46e064904dbc77718ef9c7c1e821aa8aa79fad82845f1fe9f868bda608
MD5 e25db6dcf16d892637234e666bc9b6a1
BLAKE2b-256 0f465c74e3a2f6d75eb77bbcd20adac402c4e3c642f0f8c2096c8c81a4f04c56

See more details on using hashes here.

File details

Details for the file deepsparse-1.6.1-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for deepsparse-1.6.1-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 7aa228478710d860a15483debf4a41f9532601ec08db19432bab3d2e93b6a6a8
MD5 a19a9702fd024e477fe4dc5eec68d735
BLAKE2b-256 5b9c1e57d78cea049c198f0c75cae25349d81aefb5d41a8d8e84bae364c3e118

See more details on using hashes here.

File details

Details for the file deepsparse-1.6.1-cp38-cp38-macosx_13_0_arm64.whl.

File metadata

  • Download URL: deepsparse-1.6.1-cp38-cp38-macosx_13_0_arm64.whl
  • Upload date:
  • Size: 33.4 MB
  • Tags: CPython 3.8, macOS 13.0+ ARM64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.24.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.8.10

File hashes

Hashes for deepsparse-1.6.1-cp38-cp38-macosx_13_0_arm64.whl
Algorithm Hash digest
SHA256 2628e607c36f5e422edfd5a30d36958a0193e526ea2f61672e1ee322a1d66e93
MD5 6cc716c3d33cdfda167b642e7d7a13e1
BLAKE2b-256 f1e0963b87b5a69fc3422fa58b080acd266d379f2be905968ae5eaa0292e388f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page