Skip to main content

An inference runtime offering GPU-class performance on CPUs and APIs to integrate ML into your application

Project description

tool icon   DeepSparse

Sparsity-aware deep learning inference runtime for CPUs

DeepSparse is a CPU inference runtime that takes advantage of sparsity to accelerate neural network inference. Coupled with SparseML, our optimization library for pruning and quantizing your models, DeepSparse delivers exceptional inference performance on CPU hardware.

NM Flow

✨NEW✨ DeepSparse LLMs

Neural Magic is excited to announce initial support for performant LLM inference in DeepSparse with:

  • sparse kernels for speedups and memory savings from unstructured sparse weights.
  • 8-bit weight and activation quantization support.
  • efficient usage of cached attention keys and values for minimal memory movement.

mpt-chat-comparison

Try It Now

Install (requires Linux):

pip install -U deepsparse-nightly[llm]

Run inference:

from deepsparse import TextGeneration
pipeline = TextGeneration(model="zoo:mpt-7b-dolly_mpt_pretrain-pruned50_quantized")

prompt="""
Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: what is sparsity? ### Response:
"""
print(pipeline(prompt, max_new_tokens=75).generations[0].text)

# Sparsity is the property of a matrix or other data structure in which a large number of elements are zero and a smaller number of elements are non-zero. In the context of machine learning, sparsity can be used to improve the efficiency of training and prediction.

Check out the TextGeneration documentation for usage details.

Sparsity :handshake: Performance

Developed in collaboration with IST Austria, our recent paper details a new technique called Sparse Fine-Tuning, which allows us to prune MPT-7B to 60% sparsity during fine-tuning without drop in accuracy. With our new support for LLMs, DeepSparse accelerates the sparse-quantized model 7x over the dense baseline:

Learn more about our Sparse Fine-Tuning research.

Check out the model running live on Hugging Face.

LLM Roadmap

Following this initial launch, we are rapidly expanding our support for LLMs, including:

  1. Productizing Sparse Fine-Tuning: Enable external users to apply sparse fine-tuning to their datasets via SparseML.
  2. Expanding model support: Apply our sparse fine-tuning results to Llama 2 and Mistral models.
  3. Pushing for higher sparsity: Improving our pruning algorithms to reach even higher sparsity.

Computer Vision and NLP Models

In addition to LLMs, DeepSparse supports many variants of CNNs and Transformer models, such as BERT, ViT, ResNet, EfficientNet, YOLOv5/8, and many more! Take a look at the Computer Vision and Natural Language Processing domains of SparseZoo, our home for optimized models.

Installation

Install via PyPI (optional dependencies detailed here):

pip install deepsparse 

To experiment with the latest features, there is a nightly build available using pip install deepsparse-nightly or you can clone and install from source using pip install -e path/to/deepsparse.

System Requirements

For those using Mac or Windows, we recommend using Linux containers with Docker.

Deployment APIs

DeepSparse includes three deployment APIs:

  • Engine is the lowest-level API. With Engine, you compile an ONNX model, pass tensors as input, and receive the raw outputs.
  • Pipeline wraps the Engine with pre- and post-processing. With Pipeline, you pass raw data and receive the prediction.
  • Server wraps Pipelines with a REST API using FastAPI. With Server, you send raw data over HTTP and receive the prediction.

Engine

The example below downloads a 90% pruned-quantized BERT model for sentiment analysis in ONNX format from SparseZoo, compiles the model, and runs inference on randomly generated input. Users can provide their own ONNX models, whether dense or sparse.

from deepsparse import Engine

# download onnx, compile
zoo_stub = "zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none"
compiled_model = Engine(model=zoo_stub, batch_size=1)

# run inference (input is raw numpy tensors, output is raw scores)
inputs = compiled_model.generate_random_inputs()
output = compiled_model(inputs)
print(output)

# > [array([[-0.3380675 ,  0.09602544]], dtype=float32)] << raw scores

Pipeline

Pipelines wrap Engine with pre- and post-processing, enabling you to pass raw data and receive the post-processed prediction. The example below downloads a 90% pruned-quantized BERT model for sentiment analysis in ONNX format from SparseZoo, sets up a pipeline, and runs inference on sample data.

from deepsparse import Pipeline

# download onnx, set up pipeline
zoo_stub = "zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none"  
sentiment_analysis_pipeline = Pipeline.create(
  task="sentiment-analysis",    # name of the task
  model_path=zoo_stub,          # zoo stub or path to local onnx file
)

# run inference (input is a sentence, output is the prediction)
prediction = sentiment_analysis_pipeline("I love using DeepSparse Pipelines")
print(prediction)
# > labels=['positive'] scores=[0.9954759478569031]

Server

Server wraps Pipelines with REST APIs, enabling you to set up a model-serving endpoint running DeepSparse. This enables you to send raw data to DeepSparse over HTTP and receive the post-processed predictions. DeepSparse Server is launched from the command line and configured via arguments or a server configuration file. The following downloads a 90% pruned-quantized BERT model for sentiment analysis in ONNX format from SparseZoo and launches a sentiment analysis endpoint:

deepsparse.server \
  --task sentiment-analysis \
  --model_path zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none

Sending a request:

import requests

url = "http://localhost:5543/v2/models/sentiment_analysis/infer" # Server's port default to 5543
obj = {"sequences": "Snorlax loves my Tesla!"}

response = requests.post(url, json=obj)
print(response.text)
# {"labels":["positive"],"scores":[0.9965094327926636]}

Additional Resources

Product Usage Analytics

DeepSparse gathers basic usage telemetry, including, but not limited to, Invocations, Package, Version, and IP Address, for Product Usage Analytics purposes. Review Neural Magic's Products Privacy Policy for further details on how we process this data.

To disable Product Usage Analytics, run:

export NM_DISABLE_ANALYTICS=True

Confirm that telemetry is shut off through info logs streamed with engine invocation by looking for the phrase "Skipping Neural Magic's latest package version check."

Community

Get In Touch

For more general questions about Neural Magic, complete this form.

License

Cite

Find this project useful in your research or other communications? Please consider citing:

@misc{kurtic2023sparse,
      title={Sparse Fine-Tuning for Inference Acceleration of Large Language Models}, 
      author={Eldar Kurtic and Denis Kuznedelev and Elias Frantar and Michael Goin and Dan Alistarh},
      year={2023},
      url={https://arxiv.org/abs/2310.06927},
      eprint={2310.06927},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@misc{kurtic2022optimal,
      title={The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models}, 
      author={Eldar Kurtic and Daniel Campos and Tuan Nguyen and Elias Frantar and Mark Kurtz and Benjamin Fineran and Michael Goin and Dan Alistarh},
      year={2022},
      url={https://arxiv.org/abs/2203.07259},
      eprint={2203.07259},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@InProceedings{
    pmlr-v119-kurtz20a, 
    title = {Inducing and Exploiting Activation Sparsity for Fast Inference on Deep Neural Networks}, 
    author = {Kurtz, Mark and Kopinsky, Justin and Gelashvili, Rati and Matveev, Alexander and Carr, John and Goin, Michael and Leiserson, William and Moore, Sage and Nell, Bill and Shavit, Nir and Alistarh, Dan}, 
    booktitle = {Proceedings of the 37th International Conference on Machine Learning}, 
    pages = {5533--5543}, 
    year = {2020}, 
    editor = {Hal Daumé III and Aarti Singh}, 
    volume = {119}, 
    series = {Proceedings of Machine Learning Research}, 
    address = {Virtual}, 
    month = {13--18 Jul}, 
    publisher = {PMLR}, 
    pdf = {http://proceedings.mlr.press/v119/kurtz20a/kurtz20a.pdf},
    url = {http://proceedings.mlr.press/v119/kurtz20a.html}
}

@article{DBLP:journals/corr/abs-2111-13445,
  author    = {Eugenia Iofinova and Alexandra Peste and Mark Kurtz and Dan Alistarh},
  title     = {How Well Do Sparse Imagenet Models Transfer?},
  journal   = {CoRR},
  volume    = {abs/2111.13445},
  year      = {2021},
  url       = {https://arxiv.org/abs/2111.13445},
  eprinttype = {arXiv},
  eprint    = {2111.13445},
  timestamp = {Wed, 01 Dec 2021 15:16:43 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2111-13445.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

All Thanks To Our Contributors

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deepsparse-nightly-1.7.0.20240304.tar.gz (46.9 MB view details)

Uploaded Source

Built Distributions

deepsparse_nightly-1.7.0.20240304-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (47.4 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

deepsparse_nightly-1.7.0.20240304-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (40.6 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ ARM64

deepsparse_nightly-1.7.0.20240304-cp311-cp311-macosx_13_0_arm64.whl (33.5 MB view details)

Uploaded CPython 3.11 macOS 13.0+ ARM64

deepsparse_nightly-1.7.0.20240304-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (47.4 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

deepsparse_nightly-1.7.0.20240304-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (40.6 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ ARM64

deepsparse_nightly-1.7.0.20240304-cp310-cp310-macosx_13_0_arm64.whl (33.5 MB view details)

Uploaded CPython 3.10 macOS 13.0+ ARM64

deepsparse_nightly-1.7.0.20240304-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (47.4 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

deepsparse_nightly-1.7.0.20240304-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (40.6 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ ARM64

deepsparse_nightly-1.7.0.20240304-cp39-cp39-macosx_13_0_arm64.whl (33.5 MB view details)

Uploaded CPython 3.9 macOS 13.0+ ARM64

deepsparse_nightly-1.7.0.20240304-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (47.4 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

deepsparse_nightly-1.7.0.20240304-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (40.6 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ ARM64

deepsparse_nightly-1.7.0.20240304-cp38-cp38-macosx_13_0_arm64.whl (33.5 MB view details)

Uploaded CPython 3.8 macOS 13.0+ ARM64

File details

Details for the file deepsparse-nightly-1.7.0.20240304.tar.gz.

File metadata

  • Download URL: deepsparse-nightly-1.7.0.20240304.tar.gz
  • Upload date:
  • Size: 46.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/39.0.1 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.6.9

File hashes

Hashes for deepsparse-nightly-1.7.0.20240304.tar.gz
Algorithm Hash digest
SHA256 fb6704c7821c1f2d56d010cc54dc42e005d5b34c65c0a54f3ae74fde2c4f8b48
MD5 f09bb0787d1487a4812d2370be4f22f5
BLAKE2b-256 71defa0974b51efa40927a8e3db2e44346aa05686e9602f95dbfca446e72623d

See more details on using hashes here.

File details

Details for the file deepsparse_nightly-1.7.0.20240304-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for deepsparse_nightly-1.7.0.20240304-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d593f80f40ad6d73ec29cd00ad10fe6a04b0b4e359e4256e3684d0d3266a5e61
MD5 fd45230704e6d4afd65b4222cb6c2122
BLAKE2b-256 91a70450f6e274f9f1fe6c3ff739f88914f941d69dac3293862538dc182a2740

See more details on using hashes here.

File details

Details for the file deepsparse_nightly-1.7.0.20240304-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for deepsparse_nightly-1.7.0.20240304-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 dcef8c0d8bdb612cbad940b3f1cb57459cd8a726e27cef3af3f64a5a9404e900
MD5 e31b13ee0474899df6af1bb4340d976a
BLAKE2b-256 bef96581da25e6ea8c0f578f650a03adeaafa9db0406af2dd8736fa4599834ac

See more details on using hashes here.

File details

Details for the file deepsparse_nightly-1.7.0.20240304-cp311-cp311-macosx_13_0_arm64.whl.

File metadata

File hashes

Hashes for deepsparse_nightly-1.7.0.20240304-cp311-cp311-macosx_13_0_arm64.whl
Algorithm Hash digest
SHA256 af0832d61310695a6dbc62f4d1b2d4cf888cb8a57bb03dbc2dc050abc29f63dc
MD5 7742b275fd0d07c05314957d3003c5bb
BLAKE2b-256 0d7379f920171bd52005e8bac8c00e12ad54a1dc2691586451e78a0af3a59029

See more details on using hashes here.

File details

Details for the file deepsparse_nightly-1.7.0.20240304-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for deepsparse_nightly-1.7.0.20240304-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 554ff87d8f7036e405150aba6d8b73b90ca9997cd5c33d44462903b87c3e7a7e
MD5 6c011e037daca6c286476a64466c55af
BLAKE2b-256 34b61f2ef6b679764173d9d1a6e0f2e6b36b516275b25e75db8db06c38e07fe0

See more details on using hashes here.

File details

Details for the file deepsparse_nightly-1.7.0.20240304-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for deepsparse_nightly-1.7.0.20240304-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 d9068809168652854bd64f02b944273e0741f7b8687f47963412eeda2265b584
MD5 a7f03aa005ef77d5651cfb2317144074
BLAKE2b-256 5251528cd0a0c9c6000ec320db71ed99c1df684a741bf2a295b35de3e403ae34

See more details on using hashes here.

File details

Details for the file deepsparse_nightly-1.7.0.20240304-cp310-cp310-macosx_13_0_arm64.whl.

File metadata

File hashes

Hashes for deepsparse_nightly-1.7.0.20240304-cp310-cp310-macosx_13_0_arm64.whl
Algorithm Hash digest
SHA256 d9029a70d4ed3e0f89d210b2c0188444b3167b421b2bb14c538da9861f939665
MD5 87f0705a80c21b2233e27feebb9a2807
BLAKE2b-256 66b83d77fa1b890fe0a9aaeaf95ebf7aa745f984cf008edcbcb597a943ae3309

See more details on using hashes here.

File details

Details for the file deepsparse_nightly-1.7.0.20240304-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for deepsparse_nightly-1.7.0.20240304-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 c208b2164117e3c7559ae0d3c973ab22f5c77003b9c6ca68bd711bb7cb05a04c
MD5 a619b192dc822704363676a27d137f50
BLAKE2b-256 d4e19230e146c3f3a1a3aebb12e34392aba4e7d6ce5e0ea5c2512f6bed6ba7db

See more details on using hashes here.

File details

Details for the file deepsparse_nightly-1.7.0.20240304-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for deepsparse_nightly-1.7.0.20240304-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 ba724ec000f4a71ad0c9dca2d91aec7b908a7382f2b854b7584b84917fb90889
MD5 b7abe739cf21baf4e926672c7533bd78
BLAKE2b-256 4840145b8a7b0966560acde1ebcf09f2a1ebcaaad68acbeeaf9a0cc281153b94

See more details on using hashes here.

File details

Details for the file deepsparse_nightly-1.7.0.20240304-cp39-cp39-macosx_13_0_arm64.whl.

File metadata

File hashes

Hashes for deepsparse_nightly-1.7.0.20240304-cp39-cp39-macosx_13_0_arm64.whl
Algorithm Hash digest
SHA256 df2935c3368f9e960d933db758468b7624c7953d9601fe7585baabee30ef3817
MD5 aefded70b496f83fd6466cc48193f801
BLAKE2b-256 7f3d5d91d78c1c9cae168f347d6b3b08e97ec58af3fae93608757641a9336efb

See more details on using hashes here.

File details

Details for the file deepsparse_nightly-1.7.0.20240304-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for deepsparse_nightly-1.7.0.20240304-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 78a816accf6c64c9246accb7dd044044dabd26af08f40a3c189e361590aa9acf
MD5 793754fda1eb4af9c6cf917b6cd83e40
BLAKE2b-256 dcd17194351d77e5ff0b01f5d48bc61ce4195a9ea8103881d217ee0414bed7f0

See more details on using hashes here.

File details

Details for the file deepsparse_nightly-1.7.0.20240304-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for deepsparse_nightly-1.7.0.20240304-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 dfcf6fe4f0ea4ab66751334180e7e76fd3ca60bcf45d09bb3d3e74e056420799
MD5 8dc58d13aa8624eab0a48fd117802255
BLAKE2b-256 0f8fec9ea2a6c8511779744c4b457d15c319f92860262556d583366233516f34

See more details on using hashes here.

File details

Details for the file deepsparse_nightly-1.7.0.20240304-cp38-cp38-macosx_13_0_arm64.whl.

File metadata

File hashes

Hashes for deepsparse_nightly-1.7.0.20240304-cp38-cp38-macosx_13_0_arm64.whl
Algorithm Hash digest
SHA256 b15c76dc99e0ec2e74c65c50479c3ff844d2281a7cdefc6472a3cafd31ed8f00
MD5 9732d7ed46f9401a8eac5402a488eacd
BLAKE2b-256 6fa77ba3dfe21c3255ad9443eb16aaaf1b58486982a804d0cecbfe316224de7b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page