Skip to main content

An inference runtime offering GPU-class performance on CPUs and APIs to integrate ML into your application

Project description

tool icon   DeepSparse

Sparsity-aware deep learning inference runtime for CPUs

DeepSparse is a CPU inference runtime that takes advantage of sparsity to accelerate neural network inference. Coupled with SparseML, our optimization library for pruning and quantizing your models, DeepSparse delivers exceptional inference performance on CPU hardware.

NM Flow

✨NEW✨ DeepSparse LLMs

Neural Magic is excited to announce initial support for performant LLM inference in DeepSparse with:

  • sparse kernels for speedups and memory savings from unstructured sparse weights.
  • 8-bit weight and activation quantization support.
  • efficient usage of cached attention keys and values for minimal memory movement.

mpt-chat-comparison

Try It Now

Install (requires Linux):

pip install -U deepsparse-nightly[llm]

Run inference:

from deepsparse import TextGeneration
pipeline = TextGeneration(model="zoo:mpt-7b-dolly_mpt_pretrain-pruned50_quantized")

prompt="""
Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: what is sparsity? ### Response:
"""
print(pipeline(prompt, max_new_tokens=75).generations[0].text)

# Sparsity is the property of a matrix or other data structure in which a large number of elements are zero and a smaller number of elements are non-zero. In the context of machine learning, sparsity can be used to improve the efficiency of training and prediction.

Check out the TextGeneration documentation for usage details and get the latest sparsified LLMs on our HF Collection.

Sparsity :handshake: Performance

Developed in collaboration with IST Austria, our recent paper details a new technique called Sparse Fine-Tuning, which allows us to prune MPT-7B to 60% sparsity during fine-tuning without drop in accuracy. With our new support for LLMs, DeepSparse accelerates the sparse-quantized model 7x over the dense baseline:

Learn more about our Sparse Fine-Tuning research.

Check out the model running live on Hugging Face.

LLM Roadmap

Following this initial launch, we are rapidly expanding our support for LLMs, including:

  1. Productizing Sparse Fine-Tuning: Enable external users to apply sparse fine-tuning to their datasets via SparseML.
  2. Expanding model support: Apply our sparse fine-tuning results to Llama 2 and Mistral models.
  3. Pushing for higher sparsity: Improving our pruning algorithms to reach even higher sparsity.

Computer Vision and NLP Models

In addition to LLMs, DeepSparse supports many variants of CNNs and Transformer models, such as BERT, ViT, ResNet, EfficientNet, YOLOv5/8, and many more! Take a look at the Computer Vision and Natural Language Processing domains of SparseZoo, our home for optimized models.

Installation

Install via PyPI (optional dependencies detailed here):

pip install deepsparse 

To experiment with the latest features, there is a nightly build available using pip install deepsparse-nightly or you can clone and install from source using pip install -e path/to/deepsparse.

System Requirements

For those using Mac or Windows, we recommend using Linux containers with Docker.

Deployment APIs

DeepSparse includes three deployment APIs:

  • Engine is the lowest-level API. With Engine, you compile an ONNX model, pass tensors as input, and receive the raw outputs.
  • Pipeline wraps the Engine with pre- and post-processing. With Pipeline, you pass raw data and receive the prediction.
  • Server wraps Pipelines with a REST API using FastAPI. With Server, you send raw data over HTTP and receive the prediction.

Engine

The example below downloads a 90% pruned-quantized BERT model for sentiment analysis in ONNX format from SparseZoo, compiles the model, and runs inference on randomly generated input. Users can provide their own ONNX models, whether dense or sparse.

from deepsparse import Engine

# download onnx, compile
zoo_stub = "zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none"
compiled_model = Engine(model=zoo_stub, batch_size=1)

# run inference (input is raw numpy tensors, output is raw scores)
inputs = compiled_model.generate_random_inputs()
output = compiled_model(inputs)
print(output)

# > [array([[-0.3380675 ,  0.09602544]], dtype=float32)] << raw scores

Pipeline

Pipelines wrap Engine with pre- and post-processing, enabling you to pass raw data and receive the post-processed prediction. The example below downloads a 90% pruned-quantized BERT model for sentiment analysis in ONNX format from SparseZoo, sets up a pipeline, and runs inference on sample data.

from deepsparse import Pipeline

# download onnx, set up pipeline
zoo_stub = "zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none"  
sentiment_analysis_pipeline = Pipeline.create(
  task="sentiment-analysis",    # name of the task
  model_path=zoo_stub,          # zoo stub or path to local onnx file
)

# run inference (input is a sentence, output is the prediction)
prediction = sentiment_analysis_pipeline("I love using DeepSparse Pipelines")
print(prediction)
# > labels=['positive'] scores=[0.9954759478569031]

Server

Server wraps Pipelines with REST APIs, enabling you to set up a model-serving endpoint running DeepSparse. This enables you to send raw data to DeepSparse over HTTP and receive the post-processed predictions. DeepSparse Server is launched from the command line and configured via arguments or a server configuration file. The following downloads a 90% pruned-quantized BERT model for sentiment analysis in ONNX format from SparseZoo and launches a sentiment analysis endpoint:

deepsparse.server \
  --task sentiment-analysis \
  --model_path zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none

Sending a request:

import requests

url = "http://localhost:5543/v2/models/sentiment_analysis/infer" # Server's port default to 5543
obj = {"sequences": "Snorlax loves my Tesla!"}

response = requests.post(url, json=obj)
print(response.text)
# {"labels":["positive"],"scores":[0.9965094327926636]}

Additional Resources

Product Usage Analytics

DeepSparse gathers basic usage telemetry, including, but not limited to, Invocations, Package, Version, and IP Address, for Product Usage Analytics purposes. Review Neural Magic's Products Privacy Policy for further details on how we process this data.

To disable Product Usage Analytics, run:

export NM_DISABLE_ANALYTICS=True

Confirm that telemetry is shut off through info logs streamed with engine invocation by looking for the phrase "Skipping Neural Magic's latest package version check."

Community

Get In Touch

For more general questions about Neural Magic, complete this form.

License

Cite

Find this project useful in your research or other communications? Please consider citing:

@misc{kurtic2023sparse,
      title={Sparse Fine-Tuning for Inference Acceleration of Large Language Models}, 
      author={Eldar Kurtic and Denis Kuznedelev and Elias Frantar and Michael Goin and Dan Alistarh},
      year={2023},
      url={https://arxiv.org/abs/2310.06927},
      eprint={2310.06927},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@misc{kurtic2022optimal,
      title={The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models}, 
      author={Eldar Kurtic and Daniel Campos and Tuan Nguyen and Elias Frantar and Mark Kurtz and Benjamin Fineran and Michael Goin and Dan Alistarh},
      year={2022},
      url={https://arxiv.org/abs/2203.07259},
      eprint={2203.07259},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@InProceedings{
    pmlr-v119-kurtz20a, 
    title = {Inducing and Exploiting Activation Sparsity for Fast Inference on Deep Neural Networks}, 
    author = {Kurtz, Mark and Kopinsky, Justin and Gelashvili, Rati and Matveev, Alexander and Carr, John and Goin, Michael and Leiserson, William and Moore, Sage and Nell, Bill and Shavit, Nir and Alistarh, Dan}, 
    booktitle = {Proceedings of the 37th International Conference on Machine Learning}, 
    pages = {5533--5543}, 
    year = {2020}, 
    editor = {Hal Daumé III and Aarti Singh}, 
    volume = {119}, 
    series = {Proceedings of Machine Learning Research}, 
    address = {Virtual}, 
    month = {13--18 Jul}, 
    publisher = {PMLR}, 
    pdf = {http://proceedings.mlr.press/v119/kurtz20a/kurtz20a.pdf},
    url = {http://proceedings.mlr.press/v119/kurtz20a.html}
}

@article{DBLP:journals/corr/abs-2111-13445,
  author    = {Eugenia Iofinova and Alexandra Peste and Mark Kurtz and Dan Alistarh},
  title     = {How Well Do Sparse Imagenet Models Transfer?},
  journal   = {CoRR},
  volume    = {abs/2111.13445},
  year      = {2021},
  url       = {https://arxiv.org/abs/2111.13445},
  eprinttype = {arXiv},
  eprint    = {2111.13445},
  timestamp = {Wed, 01 Dec 2021 15:16:43 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2111-13445.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

All Thanks To Our Contributors

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deepsparse_ent-1.8.0.tar.gz (49.3 MB view details)

Uploaded Source

Built Distributions

deepsparse_ent-1.8.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (49.8 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

deepsparse_ent-1.8.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (41.6 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ ARM64

deepsparse_ent-1.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (49.8 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

deepsparse_ent-1.8.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (41.6 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ ARM64

deepsparse_ent-1.8.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (49.8 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

deepsparse_ent-1.8.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (41.6 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ ARM64

deepsparse_ent-1.8.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (49.8 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

deepsparse_ent-1.8.0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (41.6 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ ARM64

File details

Details for the file deepsparse_ent-1.8.0.tar.gz.

File metadata

  • Download URL: deepsparse_ent-1.8.0.tar.gz
  • Upload date:
  • Size: 49.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.24.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.8.10

File hashes

Hashes for deepsparse_ent-1.8.0.tar.gz
Algorithm Hash digest
SHA256 46241ee6edc51e6e9a70f7fb71804e291a378d5521c6b26e50af761b3d6ce092
MD5 b9282c5dacadb0295c92f060f2556acc
BLAKE2b-256 d6d91f237fdfd4cd9d7f2faf381094a7eab4e2a92cbbe39c3753710ef312b181

See more details on using hashes here.

File details

Details for the file deepsparse_ent-1.8.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for deepsparse_ent-1.8.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 140b3d17c75097caa49ee570a9eca81fb0c7e2d4afaeba3c6b772ae63d363ea7
MD5 c3de4a093e36eae3e53db979256fdbda
BLAKE2b-256 d533c6ac83d28922e428a80852a325297bf8911a58e31a8a94bea556e2d6f44f

See more details on using hashes here.

File details

Details for the file deepsparse_ent-1.8.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for deepsparse_ent-1.8.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 b7e68d01df82aec092b804ea289e1f25aeca250ef151a99fad7f51c0a844e52d
MD5 bf2fe275fa8adde9d085153bb4e258a2
BLAKE2b-256 25b250d64e2a8eec16b54eea6dc6f0da8ff659a8d285f2d2da4f9a86a3b95c8f

See more details on using hashes here.

File details

Details for the file deepsparse_ent-1.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for deepsparse_ent-1.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d8337e70e75e1947d2250f618089de6633685ad48cd8fcee567078a59b4fcd58
MD5 38825bef93529a5f76a7fbf6b1a09f78
BLAKE2b-256 f740973b934610fe02136f95ca13d8867f69280042af7a25e14eb06335a6a33f

See more details on using hashes here.

File details

Details for the file deepsparse_ent-1.8.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for deepsparse_ent-1.8.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 d7fd5b4e90c5622ff7d78015ff719301523c2bcb1411fd21852321421c3f5fc8
MD5 29d9ed2031659a92b5e6f0a01f5de50e
BLAKE2b-256 db56c6bcd40bd5801ccdd779139d61935823e546049b293b2679dbf652f647ef

See more details on using hashes here.

File details

Details for the file deepsparse_ent-1.8.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for deepsparse_ent-1.8.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2ebd79e1d0eddcf1ace39cc423c41f75ac2c0bed9d12d7ad95d9f9f98623192c
MD5 4403910780c2e3db5b292805ddc131b1
BLAKE2b-256 423b5ca9df04c603749cd9d57e19bbe7199775213924c92879d6301b12f11c5e

See more details on using hashes here.

File details

Details for the file deepsparse_ent-1.8.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for deepsparse_ent-1.8.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 d04f849d4d8519e2f6d186f378d4da4ec7f1a247d84dfaf96f75e9c22f371595
MD5 9a0248bf6d09ed30f0a4dc51dfb170e4
BLAKE2b-256 c6c33e23fae27c5f8c52ce34995c1e63bf0e4a85eb272dd0688b5024a380bee9

See more details on using hashes here.

File details

Details for the file deepsparse_ent-1.8.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for deepsparse_ent-1.8.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 298d63414afac8c50246407bf30a67d14b16ad4356bd5d4e1dee762ada84c1b0
MD5 de01020e96438a63ec78ccf2f5278ab4
BLAKE2b-256 56d036f76b19ddc7dad975e18b889cf2bcd88c0eab9be29f967e5fb714b1c8ec

See more details on using hashes here.

File details

Details for the file deepsparse_ent-1.8.0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for deepsparse_ent-1.8.0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 4d3155e8a131629a8ac35c5761052850c67744054eaa46ee05ab54ceac05bfc3
MD5 813fb753582acf1fec337c7d21f267fb
BLAKE2b-256 69df987a045911f340d36592e4a59c04cd75b6d1740fd1dfaad7fa6d188e6e6c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page