Skip to main content

An inference runtime offering GPU-class performance on CPUs and APIs to integrate ML into your application

Project description

tool icon   DeepSparse

Sparsity-aware deep learning inference runtime for CPUs

DeepSparse is a CPU inference runtime that takes advantage of sparsity to accelerate neural network inference. Coupled with SparseML, our optimization library for pruning and quantizing your models, DeepSparse delivers exceptional inference performance on CPU hardware.

NM Flow

✨NEW✨ DeepSparse LLMs

Neural Magic is excited to announce initial support for performant LLM inference in DeepSparse with:

  • sparse kernels for speedups and memory savings from unstructured sparse weights.
  • 8-bit weight and activation quantization support.
  • efficient usage of cached attention keys and values for minimal memory movement.

mpt-chat-comparison

Try It Now

Install (requires Linux):

pip install -U deepsparse-nightly[llm]

Run inference:

from deepsparse import TextGeneration
pipeline = TextGeneration(model="zoo:mpt-7b-dolly_mpt_pretrain-pruned50_quantized")

prompt="""
Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: what is sparsity? ### Response:
"""
print(pipeline(prompt, max_new_tokens=75).generations[0].text)

# Sparsity is the property of a matrix or other data structure in which a large number of elements are zero and a smaller number of elements are non-zero. In the context of machine learning, sparsity can be used to improve the efficiency of training and prediction.

Check out the TextGeneration documentation for usage details.

Sparsity :handshake: Performance

Developed in collaboration with IST Austria, our recent paper details a new technique called Sparse Fine-Tuning, which allows us to prune MPT-7B to 60% sparsity during fine-tuning without drop in accuracy. With our new support for LLMs, DeepSparse accelerates the sparse-quantized model 7x over the dense baseline:

Learn more about our Sparse Fine-Tuning research.

Check out the model running live on Hugging Face.

LLM Roadmap

Following this initial launch, we are rapidly expanding our support for LLMs, including:

  1. Productizing Sparse Fine-Tuning: Enable external users to apply sparse fine-tuning to their datasets via SparseML.
  2. Expanding model support: Apply our sparse fine-tuning results to Llama 2 and Mistral models.
  3. Pushing for higher sparsity: Improving our pruning algorithms to reach even higher sparsity.

Computer Vision and NLP Models

In addition to LLMs, DeepSparse supports many variants of CNNs and Transformer models, such as BERT, ViT, ResNet, EfficientNet, YOLOv5/8, and many more! Take a look at the Computer Vision and Natural Language Processing domains of SparseZoo, our home for optimized models.

Installation

Install via PyPI (optional dependencies detailed here):

pip install deepsparse 

To experiment with the latest features, there is a nightly build available using pip install deepsparse-nightly or you can clone and install from source using pip install -e path/to/deepsparse.

System Requirements

For those using Mac or Windows, we recommend using Linux containers with Docker.

Deployment APIs

DeepSparse includes three deployment APIs:

  • Engine is the lowest-level API. With Engine, you compile an ONNX model, pass tensors as input, and receive the raw outputs.
  • Pipeline wraps the Engine with pre- and post-processing. With Pipeline, you pass raw data and receive the prediction.
  • Server wraps Pipelines with a REST API using FastAPI. With Server, you send raw data over HTTP and receive the prediction.

Engine

The example below downloads a 90% pruned-quantized BERT model for sentiment analysis in ONNX format from SparseZoo, compiles the model, and runs inference on randomly generated input. Users can provide their own ONNX models, whether dense or sparse.

from deepsparse import Engine

# download onnx, compile
zoo_stub = "zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none"
compiled_model = Engine(model=zoo_stub, batch_size=1)

# run inference (input is raw numpy tensors, output is raw scores)
inputs = compiled_model.generate_random_inputs()
output = compiled_model(inputs)
print(output)

# > [array([[-0.3380675 ,  0.09602544]], dtype=float32)] << raw scores

Pipeline

Pipelines wrap Engine with pre- and post-processing, enabling you to pass raw data and receive the post-processed prediction. The example below downloads a 90% pruned-quantized BERT model for sentiment analysis in ONNX format from SparseZoo, sets up a pipeline, and runs inference on sample data.

from deepsparse import Pipeline

# download onnx, set up pipeline
zoo_stub = "zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none"  
sentiment_analysis_pipeline = Pipeline.create(
  task="sentiment-analysis",    # name of the task
  model_path=zoo_stub,          # zoo stub or path to local onnx file
)

# run inference (input is a sentence, output is the prediction)
prediction = sentiment_analysis_pipeline("I love using DeepSparse Pipelines")
print(prediction)
# > labels=['positive'] scores=[0.9954759478569031]

Server

Server wraps Pipelines with REST APIs, enabling you to set up a model-serving endpoint running DeepSparse. This enables you to send raw data to DeepSparse over HTTP and receive the post-processed predictions. DeepSparse Server is launched from the command line and configured via arguments or a server configuration file. The following downloads a 90% pruned-quantized BERT model for sentiment analysis in ONNX format from SparseZoo and launches a sentiment analysis endpoint:

deepsparse.server \
  --task sentiment-analysis \
  --model_path zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none

Sending a request:

import requests

url = "http://localhost:5543/v2/models/sentiment_analysis/infer" # Server's port default to 5543
obj = {"sequences": "Snorlax loves my Tesla!"}

response = requests.post(url, json=obj)
print(response.text)
# {"labels":["positive"],"scores":[0.9965094327926636]}

Additional Resources

Product Usage Analytics

DeepSparse gathers basic usage telemetry, including, but not limited to, Invocations, Package, Version, and IP Address, for Product Usage Analytics purposes. Review Neural Magic's Products Privacy Policy for further details on how we process this data.

To disable Product Usage Analytics, run:

export NM_DISABLE_ANALYTICS=True

Confirm that telemetry is shut off through info logs streamed with engine invocation by looking for the phrase "Skipping Neural Magic's latest package version check."

Community

Get In Touch

For more general questions about Neural Magic, complete this form.

License

Cite

Find this project useful in your research or other communications? Please consider citing:

@misc{kurtic2023sparse,
      title={Sparse Fine-Tuning for Inference Acceleration of Large Language Models}, 
      author={Eldar Kurtic and Denis Kuznedelev and Elias Frantar and Michael Goin and Dan Alistarh},
      year={2023},
      url={https://arxiv.org/abs/2310.06927},
      eprint={2310.06927},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@misc{kurtic2022optimal,
      title={The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models}, 
      author={Eldar Kurtic and Daniel Campos and Tuan Nguyen and Elias Frantar and Mark Kurtz and Benjamin Fineran and Michael Goin and Dan Alistarh},
      year={2022},
      url={https://arxiv.org/abs/2203.07259},
      eprint={2203.07259},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@InProceedings{
    pmlr-v119-kurtz20a, 
    title = {Inducing and Exploiting Activation Sparsity for Fast Inference on Deep Neural Networks}, 
    author = {Kurtz, Mark and Kopinsky, Justin and Gelashvili, Rati and Matveev, Alexander and Carr, John and Goin, Michael and Leiserson, William and Moore, Sage and Nell, Bill and Shavit, Nir and Alistarh, Dan}, 
    booktitle = {Proceedings of the 37th International Conference on Machine Learning}, 
    pages = {5533--5543}, 
    year = {2020}, 
    editor = {Hal Daumé III and Aarti Singh}, 
    volume = {119}, 
    series = {Proceedings of Machine Learning Research}, 
    address = {Virtual}, 
    month = {13--18 Jul}, 
    publisher = {PMLR}, 
    pdf = {http://proceedings.mlr.press/v119/kurtz20a/kurtz20a.pdf},
    url = {http://proceedings.mlr.press/v119/kurtz20a.html}
}

@article{DBLP:journals/corr/abs-2111-13445,
  author    = {Eugenia Iofinova and Alexandra Peste and Mark Kurtz and Dan Alistarh},
  title     = {How Well Do Sparse Imagenet Models Transfer?},
  journal   = {CoRR},
  volume    = {abs/2111.13445},
  year      = {2021},
  url       = {https://arxiv.org/abs/2111.13445},
  eprinttype = {arXiv},
  eprint    = {2111.13445},
  timestamp = {Wed, 01 Dec 2021 15:16:43 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2111-13445.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

All Thanks To Our Contributors

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deepsparse-1.7.0.tar.gz (46.6 MB view details)

Uploaded Source

Built Distributions

deepsparse-1.7.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (47.1 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

deepsparse-1.7.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (40.5 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ ARM64

deepsparse-1.7.0-cp311-cp311-macosx_13_0_arm64.whl (33.4 MB view details)

Uploaded CPython 3.11 macOS 13.0+ ARM64

deepsparse-1.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (47.1 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

deepsparse-1.7.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (40.5 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ ARM64

deepsparse-1.7.0-cp310-cp310-macosx_13_0_arm64.whl (33.4 MB view details)

Uploaded CPython 3.10 macOS 13.0+ ARM64

deepsparse-1.7.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (47.1 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

deepsparse-1.7.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (40.5 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ ARM64

deepsparse-1.7.0-cp39-cp39-macosx_13_0_arm64.whl (33.4 MB view details)

Uploaded CPython 3.9 macOS 13.0+ ARM64

deepsparse-1.7.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (47.1 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

deepsparse-1.7.0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (40.5 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ ARM64

deepsparse-1.7.0-cp38-cp38-macosx_13_0_arm64.whl (33.4 MB view details)

Uploaded CPython 3.8 macOS 13.0+ ARM64

File details

Details for the file deepsparse-1.7.0.tar.gz.

File metadata

  • Download URL: deepsparse-1.7.0.tar.gz
  • Upload date:
  • Size: 46.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.24.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.8.10

File hashes

Hashes for deepsparse-1.7.0.tar.gz
Algorithm Hash digest
SHA256 584d8194afc556256e94bcd42b6866b7a1223a341a59d8789482f10fb0ed2427
MD5 343325cca248b027de3dcbc730a7d2c1
BLAKE2b-256 bf42698dfe84538268f8b409c7225be3eb79be51aff863c185f5a8641eacefa3

See more details on using hashes here.

File details

Details for the file deepsparse-1.7.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for deepsparse-1.7.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 c6d191e22fa1530452e929ed75e19ce5f9810915e8d196147a4a10ff520eb65f
MD5 de24eec611d9bde503dcdbe5c4cba5c7
BLAKE2b-256 3d87ca706f58f70cd96c5d5d94d5ca867a185352a58862750a14eeb957e9e723

See more details on using hashes here.

File details

Details for the file deepsparse-1.7.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for deepsparse-1.7.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 49a9f8cf472c92c1718fed75e9f5ef82f4881a9b1a58f700dac371ccf2bfb2a8
MD5 8de7b706ba0f29daafeb24bf7799f27b
BLAKE2b-256 3929fd028587d13631b92fc26db82d467bed49b54de9277b634cb816c371aecd

See more details on using hashes here.

File details

Details for the file deepsparse-1.7.0-cp311-cp311-macosx_13_0_arm64.whl.

File metadata

  • Download URL: deepsparse-1.7.0-cp311-cp311-macosx_13_0_arm64.whl
  • Upload date:
  • Size: 33.4 MB
  • Tags: CPython 3.11, macOS 13.0+ ARM64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.24.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.8.10

File hashes

Hashes for deepsparse-1.7.0-cp311-cp311-macosx_13_0_arm64.whl
Algorithm Hash digest
SHA256 6e89f05cb49282f2f8657cc43e1a0bbff3801b014711da3288a314260cb110ac
MD5 6850a6bb6792a9e1ba4da75a3ece0ccc
BLAKE2b-256 069dc4c775f3d11094e57d7953072dcbb7adf895afece9c76a4247acc327bc81

See more details on using hashes here.

File details

Details for the file deepsparse-1.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for deepsparse-1.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e6bd9280020ffffea0acf00d50be54daf43ba7ca9c8dbbd89a2ba247ce55d1b5
MD5 c3cb53c2d288fad0d46e7edbc6982fef
BLAKE2b-256 f86ebb6e2186beb8f4c6de6ff8c022c3f8351a0e2b6e79c591dbb3071da4cb28

See more details on using hashes here.

File details

Details for the file deepsparse-1.7.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for deepsparse-1.7.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 457e38525c99b96bdfd82a60f998ec9f8cbf416e5667ea1b7df8015cb5944442
MD5 8c63f9d163f3ea8692d75cf2d40b6320
BLAKE2b-256 d86de081bf5845184dc6e290e01fffcf19e3a912dbee41124e585c80e19e0195

See more details on using hashes here.

File details

Details for the file deepsparse-1.7.0-cp310-cp310-macosx_13_0_arm64.whl.

File metadata

  • Download URL: deepsparse-1.7.0-cp310-cp310-macosx_13_0_arm64.whl
  • Upload date:
  • Size: 33.4 MB
  • Tags: CPython 3.10, macOS 13.0+ ARM64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.24.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.8.10

File hashes

Hashes for deepsparse-1.7.0-cp310-cp310-macosx_13_0_arm64.whl
Algorithm Hash digest
SHA256 36f1b4d28346fc5d0424ad196197f9d67dcce70e9e0c7c6759bd7b801ac4ff64
MD5 bccf3b6385a4b6442225182c5a9f8025
BLAKE2b-256 e58d4bcc193c1543d36c65db4dc3023c0462d45c760ebeb844b2c63e9e060ec3

See more details on using hashes here.

File details

Details for the file deepsparse-1.7.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for deepsparse-1.7.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e84dd301dec19b7697d634daf0a5b2fd0be43b28a5cf51ab8533a1985392a468
MD5 3fb874d8e22a07c4fbe88d5d46d4db15
BLAKE2b-256 87ae65cd7f09c06ed2dddb291f6d8b8562ac8399f7b0fbc3bf518d93e85cb6e6

See more details on using hashes here.

File details

Details for the file deepsparse-1.7.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for deepsparse-1.7.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 c5feb774e80761b23a31e68fbf86a12e28f151161fc0016eba0102c4f2861d1a
MD5 ce01083818e4150dc21d24a9293a3cd4
BLAKE2b-256 23cbf87058da01ddadabb63a5c130ee0909165d031f9e06a9b359e1e3d372c6d

See more details on using hashes here.

File details

Details for the file deepsparse-1.7.0-cp39-cp39-macosx_13_0_arm64.whl.

File metadata

  • Download URL: deepsparse-1.7.0-cp39-cp39-macosx_13_0_arm64.whl
  • Upload date:
  • Size: 33.4 MB
  • Tags: CPython 3.9, macOS 13.0+ ARM64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.24.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.8.10

File hashes

Hashes for deepsparse-1.7.0-cp39-cp39-macosx_13_0_arm64.whl
Algorithm Hash digest
SHA256 b75a12bf7d6b6535b9e7fc457de9ccf5c0145f5f1a1741614b2fa4a15d21d959
MD5 5e07076959dd711a76300264f404a2b8
BLAKE2b-256 b56d1ce3f64a637594a2343c4cbdfe0308b6bf5ece406d1e2efe72e328578ad5

See more details on using hashes here.

File details

Details for the file deepsparse-1.7.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for deepsparse-1.7.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 7c37e6f1d1f3a2512acd86f70b78282987138b1e237102c00c4062c2e19c00ee
MD5 fa14c5ab5930e68a616b093d68d07d26
BLAKE2b-256 15b7e898761c32bc540518650afb3926b0ca7c638b9ea1d6c06baf65a896a115

See more details on using hashes here.

File details

Details for the file deepsparse-1.7.0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for deepsparse-1.7.0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 b986aa53298604c12e10875683fa5e8c749c65e8f00224ab9cbe41587de64813
MD5 8da5d47212ffc68cb48172f57a422534
BLAKE2b-256 efeb7fd45bf062d574aa40e400eac2a8d3bdbdb2050503425f7b340843d73a91

See more details on using hashes here.

File details

Details for the file deepsparse-1.7.0-cp38-cp38-macosx_13_0_arm64.whl.

File metadata

  • Download URL: deepsparse-1.7.0-cp38-cp38-macosx_13_0_arm64.whl
  • Upload date:
  • Size: 33.4 MB
  • Tags: CPython 3.8, macOS 13.0+ ARM64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.24.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.8.10

File hashes

Hashes for deepsparse-1.7.0-cp38-cp38-macosx_13_0_arm64.whl
Algorithm Hash digest
SHA256 fea379e0b38c8647e3ea5bd70de86ee0d2f30976693f04a78aad2909de71c70d
MD5 b91d85607006d41fc31ce0da7778cdc5
BLAKE2b-256 d8130c03fda11ffddc35d5eec46d0d09cb0b788a9d4668cc7a53635c9bb1b61a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page