Intel® NPU Acceleration Library
Project description
Intel® NPU Acceleration Library
The Intel® NPU Acceleration Library is a Python library designed to boost the efficiency of your applications by leveraging the power of the Intel Neural Processing Unit (NPU) to perform high-speed computations on compatible hardware.
Note: The Intel® NPU Acceleration Library is currently in active development, with our team working to introduce a variety of features that are anticipated to dramatically enhance performance.
Intel NPU
The Intel NPU is an AI accelerator integrated into Intel Core Ultra processors, characterized by a unique architecture comprising compute acceleration and data transfer capabilities. Its compute acceleration is facilitated by Neural Compute Engines, which consist of hardware acceleration blocks for AI operations like Matrix Multiplication and Convolution, alongside Streaming Hybrid Architecture Vector Engines for general computing tasks.
To optimize performance, the NPU features DMA engines for efficient data transfers between system memory and a managed cache, supported by device MMU and IOMMU for security isolation. The NPU's software utilizes compiler technology to optimize AI workloads by directing compute and data flow in a tiled fashion, maximizing compute utilization primarily from scratchpad SRAM while minimizing data transfers between SRAM and DRAM for optimal performance and power efficiency.
Some useful links
Feature roadmap
In our quest to significantly improve the library's performance, we are directing our efforts toward implementing a range of key features, including:
- 8-bit quantization
- 4-bit Quantization and GPTQ
- NPU-Native mixed precision inference
- Float16 support
- BFloat16 (Brain Floating Point Format)
-
torch.compile
support - LLM MLP horizontal fusion implementation
- Static shape inference
- MHA NPU inference
- NPU/GPU hetero compute
- Paper
Make sure to stay updated with the project's progress as these exciting enhancements are on the horizon. External contributions are very welcomed! If you want to participate in this library development, please check the Contributing guide, the developer guide and the list of open issues
Setup
Check that your system has an available NPU (how-to).
You can install the packet in your machine with
pip install intel-npu-acceleration-library
You can also install the package on Windows and Linux from source by typing
pip install "intel-npu-acceleration-library @ git+https://github.com/intel/intel-npu-acceleration-library.git"
To build the package you need a compiler in your system (Visual Studio 2019 suggested for Windows build). MacOS is not yet supported. At the moment only Ubuntu OS is supported for Linux build. If you need a library for your specific OS, please open an issue
The library is intended to be used with Intel Core Ultra processors, which have an integrated NPU
(Neural Processing Unit). For best performance please install/update the NPU drivers to the latest version. (Windows, Linux).
Usage
For implemented examples, please check the examples
folder
Run a single MatMul in the NPU
from intel_npu_acceleration_library.backend import MatMul
import numpy as np
inC, outC, batch = ... # Define your own values
# Create both inputs
X1 = np.random.uniform(-1, 1, (batch, inC)).astype(np.float16)
X2 = np.random.uniform(-1, 1, (outC, inC)).astype(np.float16)
mm = MatMul(inC, outC, batch, profile=False)
result = mm.run(X1, X2)
Compile a model for the NPU
If you have pytorch
>=2.0.0 installed you can use torch compile to optimize your model for the NPU
import intel_npu_acceleration_library
import torch
# Compile model for the NPU
# model a torch.nn.Module class. Model can be quantized JIT
optimized_model = torch.compile(model, backend="npu")
# Use the model as usual
In windows torch.compile is not supported yet. So you might want to use the explicit function intel_npu_acceleration_library.compile
. This is true also if you use a pytorch
version < 2.0.0
import intel_npu_acceleration_library
optimized_model = intel_npu_acceleration_library.compile(model, dtype=torch.int8)
# Use the model as usual
Run a Tiny-llama model on the NPU
from transformers import AutoTokenizer, TextStreamer
from intel_npu_acceleration_library import NPUModelForCausalLM
import torch
model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
model = NPUModelForCausalLM.from_pretrained(model_id, use_cache=True, dtype=torch.int8).eval()
tokenizer = AutoTokenizer.from_pretrained(model_id, use_default_system_prompt=True)
tokenizer.pad_token_id = tokenizer.eos_token_id
streamer = TextStreamer(tokenizer, skip_special_tokens=True)
query = input("Ask something: ")
prefix = tokenizer(query, return_tensors="pt")["input_ids"]
generation_kwargs = dict(
input_ids=prefix,
streamer=streamer,
do_sample=True,
top_k=50,
top_p=0.9,
max_new_tokens=512,
)
print("Run inference")
_ = model.generate(**generation_kwargs)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
File details
Details for the file intel_npu_acceleration_library-1.3.0.tar.gz
.
File metadata
- Download URL: intel_npu_acceleration_library-1.3.0.tar.gz
- Upload date:
- Size: 65.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.11.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a6bbe15385aa4a63814d1f8489efa1f3296869d6c6d9246cc995f1421e393e80 |
|
MD5 | 08234e242682bf62fec238d872825eaf |
|
BLAKE2b-256 | d1a98ed16d003fdec94177f334d525b5204eb9818d329f29bf5d5a2af4a582b2 |
File details
Details for the file intel_npu_acceleration_library-1.3.0-cp312-cp312-win_amd64.whl
.
File metadata
- Download URL: intel_npu_acceleration_library-1.3.0-cp312-cp312-win_amd64.whl
- Upload date:
- Size: 39.7 MB
- Tags: CPython 3.12, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.11.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c745fe51174d6ddd21212f3a09478b76a473740a8b7680f461fe1b75f3a2a998 |
|
MD5 | 18a45cf0f44ab9a0ffdc1a2d79824e5f |
|
BLAKE2b-256 | e84800c7038023453fdeef8ed9f3c2bd958f5fe742c655ce9ebaebf97d2ab26a |
File details
Details for the file intel_npu_acceleration_library-1.3.0-cp312-cp312-win32.whl
.
File metadata
- Download URL: intel_npu_acceleration_library-1.3.0-cp312-cp312-win32.whl
- Upload date:
- Size: 39.7 MB
- Tags: CPython 3.12, Windows x86
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.11.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 239df3615cf5a0e8c10840dc614d08d9815a1e61a5d9ec43dc3c76955fffb128 |
|
MD5 | 0690350728ec804069d17a998536e3db |
|
BLAKE2b-256 | 17c21aa69fe0f8b059e019bdea32cbf7ec7f8c73abb9a5e3ec78090907edbbe5 |
File details
Details for the file intel_npu_acceleration_library-1.3.0-cp311-cp311-win_amd64.whl
.
File metadata
- Download URL: intel_npu_acceleration_library-1.3.0-cp311-cp311-win_amd64.whl
- Upload date:
- Size: 39.7 MB
- Tags: CPython 3.11, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.11.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 59dabeebcd76399479d7320d3006e2172641fba8ea3ea1f06cf2fe78ed7de868 |
|
MD5 | 86c71e02bec93b53d05cb46c75458126 |
|
BLAKE2b-256 | 22ff5e3e6fe67f95b8142236dc58aefc7aa90ce5bfbff3e3cc72d231c143d2d8 |
File details
Details for the file intel_npu_acceleration_library-1.3.0-cp311-cp311-win32.whl
.
File metadata
- Download URL: intel_npu_acceleration_library-1.3.0-cp311-cp311-win32.whl
- Upload date:
- Size: 39.7 MB
- Tags: CPython 3.11, Windows x86
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.11.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ba9a269217875e04a4185fefe43650f12af14265aef299eb38d5a770e283bfe8 |
|
MD5 | d9383c25b5642d3d3fd7a9208b35e3c3 |
|
BLAKE2b-256 | 8afa5335ae0c755d6a51d4917838d98fd6a872dc4425f837edef8f929928cae0 |
File details
Details for the file intel_npu_acceleration_library-1.3.0-cp310-cp310-win_amd64.whl
.
File metadata
- Download URL: intel_npu_acceleration_library-1.3.0-cp310-cp310-win_amd64.whl
- Upload date:
- Size: 39.7 MB
- Tags: CPython 3.10, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.11.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 12f80ce3ec6e3bd60cb50b25d5ce0dae1438ef9432f772703196368cb4b04a3a |
|
MD5 | 12ca6dc92615cbe3e27276939607cbf2 |
|
BLAKE2b-256 | d66e66b2750c47ade55dfa2774f1655aa822f8337a5500d39b3cf7c110c2d206 |
File details
Details for the file intel_npu_acceleration_library-1.3.0-cp310-cp310-win32.whl
.
File metadata
- Download URL: intel_npu_acceleration_library-1.3.0-cp310-cp310-win32.whl
- Upload date:
- Size: 39.7 MB
- Tags: CPython 3.10, Windows x86
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.11.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 02ee17980cce4c9be4e8fa5efa09a7332fb0f858caa9ba4cafb0208f4cf52d2c |
|
MD5 | b2e8931ec0695d16885522b5e255ff9b |
|
BLAKE2b-256 | 0d1f83abf7086f5055d8dc6177c98e8f671600a3866f47fdd5baaee3ac801029 |
File details
Details for the file intel_npu_acceleration_library-1.3.0-cp39-cp39-win_amd64.whl
.
File metadata
- Download URL: intel_npu_acceleration_library-1.3.0-cp39-cp39-win_amd64.whl
- Upload date:
- Size: 39.7 MB
- Tags: CPython 3.9, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.11.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d9ef23d06b9f984bc2eb4048267033b3878f45625f99fc51d4aa16ed8436af34 |
|
MD5 | 77da8fa7cf07469d3454a54f0451e153 |
|
BLAKE2b-256 | b5f4f0e34a0446ea233e8dde47682e71f2336561a165375c0f3d74a0aa0c804c |
File details
Details for the file intel_npu_acceleration_library-1.3.0-cp39-cp39-win32.whl
.
File metadata
- Download URL: intel_npu_acceleration_library-1.3.0-cp39-cp39-win32.whl
- Upload date:
- Size: 39.7 MB
- Tags: CPython 3.9, Windows x86
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.11.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e3df4fe77cf1041cf6a725500a30b1611361577556caf072deb21382fa37410e |
|
MD5 | 8b883cbc73ea06e4adf59c6b4e7732c6 |
|
BLAKE2b-256 | 7c4b94ebb693781f5057094313de659730cf2b4c22d5e220783af80148f3f267 |
File details
Details for the file intel_npu_acceleration_library-1.3.0-cp38-cp38-win_amd64.whl
.
File metadata
- Download URL: intel_npu_acceleration_library-1.3.0-cp38-cp38-win_amd64.whl
- Upload date:
- Size: 39.7 MB
- Tags: CPython 3.8, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.11.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ea36ca858194ce0def7086d1bdb7538c23d52987cc79d8996297d8a876bcef08 |
|
MD5 | a413d55c9ab22c2f7af5889cf6803c72 |
|
BLAKE2b-256 | 83d1b234623ad146f3486f13fd78e5172c28ca53a9bb74d23ca22e18d394c213 |
File details
Details for the file intel_npu_acceleration_library-1.3.0-cp38-cp38-win32.whl
.
File metadata
- Download URL: intel_npu_acceleration_library-1.3.0-cp38-cp38-win32.whl
- Upload date:
- Size: 39.7 MB
- Tags: CPython 3.8, Windows x86
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.11.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c96d48cb3b1f6ee07d21b7d2bac75646f84c8e35ac938cec03eb9630286ae61a |
|
MD5 | a34fb567036379e0d518582095e6b76d |
|
BLAKE2b-256 | 58cef64c9ce3e33466a3538a0227b8d6033180297277e9a75306599ddb49ec69 |