Skip to main content

Repository of Intel® Intel Extension for Transformers

Project description

Neural Speed

Neural Speed is an innovative library designed to support the efficient inference of large language models (LLMs) on Intel platforms through the state-of-the-art (SOTA) low-bit quantization powered by Intel Neural Compressor. The work is inspired by llama.cpp and further optimized for Intel platforms with our innovations in NeurIPS' 2023

Key Features

  • Highly optimized low-precision kernels on CPUs with ISAs (AMX, VNNI, AVX512F, AVX_VNNI and AVX2). See details
  • Up to 40x performance speedup on popular LLMs compared with llama.cpp. See details
  • Tensor parallelism across sockets/nodes on CPUs. See details

Neural Speed is under active development so APIs are subject to change.

Supported Hardware

Hardware Supported
Intel Xeon Scalable Processors
Intel Xeon CPU Max Series
Intel Core Processors

Supported Models

Support almost all the LLMs in PyTorch format from Hugging Face such as Llama2, ChatGLM2, Baichuan2, Qwen, Mistral, Whisper, etc. File an issue if your favorite LLM does not work.

Support typical LLMs in GGUF format such as Llama2, Falcon, MPT, Bloom etc. More are coming. Check out the details.

Installation

Install from binary

pip install -r requirements.txt
pip install neural-speed

Build from Source

pip install -r requirements.txt
pip install .

Note: GCC requires version 10+

Quick Start (Transformer-like usage)

Install Intel Extension for Transformers to use Transformer-like APIs.

PyTorch Model from Hugging Face

from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
model_name = "Intel/neural-chat-7b-v3-1"     # Hugging Face model_id or local model
prompt = "Once upon a time, there existed a little girl,"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)

model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)

GGUF Model from Hugging Face

from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM

# Specify the GGUF repo on the Hugginface
model_name = "TheBloke/Llama-2-7B-Chat-GGUF"
# Download the the specific gguf model file from the above repo
model_file = "llama-2-7b-chat.Q4_0.gguf"
# make sure you are granted to access this model on the Huggingface.
tokenizer_name = "meta-llama/Llama-2-7b-chat-hf"

prompt = "Once upon a time"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)
model = AutoModelForCausalLM.from_pretrained(model_name, model_file = model_file)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)

PyTorch Model from Modelscope

from transformers import TextStreamer
from modelscope import AutoTokenizer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
model_name = "qwen/Qwen-7B"     # Modelscope model_id or local model
prompt = "Once upon a time, there existed a little girl,"

model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True, model_hub="modelscope")
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)

Quick Start (llama.cpp-like usage)

Single (One-click) Step

python scripts/run.py model-path --weight_dtype int4 -p "She opened the door and see"

Multiple Steps

Convert and Quantize

# skip the step if GGUF model is from Hugging Face or generated by llama.cpp
python scripts/convert.py --outtype f32 --outfile ne-f32.bin EleutherAI/gpt-j-6b

Inference

# Linux and WSL
OMP_NUM_THREADS=<physic_cores> numactl -m 0 -C 0-<physic_cores-1> python scripts/inference.py --model_name llama -m ne-q4_j.bin -c 512 -b 1024 -n 256 -t <physic_cores> --color -p "She opened the door and see"
# Windows
python scripts/inference.py --model_name llama -m ne-q4_j.bin -c 512 -b 1024 -n 256 -t <physic_cores|P-cores> --color -p "She opened the door and see"

Please refer to Advanced Usage for more details.

Advanced Topics

New model enabling

You can consider adding your own models, please follow the document: graph developer document.

Performance profiling

Enable NEURAL_SPEED_VERBOSE environment variable for performance profiling.

Available modes:

  • 0: Print full information: evaluation time and operator profiling. Need to set NS_PROFILING to ON and recompile.
  • 1: Print evaluation time. Time taken for each evaluation.
  • 2: Profile individual operator. Identify performance bottleneck within the model. Need to set NS_PROFILING to ON and recompile.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

neural-speed-1.0.tar.gz (4.4 MB view details)

Uploaded Source

Built Distributions

neural_speed-1.0-cp311-cp311-win_amd64.whl (11.7 MB view details)

Uploaded CPython 3.11 Windows x86-64

neural_speed-1.0-cp311-cp311-manylinux_2_28_x86_64.whl (23.3 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.28+ x86-64

neural_speed-1.0-cp310-cp310-win_amd64.whl (11.7 MB view details)

Uploaded CPython 3.10 Windows x86-64

neural_speed-1.0-cp310-cp310-manylinux_2_28_x86_64.whl (23.2 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.28+ x86-64

neural_speed-1.0-cp39-cp39-win_amd64.whl (11.7 MB view details)

Uploaded CPython 3.9 Windows x86-64

neural_speed-1.0-cp39-cp39-manylinux_2_28_x86_64.whl (23.2 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.28+ x86-64

neural_speed-1.0-cp38-cp38-win_amd64.whl (11.7 MB view details)

Uploaded CPython 3.8 Windows x86-64

neural_speed-1.0-cp38-cp38-manylinux_2_28_x86_64.whl (23.2 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.28+ x86-64

File details

Details for the file neural-speed-1.0.tar.gz.

File metadata

  • Download URL: neural-speed-1.0.tar.gz
  • Upload date:
  • Size: 4.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.14

File hashes

Hashes for neural-speed-1.0.tar.gz
Algorithm Hash digest
SHA256 353d1e3c1a4b70ed4878b138d549c267c0c4da953722604011ef7e3e7bedaec4
MD5 d7652e0674768e747052223e10e6a8d8
BLAKE2b-256 32fe9410ee58b48188c3bde1b04c9f62ad038010e04a8629b4b7977838b41a3e

See more details on using hashes here.

File details

Details for the file neural_speed-1.0-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for neural_speed-1.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 83f83f571e9227c072d1b7ec11df383faabb2b3ded3f221b3504c3036708920e
MD5 a385a85be8fcf75e308d6275c3336a62
BLAKE2b-256 828f4f52178422f1374378fc1f744f9c6d4c7224b4f1f58e09e1950e846b9acd

See more details on using hashes here.

File details

Details for the file neural_speed-1.0-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for neural_speed-1.0-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 33d5d4f46cabd81079e7630b7d2d085dccfe466c22b13e1f2fffe6e145253d5f
MD5 4ee5f6ab891a0258bcc61ccdca1be1ee
BLAKE2b-256 de52ce34338c918dafeb6bd9d777ce4a0b75f814fe4eedfea4bd6b13f53efbd1

See more details on using hashes here.

File details

Details for the file neural_speed-1.0-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for neural_speed-1.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 0cc96058f7bbc414658a50c0eee8fb4455fa60e99e3186249e19f64b1c6b1bb8
MD5 6821c710af4cc4091ccbd9557a728647
BLAKE2b-256 ab0719e63fe447f4b6ec128573087e3ef642cc93d7639e41759a0e93aeed1eb4

See more details on using hashes here.

File details

Details for the file neural_speed-1.0-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for neural_speed-1.0-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 e86274c840d0c398162a113ec8ed808124a1006b591aa7273c3199fa5e8958a5
MD5 f8fc19e7f19e4b215628890f5cd1b8d7
BLAKE2b-256 71bb9e7ca61a2639ada138409df92e17cf66881e6ed985bad0cb1ecc8f5a8228

See more details on using hashes here.

File details

Details for the file neural_speed-1.0-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: neural_speed-1.0-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 11.7 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.14

File hashes

Hashes for neural_speed-1.0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 50eaa685dc298856ff23ad758cc7c3e63718b79439b1e7737418433366057d40
MD5 89e42e4b3774db83c8d07c452d8b0f69
BLAKE2b-256 f511643e36827e8accfc6df672ea9093d164c1304910a3e508b01c2b43dc9eb8

See more details on using hashes here.

File details

Details for the file neural_speed-1.0-cp39-cp39-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for neural_speed-1.0-cp39-cp39-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 2308728a00b3875951cb92ba4798c4b8b3b1d357127256cfc2e87346493f8e16
MD5 f2a311c260794dc6be222dd3b43cbc3a
BLAKE2b-256 1c28feda0c5f84e9df3dcdcf71ac2ed3e3f3ea41b342c51fa27320d5a052586c

See more details on using hashes here.

File details

Details for the file neural_speed-1.0-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: neural_speed-1.0-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 11.7 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.14

File hashes

Hashes for neural_speed-1.0-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 52517e22534aa64fd2d8d2af0d83a3acd3b3abfbb76a3b59b173253f2041e604
MD5 fa18658fd0187bc85d420bd442698457
BLAKE2b-256 b4e446e500cb76ad85807c250a09eee1461cfba7a528d1feff455c8f4df6919c

See more details on using hashes here.

File details

Details for the file neural_speed-1.0-cp38-cp38-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for neural_speed-1.0-cp38-cp38-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 cc7b561d8560da13a53ee10799c490b4d8ee6593188e15a014bb2963ed080fcb
MD5 fc6e41e5127568cb6ff814fc9c0f1644
BLAKE2b-256 0b8a9f304beba4925e352a739a1e72c3f9171a9432c5f054d32d3c00f256cfe7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page