Skip to main content

Repository of Intel® Intel Extension for Transformers

Project description

Neural Speed

Neural Speed is an innovative library designed to support the efficient inference of large language models (LLMs) on Intel platforms through the state-of-the-art (SOTA) low-bit quantization powered by Intel Neural Compressor. The work is inspired by llama.cpp and further optimized for Intel platforms with our innovations in NeurIPS' 2023

Key Features

  • Highly optimized low-precision kernels on CPUs with ISAs (AMX, VNNI, AVX512F, AVX_VNNI and AVX2). See details
  • Up to 40x performance speedup on popular LLMs compared with llama.cpp. See details
  • Tensor parallelism across sockets/nodes on CPUs. See details

Neural Speed is under active development so APIs are subject to change.

Supported Hardware

Hardware Supported
Intel Xeon Scalable Processors
Intel Xeon CPU Max Series
Intel Core Processors

Supported Models

Support almost all the LLMs in PyTorch format from Hugging Face such as Llama2, ChatGLM2, Baichuan2, Qwen, Mistral, Whisper, etc. File an issue if your favorite LLM does not work.

Support typical LLMs in GGUF format such as Llama2, Falcon, MPT, Bloom etc. More are coming. Check out the details.

Installation

Install from binary

pip install neural-speed

Build from Source

pip install -r requirements.txt
pip install .

Note: GCC requires version 10+

Quick Start (Transformer-like usage)

Install Intel Extension for Transformers to use Transformer-like APIs.

PyTorch Model from Hugging Face

from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
model_name = "Intel/neural-chat-7b-v3-1"     # Hugging Face model_id or local model
prompt = "Once upon a time, there existed a little girl,"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)

model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)

GGUF Model from Hugging Face

from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM

# Specify the GGUF repo on the Hugginface
model_name = "TheBloke/Llama-2-7B-Chat-GGUF"
# Download the the specific gguf model file from the above repo
model_file = "llama-2-7b-chat.Q4_0.gguf"
# make sure you are granted to access this model on the Huggingface.
tokenizer_name = "meta-llama/Llama-2-7b-chat-hf"

prompt = "Once upon a time"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)
model = AutoModelForCausalLM.from_pretrained(model_name, model_file = model_file)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)

PyTorch Model from Modelscope

import sys
from modelscope import AutoTokenizer
from transformers import TextStreamer
from neural_speed import Model

model_name = "qwen/Qwen1.5-7B-Chat"     # modelscope model_id or local model
prompt = "Once upon a time, there existed a little girl,"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)
model = Model()
model.init(model_name, weight_dtype="int4", compute_dtype="int8", model_hub="modelscope")
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)

Quick Start (llama.cpp-like usage)

Single (One-click) Step

python scripts/run.py model-path --weight_dtype int4 -p "She opened the door and see"

Multiple Steps

Convert and Quantize

# skip the step if GGUF model is from Hugging Face or generated by llama.cpp
python scripts/convert.py --outtype f32 --outfile ne-f32.bin EleutherAI/gpt-j-6b

Inference

# Linux and WSL
OMP_NUM_THREADS=<physic_cores> numactl -m 0 -C 0-<physic_cores-1> python scripts/inference.py --model_name llama -m ne-q4_j.bin -c 512 -b 1024 -n 256 -t <physic_cores> --color -p "She opened the door and see"
# Windows
python scripts/inference.py --model_name llama -m ne-q4_j.bin -c 512 -b 1024 -n 256 -t <physic_cores|P-cores> --color -p "She opened the door and see"

Please refer to Advanced Usage for more details.

Advanced Topics

New model enabling

You can consider adding your own models, please follow the document: graph developer document.

Performance profiling

Enable NEURAL_SPEED_VERBOSE environment variable for performance profiling.

Available modes:

  • 0: Print full information: evaluation time and operator profiling. Need to set NS_PROFILING to ON and recompile.
  • 1: Print evaluation time. Time taken for each evaluation.
  • 2: Profile individual operator. Identify performance bottleneck within the model. Need to set NS_PROFILING to ON and recompile.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

neural-speed-1.0a0.tar.gz (4.3 MB view details)

Uploaded Source

Built Distributions

neural_speed-1.0a0-cp311-cp311-win_amd64.whl (11.6 MB view details)

Uploaded CPython 3.11 Windows x86-64

neural_speed-1.0a0-cp311-cp311-manylinux_2_28_x86_64.whl (23.1 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.28+ x86-64

neural_speed-1.0a0-cp310-cp310-win_amd64.whl (11.6 MB view details)

Uploaded CPython 3.10 Windows x86-64

neural_speed-1.0a0-cp310-cp310-manylinux_2_28_x86_64.whl (23.1 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.28+ x86-64

neural_speed-1.0a0-cp39-cp39-win_amd64.whl (11.6 MB view details)

Uploaded CPython 3.9 Windows x86-64

neural_speed-1.0a0-cp39-cp39-manylinux_2_28_x86_64.whl (23.1 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.28+ x86-64

neural_speed-1.0a0-cp38-cp38-win_amd64.whl (11.6 MB view details)

Uploaded CPython 3.8 Windows x86-64

neural_speed-1.0a0-cp38-cp38-manylinux_2_28_x86_64.whl (23.1 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.28+ x86-64

File details

Details for the file neural-speed-1.0a0.tar.gz.

File metadata

  • Download URL: neural-speed-1.0a0.tar.gz
  • Upload date:
  • Size: 4.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.8

File hashes

Hashes for neural-speed-1.0a0.tar.gz
Algorithm Hash digest
SHA256 0d6466feea889415ad27ac3c1caebc9b0ccf275cccb62ba39e632b3238f5db3c
MD5 880400571b0d628b5a1caa8139f0de47
BLAKE2b-256 9b7ee2976acfdc9daf3fee769f41e67dc7226fd735ae4062152e1f6c3f60bdee

See more details on using hashes here.

File details

Details for the file neural_speed-1.0a0-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for neural_speed-1.0a0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 40c6ed4b8ae3641f819533018303da9b91854b02d0af3324485370362590a134
MD5 f09085b5ea374fa699e9a8b4af2ac03b
BLAKE2b-256 684c3a985b5ba3eac3d42c42c7deba5161af0e65d0a8db15805878e70619df3d

See more details on using hashes here.

File details

Details for the file neural_speed-1.0a0-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for neural_speed-1.0a0-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 9fa3ebb11e9a06b9f0731e39122560033b491c9cd5514a288eccf019d4349fa4
MD5 84e945f7f128b51cbc0a57d3dfb25459
BLAKE2b-256 d36fe7b8e4f0c6b78163dc41c330fa446cbf531bd768bb4ad1b586017e35e8a1

See more details on using hashes here.

File details

Details for the file neural_speed-1.0a0-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for neural_speed-1.0a0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 0480b80e64f4730e7fc60fb7ec0850fde8cc92c1d51ec9570ad3a12959645712
MD5 56cf9fb7ac553bb172c52f37eeb9ee9f
BLAKE2b-256 2785e136de880c4464b82e509fa2aaa1ab952045d808be0ae769e8af0b072b77

See more details on using hashes here.

File details

Details for the file neural_speed-1.0a0-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for neural_speed-1.0a0-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 1faf7735ef4ba74a4929ccb2bdc5dd7afce811256a13495a5f7b5b50fbb37164
MD5 972fc4e9173a8a9c51869a91b5449020
BLAKE2b-256 76e322c61e6aee9015211514b939acaa9cec261e43f40d5c3853954a1055233a

See more details on using hashes here.

File details

Details for the file neural_speed-1.0a0-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for neural_speed-1.0a0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 b490fb752f9c6855c4457923cb5087eb4eade6468d866fdccf707573d3385fc9
MD5 2b9cbde9e1ab9d6d9ef5162ba516bf0c
BLAKE2b-256 d2ae25b97a440afc29dcc40df3f7a3be8a6fdb77ebc6e010732a1f4a00213913

See more details on using hashes here.

File details

Details for the file neural_speed-1.0a0-cp39-cp39-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for neural_speed-1.0a0-cp39-cp39-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 1c00ae47aa3f47fecaa09c7b0a0d11dd25f014051d0e63afd966b99d3d2ffca9
MD5 4158b9510cddd5e79a55a2b8ccf257f3
BLAKE2b-256 f862b5a5a407989e66e74640d300bece9768e79f37093d3b59bd5ac060272d3d

See more details on using hashes here.

File details

Details for the file neural_speed-1.0a0-cp38-cp38-win_amd64.whl.

File metadata

File hashes

Hashes for neural_speed-1.0a0-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 a7718715b79cb33232e7479424b059541e1c18e791e0c7552aed43a92066e54d
MD5 087dcee1e7ec27ec0c62422a65010f23
BLAKE2b-256 815c2a9124a0af902c4e26da18c42aee1620011ca15771ccce53ec9701fb7acc

See more details on using hashes here.

File details

Details for the file neural_speed-1.0a0-cp38-cp38-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for neural_speed-1.0a0-cp38-cp38-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 dc070ce6cf0cc512eec6bf139b7a8b620651b13d51f928f6a79fbafc3f83b140
MD5 ca3e69f78843da1844c6c15c1f3d5a4f
BLAKE2b-256 c16a3b87973e555f6d1b2a315b7522db9b0fd86d14489a76b5005d02218b6a6f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page