Skip to main content

Repository of Intel® Intel Extension for Transformers

Project description

Neural Speed

Neural Speed is an innovation library designed to provide the efficient inference of large language models (LLMs) on Intel platforms through the state-of-the-art (SOTA) low-bit quantization and sparsity powered by Intel Neural Compressor and llama.cpp. Highlights of this project:

Neural Speed is under active development so APIs are subject to change.

Installation

Build Python package (Recommended way)

pip install -r requirements.txt
pip install .

Note: Please make sure GCC version is higher than GCC 10.

Quick Start

There are two approaches for utilizing the Neural Speed: 1. Transformer-like usage, you need to install ITREX(intel extension for transformers) 2. llama.cpp-like usage

1. Transformer-like usage

Pytorch format HF model

from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
model_name = "Intel/neural-chat-7b-v3-1"     # Hugging Face model_id or local model
prompt = "Once upon a time, there existed a little girl,"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)

model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)

GGUF format HF model

from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM

# Specify the GGUF repo on the Hugginface
model_name = "TheBloke/Llama-2-7B-Chat-GGUF"
# Download the the specific gguf model file from the above repo
model_file = "llama-2-7b-chat.Q4_0.gguf"
# make sure you are granted to access this model on the Huggingface.
tokenizer_name = "meta-llama/Llama-2-7b-chat-hf"

prompt = "Once upon a time"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)
model = AutoModelForCausalLM.from_pretrained(model_name, model_file = model_file)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)

Please refer this link to check supported models.

If you want to use Transformer-based API in ITREX(Intel extension for transformers). Please refer to ITREX Installation Page.

2. llama.cpp-like usage:

One-click Python scripts

Run LLM with one-click python script including conversion, quantization and inference.

python scripts/run.py model-path --weight_dtype int4 -p "She opened the door and see"

Quantize and Inference Step By Step

Neural Speed supports 1. GGUF models generated by llama.cpp 2. GGUF models from HuggingFace 3. PyTorch model from HuggingFace, but quantized by Neural Speed Neural Speed offers the scripts: 1) convert and quantize, and 2) inference for conveting the model by yourself. If the GGUF model is from HuggingFace or generated by llama.cpp, you can inference it directly.

1. Convert and Quantize LLM

converting the model by following the below steps:

# convert the model directly use model id in Hugging Face. (recommended)
python scripts/convert.py --outtype f32 --outfile ne-f32.bin EleutherAI/gpt-j-6b
2. Inference

Linux and WSL

OMP_NUM_THREADS=<physic_cores> numactl -m 0 -C 0-<physic_cores-1> python scripts/inference.py --model_name llama -m ne-q4_j.bin -c 512 -b 1024 -n 256 -t <physic_cores> --color -p "She opened the door and see"

Windows

python scripts/inference.py --model_name llama -m ne-q4_j.bin -c 512 -b 1024 -n 256 -t <physic_cores|P-cores> --color -p "She opened the door and see"

For details please refer to Advanced Usage.

Supported Hardware

Hardware Optimization
Intel Xeon Scalable Processors
Intel Xeon CPU Max Series
Intel Core Processors

Supported Models

LLAMA, LLAMA2, NeuralChat series, GPT-J, GPT-NEOX, Dolly-v2, MPT, Falcon, BLOOM, OPT, ChatGLM, ChatGLM2, Baichuan, Baichuan2, Qwen, Mistral, Whisper, CodeLlama, MagicCoder and StarCoder. You find find more deatils such as validated GGUF models from HuggingFace in list.

Neural Speed also supports GGUF models generated by llama.cpp, you need to download the model and use llama.cpp to create it. Validated models: llama2-7b-chat-hf, falcon-7b, falcon-40b, mpt-7b, mpt-40b and bloom-7b1.

Advanced Usage

1. Quantization and inferenece

More parameters in llama.cpp-like usage: Advanced Usage.

2. Tensor Parallelism cross nodes/sockets

We support tensor parallelism strategy for distributed inference/training on multi-node and multi-socket. You can refer to tensor_parallelism.md to enable this feature.

3. Custom Stopping Criteria

You can customize the stopping criteria according to your own needs by processing the input_ids to determine if text generation needs to be stopped. Here is the document of Custom Stopping Criteria: simple example with minimum generation length of 80 tokens

4. Verbose Mode

Enable verbose mode and control tracing information using the NEURAL_SPEED_VERBOSE environment variable.

Available modes:

  • 0: Print all tracing information. Comprehensive output, including: evaluation time and operator profiling. (need to set NS_PROFILING to ON and recompile)
  • 1: Print evaluation time. Time taken for each evaluation.
  • 2: Profile individual operator. Identify performance bottleneck within the model. (need to set NS_PROFILING to ON and recompile)

Enable New Model

You can consider adding your own models, please follow the document: graph developer document.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

neural_speed-0.2.dev0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.0 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

neural_speed-0.2.dev0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.9 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

neural_speed-0.2.dev0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.0 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

File details

Details for the file neural_speed-0.2.dev0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for neural_speed-0.2.dev0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 907a610e764d3c3a50f48c1cda595211fde1d9aa52cf43515f1899bbd18f859a
MD5 8e3c0d58fdaef9d78f55b9bb1ff81af9
BLAKE2b-256 0ecb7a0a305a25755bf1c3083c36f83b553574c78d05bcd2b036bdf8e2adfa32

See more details on using hashes here.

File details

Details for the file neural_speed-0.2.dev0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for neural_speed-0.2.dev0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1dbd44875cfeac464e461475c438e3ab74b3fd1532201e05768fab109a9bb2fe
MD5 4b6c56ae06972fc1c947e63730a00f74
BLAKE2b-256 ebd806989ff1cd8ffcce236b4335949c9d5b91c4f6e5b120d1c2b6bc3b0d2ab9

See more details on using hashes here.

File details

Details for the file neural_speed-0.2.dev0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for neural_speed-0.2.dev0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 33a6919ca1343029ece11be08db604d176410954114e26c0cc7481fd1afc1686
MD5 aa95855026525446b6d9edcf16b476d3
BLAKE2b-256 6e8f65866de66fa99417f1638ad931195e96761f5e2508226abac94c6cdd96ef

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page