Skip to main content

Repository of Intel® Intel Extension for Transformers

Project description

Neural Speed

Neural Speed is an innovation library designed to provide the efficient inference of large language models (LLMs) on Intel platforms through the state-of-the-art (SOTA) low-bit quantization and sparsity powered by Intel Neural Compressor and llama.cpp. Highlights of this project:

Neural Speed is under active development so APIs are subject to change.

Installation

Build Python package (Recommended way)

pip install -r requirements.txt
pip install .

Note: Please make sure GCC version is higher than GCC 10.

Quick Start

There are two approaches for utilizing the Neural Speed: 1. Transformer-like usage, you need to install ITREX(intel extension for transformers) 2. llama.cpp-like usage

1. Transformer-like usage

Pytorch format HF model

from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
model_name = "Intel/neural-chat-7b-v3-1"     # Hugging Face model_id or local model
prompt = "Once upon a time, there existed a little girl,"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)

model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)

GGUF format HF model

from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM

# Specify the GGUF repo on the Hugginface
model_name = "TheBloke/Llama-2-7B-Chat-GGUF"
# Download the the specific gguf model file from the above repo
model_file = "llama-2-7b-chat.Q4_0.gguf"
# make sure you are granted to access this model on the Huggingface.
tokenizer_name = "meta-llama/Llama-2-7b-chat-hf"

prompt = "Once upon a time"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)
model = AutoModelForCausalLM.from_pretrained(model_name, model_file = model_file)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)

Please refer this link to check supported models.

If you want to use Transformer-based API in ITREX(Intel extension for transformers). Please refer to ITREX Installation Page.

2. llama.cpp-like usage:

One-click Python scripts

Run LLM with one-click python script including conversion, quantization and inference.

python scripts/run.py model-path --weight_dtype int4 -p "She opened the door and see"

Quantize and Inference Step By Step

Neural Speed supports 1. GGUF models generated by llama.cpp 2. GGUF models from HuggingFace 3. PyTorch model from HuggingFace, but quantized by Neural Speed Neural Speed offers the scripts: 1) convert and quantize, and 2) inference for converting the model by yourself. If the GGUF model is from HuggingFace or generated by llama.cpp, you can inference it directly.

1. Convert and Quantize LLM

converting the model by following the below steps:

# convert the model directly use model id in Hugging Face. (recommended)
python scripts/convert.py --outtype f32 --outfile ne-f32.bin EleutherAI/gpt-j-6b
2. Inference

Linux and WSL

OMP_NUM_THREADS=<physic_cores> numactl -m 0 -C 0-<physic_cores-1> python scripts/inference.py --model_name llama -m ne-q4_j.bin -c 512 -b 1024 -n 256 -t <physic_cores> --color -p "She opened the door and see"

Windows

python scripts/inference.py --model_name llama -m ne-q4_j.bin -c 512 -b 1024 -n 256 -t <physic_cores|P-cores> --color -p "She opened the door and see"

For details please refer to Advanced Usage.

Supported Hardware

Hardware Optimization
Intel Xeon Scalable Processors
Intel Xeon CPU Max Series
Intel Core Processors

Supported Models

LLAMA, LLAMA2, NeuralChat series, GPT-J, GPT-NEOX, Dolly-v2, MPT, Falcon, BLOOM, OPT, ChatGLM, ChatGLM2, Baichuan, Baichuan2, Qwen, Mistral, Whisper, CodeLlama, MagicCoder and StarCoder.

Neural Speed also supports GGUF models generated by llama.cpp, you need to download the model and use llama.cpp to create it. Validated models: llama2-7b-chat-hf, falcon-7b, falcon-40b, mpt-7b, mpt-40b and bloom-7b1.

Please check more validated GGUF models from HuggingFace in list.

Advanced Usage

1. Quantization and inferenece

More parameters in llama.cpp-like usage: Advanced Usage.

2. Tensor Parallelism cross nodes/sockets

We support tensor parallelism strategy for distributed inference/training on multi-node and multi-socket. You can refer to tensor_parallelism.md to enable this feature.

3. Custom Stopping Criteria

You can customize the stopping criteria according to your own needs by processing the input_ids to determine if text generation needs to be stopped. Here is the document of Custom Stopping Criteria: simple example with minimum generation length of 80 tokens

4. Verbose Mode

Enable verbose mode and control tracing information using the NEURAL_SPEED_VERBOSE environment variable.

Available modes:

  • 0: Print all tracing information. Comprehensive output, including: evaluation time and operator profiling. (need to set NS_PROFILING to ON and recompile)
  • 1: Print evaluation time. Time taken for each evaluation.
  • 2: Profile individual operator. Identify performance bottleneck within the model. (need to set NS_PROFILING to ON and recompile)

Enable New Model

You can consider adding your own models, please follow the document: graph developer document.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

neural-speed-0.3.tar.gz (4.2 MB view details)

Uploaded Source

Built Distributions

neural_speed-0.3-cp311-cp311-win_amd64.whl (9.3 MB view details)

Uploaded CPython 3.11 Windows x86-64

neural_speed-0.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.1 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

neural_speed-0.3-cp310-cp310-win_amd64.whl (9.3 MB view details)

Uploaded CPython 3.10 Windows x86-64

neural_speed-0.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.1 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

neural_speed-0.3-cp39-cp39-win_amd64.whl (9.3 MB view details)

Uploaded CPython 3.9 Windows x86-64

neural_speed-0.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.1 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

File details

Details for the file neural-speed-0.3.tar.gz.

File metadata

  • Download URL: neural-speed-0.3.tar.gz
  • Upload date:
  • Size: 4.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.18

File hashes

Hashes for neural-speed-0.3.tar.gz
Algorithm Hash digest
SHA256 9bf1d4fe264adf99967d0e2016afe84e7915f9fc4c9e8d8a43ef3cdcd386e08c
MD5 1a506cbb8f1bf2072a830be8fbf26cad
BLAKE2b-256 f92df7805b9c91f74fae6e8d05142b3453bb5c5b84fcc1ea9d56ca8fb72214bf

See more details on using hashes here.

File details

Details for the file neural_speed-0.3-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for neural_speed-0.3-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 e789a6fd9ea464755fdd353f5a529564fcdcaca4279e25cc6bbeacd05647edec
MD5 fed3f06093f032ebf108fa9d2a8333e4
BLAKE2b-256 eea465124ed7c7280fef943f92f5cfe94dc92ac4876b1022a80ebe485e475312

See more details on using hashes here.

File details

Details for the file neural_speed-0.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for neural_speed-0.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a5c826b34fad4802a0a1bc8f97175d4dec6bae8e6dc715b4f271759ebf52d4a7
MD5 4d57dd5b169184348cc6e4720da7768c
BLAKE2b-256 cd354ba72bc4eacab2e963bea3fe28db48b8f879cee9716dcf8fe9b6ea509814

See more details on using hashes here.

File details

Details for the file neural_speed-0.3-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for neural_speed-0.3-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 c418bd09c4183b79e62d078f8de7a6c1081901b9651af700475aeff7caf5c19e
MD5 3ac9d87eb78d7b0a39b01f74ea2f8d14
BLAKE2b-256 0ebac3a564fc0804da4a68467cd835f2c085d3b3db7c5f99b297909d03c5bb66

See more details on using hashes here.

File details

Details for the file neural_speed-0.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for neural_speed-0.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 6b3fba9425fdf628b6c6342ec09a58a4ea7602e114a458305c7c51f18b85f178
MD5 9daa2470f76d0db2e01ab0d6dd6dc2d6
BLAKE2b-256 1a51dc157b107b24cd1dbdd86d2360c1d13b88b24657fb9829789419f6b9cddd

See more details on using hashes here.

File details

Details for the file neural_speed-0.3-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for neural_speed-0.3-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 2725c62eb9a392d992ab21a8c0b04f85dfd020d8d4eca9bf22ee134dd6622e8a
MD5 affe9d4fc666c74015676b88be49349e
BLAKE2b-256 7fb152d54b34e40f968d010d60f3159fac564b2b788ece28327536feab9b4458

See more details on using hashes here.

File details

Details for the file neural_speed-0.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for neural_speed-0.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 b348afd8e488a26c1c9f87fdc1a8cd177d7e2961978d061ffc2095abe8d09670
MD5 5ed089c5a6846ea11fca7ade161e5d53
BLAKE2b-256 90d9dfde46afffd48eab0a1e50f998852b3ba9222087f75f2ba4c6aa5b8c8c7d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page