Repository of Intel® Intel Extension for Transformers

These details have not been verified by PyPI

Project links

Homepage

Project description

Neural Speed

Neural Speed is an innovative library designed to support the efficient inference of large language models (LLMs) on Intel platforms through the state-of-the-art (SOTA) low-bit quantization powered by Intel Neural Compressor. The work is inspired by llama.cpp and further optimized for Intel platforms with our innovations in NeurIPS' 2023

Key Features

Highly optimized low-precision kernels on CPUs with ISAs (AMX, VNNI, AVX512F, AVX_VNNI and AVX2). See details
Up to 40x performance speedup on popular LLMs compared with llama.cpp. See details
Tensor parallelism across sockets/nodes on CPUs. See details

Neural Speed is under active development so APIs are subject to change.

Supported Hardware

Hardware	Supported
Intel Xeon Scalable Processors	✔
Intel Xeon CPU Max Series	✔
Intel Core Processors	✔

Supported Models

Support almost all the LLMs in PyTorch format from Hugging Face such as Llama2, ChatGLM2, Baichuan2, Qwen, Mistral, Whisper, etc. File an issue if your favorite LLM does not work.

Support typical LLMs in GGUF format such as Llama2, Falcon, MPT, Bloom etc. More are coming. Check out the details.

Installation

Install from binary

pip install -r requirements.txt
pip install neural-speed

Build from Source

pip install -r requirements.txt
pip install .

Note: GCC requires version 10+

Quick Start (Transformer-like usage)

Install Intel Extension for Transformers to use Transformer-like APIs.

PyTorch Model from Hugging Face

from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
model_name = "Intel/neural-chat-7b-v3-1"     # Hugging Face model_id or local model
prompt = "Once upon a time, there existed a little girl,"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)

model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)

GGUF Model from Hugging Face

from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM

# Specify the GGUF repo on the Hugginface
model_name = "TheBloke/Llama-2-7B-Chat-GGUF"
# Download the the specific gguf model file from the above repo
model_file = "llama-2-7b-chat.Q4_0.gguf"
# make sure you are granted to access this model on the Huggingface.
tokenizer_name = "meta-llama/Llama-2-7b-chat-hf"

prompt = "Once upon a time"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)
model = AutoModelForCausalLM.from_pretrained(model_name, model_file = model_file)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)

PyTorch Model from Modelscope

from transformers import TextStreamer
from modelscope import AutoTokenizer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
model_name = "qwen/Qwen-7B"     # Modelscope model_id or local model
prompt = "Once upon a time, there existed a little girl,"

model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True, model_hub="modelscope")
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)

Quick Start (llama.cpp-like usage)

Single (One-click) Step

python scripts/run.py model-path --weight_dtype int4 -p "She opened the door and see"

Multiple Steps

Convert and Quantize

# skip the step if GGUF model is from Hugging Face or generated by llama.cpp
python scripts/convert.py --outtype f32 --outfile ne-f32.bin EleutherAI/gpt-j-6b

Inference

# Linux and WSL
OMP_NUM_THREADS=<physic_cores> numactl -m 0 -C 0-<physic_cores-1> python scripts/inference.py --model_name llama -m ne-q4_j.bin -c 512 -b 1024 -n 256 -t <physic_cores> --color -p "She opened the door and see"

# Windows
python scripts/inference.py --model_name llama -m ne-q4_j.bin -c 512 -b 1024 -n 256 -t <physic_cores|P-cores> --color -p "She opened the door and see"

Please refer to Advanced Usage for more details.

Advanced Topics

New model enabling

You can consider adding your own models, please follow the document: graph developer document.

Performance profiling

Enable NEURAL_SPEED_VERBOSE environment variable for performance profiling.

Available modes:

0: Print full information: evaluation time and operator profiling. Need to set NS_PROFILING to ON and recompile.
1: Print evaluation time. Time taken for each evaluation.
2: Profile individual operator. Identify performance bottleneck within the model. Need to set NS_PROFILING to ON and recompile.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

1.0

Mar 29, 2024

1.0a0 pre-release

Mar 22, 2024

0.3

Feb 23, 2024

0.2

Jan 22, 2024

0.2.dev0 pre-release

Feb 2, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

neural-speed-1.0.tar.gz (4.4 MB view details)

Uploaded Mar 29, 2024 Source

Built Distributions

neural_speed-1.0-cp311-cp311-win_amd64.whl (11.7 MB view details)

Uploaded Mar 29, 2024 CPython 3.11 Windows x86-64

neural_speed-1.0-cp311-cp311-manylinux_2_28_x86_64.whl (23.3 MB view details)

Uploaded Mar 29, 2024 CPython 3.11 manylinux: glibc 2.28+ x86-64

neural_speed-1.0-cp310-cp310-win_amd64.whl (11.7 MB view details)

Uploaded Mar 29, 2024 CPython 3.10 Windows x86-64

neural_speed-1.0-cp310-cp310-manylinux_2_28_x86_64.whl (23.2 MB view details)

Uploaded Mar 29, 2024 CPython 3.10 manylinux: glibc 2.28+ x86-64

neural_speed-1.0-cp39-cp39-win_amd64.whl (11.7 MB view details)

Uploaded Mar 29, 2024 CPython 3.9 Windows x86-64

neural_speed-1.0-cp39-cp39-manylinux_2_28_x86_64.whl (23.2 MB view details)

Uploaded Mar 29, 2024 CPython 3.9 manylinux: glibc 2.28+ x86-64

neural_speed-1.0-cp38-cp38-win_amd64.whl (11.7 MB view details)

Uploaded Mar 29, 2024 CPython 3.8 Windows x86-64

neural_speed-1.0-cp38-cp38-manylinux_2_28_x86_64.whl (23.2 MB view details)

Uploaded Mar 29, 2024 CPython 3.8 manylinux: glibc 2.28+ x86-64

File details

Details for the file neural-speed-1.0.tar.gz.

File metadata

Download URL: neural-speed-1.0.tar.gz
Upload date: Mar 29, 2024
Size: 4.4 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.10.14

File hashes

Hashes for neural-speed-1.0.tar.gz
Algorithm	Hash digest
SHA256	`353d1e3c1a4b70ed4878b138d549c267c0c4da953722604011ef7e3e7bedaec4`
MD5	`d7652e0674768e747052223e10e6a8d8`
BLAKE2b-256	`32fe9410ee58b48188c3bde1b04c9f62ad038010e04a8629b4b7977838b41a3e`

See more details on using hashes here.

File details

Details for the file neural_speed-1.0-cp311-cp311-win_amd64.whl.

File metadata

Download URL: neural_speed-1.0-cp311-cp311-win_amd64.whl
Upload date: Mar 29, 2024
Size: 11.7 MB
Tags: CPython 3.11, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.10.14

File hashes

Hashes for neural_speed-1.0-cp311-cp311-win_amd64.whl
Algorithm	Hash digest
SHA256	`83f83f571e9227c072d1b7ec11df383faabb2b3ded3f221b3504c3036708920e`
MD5	`a385a85be8fcf75e308d6275c3336a62`
BLAKE2b-256	`828f4f52178422f1374378fc1f744f9c6d4c7224b4f1f58e09e1950e846b9acd`

See more details on using hashes here.

File details

Details for the file neural_speed-1.0-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

Download URL: neural_speed-1.0-cp311-cp311-manylinux_2_28_x86_64.whl
Upload date: Mar 29, 2024
Size: 23.3 MB
Tags: CPython 3.11, manylinux: glibc 2.28+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.10.14

File hashes

Hashes for neural_speed-1.0-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm	Hash digest
SHA256	`33d5d4f46cabd81079e7630b7d2d085dccfe466c22b13e1f2fffe6e145253d5f`
MD5	`4ee5f6ab891a0258bcc61ccdca1be1ee`
BLAKE2b-256	`de52ce34338c918dafeb6bd9d777ce4a0b75f814fe4eedfea4bd6b13f53efbd1`

See more details on using hashes here.

File details

Details for the file neural_speed-1.0-cp310-cp310-win_amd64.whl.

File metadata

Download URL: neural_speed-1.0-cp310-cp310-win_amd64.whl
Upload date: Mar 29, 2024
Size: 11.7 MB
Tags: CPython 3.10, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.10.14

File hashes

Hashes for neural_speed-1.0-cp310-cp310-win_amd64.whl
Algorithm	Hash digest
SHA256	`0cc96058f7bbc414658a50c0eee8fb4455fa60e99e3186249e19f64b1c6b1bb8`
MD5	`6821c710af4cc4091ccbd9557a728647`
BLAKE2b-256	`ab0719e63fe447f4b6ec128573087e3ef642cc93d7639e41759a0e93aeed1eb4`

See more details on using hashes here.

File details

Details for the file neural_speed-1.0-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

Download URL: neural_speed-1.0-cp310-cp310-manylinux_2_28_x86_64.whl
Upload date: Mar 29, 2024
Size: 23.2 MB
Tags: CPython 3.10, manylinux: glibc 2.28+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.10.14

File hashes

Hashes for neural_speed-1.0-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm	Hash digest
SHA256	`e86274c840d0c398162a113ec8ed808124a1006b591aa7273c3199fa5e8958a5`
MD5	`f8fc19e7f19e4b215628890f5cd1b8d7`
BLAKE2b-256	`71bb9e7ca61a2639ada138409df92e17cf66881e6ed985bad0cb1ecc8f5a8228`

See more details on using hashes here.

File details

Details for the file neural_speed-1.0-cp39-cp39-win_amd64.whl.

File metadata

Download URL: neural_speed-1.0-cp39-cp39-win_amd64.whl
Upload date: Mar 29, 2024
Size: 11.7 MB
Tags: CPython 3.9, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.10.14

File hashes

Hashes for neural_speed-1.0-cp39-cp39-win_amd64.whl
Algorithm	Hash digest
SHA256	`50eaa685dc298856ff23ad758cc7c3e63718b79439b1e7737418433366057d40`
MD5	`89e42e4b3774db83c8d07c452d8b0f69`
BLAKE2b-256	`f511643e36827e8accfc6df672ea9093d164c1304910a3e508b01c2b43dc9eb8`

See more details on using hashes here.

File details

Details for the file neural_speed-1.0-cp39-cp39-manylinux_2_28_x86_64.whl.

File metadata

Download URL: neural_speed-1.0-cp39-cp39-manylinux_2_28_x86_64.whl
Upload date: Mar 29, 2024
Size: 23.2 MB
Tags: CPython 3.9, manylinux: glibc 2.28+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.10.14

File hashes

Hashes for neural_speed-1.0-cp39-cp39-manylinux_2_28_x86_64.whl
Algorithm	Hash digest
SHA256	`2308728a00b3875951cb92ba4798c4b8b3b1d357127256cfc2e87346493f8e16`
MD5	`f2a311c260794dc6be222dd3b43cbc3a`
BLAKE2b-256	`1c28feda0c5f84e9df3dcdcf71ac2ed3e3f3ea41b342c51fa27320d5a052586c`

See more details on using hashes here.

File details

Details for the file neural_speed-1.0-cp38-cp38-win_amd64.whl.

File metadata

Download URL: neural_speed-1.0-cp38-cp38-win_amd64.whl
Upload date: Mar 29, 2024
Size: 11.7 MB
Tags: CPython 3.8, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.10.14

File hashes

Hashes for neural_speed-1.0-cp38-cp38-win_amd64.whl
Algorithm	Hash digest
SHA256	`52517e22534aa64fd2d8d2af0d83a3acd3b3abfbb76a3b59b173253f2041e604`
MD5	`fa18658fd0187bc85d420bd442698457`
BLAKE2b-256	`b4e446e500cb76ad85807c250a09eee1461cfba7a528d1feff455c8f4df6919c`

See more details on using hashes here.

File details

Details for the file neural_speed-1.0-cp38-cp38-manylinux_2_28_x86_64.whl.

File metadata

Download URL: neural_speed-1.0-cp38-cp38-manylinux_2_28_x86_64.whl
Upload date: Mar 29, 2024
Size: 23.2 MB
Tags: CPython 3.8, manylinux: glibc 2.28+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.10.14

File hashes

Hashes for neural_speed-1.0-cp38-cp38-manylinux_2_28_x86_64.whl
Algorithm	Hash digest
SHA256	`cc7b561d8560da13a53ee10799c490b4d8ee6593188e15a014bb2963ed080fcb`
MD5	`fc6e41e5127568cb6ff814fc9c0f1644`
BLAKE2b-256	`0b8a9f304beba4925e352a739a1e72c3f9171a9432c5f054d32d3c00f256cfe7`

See more details on using hashes here.

neural-speed 1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Neural Speed

Key Features

Supported Hardware

Supported Models

Installation

Install from binary

Build from Source

Quick Start (Transformer-like usage)

PyTorch Model from Hugging Face

GGUF Model from Hugging Face

PyTorch Model from Modelscope

Quick Start (llama.cpp-like usage)

Single (One-click) Step

Multiple Steps

Convert and Quantize

Inference

Advanced Topics

New model enabling

Performance profiling

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes