C++ implementation of qwen & tiktoken

These details have not been verified by PyPI

Project links

Project description

qwen.cpp

C++ implementation of Qwen-LM for real-time chatting on your MacBook.

Features

Highlights:

Pure C++ implementation based on ggml, working in the same way as llama.cpp.
Pure C++ tiktoken implementation.
Streaming generation with typewriter effect.
Python binding.

Support Matrix:

Hardwares: x86/arm CPU, NVIDIA GPU
Platforms: Linux, MacOS
Models: Qwen-LM

Getting Started

Preparation

Clone the qwen.cpp repository into your local machine:

git clone --recursive https://github.com/QwenLM/qwen.cpp && cd qwen.cpp

If you forgot the --recursive flag when cloning the repository, run the following command in the qwen.cpp folder:

git submodule update --init --recursive

Download the qwen.tiktoken file from Hugging Face or modelscope.

Quantize Model

Use convert.py to transform Qwen-LM into quantized GGML format. For example, to convert the fp16 original model to q4_0 (quantized int4) GGML model, run:

python3 qwen_cpp/convert.py -i Qwen/Qwen-7B-Chat -t q4_0 -o qwen7b-ggml.bin

The original model (-i <model_name_or_path>) can be a HuggingFace model name or a local path to your pre-downloaded model. Currently supported models are:

Qwen-7B: Qwen/Qwen-7B-Chat
Qwen-14B: Qwen/Qwen-14B-Chat

You are free to try any of the below quantization types by specifying -t <type>:

q4_0: 4-bit integer quantization with fp16 scales.
q4_1: 4-bit integer quantization with fp16 scales and minimum values.
q5_0: 5-bit integer quantization with fp16 scales.
q5_1: 5-bit integer quantization with fp16 scales and minimum values.
q8_0: 8-bit integer quantization with fp16 scales.
f16: half precision floating point weights without quantization.
f32: single precision floating point weights without quantization.

Build & Run

Compile the project using CMake:

cmake -B build
cmake --build build -j --config Release

Now you may chat with the quantized Qwen-7B-Chat model by running:

./build/bin/main -m qwen7b-ggml.bin --tiktoken Qwen-7B-Chat/qwen.tiktoken -p 你好
# 你好！很高兴为你提供帮助。

To run the model in interactive mode, add the -i flag. For example:

./build/bin/main -m qwen7b-ggml.bin --tiktoken Qwen-7B-Chat/qwen.tiktoken -i

In interactive mode, your chat history will serve as the context for the next-round conversation.

Run ./build/bin/main -h to explore more options!

Using BLAS

OpenBLAS

OpenBLAS provides acceleration on CPU. Add the CMake flag -DGGML_OPENBLAS=ON to enable it.

cmake -B build -DGGML_OPENBLAS=ON && cmake --build build -j

cuBLAS

cuBLAS uses NVIDIA GPU to accelerate BLAS. Add the CMake flag -DGGML_CUBLAS=ON to enable it.

cmake -B build -DGGML_CUBLAS=ON && cmake --build build -j

Metal

MPS (Metal Performance Shaders) allows computation to run on powerful Apple Silicon GPU. Add the CMake flag -DGGML_METAL=ON to enable it.

cmake -B build -DGGML_METAL=ON && cmake --build build -j

Python Binding

The Python binding provides high-level chat and stream_chat interface similar to the original Hugging Face Qwen-7B.

Installation

Install from PyPI (recommended): will trigger compilation on your platform.

pip install -U qwen-cpp

You may also install from source.

# install from the latest source hosted on GitHub
pip install git+https://github.com/QwenLM/qwen.cpp.git@master
# or install from your local source after git cloning the repo
pip install .

tiktoken.cpp

We provide pure C++ tiktoken implementation. After installation, the usage is the same as openai tiktoken:

import tiktoken_cpp as tiktoken
enc = tiktoken.get_encoding("cl100k_base")
assert enc.decode(enc.encode("hello world")) == "hello world"

Benchmark

The speed of tiktoken.cpp is on par with openai tiktoken:

cd tests
RAYON_NUM_THREADS=1 python benchmark.py

Development

Unit Test

To perform unit tests, add this CMake flag -DQWEN_ENABLE_TESTING=ON to enable testing. Recompile and run the unit test (including benchmark).

mkdir -p build && cd build
cmake .. -DQWEN_ENABLE_TESTING=ON && make -j
./bin/qwen_test

Lint

To format the code, run make lint inside the build folder. You should have clang-format, black and isort pre-installed.

Acknowledgements

This project is greatly inspired by llama.cpp, chatglm.cpp, ggml, tiktoken, tokenizer, cpp-base64, re2 and unordered_dense.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.3

Nov 17, 2023

0.1.2

Oct 11, 2023

0.1.1

Oct 10, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

qwen-cpp-0.1.3.tar.gz (2.9 MB view details)

Uploaded Nov 17, 2023 Source

File details

Details for the file qwen-cpp-0.1.3.tar.gz.

File metadata

Download URL: qwen-cpp-0.1.3.tar.gz
Upload date: Nov 17, 2023
Size: 2.9 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for qwen-cpp-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`e4770afc32b3f5e30e973a52bc8ff1b3f0a89097efe0130cd3fb87722fff160a`
MD5	`a8a22dced8cb75837885d2bcb56bc151`
BLAKE2b-256	`a47661e947717636072018ce25a5929af05b0e47538a564cc0c3298935b38a49`

See more details on using hashes here.

qwen-cpp 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

qwen.cpp

Features

Getting Started

Using BLAS

Python Binding

tiktoken.cpp

Development

Acknowledgements

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes