Skip to main content

An open platform for training, serving, and evaluating large language model based chatbots.

Project description

FastChat

An open platform for training, serving, and evaluating large language model based chatbots.

Release

  • 🔥 We released Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90% ChatGPT Quality. Checkout the blog post and demo.

Join our Discord server and follow our Twitter to get the latest updates.

Contents

Install

Method 1: With pip

# Install FastChat
pip3 install fschat

# Install the latest main branch of huggingface/transformers
pip3 install git+https://github.com/huggingface/transformers

Method 2: From source

  1. Clone this repository and navigate to the FastChat folder
git clone https://github.com/lm-sys/FastChat.git
cd FastChat

If you are running on Mac:

brew install rust
  1. Install Package
pip3 install --upgrade pip  # enable PEP 660 support
pip3 install -e .

Vicuna Weights

We release Vicuna weights as delta weights to comply with the LLaMA model license. You can add our delta to the original LLaMA weights to obtain the Vicuna weights. Instructions:

  1. Get the original LLaMA weights in the huggingface format by following the instructions here.
  2. Use the following scripts to get Vicuna weights by applying our delta. They will automatically download delta weights from our Hugging Face account.

NOTE: Our released weights are only compatible with the latest main branch of huggingface/transformers. We install the correct version of transformers when fastchat is installed.

Vicuna-13B

This conversion command needs around 60 GB of CPU RAM. If you do not have enough memory, you can create a large swap file that allows the operating system to automatically utilize the disk as virtual memory.

python3 -m fastchat.model.apply_delta \
    --base /path/to/llama-13b \
    --target /output/path/to/vicuna-13b \
    --delta lmsys/vicuna-13b-delta-v0

Vicuna-7B

This conversion command needs around 30 GB of CPU RAM. If you do not have enough memory, you can create a large swap file that allows the operating system to automatically utilize the disk as virtual memory.

python3 -m fastchat.model.apply_delta \
    --base /path/to/llama-7b \
    --target /output/path/to/vicuna-7b \
    --delta lmsys/vicuna-7b-delta-v0

Inference with Command Line Interface

Single GPU

The command below requires around 28GB of GPU memory for Vicuna-13B and 14GB of GPU memory for Vicuna-7B.

python3 -m fastchat.serve.cli --model-name /path/to/vicuna/weights

Multiple GPUs

If you do not have enough GPU memory, you can use model parallelism to aggregate memory from multiple GPUs on the same machine.

python3 -m fastchat.serve.cli --model-name /path/to/vicuna/weights --num-gpus 2

CPU Only

This runs on the CPU only and does not require GPU. It requires around 60GB of CPU memory for Vicuna-13B and around 30GB of CPU memory for Vicuna-7B.

python3 -m fastchat.serve.cli --model-name /path/to/vicuna/weights --device cpu

Metal Backend (Mac Computers with Apple Silicon or AMD GPUs)

Use --device mps to enable GPU acceleration on Mac computers and use --load-8bit to turn on 8-bit compression.

python3 -m fastchat.serve.cli --model-name /path/to/vicuna/weights --device mps --load-8bit

Vicuna-7B can run on a 32GB M1 Macbook with 1 - 2 words / second.

No Enough Memory or Other Platforms

If you do not have enough memory, you can enable 8-bit compression by adding --load-8bit to commands above. This can reduce the memory usage by around half with slightly degraded model quality. It is compatible with the CPU, GPU, and Metal backend. Vicuna-13B with 8-bit compression can run on a single NVIDIA 3090/4090/V100(16GB) GPU.

python3 -m fastchat.serve.cli --model-name /path/to/vicuna/weights --load-8bit

Besides, we are actively exploring more methods to make the model easier to run on more platforms. Contributions and pull requests are welcome.

Serving with Web GUI

To serve using the web UI, you need three main components: web servers that interface with users, model workers that host one or more models, and a controller to coordinate the webserver and model workers. Here are the commands to follow in your terminal:

Note for Windows users: Windows users will need Python 3.7 and above and to set the following environmental variable prior to launching the worker.

PYTHONUTF8=1

Launch the controller

python3 -m fastchat.serve.controller

This controller manages the distributed workers.

Launch the model worker

python3 -m fastchat.serve.model_worker --model-path /path/to/vicuna/weights

Wait until the process finishes loading the model and you see "Uvicorn running on ...". You can launch multiple model workers to serve multiple models concurrently. The model worker will connect to the controller automatically.

To ensure that your model worker is connected to your controller properly, send a test message using the following command:

python3 -m fastchat.serve.test_message --model-name vicuna-13b

Launch the Gradio web server

python3 -m fastchat.serve.gradio_web_server

This is the user interface that users will interact with.

By following these steps, you will be able to serve your models using the web UI. You can open your browser and chat with a model now.

Evaluation

Our AI-enhanced evaluation pipeline is based on GPT-4. Here are some high-level instructions for using the pipeline:

First, generate answers from different models. Use qa_baseline_gpt35.py for ChatGPT, or specify the model checkpoint and run model_qa.py for Vicuna and other models.

Then, use GPT-4 to generate reviews automatically, which can be done manually if the GPT-4 API is not available to you. Once you have your evaluation data, visualize the results by running generate_webpage_data_from_table.py, which generates data for a static website.

Finally, serve a static website under the webpage directory. You can simply use python3 -m http.server to serve the website locally.

Besides the evaluation workflow, we also document the data format used for evaluation, which is encoded with JSON Lines and includes information on models, prompts, reviewers, questions, answers, and reviews. You can customize the evaluation process or contribute to our project by accessing relevant data.

Check evaluation for detailed instructions.

Fine-tuning

Data

Vicuna is created by fine-tuning a LLaMA base model using approximately 70K user-shared conversations gathered from ShareGPT.com with public APIs. To ensure data quality, we convert the HTML back to markdown and filter out some inappropriate or low-quality samples. Additionally, we divide lengthy conversations into smaller segments that fit the model's maximum context length. For detailed instructions to clean the ShareGPT data, check out here.

Due to some concerns, we may not release the data at the moment. If you would like to try the fine-tuning code, you can try to run it with our preprocessed alpaca dataset (originally from here).

Code and Hyperparameters

We fine-tune the model using the code from Stanford Alpaca, with some modifications to support gradient checkpointing and Flash Attention. We use similar hyperparameters as the Stanford Alpaca.

Hyperparameter Global Batch Size Learning rate Epochs Max length Weight decay
Vicuna-13B 128 2e-5 3 2048 0

Fine-tuning on Any Cloud with SkyPilot

SkyPilot is a framework built by UC Berkeley for easily and cost effectively running ML workloads on any cloud (AWS, GCP, Azure, Lambda, etc.). To use SkyPilot, install it with the following command and setup the cloud credentials locally following the instructions here.

# Install skypilot from the master branch
pip install git+https://github.com/skypilot-org/skypilot.git

Vicuna

Vicuna can be trained on 8 A100 GPUs with 80GB memory. The following command will automatically launch a node satisfying the requirement, setup and run the training job on it.

sky launch -c vicuna -s scripts/train-vicuna.yaml --env WANDB_API_KEY

Other options are also valid:

# Launch it on managed spot to save 3x cost (train Vicuna-13B with around $300)
sky spot launch -n vicuna scripts/train-vicuna.yaml --env WANDB_API_KEY

# Train a 7B model
sky launch -c vicuna -s scripts/train-vicuna.yaml --env WANDB_API_KEY --env MODEL_SIZE=7

Note: Please make sure the WANDB_API_KEY has been setup on your local machine. You can find the API key on your wandb profile page. If you would like to train the model without using wandb, you can replace the --env WANDB_API_KEY flag with --env WANDB_MODE=offline.

Alpaca

Launch the training job with the following line (will be launched on a single node with 4 A100-80GB GPUs)

sky launch -c alpaca -s scripts/train-alpaca.yaml --env WANDB_API_KEY

Fine-tuning with Local GPUs

Vicuna can also be trained on 8 A100 GPUs with 80GB memory with the following code. To train on fewer GPUs, you can reduce the per_device_train_batch_size and increase the gradient_accumulation_steps accordingly to keep the global batch size the same. To setup the environment, please see the setup section in scripts/train-vicuna.yaml.

torchrun --nnodes=1 --nproc_per_node=8 --master_port=<your_random_port> \
    fastchat/train/train_mem.py \
    --model_name_or_path <path-to-llama-model-weight> \
    --data_path <path-to-data> \
    --bf16 True \
    --output_dir ./checkpoints \
    --num_train_epochs 3 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 1200 \
    --save_total_limit 100 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fschat-0.1.9.tar.gz (47.9 kB view details)

Uploaded Source

Built Distribution

fschat-0.1.9-py3-none-any.whl (56.3 kB view details)

Uploaded Python 3

File details

Details for the file fschat-0.1.9.tar.gz.

File metadata

  • Download URL: fschat-0.1.9.tar.gz
  • Upload date:
  • Size: 47.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for fschat-0.1.9.tar.gz
Algorithm Hash digest
SHA256 0c9c7db6d797ff5c786acd14fd7e17e42d5be5d0b52f3f6f4cde3277f721b40e
MD5 5aa88d850e4f89b42615e20c6fa11450
BLAKE2b-256 9368df35416e816a223f89ceb889c773ae9a92471cdec1b284cac609140dc532

See more details on using hashes here.

File details

Details for the file fschat-0.1.9-py3-none-any.whl.

File metadata

  • Download URL: fschat-0.1.9-py3-none-any.whl
  • Upload date:
  • Size: 56.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for fschat-0.1.9-py3-none-any.whl
Algorithm Hash digest
SHA256 904bea25caba6f117e1d473ff02795088330759a836a857c16d03f71eb7f5b05
MD5 8ef378ddb7ca3cbba6b19037ce62a536
BLAKE2b-256 2804325fc750b61f2faaed68c4e15f10cc3cc911dfc496f7d868167c76b33859

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page