Skip to main content

SCAR: An AI-powered tool for ranking and filtering instruction-answer pairs based on writing quality and style consistency

Project description

SCAR: Style Consistency-Aware Response Ranking for LLM Instruction-Tuning

Overview

SCAR is an innovative data selection method designed to enhance instruction-tuning for large language models. It leverages style consistency-aware response ranking to improve the quality and efficiency of the training data.

Installation

You can install SCAR using pip:

pip install scar

Requirements

SCAR requires the following dependencies: torch, transformers, scikit-learn, tqdm, nltk, datasketch, peft, trl, accelerate, langdetect, and deepspeed. These will be automatically installed when you install SCAR via pip.

Usage

Basic Usage with Hugging Face Transformers

Here's a simple example of how to use the StyleRanker model with Hugging Face Transformers:

import torch
from transformers import AutoTokenizer
from style_ranker.ranker.model import StyleRanker

# Load the model and tokenizer
model_path = "lizhuang144/scar-gte-base"
model = StyleRanker.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Prepare your data
instructions = ["Write a poem about spring", "Explain quantum computing"]
answers = ["Blossoms bloom in gentle breeze...", "Quantum computing is a type of computation..."]

# Tokenize the inputs
max_length = 512
instruction_inputs = tokenizer(instructions, return_tensors='pt', padding=True, truncation=True, max_length=max_length)
answer_inputs = tokenizer(answers, return_tensors='pt', padding=True, truncation=True, max_length=max_length)
model.eval()
# Get the scores
with torch.no_grad():
    scores = model(
        instruction_inputs.input_ids,
        instruction_inputs.attention_mask,
        answer_inputs.input_ids,
        answer_inputs.attention_mask
    )

# Print the results
for instruction, answer, score in zip(instructions, answers, scores):
    print(f"Instruction: {instruction}")
    print(f"Answer: {answer}")
    print(f"Score: {score.item()}")
    print()

Advanced Usage

SCAR offers sophisticated capabilities for data filtering and ranking through its comprehensive pipeline. This allows you to fine-tune your selection process by choosing the top-k pairs with the highest scores, setting a ratio for selection, or applying a threshold for filtering.

The rank_and_filter function provides a powerful way to rank and filter instruction-answer pairs. Here's an example demonstrating its usage:

from style_ranker.rank import rank_and_filter

# Load the model and tokenizer
model_path = "lizhuang144/scar-gte-base"

# Prepare your data
instructions = ["Write a poem about spring", "Explain quantum computing", "Describe the water cycle"]
answers = ["Blossoms bloom in gentle breeze...", "Quantum computing is a type of computation...",
           "The water cycle, also known as..."]

# Example 1: Using topk
topk_pairs = rank_and_filter(model_path, instructions, answers, topk=2)

# Example 2: Using threshold
threshold_pairs = rank_and_filter(model_path, instructions, answers, threshold=-0.5)

# Example 3: Using ratio
ratio_pairs = rank_and_filter(model_path, instructions, answers, ratio=0.5)

# Print results for each method
print("Top-k results:")
for instruction, answer, score in topk_pairs:
    print(f"Score: {score:.2f} | Instruction: {instruction}")

print("\nThreshold results:")
for instruction, answer, score in threshold_pairs:
    print(f"Score: {score:.2f} | Instruction: {instruction}")

print("\nRatio results:")
for instruction, answer, score in ratio_pairs:
    print(f"Score: {score:.2f} | Instruction: {instruction}")

Model List

We provide the following pre-trained SCAR models:

Performance

SCAR demonstrates significant improvements in LLM performance when used for data filtering and selection. We evaluated our method using two LLMs: Olmo and Starcoder.

Note: Prior to applying SCAR, we filter out non-English and remove duplicate instruction-response pairs.

Olmo Performance

Dataset Size L.C. WinRate
Full dataset (320k) 3.86
SCAR-filtered 10k 5.37
SCAR-filtered 5k 5.64
SCAR-filtered 2.5k 4.08

The official checkpoint allenai/OLMo-7B-SFT is trained on 320k data from allenai/tulu-v2-sft-mixture. We evaluate the performance of models trained with SCAR-filtered data using 10k, 5k, and 2.5k instruction-answer pairs. The evaluation metric is L.C. WinRate, which compares model outputs with 'gpt-4-1106-preview' using meta-llama/Meta-Llama-3-70B-Instruct as the judger on the AlpacaEval benchmark.

Starcoder Performance

Dataset Size HumanEval (Python) MultiPL-E (Java) MultiPL-E (C++) MultiPL-E (JavaScript)
Pass@1 / Pass@10 Pass@1 / Pass@10 Pass@1 / Pass@10 Pass@1 / Pass@10
Full dataset (13k) 35.56/ 51.81 26.03 / 38.44 32.80 / 46.97 29.32 / 41.90
SCAR-filtered 10k 36.29 / 53.99 28.29 / 39.58 33.22 / 49.79 30.17 / 46.20
SCAR-filtered 5k 36.95 / 54.07 28.96 / 39.02 34.53 / 49.90 34.53 / 49.90
SCAR-filtered 2.5k 37.57 / 55.65 29.29 / 41.06 34.09 / 49.47 31.19 / 42.83

The official checkpoint 'bigcode/octocoder' is the 'bigcode/starcoder' fine-tuned on 13k data from 'bigcode/guanaco-commits'. We evaluated the performance using the bigcode-evaluation-harness. The performance of 'bigcode/octocoder' is obtained from the 'bigcode/bigcode-models-leaderboard'. We evaluated models on four datasets in four programming languages (Python, Java, C++, and JavaScript) and reported two execution accuracies (Pass@1 and Pass@10) for each dataset. We evaluated the performance of the model trained with SCAR-filtered data with 10k, 5k, and 2.5k instruction-answer pairs.

Key Components

  • StyleRanker: A model for ranking instruction-answer pairs based on style consistency and data quality.
  • Data Filtering: Scripts for filtering and selecting high-quality instruction-answer pairs.
  • LLM Training: Scripts for fine-tuning large language models using the selected data.

Scripts

The scripts/ directory contains bash scripts for various tasks:

  • quality_measure.sh: Measures the quality of the collected responses using LLMs, utilized to train the ranker.
  • train_ranker.sh: Trains the SCAR style ranker model. Please update the script arguments as needed.
  • data_filter.sh: Ranks and filters instruction-answer pairs. Please update the script arguments as needed.
  • train_llm.sh: This script fine-tunes a large language model using the filtered data. Review and update the script arguments accordingly to ensure proper training. The following additional packages are required to train the LLM:
    • peft
    • trl
    • accelerate
    • deepspeed

Ensure all dependencies are installed before running these scripts to achieve the best results.

Project Structure

The project is organized as follows:

  • data/: Datasets for training and evaluation
    • llm_sft_data/: Training data for the large language model (code and open domain)
    • ranker_data/: Training data for the ranker (code and open domain)
  • style_ranker/: Main package
    • consts.py
    • dedup.py: Near deduplication
    • llm/: LLM training (train.py)
    • rank.py: Ranking and filtering
    • ranker/: StyleRanker implementation
      • config.py, dataset.py, model.py, quality.py: Quality measure with LLMs like GPT-3.5-turbo
      • SCAR ranker training (train.py)
    • utils.py
  • examples/: Example Python scripts
    • filter_pipeline.py, rank_pairs.py, remove_dupes.py, vicuna_converter.py
  • scripts/: Example Bash scripts
    • data_filter.sh, quality_measure.sh, train_llm.sh, train_ranker.sh
  • requirements.txt: List of dependencies
  • setup.py: Installation script

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scar_tool-0.33.tar.gz (19.4 kB view details)

Uploaded Source

Built Distribution

scar_tool-0.33-py3-none-any.whl (22.4 kB view details)

Uploaded Python 3

File details

Details for the file scar_tool-0.33.tar.gz.

File metadata

  • Download URL: scar_tool-0.33.tar.gz
  • Upload date:
  • Size: 19.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.14

File hashes

Hashes for scar_tool-0.33.tar.gz
Algorithm Hash digest
SHA256 90561e1c1e48ec30de6b099e472dae72fcef842cf32a2da1b0c9c66a8a94f83d
MD5 2b5147e275f33428abe2c94a98147dd5
BLAKE2b-256 9a4cca0feacc07cbdb1bfb3d7bd6ebad7f22e0bb837ee59d679ad8fde1529983

See more details on using hashes here.

File details

Details for the file scar_tool-0.33-py3-none-any.whl.

File metadata

  • Download URL: scar_tool-0.33-py3-none-any.whl
  • Upload date:
  • Size: 22.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.14

File hashes

Hashes for scar_tool-0.33-py3-none-any.whl
Algorithm Hash digest
SHA256 6f5c39611f6df6e1be62045563d9bbc84aff61d4a9a146941382833d3c8ca82a
MD5 42ebb389d2557752a903b08ccea7b0bf
BLAKE2b-256 a5cf2010ef6abf500ace0371d65eb092c274a92912ff5c8ef10e13984aab19e8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page