Skip to main content

SCAR: An AI-powered tool for ranking and filtering instruction-answer pairs based on writing quality and style consistency

Project description

SCAR: Style Consistency-Aware Response Ranking for LLM Instruction-Tuning

Overview

SCAR is an innovative data selection method designed to enhance instruction-tuning for large language models. It leverages style consistency-aware response ranking to improve the quality and efficiency of the training data.

Installation

You can install SCAR using pip:

pip install scar

Requirements

SCAR requires the following dependencies: torch, transformers, scikit-learn, tqdm, nltk, datasketch, peft, trl, accelerate, langdetect, and deepspeed. These will be automatically installed when you install SCAR via pip.

Usage

Basic Usage with Hugging Face Transformers

Here's a simple example of how to use the StyleRanker model with Hugging Face Transformers:

import torch
from transformers import AutoTokenizer
from style_ranker.ranker.model import StyleRanker

# Load the model and tokenizer
model_path = "lizhuang144/scar-gte-base"
model = StyleRanker.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Prepare your data
instructions = ["Write a poem about spring", "Explain quantum computing"]
answers = ["Blossoms bloom in gentle breeze...", "Quantum computing is a type of computation..."]

# Tokenize the inputs
max_length = 512
instruction_inputs = tokenizer(instructions, return_tensors='pt', padding=True, truncation=True, max_length=max_length)
answer_inputs = tokenizer(answers, return_tensors='pt', padding=True, truncation=True, max_length=max_length)
model.eval()
# Get the scores
with torch.no_grad():
    scores = model(
        instruction_inputs.input_ids,
        instruction_inputs.attention_mask,
        answer_inputs.input_ids,
        answer_inputs.attention_mask
    )

# Print the results
for instruction, answer, score in zip(instructions, answers, scores):
    print(f"Instruction: {instruction}")
    print(f"Answer: {answer}")
    print(f"Score: {score.item()}")
    print()

Advanced Usage

SCAR offers sophisticated capabilities for data filtering and ranking through its comprehensive pipeline. This allows you to fine-tune your selection process by choosing the top-k pairs with the highest scores, setting a ratio for selection, or applying a threshold for filtering.

The rank_and_filter function provides a powerful way to rank and filter instruction-answer pairs. Here's an example demonstrating its usage:

from style_ranker.rank import rank_and_filter

# Load the model and tokenizer
model_path = "lizhuang144/scar-gte-base"

# Prepare your data
instructions = ["Write a poem about spring", "Explain quantum computing", "Describe the water cycle"]
answers = ["Blossoms bloom in gentle breeze...", "Quantum computing is a type of computation...",
           "The water cycle, also known as..."]

# Example 1: Using topk
topk_pairs = rank_and_filter(model_path, instructions, answers, topk=2)

# Example 2: Using threshold
threshold_pairs = rank_and_filter(model_path, instructions, answers, threshold=-0.5)

# Example 3: Using ratio
ratio_pairs = rank_and_filter(model_path, instructions, answers, ratio=0.5)

# Print results for each method
print("Top-k results:")
for instruction, answer, score in topk_pairs:
    print(f"Score: {score:.2f} | Instruction: {instruction}")

print("\nThreshold results:")
for instruction, answer, score in threshold_pairs:
    print(f"Score: {score:.2f} | Instruction: {instruction}")

print("\nRatio results:")
for instruction, answer, score in ratio_pairs:
    print(f"Score: {score:.2f} | Instruction: {instruction}")

Model List

We provide the following pre-trained SCAR models:

Performance

SCAR demonstrates significant improvements in LLM performance when used for data filtering and selection. We evaluated our method using two LLMs: Olmo and Starcoder.

Note: Prior to applying SCAR, we filter out non-English and remove duplicate instruction-response pairs.

Olmo Performance

Dataset Size L.C. WinRate
Full dataset (320k) 3.86
SCAR-filtered 10k 5.37
SCAR-filtered 5k 5.64
SCAR-filtered 2.5k 4.08

The official checkpoint allenai/OLMo-7B-SFT is trained on 320k data from allenai/tulu-v2-sft-mixture. We evaluate the performance of models trained with SCAR-filtered data using 10k, 5k, and 2.5k instruction-answer pairs. The evaluation metric is L.C. WinRate, which compares model outputs with 'gpt-4-1106-preview' using meta-llama/Meta-Llama-3-70B-Instruct as the judger on the AlpacaEval benchmark.

Starcoder Performance

Dataset Size HumanEval (Python) MultiPL-E (Java) MultiPL-E (C++) MultiPL-E (JavaScript)
Pass@1 / Pass@10 Pass@1 / Pass@10 Pass@1 / Pass@10 Pass@1 / Pass@10
Full dataset (13k) 35.56/ 51.81 26.03 / 38.44 32.80 / 46.97 29.32 / 41.90
SCAR-filtered 10k 36.29 / 53.99 28.29 / 39.58 33.22 / 49.79 30.17 / 46.20
SCAR-filtered 5k 36.95 / 54.07 28.96 / 39.02 34.53 / 49.90 34.53 / 49.90
SCAR-filtered 2.5k 37.57 / 55.65 29.29 / 41.06 34.09 / 49.47 31.19 / 42.83

The official checkpoint 'bigcode/octocoder' is the 'bigcode/starcoder' fine-tuned on 13k data from 'bigcode/guanaco-commits'. We evaluated the performance using the bigcode-evaluation-harness. The performance of 'bigcode/octocoder' is obtained from the 'bigcode/bigcode-models-leaderboard'. We evaluated models on four datasets in four programming languages (Python, Java, C++, and JavaScript) and reported two execution accuracies (Pass@1 and Pass@10) for each dataset. We evaluated the performance of the model trained with SCAR-filtered data with 10k, 5k, and 2.5k instruction-answer pairs.

Key Components

  • StyleRanker: A model for ranking instruction-answer pairs based on style consistency and data quality.
  • Data Filtering: Scripts for filtering and selecting high-quality instruction-answer pairs.
  • LLM Training: Scripts for fine-tuning large language models using the selected data.

Scripts

The scripts/ directory contains bash scripts for various tasks:

  • quality_measure.sh: Measures the quality of the collected responses using LLMs, utilized to train the ranker.
  • train_ranker.sh: Trains the SCAR style ranker model. Please update the script arguments as needed.
  • data_filter.sh: Ranks and filters instruction-answer pairs. Please update the script arguments as needed.
  • train_llm.sh: Fine-tunes a large language model using the filtered data. Please review and update the script arguments accordingly.

Project Structure

The project is organized as follows:

  • data/: Datasets for training and evaluation
    • llm_sft_data/: Training data for the large language model (code and open domain)
    • ranker_data/: Training data for the ranker (code and open domain)
  • style_ranker/: Main package
    • consts.py
    • dedup.py: Near deduplication
    • llm/: LLM training (train.py)
    • rank.py: Ranking and filtering
    • ranker/: StyleRanker implementation
      • config.py, dataset.py, model.py, quality.py: Quality measure with LLMs like GPT-3.5-turbo
      • SCAR ranker training (train.py)
    • utils.py
  • examples/: Example Python scripts
    • filter_pipeline.py, rank_pairs.py, remove_dupes.py, vicuna_converter.py
  • scripts/: Example Bash scripts
    • data_filter.sh, quality_measure.sh, train_llm.sh, train_ranker.sh
  • requirements.txt: List of dependencies
  • setup.py: Installation script

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scar_tool-0.1.tar.gz (19.3 kB view hashes)

Uploaded Source

Built Distribution

scar_tool-0.1-py3-none-any.whl (22.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page