SCAR: An AI-powered tool for ranking and filtering instruction-answer pairs based on writing quality and style consistency
Project description
SCAR: Style Consistency-Aware Response Ranking for LLM Instruction-Tuning
Overview
SCAR is an innovative data selection method designed to enhance instruction-tuning for large language models. It leverages style consistency-aware response ranking to improve the quality and efficiency of the training data.
Installation
You can install SCAR using pip:
pip install scar
Requirements
SCAR requires the following
dependencies: torch
, transformers
, scikit-learn
, tqdm
, nltk
, datasketch
, peft
, trl
, accelerate
, langdetect
,
and deepspeed
. These will be automatically installed when you install SCAR via pip.
Usage
Basic Usage with Hugging Face Transformers
Here's a simple example of how to use the StyleRanker model with Hugging Face Transformers:
import torch
from transformers import AutoTokenizer
from style_ranker.ranker.model import StyleRanker
# Load the model and tokenizer
model_path = "lizhuang144/scar-gte-base"
model = StyleRanker.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
# Prepare your data
instructions = ["Write a poem about spring", "Explain quantum computing"]
answers = ["Blossoms bloom in gentle breeze...", "Quantum computing is a type of computation..."]
# Tokenize the inputs
max_length = 512
instruction_inputs = tokenizer(instructions, return_tensors='pt', padding=True, truncation=True, max_length=max_length)
answer_inputs = tokenizer(answers, return_tensors='pt', padding=True, truncation=True, max_length=max_length)
model.eval()
# Get the scores
with torch.no_grad():
scores = model(
instruction_inputs.input_ids,
instruction_inputs.attention_mask,
answer_inputs.input_ids,
answer_inputs.attention_mask
)
# Print the results
for instruction, answer, score in zip(instructions, answers, scores):
print(f"Instruction: {instruction}")
print(f"Answer: {answer}")
print(f"Score: {score.item()}")
print()
Advanced Usage
SCAR offers sophisticated capabilities for data filtering and ranking through its comprehensive pipeline. This allows you to fine-tune your selection process by choosing the top-k pairs with the highest scores, setting a ratio for selection, or applying a threshold for filtering.
The rank_and_filter
function provides a powerful way to rank and filter instruction-answer pairs. Here's an example
demonstrating its usage:
from style_ranker.rank import rank_and_filter
# Load the model and tokenizer
model_path = "lizhuang144/scar-gte-base"
# Prepare your data
instructions = ["Write a poem about spring", "Explain quantum computing", "Describe the water cycle"]
answers = ["Blossoms bloom in gentle breeze...", "Quantum computing is a type of computation...",
"The water cycle, also known as..."]
# Example 1: Using topk
topk_pairs = rank_and_filter(model_path, instructions, answers, topk=2)
# Example 2: Using threshold
threshold_pairs = rank_and_filter(model_path, instructions, answers, threshold=-0.5)
# Example 3: Using ratio
ratio_pairs = rank_and_filter(model_path, instructions, answers, ratio=0.5)
# Print results for each method
print("Top-k results:")
for instruction, answer, score in topk_pairs:
print(f"Score: {score:.2f} | Instruction: {instruction}")
print("\nThreshold results:")
for instruction, answer, score in threshold_pairs:
print(f"Score: {score:.2f} | Instruction: {instruction}")
print("\nRatio results:")
for instruction, answer, score in ratio_pairs:
print(f"Score: {score:.2f} | Instruction: {instruction}")
Model List
We provide the following pre-trained SCAR models:
lizhuang144/scar-gte-base
: SCAR model trained usingAlibaba-NLP/gte-base-en-v1.5
as the representation encoder.lizhuang144/scar-gte-large
: SCAR model trained usingAlibaba-NLP/gte-large-en-v1.5
as the representation encoder.lizhuang144/scar-roberta-base
: SCAR model trained usingFacebookAI/roberta-base
as the representation encoder.
Performance
SCAR demonstrates significant improvements in LLM performance when used for data filtering and selection. We evaluated our method using two LLMs: Olmo and Starcoder.
Note: Prior to applying SCAR, we filter out non-English and remove duplicate instruction-response pairs.
Olmo Performance
Dataset Size | L.C. WinRate |
---|---|
Full dataset (320k) | 3.86 |
SCAR-filtered 10k | 5.37 |
SCAR-filtered 5k | 5.64 |
SCAR-filtered 2.5k | 4.08 |
The official checkpoint allenai/OLMo-7B-SFT is trained on 320k data from allenai/tulu-v2-sft-mixture. We evaluate the performance of models trained with SCAR-filtered data using 10k, 5k, and 2.5k instruction-answer pairs. The evaluation metric is L.C. WinRate, which compares model outputs with 'gpt-4-1106-preview' using meta-llama/Meta-Llama-3-70B-Instruct as the judger on the AlpacaEval benchmark.
Starcoder Performance
Dataset Size | HumanEval (Python) | MultiPL-E (Java) | MultiPL-E (C++) | MultiPL-E (JavaScript) |
---|---|---|---|---|
Pass@1 / Pass@10 | Pass@1 / Pass@10 | Pass@1 / Pass@10 | Pass@1 / Pass@10 | |
Full dataset (13k) | 35.56/ 51.81 | 26.03 / 38.44 | 32.80 / 46.97 | 29.32 / 41.90 |
SCAR-filtered 10k | 36.29 / 53.99 | 28.29 / 39.58 | 33.22 / 49.79 | 30.17 / 46.20 |
SCAR-filtered 5k | 36.95 / 54.07 | 28.96 / 39.02 | 34.53 / 49.90 | 34.53 / 49.90 |
SCAR-filtered 2.5k | 37.57 / 55.65 | 29.29 / 41.06 | 34.09 / 49.47 | 31.19 / 42.83 |
The official checkpoint 'bigcode/octocoder' is the 'bigcode/starcoder' fine-tuned on 13k data from 'bigcode/guanaco-commits'. We evaluated the performance using the bigcode-evaluation-harness. The performance of 'bigcode/octocoder' is obtained from the 'bigcode/bigcode-models-leaderboard'. We evaluated models on four datasets in four programming languages (Python, Java, C++, and JavaScript) and reported two execution accuracies (Pass@1 and Pass@10) for each dataset. We evaluated the performance of the model trained with SCAR-filtered data with 10k, 5k, and 2.5k instruction-answer pairs.
Key Components
- StyleRanker: A model for ranking instruction-answer pairs based on style consistency and data quality.
- Data Filtering: Scripts for filtering and selecting high-quality instruction-answer pairs.
- LLM Training: Scripts for fine-tuning large language models using the selected data.
Scripts
The scripts/
directory contains bash scripts for various tasks:
quality_measure.sh
: Measures the quality of the collected responses using LLMs, utilized to train the ranker.train_ranker.sh
: Trains the SCAR style ranker model. Please update the script arguments as needed.data_filter.sh
: Ranks and filters instruction-answer pairs. Please update the script arguments as needed.train_llm.sh
: Fine-tunes a large language model using the filtered data. Please review and update the script arguments accordingly.
Project Structure
The project is organized as follows:
data/
: Datasets for training and evaluationllm_sft_data/
: Training data for the large language model (code and open domain)ranker_data/
: Training data for the ranker (code and open domain)
style_ranker/
: Main packageconsts.py
dedup.py
: Near deduplicationllm/
: LLM training (train.py
)rank.py
: Ranking and filteringranker/
: StyleRanker implementationconfig.py
,dataset.py
,model.py
,quality.py
: Quality measure with LLMs like GPT-3.5-turbo- SCAR ranker training (
train.py
)
utils.py
examples/
: Example Python scriptsfilter_pipeline.py
,rank_pairs.py
,remove_dupes.py
,vicuna_converter.py
scripts/
: Example Bash scriptsdata_filter.sh
,quality_measure.sh
,train_llm.sh
,train_ranker.sh
requirements.txt
: List of dependenciessetup.py
: Installation script
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file scar_tool-0.1.tar.gz
.
File metadata
- Download URL: scar_tool-0.1.tar.gz
- Upload date:
- Size: 19.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.14
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 49106372a3fd7cc1d46605e15cd27ad37041fd8058ae0edb1aa50359384ee326 |
|
MD5 | 7369a76598a135072c2ab747d0060d4e |
|
BLAKE2b-256 | 9f0afd877993dff728695129960d1d1f38d140784af3ee8540e64f3bcede350e |
File details
Details for the file scar_tool-0.1-py3-none-any.whl
.
File metadata
- Download URL: scar_tool-0.1-py3-none-any.whl
- Upload date:
- Size: 22.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.14
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3160662ef35b17165fe8dbdef44f59b7b99a9aa3bb1c627f1776bcb376225279 |
|
MD5 | 92eda19af343c2f04d8e7bee987f913b |
|
BLAKE2b-256 | 250631832d8733146709bde5bd29dd93461ec4488720aa86360047e69ad9e0e5 |