Skythought Evals: Evaluation and Data Generation Tools for Reasoning Models

These details have not been verified by PyPI

Project links

Repository

Project description

SkyThought

News • Links • Getting Started • Evaluation • Citation • Acknowledgement

News

[2025/02/11] 🎉 We released Sky-T1-7B (model) and Sky-T1-mini (model) to demonstrate the potential of RL in further enhancing model's capability beyond distillation.
[2025/01/23] ⚡️ We released Sky-T1-32B-Flash (model, data) to tackle overthinking and reduce reasoning sequence lengths while maintaining accuracy.
[2025/01/19] 🎉 Chat demo for Sky-T1-32B-Preview is alive! Please check it out!
[2025/01/10] 🎉 We have released our Sky-T1-32B-Preview model and data through HuggingFace!

Getting Started

We open source the code and scripts we used for data curation, training, and evaluation for Sky-T1-32B-Preview, you can find more details in each directory.

recipes: Recipes - data curation steps and training strategies - for building our models Sky-T1-32B-Flash, Sky-T1-32B-Preview and Sky-T1-7B series.
skythought/evals: Our data generation and evaluation library.
skythought/train: Training scripts for Sky-T1. We use Llama-Factory to perform training.
skythought/skythought-rl: RL training code for Sky-T1-7B and Sky-T1-mini.

Evaluation

Usage

First, clone the repository and install the package

git clone https://github.com/NovaSky-AI/SkyThought.git
cd SkyThought
# installs shown for uv
uv venv python==3.10
source .venv/bin/activate
uv pip install -e .

Running evaluation is as simple as:

skythought evaluate --model NovaSky-AI/Sky-T1-32B-Preview --task aime

We support a wide variety of datasets in mathematics, science and coding:

AIME'24
MATH500
GPQADiamond
MMLU
ARC-Challenge
OlympiadBench
AMC'23
TACO
APPS
LiveCodeBench
MMLU Pro
MinervaMath
GSM8K
AIME'25

For more details, please refer to our evaluation guide and the README.

Evaluation results

Following, we show our evaluation results for the Sky-T1-32B-Preview model across math, coding, and science benchmarks.

Metric	Sky-T1-32B-Preview	Qwen-2.5-32B-Instruct	QwQ	o1-preview
Math500	86.4	81.4	92.2	81.4
AIME2024	43.3	16.7	50.0	40.0
LiveCodeBench-Easy	86.3	84.6	90.7	92.9
LiveCodeBench-Medium	56.8	40.8	56.3	54.9
LiveCodeBench-Hard	17.9	9.8	17.1	16.3
GPQA-Diamond	56.8	45.5	52.5	75.2
OlympiadBench (Math, EN)	59.79	46.74	62.17	59.2

Results on non-reasoning benchmarks

We also evaluate on non-reasoning benchmarks (these are benchmarks for instruction-following, QA, etc) to test whether the model has traded-off capability in other domains for better performance in reasoning-related benchmarks.

Metric	Sky-T1-32B-Preview	Qwen-2.5-32B-Instruct	QwQ-32B-Preview	Eval Implementation
MMLU (0 shot; no CoT)	78.36	74.14	71.23	lm_eval
MMLU (5 shot; no CoT)	82.46	82.62	82.32	lm_eval
ARC-C (0 shot; no CoT)	49.49	49.4	49.66	lm_eval
IFEval	75.79	78.74	42.51	lm_eval
LLM-as-a-Judge	9.12	9.19	8.30	fastchat
MGSM (0 shot; `direct`)	33	42.3	19.07	lm_eval
MGSM (8-shot; `direct`)	58.4	61.47	58.5	lm_eval
BFCL-v3	53.18	58.92	17.41	BFCL
Arena-Hard	74.79	66.51	52.6	Arena-Hard-Auto

For more details, refer here.

Fully Open-source: Driving Progress Together

We believe that open-source collaboration drives progress, and with Sky-T1-32B-Preview, we are fully committed to empowering the community. We open-source all details (i.e., data, codes, model weights) to enable the community to replicate and improve on our results easily:

Model	Sky-T1-32B-Preview	STILL-2	Journey	QwQ	o1
Data	✅	✅	❌	❌	❌
Code	✅	❌	❌	❌	❌
Report	✅	✅	✅	❌	❌
Math domain	✅	✅	✅	✅	✅
Coding domain	✅	❌	❌	✅	✅
Model Weights	✅	✅	❌	✅	❌

Citation

The code in this repository is mostly described in the post below. Please consider citing this work if you find the repository helpful.

@misc{sky_t1_2025,
  author       = {NovaSky Team},
  title        = {Sky-T1: Train your own O1 preview model within $450},
  howpublished = {https://novasky-ai.github.io/posts/sky-t1},
  note         = {Accessed: 2025-01-09},
  year         = {2025}
}

Acknowledgement

This work is done at Berkeley Sky Computing Lab, with the amazing compute support from Lambda Labs, Anyscale, and Databricks. We would like to express our gratitude for the valuable academic feedback and support from the Still-2 Team, and Junyang Lin from the Qwen Team.

Project details

These details have not been verified by PyPI

Project links

Repository

Release history Release notifications | RSS feed

This version

0.1.0

Feb 21, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

skythought-0.1.0.tar.gz (80.8 kB view details)

Uploaded Feb 21, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

skythought-0.1.0-py3-none-any.whl (101.9 kB view details)

Uploaded Feb 21, 2025 Python 3

File details

Details for the file skythought-0.1.0.tar.gz.

File metadata

Download URL: skythought-0.1.0.tar.gz
Upload date: Feb 21, 2025
Size: 80.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.15

File hashes

Hashes for skythought-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`400f6cdc0ee7cda56fd3a182cc69ff5d735ca5fe219cec6c47681e3aa2225294`
MD5	`f2acc088ebaa8f623d2963eacab661ca`
BLAKE2b-256	`4be7c85da55c04672dca56ae7b1a20992902ca57152ec255bcd4f313c6d2780f`

See more details on using hashes here.

File details

Details for the file skythought-0.1.0-py3-none-any.whl.

File metadata

Download URL: skythought-0.1.0-py3-none-any.whl
Upload date: Feb 21, 2025
Size: 101.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.15

File hashes

Hashes for skythought-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8c0436bd41f1b346178372dbe7e4fef5bd6e541da8803d881c55e3d48c4f3909`
MD5	`5fb6e384ce6b34c50896444aaf090d91`
BLAKE2b-256	`a5669435e790b1f3e0703fda935e80406de6e76b64c4a692ff3b75b6501c2978`

See more details on using hashes here.

skythought 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

SkyThought

News

Links

Getting Started

Evaluation

Usage

Evaluation results

Results on non-reasoning benchmarks

Fully Open-source: Driving Progress Together

Citation

Acknowledgement

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes