Skip to main content

Skythought Evals: Evaluation and Data Generation Tools for Reasoning Models

Project description

SkyThought

Github Twitter Hugging Face Collection Discord

News

  • [2025/02/11] 🎉 We released Sky-T1-7B (model) and Sky-T1-mini (model) to demonstrate the potential of RL in further enhancing model's capability beyond distillation.
  • [2025/01/23] ⚡️ We released Sky-T1-32B-Flash (model, data) to tackle overthinking and reduce reasoning sequence lengths while maintaining accuracy.
  • [2025/01/19] 🎉 Chat demo for Sky-T1-32B-Preview is alive! Please check it out!
  • [2025/01/10] 🎉 We have released our Sky-T1-32B-Preview model and data through HuggingFace!

Links

Getting Started

We open source the code and scripts we used for data curation, training, and evaluation for Sky-T1-32B-Preview, you can find more details in each directory.

Evaluation

Usage

First, clone the repository and install the package

git clone https://github.com/NovaSky-AI/SkyThought.git
cd SkyThought
# installs shown for uv
uv venv python==3.10
source .venv/bin/activate
uv pip install -e .

Running evaluation is as simple as:

skythought evaluate --model NovaSky-AI/Sky-T1-32B-Preview --task aime

We support a wide variety of datasets in mathematics, science and coding:

  • AIME'24
  • MATH500
  • GPQADiamond
  • MMLU
  • ARC-Challenge
  • OlympiadBench
  • AMC'23
  • TACO
  • APPS
  • LiveCodeBench
  • MMLU Pro
  • MinervaMath
  • GSM8K
  • AIME'25

For more details, please refer to our evaluation guide and the README.

Evaluation results

Following, we show our evaluation results for the Sky-T1-32B-Preview model across math, coding, and science benchmarks.

Metric Sky-T1-32B-Preview Qwen-2.5-32B-Instruct QwQ o1-preview
Math500 86.4 81.4 92.2 81.4
AIME2024 43.3 16.7 50.0 40.0
LiveCodeBench-Easy 86.3 84.6 90.7 92.9
LiveCodeBench-Medium 56.8 40.8 56.3 54.9
LiveCodeBench-Hard 17.9 9.8 17.1 16.3
GPQA-Diamond 56.8 45.5 52.5 75.2
OlympiadBench (Math, EN) 59.79 46.74 62.17 59.2

Results on non-reasoning benchmarks

We also evaluate on non-reasoning benchmarks (these are benchmarks for instruction-following, QA, etc) to test whether the model has traded-off capability in other domains for better performance in reasoning-related benchmarks.

Metric Sky-T1-32B-Preview Qwen-2.5-32B-Instruct QwQ-32B-Preview Eval Implementation
MMLU (0 shot; no CoT) 78.36 74.14 71.23 lm_eval
MMLU (5 shot; no CoT) 82.46 82.62 82.32 lm_eval
ARC-C (0 shot; no CoT) 49.49 49.4 49.66 lm_eval
IFEval 75.79 78.74 42.51 lm_eval
LLM-as-a-Judge 9.12 9.19 8.30 fastchat
MGSM (0 shot; direct) 33 42.3 19.07 lm_eval
MGSM (8-shot; direct) 58.4 61.47 58.5 lm_eval
BFCL-v3 53.18 58.92 17.41 BFCL
Arena-Hard 74.79 66.51 52.6 Arena-Hard-Auto

For more details, refer here.

Fully Open-source: Driving Progress Together

We believe that open-source collaboration drives progress, and with Sky-T1-32B-Preview, we are fully committed to empowering the community. We open-source all details (i.e., data, codes, model weights) to enable the community to replicate and improve on our results easily:

Model
Sky-T1-32B-Preview
STILL-2
Journey
QwQ
o1
Data
Code
Report
Math domain
Coding domain
Model Weights

Citation

The code in this repository is mostly described in the post below. Please consider citing this work if you find the repository helpful.

@misc{sky_t1_2025,
  author       = {NovaSky Team},
  title        = {Sky-T1: Train your own O1 preview model within $450},
  howpublished = {https://novasky-ai.github.io/posts/sky-t1},
  note         = {Accessed: 2025-01-09},
  year         = {2025}
}

Acknowledgement

This work is done at Berkeley Sky Computing Lab, with the amazing compute support from Lambda Labs, Anyscale, and Databricks. We would like to express our gratitude for the valuable academic feedback and support from the Still-2 Team, and Junyang Lin from the Qwen Team.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

skythought-0.1.0.tar.gz (80.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

skythought-0.1.0-py3-none-any.whl (101.9 kB view details)

Uploaded Python 3

File details

Details for the file skythought-0.1.0.tar.gz.

File metadata

  • Download URL: skythought-0.1.0.tar.gz
  • Upload date:
  • Size: 80.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.15

File hashes

Hashes for skythought-0.1.0.tar.gz
Algorithm Hash digest
SHA256 400f6cdc0ee7cda56fd3a182cc69ff5d735ca5fe219cec6c47681e3aa2225294
MD5 f2acc088ebaa8f623d2963eacab661ca
BLAKE2b-256 4be7c85da55c04672dca56ae7b1a20992902ca57152ec255bcd4f313c6d2780f

See more details on using hashes here.

File details

Details for the file skythought-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: skythought-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 101.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.15

File hashes

Hashes for skythought-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8c0436bd41f1b346178372dbe7e4fef5bd6e541da8803d881c55e3d48c4f3909
MD5 5fb6e384ce6b34c50896444aaf090d91
BLAKE2b-256 a5669435e790b1f3e0703fda935e80406de6e76b64c4a692ff3b75b6501c2978

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page