Tossaporn Saengja back

simple, efficient reasoning dataset

s1: Simple test-time scaling

Muennighoff et al. ¹ curate a very high quality, small reasoning dataset s1K (N=1,000). Supervised fine-tuning Qwen2.5-32B-Instruct on s1K in <30 minutes of 16 H100 GPUs improves math capability to be competitive vs o1.

#paper #ai

Method

They gather very high quality dataset (N=59,000), around the ballpark of math olympiad, PhD qualifying exams and filter with three stages:

format: API errors, string patterns (N=51,581).
difficulty: solvable by small models, reasoning token length (N=24,496).
diversity: Mathematics Subject Classification (MSC) system (by Claude 3.5 Sonnet, N=1,000).

Budget forcing

Simple length manipulation on reasoning:

forcing longer reasoning: replace </think> with Wait to encourage reflection.
shorter: append </think> Final Answer:

https://arxiv.org/abs/2501.19393v2↩︎