Steiner-preview

For more details, please refer to the announcement blog post.

Steiner is a series of reasoning models trained on synthetic data using reinforcement learning. These models can explore multiple reasoning paths in an autoregressive manner during inference and autonomously verify or backtrack when necessary, enabling a linear traversal of the implicit search tree.

Steiner is a personal interest project by Yichao 'Peak' Ji, inspired by OpenAI o1. The ultimate goal is to reproduce o1 and validate the inference-time scaling curves. The Steiner-preview model is currently a work-in-progress. The reason for open-sourcing it is that I’ve found automated evaluation methods, primarily based on multiple-choice questions, struggle to fully reflect the progress of reasoning models. In fact, the assumption that "the correct answer is always among the options" doesn’t align well with real-world reasoning scenarios, as it encourages models to perform substitution-based validation rather than open-ended exploration. For this reason, I’ve chosen to open-source these intermediate results and, when time permits, to build in public. This approach allows me to share knowledge while also gathering more evaluations and feedback from real human users.

⚠️ Disclaimer: While Steiner has been able to achieve high-quality zero-shot results without relying on Chain of Thought (CoT) prompting or an agent framework, it has not yet replicated the inference-time scaling capabilities demonstrated by o1. In experiments using a specialized logits processor to intervene on reasoning tokens, increasing the number of reasoning steps did not improve performance; in fact, it led to a decline in benchmarks such as MMLU-Pro and GPQA. As a result, Steiner cannot currently be considered a successful reproduction of OpenAI o1. There may be deficiencies in both the training methods and data quality, so please interpret the results with caution.

Deployment

Steiner is compatible with all existing inference services, with vLLM being the most recommended for deployment.

vLLM

Deploying Steiner is no different from using other LLMs; you just need to add the following two parameters to the inference request:

"skip_special_tokens": false,
"spaces_between_special_tokens": false,

For example:

{
  "model": "steiner",
  "skip_special_tokens": false,
  "spaces_between_special_tokens": false,
  "messages": [
    {
      "role": "user",
      "content": "Hello"
    }
  ]
}

If you are using the Python client provided by OpenAI, you can use it like this:

stream = client.chat.completions.create(
    model="steiner",
    messages=[{"role": "user", "content": "Hello"}],
    stream=True,
    extra_body={
        "skip_special_tokens": False,
        "spaces_between_special_tokens": False,
    },
)

llama.cpp

Add the following command line arguments while starting llama.cpp: --special and --flash-attn.

Benchmarks

GPQA Diamond

Subdomain	Accuracy (0-shot w/o CoT)
Physics (general)	63.16%
Organic Chemistry	40.28%
Quantum Mechanics	76.00%
Electromagnetism and Photonics	50.00%
High-energy particle physics	57.14%
Genetics	25.00%
Astrophysics	53.85%
Molecular Biology	80.00%
Chemistry (general)	50.00%
Relativistic Mechanics	57.14%
Inorganic Chemistry	0.00%
Optics and Acoustics	0.00%
Condensed Matter Physics	100.00%
All	53.54%

Limitations

Steiner’s current post-training data does not include examples for multi-turn dialogues. The best-performing version of the Steiner model (based on Qwen2.5-32B) lacks the ability to handle multi-turn conversations. The open-source Steiner-preview model (based on Qwen2.5-32B-Instruct) is compatible with chat formats but is still not recommended for multi-turn dialogues.
Similar to OpenAI o1-2024-09-12, Steiner also does not recommend the use of custom system prompts or modifications to sampling parameters such as temperature. Steiner has not yet been trained on a diverse set of system prompts, and altering other parameters may lead to errors in the formatting of reasoning tokens.
The language composition of Steiner's post-training data is approximately 90% English and 10% Chinese, but during the reasoning path data augmentation process, almost only English was used. Therefore, while the model's final responses demonstrate a certain degree of language following ability, the reasoning tokens may predominantly be generated in English.

Citation

If you find my work helpful, please consider citing it in your research or projects. Your acknowledgment would be greatly appreciated!

@misc{ji2024steiner,
    title = {A Small Step Towards Reproducing OpenAI o1: Progress Report on the Steiner Open Source Models},
    url = {https://medium.com/@peakji/b9a756a00855},
    author = {Yichao Ji},
    month = {October},
    year = {2024}
}

peakji
/

steiner-32b-preview-gguf