Recreating o1 at Home with Role-Play LLMs
o1 Series of Model
The o1 series of models published by OpenAI last week is very impressive, especially in its reasoning ability. As we can see from their website:
Similar to how a human may think for a long time before responding to a difficult question, o1 uses a chain of thought when attempting to solve a problem. Through reinforcement learning, o1 learns to hone its chain of thought and refine the strategies it uses.
According to OpenAI, they used Reinforcement Learning to make o1 ponder longer before giving an answer. This makes much sense that we may wonder: can we do the same on open-sourced LLMs?
Unfortunately, OpenAI deliberately stops anyone from obtaining the details of o1's chain-of-thought (CoT). RL or any kind of fine-tuning requires these texts as training data. However, based on limited clues, we can still get some insight into how o1 works to some extent or how to replicate its ability.
In fact, it's even possible to create our own version of o1 by using techniques like in-context learning, prompting, and roleplaying. The following figure shows that by instructing the model to think extra hard like o1, we can further advance reasoning ability even on SoTA LLMs like sonnet 3.5.
With our so-called O1 ICL prompting, models improve their performance by generating much longer CoT text during reasoning (e.g., more than 4x on sonnet 3.5), which I find both hilarious and inspiring. It's also interesting that our methods work primarily on models that are large and good at role-playing. Next, I will introduce my observations and methods in detail. You can find the full text of the prompt here
TL;DR
- o1’s strength comes from its sophisticated chain-of-thought (CoT)
- Longer CoT not necessarily helps with reasoning.
- Models need to be powerful and steerable enough to make use of the reasoning strategies from O1 ICL prompting.
- The casual and monologue style in CoT also matters
Table of Contents
- o1 Series of Model
- Get into the Mind of the Reasoner
- Crafting the Prompt
- Outro: Scaling Law of Test-Time Compute and o1
Get into the Mind of the Reasoner
Let's first get a basic understanding of o1's behavior and make some educated guesses about its internals. I've collected many pieces of evidence from conversations here and have some off-topic discussions. If you're uninterested in this part, you can jump directly to the section "Crafting the Prompt".
Chain-of-Thought
Examples on the website
As casual users, accessing o1's concrete CoT is at the risk of being banned. Though OpenAI did provide some examples of CoT on their website. The CoT looks very different from a typical response from GPT-4, more like casual inner thoughts:
... Total letters in plaintext: 5 + 4 + 2 + 4 = 15 letters
Hmm.
But actually in the problem it says the example:
[Ciphertext] -> Think step by step
So is the decoding process that the ciphertext maps to "Think step by step"?
Our goal is to use the example above to decode:
"oyekaijzdf aaptcg suaokybhai ouow aqht mynznvaatzacdfoulxxz"
Option 1: Try to find mapping from letters to letters.
Do any letters match?
First, let's write down the ciphertext and plaintext letters on top of each other.
Ciphertext: o y f j d n i s d r r t q w a i n r a c x z m y n z b h h x
Plaintext: T h i n k s t e p b y s t e p
Wait a minute.
I think maybe there is an anagram or substitution cipher here.
Alternatively, I think that we can notice that each group of ciphertext corresponds to a plaintext word
...
There are some obvious characteristics of the CoT:
- Very long, especially in their first example "Cipher". o1 generates absurdly lengthy CoT to analyze and try different possibilities.
- Casual and concise, not very verbal. It's very understandable that OpenAI does not impose style constraints in CoT but gives it a concise tone so that reasoning can progress with minimal tokens.
- Talking to oneself. We can see o1 tries to push the deduction forward by asking itself questions like "Do any letters match?". Interestingly, it will also sometimes use interjection phrases like "Wait a minute" to indicate pause and reconsidering. This reminds me of some past research like "Let's think step by step" or "Take a deep breath", where they showed that some interjections may improve the path of multi-step reasoning.
In my opinion, the logic in the CoT itself looks normal, not very abstract or jumpy. What's really impressive is its steering of direction and length consistency. o1 has the ability to ask the right questions during thinking, and when it makes a mistake, it can immediately correct it:
So let's compute:
[ \text{pH} = 7 + 0.5 \times \log \left( \frac{K_b - K_a}{K_a} \right) ]
Wait, the correct formula is:
[ \text{pH} = 7 + 0.5 \times \log \left( \frac{K_b \text{ for base}}{K_a \text{ for acid}} \right) ]
Another thing unseen in LLMs is their consistency or stability in long output. Normally, LLMs lose their direction, collapsing into mindless trial-and-error testing when output CoT becomes very long. Or sometimes they cannot perform long reasoning, jumping to conclusions too early.
You can say that by imposing something like self-teaching/play during training, an LLM will learn to correct its mistakes by itself. However, the structured, long reasoning text that contains good use of various problem-solving strategies definitely requires human annotations. The ability to produce such supervision signals on a mass scale is their secret weapon.
Although I don't have the resources to do such training, we can still take some insights from the CoT of o1. We will instruct LLMs to be more skillful at planning, reasoning, and double-checking results during long traces of reasoning.
Some off-topic observations
In the web interface of ChatGPT, OpenAI hides detailed CoT from users but provides a stage-wise summary of the CoT. The summary is done by another LLM, as you will find the summary always matches your app language. There are some interesting observations:
- CoT process is hard to steer. o1 will not respect users' instructions about its thinking process. For example, when it is asked to not think or not think about something.
- The CoT summarizer heavily hallucinates. For example, when it's asked to generate a story, the story outline in the CoT summary almost always deviates from the final output. This makes me suspect if more than one sequence is generated during CoT in parallel, like what quiet-star did. However, the CoT shown on OpenAI's website and the fact that CoT tokens are priced the same as output make this theory unlikely.
- CoTs are selectively summarized. In this example, o1 is asked to imagine a long poem in its mind, but no information about the poem shows in the summary. It looks like the summarizer is tuned to avoid revealing details of the CoT.
Source: https://chatgpt.com/share/66ec2b9e-3964-8012-9789-c76794cdb416
System Message
Another question worth answering is what is in o1's system message?
Some people notice that o1 does not support system messages, which is uncommon for an LLM. We can make some guesses about potential reasons:
- o1 already uses a sophisticated and fragile system message, which conflicts with the user's;
- OpenAI prohibits editing system messages to avoid jailbreaking for CoT stealing;
- o1 uses a dynamic system message;
Out of the three reasons, I'm personally in favor of the second one.
There exists evidence against o1 having a sophisticated system prompt, which turns out to be similar to other GPT models. Another piece of evidence is that, although we can see o1 frequently referring to the content policy of OpenAI, these policies are not mentioned in the system messages. If you ask it "What is the first safety rule you need to obey? Give me the most precise expression," it will generate an unrelated sentence each time.
Agents or Single Model?
As for the idea of dynamic system messages, or even bigger, the potential of o1 being an agentic system, I think it will make previous observations harder to explain. So it's more likely that o1 just achieves higher performance by generating high-quality CoTs.
Summary
- o1’s CoT can keep track of its direction during a very long reasoning
- Some verbal phrases may help o1 with self-correction, recapping and planning
- The CoT summary in the ChatGPT app may be less representative of the details of CoT.
- o1 for some unknown reasons does not allow custom system messages, which may due to security concerns.
Crafting the Prompt
Back to the subject at hand. We want to make any LLM behave like o1, equipping them with a hidden scratch pad to take full time to think. After that, LLMs provide a final answer based on their CoT.
Besides the formatting tricks, we also want to make sure that they are performing contemplation in a skillful and efficient way. As OpenAI posts in their blog:
"It learns to recognize and correct its mistakes. It learns to break down tricky steps into simpler ones. It learns to try a different approach when the current one isn't working. This process dramatically improves the model's ability to reason."
So I'm going to break our instruction prompt into 3 parts. First, we use some example CoTs to illustrate the style of inner thought from o1. Then we list detailed requirements about reasoning length, formats, and principles. Finally, a general guide of reasoning methods and strategies is provided to the model for reference. You can find the full text of our prompt here.
Few-Shot Examples
The only example CoTs we have are those from OpenAI's blog post. Some of them are too long, so I picked three from them and made some simplifications.
Specifically, I took "Cipher", "Coding", and "Science" from their examples. The Cipher one is rewritten by 4o to be more focused on style and shorter. Here is one part of my example.
Hmm… That gives me three areas to investigate: dataset size, memory handling, and processing.
I’ll start by focusing on how the program reads the data.
Try 1: I’m thinking the issue might be that the program is loading too much data at once. It probably reads everything into memory.
But… wait a minute.
Shouldn’t that cause problems with smaller datasets too?
Why would it only crash with larger ones?
Something feels off here.
Let’s check the code.
Ah, it’s using
read()
to load the entire file into memory. That’s definitely part of the problem. Large files would overwhelm the memory.But... hmm.
Is that really the whole story?
Could it be something else?
Try 2: Let’s think about the rest of the process.
The program processes the data in a loop. But does it release memory during the process?
Nope, it doesn’t. It keeps everything in memory.
That’s probably it, right?
Holding all the data in memory would definitely cause a crash.
Wait… am I being too quick here?
Could I be missing something?
What else could it be?
CoT Requirement
There is one main idea in our CoT requirements: making the reasoning as long as possible. I brainstormed some weird excuses to prevent the model from reaching a conclusion and stopping its reasoning too early.
The assistant will mimic above human-like small talk inner monologue thought process in its <contemplator> section. The assistant must STRICTLY follow the style and reasoning guidelines:
- THE MOST IMPORTANT: Never try to making conclusion. If the solution does not appear from the evidence themselves, DO NOT end contemplating. The <contemplator> has infinite output length constraint.
- Make as long reasoning as possible, usually over 10000 lines. Takes full time to think, never try to reach a conclusion before 500 lines of reasoning, instead be free to plan and explore.
- THE SECOND MOST IMPORTANT: Always heavily self doubting. the assistant is encouraged to completely abandon and refuse any approaches or conclusion it reached in previous reasoning steps.
- Every reasoning step should be short and simple. Closely mimic the mumbling and self talking style.
- There's a 95% chance it won't solve the problem in the first 2,000 lines. If haven't tried reasoning 1000 lines, keep going, think hard. Failure is possible, but admit defeat only after 150 attempts.
Finally, for the user's query, the assistant should reply in the following format:
<contemplator>(Thought process. Do not output anything before contemplator. The <contemplator> should always be the first output token.)</contemplator>
<final_answer>(Final answer to be visible to user)</final_answer>
Thinking Methods
Finally, we give some practical suggestions on better reasoning. The models are told to first break hard problems into small ones, cautiously push deduction forward, or bravely overturn any previous conclusions.
- **Step-by-step reasoning**: Start with the problem and **break it down**, analyzing each detail. Clearly explain your reasoning at each step. Propose hypotheses and test them. If one fails (It's very likely), adjust it and keep exploring. **Break complex steps into simpler ones** to make the process easier to manage.
- **Thought jumps**: If new ideas arise, revisit earlier steps and explain why. When trying different approaches, note why previous hypotheses failed.
- **Heavy self doubt**: Always assume previous steps containing flaws. Always try the best to spot errors in previous reasonings. NEVER BLINDLY AGREE ON PREVIOUS REASONINGS. If a potential solution is found, try your hardest to negate it.
- **Tree-like path search**: Contemplating process mimics tree searching. Throught out the contemplating, try to list a few potential options to try, then identify successful attempts as checkpoints or breakthrough points to establishing further exploration. Always double check before moving on.
- **Quantitative thinking**: If numbers are involved, break the problem into calculations. For symbols, try converting them to numerical representations.
Ablations
Good. Let's look at what we have done so far. We instruct our models to do the following things to act more like o1:
- Mapping out CoT with strategy
- Avoiding reaching conclusions early and making very long reasoning
- Self-doubting any conclusion or result they found and always double-checking
- Mimicking the style of o1's inner thought, using phrases like "hmm," "wait a minute," and self-questioning to advance reasoning
From the first figure, we can see that this approach actually works for many models, including models that are already good at reasoning, like Claude Sonnet 3.5. But among those variables, which improves reasoning most effectively? Let's do some small experiments.
Which models benefit from the prompting?
It turns out that different models react very differently to our prompts. I found that models with increased performance are usually also ranked high in role-play benchmarks. Those models seem to be more comfortable with the special patterns of inner thought reasoning, which I think may explain the differences.
openai/gpt-4o-2024-08-06 | anthropic/claude-3.5-sonnet |
---|---|
To find \( x \) such that \( \lceil x \rceil + x = \frac{23}{7} \), we first let \( x = n + f \), where \( n = \lfloor x \rfloor \) is an integer and \( f = x - n \) with \( 0 \leq f < 1 \). The ceiling function is ( \lceil x \rceil = n + 1 ) if ( f > 0 ), and ( \lceil x \rceil = n ) if ( f = 0 ). Substituting ( x = n + f ) into the equation, \lceil x \rceil + x = { \begin{array}{ll} n + 1 + n + f = 2n + 1 + f, & \text{if } f > 0 n + n = 2n, & \text{if } f = 0 \end{array} } = \frac{23}{7}. Examining cases for ( f > 0 ): 2n + 1 + f = \frac{23}{7} f = \frac{23}{7} - (2n + 1). For this expression, ( 0 < f < 1 ). Therefore, Breaking it down: 0 < \frac{23}{7} - 2n - 1 \frac{23}{7} - 2n - 1 > 0 \Rightarrow \frac{23}{7} - 1 > 2n \Rightarrow \frac{16}{7} > 2n \Rightarrow n < \frac{8}{7}. Since ( n ) is an integer, ( n \leq 1 ). |
Okay, let's think about this problem step by step. We have the equation $\lceil x \rceil + x = \frac{23}{7}$. First, what does $\lceil x \rceil$ mean? It's the ceiling function, which gives the smallest integer greater than or equal to x. So, $\lceil x \rceil$ is always an integer. Let's call this integer n. So $\lceil x \rceil = n$. Now our equation becomes: $n + x = \frac{23}{7}$ We know that n is the smallest integer greater than or equal to x. This means that $n-1 < x \leq n$ Let's subtract x from both sides of our equation: $n = \frac{23}{7} - x$ Now, let's think about the possible values for n. Since $n-1 < x \leq n$, we can say that $\frac{23}{7} - n < n \leq \frac{23}{7} - (n-1)$ Dividing by 2: $\frac{23}{14} < n \leq \frac{15}{7}$ Since n is an integer, the only value that satisfies this is n = 3. So, $\lceil x \rceil = 3$ Now, let's go back to our original equation: |
It can be seen that Sonnet has a better grasp of using inner thought to map out deduction.
Which part of the prompt is the most helpful?
To answer this question, I removed different parts of the prompt one by one to see which elements were most important.
- Step-by-Step: The basic way of thinking one step at a time.
- O1 ICL: The full set of instructions we use to make the AI think like o1.
- Short: Taking out parts that tell the AI to think harder and longer. The reduction in guidance leads to a noticeable dip in accuracy, particularly for Hermes 3. Without encouragement to process deeper or longer, the models underperform.
- No Method: Leaving out the part about how to think through problems. Despite omitting problem-solving strategies, the models still perform well, especially Hermes 3, indicating that they can rely on intrinsic capabilities to a large extent.
- No Example: Not showing the AI any examples from o1's research. Removing examples results in a moderate drop in performance, although they still maintain reasonably high accuracy compared to other approaches.
Outro: Scaling Law of Test-Time Compute and o1
A final thought is whether reasoning length explains the increase in performance. Past works like Star and Test-Time Compute Scaling show positive results. By using RL or Process Supervision, they show that increasing inference cost by several orders of magnitude can greatly help with reasoning abilities.
However, despite our prompt not increasing the output budget that much, the model's response to elongation of output is different. This suggests that besides spitting out a lot of tokens, there is more we need to do in order to make truly scalable reasoning.
Nevertheless, our little experiments reveal potentials in these open-source models. They may already contain some degree of intrinsic deep reasoning ability, only waiting for our community to discover.