Key takeaway: brainstorm in one chat, execute in another.
Humans use conversation to clarify ideas, test assumptions, and find better options. That is one reason chatting with AI feels so natural. It matches how we already think and work, and often it is genuinely useful.
Sometimes those chats become infuriatingly cumbersome. A conversation that starts out helpful slowly turns mushy. The model seems engaged, productive, even confident, but the result gets worse instead of better.
The reason is simple: chatting is natural for humans, but not always the best format for the model
Recent research suggests that LLMs can get “lost in conversation”. As the conversation unfolds, the model becomes less and less reliable while still seeming helpful on the surface, and this can happen surprisingly quickly.
In practice, the model may:
- make assumptions too early,
- cling to earlier wrong assumptions,
- give final answers before the task is fully clear,
- fill in missing details with confident guesswork,
- and keep building on mistakes instead of correcting the course.
Usually it is a combination of these, and the effect is cumulative. The model makes a few bad calls early on, and then keeps building on them. That is why a chat can feel productive while the output quietly gets worse.
A practical way to work around this
This is not a miracle cure. It is a practical workaround, but it is a good one; use two chats, one for brainstorming and another for getting to the end result.
Here is a four-step path to follow.
1. Start an exploration chat
Use the first chat to think out loud, discuss options, refine the spec, and work out what you actually want. This is the brainstorming phase.
2. Ask for a structured summary
Then ask the model for a clean summary of:
- goals,
- requirements,
- constraints,
- assumptions,
- open questions,
- and decisions already made.
3. Review the summary yourself
This is important. Check for missing details, contradictions, vague wording, or accidental assumptions. Remove things that were only part of the exploration. Add the bits the model did not capture properly. Tighten anything that still sounds a bit hand-wavy.
4. Start a fresh chat with the cleaned summary
Now the model gets a crisp brief instead of a bunch of mixed, half-formed thoughts, dead ends, and historical baggage.
This tends to work much better for coding, planning, analysis, and other specification-heavy tasks where accuracy matters more than conversational momentum.
In other words: explore your topic in one chat, then start a fresh one for execution with a honed brief.
Why this works: the research behind it
Most LLM evaluation still happens in a single-turn setup, where the model receives a complete, fully specified instruction and is scored on the result. That is a tidy laboratory setting.
Real use is messier. People start with a rough idea, add constraints later, revise the brief as they go, and clarify things through conversation.
A recent paper suggests that models become much less reliable when the brief emerges piece by piece across several messages. Giving the model a complete brief upfront can yield much more reliable results than even a short back-and-forth.
When the researchers split the instruction into smaller pieces and revealed them gradually across multiple messages, the effect was not subtle. Performance dropped sharply, and so did reliability. The model may still be capable of solving the task, but once it takes a wrong turn in a gradually clarified conversation, it often fails to recover..
That is the important point. The problem is not only lower performance, it’s lower reliability.
Not just a wording issue
One of the more interesting parts of the paper is that the researchers checked whether this was simply a wording problem.
They recombined the same information into one message and tested that too. Performance stayed much closer to the single-turn baseline. So the issue is not just phrasing. It is that the brief unfolds gradually through conversation.
That matters, because it suggests the issue is deeper than just “write better prompts”. It is about how the task is revealed and handled over time.
Not just a long-chat problem
It is also not only a problem in very long chats.
The paper suggests the reliability problem can appear very early, as soon as the task is gradually clarified over a few messages.
So this is less about “too much context” and more about incremental specification. The brief emerges gradually, the model starts filling in the gaps too early, and once it drifts off course, it often struggles to recover.
Which, inconveniently, is exactly how many people use LLMs.
The practical lesson
LLMs are excellent for exploration. They can help you to think, compare options, test ideas, and shape a problem, but that does not mean the same conversation is the best place to execute the final task.
If the goal is a reliable output, especially for coding, planning, analysis, or other tasks where details matter, it often helps to separate discovery from execution
Use the first chat to figure out what you mean.
Use the second chat to do the work.
That is often the difference between quite clever and actually useful. For anyone who wants the research behind this, the paper is worth a read.





