For the things I do (voice agents / realtime multi-turn / multi-step function…

August 31, 2025

For the things I do (voice agents / realtime multi-turn / multi-step function calling harnesses) needle in a haystack benchmarks don’t map at all to real-world long context performance.

I care almost entirely about
- Can you combine your system instructions with the last 30 turns or so of information to “understand the assignment every turn?”
- Can you reliably call functions (with the correct arguments) deep into a multi-turn conversation.

Generally, for non-trivial instruction complexity, today’s models are not great in multi-turn contexts compared to “normal” use.

So … context engineering, amiright @dexhorthy?

Nicolay Gerold@nicolaygerold

After 300k tokens gpt5‘s tool calling seems to be pretty much gone. Starts to write python scripts because it fails repetitive edit file calls.