August 31, 2025
For the things I do (voice agents / realtime multi-turn / multi-step function calling harnesses) needle in a haystack benchmarks don’t map at all to real-world long context performance.
I care almost entirely about
- Can you combine your system instructions with the last 30 turns or so of information to “understand the assignment every turn?”
- Can you reliably call functions (with the correct arguments) deep into a multi-turn conversation.
Generally, for non-trivial instruction complexity, today’s models are not great in multi-turn contexts compared to “normal” use.
So … context engineering, amiright @dexhorthy?
After 300k tokens gpt5‘s tool calling seems to be pretty much gone. Starts to write python scripts because it fails repetitive edit file calls.