An alternative possibility: Anthropic ships a 200ms TTFT voice-to-voice model…

November 11, 2025

An alternative possibility: Anthropic ships a 200ms TTFT voice-to-voice model that is as "smart" as Sonnet 4.5 and this unlocks so many new use cases that all of us who build tooling and applications higher up the stack scramble to build stuff that leverages the model.

The thesis here is that we will always want to do "20% more" than the best available model is capable of natively. You get that extra 20% by:

- Context engineering. Writing a non-trivial amount of code to give the model the most useful tokens every inference call. State machines. Parallel inference loops that compress/summarize/enrich the context.
- Deeper integration with other systems. Long-running processes that are event-driven. Smart discovery of rich tools. Memory. Sandboxes for code execution. Big, hairy, bridges into legacy infrastructure and all the testing and observability tooling that makes it possible to ship natural language interfaces to deterministic systems at scale.
- Multi-model/multi-agent orchestration. 200ms TTFT is *great*, but way too slow for realtime vision. Compliance often requires de-identification before shipping data over the network. I've got a whole text agent stack with RAG and agent loops that I built for my enterprise and now I want to "wrap" it with a voice I/O channel. I trained a medium-sized open weights model on my proprietary data, and I need to use it in conjunction with a SOTA model.

There's so much to build. Better models make it easier to build the "easy" things without much extra work/code/harness development. But we keep moving the bar upwards on what we want to build.

Pooja Nagpal@PoojaIsNagpal

Hypothetical:
1. Anthropic ships a sub 200ms latency voice model with native tool calling
2. the entire category of customer support / voice agents for "x" collapses (decagon, cresta, eleven, vapi, giga).
3. Zendesk / Intercom/ Salesforce do basic integrations.

Ledgers like