In this case, from GPT-4o to Llama 3.3 70B

January 11, 2025

In this case, from GPT-4o to Llama 3.3 70B.

But I have definitely had similar experiences moving between all of the SOTA models. Claude, Gemini Flash, GPT-4o, and Llama 3.3 all have different strengths, quirks, and prompt techniques that they respond to best.

I find the differences to be particularly apparent with:
- function calling
- asking the model to adhere to a specific output format
- asking the model to respond at a particular length or level of detail (while retaining some flexibility to adapt to what the user might expect in the context of a specific conversation)

Here's an example of a complex prompt that took a lot of iteration to get Gemini Flash, GPT-4o, and Claude to all work pretty well with (h/t to @mark_backman for this binary classifier prompt for conversational turn detection).

Full prompt:

^[2]

https://x.com/td_quang/status/1878177964204114061 ↩
https://github.com/pipecat-ai/pipecat/blob/main/examples/foundational/22c-natural-conversation-mixed-llms.py#L57 ↩