January 11, 2025
In this case, from GPT-4o to Llama 3.3 70B.
But I have definitely had similar experiences moving between all of the SOTA models. Claude, Gemini Flash, GPT-4o, and Llama 3.3 all have different strengths, quirks, and prompt techniques that they respond to best.
I find the differences to be particularly apparent with:
- function calling
- asking the model to adhere to a specific output format
- asking the model to respond at a particular length or level of detail (while retaining some flexibility to adapt to what the user might expect in the context of a specific conversation)
Here's an example of a complex prompt that took a lot of iteration to get Gemini Flash, GPT-4o, and Claude to all work pretty well with (h/t to @mark_backman for this binary classifier prompt for conversational turn detection).
Full prompt: