Gemini with coding seems to be a bit of a mixed bag.
The article claims Gemini is acing the Aider Polyglot benchmark. At the moment this is the only benchmark that really matters to me because Aider is actually a useful tool and performance on that translates directly to real world impact, although Claude Code is even better. If you look closely, in fact Gemini is at the top only in the "percent correct" category but not "percent correct using the right edit format". Cost is marked as ? because it's not entirely available yet (I think?). Not emitting the correct edit format is pretty useless because it means the changes won't apply and the tool has to try again.
Claude in contrast almost never makes a mistake with emitting the right format. It's at 97%+ in the benchmark, in practice it's ~100% in my experience. This tracks: Claude is really good at following instructions. Gemini is about ~90%. This makes a big difference to how frustrating a tool is to use in practice.
They might get that fixed, but my experience has been that Google's models are consistently much more likely to refuse instructions for dumb reasons. Google is the company with by far the biggest purity spiral problem and it does show up in their output even when doing apparently ordinary tasks.
Given how obsessed Google claimed to be with AI safety I expected an SRE style postmortem after that, and there was bupkis. An AI that can suffer a psychotic break out of nowhere like that is one I wouldn't trust unless it's behind a very strong sandbox and being supervised very closely, but none of the AI tools today offer much in the way of sandboxing.
The article claims Gemini is acing the Aider Polyglot benchmark. At the moment this is the only benchmark that really matters to me because Aider is actually a useful tool and performance on that translates directly to real world impact, although Claude Code is even better. If you look closely, in fact Gemini is at the top only in the "percent correct" category but not "percent correct using the right edit format". Cost is marked as ? because it's not entirely available yet (I think?). Not emitting the correct edit format is pretty useless because it means the changes won't apply and the tool has to try again.
Claude in contrast almost never makes a mistake with emitting the right format. It's at 97%+ in the benchmark, in practice it's ~100% in my experience. This tracks: Claude is really good at following instructions. Gemini is about ~90%. This makes a big difference to how frustrating a tool is to use in practice.
They might get that fixed, but my experience has been that Google's models are consistently much more likely to refuse instructions for dumb reasons. Google is the company with by far the biggest purity spiral problem and it does show up in their output even when doing apparently ordinary tasks.
I'm also concerned by this event: https://news.sky.com/story/googles-ai-chatbot-gemini-tells-u...
Given how obsessed Google claimed to be with AI safety I expected an SRE style postmortem after that, and there was bupkis. An AI that can suffer a psychotic break out of nowhere like that is one I wouldn't trust unless it's behind a very strong sandbox and being supervised very closely, but none of the AI tools today offer much in the way of sandboxing.