I published some notes and opinions on this paper here: https://simonwillison.ne...

I published some notes and opinions on this paper here: https://simonwillison.net/2025/Apr/30/criticism-of-the-chatb...

Short version: the thing I care most about in this paper is that well funded vendors can apparently submit dozens of variations of their models to the leaderboard and then selectively publish the model that did best.

This gives them a huge advantage. I want to know if they did that. A top place model with a footnote saying "they tried 22 variants, most of which scored lower than this one" helps me understand what' going on.

If the top model tried 22 times and scored lower on 21 of those tries, whereas the model in second place only tried once, I'd like to hear about it.