Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Does anyone know what this "APEX-Agents benchmark for long time horizon investment banking, consulting and legal work" actually evaluates?

That sounds so broad that creating a meaningful benchmark is probably as difficult as creating an AI that actually "solves" those domains.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: