Human Performance Test

New study shows AI isn’t ready for office work

Mercor’s APEX-Agents benchmark finds top AI models score under 25% accuracy on realistic consulting, legal, and finance tasks ...

Google researchers have discovered that AI reasoning models like DeepSeek-R1 and QwQ-32B simulate internal debates between ...

Some results have been hidden because they may be inaccessible to you