The questions on Humanity’s Last Exam went through a two-step filtering process. First, submitted questions were given to leading A.I. models to solve.
If the models couldn’t answer them (or if, in the case of multiple-choice questions, the models did worse than by random guessing), the questions were given to a set of human reviewers, who refined them and verified the correct answers. Experts who wrote top-rated questions were paid between $500 and $5,000 per question, as well as receiving credit for contributing to the exam.
Mr. Hendrycks, who helped create a widely used A.I. test known as Massive Multitask Language Understanding, or M.M.L.U., said he was inspired to create harder A.I. tests by a conversation with Elon Musk. (Mr. Hendrycks is also a safety advisor to Mr. Musk’s A.I. company, xAI.) Mr. Musk, he said, raised concerns about the existing tests given to A.I. models, which he thought were too easy.
Once the list of questions had been compiled, the researchers gave Humanity’s Last Exam to six leading A.I. models, including Google’s Gemini 1.5 Pro and Anthropic’s Claude 3.5 Sonnet. All of them failed miserably. OpenAI’s o1 system scored the highest of the bunch, with a score of 8.3 percent.
Mr. Zhou, the theoretical particle physics researcher who submitted questions to Humanity’s Last Exam, told me that while A.I. models were often impressive at answering complex questions, he didn’t consider them a threat to him and his colleagues, because their jobs involve much more than spitting out correct answers.
“There’s a big gulf between what it means to take an exam and what it means to be a practicing physicist and researcher,” he said. “Even an A.I. that can answer these questions might not be ready to help in research, which is inherently less structured.”