OpenAI Introduces GDPval Benchmark to Compare AI with Human Professionals

Share this post

OpenAI has launched a new benchmark, GDPval, designed to measure how well AI models perform against human experts across major industries. The test is part of OpenAI’s long-term mission to track progress toward artificial general intelligence (AGI), which would enable AI to handle economically valuable work at a human or higher level.

According to the company, its GPT-5 model and Anthropic’s Claude Opus 4.1 are already producing work close to the quality of seasoned professionals. However, OpenAI stresses this doesn’t mean AI is ready to replace people in their jobs. GDPval currently evaluates only a narrow set of tasks, primarily written reports, rather than the full scope of professional responsibilities.

How GDPval Works

The benchmark focuses on nine industries that make up a large share of the U.S. economy — including healthcare, finance, manufacturing, and government. It tests AI performance in 44 job roles, from software engineers to nurses to journalists.

In the first version, GDPval-v0, professionals were asked to review reports generated by both AI and humans, then choose which was better. For example, investment bankers compared AI-generated competitor analyses to those written by other bankers. AI models were then scored on their “win rate” against human work.

  • GPT-5-high (a stronger version of GPT-5) matched or beat industry experts in 40.6% of cases.
  • Claude Opus 4.1 achieved 49%, though OpenAI noted its strong visual presentation may have boosted results more than actual substance.
  • By contrast, GPT-4o scored only 13.7% about 15 months earlier, showing rapid progress.

What It Means

OpenAI’s chief economist Dr. Aaron Chatterji said the results suggest professionals can use AI to offload routine work and focus on higher-value tasks. Tejal Patwardhan, who leads OpenAI’s evaluation team, added that the speed of improvement indicates further gains are likely.

Still, GDPval has limits. Since it mainly tests report-writing, it doesn’t capture the broader skills required in most jobs. OpenAI says it plans to expand future versions to cover more industries and interactive workflows.

Why It Matters

Traditional AI benchmarks like AIME 2025 (math problem-solving) and GPQA Diamond (PhD-level science questions) are nearing saturation, meaning models are already close to maxing them out. GDPval offers a new way to evaluate whether AI is becoming truly useful in real-world economic contexts.

For now, OpenAI sees GDPval as an early but meaningful step in proving AI’s value across industries — though much more comprehensive testing will be needed before declaring that AI consistently outperforms humans.


Share this post

Leave a Reply

Your email address will not be published. Required fields are marked *