OpenAI Unveils EVMbench Benchmark to Evaluate AI in Smart Contracts

TLDR

OpenAI and Paradigm have launched EVMbench to evaluate AI’s performance in smart contract security.
The benchmark tests AI systems in detecting vulnerabilities, patching code, and executing fund-draining exploits.
EVMbench uses 120 high-risk vulnerabilities sourced from 40 professional audits to simulate real-world scenarios.
GPT-5.3-Codex achieved a 72.2% success rate in exploit tasks, a notable improvement over GPT-5’s 31.9% performance.
OpenAI has invested $10 million in API credits to support open-source security initiatives and strengthen smart contract defenses.

OpenAI and Paradigm have unveiled a new smart contract security evaluation system called EVMbench. This benchmark aims to assess AI systems in detecting vulnerabilities and executing exploits in Ethereum Virtual Machine (EVM) environments. With smart contracts securing over $100 billion in crypto assets, testing the security of these contracts has become crucial.

Testing AI in Smart Contract Security

OpenAI, in collaboration with Paradigm, launched EVMbench to evaluate how AI handles security in smart contracts. The benchmark leverages 120 curated vulnerabilities from 40 professional audits, including scenarios from the Tempo blockchain. The system evaluates AI models in three distinct tasks: detecting vulnerabilities, patching code, and executing fund-draining exploits in a sandboxed EVM environment.

EVMbench focuses on Ethereum-based contracts and incorporates scenarios that reflect real financial applications. The use of 120 high-risk issues, along with data from public auditing competitions, helps to simulate actual challenges faced in the crypto space. OpenAI developed this system to address the growing concern over AI’s role in identifying and mitigating risks in smart contract security.

EVMbench’s Capabilities and Performance

The benchmark provides a comprehensive approach to testing AI agents by evaluating their capabilities in different security tasks. In detection mode, the agents review contract code to identify known vulnerabilities. In patch mode, the AI must fix these vulnerabilities without compromising the contract’s functionality.

Recent testing showed impressive results with the GPT-5.3-Codex model achieving a 72.2% success rate in exploit tasks, up from 31.9% with the GPT-5 model. Despite these advancements, detection and patching performance remained lower. OpenAI noted that while the benchmark gives a glimpse into AI’s potential, it does not fully replicate real-world conditions, as some complex multi-chain and timing-based attacks are excluded from the testing framework.

OpenAI Expands Security Efforts

OpenAI’s announcement also highlighted its broader commitment to security. As part of the release, the company invested $10 million in API credits to support open-source security projects. The company also emphasized that all EVMbench tools and datasets have been made publicly available for further research and development.

The launch of EVMbench is seen as a step toward strengthening the cybersecurity of smart contracts and blockchain systems. With the increasing reliance on smart contracts, OpenAI aims to help the industry address emerging risks by testing AI systems in critical financial settings. As AI continues to evolve, its role in both defending and attacking smart contracts will be crucial for maintaining the integrity of the crypto ecosystem.

OpenAI Unveils EVMbench Benchmark to Evaluate AI in Smart Contracts

TLDR

Testing AI in Smart Contract Security

EVMbench’s Capabilities and Performance

OpenAI Expands Security Efforts

Related Posts