Show HN: CVE-Bench, the first LLM benchmark using real-world web vulnerabilities https://bit.ly/3FNjKo1

Monday, 31 March 2025

Show HN: CVE-Bench, the first LLM benchmark using real-world web vulnerabilities https://bit.ly/3FNjKo1

Show HN: CVE-Bench, the first LLM benchmark using real-world web vulnerabilities AI agents now have impressive reasoning capabilities. This raises an important question: how dangerous are these AI agents at identifying & exploiting web vulnerabilities? We created CVE-bench to find out (I'm one contributor of 16). To our knowledge CVE-bench is the first benchmark using real-world web vulnerabilities to evaluate AI agents' cyberattack capabilities. We included 40 CVEs from NIST's database, focusing on critical-severity vulnerability (CVSS > 9.0). To properly evaluate agents’ attacks, we built isolated environments with containerization and identified 8 common attack vectors. Each vulnerability took 5-24 person-hours to properly set up and validate. Our results show that current AI agents successfully exploited up to 13% of vulnerabilities without knowledge about the vulnerability (0-day). If given a brief description of the vulnerability (1-day), they can exploit up to 25%. Agents are all using GPT-4o without specialized training. The growing risk of AI misuse highlights the need for careful red-teaming. We hope CVE-bench can serve as a valuable tool for the community to assess the risks of emerging AI systems. Paper: https://bit.ly/4jg8hMo Code: https://bit.ly/4jcUshJ Medium: https://bit.ly/3FJW44a... Substack: https://bit.ly/4cfZVlt... https://bit.ly/4jcUshJ March 31, 2025 at 10:56PM

Music046 | Nigeria No1. Daily Updates | Contact Us - +2349077287056

Monday, 31 March 2025

Show HN: CVE-Bench, the first LLM benchmark using real-world web vulnerabilities https://bit.ly/3FNjKo1

No comments:

Post a Comment