Expanding HarmBench: Investigating Gaps & Extending Adversarial LLM Testing
post by racinkc1 · 2025-03-03T19:23:20.687Z · LW · GW · 0 commentsContents
No comments
Dear Alignment Forum Members,
We recently reached out to Oliver from Safe.ai regarding their work on HarmBench, an adversarial evaluation benchmark for LLMs. He confirmed that while they are not planning a follow-up, we have their blessing to expand upon the experiment. Given the rapid evolution of language models and the increasing importance of rigorous adversarial evaluation, we believe that further exploration of HarmBench could be invaluable to the field of AI safety.
Safe.ai’s work has already demonstrated key vulnerabilities in Mistral, GPT-4, and other models, but there are still open questions that warrant further investigation:
- Do newer models exhibit improved robustness, or do they fail in similar ways?
- Are there overlooked attack types that could systematically break safety guardrails?
- How do different architectures compare when subjected to adversarial testing?
With the original HarmBench data available on GitHub, we are in a strong position to build upon this foundation and refine adversarial evaluation methodologies. However, we are relatively new to this space and want to ensure we are not duplicating existing efforts. If there has already been significant work expanding HarmBench, we would appreciate any guidance or links to relevant research so we can refine our approach.
Additionally, if anyone is interested in collaborating, we would welcome the opportunity to work with those more experienced in adversarial robustness testing. Our primary goal is to identify gaps, expand upon Safe.ai’s work, and contribute meaningfully to AI safety research. Any feedback, critiques, or potential directions would be greatly appreciated.
Thank you for your time and consideration
Best,
Casey R Wagner
0 comments
Comments sorted by top scores.