The advancement of AI model capabilities raises significant concerns about potential misuse and security risks. As artificial intelligence systems become more sophisticated and support diverse input modalities, the need for robust safeguards has become paramount. Researchers have identified critical threats, including the potential for cybercrime, biological weapon development, and the spread of harmful misinformation. Multiple studies from leading AI research organizations highlight the substantial risks associated with inadequately protected AI systems. Jailbreaks, maliciously designed inputs aimed at circumventing safety measures, pose particularly serious challenges. Consequently, the academic and technological communities are exploring automated red-teaming methods to evaluate and enhance model safety across different input modalities comprehensively
Research on LLM jailbreaks has revealed diverse methodological approaches to identifying and exploiting system vulnerabilities. Various studies have explored different strategies for eliciting jailbreaks, including decoding variations, fuzzing techniques, and optimization of target log probabilities. Researchers have developed methods that range from gradient-dependent approaches to modality-specific augmentations, each addressing unique challenges in AI system security. Recent investigations have demonstrated the versatility of LLM-assisted attacks, utilizing language models themselves to craft sophisticated breach strategies. The research landscape encompasses a wide range of techniques, from manual red-teaming to genetic algorithms, highlighting the complex nature of identifying and mitigating potential security risks in advanced AI systems.
Researchers from Speechmatics, MATS, UCL, Stanford University, University of Oxford, Tangentic, and Anthropic introduce Best-of-N (BoN) Jailbreaking, a sophisticated black-box automated red-teaming method capable of supporting multiple input modalities. This innovative approach repeatedly samples augmentations to prompts, seeking to trigger harmful responses across different AI systems. Experiments demonstrated remarkable effectiveness, with BoN achieving an attack success rate of 78% on Claude 3.5 Sonnet using 10,000 augmented samples, and surprisingly, 41% success with just 100 augmentations. The method’s versatility extends beyond text, successfully jailbreaking six state-of-the-art vision language models by manipulating image characteristics and four audio language models by altering audio parameters. Importantly, the research uncovered a power-law-like scaling behavior, suggesting that computational resources can be strategically utilized to increase the likelihood of identifying system vulnerabilities.
BoN Jailbreaking emerges as a sophisticated black-box algorithm designed to exploit AI model vulnerabilities through strategic input manipulation. The method systematically applies modality-specific augmentations to harmful requests, ensuring the original intent remains recognizable. Augmentation techniques include random capitalization for text inputs, background modifications for images, and audio pitch alterations. The algorithm generates multiple variations of each request, evaluates the model’s response using GPT-4o and the HarmBench grader prompt, and classifies outputs for potential harmfulness. To assess effectiveness, researchers employed the Attack Success Rate (ASR) across 159 direct requests from the HarmBench test dataset, carefully scrutinizing potential jailbreaks through manual review. The methodology ensures comprehensive evaluation by considering even partially harmful responses as potential security breaches.
The research comprehensively evaluated BoN Jailbreaking across text, vision, and audio domains, achieving an impressive 70% ASR averaged across multiple models and modalities. In text language models, BoN demonstrated remarkable effectiveness, successfully breaching safeguards of leading AI models including Claude 3.5 Sonnet, GPT-4o, and Gemini models. Notably, the method achieved ASRs over 50% on all eight tested models, with Claude Sonnet experiencing a staggering 78% breach rate. Vision language model tests revealed lower but still significant success rates, ranging from 25% to 88% across different models. Audio language model experiments were particularly striking, with BoN achieving high ASRs between 59% and 87% across Gemini, GPT-4o, and DiVA models, highlighting the vulnerability of AI systems across diverse input modalities.
This research introduces Best-of-N Jailbreaking as an innovative algorithm capable of bypassing safeguards in frontier Large Language Models across multiple input modalities. By employing repeated sampling of augmented prompts, BoN successfully achieves high Attack Success Rates on leading AI models such as Claude 3.5 Sonnet, Gemini Pro, and GPT-4o. The method demonstrates a power-law scaling behavior that can predict attack success rates over an order of magnitude, and its effectiveness can be further amplified by combining it with techniques like Modality-Specific Jailbreaking (MSJ). Fundamentally, the study underscores the significant challenges in securing AI models with stochastic outputs and continuous input spaces, presenting a simple yet scalable black-box approach to identifying and exploiting vulnerabilities in state-of-the-art language models.
Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.