AI Control assesses the safety of deployment protocols for untrusted AIs through red-teaming exercises involving a protocol designer and an adversary. AI systems, like chatbots with access to tools such as code interpreters, become increasingly integrated into various tasks, ensuring their safe deployment becomes more complex. While prior research has focused on building robustly safe models or detecting harmful behavior through interpretability tools, the study introduces a complementary approach. It evaluates strategies for safely deploying untrusted AI systems, especially when there is significant uncertainty about their behavior across diverse and complex use cases.
Researchers from the University of Oxford and Redwood Research introduce AI-Control Games, a formal decision-making framework for evaluating red-teaming exercises as multi-objective, partially observable, stochastic games. They focus on finding optimal protocols for safely deploying untrusted language models as programming assistants. The study highlights Trusted Monitoring protocols, which rely on weaker models and human oversight to detect harmful behavior. AI-Control Games help evaluate protocol safety and usefulness, providing improvements over empirical studies. The research explores trade-offs between safety and efficacy, with algorithms to identify optimal protocols and assess real-world deployment strategies for untrusted AIs.
The game-theoretic framework of AI-Control Games introduces a unique approach to AI safety by modeling decision-making between a protocol designer and an adversary. While similar frameworks exist in other fields, AI-Control Games differ in their focus on partially observable, multi-objective scenarios. In these games, the designer aims to maximize safety and efficiency while the adversary attempts to exploit weaknesses. Unlike traditional models, AI-Control Games involve multi-round interactions and imperfect information, making them more applicable to real-world AI deployment scenarios. These games balance safety with practical utility, a critical consideration in AI system control and monitoring.
The study explores AI-Control Games, where an untrusted AI (U), a trusted AI (T), and a human (H) work together to solve programming tasks. The protocol uses T to evaluate U’s solutions, assessing their safety through suspicion scores. If a solution appears compromised, H can audit it, or T can create an alternative solution. The red team introduces vulnerabilities, while the blue team aims to prevent them. The study formalizes this interaction as a two-player game, analyzing both safety and usefulness in scenarios where the untrusted AI might try to undermine the protocol.
In the Trusted Monitoring protocol model, as defined using the general framework, the Trusted Monitoring AI-Control Game (GTM) is detailed through a tuple (S, …, H). The protocol’s dynamics are contingent upon whether the game is scheming (⊗) or honest (⊕), with Red’s actions being either to introduce a backdoor or provide a safe solution. The game’s state space consists of the number of problems left to solve, the available audit coins, and a boolean indicating if the current situation is backdoored. In honest games, Red’s action is replaced with a null action, leading to Blue scoring based solely on usefulness. In scheming games, where safety is critical, Blue has the option to audit, defer, or run the solution, with safety being achieved through the detection of backdoors by human auditors or by avoiding backdoored solutions altogether.
The evaluation of memoryless protocols is simplified by restricting policies to those based on fixed false positive rates for audit and defer actions. The safety of these protocols is computed using dynamic programming, with the safety and usefulness values derived from a polynomial expression in the backdoor rate. The model and algorithms presented enable the identification of Pareto-optimal protocols, balancing safety and usefulness across different scenarios. Notably, the findings highlight that increasing the audit false positive rate can improve the protocol’s performance.
Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..
Don’t Forget to join our 50k+ ML SubReddit
Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.