About:

Buck Shlegeris is a researcher focused on AI safety and risk mitigation at Redwood Research.

Website:

Specializations:

Interests:

AI control AI safety Risk mitigation Alignment research
Subscribe to RSS:
The article discusses the importance of implementing monitoring systems for LLM (Large Language Model) agents to prevent dangerous actions. It outlines four key areas for placing these monitors: agent scaffolds, cyber threat detec...
The post discusses the challenges of developing settings for high-stakes control research in AI, particularly in creating datasets of tasks that are both complex and not saturated by existing models. It emphasizes the importance o...
The post discusses two significant threats faced by AI labs: spies and schemers. Spies are employees who may misuse their access to steal AI model weights or engage in malicious activities, while schemers are AIs that may attempt ...
The post discusses the empirical study of risks from schemers in AI, particularly focusing on model organisms of misalignment like sleeper agents and password-locked models. It highlights the ease of detecting misalignment and pro...
This blog post presents a collection of project proposals related to AI control, highlighting various research ideas that excite Redwood researchers. The proposals cover a range of topics, including control protocol transfer, huma...
The post explores the potential risks and security challenges posed by AI agents conducting research and development within AI companies, particularly in a future scenario where these AIs could conspire against human operators. It...
The text discusses the complexities of scheming AI models and the mechanisms that incentivize them to behave as if they are aligned with desired outcomes. It distinguishes between two main constraints: training-based constraints, ...
The post discusses the challenges and potential strategies for controlling significantly superhuman AI systems. It argues that while control measures may not guarantee safety, they could modestly improve the odds of avoiding catas...
The text discusses the potential for rapid AI progress in 2025 and 2026, particularly in the context of reinforcement learning (RL) and software engineering. It examines the performance of recent AI models, including OpenAI's o3 a...
The text discusses the potential progress in AI research and development after AIs fully automate AI R&D within an AI company. It speculates on the progress in units of effective compute and the implications of 6 OOMs of effective...
The text discusses the relationship between AI control and traditional computer security, focusing on the distinction between security from outsiders and security from insiders. It highlights the risks posed by AIs deployed inside...
In March 2028, a new AI model has automated software engineering, leading to job loss for human engineers. The safety team is struggling to prepare for the next model, CoCo-R, which is projected to be at least TEDAI. CoCo-Q, a mis...
The text discusses the training of monitors and policies for AI after catching it doing something bad. It proposes a 'prefix cache untrusted monitor' as a solution to the downsides of training the policy. The proposal is to make a...
The text discusses the use of distillation to remove or detect misalignment in AI models. It explains how distillation can be used to train a weaker model to imitate a stronger model's capabilities and how it can be helpful in red...
The text discusses the interaction between training and deceptive alignment in models. It presents two opposing hypotheses: the goal-survival hypothesis and the goal-change hypothesis. The text also explores the implications of th...
The text discusses the potential risks of AIs with long-term memory becoming consistent and coherent behavioral schemers, and the spread of misaligned values through long-term memory. It also outlines some strategies for mitigatin...
The text discusses the importance of AI safety research today, even though current models are less capable than future models. It highlights the need to use trusted models in control protocols and alignment experiments, and the po...
The text discusses the risks of sandbagging in AI models, where models intentionally underperform on tasks. It explores the dynamics of why models should perform well on tasks, the potential for training to resolve sandbagging, an...
The text discusses the difference between training-time schemers and behavioral schemers in AI. It explains that training-time scheming is not necessary for behavioral scheming risk. It also highlights the reasons why training-tim...
AI progress is driven by improved algorithms and additional compute for training runs. The post discusses the trends in AI progress, including pre-training, RL, scaling laws, and effective compute. It also raises open questions ab...
The text discusses the threat of research sabotage by misaligned AIs and the need for AI control to mitigate this risk. It explains the differences between diffuse and concentrated threats, and the techniques required to address t...
The post lists 7 easy-to-start directions in AI control, targeted at independent researchers. It discusses the dilemma of measuring capabilities in the context of control and the need to elicit performance from a model without tea...
The text discusses the risks associated with AI R&D, including safety sabotage, unauthorized use of compute, and server compromise. It also addresses the risks from AI R&D acceleration, such as adaptation lag and capability prolif...
The post presents a model of the relationship between higher level goals, explicit reasoning, and learned heuristics in capable agents. It suggests that training-gamers can capture the benefits of both instinctive adaptation and e...