Systematic Sandbagging Evaluations on Claude 3.5 Sonnet

post by farrelmahaztra · 2025-02-14T01:22:46.695Z · LW · GW · 0 comments

This is a link post for https://farrelmahaztra.com/posts/sandbagging

Contents

  Summary
  Related Posts: 
None
No comments

This was the project I worked on during BlueDot Impact's AI Safety Fundamentals Alignment course, which expands on findings from Meinke et al's "Frontier Models are Capable of In-context Scheming".

Summary

Related Posts: 

0 comments

Comments sorted by top scores.