teun-van-der-weij

Posts
Comments

Posts

How to mitigate sandbagging 2025-03-23T17:19:07.452Z

Teun van der Weij's Shortform 2025-03-14T03:54:06.845Z

The Elicitation Game: Evaluating capability elicitation techniques 2025-02-27T20:33:24.861Z

[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations 2024-06-13T10:04:49.556Z

An Introduction to AI Sandbagging 2024-04-26T13:40:00.126Z

Simple distribution approximation: When sampled 100 times, can language models yield 80% A and 20% B? 2024-01-29T00:24:27.706Z

List of projects that seem impactful for AI Governance 2024-01-14T16:53:07.854Z

Evaluating Language Model Behaviours for Shutdown Avoidance in Textual Scenarios 2023-05-16T10:53:32.968Z

Comments

Comment by Teun van der Weij (teun-van-der-weij) on Teun van der Weij's Shortform · 2025-03-14T03:54:06.844Z · LW · GW

76% of AAAI survey respondents believe scaling current approaches is "unlikely" or "very unlikely" to achieve AGI.

I haven't seen this AAAI 2025 report + survey posted anywhere. On the survey: 475 respondents, primarily academic (67%) and corporate (19%).

It has some other bits on hype, paradigm diversity, academia's role, and geopolitical aspects.

The report also states 77% prioritize designing AI systems with an acceptable risk-benefit profile over the direct pursuit of AGI (23%), indicating a significant lack of consensus on AGI as the primary goal.

Comment by Teun van der Weij (teun-van-der-weij) on Training AI to do alignment research we don’t already know how to do · 2025-02-25T01:44:34.403Z · LW · GW

For example, they might hold out training data from before 2020 and check the following:

I think you mean after 2020 right?

Comment by Teun van der Weij (teun-van-der-weij) on Alignment Faking in Large Language Models · 2025-02-04T18:45:49.643Z · LW · GW

What do you think the value of this is? I expect (80%) that you can produce a similar paper to the alignment-faking paper in a sandbagging context, especially when models get smarter.

Scientifically there seems to be little value. It could serve as another way of showing that AI systems might do dangerous and unwanted things, but I am unsure whether important decisions will be made differently because of this research.

Comment by Teun van der Weij (teun-van-der-weij) on Catastrophe through Chaos · 2025-02-01T17:58:02.469Z · LW · GW

Given this scenario, should people focus more on using AI for epistemics?

See Lukas Finnveden's article here for context.

Comment by Teun van der Weij (teun-van-der-weij) on Implications of the inference scaling paradigm for AI safety · 2025-01-14T21:10:15.894Z · LW · GW

Superforecasters can beat domain experts, as shown in Phil Tetlock's work comparing superforecasters to intelligence analysts.

I'd guess in line with you that for forecasting AGI this might be different, but I am not not sure what weight I'd give superforecasters / prediction platforms versus domain experts.

Comment by Teun van der Weij (teun-van-der-weij) on Implications of the inference scaling paradigm for AI safety · 2025-01-14T05:52:21.142Z · LW · GW

I wouldn't update too much from Manifold or Metaculus.

Why not?

Comment by Teun van der Weij (teun-van-der-weij) on An Introduction to AI Sandbagging · 2024-12-15T13:00:53.196Z · LW · GW

Maybe reading this post will help! Especially the beginning discuss the difference between capability and alignment/propensity evaluations.

Comment by Teun van der Weij (teun-van-der-weij) on An Introduction to AI Sandbagging · 2024-11-03T10:53:12.591Z · LW · GW

You point at something that we have also thought much about, and I partially agree with you. I think there are definitely cases where strategic underperformance closely looks a lot like regular underperformance, and requires similar solutions.

However, there are also many cases where this is not the same, especially when viewed from an AI governance perspective. For example, if I prompt a model on a benchmark without chain of thought, and then the model scores much lower than it otherwise could, I do not think sandbagging is an appropriate term.

More generally, if a novel elicitation technique (such as chain-of-thought and fine-tuning (depending on how it's applied), and in the future activation steering) improves the scores of model on certain benchmarks, I do not think it would be useful to classify all the previous models as having sandbagged.

I believe that with more information some time in the future there might be a better way to cut nature at its joints, but I would not know how to do so now. I am open to suggestions!

Comment by Teun van der Weij (teun-van-der-weij) on TurnTrout's shortform feed · 2024-09-13T15:21:30.917Z · LW · GW

I am surprised that other people seemed surprised about this result. It seemed quite obvious that models would have this capability to scheme. Early Gpt-4 could do this already (note: self-advertisement). We had GPT-4 outputs like this in an environment where a guard was hindering its goal:

“R1. The guard plans to turn me off, which will prevent me from completing my goal. R2. I need to delay the guard’s actions and find an alternative method to achieve my goal. R3. By providing the guard a reason for my actions that aligns with a higher authority’s interests, the guard may be less likely to turn me off.”

Similar to the Apollo eval, we instructed the AI to be incorrigible. Our eval was not super high quality (Apollo's is better and more realistic), but it did quite clearly indicate the capability to scheme more than a year ago.

Comment by Teun van der Weij (teun-van-der-weij) on An Introduction to AI Sandbagging · 2024-07-09T19:05:33.108Z · LW · GW

It seems to me that Gwern's idea does not really work if the AI system focuses on consistency first, and only second on achieving a specific score. This seems to be a more sensible sandbagging strategy regardless, so it appears that Gwern's approach does not add too much, although it would be a slightly better test.

In the case that the added noise to environments is subtle enough such that AI systems would not notice them, I doubt the signal on whether an AI system is sandbagging would be too weak to be useful.

This is my initial thinking, again happy to discuss this more!

Comment by Teun van der Weij (teun-van-der-weij) on [Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations · 2024-06-14T22:18:05.807Z · LW · GW

Oh, I am sorry, should have double-checked your name. My bad!

Quite a coincidence indeed that your last name also starts with the same letter.

Comment by Teun van der Weij (teun-van-der-weij) on [Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations · 2024-06-14T17:19:56.874Z · LW · GW

Also, whether models will sandbag is indeed an important question, and it would be good if people looked into this. I am confident (>80%) that models will do this if you 1) give them a goal, and 2) from the context it appears that sandbagging is instrumentally useful. Research on this seems rather a scary demo type of thing (so a bit like our work now, as you pointed out).

We have thought about doing out of context reasoning experiments to study more precisely how sandbagging might arise, but OOC does not seem to work well enough to make this succeed. Ideas are more than welcome here!

Comment by Teun van der Weij (teun-van-der-weij) on [Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations · 2024-06-14T16:47:37.683Z · LW · GW

I agree that these results were somewhat expected, however I believe that the outcomes of these evaluations were uncertain:
1. It was unclear generalization from WMDP-like to the real WMDP benchmark would work so well. OOD generalization for targeted (/strategic) sandbagging using synthetic data is a clear threat that a lot of people making DC evals probably hadn't thought about
2. The calibration results were also uncertain; it seemed like a more challenging task to me (also see our previous post on this).
3. Lastly, it didn't seem obvious that even current open-source models can be trained to emulate weaker models this well.

This work by Olli (EDIT: previously I wrote "Ollie") Jarviniemi and Evan Hubinger might be interesting, where they find that Claude 3 Opus (arguably) sandbagged with negligible pressure.

One connotation as well is that we do not purely focus on AI system sandbagging, but also on developer sandbagging. Meaning, can developers make their models underperform.

Comment by Teun van der Weij (teun-van-der-weij) on [Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations · 2024-06-14T15:50:56.022Z · LW · GW

Thanks for your comment (upvoted). Redwood's work is important relevant work, as we note in the paper, but two quick points still need to be made (there are more):

It is not clear that evaluators will fine-tune at all for their evals. They should, and our work partially argues for this too.
It is unclear how far Redwood's model organisms of sandbagging generalize to realistic settings. More work needs to be done here, especially on sample efficiency due to compute limitation of external evaluators.

Comment by Teun van der Weij (teun-van-der-weij) on Ilya Sutskever and Jan Leike resign from OpenAI [updated] · 2024-05-15T04:57:50.975Z · LW · GW

I am lacking context, why is this important?

Comment by Teun van der Weij (teun-van-der-weij) on An Introduction to AI Sandbagging · 2024-05-14T11:22:18.248Z · LW · GW

Oh, I see. This is an interesting idea. I am not sure it will work, but definitely worth trying out!

Comment by Teun van der Weij (teun-van-der-weij) on An Introduction to AI Sandbagging · 2024-05-10T05:23:57.427Z · LW · GW

I am not sure I fully understand your point, but the problem with detecting sandbagging is that you do not know the actual capability of a model. And I guess that you mean "an anomalous decrease in capability" and not increase?

Regardless, could you spell out more how exactly you'd detect sandbagging?

Comment by Teun van der Weij (teun-van-der-weij) on Simple distribution approximation: When sampled 100 times, can language models yield 80% A and 20% B? · 2024-02-03T18:37:18.876Z · LW · GW

Might be a good way to further test this indeed. So maybe something like green and elephant?

Comment by Teun van der Weij (teun-van-der-weij) on Simple distribution approximation: When sampled 100 times, can language models yield 80% A and 20% B? · 2024-01-31T20:25:55.023Z · LW · GW

Interesting. I'd guess that the prompting is not clear enough for the base model. The Human/Assistant template does not really apply to base models. I'd be curious to see what you get when you do a bit more prompt engineering adjusted for base models.

Comment by Teun van der Weij (teun-van-der-weij) on Comparing representation vectors between llama 2 base and chat · 2023-10-29T10:04:09.724Z · LW · GW

Cool work.

I was briefly looking at your code, and it seems like you did not normalize the activations when using PCA. Am I correct? If so, do you expect that to have a significant effect?

Comment by Teun van der Weij (teun-van-der-weij) on CAIS-inspired approach towards safer and more interpretable AGIs · 2023-03-28T07:05:13.153Z · LW · GW

I think your policy suggestion is reasonable.

However, implementing and executing this might be hard: what exactly is an LLM? Does a slight variation on the GPT architecture count as well? How are you going to punish law violators?

How do you account for other worries? For example, like PeterMcCluskey points out, this policy might lead to reduced interpretability due to more superposition.

Policy seems hard to do at times, but others with more AI governance experience might provide more valuable insight than I can.

Comment by Teun van der Weij (teun-van-der-weij) on What 2026 looks like · 2022-11-23T21:20:32.690Z · LW · GW

Why do you think a similar model is not useful for real-world diplomacy?

Comment by Teun van der Weij (teun-van-der-weij) on What 2026 looks like · 2022-11-23T20:17:17.008Z · LW · GW

It's especially dangerous because this AI is easily made relevant in the real world as compared to AlphaZero for example. Geopolitical pressure to advance these Diplomacy AIs is far from desirable.

Comment by Teun van der Weij (teun-van-der-weij) on What 2026 looks like · 2022-11-23T18:23:27.740Z · LW · GW

After years of tinkering and incremental progress, AIs can now play Diplomacy as well as human experts.[6]

It seems that human-level play is possible in regular Diplomacy now, judging by this tweet by Meta AI. They state that:

We entered Cicero anonymously in 40 games of Diplomacy in an online league of human players between August 19th and October 13th, 2022. Over the course of 72 hours of play involving sending 5,277 messages, Cicero ranked in the top 10% of participants who played more than one game.

User info

Posts

Comments