Posts

Inverse Scaling Prize: Second Round Winners 2023-01-24T20:12:48.474Z
Inverse Scaling Prize: Round 1 Winners 2022-09-26T19:57:01.367Z
Beliefs and Disagreements about Automating Alignment Research 2022-08-24T18:37:00.419Z
Announcing the Inverse Scaling Prize ($250k Prize Pool) 2022-06-27T15:58:19.135Z
Pong from pixels without reading "Pong from Pixels" 2020-08-29T17:26:25.656Z

Comments

Comment by Ian McKenzie (naimenz) on Password-locked models: a stress case for capabilities evaluation · 2023-08-08T02:07:57.860Z · LW · GW

One thing is that even given access to the model weights and the code behind the API, you could not tell if the model was password-locked, whereas you would see the hardcoded verifier. Thus if a lab wanted to hide capabilities they could delete the training data and you would have no way of knowing.

Comment by Ian McKenzie (naimenz) on Russell Conjugations list & voting thread · 2023-02-21T00:12:04.763Z · LW · GW

The Wikipedia article has a typo in one of these: it should say "I am sparkling; you are unusually talkative; he is drunk." (as in the source)

Comment by Ian McKenzie (naimenz) on Inverse Scaling Prize: Round 1 Winners · 2022-09-29T18:08:19.573Z · LW · GW

We did do human validation on the tasks with Surge: redefine-math, quote-repetition, and hindsight-neglect all got 100% agreement, and NeQA got 98% agreement. I agree though that it seems likely many people would do the task ‘wrong’, so maybe the task would be improved by adding clearer instructions.

The situation feels somewhat like model splintering to me: the few-shot examples fit both patterns but the question doesn’t. The larger models are learning the incorrect generalization.

I think it’s important to note that LMs learning to respond in the same way as the average internet user is in some sense expected but can still be an example of inverse scaling – we would like our models to be smarter than that.

Comment by Ian McKenzie (naimenz) on A Quick Guide to Confronting Doom · 2022-04-16T17:18:18.913Z · LW · GW

I think 75% is 1:3 rather than 1:2.

Comment by Ian McKenzie (naimenz) on Why Instrumental Goals are not a big AI Safety Problem · 2022-04-09T20:57:34.431Z · LW · GW

A couple of things that come to mind:

  • You'll run into ELK-like difficulties when trying to train B: maybe it looks to the human that the off switch is still accessible but in reality you would not be able to activate it for some reason -- A steers into a Faraday cage or something more clever.

  • (In response to your reply to acylhalide) Assuming you solve the training difficulties, B has to be at least as good at modelling the world as A: it needs to be able to predict the consequences of action sequences from A to see if the off switch is accessible.

  • (In response to your reply to Evan) I don't necessarily expect A+B+R to be as capable as A+R: adding B introduces a huge shift to the environment that A has to deal with. If A always disabled the off switch in regular training, and now is unable to execute action sequences that involve that, then it may not be able to act effectively.

Comment by Ian McKenzie (naimenz) on Can we expect more value from AI alignment than from an ASI with the goal of running alternate trajectories of our universe? · 2020-08-09T13:52:01.115Z · LW · GW

Is your suggestion to run this system as a source of value, simulating lives for their own sake rather than to improve the quality of life of sentient beings in our universe? Our history (and present) aren't exactly utopian, and I don't see any real reason to believe that slight variations on it would lead to anything happier.

I think we can expect to achieve a lot more than that from a properly aligned AGI. There is so much suffering that could be alleviated right now with proper coordination, as a lower bound on how much better it could be than just effectively running copies of our timeline but at lower resolution.

Comment by Ian McKenzie (naimenz) on Our take on CHAI’s research agenda in under 1500 words · 2020-08-07T09:41:12.478Z · LW · GW

Is the "Going Beyond Agents" section part of CHAI's research agenda, or your take on further challenges from an embedded agency perspective?

Comment by Ian McKenzie (naimenz) on Rationality: An Introduction · 2019-07-20T22:12:45.494Z · LW · GW

In the example with Bob, surely the odds of Bob having a crush on you after winking (2:1) should be higher than a random person winking at you (given as 10:1), as we already have reason to suspect that Bob is more likely to have a crush on you than some random person not part of the six.