Posts

Models Don't "Get Reward" 2022-12-30T10:37:11.798Z
A Summary Of Anthropic's First Paper 2021-12-30T00:48:15.290Z

Comments

Comment by Sam Ringer on Super-Exponential versus Exponential Growth in Compute Price-Performance · 2023-10-09T21:37:34.009Z · LW · GW

This post (and the author's comments) don't seem to be getting a great response and I'm confused why? The post seems pretty reasonable and the author's comments are well informed.

My read of the main thrust is "don't concentrate on a specific paradigm and instead look at this trend that has held for over 100 years".

Can someone concisely explain why they think this is misguided? Is it just concerns over the validity of fitting parameters for a super-exponential model?

(I would also add that on priors when people claim "There is no way we can improve FLOPs/$ because of reasons XYZ" they have historically always been wrong.)

Comment by Sam Ringer on Models Don't "Get Reward" · 2023-01-07T15:10:09.208Z · LW · GW

Ah ok. If a reward function is taken as a preference ordering then you are right the model is optimizing for reward as the preference ranking is literally identical.

I think the reason we have been talking past each other is in my head when I think of "reward function" I am literally thinking of the reward function (i.e the actual code), and when I think of "reward maximiser" I think of a system that is trying to get that piece of code to output a high number.

So I guess it's a case of us needing to be very careful by exactly what we mean by reward function, and my guess is as long as we use the same definition then we are in agreement? Does that make sense?

Comment by Sam Ringer on Models Don't "Get Reward" · 2023-01-07T12:01:54.692Z · LW · GW

This is an interesting point. I can imagine a case where our assigned reward comes from a simple function (e.g reward = number of letter 'e's in output) and we also have a model which is doing some internal optimization to maximise the number of 'e's produced in its output, so it is "goal-directed to produce lots of 'e's".

Even in this case, I would still say this model isn't a "reward maximiser". It is a "letter 'e' maximiser".

(I also want to acknowledge that thinking this through makes me feel somewhat confused. I think what I said is correct. My guess is the misunderstanding I highlight in the post is quite pervasive, and the language we use isn't current up-to-scratch to write about these things clearly. Good job thinking of a case that is pushing against my understanding!)

Comment by Sam Ringer on Models Don't "Get Reward" · 2023-01-07T11:49:06.987Z · LW · GW

Thanks for the feedback!


From reading your linked comment, I think we agree about selection arguments. In the post, when I mention "selection pressure towards a model", I generally mean "such a model would score highly on the reward metric" as opposed to "SGD is likely to reach such a model". I believe the former is correct and the later is very much an open-question.

To second what I think is your general point, a lot of the language used around selection can be confusing because it confounds "such a solution would do well under some metric" with "your optimization process is likely to produce such a solution". The wolves with snipers example illustrates this pretty clearly. I'm definitely open to ideas for better language to distinguish the two cases!

Comment by Sam Ringer on Models Don't "Get Reward" · 2022-12-30T18:17:18.030Z · LW · GW

I agree with the sentiment of this. When writing this I was aware that this would not apply to all cases of RL.

However, I think I disagree with w.r.t non-vanilla policy gradient methods (TRPO, PPO etc). Using advantage functions and baselines doesn't change how things look from the perspective of the policy network. It is still only observing the environment and taking appropriate actions, and never "sees" the reward. Any advantage functions are only used in step 2 of my example, not step 1. (I'm sure there are schemes where this is not the case, but I think I'm correct for PPO.)

I'm less sure of this, but even in model-based systems, Q-learning etc, the planning/iteration happens with respect to the outputs of a value network, which is trained to be correlated with the reward, but isn't the reward itself. For example, I would say that the MCTS procedure of MuZero does "want" something, but that thing is not plans that get high reward, but plans that score highly according to the system's value function. (I'm happy to be deferential on this though.)

The other interesting case is Decision Transformers. DTs absolutely "get" reward. It is explicitly an input to the model! But I mentally bucket them up as generative models as opposed to systems that "want" reward.

Comment by Sam Ringer on Simulators · 2022-12-11T12:16:04.166Z · LW · GW

Instrumental convergence only comes into play when there are free variables in action space which are optimized with respect to their consequences.


I roughly get what this is gesturing at, but I'm still a bit confused. Does anyone have any literature/posts they can point me at which may help explain?

Also great post janus! It has really updated my thinking about alignment.

Comment by Sam Ringer on [deleted post] 2022-06-13T03:55:46.913Z

Yeh so thinking a little more I'm not sure my original comment conveyed everything I was hoping to. I'll add that even if you could get a side of A4 explaining AI x-risk in front of a capabilities researcher at <big_capabilities_lab>, I think they would be much more likely to engage with it if <big_capabilities_lab> is mentioned.

I think arguments will probably be more salient if they include "and you personally, intentionally or not, are entangled with this."

Saying that, I don't have any data about the above. I'm keen to hear any personal experiences anyone else might have in this area.

Comment by Sam Ringer on [deleted post] 2022-06-13T03:50:27.695Z

Ok not sure I understand this. Are you saying "Big corps are both powerful and complicated. Trying to model their response is intractably difficult so under that uncertainty you are better to just steer clear?"

Comment by Sam Ringer on [deleted post] 2022-06-12T23:56:23.868Z

I think it's good that someone is bringing this up. I think as a community we want to be deliberate and thoughtful with this class of things.

That being said, my read is that the main failure mode with advocacy at the moment isn't "capabilities researchers are having emotional responses to being called out which is making it hard for them to engage seriously with x-risk."
It's "they literally have no idea that anyone thinks what they are doing is bad."

Consider FAIR trying their hardest to open-source capabilities work with OPT. The tone and content of the responses shows overwhelming support for doing something that is, in my worldview, really, really bad.

I would feel much better if these people at least glanced their eyeballs over arguments for not open-source capabilities. Using the names of specific labs surely makes it more likely that the relevant writing ends up in front of them?

Comment by Sam Ringer on Alignment Problems All the Way Down · 2022-01-23T08:50:51.321Z · LW · GW

I think the failure case identified in this post is plausible (and likely) and is very clearly explained so props for that!

However, I agree with Jacob's criticism here. Any AGI success story basically has to have "the safest model" also be "the most powerful" model, because of incentives and coordination problems.

Models that are themselves optimizers are going to be significantly more powerful and useful than "optimizer free" models. So the suggestion of trying to avoiding mesa-optimization altogether is a bit of a fabricated option. There is an interesting parallel here with the suggestion of just "not building agents" (https://www.gwern.net/Tool-AI).

So from where I am sitting, we have no option but to tackle aligning the mesa-optimizer cascade head-on.

Comment by Sam Ringer on What's Up With Confusingly Pervasive Goal Directedness? · 2022-01-22T18:31:53.775Z · LW · GW

This post seems to be using a different meaning of "consequentialism" to what I am familiar with (that of moral philosophy). Subsequently, I'm struggling to follow the narrative from "consequentialism is convergently instrumental" onwards.

Can someone give me some pointers of how I should be interpreting the definition of consequentialism here? If it is just the moral philosophy definition, then I'm getting very confused as to why "judge morality of actions by their consequences" is a useful subgoal for agents to optimize against...