Posts
Comments
I'm noticing there are still many interp mentors for the current round of MATS -- was the "fewer mech interp mentors" change implemented for this cohort, or will that start in Winter or later?
How often do people not do PhDs on the basis that they don't teach you to be a good researcher? Perhaps this is different in certain circles, but almost everyone I know doesn't want to do a PhD for personal reasons (and also timelines).
The most common objections are the following:
- PhDs are very depressing and not very well paid.
- Advisors do not have strong incentives to put much effort into training you and apparently often won't. This is pretty demotivating.
- A thing you seem to be advocating for is PhDs primarily at top programs. These are very competitive, it is hard to make progress towards getting into a better program once you graduate, and there is a large opportunity cost to devoting my entire undergraduate degree to doing enough research to be admitted.
- PhDs take up many years of your life. Life is short.
- It is very common for PhD students (not just in alignment) to tell other people not to do a PhD. This is very concerning.
If I was an impact-maximizer I might do a PhD, but as a person who is fairly committed to not being depressed, it seems obvious that I should probably not do a PhD and look for alternative routes to becoming a research lead instead.
I'd be interested to hear whether you disagree with these points (you seem to like your PhD!), or whether this post was just meant to address the claim that it doesn't train you to be a good researcher.
kth element of q
Sounds good
I was under the impression that riding a motorcycle even with proper protection is still very dangerous?
Perhaps worth noting: one of the three resignations, Aleksander Madry, was head of the preparedness team which is responsible for preventing risks from AI such as self-replication.
You're right. For some reason, I thought EPIC was a pseudometric on the quotient space and not on the full reward space.
I think this makes the thing I'm saying much less useful.
The reason we're using a uniform distribution is that it follows naturally from the math, but maybe an intuitive explanation is the following: the reason this is weird is that most realistic distributions are only going to sample from a small number of states/actions. Whereas the uniform distribution more or less encodes that the reward functions are similar across most states/actions. So it's encoding something about generalization.
So more concretely, this is work towards some sort of RLHF training regime that "provably" avoids Goodharting. The main issue is that a lot of the numbers we're using are quite hard to approximate.
So here's a thing that I think John is pointing at, with a bit more math?:
The diversion is in the distance function.
- In the paper, we define the distance between rewards as the angle between reward vectors.
- So what we sort of do is look at the "dot product", i.E., look at for true and proxy rewards and with states/actions sampled according to a uniform distribution. I give justification as to why this is a natural way to define distance in a separate comment.
But the issue here is that this isn't the distribution of the actions/states we might see in practice. might be very high if states/actions are instead weighted by drawing them from a distribution induced from a certain policy (e.g., the policy of "killing lots of snakes without doing anything sneaky to game the reward" in the examples, I think?). But then as people optimize, the policy changes and this number goes down. A uniform distribution is actually likely quite far from any state/action distributions we would see in practice.
In other words the way we formally define reward distance here will often not match how "close" two reward functions seem, and lots of cases of "Goodharting" are cases where two reward functions just seem close on a particular state/action distribution but aren't close according to our distance metric.
This makes the results of the paper primarily useful for working towards training regimes where we optimize the proxy and can approximate distance, which is described in Appendix F of the paper. This is because as we optimize the proxy it will start to generalize, and then problems with over-optimization as described in the paper are going to start mattering a lot more.
An important part of the paper that I think is easily missed, and useful for people doing work on distances between reward vectors:
There is some existing literature on defining distances between reward functions (e.g., see Gleave et. al.). However, all proposed distances are only pseudometrics.
A bit about distance functions:
Commonly, two reward functions are defined to be the same (e.g., see Skalse et. al.) if they're equivalent up to scaling the reward function and introducing potential shaping. By the latter, I mean that two reward functions are the same if one is and the other is of the form for some function and discount . This is because in Ng. et. al. it is shown these make up all reward vectors that we know give the same optimal policy as the original reward across all environments (with the same state/action space).
This leads us to the following important claim:
Projecting reward vectors onto and taking the angle between them is a perfect distance metric according to these desiderata.
Why: It can easily be shown it's a metric, provided it's well-defined with the equivalence relation. It can also be shown that the locus of reward functions that give the same projection as onto is exactly the set of potential-shaped reward functions. Then the claim pretty clearly follows.
In particular, this seems like the most natural "true" reward metric, and I'm not sure any other "true" metrics have even been proposed before this.