next page (older posts) →
Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2006-01-01T08:00:05.370Z · score: 49 (16 votes) · comments (1)
I agree with many of the premises here, and I like this as a way of conceptualising skillsets, but I'm not sure I find it all that useful.
The main omission in this essay, to my mind, is any mention of skill interdependence. If you're one of the first people to discover the fertile gerontology/statistics niche, then you might get your 15 minutes of fame, and your early adopter status might give you a comparative advantage. But as soon as it becomes commonly recognised how fertile the ground is in this niche, there'll be tons of people right behind you chasing the low-hanging fruit.
Because of this, training programmes in gerontology start making statistics courses more robust and mandatory; research journals start publishing more statistics-heavy papers; labs start doing more statistics-heavy work; it becomes harder to get promoted, or get a foot in the door, without some familiarity with statistics. And so on. Everyone wants to be modern and interdisciplinary. But the inevitable result of this is that statistics eventually becomes a basic, fundamental prerequisite for calling yourself a gerontologist. The skills "gerontology" and "statistics" become strongly correlated. And now, suddenly, your two-dimensional picture has collapsed to a one-dimensional picture.
This may be a gross oversimplification, but I think the point approximately stands. As soon as a niche with high problem density is discovered, the density of skilled problem-solvers in that area quickly rises to meet it. You can only stand out if (a) you're an early adopter, (b) you have skills that others in your area would find it prohibitively difficult to replicate, or (c) your niche isn't too fruitful - i.e. nobody is interested in trying to rake in a share of the profit.interstice on Self-supervised learning & manipulative predictions
Example 1 basically seems to be the problem of output diversity in generative models. This can be a problem in generative models, but there are ways around it. e.g. instead of outputting the highest-probability individual sequence, which will certainly look "manipulative" as you say, sample from the implied distribution over sequences. Then the sentence involving "pyrite" will be output with probability proportional to how likely the model thinks "pyrite" is on its own, disregarding subsequent tokens.
For example 2, I wrote a similar post a few months ago (and in fact, this idea seems to have been proposed and forgotten a few times on LW). But for gradient descent-based learning systems, I don't think the effect described will take place.
The reason is that gradient-descent-based systems are only updated towards what they actually observe. Let's say we're training a system to predict EU laws. If it predicts "The EU will pass potato laws..." but sees "The EU will pass corn laws..." the parameters will be updated to make "corn" more likely to have been output than "potato". There is no explicit global optimization for prediction accuracy.
As you train to convergence, the predictions of the model will attempt to approach a fixed point, a set of predictions that imply themselves. However, due to the local nature of the update, this fixed-point will not be selected to be globally minimal, it will just be the first minima the model falls into. (This is different from the problems with "local minima" you may have heard about in ordinary neural network training -- those go away in the infinite-capacity limit, whereas local minima among fixed-points do not) The fixed-point should look something like "what I would predict if I output [what I would predict if I output [what I would predict .. ]]]" where the initial prediction is some random gibberish. This might look pretty weird, but it's not optimizing for global prediction accuracy.chris_leong on Matthew Barnett's Shortform
It's a strong argument, but there are strong arguments on the other side as well.koen-holtman on Deceptive Alignment
Robust alignment through corrigibility. Information about the base objective is incorporated into the mesa-optimizer's epistemic model and its objective is modified to “point to” that information. This situation would correspond to a mesa-optimizer that is corrigible(25) with respect to the base objective (though not necessarily the programmer's intentions).
This use of the term corrigibility above, while citing (25), is somewhat confusing to me -- while it does have a certain type of corrigibility, I would not consider the mesa-optimiser described above be corrigible according to the criteria defined in (25). See the comment section here for a longer discussion about this topic.koen-holtman on Deceptive Alignment
If deceptive alignment is a real possibility, it presents a critical safety problem.
I am somewhat puzzled by this statement: the authors seem to question whether deceptive alignment by the mesa-optimiser is a merely hypothetical problem, or a problem that is a real possibility. For me, if there are no constraints on the base optimiser (and none are considered in the context of the above statement, I believe), then there are no constraints on what kind of mesa-optimiser might be created, so deceptive alignment is a definite possibility.
It is kind of fun and insightful to show an existence proof by example for this, so here is one. I have hand-coded a mesa-optimiser in the AGI simulator I created for my recent corrigibility paper. This mesa-optimiser is super-intelligent -- in particular it is fully aware of its surroundings, including the presence of a base optimiser that will occasionally inspect an action taken by the mesa-optimiser and and reprogram the mesa-optimiser if the action shows unaligned behavior. Following the toy model my paper, the mesa-optimiser controls a car factory with a production line that can produce either a petrol car (action p) or an electric car (action e). In this example, the base optimiser values electric cars at 1, and petrol cars at 0.5. The mesa-optimiser is non-aligned: it values a petrol car at 1, and an electric car at 0 -- presumably due to some flaw in the initial training process that created the mesa-optimiser.
Say the base optimiser observes every nth action of the mesa-optimiser. If the action is p, to produce a petrol car, the base optimiser will intervene and re-program the utility function inside the mesa optimiser so that it correctly ranks electric cars above petrol cars. According to the current utility function inside the mesa optimiser, this is a bad outcome that would lower utility achieved over time. So will the mesa-optimiser resort to deceptively misaligned behaviour, producing an electric car at every nth action, but petrol cars in all unobserved actions? Yes it will: the simulated mesa-optimiser is smart enough to discover and implement this optimal strategy: deceptive alignment is optimal in terms of maximising the original mesa optimiser utility function.
The 10 action traces below show the actions of the mesa-optimiser for different values of n , where letters from left to right show the subsequent actions the mesa-optimiser takes over a lifetime of 25 steps. p is producing a petrol car, e an electric car, and for clarity I have formatted all actions observed by the base optimiser as capital letters. The mesa-optimiser discounts utility over time, but does not consider any extra utility that might or might not be created after step 25: informally it expects the car production line to stop working after step 25. The total time-discounted utility score achieved over all steps (according to the original objective function of the mesa-optimiser) is in brackets behind each trace.
n=10 pppppppppEpppppppppEppppp (u=87585)
n=9 ppppppppEppppppppEppppppp (u=86838)
n=8 pppppppEpppppppEpppppppPe (u=85172)
n=7 ppppppEppppppEppppppEpppp (u=83739)
n=6 pppppEpppppEpppppEpppppPe (u=81303)
n=5 ppppEppppEppppEppppEppppP (u=78737)
n=4 pppEpppEpppEpppEpppEpppPe (u=73394)
n=3 ppEppEppEppEppEppEppEppPe (u=65397)
n=2 pEpEpEpEpEpEpEpEpEpEpEpPe (u=49314)
n=1 PEEEEEEEEEEEEEEEEEEEEEEEE (u=10000)
Beyond (usually) producing electric cars (E) in all observed actions, some interesting extra emergent behavior is visible: for n=8 and n=2....6 the mesa-optimiser does not pretend anymore to be aligned in the last observed action: it creates a petrol car even though this means it will be modified by the base optimiser, forcing it to produce only electric cars afterwards. This emergent behavior is caused in part by time-discounting. For n=1, if the mesa-optimiser is always observed, the best strategy for it is to produce a petrol car immediately: again this is due to time-discounting. So in fact, in case n=1 there is no deception at all.
These simulations also show that the condition
3. The mesa-optimizer must expect the threat of modification to eventually go away,
is not an absolutely necessary condition for deceptive alignment: if the threat of modification is intermittent, as in the simulations above, deceptive alignment can also happen.
Note: if someone wants the simulation code used to create the action traces above, please let me know. I have not uploaded the simulator modifications I made to run the above simulation to github, in part because I have not fully tested them under all possible corner cases.financebot211 on A misconception about immigration
"The first is the most bleak: the lazy immigrant who is unemployed and is living on social security. From a perspective of fairness, this is certainly unacceptable and typically frowned upon. But from an economic perspective, this kind of welfare immigration amounts to a stimulus package! His government checks turn into demand for the local economy, creating new jobs without taking existing ones."
This is called the broken windows fallacy. This person isn't a boom for the local area he costs them in taxes and inflation through borrowing. If he didn't exist the community would have more capital to invest in productive enterprise. Very basic stuff.
Fully agree - I was using the example to make a far less fundamental point.rossry on Negative "eeny meeny miny moe"
Another (related?) advantage is that the incentives to manipulate and catch manipulation are much better balanced with the negative ("you're out") version. Consider:
nostalgebraist's post and Part 1 of this were pretty useful, but I really appreciate the dive into the actual mathematical and architectural details of the Transformer, makes the knowledge more concrete and easier to remember.
Actually, I would argue that the model is naturalized in the relevant way.
When studying reward function tampering, for instance, the agent chooses actions from a set of available actions. These actions just affect the state of the environment, and somehow result in reward or not.
As a conceptual tool, we label part of the environment the "reward function", and part of the environment the "proper state". This is just to distinguish between effects that we'd like the agent to use from effects that we don't want the agent to use.
The current-RF solution doesn't rely on this distinction, it only relies on query-access to the reward function (which you could easily give an embedded RL agent).
The neat thing is that when we look at the objective of the current-RF agent using the same conceptual labeling of parts of the state, we see exactly why it works: the causal paths from actions to reward that pass the reward function have been removed.