Comment by william_s on Some disjunctive reasons for urgency on AI risk · 2019-02-15T21:59:17.046Z · score: 2 (2 votes) · LW · GW

Re: scenario 3, see The Evitable Conflict, the last story in Isaac Asimov's "I, Robot":

"Stephen, how do we know what the ultimate good of Humanity will entail? We haven't at our disposal the infinite factors that the Machine has at its! Perhaps, to give you a not unfamiliar example, our entire technical civilization has created more unhappiness and misery than it has removed. Perhaps an agrarian or pastoral civilization, with less culture and less people would be better. If so, the Machines must move in that direction, preferably without telling us, since in our ignorant prejudices we only know that what we are used to, is good – and we would then fight change. Or perhaps a complete urbanization, or a completely caste-ridden society, or complete anarchy, is the answer. We don't know. Only the Machines know, and they are going there and taking us with them."
Comment by william_s on HCH is not just Mechanical Turk · 2019-02-13T00:15:16.037Z · score: 4 (2 votes) · LW · GW

Yeah, to some extent. In the Lookup Table case, you need to have a (potentially quite expensive) way of resolving all mistakes. In the Overseer's Manual case, you can also leverage humans to do some kind of more robust reasoning (for example, they can notice a typo in a question and still respond correctly, even if the Lookup Table would fail in this case). Though in low-bandwidth oversight, the space of things that participants could notice and correct is fairly limited.

Though I think this still differs from HRAD in that it seems like the output of HRAD would be a much smaller thing in terms of description length than the Lookup Table, and you can buy extra robustness by adding many more human-reasoned things into the Lookup Table (ie. automatically add versions of all questions with typos that don't change the meaning of a question into the Lookup Table, add 1000 different sanity check questions to flag that things can go wrong).

So I think there are additional ways the system could correct mistaken reasoning relative to what I would think the output of HRAD would look like, but you do need to have processes that you think can correct any way that reasoning goes wrong. So the problem could be less difficult than HRAD, but still tricky to get right.

Comment by william_s on The Argument from Philosophical Difficulty · 2019-02-11T17:47:18.818Z · score: 3 (2 votes) · LW · GW

Thanks, this position makes more sense in light of Beyond Astronomical Waste (I guess I have some concept of "a pretty good future" that is fine with something like a bunch of human-descended beings living a happy lives that misses out on the sort of things mentioned in Beyond Astronomical Waste, and "optimal future" which includes those considerations). I buy this as an argument that "we should put more effort into making philosophy work to make the outcome of AI better, because we risk losing large amounts of value" rather than "our efforts to get a pretty good future are doomed unless we make tons of progress on this" or something like that.

"Thousands of millions" was a typo.

Comment by william_s on Thoughts on reward engineering · 2019-02-10T22:31:38.378Z · score: 4 (2 votes) · LW · GW
What is the motivation for using RL here?

I see the motivation as given practical compute limits, it may be much easier to have the system find an action the overseer approves of instead of imitating the overseer directly. Using RL also allows you to use any advances that are made in RL by the machine learning community to try to remain competitive.

Comment by william_s on Thoughts on reward engineering · 2019-02-10T22:28:42.330Z · score: 4 (2 votes) · LW · GW
Would this still be a problem if we were training the agent with SL instead of RL?

Maybe this could happen with SL if SL does some kind of large search and finds a solution that looks good but is actually bad. The distilled agent would then learn to identify this action and reproduce it, which implies the agent learning some facts about the action to efficiently locate it with much less compute than the large search process. Knowing what the agent knows would allow the overseer to learn those facts, which might help in identifying this action as bad.

Comment by william_s on Reinforcement Learning in the Iterated Amplification Framework · 2019-02-10T22:09:27.681Z · score: 1 (1 votes) · LW · GW
I don't understand why we want to find this X* in the imitation learning case.

Ah, with this example the intent was more like "we can frame what the RL case is doing as finding X* , let's show how we could accomplish the same thing in the imitation learning case (in the limit of unlimited compute)".

The reverse mapping (imitation to RL) just consists of applying reward 1 to M2's demonstrated behaviour (which could be "execute some safe search and return the results), and reward 0 to everything else.

What is pM(X∗)?

is the probability of outputting (where is a stochastic policy)

M2("How good is answer X to Y?")∗∇log(pM(X))

This is the REINFORCE gradient estimator (which tries to increase the log probability of actions that were rated highly)

Comment by william_s on Announcement: AI alignment prize round 4 winners · 2019-02-10T19:06:00.792Z · score: 4 (2 votes) · LW · GW

I guess the question was more from the perspective of: if the cost was zero then it seems like it would worth running, so what part of the cost makes it not worth running (where I would think of cost as probably time to judge or availability of money to fund the contest).

Comment by william_s on The Argument from Philosophical Difficulty · 2019-02-10T19:02:57.074Z · score: 6 (3 votes) · LW · GW

One important dimension to consider is how hard it is to solve philosophical problems well enough to have a pretty good future (which includes avoiding bad futures). It could be the case that this is not so hard, but fully resolving questions so we could produce an optimal future is very hard or impossible. It feels like this argument implicitly relies on assuming that "solve philosophical problems well enough to have a pretty good future" is hard (ie. takes thousands of millions of years in scenario 4) - can you provide further clarification on whether/why you think that is the case?

Comment by william_s on Announcement: AI alignment prize round 4 winners · 2019-02-09T17:43:58.790Z · score: 6 (3 votes) · LW · GW

Slightly disappointed that this isn't continuing (though I didn't submit to the prize, I submitted to Paul Christiano's call for possible problems with his approach which was similarly structured). Was hoping that once I got further into my PhD, I'd have some more things worth writing up, and the recognition/a bit of prize money would provide some extra motivation to get them out the door.

What do you feel like is the limiting resource that keeps continuing this from being useful to continue in it's current form?

Comment by william_s on HCH is not just Mechanical Turk · 2019-02-09T17:10:56.001Z · score: 1 (1 votes) · LW · GW

Yeah, this is a problem that needs to be addressed. It feels like in the Overseers Manual case you can counteract this by giving definitions/examples of how you want questions to be interpreted, and in the Lookup Table case this can be addr by coordination within the team creating the lookup table

## Reinforcement Learning in the Iterated Amplification Framework

2019-02-09T00:56:08.256Z · score: 24 (6 votes)

## HCH is not just Mechanical Turk

2019-02-09T00:46:25.729Z · score: 36 (14 votes)
Comment by william_s on Can there be an indescribable hellworld? · 2019-01-31T20:03:23.882Z · score: 1 (1 votes) · LW · GW

Do you think you'd agree with a claim of this form applied to corrigibility of plans/policies/actions?

That is: If some plan/policy/action is uncorrigible, then A can provide some description of how the action is incorrigible.

Comment by william_s on Why we need a *theory* of human values · 2018-12-29T00:01:46.830Z · score: 3 (2 votes) · LW · GW
The better we can solve the key questions ("what are these 'wiser' versions?", "how is the whole setup designed?", "what questions exactly is it trying to answer?"), the better the wiser ourselves will be at their tasks.

I feel like this statement suggests that we might not be doomed if we make a bunch of progress, but not full progress on these statements. I agree with that assessment, but it felt on reading the post like the post was making the claim "Unless we fully specify a correct theory of human values, we are doomed".

I think that I'd view something like Paul's indirect normativity approach as requiring that we do enough thinking in advance to get some critical set of considerations known by the participating humans, but once that's in place we should be able to go from this core set to get the rest of the considerations. And it seems possible that we can do this without a fully-solved theory of human value (but any theoretical progress in advance we can make on defining human value is quite useful).

Comment by william_s on Three AI Safety Related Ideas · 2018-12-20T21:36:23.176Z · score: 6 (3 votes) · LW · GW

My interpretation of what you're saying here is that the overseer in step #1 can do a lot of things to bake in having the AI interpret "help the user get what they really want" in ways that get the AI to try to eliminate human safety problems for the step #2 user (possibly entirely), but problems might still occur in the short term before the AI is able to think/act to remove those safety problems.

It seems to me that this implies that IDA essentially solves the AI alignment portion of points 1 and 2 in the original post (modulo things happening before the AI is in control).

Comment by william_s on A comment on the IDA-AlphaGoZero metaphor; capabilities versus alignment · 2018-07-19T21:41:18.523Z · score: 1 (1 votes) · LW · GW

Correcting all problems in the subsequent amplification stage would be a nice property to have, but I think IDA can still work even if it corrects errors with multiple A/D steps in between (as long as all catastrophic errors are caught before deployment). For example, I could think of the agent initially using some rules for how to solve math problems where distillation introduces some mistake, but later in the IDA process the agent learns how to rederive those rules and realizes the mistake.

Comment by william_s on A general model of safety-oriented AI development · 2018-06-13T20:21:35.174Z · score: 8 (3 votes) · LW · GW

Shorter name candidates:

Inductively Aligned AI Development

Inductively Aligned AI Assistants

Comment by william_s on A general model of safety-oriented AI development · 2018-06-13T20:20:03.086Z · score: 6 (2 votes) · LW · GW

It's a nice property of this model that it prompts consideration of the interaction between humans and AIs at every step (to highlight things like risks of the humans having access to some set of AI systems for manipulation or moral hazard reasons).

Comment by william_s on Poker example: (not) deducing someone's preferences · 2018-06-13T18:53:25.062Z · score: 4 (1 votes) · LW · GW

In the higher dimensional belief/reward space, do you think that it would be possible to significantly narrow down the space of possibilities (so this argument is saying "be bayesian with respect to reward/beliefs, picking policies that work over a distribution) or are you more pessimistic than that, thinking that the uncertainty would be so great in higher dimensional spaces that it would not be possible to pick a good policy?

Comment by william_s on Amplification Discussion Notes · 2018-06-01T19:04:19.114Z · score: 12 (3 votes) · LW · GW

Open Question: Working with concepts that the human can’t understand

Question: when we need to assemble complex concepts by learning/interacting with the environment, rather than using H's concepts directly, and when those concepts influence reasoning in subtle/abstract ways, how do we retain corrigibility/alignment?

Paul: I don't have any general answer to this, seems like we should probably choose some example cases. I'm probably going to be advocating something like "Search over a bunch of possible concepts and find one that does what you want / has the desired properties."

E.g. for elegant proofs, you want a heuristic that gives successful lines of inquiry higher scores. You can explore a bunch of concepts that do that, evaluate each one according to how well it discriminates good from bad lines of inquiry, and also evaluate other stuff like "What would I infer from learning that a proof is elegant other than that it will work" and make sure that you are OK with that.

Andreas: Suppose you don't have the concepts of "proof" and "inquiry", but learned them (or some more sophisticated analogs) using the sort of procedure you outlined below. I guess I'm trying to see in more detail that you can do a good job at "making sure you're OK with reasoning in ways X" in cases where X is far removed from H's concepts. (Unfortunately, it seems to be difficult to make progress on this by discussing particular examples, since examples are necessarily about concepts we know pretty well.)

This may be related to the more general question of what sorts of instructions you'd give H to ensure that if they follow the instructions, the overall process remains corrigible/aligned.

Comment by william_s on Amplification Discussion Notes · 2018-06-01T19:04:01.100Z · score: 9 (2 votes) · LW · GW

Open Question: Severity of “Honest Mistakes”

In the discussion about creative problem solving,Paul said that he was concerned about problems arising when the solution generator was deliberately searching for a solution with harmful side effects. Other failures could occur where the solution generator finds a solution with harmful side effects without “deliberately searching” for it. The question is how bad these “honest mistakes” would end up being.

Paul: I also want to make the further claim that such failures are much less concerning than what-I'm-calling-alignment failures, which is a possible disagreement we could dig into (I think Wei Dai disagrees or is very unsure).

## Amplification Discussion Notes

2018-06-01T19:03:35.294Z · score: 41 (10 votes)
Comment by william_s on Challenges to Christiano’s capability amplification proposal · 2018-05-26T22:58:13.323Z · score: 10 (2 votes) · LW · GW
I would solve X-and-only-X in two steps:
First, given an agent and an action which has been optimized for undesirable consequence Y, we'd like to be able to tell that the action has this undesirable side effect. I think we can do this by having a smarter agent act as an overseer, and giving the smarter agent suitable insight into the cognition of the weaker agent (e.g. by sharing weights between the weak agent and an explanation-generating agent). This is what I'm calling informed oversight.
Second, given an agent, identify situations in which it is especially likely to produce bad outcomes, or proofs that it won't, or enough understanding of its internals that you can see why it won't. This is discussed in “Techniques for Optimizing Worst-Case Performance.”

Paul, I'm curious whether you'd see as necessary for these techniques to work to have that the optimization target is pretty good/safe (but not perfect): ie some safety comes from the fact that the agents optimized for approval or imitation only have a limited class of Y's that they might also end up being optimized for.

Comment by william_s on Challenges to Christiano’s capability amplification proposal · 2018-05-26T22:54:32.134Z · score: 16 (3 votes) · LW · GW
So I also don't see how Paul expects the putative alignment of the little agents to pass through this mysterious aggregation form of understanding, into alignment of the system that understands Hessian-free optimization.

My model of Paul's approach sees the alignment of the subagents as just telling you that no subagent is trying to actively sabotage your system (ie. by optimizing to find the worst possible answer to give you), and that the alignment comes from having thought carefully about how the subagents are supposed to act in advance (in a way that could potentially be run just by using a lookup table).

Comment by william_s on Resolving human values, completely and adequately · 2018-05-16T18:23:05.370Z · score: 5 (2 votes) · LW · GW

Glad to see this work on possible structure for representing human values which can include disagreement between values and structured biases.

I had some half-formed ideas vaguely related to this, which I think map onto an alternative way to resolve self reference.

Rather than just having one level of values that can refer to other values on the same level (which potentially leads to a self-reference cycle), you could instead explicitly represent each level of value, with level 0 values referring to concrete reward functions, level 1 values endorsing or negatively endorsing level 0 values, and generally level n values only endorsing or negatively endorsing level n-1 values. This might mean that you have some kinds of values that end up being duplicated between multiple levels. For any n, there's a unique solution to the level of endorsement for every concrete value. We can then consider the limit as n->infinity as the true level of endorsement. This allows for situations where the limit fails to converge (ie. it alternates between different values at odd and even levels), which seems like a way to handle self reference contradictions (possibly also the all-or-nothing problem if it results from a conflict between meta-levels).

I think this maps into the case where we don't distinguish between value levels if we define an function that just adjusts the endorsement of each value by the values that directly to refer to it. Then iterating this function n times gives the equivalent of having an n-level meta-hierarchy.

I think there might be interesting work in mapping this strategy into some simple value problem, and then trying to perform bayesian value learning in that setting with some reasonable prior over values/value endorsements.

Comment by william_s on Can corrigibility be learned safely? · 2018-04-24T18:18:04.818Z · score: 4 (1 votes) · LW · GW

Ah, right. I guess I was balking at moving from exorbitant to exp(exorbitant). Maybe it's better to think of this as reducing the size of fully worked initial overseer example problems that can be produced for training/increasing the number of amplification rounds that are needed.

So my argument is more an example of what a distilled overseer could learn as an efficient approximation.

Comment by william_s on Can corrigibility be learned safely? · 2018-04-24T16:42:01.486Z · score: 4 (1 votes) · LW · GW

I guess what we're trying to unpack is "the mechanism that makes decisions from that database", and whether it can be efficient. If you store all experience with no editing, you're left with the same problem of "what decisions in the tree do you change based on experience (X,y) and how do you change them?" And it seems like the most naive way of doing this is too inefficient (looking at all data at all decision nodes).

If you summarize experience (what I'm interpreting "decide how to update some small sketch" as meaning), then the question is can you produce that summary in an efficient way, and efficiently use it at runtime?

The gradient-descent style solution is meant to gesture how this could be done in a way analogous to how we use parametric models rather than non-parametric models in ML for decently sized datasets. In advance, bake in a set of parameters/information that could influence each decision, then do updates by having the overseer look for ways to change the decision parameters. But when we do this in an oversight model, we can make the decisions/parameters interpretable sentences.

Rough example of how I see the overall process going:

1. amplified overseer acts in an environment with a breakable vase, but the overseer does not know that the vase is breakable

2. First time it acts, some overseer node N looks into the database for any information related to the vase, and returns nothing, and proceeds in it decision

3. The vase breaks, and the negative feedback is given to the amplified overseer

4. Amplified overseer runs some process to figure out that it could have made a better decision if the database contained the knowledge that vases are breakable, and so updates the database. (The thing that Wei Dai are trying to unpack is, can this step be run in a reasonable amount of time, given that each node of the overseer has only a limited understanding of the original decision process)

5. Second time it acts, the amplified overseer process is the same, up until the node N that does the database lookup is run. This time, the lookup returns that the vase is breakable, so the overseer runs a different decision path and treats the vase with more care.

Comment by william_s on Can corrigibility be learned safely? · 2018-04-23T23:57:13.728Z · score: 9 (2 votes) · LW · GW
What if the current node is responsible for the error instead of one of the subqueries, how do you figure that out?

I think you'd need to form the decomposition in such a way that you could fix any problem through perturbing something in the
world representation (an extreme version is you have the method for performing every operation contained in the world representation and looked up, so you can adjust it in the future).

When you do backprop, you propagate the error signal through all the nodes, not just through a single path that is "most responsible" for the error, right? If you did this with meta-execution, wouldn't it take an exponential amount of time?

One step of this method, as in backprop, is the same time complexity as the forward pass (running meta-execution forward, which I wouldn't call exponential complexity, as I think the relevant baseline is the number of nodes in the meta-execution forward tree). You only need to process each node once (when the backprop signal for it's output is ready), and need to do a constant amount of work at each node (figure out all the ways to perturb the nodes input).

The catch is that, as with backprop, maybe you need to run multiple steps to get it to actually work.

And what about nodes that are purely symbolic, where there are multiple ways the subnodes (or the current node) could have caused the error, so you couldn't use the right answer for the current node to figure out what the right answer is from each subnode? (Can you in general structure the task tree to avoid this?)

The default backprop answer to this is to shrug and adjust all of the inputs (which is what you get from taking the first order gradient). If this causes problems, then you can fix them in the next gradient step. That seems to work in practice for backprop in continuous models. Discrete models like this it might be a bit more difficult - if you start to try out different combinations to see if they work, that's where you'd get exponential complexity. But we'd get to counter this by potentially having cases where, based on understanding the operation, we could intelligently avoid some branches - I think this could potentially wash out to linear complexity in the number of forward nodes if it all works well.

I wonder if we're on the right track at all, or if Paul has an entirely different idea about this.

So do I :)

Comment by william_s on Can corrigibility be learned safely? · 2018-04-23T18:33:51.958Z · score: 9 (2 votes) · LW · GW

Huh, I hadn't thought of this as trying to be a direct analogue of gradient descent, but now that I think about your comment that seems like an interesting way to approach it.

A human debugging a translation software could look at the return value of some high-level function and ask "is this return value sensible" using their own linguistic intuition, and then if the answer is "no", trace the execution of that function and ask the same question about each of the function it calls. This kind of debugging does not seem available to meta-execution trying to debug itself, so I just don't see any way this kind of learning / error correction could work.

I'm curious now whether you could run a more efficient version of gradient descent if you replace the gradient at each step with an overseer human who can harness some intuition to try to do better than the gradient.

Comment by william_s on Can corrigibility be learned safely? · 2018-04-23T15:55:45.522Z · score: 4 (1 votes) · LW · GW
What if the field of linguistics as a whole is wrong about some concept or technique, and as a result all of the humans are wrong about that? It doesn't seem like using different random seeds would help, and there may not be another approach that can be taken that avoids that concept/technique.

Yeah, I don't think simple randomness would recover from this level of failure (only that it would help with some kinds of errors, where we can sample from a distribution that doesn't make that error sometimes). I don't know if anything could recover from this error in the middle of a computation without reinventing the entire field of linguistics from scratch, which might be too to ask. However, I think it could be possible to recover from this error if you get feedback about the final output being wrong.

But in IDA, H is fixed and there's no obvious way to figure out which parts of a large task decomposition tree was responsible for the badly translated sentence and therefore need to be changed for next time.

I think that the IDA task decomposition tree could be created in such a way that you can reasonably trace back which part was responsible for the misunderstanding/that needs to be changed. The structure you'd need for this is that given a query, you can figure out which of it's children would need to be corrected to get the correct result. So if you have a specific word to correct, you can find the subagent that generated that word, then look at it's inputs, see which input is correct, trace where that came from, etc. This might need to be deliberately engineered into the task decomposition (in the same way that differently written programs accomplishing the same task could be easier or harder to debug).

## Understanding Iterated Distillation and Amplification: Claims and Oversight

2018-04-17T22:36:29.562Z · score: 70 (19 votes)
Comment by william_s on Utility versus Reward function: partial equivalence · 2018-04-16T17:03:02.145Z · score: 4 (1 votes) · LW · GW

Ah, misunderstood that, thanks.

Comment by william_s on Utility versus Reward function: partial equivalence · 2018-04-16T15:08:30.412Z · score: 4 (1 votes) · LW · GW

Say w2a is the world where the agent starts in w2 and w2b is the world that results after the agent moves from w1 to w2.

Without considering the agent's memory part of the world, it seems like the problem is worse: the only way to distinguish between w2a and w2b is the agent's memory of past events, so it seems that leaving the agent's memory over the past out of the utility function requires U(w2a) = U(w2b)

Comment by william_s on Two clarifications about "Strategic Background" · 2018-04-15T17:01:39.498Z · score: 8 (2 votes) · LW · GW

Would you think that the following approach would fit within "in addition to making alignment your top priority and working really hard to over-engineer your system for safety, also build the system to have the bare minimum of capabilities" and possibly work, or would you think that it would be hopelessly doomed?

• Work hard on designing the system to be safe
• But there's some problem left over that you haven't been able to fully solve, and think will manifest at a certain scale (level of intelligence/optimization power/capabilities)
• Run the system, but limit scale to stay well within the range where you expect it to behave well
Comment by william_s on Utility versus Reward function: partial equivalence · 2018-04-13T16:02:55.058Z · score: 3 (1 votes) · LW · GW

I'm trying to wrap my head around the case where there are two worlds, w1 and w2; w2 is better than w1, but moving from w1 to w2 is bad (ie. kill everyone and replacing them with different people who are happier, and we think this is bad).

I think for the equivalence to work in this case, the utility function U also needs to depend on your current state - if it's the same for all states, then the agent would always prefer to move from w1 to w2 and erase it's memory of the past when maximizing the utility function, wheras it would act correctly with the reward function.

Comment by william_s on Can corrigibility be learned safely? · 2018-04-10T15:09:59.476Z · score: 3 (1 votes) · LW · GW
how does IDA recover from an error on H's part?

Error recovery could be supported by having a parent agent running multiple versions of a query in parallel with different approaches (or different random seeds).

And also, how does it improve itself using external feedback

I think this could be implemented as: part of the input for a task is a set of information on background knowledge relevant to the task (ie. model of what the user wants, background information about translating the language). The agent can have a task "Update [background knowledge] after receiving [feedback] after providing [output] for task [input]", which outputs a modified version of [background knowledge], based on the feedback.

Comment by william_s on Can corrigibility be learned safely? · 2018-04-09T17:11:24.636Z · score: 3 (1 votes) · LW · GW
The only way I know how to accomplish this is to have IDA emulate the deep learning translator at a very low level, with H acting as a "human transistor" or maybe a "human neuron", and totally ignore what H knows about translation including the meanings of words.

The human can understand the meaning of the word it sees, the human just can't know the context (the words that it doesn't see), and so can't use their understanding of that context.

The could try to guess possible contexts for the word and leverage their understanding of those contexts ("what are some examples of sentences where the word could be used ambiguously?"), but they aren't allowed to know if any of their guesses actually apply to the text they are currently working on (and so their answer is independent of the actual text they are currently working on).

Comment by william_s on Can corrigibility be learned safely? · 2018-04-07T14:40:23.513Z · score: 3 (1 votes) · LW · GW

Okay, I agree that we're on the same page. Amplify(X,n) is what I had in mind.

Comment by william_s on Can corrigibility be learned safely? · 2018-04-05T20:29:22.676Z · score: 3 (1 votes) · LW · GW

Was thinking of things more in line with Paul's version, not this finding ambiguity definition, where the goal is to avoid doing some kind of malign optimization during search (ie. untrained assistant thinks it's a good idea to use the universal prior, then you show them What does the universal prior actually look like?, and afterwards they know not to do that).

Comment by william_s on Can corrigibility be learned safely? · 2018-04-04T22:04:23.262Z · score: 8 (2 votes) · LW · GW
Can you give an example of natural language instruction (for humans operating on small inputs) that can't be turned into a formal algorithm easily?

Any set of natural language instructions for humans operating on small inputs can be turned into a lookup table by executing the human on all possible inputs (multiple times on each input, if you want to capture a stochastic policy).

The with the following "Consider the sentence [s1] w [s2]", and have the agent launch queries of the form "Consider the sentence [s1] w [s2], where we take w to have meaning m". Now, you could easily produce this behaviour algorithmically if you have a dictionary. But in a world without dictionaries, suitably preparing a human to answer this query takes much less effort than producing a dictionary.

Comment by william_s on Can corrigibility be learned safely? · 2018-04-04T21:52:57.420Z · score: 3 (1 votes) · LW · GW
By "corrigible" here did you mean Paul's definition which doesn't include competence in modeling the user and detecting ambiguities, or what we thought "corrigible" meant (where it does include those things)?

Thinking of "corrigible" as "whatever Paul means when he says corrigible". The idea applies to any notion of corrigibility which allows for multiple actions and does not demand that the action returned be one that is the best possible for the user

Comment by william_s on Can corrigibility be learned safely? · 2018-04-04T19:49:11.861Z · score: 3 (1 votes) · LW · GW

What is the difference between "core after a small number of amplification steps" and "core after a large number of amplification steps" that isn't captured in "larger effective computing power" or "larger set of information about the world", and allows the highly amplified core to solve these problems?

Comment by william_s on Can corrigibility be learned safely? · 2018-04-04T17:45:01.393Z · score: 3 (1 votes) · LW · GW

I'm a little confused about what this statement means. I thought that if you have an overseer that implements some reasoning core, and consider amplify(overseer) with infinite computation time and unlimited ability to query the world (ie. for background information on what humans seem to want, how they behave, etc.), then amplify(overseer) should be able to solve any problem that an agent produced by iterating IDA could solve.

Did you mean to say that

• "already highly competent at these tasks" means that the core should be able to solve these problems without querying the world at all, and this is not likely to be possible?
• you don't expect to find a core such that only one round of amplification of amplify(overseer) can solve practical tasks in any reasonable amount of time/number of queries?
• There is some other way that the agent produced by IDA would be more competent than the original amplified overseer?
Comment by william_s on Can corrigibility be learned safely? · 2018-04-04T15:45:26.875Z · score: 17 (4 votes) · LW · GW
Among people I've had significant online discussions with, your writings on alignment tend to be the hardest to understand and easiest to misunderstand.

Additionally, I think that there are ways to misunderstand the IDA approach that leave out significant parts of the complexity (ie. IDA based off of humans thinking for a day with unrestricted input, without doing the hard work of trying to understand corrigibility and meta-philosophy beforehand), but can seem to be plausible things to talk about in terms of "solving the AI alignment problem" if one hasn't understood the more subtle problems that would occur. It's then easy to miss the problems and feel optimistic about IDA working while underestimating the amount of hard philosophical work that needs to be done, or to incorrectly attack the approach for missing the problems completely.

(I think that these simpler versions of IDA might be worth thinking about as a plausible fallback plan if no other alignment approach is ready in time, but only if they are restricted in terms of accomplishing specific tasks to stabilise the world, restricted in how far the amplification is taking, replaced with something better as soon as possible, etc. I also think that working on simple versions of IDA can help make progress on issues that would be required for fully scalable IDA, ie. the experiments that Ought is running.).

Comment by william_s on Can corrigibility be learned safely? · 2018-04-03T19:10:32.587Z · score: 13 (3 votes) · LW · GW

I would see the benefits of humans vs. algorithms being that giving a human a bunch of natural language instructions would be much easier (but harder to verify) than writing down a formal algorithm. Also, the training could just cover how to avoid taking incorrigible actions, and the Overseer could still use their judgement of how to perform competently within the space of corrigible outputs.

Comment by william_s on Can corrigibility be learned safely? · 2018-04-03T19:08:22.234Z · score: 8 (2 votes) · LW · GW

Trying to understand the boundary lines around incorrigibility, looking again at this example from Universality and Security Amplification

For example, suppose meta-execution asks the subquestion “What does the user want?”, gets a representation of their values, and then asks the subquestion “What behavior is best according to those values?” I’ve then generated incorrigible behavior by accident, after taking innocuous steps.

It sounds like from this that this only counts as incorrigible if the optimization in “What behavior is best according to those values?” is effectively optimizing for something that the user doesn't want, but is not incorrigible if it is optimizing for something that the user doesn't want in a way that the user can easily correct? (so incorrigibilty requires something more than just being malign)

One way to describe this is that the decomposition is incorrigible if the models of the user that are used in “What behavior is best according to those values?” are better than the models used in “What does the user want?” (as this could lead the AI to maximize an approximation V* of the user's values V and realize that if the AI reveals to the user that they are maximizing V*, the user will try to correct what the AI is doing, which will perform worse on V*).

So acceptable situations are where both subqueries get the same user models, the first subquery gets a user better model than the second, or the situation where “What behavior is best according to those values?” is performing some form of mild optimization. Is that roughly correct?

Comment by william_s on Can corrigibility be learned safely? · 2018-04-03T16:37:40.094Z · score: 8 (2 votes) · LW · GW

I think that the bank example falls into "intent corrigibility". The action "hack the bank" was output because the AI formed an approximate model of your morals and then optimised the approximate model of your morals "too hard", coming up with an action that did well on the proxy but not on the real thing. The understanding of how not do do this doesn't depend on how well you can understand the goal specification, but the meta-level knowledge that optimizing approximate reward functions can lead to undesired results.

(The AI also failed to ask you clarifying questions about it's model of your morals, failed to realize that it could instead have tried to do imitation learning or quantilization to come up with a plan more like what you had in mind, etc.)

I think the argument that worst-case guarantees about "intent corrigibility" are possible is that 1) it only needs to cover the way that the finite "universal core" of queries are handled 2) It's possible to do lots of pre-computation as I discuss in my other comment, as well as delegating to other subagents. So you aren't modelling "Would someone with 15 minutes to think about answering this query find the ambiguity", it's "Would a community of AI researchers with a long time to think about answering this be able to provide training to someone so that they and a bunch of assistants find the ambiguity"? I agree that this seems hard and it could fail, but I think I'm at the point of "let's try this through things like Ought's experiments", and it could either turn out to seem possible or impossible based on that.

(An example of "consequence corrigibility" would be if you were okay with hacking the bank but only as long as it doesn't lead to you going to jail. The AI comes up with a plan to hack the bank that it thinks won't get caught by the police. But the AI underestimated the intelligence of the police, gets caught, and this lands you in jail. This situation isn't "corrigible" in the sense that you've lost control over the world.)

Comment by william_s on Can corrigibility be learned safely? · 2018-04-03T01:28:05.020Z · score: 3 (1 votes) · LW · GW

## Reasons to think IDA could provide corrigibility:

The agents are strictly limited in bits of input (identifying queries from the universal reasoning core) that they can use to influence their subqueries/how they produce their output. They are limited in the amount of thinking that they can do at runtime. But, they are not limited in the amount of "precomputation" that can go into preparing the human for acting in the role of an overseer, and there are lots of ways to use this. We could spend a lot of time training the humans in AI safety thinking beforehand. We could produce an "overseer's manual" which lays out a checklist for how to behave corrigibly, which includes lots of thinking about how things could be misinterpreted if broken down into small queries. We could spend a lot of time red-teaming the humans before allowing them to act as overseers, coming up with queries that might cause them to make a mistake.

This precomputation should be safe because it's done using humans in normal circumstances.

## Reasons to think IDA could provide "consequence corrigibility":

One reason to think it might not be doomed is that we would only get bad outcomes if the overseer is incompetent at corrigibility relative to it's competance to acting in the world. I think you would have a stronger argument if there are reasons to think that corrigibilty will be especially hard to oversee vs. competence in other domains. I currently think it's somewhat likely that the general competence threshold for intentionally causing bad outcomes is above the competence threshold where the agent could understand it's own knowledge of corrigibility.

Arguments that corrigibilty is much harder than dangerous levels of general competence (ie. that could cause significant damage in the real world) in practice would make me less optimistic about finding a tradeoff point which acheives both "intent corrigibility" and "consequence corrigibility". I do think that there are narrow capabilities that could be dangerous and wouldn't imply understanding of corrigibility - the overseer would need to avoid training these narrow capabilities before training general competence/corrigibility.

Comment by william_s on Can corrigibility be learned safely? · 2018-04-03T01:27:21.538Z · score: 16 (3 votes) · LW · GW

From my current understanding of Paul's IDA approach, I think there are two different senses in which corrigibility can be thought about in regards to IDA, both with different levels of guarantee.

1. On average, the reward function incentivizes behaviour which competes effectively and gives the user effective control.
2. There do not exist inputs on which the policy choose an action because it is bad, or the value function outputs a high reward because the prior behaviour was bad. (Or else the policy on its own will generate bad consequences.)
3. The reward function never gives a behaviour a higher reward because it is bad. (Or else the test-time optimisation by MCTS can generate bad behaviour.) For example, if the AI deludes the human operator so that the operator can’t interfere with the AI’s behaviour, that behaviour can’t receive a higher reward even if it ultimately allows the AI to make more money.

Property 1 is dealing with "consequence corrigibility" (competence at producing actions that will produce outcomes in the world we would describe as corrigible)

Properties 2&3 are dealing with corrigibility in terms of "intent corrigibility" (guaranteeing that the system does not optimise for bad outcomes). This does not cover the agent incompetently causing bad actions in the world, only the agent deliberately trying to produce bad outcomes.

I think IDA doesn't require or claim worst-case guarantees on the task of "consequence corrigibility" (and that this is an impossible goal for bounded reasoners).

I think that average-case good performance on "consequence corrigibility" is claimed by IDA, but only as a subset of general competence.

I think that providing worst-case guarantees on "intent corrigibility" is required and claimed by IDA.

Roughly, I think that:

• Versions of IDA that allow the overseer nodes more information could be generally competent (including predicting what behaviour could be corrigible), but could fail to be "intent corrigible"
• Versions of IDA that allow the overseer nodes only a highly restricted set of queries could be "intent corrigible" but fail to be generally competent, and hence not be "consequence corrigible"
• Standard ML approaches will, at some level of optimisation power, fail to behave "intent corrigibly" (even if you train them to be "consequence corrigible")

The question I'm uncertain about is whether there's a middle point in tradeoff space where both properties are sufficiently satisfied to produce good outcomes.

Do you agree or disagree with how I've broken down corrigibility claims for IDA, and which claims do you think your argument bears on?

Comment by william_s on Values determined by "stopping" properties · 2018-04-01T16:34:58.522Z · score: 3 (1 votes) · LW · GW

Equivalently, we can say that we don't know how we should define the dimensions of the human values or the distance measure from current human values, and if we pick these definitions arbitrarily, we will end up with arbitrary results.

Comment by william_s on Non-Adversarial Goodhart and AI Risks · 2018-03-27T18:05:08.993Z · score: 7 (2 votes) · LW · GW

Brainstorming approaches to working with causal goodhart

• Low-impact measures that include the change in the causal structure of the world. It might be possible to form a measure like this which doesn't depend on recovering the true causal structure at any point (ie. minimizing the difference between predictions of causal structure in state A and B, even if both of those predictions are wrong)
• Figure out how to elicit human models of causal structure, and provide the human model of causal structure along with the metric, and the AI uses this information to figure out whether it's violating the assumptions that the human made
• Causal transparency: have the AI explain the causal structure of how it's plans will influence the proxy. This might allow a human to figure out whether the plan will cause the proxy to diverge from the goal. ie. True goal is happiness, proxy is happiness score as measured by online psychological questionnaire, AI's plan says that it will influence the proxy by hacking into the online psychological questionnaire. You don't need to understand how the AI plans to hack into the server to understand that the plan is diverging the proxy from the goal.
Comment by william_s on Prize for probable problems · 2018-03-21T19:35:43.323Z · score: 8 (2 votes) · LW · GW

1. I don't think that every heuristic could use to solve problems of depth needs to be applicable to performing the search of depth - we only need enough heuristics to be useable to be able to keep increasing the search depth at each amplification round in an efficient manner. It's possible that some of the value of heuristics like "solution is likely to be an output of algorithm G" could be (imperfectly) captured through some small universal set of heuristics that we can specify how to learn and exploit. (I think that variations on "How likely is the search on [partial solution] to produce an answer?" might get us pretty far).

The AlphaGo analogy is that the original supervised move prediction algorithm didn't necessarily learn every heuristic that the experts used, but just learned enough to be able to efficiently guide the MCTS to better performance.

(Though I do think that imperfectly learning heuristics might cause alignment problems without a solution to the aligned search problem).

2. This isn't a problem if once the agent can run algorithm G on t for problems of depth , it can directly generalize to applying G to problems of depth . Simple Deep RL methods aren't good at this kind of tasks, but things like the Neural Turing Machine are trying to do better on this sort of tasks. So the ability to learn efficient exponential search could be limited by the underlying agent capability; for some capability range, a problem could be directly solved by an unaligned agent, but couldn't be solved for an aligned agent. This isn't a problem if we can surpass that level of capability.

I'm not sure that these considerations fix the problem entirely, or whether Paul would take a different approach.

It also might be worth coming up with a concrete example where some heuristics are not straightforward to generalize from smaller to larger problems, and it seems like this will prevent efficiently learning to solve large problems. The problem, however, would need to be something that humans can solve (ie. finding a string that hashed to a particular value using a cryptographic hash function would be hard to generalize any heuristics from, but I don't think humans could do it either so it's outside of scope).

Comment by william_s on Prize for probable problems · 2018-03-20T22:44:43.777Z · score: 8 (2 votes) · LW · GW

For this example, I think you can do this if you implement the additional query "How likely is the search on [partial solution] to return a complete solution?". This is asked of all potential branches before recursing into them. learns to answer the solution probability query efficiently.

Then in amplification of in the top level of looking for a solution to problem of length , the root agent first asks "How likely is the search on [string starting with 0] to return a complete solution?" and "How likely is the search on [string starting with 1] to return a complete solution?". Then, the root agent first queries whichever subtree is most likely to contain a solution. (This doesn't improve worst case running time, but does improve average case running time.).

This is analogous to running a value estimation network in tree search, and then picking the most promising node to query first.

Comment by william_s on Prize for probable problems · 2018-03-20T19:20:47.004Z · score: 8 (2 votes) · LW · GW

Sorry, that was a bit confusing, edited to clarify. What I mean is, you have some algorithm you're using to implement new agents, and that algorithm has a training cost (that you pay during distillation) and a runtime cost (that you pay when you apply the agent). The runtime cost of the distilled agent can be as good as the runtime cost of an unaligned agent implemented by the same algorithm (part of Paul's claim about being competitive with unaligned agents).

## Improbable Oversight, An Attempt at Informed Oversight

2017-05-24T17:43:53.000Z · score: 2 (2 votes)

## Informed Oversight through Generalizing Explanations

2017-05-24T17:43:39.000Z · score: 1 (1 votes)

## Proposal for an Implementable Toy Model of Informed Oversight

2017-05-24T17:43:13.000Z · score: 1 (1 votes)