## Posts

Machine Learning Projects on IDA 2019-06-24T18:38:18.873Z · score: 51 (18 votes)
Reinforcement Learning in the Iterated Amplification Framework 2019-02-09T00:56:08.256Z · score: 26 (7 votes)
HCH is not just Mechanical Turk 2019-02-09T00:46:25.729Z · score: 40 (17 votes)
Amplification Discussion Notes 2018-06-01T19:03:35.294Z · score: 43 (11 votes)
Understanding Iterated Distillation and Amplification: Claims and Oversight 2018-04-17T22:36:29.562Z · score: 73 (21 votes)
Improbable Oversight, An Attempt at Informed Oversight 2017-05-24T17:43:53.000Z · score: 2 (2 votes)
Informed Oversight through Generalizing Explanations 2017-05-24T17:43:39.000Z · score: 1 (1 votes)
Proposal for an Implementable Toy Model of Informed Oversight 2017-05-24T17:43:13.000Z · score: 1 (1 votes)

Comment by william_s on Use-cases for computations, other than running them? · 2020-01-21T00:55:46.560Z · score: 4 (3 votes) · LW · GW

Substituting parts of the computation (replace a slow, correct algorithm for part of the computation with a fast, approximate one)

Comment by william_s on Use-cases for computations, other than running them? · 2020-01-21T00:54:07.155Z · score: 4 (3 votes) · LW · GW
• Formally verifying properties of the computation
• Informally checking properties of the computation (is this algorithm for making loan decisions fair?)
• Debugging the computation, or more generally "modifying the computation to do what you actually want"
Comment by william_s on Understanding Iterated Distillation and Amplification: Claims and Oversight · 2019-11-30T21:45:04.540Z · score: 2 (2 votes) · LW · GW

One situation is: maybe an HBO tree of size 10^20 runs into a security failure with high probability, but an HBO tree of size 10^15 doesn't and is sufficient to output a good LBO overseer.

Comment by william_s on Understanding Iterated Distillation and Amplification: Claims and Oversight · 2019-11-30T21:43:55.358Z · score: 2 (2 votes) · LW · GW

Yeah, that's my current picture of it.

Comment by william_s on Can corrigibility be learned safely? · 2019-11-30T21:43:17.397Z · score: 4 (3 votes) · LW · GW
I understand that, but I don't see why #2 is likely to be achievable. Corrigibility seems very similar to Wei Dai's translation example, so it seems like there could be many deceptive actions that humans would intuitively recognize as not corrigible, but which would fool an early-stage LBO tree into assigning a high reward. This seems like it would be a clear example of "giving a behaviour a high reward because it is bad". Unfortunately I can't think of any good examples, so my intuition may simply be mistaken.

For factored cognition: I think the reason #2 might be achievable relies on assuming that there is some reason that the bad features is selected in the first place and is assigned high reward. For example, this might have happened because the agent ran a simulation forward, and then realized that if they punch the human they can take the reward button from them. The hope is that we can figure out that the simulation process happened and why it lead to the wrong thing (or outlaw simluations of this form in the first place).

For factored evaluation, I think the story is a bit different (relying on the other expert being able to understand the reasons for the reward assignment and point it out to the judge, but I don't think the judge needs to be able to find it on there own).

Incidentally, it seems like Ought could feasibly test whether meta-execution is sufficient to ensure corrigibility; for example, a malicious expert could recommend deceptive/influence-seizing actions to an agent in a simulated environment, and the meta-execution tree would have to detect every deceptive action without any contextual knowledge . Are there any plans to do this?

No plans currently, but it would be interesting.

Comment by william_s on Can corrigibility be learned safely? · 2019-11-03T01:04:34.045Z · score: 1 (1 votes) · LW · GW

Suppose an action is evaluated as a linear combination of a set of human interpretable features.

The action "punch the human" could be selected because 1) many of the reward weights of these features could be wrong, or it could be selected because 2) there is one feature "this action prevents the human from turning me off" that is assigned high reward. I think the thing we'd want to prevent in this case is 2) but not 1), and I think that's more likely to be achievable.

Comment by william_s on Understanding Iterated Distillation and Amplification: Claims and Oversight · 2019-11-03T00:56:41.177Z · score: 3 (2 votes) · LW · GW

I think it's a general method that is most applicable in LBO, but might still be used in HBO (eg. an HBO overseer can read one chapter of a math textbook, but this doesn't let it construct an ontology that let's it solve complicated math problems, so instead it needs to use meta-execution to try to manipulate objects that it can't reason about directly.

Comment by william_s on Understanding Iterated Distillation and Amplification: Claims and Oversight · 2019-11-03T00:54:04.817Z · score: 6 (3 votes) · LW · GW

I'd interpreted it as "using the HBO system to construct a "core for reasoning" reduces the chances of failure by exposing it to less inputs/using it for less total time", plus maybe other properties (eg. maybe we could look at and verify an LBO overseer, even if we couldn't construct it ourselves)

Comment by william_s on Concrete experiments in inner alignment · 2019-09-26T22:52:13.788Z · score: 3 (3 votes) · LW · GW

Possible source for optimization-as-a-layer: SATNet (differentiable SAT solver)

https://arxiv.org/abs/1905.12149

Comment by william_s on 2-D Robustness · 2019-09-26T18:34:37.193Z · score: 4 (4 votes) · LW · GW

One way to try to measure capability robustness seperate from alignment robustness off of the training distribution of some system would be to:

• use an inverse reinforcement learning algorithm infer the reward function of the off-distribution behaviour
• train a new system to do as well on the reward function as the original system
• measure the number of training steps needed to reach this point for the new system.

This would let you make comparisons between different systems as to which was more capability robust.

Maybe there's a version that could train the new system using behavioural cloning, but it's less clear how you measure when you're as competent as the original agent (maybe using a discriminator?)

The reason for trying this is having a measure of competence that is less dependent on human judgement/closer to the systems's ontology and capabilities.

Comment by william_s on Honoring Petrov Day on LessWrong, in 2019 · 2019-09-26T18:22:32.665Z · score: 6 (6 votes) · LW · GW

I think the better version of this strategy would involve getting competing donations from both sides, using some weighting of total donations for/against pushing the button to set a probability of pressing the button, and tweaking the weighting of the donations such that you expect the probability of pressing the button will be low (because pressing the button threatens to lower the probability of future games of this kind, this is an iterated game rather than a one-shot).

Comment by william_s on Problems with AI debate · 2019-09-05T18:08:11.444Z · score: 6 (4 votes) · LW · GW

For Alaska vs. Bali, alternative answer is "You could be convinced that either Alaska or Bali is a good vacation destination". It's an interesting question whether this could actually win in debate. I think it might have a better chance in Factored Evaluation, because we can spin up two seperate trees to view the most compelling argument for Alaska and the most compelling argument for Bali and verify that these are convincing. In debate, you'd need view either Alaska Argument before Bali Argument, or Bali Argument before Alaska Argument, and you might just be convinced by the first argument you see in which case you wouldn't agree that you could be convinced either way.

Comment by william_s on HCH is not just Mechanical Turk · 2019-08-07T02:55:24.913Z · score: 4 (3 votes) · LW · GW

I'd say that the claim is not sufficient - it might provide some alignment value, but it needs a larger story about how the whole computation is going to be safe. I do think that the HCH framework could make specifying an aligned GOFAI-like computation easier (but it's hard to come up with a rigorous argument for this without pointing to some kind of specification that we can make claims about, which is something I'd want to produce along the way while proceeding with HCH-like approaches)

Comment by william_s on Deceptive Alignment · 2019-07-21T20:54:27.734Z · score: 4 (3 votes) · LW · GW

I think a cleaner way of stating condition 3 might be "there exists some policy with a non-zero chance of deception at some point in the future, with expected value for the mesa-optimizer higher than the policy of always cooperating".

This could be met if the threat of modification never goes away, but maybe decreases over time. (at training time, system is monitored for defection every episode, at a deployment time it is monitored for defection 1/million episodes, not worthwhile to defect during training but worthwhile to defect at deployment time)

This could fail to be met if the mesa-optimizer believes that 1) there is a 1% chance that it will be detected by the base optimizer 2) if this happens, the base optimizer will be able to determine $O_mesa$ and give it a large negative reward, 100x more than the possible reward from the best defecting policy. (not sure if there's any version of this that would work, since it seems hard to figure out $O_mesa$ and provide negative reward)

Comment by william_s on An Increasingly Manipulative Newsfeed · 2019-07-14T03:53:52.360Z · score: 1 (1 votes) · LW · GW

To me, It seems like the point of this story is that we could build an AI that ends up doing very dangerous things without ever asking it "Will you do things I don't like if given more capability?" or some other similar question that requires it to execute the treacherous turn. In contrast, if the developers did something like build a testing world with toy humans in it who could be manipulated in a way detectable to the developers, and placed the AI in the toy testing world, then it seems like this AI would be forced into a position where it either acts in a way according to it's true incentives (manipulate the humans and be detected), or execute the treacherous turn (abstain from manipulating the humans so developers will trust it more). So it seems like this wouldn't happen if the developers are trying to test for treacherous turn behaviour during development.

Comment by william_s on Contest: $1,000 for good questions to ask to an Oracle AI · 2019-07-01T16:48:25.834Z · score: 4 (2 votes) · LW · GW Are you interested in protocols involving multiple episodic questions (where you ask one question, wait for it to resolve, then ask another question?) Comment by william_s on Contest:$1,000 for good questions to ask to an Oracle AI · 2019-07-01T16:46:57.502Z · score: 12 (10 votes) · LW · GW

Submission: low-bandwidth oracle

Plan Criticism: Given plan to build an aligned AI, put together a list of possible lines of thought to think about problems with the plan (open questions, possible failure modes, criticisms, etc.). Ask the oracle to pick one of these lines of thought, pick another line of thought at random, and spend the next time period X thinking about both, judge which line of thought was more useful to think about (where lines of thought that spot some fatal missed problem are judged to be very useful) and reward the oracle if its suggestion was picked.

Comment by william_s on The Main Sources of AI Risk? · 2019-03-22T18:55:56.945Z · score: 5 (3 votes) · LW · GW
• AI systems end up controlled by a group of humans representing a small range of human values (ie. an ideological or religious group that imposes values on everyone else). While not caused only by AI design, it is possible that design decisions could impact the likelihood of this scenario (ie. at what point are values loaded into the system/how many people's values are loaded into the system), and is relevant for overall strategy.
Comment by william_s on The Main Sources of AI Risk? · 2019-03-22T18:52:24.938Z · score: 6 (3 votes) · LW · GW
• Failure to learn how to deal with alignment in the many-humans, many-AIs case even if single-human, single-AI alignment is solved (which I think Andrew Critch has talked about). For example, AIs negotiating on behalf of humans take the stance described in https://arxiv.org/abs/1711.00363 of agreeing to split control of the future according to which human's priors are most accurate (on potentially irrelevant issues) if this isn't what humans actually want.
Comment by william_s on Some Thoughts on Metaphilosophy · 2019-03-08T18:58:55.169Z · score: 2 (2 votes) · LW · GW

Maybe one AI philosophy service could look like: would ask you a bunch of other questions that are simpler than the problem of qualia, then show you what those answers imply about the problem of qualia if you use some method of reconciling those answers.

Comment by william_s on Some Thoughts on Metaphilosophy · 2019-03-08T18:53:49.848Z · score: 2 (2 votes) · LW · GW

Re: Philosophy as interminable debate, another way to put the relationship between math and philosophy:

Philosophy as weakly verifiable argumentation

Math is solving problems by looking at the consequences of a small number of axiomatic reasoning steps. For something to be math, we have to be able to ultimately cash out any proof as a series of these reasoning steps. Once something is cashed out in this way, it takes a small constant amount of time to verify any reasoning step, so we can verify given polynomial time.

Philosophy is solving problems where we haven't figured out a set of axiomatic reasoning steps. Any non-axiomatic reasoning step we propose could end up having arguments that we hadn't thought of that would lead us to reject that step. And those arguments themselves might be undermined by other arguments, and so on. Each round of debate lets us add another level of counter-arguments. Philosophers can make progress when they have some good predictor of whether arguments are good or not, but they don't have access to certain knowledge of arguments being good.

Another difference between mathematics and philosophy is that in mathematics we have a well defined set of objects and a well-defined problem we are asking about. Whereas in philosophy we are trying to ask questions about things that exist in the real world and/or we are asking questions that we haven't crisply defined yet.

When we come up with a set of axioms and a description of a problem, we can move that problem from the realm of philosophy to the realm of mathematics. When we come up with some method we trust of verifying arguments (ie. replicating scientific experiments), we can move problems out of philosophy to other sciences.

It could be the case that philosophy grounds out in some reasonable set of axioms which we don't have access to now for computational reasons - in which case it could all end up in the realm of mathematics. It could be the case that, for all practical purposes, we will never reach this state, so it will remain in the "potentially unbounded DEBATE round case". I'm not sure what it would look like if it could never ground out - one model could be that we have a black box function that performs a probabilistic evaluation of argument strength given counter-arguments, and we go through some process to get the consequences of that, but it never looks like "here is a set of axioms".

Comment by william_s on Some Thoughts on Metaphilosophy · 2019-03-08T17:41:23.465Z · score: 6 (4 votes) · LW · GW

I guess it feels like I don't know how we could know that we're in the position that we've "solved" meta-philosophy. It feels like the thing we could do is build a set of better and better models of philosophy and check their results against held-out human reasoning and against each other.

I also don't think we know how to specify a ground truth reasoning process that we could try to protect and run forever which we could be completely confident would come up with the right outcome (where something like HCH is a good candidate but potentially with bugs/subtleties that need to be worked out).

I feel like I have some (not well justified and possibly motivated) optimism that this process yields something good fairly early on. We could gain confidence that we are in this world if we build a bunch of better and better models of meta-philosophy and observe at some point the models continue agreeing with each other as we improve them, and that they agree with various instantiations of protected human reasoning that we run. If we are in this world, the thing we need to do is just spend some time building a variety of these kinds of models and produce an action that looks good to most of them. (Where agreement is not "comes up with the same answer" but more like "comes up with an answer that other models think is okay and not disastrous to accept").

Do you think this would lead to "good outcomes"? Do you think some version of this approach could be satisfactory for solving the problems in Two Neglected Problems in Human-AI Safety?

Do you think there's a different kind of thing that we would need to do to "solve metaphilosophy"? Or do you think that working on "solving metaphilosophy" roughly caches out as "work on coming up with better and better models of philosophy in the model I've described here"?

Comment by william_s on Three AI Safety Related Ideas · 2019-03-08T17:30:25.571Z · score: 4 (2 votes) · LW · GW

A couple ways to implement a hybrid approach with existing AI safety tools:

Logical Induction: Specify some computationally expensive simulation of idealized humans. Run a logical inductor with the deductive process running the simulation and outputting what the humans say after time x in simulation, as well as statements about what non-idealized humans are saying in the real world. The inductor should be able to provide beliefs about what the idealized humans will say in the future informed by information from the non-idealized humans.

HCH/IDA: The HCH-humans demonstrate a reasoning process which aims to predict the output of a set of idealized humans using all available information (which can include running simulations of idealized humans or information from real humans). The way that the HCH tree using information about real humans involves looking carefully at their circumstances and asking things like "how do the real human's circumstances differ from the idealized human" and "is the information from the real human compromised in some way?"

Comment by william_s on Can HCH epistemically dominate Ramanujan? · 2019-02-27T22:58:06.230Z · score: 1 (1 votes) · LW · GW

It seems like for Filtered-HCH, the application in the post you linked to, you might be able to do a weaker version where you label any computation that you can't understand in kN steps as problematic, only accepting things you think you can efficiently understand. (But I don't think Paul is arguing for this weaker version).

Comment by william_s on Reinforcement Learning in the Iterated Amplification Framework · 2019-02-18T21:09:09.756Z · score: 4 (2 votes) · LW · GW
RL is typically about sequential decision-making, and I wasn't sure where the "sequential" part came in).

I guess I've used the term "reinforcement learning" to refer to a broader class of problems including both one-shot bandit problems and sequential decision making problems. In this view The feature that makes RL different from supervised learning is not that we're trying to figure out what how to act in an MDP/POMDP, but instead that we're trying to optimize a function that we can't take the derivative of (in the MDP case, it's because the environment is non-differentiable, and in the approval learning case, it's because the overseer is non-differentiable).

Comment by william_s on Some disjunctive reasons for urgency on AI risk · 2019-02-15T21:59:17.046Z · score: 2 (2 votes) · LW · GW

Re: scenario 3, see The Evitable Conflict, the last story in Isaac Asimov's "I, Robot":

"Stephen, how do we know what the ultimate good of Humanity will entail? We haven't at our disposal the infinite factors that the Machine has at its! Perhaps, to give you a not unfamiliar example, our entire technical civilization has created more unhappiness and misery than it has removed. Perhaps an agrarian or pastoral civilization, with less culture and less people would be better. If so, the Machines must move in that direction, preferably without telling us, since in our ignorant prejudices we only know that what we are used to, is good – and we would then fight change. Or perhaps a complete urbanization, or a completely caste-ridden society, or complete anarchy, is the answer. We don't know. Only the Machines know, and they are going there and taking us with them."
Comment by william_s on HCH is not just Mechanical Turk · 2019-02-13T00:15:16.037Z · score: 6 (3 votes) · LW · GW

Yeah, to some extent. In the Lookup Table case, you need to have a (potentially quite expensive) way of resolving all mistakes. In the Overseer's Manual case, you can also leverage humans to do some kind of more robust reasoning (for example, they can notice a typo in a question and still respond correctly, even if the Lookup Table would fail in this case). Though in low-bandwidth oversight, the space of things that participants could notice and correct is fairly limited.

Though I think this still differs from HRAD in that it seems like the output of HRAD would be a much smaller thing in terms of description length than the Lookup Table, and you can buy extra robustness by adding many more human-reasoned things into the Lookup Table (ie. automatically add versions of all questions with typos that don't change the meaning of a question into the Lookup Table, add 1000 different sanity check questions to flag that things can go wrong).

So I think there are additional ways the system could correct mistaken reasoning relative to what I would think the output of HRAD would look like, but you do need to have processes that you think can correct any way that reasoning goes wrong. So the problem could be less difficult than HRAD, but still tricky to get right.

Comment by william_s on The Argument from Philosophical Difficulty · 2019-02-11T17:47:18.818Z · score: 4 (3 votes) · LW · GW

Thanks, this position makes more sense in light of Beyond Astronomical Waste (I guess I have some concept of "a pretty good future" that is fine with something like a bunch of human-descended beings living a happy lives that misses out on the sort of things mentioned in Beyond Astronomical Waste, and "optimal future" which includes those considerations). I buy this as an argument that "we should put more effort into making philosophy work to make the outcome of AI better, because we risk losing large amounts of value" rather than "our efforts to get a pretty good future are doomed unless we make tons of progress on this" or something like that.

"Thousands of millions" was a typo.

Comment by william_s on Thoughts on reward engineering · 2019-02-10T22:31:38.378Z · score: 4 (2 votes) · LW · GW
What is the motivation for using RL here?

I see the motivation as given practical compute limits, it may be much easier to have the system find an action the overseer approves of instead of imitating the overseer directly. Using RL also allows you to use any advances that are made in RL by the machine learning community to try to remain competitive.

Comment by william_s on Thoughts on reward engineering · 2019-02-10T22:28:42.330Z · score: 4 (2 votes) · LW · GW
Would this still be a problem if we were training the agent with SL instead of RL?

Maybe this could happen with SL if SL does some kind of large search and finds a solution that looks good but is actually bad. The distilled agent would then learn to identify this action and reproduce it, which implies the agent learning some facts about the action to efficiently locate it with much less compute than the large search process. Knowing what the agent knows would allow the overseer to learn those facts, which might help in identifying this action as bad.

Comment by william_s on Reinforcement Learning in the Iterated Amplification Framework · 2019-02-10T22:09:27.681Z · score: 4 (2 votes) · LW · GW
I don't understand why we want to find this X* in the imitation learning case.

Ah, with this example the intent was more like "we can frame what the RL case is doing as finding X* , let's show how we could accomplish the same thing in the imitation learning case (in the limit of unlimited compute)".

The reverse mapping (imitation to RL) just consists of applying reward 1 to M2's demonstrated behaviour (which could be "execute some safe search and return the results), and reward 0 to everything else.

What is pM(X∗)?

is the probability of outputting (where is a stochastic policy)

M2("How good is answer X to Y?")∗∇log(pM(X))

This is the REINFORCE gradient estimator (which tries to increase the log probability of actions that were rated highly)

Comment by william_s on Announcement: AI alignment prize round 4 winners · 2019-02-10T19:06:00.792Z · score: 6 (3 votes) · LW · GW

I guess the question was more from the perspective of: if the cost was zero then it seems like it would worth running, so what part of the cost makes it not worth running (where I would think of cost as probably time to judge or availability of money to fund the contest).

Comment by william_s on The Argument from Philosophical Difficulty · 2019-02-10T19:02:57.074Z · score: 6 (3 votes) · LW · GW

One important dimension to consider is how hard it is to solve philosophical problems well enough to have a pretty good future (which includes avoiding bad futures). It could be the case that this is not so hard, but fully resolving questions so we could produce an optimal future is very hard or impossible. It feels like this argument implicitly relies on assuming that "solve philosophical problems well enough to have a pretty good future" is hard (ie. takes thousands of millions of years in scenario 4) - can you provide further clarification on whether/why you think that is the case?

Comment by william_s on Announcement: AI alignment prize round 4 winners · 2019-02-09T17:43:58.790Z · score: 8 (4 votes) · LW · GW

Slightly disappointed that this isn't continuing (though I didn't submit to the prize, I submitted to Paul Christiano's call for possible problems with his approach which was similarly structured). Was hoping that once I got further into my PhD, I'd have some more things worth writing up, and the recognition/a bit of prize money would provide some extra motivation to get them out the door.

What do you feel like is the limiting resource that keeps continuing this from being useful to continue in it's current form?

Comment by william_s on HCH is not just Mechanical Turk · 2019-02-09T17:10:56.001Z · score: 1 (1 votes) · LW · GW

Yeah, this is a problem that needs to be addressed. It feels like in the Overseers Manual case you can counteract this by giving definitions/examples of how you want questions to be interpreted, and in the Lookup Table case this can be addr by coordination within the team creating the lookup table

Comment by william_s on Can there be an indescribable hellworld? · 2019-01-31T20:03:23.882Z · score: 1 (1 votes) · LW · GW

Do you think you'd agree with a claim of this form applied to corrigibility of plans/policies/actions?

That is: If some plan/policy/action is uncorrigible, then A can provide some description of how the action is incorrigible.

Comment by william_s on Why we need a *theory* of human values · 2018-12-29T00:01:46.830Z · score: 3 (2 votes) · LW · GW
The better we can solve the key questions ("what are these 'wiser' versions?", "how is the whole setup designed?", "what questions exactly is it trying to answer?"), the better the wiser ourselves will be at their tasks.

I feel like this statement suggests that we might not be doomed if we make a bunch of progress, but not full progress on these statements. I agree with that assessment, but it felt on reading the post like the post was making the claim "Unless we fully specify a correct theory of human values, we are doomed".

I think that I'd view something like Paul's indirect normativity approach as requiring that we do enough thinking in advance to get some critical set of considerations known by the participating humans, but once that's in place we should be able to go from this core set to get the rest of the considerations. And it seems possible that we can do this without a fully-solved theory of human value (but any theoretical progress in advance we can make on defining human value is quite useful).

Comment by william_s on Three AI Safety Related Ideas · 2018-12-20T21:36:23.176Z · score: 6 (3 votes) · LW · GW

My interpretation of what you're saying here is that the overseer in step #1 can do a lot of things to bake in having the AI interpret "help the user get what they really want" in ways that get the AI to try to eliminate human safety problems for the step #2 user (possibly entirely), but problems might still occur in the short term before the AI is able to think/act to remove those safety problems.

It seems to me that this implies that IDA essentially solves the AI alignment portion of points 1 and 2 in the original post (modulo things happening before the AI is in control).

Comment by william_s on A comment on the IDA-AlphaGoZero metaphor; capabilities versus alignment · 2018-07-19T21:41:18.523Z · score: 1 (1 votes) · LW · GW

Correcting all problems in the subsequent amplification stage would be a nice property to have, but I think IDA can still work even if it corrects errors with multiple A/D steps in between (as long as all catastrophic errors are caught before deployment). For example, I could think of the agent initially using some rules for how to solve math problems where distillation introduces some mistake, but later in the IDA process the agent learns how to rederive those rules and realizes the mistake.

Comment by william_s on A general model of safety-oriented AI development · 2018-06-13T20:21:35.174Z · score: 8 (3 votes) · LW · GW

Shorter name candidates:

Inductively Aligned AI Development

Inductively Aligned AI Assistants

Comment by william_s on A general model of safety-oriented AI development · 2018-06-13T20:20:03.086Z · score: 6 (2 votes) · LW · GW

It's a nice property of this model that it prompts consideration of the interaction between humans and AIs at every step (to highlight things like risks of the humans having access to some set of AI systems for manipulation or moral hazard reasons).

Comment by william_s on Poker example: (not) deducing someone's preferences · 2018-06-13T18:53:25.062Z · score: 4 (1 votes) · LW · GW

In the higher dimensional belief/reward space, do you think that it would be possible to significantly narrow down the space of possibilities (so this argument is saying "be bayesian with respect to reward/beliefs, picking policies that work over a distribution) or are you more pessimistic than that, thinking that the uncertainty would be so great in higher dimensional spaces that it would not be possible to pick a good policy?

Comment by william_s on Amplification Discussion Notes · 2018-06-01T19:04:19.114Z · score: 14 (4 votes) · LW · GW

Open Question: Working with concepts that the human can’t understand

Question: when we need to assemble complex concepts by learning/interacting with the environment, rather than using H's concepts directly, and when those concepts influence reasoning in subtle/abstract ways, how do we retain corrigibility/alignment?

Paul: I don't have any general answer to this, seems like we should probably choose some example cases. I'm probably going to be advocating something like "Search over a bunch of possible concepts and find one that does what you want / has the desired properties."

E.g. for elegant proofs, you want a heuristic that gives successful lines of inquiry higher scores. You can explore a bunch of concepts that do that, evaluate each one according to how well it discriminates good from bad lines of inquiry, and also evaluate other stuff like "What would I infer from learning that a proof is elegant other than that it will work" and make sure that you are OK with that.

Andreas: Suppose you don't have the concepts of "proof" and "inquiry", but learned them (or some more sophisticated analogs) using the sort of procedure you outlined below. I guess I'm trying to see in more detail that you can do a good job at "making sure you're OK with reasoning in ways X" in cases where X is far removed from H's concepts. (Unfortunately, it seems to be difficult to make progress on this by discussing particular examples, since examples are necessarily about concepts we know pretty well.)

This may be related to the more general question of what sorts of instructions you'd give H to ensure that if they follow the instructions, the overall process remains corrigible/aligned.

Comment by william_s on Amplification Discussion Notes · 2018-06-01T19:04:01.100Z · score: 9 (2 votes) · LW · GW

Open Question: Severity of “Honest Mistakes”

In the discussion about creative problem solving,Paul said that he was concerned about problems arising when the solution generator was deliberately searching for a solution with harmful side effects. Other failures could occur where the solution generator finds a solution with harmful side effects without “deliberately searching” for it. The question is how bad these “honest mistakes” would end up being.

Paul: I also want to make the further claim that such failures are much less concerning than what-I'm-calling-alignment failures, which is a possible disagreement we could dig into (I think Wei Dai disagrees or is very unsure).

Comment by william_s on Challenges to Christiano’s capability amplification proposal · 2018-05-26T22:58:13.323Z · score: 10 (2 votes) · LW · GW
I would solve X-and-only-X in two steps:
First, given an agent and an action which has been optimized for undesirable consequence Y, we'd like to be able to tell that the action has this undesirable side effect. I think we can do this by having a smarter agent act as an overseer, and giving the smarter agent suitable insight into the cognition of the weaker agent (e.g. by sharing weights between the weak agent and an explanation-generating agent). This is what I'm calling informed oversight.
Second, given an agent, identify situations in which it is especially likely to produce bad outcomes, or proofs that it won't, or enough understanding of its internals that you can see why it won't. This is discussed in “Techniques for Optimizing Worst-Case Performance.”

Paul, I'm curious whether you'd see as necessary for these techniques to work to have that the optimization target is pretty good/safe (but not perfect): ie some safety comes from the fact that the agents optimized for approval or imitation only have a limited class of Y's that they might also end up being optimized for.

Comment by william_s on Challenges to Christiano’s capability amplification proposal · 2018-05-26T22:54:32.134Z · score: 16 (3 votes) · LW · GW
So I also don't see how Paul expects the putative alignment of the little agents to pass through this mysterious aggregation form of understanding, into alignment of the system that understands Hessian-free optimization.

My model of Paul's approach sees the alignment of the subagents as just telling you that no subagent is trying to actively sabotage your system (ie. by optimizing to find the worst possible answer to give you), and that the alignment comes from having thought carefully about how the subagents are supposed to act in advance (in a way that could potentially be run just by using a lookup table).

Comment by william_s on Resolving human values, completely and adequately · 2018-05-16T18:23:05.370Z · score: 5 (2 votes) · LW · GW

Glad to see this work on possible structure for representing human values which can include disagreement between values and structured biases.

I had some half-formed ideas vaguely related to this, which I think map onto an alternative way to resolve self reference.

Rather than just having one level of values that can refer to other values on the same level (which potentially leads to a self-reference cycle), you could instead explicitly represent each level of value, with level 0 values referring to concrete reward functions, level 1 values endorsing or negatively endorsing level 0 values, and generally level n values only endorsing or negatively endorsing level n-1 values. This might mean that you have some kinds of values that end up being duplicated between multiple levels. For any n, there's a unique solution to the level of endorsement for every concrete value. We can then consider the limit as n->infinity as the true level of endorsement. This allows for situations where the limit fails to converge (ie. it alternates between different values at odd and even levels), which seems like a way to handle self reference contradictions (possibly also the all-or-nothing problem if it results from a conflict between meta-levels).

I think this maps into the case where we don't distinguish between value levels if we define an function that just adjusts the endorsement of each value by the values that directly to refer to it. Then iterating this function n times gives the equivalent of having an n-level meta-hierarchy.

I think there might be interesting work in mapping this strategy into some simple value problem, and then trying to perform bayesian value learning in that setting with some reasonable prior over values/value endorsements.

Comment by william_s on Can corrigibility be learned safely? · 2018-04-24T18:18:04.818Z · score: 4 (1 votes) · LW · GW

Ah, right. I guess I was balking at moving from exorbitant to exp(exorbitant). Maybe it's better to think of this as reducing the size of fully worked initial overseer example problems that can be produced for training/increasing the number of amplification rounds that are needed.

So my argument is more an example of what a distilled overseer could learn as an efficient approximation.

Comment by william_s on Can corrigibility be learned safely? · 2018-04-24T16:42:01.486Z · score: 4 (1 votes) · LW · GW

I guess what we're trying to unpack is "the mechanism that makes decisions from that database", and whether it can be efficient. If you store all experience with no editing, you're left with the same problem of "what decisions in the tree do you change based on experience (X,y) and how do you change them?" And it seems like the most naive way of doing this is too inefficient (looking at all data at all decision nodes).

If you summarize experience (what I'm interpreting "decide how to update some small sketch" as meaning), then the question is can you produce that summary in an efficient way, and efficiently use it at runtime?

The gradient-descent style solution is meant to gesture how this could be done in a way analogous to how we use parametric models rather than non-parametric models in ML for decently sized datasets. In advance, bake in a set of parameters/information that could influence each decision, then do updates by having the overseer look for ways to change the decision parameters. But when we do this in an oversight model, we can make the decisions/parameters interpretable sentences.

Rough example of how I see the overall process going:

1. amplified overseer acts in an environment with a breakable vase, but the overseer does not know that the vase is breakable

2. First time it acts, some overseer node N looks into the database for any information related to the vase, and returns nothing, and proceeds in it decision

3. The vase breaks, and the negative feedback is given to the amplified overseer

4. Amplified overseer runs some process to figure out that it could have made a better decision if the database contained the knowledge that vases are breakable, and so updates the database. (The thing that Wei Dai are trying to unpack is, can this step be run in a reasonable amount of time, given that each node of the overseer has only a limited understanding of the original decision process)

5. Second time it acts, the amplified overseer process is the same, up until the node N that does the database lookup is run. This time, the lookup returns that the vase is breakable, so the overseer runs a different decision path and treats the vase with more care.

Comment by william_s on Can corrigibility be learned safely? · 2018-04-23T23:57:13.728Z · score: 9 (2 votes) · LW · GW
What if the current node is responsible for the error instead of one of the subqueries, how do you figure that out?

I think you'd need to form the decomposition in such a way that you could fix any problem through perturbing something in the
world representation (an extreme version is you have the method for performing every operation contained in the world representation and looked up, so you can adjust it in the future).

When you do backprop, you propagate the error signal through all the nodes, not just through a single path that is "most responsible" for the error, right? If you did this with meta-execution, wouldn't it take an exponential amount of time?

One step of this method, as in backprop, is the same time complexity as the forward pass (running meta-execution forward, which I wouldn't call exponential complexity, as I think the relevant baseline is the number of nodes in the meta-execution forward tree). You only need to process each node once (when the backprop signal for it's output is ready), and need to do a constant amount of work at each node (figure out all the ways to perturb the nodes input).

The catch is that, as with backprop, maybe you need to run multiple steps to get it to actually work.

And what about nodes that are purely symbolic, where there are multiple ways the subnodes (or the current node) could have caused the error, so you couldn't use the right answer for the current node to figure out what the right answer is from each subnode? (Can you in general structure the task tree to avoid this?)

The default backprop answer to this is to shrug and adjust all of the inputs (which is what you get from taking the first order gradient). If this causes problems, then you can fix them in the next gradient step. That seems to work in practice for backprop in continuous models. Discrete models like this it might be a bit more difficult - if you start to try out different combinations to see if they work, that's where you'd get exponential complexity. But we'd get to counter this by potentially having cases where, based on understanding the operation, we could intelligently avoid some branches - I think this could potentially wash out to linear complexity in the number of forward nodes if it all works well.

I wonder if we're on the right track at all, or if Paul has an entirely different idea about this.

So do I :)