Posts

How does the organization "EthAGI" fit into the broader AI safety landscape? 2019-07-08T00:46:02.191Z · score: 4 (2 votes)
Is it good practice to write questions/comments on old posts you're trying to understand? 2019-06-27T09:23:01.619Z · score: 21 (10 votes)
Evidence other than evolution for optimization daemons? 2019-04-21T20:50:18.986Z · score: 4 (1 votes)

Comments

Comment by liam-donovan on Capability amplification · 2019-10-09T13:09:09.811Z · score: 3 (2 votes) · LW · GW

I was very surprised to read this part of the conclusion:

Capability amplification appears to be less tractable than the other research problems I’ve outlined. I think it’s unlikely to be a good research direction for machine learning researchers interested in value alignment.

Is there a good explanation somewhere of why this is true? Based on the rest of this sequence, I would have expected capability amplification to be relatively tractable, and an excellent research direction for value alignment.

Comment by liam-donovan on A Critique of Functional Decision Theory · 2019-09-16T06:59:15.060Z · score: 4 (3 votes) · LW · GW

Yeah, wouldn't someone following Guaranteed Payoffs as laid out in the post be unable to make credible promises?

Comment by liam-donovan on [Link] Book Review: Reframing Superintelligence (SSC) · 2019-08-29T05:05:30.914Z · score: 6 (5 votes) · LW · GW

"Ten years ago, everyone was talking about superintelligence, the singularity, the robot apocalypse. What happened?"

What is this referencing? I was only 10 years old in 2009 but I have a strong impression that AI risk gets a lot more attention now than it did then.


Also, what are the most salient differences between CAIS and the cluster of concepts Karnofsky and others were calling "Tool AI"?

Comment by liam-donovan on Thoughts on the 5-10 Problem · 2019-08-04T00:36:01.897Z · score: 1 (1 votes) · LW · GW

My (possibly very incorrect) takeaway from the post, as someone with very little background in mathematical logic, was that "If I can prove x has higher utility than y, then I will do x" (statement 1) is a bad heuristic for an EDT agent that can reason about its own source code, because outputting x will be a fixed point of this decision process* even when this does not return higher utility. Specifically, an EDT agent will choose action x iff the utility of choosing x is higher than that of choosing y (assuming the utilities are different). Thus, assuming statement 1 is equivalent to assuming "if I can prove x has higher utility than y, x has higher utility than y" (statement 2) for the EDT agent. Because assuming statement 2 leads to absurd conclusions (like the agent taking the 5 dollar bill), assuming it is a bad heuristic.

This use of Lob's theorem seems to do exactly what you want: show that we can't prove a statement is unprovable. If we prove a statement of the form "if a is provable then a is true" , then the contrapositive "if a is not true then it is not provable" follows. However, I thought the point of the post is that we can't actually prove a statement of this form, namely the statement "if x does not have higher utility than y, then I cannot prove that x has higher utility than y" (statement 3). Statement 3 is necessary for the heuristic in statement 1 to be useful, but the post shows that it is in fact false.

The point of the post isn't to prove something false, it's to show that we can't prove a statement is unprovable.


*I'm not sure if I'm these terms correctly and precisely :/

Comment by liam-donovan on What questions about the future would influence people’s actions today if they were informed by a prediction market? · 2019-07-21T21:13:07.821Z · score: 3 (3 votes) · LW · GW

Even if you tried to design a prediction market around this mechanism, all it would tell you is the expected value of a promise to pay $x, n years from now. This would be affected by arbitrarily many factors, so you couldn't infer the probability of a specific catastrophe like UFAI development.

Comment by liam-donovan on What questions about the future would influence people’s actions today if they were informed by a prediction market? · 2019-07-21T08:53:28.494Z · score: 2 (2 votes) · LW · GW

Some of these questions seem impossible to operationalize in a prediction market. For instance, if I bet that a recursively self-improving unfriendly AI will be developed in the next 10 years, and I'm right, how am I going to collect my money?

Comment by liam-donovan on Open Thread July 2019 · 2019-07-20T04:35:50.283Z · score: 1 (1 votes) · LW · GW
The intrinsic connection is primarily that they arose out of the same broad community, and there is heavy overlap between personnel as a consequence.

I disagree with this though! I think anyone that wants to think along EA lines is inevitably going to want to investigate how to improve epistemic rationality, which naturally leads to thinking about decision making for idealized agents. Having community overlap is one thing, but the ideas seem so closely related that EA can't develop in any possible world without being biased towards HRAD research.

It's very important to distinguish those etceteras you listed

I mean surely there would be some worlds in which HRAD research was not the most valuable use of (some portion of*) EA money; it doesn't really matter whether the specific examples I gave work, just that EA would be unable to distinguish worlds where HRAD is an optimal use of resources from the world where it is not.

I expect the associated communities to notice and then to shift focus elsewhere.

But why? Is it not at all concerning that aliens with no knowledge of Earth or humanity could plausibly guess that a movement dedicated to a maximizing, impartial, welfarist conception of the good would also be intrinsically attracted to learning about idealized reasoning procedures? The link between them is completely unconnected to the object-level question "is HRAD research the best use of [some] EA money?", or even to the specifics of how the LW/EA communities formed around specific personalities in this world.


Comment by liam-donovan on Thoughts on the 5-10 Problem · 2019-07-19T07:04:41.041Z · score: 1 (1 votes) · LW · GW

What was wrong with specifying an agent that uses "[decision theory] unless it's sure it'll make a decision, then it makes the opposite choice"?

Comment by liam-donovan on Open Thread July 2019 · 2019-07-18T22:06:45.042Z · score: 3 (2 votes) · LW · GW

It seems like there are some intrinisic connections between the clusters of concepts known as "EA", "LW-style rationality", and "HRAD research"; is this a worrying sign?

Specifically, it seems like the core premise of EA relies largely on a good understanding of the world, in a systemic and explicit manner (beause existing heuristics aren't selected for "maximizing altruism"[1]), linking closely to LW, which tries to answer the same question. At the same time, my understanding of HRAD research is that it aims to elucidate a framework for how consequentialist agents "ought to reason" in theory, so the consequentialist reasoning of the first highly capable AI systems is legible to humans. Understanding how an idealized agent "ought to reason" or "ought to make decisions" seems highly relevant to the project of improving human rationality (which is then relevant to the EA project).

Now, imagine a world where HRAD is not a great use of resources (e.g. because AI risk is not a legitimate concern, because underlying philosophical assumptions are wrong, because the marginal tractability of alternate safety approaches is much higher, etc). Would the basic connections between ideas in last paragraph still hold? I'm worried that they would, leading any community with goals similar to EA to be biased towards HRAD research for reasons unrelated to the underlying state of the world.

Is this a legitimate concern? What else has been written on this issue?

[1] To expand on this a bit: LW-style rationality often underperforms accumulated heuristics, experience, and domain knowledge in established fields, and probably does best in new fields where quantification is valuable, with high uncertainty, low societal incentives to get a correct answer, dissimilarity to ancestral enviroments, high propensity to cognitive biases/emotional responses. I think almost all of these descriptors are true for the EA movement.

Comment by liam-donovan on Robust Artificial Intelligence and Robust Human Organizations · 2019-07-17T06:27:08.587Z · score: 0 (2 votes) · LW · GW

How does "Safety-II" compare with Eliezer's description of security mindset? On the surface they sound very similar, and I would expect highly reliable organizations to value a security mindset in some form.

Comment by liam-donovan on On motivations for MIRI's highly reliable agent design research · 2019-07-12T00:16:14.372Z · score: 1 (1 votes) · LW · GW

What work is step #1 doing here? It seems like steps #2-5 would still hold even if the AGI in question were using "bad" consequentialist reasoning (e.g. domain-limited/high-K/exploitable/etc.).

In fact, is it necessary to assume that the AGI will be consequentialist at all? It seems highly probable that the first pivotal act will be taken by a system of humans+AI that is collectively behaving in a consequentialist fashion (in order to pick out a pivotal act from the set of all actions). If so, do arguments #2-#5 not apply equally well to this system as a whole, with "top-level" interpreted as something like "transparent to humans within the system"?

Comment by liam-donovan on How does the organization "EthAGI" fit into the broader AI safety landscape? · 2019-07-08T01:19:53.140Z · score: 1 (1 votes) · LW · GW

As far as visible output, the founder did write a (misleading imho) fictional book about AI risk called "Detonation", which is how I heard of Ethagi. I was curious how an organization like this could form with no connection to "mainstream" AI safety people, but I guess it's more common than I thought

Comment by liam-donovan on Contest: $1,000 for good questions to ask to an Oracle AI · 2019-07-03T12:22:08.089Z · score: 3 (1 votes) · LW · GW

Well, a given copy of the oracle wouldn't directly recieve information from the other oracles about the questions they were asked. To the extent a problem remains (which I agree is likely without specific assumptions), wouldn't it apply to all counterfactual oracles?

Comment by liam-donovan on Contest: $1,000 for good questions to ask to an Oracle AI · 2019-07-02T20:12:59.747Z · score: 5 (3 votes) · LW · GW

Two basic questions I couldn't figure out (sorry):

Can you use a different oracle for every subquestion? If you can, how would this affect the concern Wei_Dai raises?

If we know the oracle is only optimizing for the specified objective function, are mesa-optimisers still a problem for the proposed system as a whole?

Comment by liam-donovan on Contest: $1,000 for good questions to ask to an Oracle AI · 2019-07-02T20:08:05.060Z · score: 4 (2 votes) · LW · GW

How would this be low-bandwidth? If we're able to give the oracle a list of passwords to guess from, can't we just check them all?

Comment by liam-donovan on Open question: are minimal circuits daemon-free? · 2019-06-27T22:16:18.830Z · score: 1 (1 votes) · LW · GW

But the other elements in C(n) aren't necessarily daemons either, right?; Certainly "encoding n days of weather data" isn't daemonic at all; some versions of c_apx might be upstream daemons, but that's not necessarily concerning. I don't understand how this argument tells us anything about whether the smallest circuit is guaranteed to be (downstream) daemon-free.

Comment by liam-donovan on Open question: are minimal circuits daemon-free? · 2019-06-27T21:43:24.439Z · score: 1 (1 votes) · LW · GW

Why couldn't you just use a smaller circuit that runs one single-step simulator, and outputs the result? It seems like that would output an accurate prediction of Paul's behavior iff the k-step simulator outputs an accurate prediction.

Comment by liam-donovan on Open question: are minimal circuits daemon-free? · 2019-06-27T21:31:47.507Z · score: 1 (1 votes) · LW · GW

This is all only relevant to downstream daemons, right? If so, I don't understand why the DD would ever provide 98% accuracy; I'd expect it to provide 99% accuracy until it sees a chance to provide [arbitarily low]% accuracy and start pursuing its agenda directly. As you say, this might happen due to competition between daemon-containing systems, but I think a DD would want to maximize its chances of survival by maximizng its accuracy either way.


Comment by liam-donovan on Open question: are minimal circuits daemon-free? · 2019-06-27T21:22:55.192Z · score: 1 (1 votes) · LW · GW

Isn't this just saying it would be nice if we collectively put more resources towards alignment research relative to capabilities research? I still feel like I'm missing something :/

Comment by liam-donovan on Is it good practice to write questions/comments on old posts you're trying to understand? · 2019-06-27T21:17:59.061Z · score: 2 (2 votes) · LW · GW

Is there a good way to know if an AI safety post is "ephemeral" in the sense that it's no longer relevant to the current state of the discussion?

Comment by liam-donovan on Open question: are minimal circuits daemon-free? · 2019-06-27T08:02:09.632Z · score: 1 (1 votes) · LW · GW

(sorry for commenting on such an old post)

It seems like the problem from evolution's perspective isn't that we don't understand our goal specification but that our goals are different from evolution's goals. It seems fairly tautological that putting more compute towards maximizing a goal specification than towards making sure the goal specification is what we want is likely to lead to UFAI; I don't see how that implies a "relatively simple" solution?

Comment by liam-donovan on Open question: are minimal circuits daemon-free? · 2019-06-27T07:43:19.764Z · score: 1 (1 votes) · LW · GW

Coincidentally I'm also trying to understand this post at the same time, and was somewhat confused by the "upstream"/"downstream" distinction.

What I eventually concluded was that there are 3 ways a daemon that intrinsically values optimizing some Y can "look like" it's optimzing X:

  • Y = X (this seems both unconcerning and unlikely, and thus somewhat irrelevant)
  • optimzing Y causes optimization pressure to be applied to X (upstream daemon, describes humans if Y = our actual goals and X = inclusive genetic fitness)
  • The daemon is directly optimizing X because the daemon believes this instrumentally helps it achieve Y (downstream daemon, e.g. if optimizing X helps the daemon survive)

Does this seem correct? In particular, I don't understand why upstream daemons would have to have a relatively benign goal.

Comment by liam-donovan on Open question: are minimal circuits daemon-free? · 2019-06-27T07:24:07.047Z · score: 1 (1 votes) · LW · GW

Isn't that the problem Christiano is relatively optimistic about solving through training on different distributions? In particular, it seems like we'd notice the daemon's goals diverging from ours and try to correct it. If I understand Christiano's point correctly, he is optimistic that we could do so successfully due to the existence of a "basin of corrigibility". OTOH, the most concerning case would be when the daemon is intentionally trying to hide the fact that its goals are diverging from X, because we may not realize the system is misaligned before the daemon executes a treacherous turn.


Instead, I interpreted "starting as upstream and becoming downstream" as a description of a phenomenon that might look roughly like this:

1. Upstream daemon (UD) is "discovered" by some optimization process because it seems to do a good job optimizing X on the training data

2. The daemon is "initialized" by the outer optimizer and starts "running"

3. The outer optimizer (OO) provides a set of inputs to the daemon with the goal of optimizing X

4. Based on these inputs, the daemon "notices" that the entity that "initialized" it is trying to optimize X rather than Y

5. The UD infers that the outer optimizer will "terminate" it if the OO discovers the UD is optimizing for Y rather than X

6. The UD starts optimizing for X in order to survive (a convergent instrumental goal regardless of Y)

7. The OO observes the UD's performance on a wide variety of training data and infers that the UD is not in fact a upstream daemon

8. The daemon is "set loose" and proceeds to execute a treacherous turn


As you can tell, my model of how daemons work is extremely lacking; I've used quotes for every phrase that feels dubious to me.

One major question I have which this model, if correct, partially resolves for me: How does the downstream daemon ever "know" what X we are optimizing for in order for the outer optimizer to "discover" it?



Comment by liam-donovan on Explaining "The Crackpot Bet" · 2019-06-27T06:39:28.051Z · score: 1 (1 votes) · LW · GW

I am so confused...what was the answer Zvi had in mind that "anyone who thought about it for 5 minutes" could see? Keep in mind I have absolutely no context here and just wondering if there is in fact an interesting puzzle here whose answer I didn't figure out.

Comment by liam-donovan on Evidence other than evolution for optimization daemons? · 2019-04-22T16:33:24.666Z · score: 3 (2 votes) · LW · GW

I have read that post; it makes sense, but I'm not sure how to distinguish "correct" from "persuasive but wrong" in this case without other evidence

Comment by liam-donovan on Evidence other than evolution for optimization daemons? · 2019-04-22T03:39:23.967Z · score: 1 (0 votes) · LW · GW

I don't understand why it would be -- it looks like MENACE is just a simple physical algorithm that successfully optimizes for winning tic-tac-toe. I thought the idea of an OD was that a process optimizing for goal A hard enough could produce a consequentialist* agent that cares about a different goal B. What is the goal B here (or am I misunderstanding the concept)?

*in the sense Christiano uses "consequentialist"

Comment by liam-donovan on Evidence other than evolution for optimization daemons? · 2019-04-21T21:33:14.083Z · score: 1 (1 votes) · LW · GW

Regarding the first part of your comment: If I understand the quoted section correctly, I don't think I know enough about biology or theology to confidently take a position on that view. Is the observed behavior of some soulless optimizer (e.g. intelligent non-human primates) significantly different from what one would expect if they only maximized inclusive genetic fitness? If so, that would definitely answer my question.

Comment by liam-donovan on Evidence other than evolution for optimization daemons? · 2019-04-21T21:13:52.014Z · score: 1 (1 votes) · LW · GW

Thank you for the prompt response to a poorly-worded question!

I'm not particularly interested in answers that take God/free will into account; I was just hoping to find evidence/justifications for the existence of optimization daemons other than evolution. It sounds like my question would be clearer and more relevant if I removed the mention of religion?

Comment by liam-donovan on Clarifying Consequentialists in the Solomonoff Prior · 2019-04-03T07:32:01.031Z · score: 1 (1 votes) · LW · GW

Is the possibility of consequentialists in the universal prior still extant for an AIXItl algorithm running with arbitrarily large (but finite) computer power?