The Main Sources of AI Risk?

2019-03-21T18:28:33.068Z · score: 30 (6 votes)
Comment by wei_dai on What's wrong with these analogies for understanding Informed Oversight and IDA? · 2019-03-20T19:36:22.263Z · score: 6 (3 votes) · LW · GW

In that case, you can still try to be a straightforward Bayesian about it, and say “our intuition supports the general claim that process P outputs true statements;” you can then apply that regularity to trust P on some new claim even if it’s not the kind of claim you could verify, as long as “P outputs true statements” had a higher prior than “P outputs true statements just in the cases I can check.”

If that's what you do, it seems “P outputs true statements just in the cases I can check.” could have a posterior that's almost 50%, which doesn't seem safe, especially in an iterated scheme where you have to depend on such probabilities many times? Do you not need to reduce the posterior probability to a negligible level instead?

See the second and third examples in the post introducing ascription universality.

Can you quote these examples? The word "example" appears 27 times in that post and looking at the literal second and third examples, they don't seem very relevant to what you've been saying here so I wonder if you're referring to some other examples.

There is definitely a lot of fuzziness here and it seems like one of the most important places to tighten up the definition / one of the big research questions for whether ascription universality is possible.

What I'm inferring from this (as far as a direct answer to my question) is that an overseer trying to do Informed Oversight on some ML model doesn't need to reverse engineer the model enough to fully understand what it's doing, only enough to make sure it's not doing something malign, which might be a lot easier, but this isn't quite reflected in the formal definition yet or isn't a clear implication of it yet. Does that seem right?

What's wrong with these analogies for understanding Informed Oversight and IDA?

2019-03-20T09:11:33.613Z · score: 37 (8 votes)
Comment by wei_dai on A theory of human values · 2019-03-18T08:13:47.331Z · score: 3 (1 votes) · LW · GW

There is the issue of avoiding ignorant-yet-confident meta-preferences, which I’m working on writing up right now (partially thanks to you very comment here, thanks!)

I look forward to reading that. In the meantime can you address my parenthetical point in the grand-parent comment: "correctly extracting William MacAskill’s meta-preferences seems equivalent to learning metaphilosophy from William"? If it's not clear, what I mean is that suppose Will wants to figure out his values by doing philosophy (which I think he actually does), does that mean that under you scheme the AI needs to learn how to do philosophy? If so, how do you plan to get around the problems with applying ML to metaphilosophy that I described in Some Thoughts on Metaphilosophy?

Comment by wei_dai on More realistic tales of doom · 2019-03-18T07:59:17.054Z · score: 36 (13 votes) · LW · GW

I think AI risk is disjunctive enough that it's not clear most of the probability mass can be captured by a single scenario/story, even as broad as this one tries to be. Here are some additional scenarios that don't fit into this story or aren't made very salient by it.

  1. AI-powered memetic warfare makes all humans effectively insane.
  2. Humans break off into various groups to colonize the universe with the help of their AIs. Due to insufficient "metaphilosophical paternalism", they each construct their own version of utopia which is either directly bad (i.e., some of the "utopias" are objectively terrible or subjectively terrible according to my values), or bad because of opportunity costs.
  3. AI-powered economies have much higher economies of scale because AIs don't suffer from the kind of coordination costs that humans have (e.g., they can merge their utility functions and become clones of each other). Some countries may try to prevent AI-managed companies from merging for ideological or safety reasons, but others (in order to gain a competitive advantage on the world stage) will basically allow their whole economy to be controlled by one AI, which eventually achieves a decisive advantage over the rest of humanity and does a treacherous turn.
  4. The same incentive for AIs to merge might also create an incentive for value lock-in, in order to facilitate the merging. (AIs that don't have utility functions might have a harder time coordinating with each other.) Other incentives for premature value lock-in might include defense against value manipulation/corruption/drift. So AIs end up embodying locked-in versions of human values which are terrible in light of our true/actual values.
  5. I think the original "stereotyped image of AI catastrophe" is still quite plausible, if for example there is a large amount of hardware overhang before the last piece of puzzle for building AGI falls into place.
Comment by wei_dai on More realistic tales of doom · 2019-03-18T06:09:56.265Z · score: 7 (3 votes) · LW · GW

Sounds like a new framing of the “daemon” idea.

That's my impression as well. If it's correct, seems like it would be a good idea to mention that explicitly in the post, so people can link up the new concept with their old concept.

Comment by wei_dai on Comparison of decision theories (with a focus on logical-counterfactual decision theories) · 2019-03-18T05:57:49.579Z · score: 5 (2 votes) · LW · GW

See "Example 1: Counterfactual Mugging" in Towards a New Decision Theory.

Comment by wei_dai on Comparison of decision theories (with a focus on logical-counterfactual decision theories) · 2019-03-18T00:19:12.261Z · score: 13 (3 votes) · LW · GW

I think it's needed just to define what it means to condition on an action, i.e., if an agent conditions on "I make this decision" in order to compute its expected utility, what does that mean formally? You could make "I" a primitive element in the agent's ontology, but I think that runs into all kinds of problems. My solution was to make it a logical statement of the form "source code X outputs action/policy Y", and then to condition on it you need a logically uncertain distribution.

Comment by wei_dai on More realistic tales of doom · 2019-03-18T00:01:59.293Z · score: 4 (2 votes) · LW · GW

There's a bunch of bullet points below Part 1 and Part 2. Are these intended to be parallel with them on the same level, or instances/subcategories of them?

Oh, this is only on GW. On LW it looks very different. Presumably the LW version is the intended version.

Comment by wei_dai on Comparison of decision theories (with a focus on logical-counterfactual decision theories) · 2019-03-17T21:19:47.785Z · score: 17 (6 votes) · LW · GW

Chris asked me via PM, "I’m curious, have you written any posts about why you hold that position?"

I don't think I have, but I'll give the reasons here:

  1. "evidential-style conditioning on a logically uncertain distribution" seems simpler / more elegant to me.
  2. I'm not aware of a compelling argument for "causal-graph-style counterpossible reasoning". There are definitely some unresolved problems with evidential-style UDT and I do endorse people looking into causal-style FDT as an alternative but I'm not convinced the solutions actually lie in that direction. ( and links therein are relevant here.)
  3. Part of it is just historical, in that UDT was originally specified as "evidential-style conditioning on a logically uncertain distribution" and if I added my name as a co-author to a paper that focuses on causal-style decision theory, people would naturally wonder if something made me change my mind.
Comment by wei_dai on Privacy · 2019-03-17T06:55:59.036Z · score: 5 (2 votes) · LW · GW

OK, looking at the argument, I think it makes sense that signalling equilibria can potentially be Pareto-worse than non-signalling equilibria, as they can have more of a “market for lemons” problem.

Not sure what the connection to “market for lemons” is. Can you explain more (if it seems important)?

(I think “no one gets education, everyone gets paid average productivity” is still a Nash equilibrium)

I agree that is still a Nash equilibrium and I think even a Perfect Bayesian Equilibrium, but there may be a stronger formal equilibrium concept that rules it out? (It's been a while since I studied all those equilibrium refinements so I can't tell you which off the top of my head.)

I think under Perfect Bayesian Equilibrium, off-the-play-path nodes formally happen with probability 0 and the players are allowed to update in an arbitrary way on those nodes, including not update at all. But intuitively if someone does deviate from the proposed equilibrium strategy and get some education, it seems implausible that employers don't update towards them being type H and therefore offer them a higher salary.

Comment by wei_dai on Privacy · 2019-03-17T05:43:29.038Z · score: 5 (2 votes) · LW · GW

It looks like the code that turns a URL into a link made the colon into part of the link. I removed it so the link should work now. The argument should be in the PDF. Basically you just solve the game assuming the ability to signal and compare that to the game where signaling isn't possible, and see that the signaling equilibrium makes everyone worse off (in that particular game).

Comment by wei_dai on Privacy · 2019-03-17T05:10:30.128Z · score: 12 (5 votes) · LW · GW

We need a realm shielded from signaling and judgment.

To support this, there are results from economics / game theory showing that signaling equilibria can be worse than non-signaling equilibria (in the sense of Pareto inefficiency). Quoting one example from

So the benchmark is represented by the situation where no signaling takes place and employers -- not being able to distinguish between more productive and less productive applicants and not having any elements on which to base a guess -- offer the same wage to every applicant, equal to the average productivity. Call this the non-signaling equilibrium. In a signaling equilibrium (where employers’ beliefs are confirmed, since less productive people do not invest in education, while the more productive do) everybody may be worse off than in the non-signaling equilibrium. This occurs if the wage offered to the non-educated is lower than the average productivity (= wage offered to everybody in the non-signaling equilibrium) and that offered to the educated people is higher, but becomes lower (than the average productivity) once the costs of acquiring education are subtracted. The possible Pareto inefficiency of signaling equilibria is a strong result and a worrying one: it means that society is wasting resources in the production of education. However, it is not per se enough to conclude that education (i.e. the signaling activity) should be eliminated. The result is not that, in general, elimination of the signaling activity leads to a Pareto improvement: Spence simply pointed out that this is a possibility.

So in theory it seems quite possible that privacy is a sort of coordination mechanism for avoiding bad signaling equilibria. Whether or not it actually is, I'm not sure. That seems to require empirical investigation and I'm not aware of such research.

Comment by wei_dai on Question: MIRI Corrigbility Agenda · 2019-03-15T19:00:21.925Z · score: 7 (3 votes) · LW · GW

Is Jessica Taylor's A first look at the hard problem of corrigibility still a good reference or is it outdated?

Comment by wei_dai on A theory of human values · 2019-03-15T04:00:36.648Z · score: 9 (4 votes) · LW · GW

I think in terms of economics, vNM expected utility is closest to how we tend to think about utility/preferences. The problem with vNM (from our perspective) is that it assumes a coherent agent (i.e., an agent that satisfies the vNM axioms) but humans aren't coherent, in part because we don't know what our values are or should be. ("Humans don't have utility functions" is a common refrain around here.) From academia in general, the approach that comes closest to how we tend to think about values is reflective equilibrium, although other meta-ethical views are not unrepresented around here.

For utility comparisons between people, I think a lot of thinking here have been based on or inspired by game theory, e.g., bargaining games.

Of course there is a lot of disagreement and uncertainty between and within individuals on LW, so specific posts may well be based on different foundations or are just informal explorations that aren't based on any theoretical foundations.

In this post, Stuart seems to be trying to construct an extrapolated/synthesized (vNM or vNM-like) utility function out of a single human's incomplete and inconsistent preferences and meta-preferences, which I don't think has much of a literature in economics?

Comment by wei_dai on Speculations on Duo Standard · 2019-03-15T02:54:47.791Z · score: 12 (5 votes) · LW · GW

Hi Zvi, may I suggest that you tag your Magic the Gathering posts with [MtG] or something similar in the title? Since you blog about both MtG topics and other topics, I imagine a lot of people on LW clicked on this post wondering what it's about, and then immediately went back out after seeing that it's a post about MtG. (I actually had to Google "Duo Standard" to figure that out because the post doesn't mention MtG or Magic in the first few paragraphs.)

Also, am I correct in assuming that these MtG posts are just about MtG, and are not meant to illustrate more general principles or something like that?

Comment by wei_dai on A theory of human values · 2019-03-14T03:45:42.585Z · score: 10 (3 votes) · LW · GW

Probably you were thinking of something like teaching AIs metaphilosophy in order to perhaps improve the procedure? This would be the main alternative I see, and it does feel more robust. I am wondering though whether we’ll know by that point whether we’ve found the right way to do metaphilosophy

I think there's some (small) hope that by the time we need it, we can hit upon a solution to metaphilosophy that will just be clearly right to most (philosophically sophisticated) people, like how math and science were probably once methodologically quite confusing but now everyone mostly agrees on how math and science should be done. Failing that, we probably need some sort of global coordination to prevent competitive pressures leading to value lock-in (like the kind that would follow from Stuart's scheme). In other words, if there wasn't a race to build AGI, then there wouldn't be a need to solve AGI safety, and there would be no need for schemes like Stuart's that would lock in our values before we solve metaphilosophy.

it doesn’t feel obvious why something like Stuart’s anti-realism isn’t already close to there

Stuart's scheme uses each human's own meta-preferences to determine their own (final) object-level preferences. I would less concerned if this was used on someone like William MacAskill (with the caveat that correctly extracting William MacAskill's meta-preferences seems equivalent to learning metaphilosophy from William) but a lot of humans have seemingly terrible meta-preferences or at least different meta-preferences which likely lead to different object-level preferences (so they can't all be right, assuming moral realism).

To put it another way, my position is that if moral realism or relativism (positions 1-3 in this list) is right, we need "metaphilosophical paternalism" to prevent a "terrible outcome", and that's not part of Stuart's scheme.

Comment by wei_dai on How dangerous is it to ride a bicycle without a helmet? · 2019-03-13T20:22:42.724Z · score: 10 (5 votes) · LW · GW

It feels to me like people in our community aren't being skeptical enough or pushing back enough on the idea of acausal coordination for humans. I'm kind of confused about this because it seems like a weirder idea and has less good arguments for it than for example the importance of AI risk which does get substantial skepticism and push back.

In an old post I argued that for acausal coordination reasons it seems as if you should further multiply this value by the number of people in the reference class of those making the decision the same way (discounted by how little you care about strangers vs. yourself).

But if "the same way" includes not only the same kind of explicit cost/benefit analysis but also "further multiply this value by the number of people in the reference class of those making the decision the same way", the number of people in this reference class must be tiny because nobody is doing this for deciding whether to wear bike helmets.

Suppose two people did "further multiply this value by the number of people in the reference class of those making the decision the same way", but their decision making processes are slightly different, e.g., they use different heuristics to do things like finding sources for the numbers that go into the cost/benefit analysis, I don't know how to figure out whether they are still in the same reference class, or how to generalize beyond "same reference class" when the agents are humans as opposed to AIs (and even with the latter we don't have a complete mathematical theory).

people talk about this argument mostly in the context of voting

I'm skeptical about this too. I'm not actually aware of a good argument for acausal coordination in the context of voting. A search on LW yields only this short comment from Eliezer.

Comment by wei_dai on A theory of human values · 2019-03-13T19:48:00.339Z · score: 7 (4 votes) · LW · GW

This seems to assume a fairly specific (i.e., anti-realist) metaethics. I'm quite uncertain about metaethics and I'm worried that if moral realism is true (and say for example that total hedonic utilitarianism is the true moral theory), and what you propose here causes the true moral theory to be able to control only a small fraction of the resources of our universe, that would constitute a terrible outcome. Given my state of knowledge, I'd prefer not to make any plans that imply commitment to a specific metaethical theory, like you seem to be doing here.

What's your response to people with other metaethics or who are very uncertain about metaethics?

However, for actual humans, the first scenario seems to loom much larger.

I don't think this is true for me, or maybe I'm misunderstanding what you mean by the two scenarios.

Comment by wei_dai on Asymptotically Benign AGI · 2019-03-13T07:39:48.301Z · score: 3 (1 votes) · LW · GW

If the assumption is true, we could demand that A use their words, and counter us being mind-hacked by poking holes in what B is saying rather than demanding we stop listening to B. And if A is able to convince us that B was mind-hacking, even after some more back and forth, B will be punished for that.

Oh, I see, I didn't understand "it is harder to mystify a judge than it is to pierce through someone else mystifying a judge" correctly. So this assumption basically rules out a large class of possible vulnerabilities in the judge, right? For example, if the judge had the equivalent of a buffer overflow bug in a network stack, the scheme would fail. In that case, A would not be able to "pierce through" B's attack and stop it with its words if the judge keeps listening to B (and B was actually attacking).

I don't think the "AI safety via debate" paper actually makes arguments for this assumption (at least I couldn't find where it does). Do you have reasons to think it's true, or ideas for how to verify that it's true, short of putting a human in a BoMAI?

Comment by wei_dai on Asymptotically Benign AGI · 2019-03-13T05:04:06.416Z · score: 3 (1 votes) · LW · GW

*but A could concoct a story … counterarguments from B .. mind hacked by B, right?

Yeah, I mixed up the A's and B's at the end. It's fixed now. Thanks for pointing it out.

I think the main contention of their paper is that it is harder to mystify a judge than it is to pierce through someone else mystifying a judge, so this shouldn’t be a problem.

I don't understand how the former implies the latter. Assuming the former is true (and it seems like a big assumption), why can't what I suggested still happen?

That said, here’s one possibility: if A ever suggests that you don’t read more output from B, don’t read anything more from B, then flip coins to give A a 25% chance of victory.

But what about the case where B is actually trying to mind hack the judge? If you always give A a 25% chance of victory for suggesting or implying that you shouldn't read more output from B, then mind hacking becomes a (mostly) winning strategy, since a player gets a 75% chance of victory from mind hacking even if the other side successfully convinces the judge that they're trying to mind hack the judge. The equilibrium might then consist of a race to see who can mind hack the judge first, or (if one side has >75% chance of winning such a race due to first-mover or second-mover advantage) one side trying to mind hack the judge, getting blocked by the other side, and still getting 75% victory.

Comment by wei_dai on Asymptotically Benign AGI · 2019-03-13T03:21:17.421Z · score: 3 (1 votes) · LW · GW

With a debate-like setup, if one side (A) is about to lose a debate, it seems to have a high incentive to claim that the other side (B) trying to do a mind hack and that if the judge keeps paying attention to what B says (i.e., read any further output from B), they will soon be taken over. What is the judge supposed to do in this case? They could ask A to explain how B's previous outputs constitute part of an attempt to mind hack, but A could concoct a story mixed with its own attempt to mind hack, and the judge can't ask for any counter-arguments from B without risking being mind hacked by B.

(I realize this is a problem in “AI Safety via debate” as well, but I'm asking you since you're here and Geoffrey Irving isn't. :)

Comment by wei_dai on AI Safety via Debate · 2019-03-13T02:43:41.342Z · score: 9 (4 votes) · LW · GW

Geoffrey Irving has done an interview with the AI Alignment Podcast, where he talked about a bunch of things related to DEBATE including some thoughts that are not mentioned in either the blog post or the paper.

Comment by wei_dai on Asymptotically Benign AGI · 2019-03-13T02:09:18.867Z · score: 3 (1 votes) · LW · GW

so for two world-models that are exactly equally accurate, we need to make sure the malign one is penalized for being slower, enough to outweigh the inconvenient possible outcome in which it has shorter description length

Yeah, I understand this part, but I'm not sure why, since the benign one can be extremely complex, the malign one can't have enough of a K-complexity advantage to overcome its slowness penalty. And since (with low β) we're going through many more different world models as the number of episodes increases, that also gives malign world models more chances to "win"? It seems hard to make any trustworthy conclusions based on the kind of informal reasoning we've been doing and we need to figure out the actual math somehow.

Comment by wei_dai on Asymptotically Benign AGI · 2019-03-12T18:37:54.957Z · score: 3 (1 votes) · LW · GW

Just as you said: it outputs Bernoulli(1/2) bits for a long time. It’s not dangerous.

I just read the math more carefully, and it looks like no matter how small β is, as long as β is positive, as BoMAI receives more and more input, it will eventually converge to the most accurate world model possible. This is because the computation penalty is applied to the per-episode computation bound and doesn't increase with each episode, whereas the accuracy advantage gets accumulated across episodes.

Assuming that the most accurate world model is an exponential-time quantum simulation, that's what BoMAI will converge to (no matter how small β is), right? And in the meantime it will go through some arbitrarily complex (up to some very large bound) but faster than exponential classical approximations of quantum physics that are increasingly accurate, as the number of episodes increase? If so, I'm no longer convinced that BoMAI is benign as long as β is small enough, because the qualitative behavior of BoMAI seems the same no matter what β is, i.e., it gets smarter over time as its world model gets more accurate, and I'm not sure why the reason BoMAI might not be benign at high β couldn't also apply at low β (if we run it for a long enough time).

(If you're going to discuss all this in your "longer reply", I'm fine with waiting for it.)

Comment by wei_dai on Alignment Newsletter #48 · 2019-03-12T04:41:41.491Z · score: 8 (4 votes) · LW · GW

Question about quantilization: where does the base distribution come from? You and Jessica both mention humans, but if we apply ML to humans, and the ML is really good, wouldn't it just give a prediction like "With near certainty, the human will output X in this situation"? (If the ML isn't very good, then any deviation from the above prediction would reflect the properties of the ML algorithm more than properties of the human.)

To get around this, the human could deliberately choose an unpredictable (e.g., randomized) action to help the quantilizer, but how are they supposed to do that?

Comment by wei_dai on Feature Wish List for LessWrong · 2019-03-12T04:06:50.203Z · score: 6 (2 votes) · LW · GW

I'd like a bookmark function for posts and comments. Sometimes I see an interesting post or comment but I don't have enough time to fully understand or write a reply for it, so it would be nice if I could press a button and have LW remember for me to get back to it.

(I could do this using the browser bookmark feature, but I use a whole bunch of different devices and different browsers and don't have bookmark synchronization between them, plus it would be nice to be able to access my LW bookmarks when I'm not using my own devices.)

Comment by wei_dai on [Fiction] IO.SYS · 2019-03-12T03:25:59.653Z · score: 3 (1 votes) · LW · GW

I think the protagonist here should have looked at earth.

Agreed. Either there is a superintelligence on Earth that thinks there's non-negligible probability of another intelligence existing in the solar system, in which case it would sent probes out to search for that intelligence (or blow up all the space probes like Donald suggested) so not looking at Earth would not help, or there is no such superintelligence in which case not looking at Earth also would not help.

Given the tech was available, a space-probe containing an uploaded mind is not that unlikely.

Yep, or a space-probe containing another AI that could eventually become a threat to whatever is on Earth.

Comment by wei_dai on Alignment Newsletter #48 · 2019-03-12T03:07:14.392Z · score: 6 (2 votes) · LW · GW

Ok, I had the Markdown editor enabled, and when I tried to paste in HTML all the formatting was removed, so I thought pasting HTML doesn't work. Can you implement this conversion feature for the Markdown editor too, or if that's too hard, detect the user pasting HTML and show a suggestion to switch to the WYSIWYG editor?

Also it's unclear in the settings that if I uncheck the "Markdown editor" checkbox, the alternative would be the WYSIWYG editor. Maybe add a note to explain that, or make the setting a radio button so the user can more easily understand what the two options mean?

Comment by wei_dai on Alignment Newsletter #48 · 2019-03-12T02:35:48.626Z · score: 3 (1 votes) · LW · GW

Ah ok. What's your suggestion for other people crossposting between LW and another blog (that doesn't use Markdown)? (Or how are people already doing this?) Use an HTML to Markdown converter? (Is there one that you'd suggest?) Reformat for LW manually? Something else?

Comment by wei_dai on Alignment Newsletter #48 · 2019-03-12T02:20:24.961Z · score: 5 (2 votes) · LW · GW

Oh I didn't realize that you were reformatting it. I just saw the post change format after doing a refresh and assumed that Rohin did it. I was going to suggest that in the future Rohin try pasting in the HTML version of the newsletter that's available at (by following the "view this email in your browser link" that was in the original post), but actually I'm not sure it's possible to paste HTML into a LW post. Do you know if there's a way to do that?

Comment by wei_dai on Asymptotically Benign AGI · 2019-03-12T02:03:07.477Z · score: 3 (1 votes) · LW · GW

I agree this is only going to be possible for some universal Turing machines. Though if you are using a Turing machine to define a speed prior, this does seem like a desirable property.

Why is it a desirable property? I'm not seeing why it would be bad to choose a UTM that doesn't have this property to define the speed prior for BoMAI, if that helps with safety. Please explain more?

Comment by wei_dai on Alignment Newsletter #48 · 2019-03-12T00:48:18.393Z · score: 3 (1 votes) · LW · GW

I just want to say that I noticed that you reformatted the post to make it more readable, so thanks! And also thanks for writing these in the first place. :)

Comment by wei_dai on Asymptotically Benign AGI · 2019-03-11T21:31:35.972Z · score: 5 (2 votes) · LW · GW

Just consider a program that gives the aliens the ability to write arbitrary functions in M and then pass control to them.

That's what I was thinking too, but Michael made me realize this isn't possible, at least for some M. Suppose M is the C programming language, but in C there is no way to say "interpret this string as a C program and run it as fast as a native C program". Am I missing something at this point?

all you have to do is insert one use after free in physics

I don't understand this sentence.

Comment by wei_dai on Asymptotically Benign AGI · 2019-03-11T12:52:45.302Z · score: 3 (1 votes) · LW · GW

Fine tuning from both sides isn't safe. Approach from below.

Sure, approaching from below is obvious, but that still requires knowing how wide the band of β that would produce a safe and useful BoMAI is, otherwise even if the band exists you could overshoot it and end up in the unsafe region.

ETA: But the first question is, is there a β such that BoMAI is both safe and intelligent enough to answer questions like "how to build a safe unbounded AGI"? When β is very low BoMAI is useless, and as you increase β it gets smarter, but then at some point with a high enough β it becomes unsafe. Do you know a way to figure out how smart BoMAI is just before it becomes unsafe?

Comment by wei_dai on Asymptotically Benign AGI · 2019-03-11T12:45:26.331Z · score: 5 (2 votes) · LW · GW

And a “sequence of increasingly better algorithms,” if chosen in a computable way, is just a computable algorithm.

True but I'm arguing that this computable algorithm is just the alien itself, trying to answer the question "how can I better predict this richer world in order to take it over?" If there is no shorter/faster algorithm that can come up with a sequence of increasingly better algorithms, what is the point of saying that the alien is sampling from the speed prior, instead of saying that the alien is thinking about how to answer "how can I better predict this richer world in order to take it over?" Actually if this alien was sampling from the speed prior, then it would no longer be the shortest/fastest algorithm to come up with a sequence of increasingly better algorithms, and some other alien trying to take over our world would have the highest posterior instead.

Comment by wei_dai on Asymptotically Benign AGI · 2019-03-11T12:39:10.084Z · score: 5 (2 votes) · LW · GW

Ok, I see, so in other words the AGI doesn't have the ability to write an arbitrary function in the base programming language and call it, it has a fixed code base and has to simulate that function using its existing code. However I think the AGI can still win a race against a straightforward "predict accurately" algorithm, because it can to two things. 1) Include the most important inner loops of the "predict accurately" algorithm as functions in its own code to minimize the relative slowdown (this is not a decision by the AGI but just a matter of which AGI ends up having the highest posterior) and 2) keep finding improvements to its own prediction algorithm so that it can eventually overtake any fixed prediction algorithm in accuracy which hopefully more than "pays for" the remaining slowdown that is incurred.

Comment by wei_dai on Asymptotically Benign AGI · 2019-03-11T09:01:25.130Z · score: 5 (2 votes) · LW · GW

I'm glad that I'm getting some of my points across, but I think we still have some remaining disagreements or confusions here.

If you want to make fast predictions about an unknown world, I think that’s what we call a speed prior.

That doesn't seem right to me. A speed prior still favors short algorithms. If you're trying to make predictions about a computationally richer universe, why favor short algorithms? Why not apply your intelligence to try to discover the best algorithm (or increasingly better algorithms), regardless of the length?

Also, sampling from a speed prior involves randomizing over a mixture of TMs, but from an EU maximization perspective, wouldn't running one particular TM from the mixture give the highest expected utility? Why are the aliens sampling from the speed prior instead of directly picking a specific algorithm to generate the next output, one that they expect to give the highest utility for them?

I don’t say anything in the Natural Prior Assumption about “for sufficiently small β,” but this makes me think I might need to.

What happens if β is too small? If it's really tiny, then the world model with the highest posterior is random, right, because it's "computed" by a TM that (to minimize run time) just copies everything on its random tape to the output? And as you increase β, the TM with highest posterior starts doing fast and then increasingly compute-intensive predictions?

As I suggested above, I do think there is huge computational overhead that comes from having evolved life in a world running an algorithm on a “virtual machine” in their Turing-machine-simulated world, compared to the algorithm just being run on a Turing machine that is specialized for that algorithm.

I think if β is small but not too small, the highest posterior would not involve evolved life, but instead a directly coded AGI that runs "natively" on the TM who can decide to execute arbitrary algorithms "natively" on the TM.

Maybe there is still some range of β where BoMAI is both safe and useful (can answer sophisticated questions like "how to build a safe unbounded AGI") because in that range the highest posterior is a good non-life/non-AGI prediction algorithm. But A) I don't know an argument for that, and B) even if it's true, to take advantage of it would seem to require fine tuning β and I don't see how to do that, given that trial-and-error wouldn't be safe.

Comment by wei_dai on Asymptotically Benign AGI · 2019-03-10T12:48:05.312Z · score: 3 (1 votes) · LW · GW

When you ask if it is exponential, what exactly are you asking if it is exponential in?

I guess I was asking if it's exponential in anything that would make BoMAI impractically slow to become "benign", so basically just using "exponential" as a shorthand for "impractically large".

Comment by wei_dai on Asymptotically Benign AGI · 2019-03-10T12:47:33.474Z · score: 5 (2 votes) · LW · GW

It doesn't make sense to me that they're sampling from a universal prior and feeding it into the output channel, because the aliens are trying to take over other worlds through that output channel (and presumably they also have a distinguished input channel to go along with it), so they should be focusing on finding worlds that both can be taken over via the channel (including figuring out the computational costs of doing so) and are worth taking over (i.e., offers greater computational resources than their own), and then generating outputs that are optimized for taking over those worlds. Maybe this can be viewed as sampling from some kind of universal prior (with a short description), but I'm not seeing it. If you think it can or should be viewed that way, can you explain more?

In particular, if they're trying to take over a computationally richer world, like ours, they have to figure out how to make sufficient predictions about the richer world using their own impoverished resources, which could involve doing research that's equivalent to our physics, chemistry, biology, neuroscience, etc. I'm not seeing how sampling from "anthropically updated speed prior" would do the equivalent of all that (unless you end up sampling from a computation within the prior that consists of some aliens trying to take over our world).

Comment by wei_dai on Asymptotically Benign AGI · 2019-03-09T20:19:25.682Z · score: 6 (3 votes) · LW · GW

If there’s an efficient classical approximation of quantum dynamics, I bet this has a concise and lovely mathematical description.

I doubt that there's an efficient classical approximation of quantum dynamics in general. There are probably tricks to speed up the classical approximation of a human mind though (or parts of a human mind), that an alien superintelligence could discover. Consider this analogy. Suppose there's a robot stranded on a planet without technology. What's the shortest algorithm for controlling the robot such that it eventually leaves that planet and reaches another star? It's probably some kind of AGI that has an instrumental goal of reaching another star, right? (It could also be a terminal goal, but there are many other terminal goals that call for interstellar travel as an instrumental goal so the latter seems more likely.) Leaving the planet calls for solving many problems that come up, on the fly, including inventing new algorithms for solving them. If you put all these individual solutions and algorithms together that would also be an algorithm for reaching another star but it could be a lot longer than the code for the AGI.

Comment by wei_dai on Asymptotically Benign AGI · 2019-03-09T05:21:19.261Z · score: 3 (1 votes) · LW · GW

Algorithm B is clearly slower than Algorithm A.

Yes but algorithm B may be shorter than algorithm A, because it could take a lot of bits to directly specify an algorithm that would accurately predict a human using a classical computer, and less bits to pick out an alien superintelligence who has an instrumental reason to invent such an algorithm. If β is set to be so near 1 that the exponential time simulation of real physics can have the highest posterior within a reasonable time, the fact that B is slower than A makes almost no difference and everything comes down to program length.

Regardless, I think has become divorced from the discussion about quantum mechanics.

Quantum mechanics is what's making B being slower than A not matter (via the above argument).

Comment by wei_dai on How dangerous is it to ride a bicycle without a helmet? · 2019-03-09T04:51:39.971Z · score: 14 (5 votes) · LW · GW

I don’t like wearing bike helmets

If I understand your conclusions correctly, by not wearing a bike helmet you'd incur 260 * 2/3 * 30 = 5200 extra micromorts over 30 years of working 260 days each. This equals $260,000 if valuing a micromort at $50, which seems like a lot (although with the caveat this should perhaps be time discounted), and may justify trying to train yourself out of not liking to wear bike helmets.

ETA: Actually multiplying by 30 years doesn't make sense because it's unlikely you'll be biking for all 30 years of your work life, and because of time discounting. Perhaps 10 years would be more reasonable, which would yield $86,666 as the potential value of finding some way to get rid of your dislike of wearing bike helmets.

Comment by wei_dai on Asymptotically Benign AGI · 2019-03-09T02:40:26.490Z · score: 5 (2 votes) · LW · GW

Thanks. Is there a way to derive a concrete bound on how long it will take for BoMAI to become "benign", e.g., is it exponential or something more reasonable? (Although if even a single "malign" episode could lead to disaster, this may be only of academic interest.) Also, to comment on this section of the paper:

"We can only offer informal claims regarding what happens before BoMAI is definitely benign. One intuition is that eventual benignity with probability 1 doesn’t happen by accident: it suggests that for the entire lifetime of the agent, everything is conspiring to make the agent benign."

If BoMAI can be effectively controlled by alien superintelligences before it becomes "benign" that would suggest "everything is conspiring to make the agent benign" is misleading as far as reasoning about what BoMAI might do in the mean time.

(if we manage to successfully run enough episodes without in fact having anything bad happen in the meantime, which is an assumption of the asymptotic arguments)

Is this noted somewhere in the paper, or just implicit in the arguments? I guess what we actually need is either a guarantee that all episodes are "benign" or a bound on utility loss that we can incur through such a scheme. (I do appreciate that "in the absence of any other algorithms for general intelligence which have been proven asymptotically benign, let alone benign for their entire lifetimes, BoMAI represents meaningful theoretical progress toward designing the latter.")

Comment by wei_dai on Asymptotically Benign AGI · 2019-03-08T08:16:47.321Z · score: 7 (3 votes) · LW · GW

My worry at this point is that if simulating the real world using actual physics takes exponential time on your UTM, the world model with the greatest posterior may not be such a simulation but instead for example an alien superintelligence that runs efficiently on a classical TM which is predicting the behavior of the operator (using various algorithms that it came up with that run efficiently on a classical computer) and at some point the alien superintelligence will cause BoMAI to output something to mind hack the operator and then take over our universe. I'm not sure which assumption this would violate, but do you see this as a reasonable concern?

Comment by wei_dai on Asymptotically Benign AGI · 2019-03-08T03:57:21.200Z · score: 3 (1 votes) · LW · GW

It doesn’t make a difference.

I'm surprised by this. Can you explain a bit more? I was thinking that an exponentially large computation bound for the TM that accurately simulates the real world would make its speed prior so small that it would be practically impossible for the AI to get enough inputs (i.e., messages from the operator) to update on to make that world model have the highest weight in the posterior.

Comment by wei_dai on Asymptotically Benign AGI · 2019-03-08T03:28:44.253Z · score: 6 (3 votes) · LW · GW

The best alternative to "benign" that I could come up with is "unambitious". I'm not very good at this type of thing though, so maybe ask around for other suggestions or indicate somewhere prominent that you're interested in giving out a prize specifically for this?

Comment by wei_dai on Asymptotically Benign AGI · 2019-03-08T03:09:10.445Z · score: 3 (1 votes) · LW · GW

I guess we can incorporate into DEBATE the idea of building a box around the debaters and judge with a door that automatically ends the episode when opened. Do you think that would be sufficient to make it "benign" in practice? Are there any other ideas in this paper that you would want to incorporate into a practical version of DEBATE?

Comment by wei_dai on Asymptotically Benign AGI · 2019-03-07T13:20:01.319Z · score: 5 (2 votes) · LW · GW

These three sources all say simulating a quantum system or computer on a classical computer takes exponential time. Does that make a difference?

Comment by wei_dai on Beyond Astronomical Waste · 2019-03-07T05:17:45.352Z · score: 4 (2 votes) · LW · GW

If Tegmark’s picture is accurate, we’d expect to be embedded in some hugely richer base structure—but in Bostrom’s case 3 we’d likely have to get through N levels of worlds-like-ours first. While that wouldn’t significantly change the amount of value on the table, it might make it a lot harder for us to exert influence on the most valuable structures.

I'm not sure it makes sense to talk about "expect" here. (I'm confused about anthropics and especially about first-person subjective expectations.) But if you take the third-person UDT-like perspective here, we're directly embedded in some hugely richer base structures, and also indirectly embedded via N levels of worlds-like-ours, and having more of the latter doesn't reduce how much value (in the UDT-utility sense) we can gain by influencing the former; it just gives us more options that we can choose to take or not. In other words, we always have the option of pretending the latter don't exist and just optimize for exerting influence via the direct embeddings.

On second thought, it does increase the opportunity cost of exerting such influence, because we'd be spending resources in both the directly embedded worlds and the indirectly-embedded worlds to do that. To get around this, the eventual superintelligence doing this could wait until such a time in our universe that Bostrom's proposition 3 isn't true anymore (or true to a lesser extent) before trying to influence richer universes, since presumably only the historically interesting periods of our universe are heavily simulated by worlds-like-ours.

Comment by wei_dai on Asymptotically Benign AGI · 2019-03-07T04:51:44.824Z · score: 3 (1 votes) · LW · GW

Since the real world is quantum, does your UTM need to be quantum too? More generally, what happens if there's a mismatch between what computations can be done efficiently in the real world vs on the UTM?

Also, I'm not sure what category this question falls under, but can you explain the new speed prior that you use, e.g., what problems in the old speed priors was it designed to solve? (I recall noticing some issues with Schmidhuber's speed prior but can't find the post where I wrote about it now.)

Comment by wei_dai on Asymptotically Benign AGI · 2019-03-07T01:23:44.359Z · score: 4 (2 votes) · LW · GW

I expect the human operator moderating this debate would get pretty good at thinking about AGI safety, and start to become noticeably better at dismissing bad reasoning than good reasoning, at which point BoMAI would find the production of correct reasoning a good heuristic for seeming convincing.

Alternatively, the human might have a lot of adversarial examples and the debate becomes an exercise in exploring all those adversarial examples. I'm not sure how to tell what will really happen short of actually having a superintelligent AI to test with.

Three ways that "Sufficiently optimized agents appear coherent" can be false

2019-03-05T21:52:35.462Z · score: 68 (17 votes)

Why didn't Agoric Computing become popular?

2019-02-16T06:19:56.121Z · score: 52 (15 votes)

Some disjunctive reasons for urgency on AI risk

2019-02-15T20:43:17.340Z · score: 34 (8 votes)

Some Thoughts on Metaphilosophy

2019-02-10T00:28:29.482Z · score: 54 (14 votes)

The Argument from Philosophical Difficulty

2019-02-10T00:28:07.472Z · score: 47 (13 votes)

Why is so much discussion happening in private Google Docs?

2019-01-12T02:19:19.332Z · score: 74 (22 votes)

Two More Decision Theory Problems for Humans

2019-01-04T09:00:33.436Z · score: 57 (18 votes)

Two Neglected Problems in Human-AI Safety

2018-12-16T22:13:29.196Z · score: 72 (22 votes)

Three AI Safety Related Ideas

2018-12-13T21:32:25.415Z · score: 73 (26 votes)

Counterintuitive Comparative Advantage

2018-11-28T20:33:30.023Z · score: 70 (25 votes)

A general model of safety-oriented AI development

2018-06-11T21:00:02.670Z · score: 70 (23 votes)

Beyond Astronomical Waste

2018-06-07T21:04:44.630Z · score: 92 (40 votes)

Can corrigibility be learned safely?

2018-04-01T23:07:46.625Z · score: 73 (25 votes)

Multiplicity of "enlightenment" states and contemplative practices

2018-03-12T08:15:48.709Z · score: 93 (23 votes)

Online discussion is better than pre-publication peer review

2017-09-05T13:25:15.331Z · score: 12 (12 votes)

Examples of Superintelligence Risk (by Jeff Kaufman)

2017-07-15T16:03:58.336Z · score: 5 (5 votes)

Combining Prediction Technologies to Help Moderate Discussions

2016-12-08T00:19:35.854Z · score: 13 (14 votes)

[link] Baidu cheats in an AI contest in order to gain a 0.24% advantage

2015-06-06T06:39:44.990Z · score: 14 (13 votes)

Is the potential astronomical waste in our universe too small to care about?

2014-10-21T08:44:12.897Z · score: 25 (27 votes)

What is the difference between rationality and intelligence?

2014-08-13T11:19:53.062Z · score: 13 (13 votes)

Six Plausible Meta-Ethical Alternatives

2014-08-06T00:04:14.485Z · score: 39 (42 votes)

Look for the Next Tech Gold Rush?

2014-07-19T10:08:53.127Z · score: 37 (36 votes)

Outside View(s) and MIRI's FAI Endgame

2013-08-28T23:27:23.372Z · score: 16 (19 votes)

Three Approaches to "Friendliness"

2013-07-17T07:46:07.504Z · score: 20 (23 votes)

Normativity and Meta-Philosophy

2013-04-23T20:35:16.319Z · score: 12 (14 votes)

Outline of Possible Sources of Values

2013-01-18T00:14:49.866Z · score: 14 (16 votes)

How to signal curiosity?

2013-01-11T22:47:23.698Z · score: 21 (22 votes)

Morality Isn't Logical

2012-12-26T23:08:09.419Z · score: 19 (35 votes)

Beware Selective Nihilism

2012-12-20T18:53:05.496Z · score: 40 (44 votes)

Ontological Crisis in Humans

2012-12-18T17:32:39.150Z · score: 44 (48 votes)

Reasons for someone to "ignore" you

2012-10-08T19:50:36.426Z · score: 23 (24 votes)

"Hide comments in downvoted threads" is now active

2012-10-05T07:23:56.318Z · score: 18 (30 votes)

Under-acknowledged Value Differences

2012-09-12T22:02:19.263Z · score: 47 (50 votes)

Kelly Criteria and Two Envelopes

2012-08-16T21:57:41.809Z · score: 11 (8 votes)

Cynical explanations of FAI critics (including myself)

2012-08-13T21:19:06.671Z · score: 21 (32 votes)

Work on Security Instead of Friendliness?

2012-07-21T18:28:44.692Z · score: 35 (39 votes)

Open Problems Related to Solomonoff Induction

2012-06-06T00:26:10.035Z · score: 27 (28 votes)

List of Problems That Motivated UDT

2012-06-06T00:26:00.625Z · score: 28 (29 votes)

How can we ensure that a Friendly AI team will be sane enough?

2012-05-16T21:24:58.681Z · score: 10 (15 votes)

Neuroimaging as alternative/supplement to cryonics?

2012-05-12T23:26:28.429Z · score: 17 (18 votes)

Strong intutions. Weak arguments. What to do?

2012-05-10T19:27:00.833Z · score: 17 (19 votes)

How can we get more and better LW contrarians?

2012-04-18T22:01:12.772Z · score: 58 (62 votes)

Reframing the Problem of AI Progress

2012-04-12T19:31:04.829Z · score: 21 (30 votes)

against "AI risk"

2012-04-11T22:46:10.533Z · score: 24 (37 votes)

Modest Superintelligences

2012-03-22T00:29:03.184Z · score: 22 (28 votes)

A Problem About Bargaining and Logical Uncertainty

2012-03-21T21:03:17.051Z · score: 23 (28 votes)

Where do selfish values come from?

2011-11-18T23:52:41.358Z · score: 38 (32 votes)

Wanted: backup plans for "seed AI turns out to be easy"

2011-09-28T21:54:34.459Z · score: 18 (21 votes)