Posts
Comments
Donated $300.
I don't follow. Could you make this example more formal, giving a set of outcomes, a set of lotteries over these outcomes, and a preference relation on these that corresponds to "I will act so that, at some point, there will have been a chance of me becoming a heavy-weight champion of the world", and which fails Continuity but satisfies all other VNM axioms? (Intuitively this sounds more like it's violating Independence, but I may well be misunderstanding what you're trying to do since I don't know how to do the above formalization of your argument.)
Also, Magical Britain keeps Muggles out, going so far as to enforce this by not even allowing Muggles to know that Magical Britain exists. I highly doubt that Muggle Britain would do that to potential illegal immigrants even if it did have the technology...
Incidentally, the same argument also applies to Governor Earl Warren's statement quoted in Absence of evidence is evidence of absence: He can be seen as arguing that there are at least three possibilities, (1) there is no fifth column, (2) there is a fifth column and it supposed to do sabotage independent from an invasion, (3) there is a fifth column and it is supposed to aid a Japanese invasion of the West Coast. In case (2), you would expect to have seen sabotage; in case (1) and (3), you wouldn't, because if the fifth column were known to exist by the time of the invasion, it would be much less effective. Thus, while observing no sabotage is evidence against the fifth column existing, it is evidence in favor of a fifth column existing and being intended to support an invasion. I recently heard Eliezer claim that this was giving Warren too much credit when someone was pointing out an interpretation similar to this, but I'm pretty sure this argument was represented in Warren's brain (if not in explicit words) when he made his statement, even if it's pretty plausible that his choice of words was influenced by making it sound as if the absence of sabotage was actually supporting the contention that there was a fifth column.
In particular, Warren doesn't say that the lack of subversive activity convinces him that there is a fifth column, he says that it convinces him "that the sabotage we are to get, the Fifth Column activities are to get, are timed just like Pearl Harbor was timed". Moreover, in the full transcript, he claims that there are reasons to think (1) very unlikely, namely that, he alleges, the Axis powers all use them everywhere else:
To assume that the enemy has not planned fifth column activities for us in a wave of sabotage is simply to live in a fool's paradise. These activities, whether you call them "fifth column activities" or "sabotage" or "war behind the lines upon civilians," or whatever you may call it, are just as much an integral part of Axis warfare as any of their military and naval operations. When I say that I refer to all of the Axis powers with which we are at war. [...] Those activities are now being used actively in the war in the Pacific, in every field of operations about which I have read. They have unquestionably, gentlemen, planned such activities for California. For us to believe to the contrary is just not realistic.
I.e., he claims that (1) would be very unique given the Axis powers' behavior elsewhere. On the other hand, he suggests that (3) fits a pattern of surprise attacks:
[...] It convinces me more than perhaps any other factor that the sabotage that we are to get, the fifth column activities that we are to get, are timed just like Pearl Harbor was timed and just like the invasion of France, and of Denmark, and of Norway, and all of those other countries.
And later, he explicitly argues that you wouldn't expect to have seen sabotage in case (3):
If there were sporadic sabotage at this time or if there had been for the last 2 months, the people of California or the Federal authorities would be on the alert to such an extent that they could not possibly have any real fifth column activities when the M-day comes.
So he has the pieces there for a correct Bayesian argument that a fifth column still has high posterior probability after seeing no sabotage, and that a fifth column intended to support an invasion has higher posterior than prior probability: Low prior probability of (1); (comparatively) high prior probability of (3); and an argument that (3) predicts the evidence nearly as well as (1) does. I'm not saying his premises are true, just that the fact that he claims all of them suggests that his brain did in fact represent the correct argument. The fact that he doesn't say that this argument convinces him "more than anything" that there is a fifth column, but rather says that it convinces him that the sabotage will be timed like Pearl Harbor (and France, Denmark and Norway), further supports this -- though, as noted above, while I think that his brain did represent the correct argument, it does seem plausible that his words were chosen so as to suggest the alternative interpretation as well.
The true message of the first video is even more subliminal: The whiteboard behind him shows some math recently developed by MIRI, along with a (rather boring) diagram of Botworld :-)
Sorry about that; I've had limited time to spend on this, and have mostly come down on the side of trying to get more of my previous thinking out there rather than replying to comments. (It's a tradeoff where neither of the options is good, but I'll try to at least improve my number of replies.) I've replied there. (Actually, now that I spent some time writing that reply, I realize that I should probably just have pointed to Coscott's existing reply in this thread.)
I'm not sure which of the following two questions you meant to ask (though I guess probably the second one), so I'll answer both:
(a) "Under what circumstances is something (either an l-zombie or conscious)?" I am not saying that something is an l-zombie only if someone has actually written out the code of the program; for the purposes of this post, I assume that all natural numbers exist as platonical objects, and therefore all observers in programs that someone could in principle write and run exist at least as l-zombies.
(b) "When is a program an l-zombie, and when is it conscious?" The naive view would be that the program has to be actually run in the physical world; if you've written a program and then deleted the source without running it, it wouldn't be conscious. But as to what exactly the rule is that you can use to look at say a cellular automaton (as a model of physical reality) and ask whether the conscious experience inside a given Turing machine is "instantiated inside" that automaton, I don't have one to propose. I do think that's a weak point of the l-zombies view, and one reason that I'd assign measureless Tegmark IV higher a priori probability.
Thank you for the feedback, and sorry for causing you distress! I genuinely did not take into consideration that this choice could cause distress, and it could have occurred to me, and I apologize.
On how I came to think that it might be a good idea (as opposed to missing that it might be a bad idea): While there's math in this post, the point is really the philosophy rather than the math (whose role is just to help thinking more clearly about the philosophy, e.g. to see that PBDT fails in the same way as NBDT on this example). The original counterfactual mugging was phrased in terms of dollars, and one thing I wondered about in the early discussions was whether thinking in terms of these low stakes made people think differently than they would if something really important was at stake. I'm reconstructing, it's been a while, but I believe that's what made me rephrase it in terms of the whole world being at stake. Later, I chose the torture as something that, on a scale I'd reflectively endorse (as opposed, I acknowledge, actual psychology), is much less important than the fate of the world, but still important. But I entirely agree that for the purposes of this post, "paying $1" (any small negative effect) would have made the point just as well.
In short, I don't think SUDT (or UDT) by itself solves the problem of counterfactual mugging. [...] Perhaps SUDT also needs to specify a rule for selecting utility functions (e.g. some sort of disinterested "veil of ignorance" on the decider's identity, or an equivalent ban on utilities which sneak it in a selfish or self-interested term).
I'll first give an answer to a relatively literal reading of your comment, and then one to what IMO you are "really" getting at.
Answer to a literal reading: I believe that what you value is part of the problem definition, it's not the decision theory's job to constrain that. For example, if you prefer DOOM to FOOM, (S)UDT doesn't say that your utilities are wrong, it just says you should choose (H). And if we postulate that someone doesn't care whether there's a positive intelligence explosion if they don't get to take part in it (not counting near-copies), then they should choose (H) as well.
But I disagree that this means that (S)UDT doesn't solve the counterfactual mugging. It's not like the copy-selfless utility function I discuss in the post automatically makes clear whether we should choose (H) or (T): If we went with the usual intuition that you should update on your evidence and then use the resulting probabilities in your expected utility calculation, then even if you are completely selfless, you will choose (H) in order to do the best for the world. But (S)UDT says that if you have these utilities, you should choose (T). So it would seem that the version of the counterfactual mugging discussed in the post exhibits the problem, and (S)UDT comes down squarely on the side of one of the potential solutions.
Answer to the "real" point: But of course, what I read you as "really" saying is that we could re-interpret our intuition that we should use updated probabilities as meaning that our actual utility function is not the one we would write down naively, but a version where the utilities of all outcomes in which the observer-moment making the decision isn't consciously experienced are replaced by a constant. In the case of the counterfactual mugging, this transformation gives exactly the same result as if we had updated our probabilities. So in a sense, when I say that SUDT comes down on the side of one of the solutions, I am implicitly using a rule for how to go from "naive" utilities to utilities-to-use-in-SUDT: namely, the rule "just use the naive utilities". And when I use my arguments about l-zombies to argue that choosing (T) is the right solution to the counterfactual mugging, I need to argue why this rule is correct.
In terms of clarity of meaning, I have to say that I don't feel too bad about not spelling out that the utility function is just what you would normally call your utility function, but in terms of the strength of my arguments, I agree that the possibility of re-interpreting updating in terms of utility functions is something that needs to be addressed for my argument from l-zombies to be compelling. It just happens to be one of the many things I haven't managed to address in my updateless anthropics posts so far.
In brief, my reasons are twofold: First, I've asked myself, suppose that it actually were the case that I were an l-zombie, but could influence what happens in the real world; what would my actual values be then? And the answer is, I definitely don't completely stop caring. And second, there's the part where this transformation doesn't just give back exactly what you would have gotten if you updated in all anthropic problems, which makes the case for it suspect. The situations I have in mind are when your decision determines whether you are a conscious observer: In this case, how you decide depends on the utility you assign to outcomes in which you don't exist, something that doesn't have any interpretation in terms of updating. If the only reason I adopt these utilities is to somehow implement my intuitions about updating, it seems very odd to suddenly have this new number influencing my decisions.
It's priors over logical states of affairs. Consider the following sentence: "There is a cellular automaton that can be described in at most 10 KB in programming language X, plus a computable function f() which can be described in another 10 KB in the same programming language, such that f() returns a space/time location within the cellular automaton corresponding to Earth as we know it in early 2014." This could be false even if Tegmark IV is true, and prior probability (i.e., probability without trying to do an anthropic update of the form "I observe this, so it's probably simple") says it's probably false.
Yup, sure.
To summarize that part of the post: (1) The view I'm discussing there argues that the reason we find ourselves in a simple-looking world is that all possible experiences are consciously experienced, including the ones where the world looks simple, and we just happen to experience the latter. (2) If this is correct, then you cannot use the fact that you look around and see a simple-looking world to infer that you live in a simple-looking world, because there are plenty of complex interventionistic worlds that look deceptively simple. In fact, the prior probability that the particular world you see is actually simple is extremely low. (3) However, if you value the things that happen in actually simple worlds more than the things that happen in complex worlds, then it's still correct to act as if your simple-looking world is in fact simple, despite the fact that prior probability says this is possibly wrong (or to put this differently, even though most of the equally-existing mathematically possible humans reasoning like this will be wrong).
I don't feel like considering these different ways to approach K-complexity addresses the point I was trying to make. The rebuttal seems to be arguing that we should weigh the TMs that don't read the end of the tape equally, rather than weighing TMs more that read less of the tape. But my point isn't that I don't want to weigh complex TMs as much as simple TMs; it is (1) that I seem to be willing to consider TMs with one obviously disorderly event "pretty simple", even though I think they have high K-complexity; and (2) given this, the utility I lose by only disregarding the possibility of magical reality fluid in worlds where I've seen a single obviously disorderly event doesn't seem to lose me all that much utility if measureless Tegmark IV is true, compared to the utility I may lose if there actually is magical reality fluid or something like that and I ignore this possibility and, because of this, act in a way that is very bad.
(If there aren't any important ways in which I'd act differently if measureless Tegmark IV is false, then this argument has no pull, but I think there may be; for example, if the ultrafinitist hypothesis from the end of my post were correct, that might make a difference to FAI theory.)
So, I can see that you would care similarly as you would in a multiverse with magical reality fluid that's distributed in the same proportions as your measure of caring, and if your measure of caring is K-complexity with respect to a universal Turing machine (UTM) we would consider simple, it's at least one plausible possibility that the true magical reality fluid that's distributed in roughly those proportions. But given the state of our confusion, I think that conditional on there being a true measure, any single hypothesis as to how that measure is distributed should have significantly less than 50% probability, so "Conditional on there being a true measure, I would act the same way as according to my K-complexity based preferences" sounds wrong to me. (One particularly salient other possibility is that we could have magical reality fluid due to Tegmark I -- infinite space -- and Tegmark III -- many-worlds -- but not due to all mathematically possible universes existing, in which case we surely wouldn't get weightings that are close to K-complexity with a simple UTM. I mean, this is a case of one single universe, but with all possible experiences existing, to different degrees.)
But you see Eliezer's comments because a conscious copy of Eliezer has been run.
A conscious copy of Eliezer that thought about what Eliezer would do when faced with that situation, not a conscious copy of Eliezer actually faced with that situation -- the latter Eliezer is still an l-zombie, if we live in a world with l-zombies.
For l-zombies to do anything they need to be run, whereupon they stop being l-zombies.
Omega doesn't necessarily need to run a conscious copy of Eliezer to be pretty sure that Eliezer would pay up in the counterfactual mugging; it could use other information about Eliezer, like Eliezer's comments on LW, the way that I just did. It should be possible to achieve pretty high confidence that way about what Eliezer-being-asked-about-a-counterfactual-mugging would do, even if that version of Eliezer should happen to be an l-zombie.
Fixed, thanks!
(Agree with Coscott's comment.)
I meant useful in the context of AI since any such sequence would obviously have to be non-computable and thus not something the AI (or person) could make pragmatic use of.
I was replying to this:
Ultimately, you can always collapse any computable sequence of computable theories (necessary for the AI to even manipulate) into a single computable theory so there was never any hope this kind of sequence could be useful.
I.e., I was talking about computable sequences of computable theories, not about non-computable ones.
Also, it is far from clear that T_0 is the union of all theories (and this is the problem in the proof in the other rightup). It may well be that there is a sequence of theories like this all true in the standard model of arithmetic but that their construction requires that T_n add extra statements beyond the schema for the proof predicate in T_{n+1}
I can't make sense of this. Of course T_n can contain statements other than those in T_{n+1} and the Löb schema of T_{n+1}, but this is no problem for the proof that T_0 is the union of all the theories; the point is that because of the Löb schema, we have T_{n+1} \subset T_n for all n, and therefore (by transitivity of the subset operation) T_n \subseteq T_0 for all n.
Also, the claim that T_n must be stronger than T_{n+1} (prove a superset of it...to be computable we can't take all these theories to be complete) is far from obvious if you don't require that T_n be true in the standard model. If T_n is true in the standard model than, as it proves that Pf(T_n+1, \phi) -> \phi this is true so if T_{n+1} |- \phi then (as this witnessed in a finite proof) there is a proof that this holds from T_n and thus a proof of \phi. However, without this assumption I don't even see how to prove the containment claim.
Note again that I was talking about computable sequences T_n. If T_{n+1} |- \phi and T_{n+1} is computable, then PA |- Pf(T_{n+1}, \phi) and therefore T_n |- Pf(T_{n+1}, \phi) if T_n extends PA. This doesn't require either T_n or T_{n+1} to be sound.
Actually, the `proof' you gave that no true list of theories like this exists made the assumption (not listed in this paper) that the sequence of indexes for the computable theories is definable over arithmetic. In general there is no reason this must be true but of course for the purposes of an AI it must.
("This paper" being Eliezer's writeup of the procrastination paradox.) That's true, thanks.
Ultimately, you can always collapse any computable sequence of computable theories (necessary for the AI to even manipulate) into a single computable theory so there was never any hope this kind of sequence could be useful.
First of all (always assuming the theories are at least as strong as PA), note that in any such sequence, T_0 is the union of all the theories in the sequence; if T_(n+1) |- phi, then PA |- Box_(T_(n+1)) "phi", so T_n |- Box_(T_(n+1)) "phi", so by the trust schema, T_n |- phi; going up the chain like this, T_0 |- phi. So T_0 is in fact the "collapse" of the sequence into a single theory.
That said, I disagree that there is no hope that this kind of sequence could be useful. (I don't literally want to use an unsound theory, but see my writeup about an infinite sequence of sound theories each proving the next consistent, linked from the main post; the same remarks apply there.) Yes, T_0 is stronger than T_1, so why would you ever want to use T_1? Well, T_0 + Con(T_0) is stronger than T_0, so why would you ever want to use T_0? But by this argument, you can't use any sound theory including PA, so this doesn't seem like a remotely reasonable argument against using T_1. Moreover, the fact that an agent using T_0 can construct an agent using T_1, but it can't construct an agent using T_0, seems like a sufficient argument against the claim that the sequence as a whole must be useless because you could always use T_0 for everything.
Independent.
I'm hard-pressed to this of any more I could want from [the coco-value] (aside from easy extensions to bigger classes of games).
Invariance to affine transformations of players' utility functions. This solution requires that both players value outcomes in a common currency, plus the physical ability to transfer utility in this currency outside the game (unless there are two outcomes o_1 and o_2 of the game such that A(o_1) + B(o_1) = A(o_2) + B(o_2) = max_o A(o) + B(o), and such that A(o_1) >= A's coco-value >= A(o_2), in which case the players can decide to play the convex combination of these two outcomes that gives each player their coco-value, but this only solves the utility transfer problem, it doesn't make the solution invariant under affine transformations).
...so? What you say is true but seems entirely irrelevant to the question what the superrational outcome in an asymmetric game should be.
Retracted my comment for being unhelpful (I don't recognize what I said in what you heard, so I'm clearly not managing to explain myself here).
Agree with Nisan's intuition, though I also agree with Wei Dai's position that we shouldn't feel sure that Bayesian probability is the right way to handle logical uncertainty. To more directly answer the question what it means to assign a probability to the twin prime conjecture: If Omega reveals to you that you live in a simulation, and it offers you a choice between (a) Omega throws a bent coin which has probability p of landing heads, and shuts down the simulation if it lands tails, otherwise keeps running it forever; and (b) Omega changes the code of the simulation to search for twin primes and run for one more step whenever it finds one; then you should be indifferent between (a) and (b) iff you assign probability p to the twin prime conjecture. [ETA: Argh, ok, sorry, not quite, because in (b) you may get to run for a long time still before getting shut down -- but you get the idea of what a probability over logical statements should mean.]
I'm not saying we'll take the genome and read it to figure out how the brain does what it does, I'm saying that we run a brain simulation and do science (experiments) on it and study how it works, similarly how we study how DNA transcription or ATP production or muscle contraction or a neuron's ion pumps or the Krebs cycle or honeybee communication or hormone release or cell division or the immune system or chick begging or the heart's pacemaker work. There are a lot of things evolution hasn't obfuscated so much that we haven't been able to figure out what they're doing. Of course there's also a lot of things we don't understand yet, but I don't see how that leads to the conclusion that evolution is generally obfuscatory.
Saying that all civilizations able to create strong AI will reliably be wise enough to avoid creating strong AI seems like a really strong statement, without any particular reason to be true. By analogy, if you replace civilizations by individual research teams, would it be safe to rely on each team capable of creating uFAI to realize the dangers of doing so and therefore refraining from doing so, so that we can safely take a much longer time to figure out FAI? Even if it were the case that most teams capable of creating uFAI hold back like this, one single rogue team may be enough to destroy the world, and it just seems really likely that there will be some not-so-wise people in any large enough group.
Good points.
evolution hit on some necessary extraordinarily unlikely combination to give us intelligence and for P vs NP reasons we can't find it
For this one, you also need to explain why we can't reverse-engineer it from the human brain.
no civilization smart enough to create strong AI is stupid enough to create strong AI
This seems particularly unlikely in several ways; I'll skip the most obvious one, but also it seems unlikely that humans are "safe" in that they don't create a FOOMing AI but it wouldn't be possible even with much thought to create a strong AI that doesn't create a FOOMing successor. You may have to stop creating smarter successors at some early point in order to avoid a FOOM, but if humans can decide "we will never create a strong AI", it seems like they should also be able to decide "we'll never create a strong AI x that creates a stronger AI y that creates an even stronger AI z", and therefore be able to create an AI x' that decides "I'll never create a stronger AI y' that creates an even stronger AI z'", and then x' would be able to create a stronger AI y' that decides "I'll never create a stronger AI z''", and then y' won't be able to create any stronger successor AIs.
(Shades of the procrastination paradox.)
I would agree with your reasoning if CFAR claimed that they can reliably turn people into altruists free of cognitive biases within the span of their four-day workshop. If they claimed that and were correct in that, then it shouldn't matter whether they (a) require up-front payment and offer a refund or (b) have people decide what to pay after the workshop, since a bias-free altruist would make end up paying the same in either case. There would only be a difference if CFAR didn't achieve what, in this counterfactual scenario, it claimed to achieve, so they should be willing to choose option (b) which would be better for their participants if they don't achieve these claims. But of course CFAR doesn't actually claim that they can make you bias-free in four days, or even that they can make themselves bias-free with years of training. Much of CFAR's curriculum is aimed at taking the brain we actually have and tweaking the way we use it in order to achieve better (not perfect, but better) results -- for example, using tricks that seem to engage our brain's mechanisms for habit formation, in order to bypass using willpower to stick with a habit, rather than somehow acquiring all the willpower that would be useful to have (since there's no known way to just do that). Or consider precommitment devices like Beeminder -- a perfectly bias-free agent wouldn't have any use for these, but many CFAR alumni (and, I believe, CFAR instructors) have found them useful. CFAR doesn't pretend to be able to turn people into bias-free rationalists who don't need such devices, so I see nothing inconsistent about them both believing that they can deliver useful training that makes people both on average more effective and more altruistic (though I would expect the latter to only be true in the long run, through contact with the CFAR community, and only for a subset of people, rather than for the vast majority of attendees right after the 4-day workshop), and also believing that if they didn't charge up-front and asked people to pay afterwards whatever they thought it was worth, they wouldn't make enough money to stay afloat.
Yep: CFAR advertised their fundraiser in their latest newsletter, which I received on Dec. 5.
The only scenario I can see where this would make sense is if SIAI expects small donors to donate less than $(1/2)N in a dollar-for-dollar scheme, so that its total gain from the fundraiser would be below $(3/2)N, but expects to get the full $(3/2)N in a two-dollars-for-every-dollar scheme. But not only does this seem like a very unlikely story [...]
One year later, the roaring success of MIRI's Winter 2013 Matching Challenge, which is offering 3:1 matching for new large donors (people donating >= $5K who have donated less that $5K in total in the past) -- almost $232K out of the $250K maximum donated by the time of writing, with more than three weeks time left, where the Winter 2012 Fundraiser the parent is commenting on only reached its goal of $115K after a deadline extension, and the Summer 2013 Matching Challenge only reached its $200K goal around the time of the deadline -- means that I pretty much need to eat my hat on the "very unlikely story" comment above. (There's clearly an upward growth curve as well, but it does seem clear that lots of people wanted to take advantage of the 3:1.)
So far I still stand by the rest of the comment, though:
[...] even if it did happen it seems that you should want to donate in the current fundraiser if you're willing to do so [at 1:1 matching], since this means that more matching funds would be available in the later two-dollars-for-every-dollar fundraiser for getting the other people to donate who we are postulating aren't willing to donate at dollar-for-dollar.
Yes, a real-life reasoner would have to use probabilistic reasoning to carry out these sorts of inference. We do not have a real understanding yet of how to do probabilistic reasoning about logical statements, though, although there has been a bit of work about it in the past. This is one topic MIRI is currently doing research on. In the meantime, we also examine problems of self-reference in ordinary deductive logic, since we understand it very well. It's not certain that the results there will carry over in any way into the probabilistic setting, and it's possible that these problems simply disappear if you go to probabilistic reasoning, but there's no reason to consider this overwhelmingly likely, and if they don't it seems likely that at least some insights gained while thinking about deductive logic will carry over. In addition, when an AI reasons about another AI, it seems likely that it will use deductive logic when reasoning about the other AI's source code, even if it also has to use probabilistic reasoning to connect the results obtained that way to the real world, where the other AI runs on an imperfect processor and its source code isn't known with certainty.
More here.
There is a way to write a predicate Proves(p,f) in the language of PA which is true if f is the Gödel number of a formula and p is the Gödel number of a proof of that formula from the axioms of PA. You can then define a predicate Provable(f) := exists p. Proves(p,f); then Provable(f) says that f is the Gödel number of a provable formula. Writing "A" for the Gödel number of the formula A, we can then write
PA |- Provable("A")
to say that there's a proof that A is provable, and
PA |- Provable("Provable("A")")
to say that there's a proof that there's a proof that A is provable. (Of course, PA |- A just says that there is a proof of A.)
On the metalevel, we know that
PA |- A
if and only if
PA |- Provable("A").
On the other hand, PA proves neither A -> Provable("A") nor Provable("A") -> A for general A.
(Hope that helps?)
An example of this: CFAR has published some results on an experiment where they tried to see if they could improve people's probability estimates by asking them how surprised they'd be by truth about some question turning out one way or another. They expected it would, but it turned out it didn't. And that doesn't surprise me. If imagined feelings of surprise contained some information naive probability-estimation methods didn't, why wouldn't we have evolved to tap that information automatically?
Because so few of our ancestors died because they got numerical probability estimates wrong.
I agree with the general idea in your post, but I don't think it strongly predicts that CFAR's experiment would fail. Morever, if it predicts that, why doesn't it also predict that we should have evolved to sample our intuitions multiple times and average the results, since that seems to give more accurate numerical estimates? (I don't actually think this single article is very strong evidence for or against this interpretation of the hypothesis by itself, but neither do I think that CFAR's experiment is; I think the likelihood ratios aren't particularly extreme in either case.)
Thanks!
Mark, have you read Eliezer's article about the Löbian obstacle, and what was your reaction to it?
I'm in the early stages of writing up my own work on the Löbian obstacle for publication, which will need to include its own (more condensed, rather than expanded) exposition of the Löbian obstacle; but I liked Eliezer's article, so it would be helpful to know why you didn't think it argued the point well enough.
Don't worry, I wasn't offended :)
Good to hear, and thanks for the reassurance :-) And yeah, I do too well know the problem of having too little time to write something polished, and I do certainly prefer having the discussion in fairly raw form to not having it at all.
One possibility is that MIRI's arguments actually do look that terrible to you
What I would say is that the arguments start to look really fishy when one thinks about concrete instantiations of the problem.
I'm not really sure what you mean by a "concrete instantiation". I can think of concrete toy models, of AIs using logical reasoning which know an exact description of their environment as a logical formula, which can't reason in the way I believe is what we want to achieve, because of the Löbian obstacle. I can't write down a self-rewriting AGI living in the real world that runs into the Löbian obstacle, but that's because I can't write down any AGI that lives in the real world.
My reason for thinking that the Löbian obstacle may be relevant is that, as mentioned in the interview, I think that a real-world seed FAI will probably use (something very much like) formal proofs to achieve the high level of confidence it needs in most of its self-modifications. I feel that formally specified toy models + this informal picture of a real-world FAI are as close to thinking about concrete instantiations as I can get at this point.
I may be wrong about this, but it seems to me that when you think about concrete instantiations, you look towards solutions that reason about the precise behavior of the program they're trying to verify -- reasoning like "this variable gets decremented in each iteration of this loop, and when it reaches zero we exit the loop, so we won't loop forever". But heuristically, while it seems possible to reason about the program you're creating in this way, our task is to ensure that we're creating a program which creates a program which creates a program which goes out to learn about the world and look for the most efficient way to use transistors it finds in the external environment to achieve its goals, and we want to verify that those transistors won't decide to blow up the world; it seems clear to me that this is going to require reasoning of the type "the program I'm creating is going to reason correctly about the program it is creating", which is the kind of reasoning that runs into the Löbian obstacle, rather than the kind of reasoning applied by today's automated verification techniques.
Writing this, I'm not too confident that this will be helpful to getting the idea across. Hope the face-to-face with Paul with help, perhaps also with translating your intuitions to a language that better matches the way I think about things.
I think that the point above would be really helpful to clarify, though. This seems to be a recurring theme in my reactions to your comments on MIRI's arguments -- e.g. there was that LW conversation you had with Eliezer where you pointed out that it's possible to verify properties probabilistically in more interesting ways than running a lot of independent trials, and I go, yeah, but how is that going to help with verifying whether the far-future descendant of an AI we build now, when it has entire solar systems of computronium to run on, is going to avoid running simulations which by accident contain suffering sentient beings? It seems that to achieve confidence that this far-future descendant will behave in a sensible way, without unduly restricting the details of how it is going to work, is going to need fairly abstract reasoning, and the sort of tools you point to don't seem to be capable of this or to extend in some obvious way to dealing with this.
You seem to be quite willing to use that reasoning yourself to show that the initial AI is safe
I'm not sure I understand what you're saying here, but I'm not convinced that this is the sort of reasoning I'd use.
I'm fairly sure that the reason your brain goes "it would be safe if we only allow self-modifications when there's a proof that they're safe" is that you believe that if there's a proof that a self-modification is safe, then it is safe -- I think this is probably a communication problem between us rather than you actually wanting to use different reasoning. But again, hopefully the face-to-face with Paul can help with that.
I don't think that "whole brain emulations can safely self-modify" is a good description of our disagreements. I think that this comment (the one you just made) does a better job of it. But I should also add that my real objection is something more like: "The argument in favor of studying Lob's theorem is very abstract and it is fairly unintuitive that human reasoning should run into that obstacle. [...]"
Thanks for the reply! Thing is, I don't think that ordinary human reasoning should run into that obstacle, and the "ordinary" is just to exclude humans reasoning by writing out formal proofs in a fixed proof system and having these proofs checked by a computer. But I don't think that ordinary human reasoning can achieve the level of confidence an FAI needs to achieve in its self-rewrites, and the only way I currently know how an FAI could plausibly reach that confidence is through logical reasoning. I thought that "whole brain emulations can safely self-modify" might describe our disagreement because that would explain why you think that human reasoning not being subject to Löb's theorem would be relevant.
My next best guess is that you think that even though human reasoning can't safely self-modify, its existence suggests that it's likely that there is some form of reasoning which is more like human reasoning than logical reasoning and therefore not subject to Löb's theorem, but which is sufficiently safe for a self-modifying FAI. Request for reply: Would that be right?
I can imagine that that might be the case, but I don't think it's terribly likely. I can more easily imagine that there would be something completely different from both human reasoning or logical reasoning, or something quite similar to normal logical reasoning but not subject to Löb's theorem. But if so, how will we find it? Unless essentially every kind of reasoning except human reasoning can easily be made safe, it doesn't seem likely that AGI research will hit on a safe solution automatically. MIRI's current research seems to me like a relatively promising way of trying to search for a solution that's close to logical reasoning.
When I say "failure to understand the surrounding literature", I am referring more to a common MIRI failure mode of failing to sanity-check their ideas / theories with concrete examples / evidence. I doubt that this comment is the best place to go into that, but perhaps I will make a top-level post about this in the near future.
Ok, I think I probably don't understand this yet, and making a post about it sounds like a good plan!
Sorry for ducking most of the technical points, as I said, I hope that talking to Paul will resolve most of them.
No problem, and hope so as well.
Since the PSM was designed without self-modification in mind, "safe but unable to improve itself in effective ways".
(Not sure how this thought experiment helps the discussion along.)
MIRI stated goals are similar to those of mainstream AI research, and MIRI approach in particular includes as subgoals the goals of research fields such as model checking and automated theorem proving.
It's definitely not a goal of mainstream AI, and not even a goal of most AGI researchers, to create self-modifying AI that provably preserves its goals. MIRI's work on this topic doesn't seem relevant to what mainstream AI researchers want to achieve.
Zooming out from MIRI's technical work to MIRI's general mission, it's certainly true that MIRI's failure to convince the AI world of the importance of preventing unFriendly AI is Bayesian evidence against MIRI's perspective being correct. Personally, I don't find this evidence strong enough to make me think that preventing unFriendly AI isn't worth working on.
Also, two more points why MIRI isn't that likely to produce research AI researchers will see as a direct boon to their field: One, stuff that's close to something people are already trying to do is more often already worked on; the stuff that people aren't working on seem more important for MIRI to work on. And two, AGI researchers in particular are particularly interested in results that get us closer to AGI, and MIRI is trying to work on topics that can be published about without bringing the world closer to AGI.
I thought the example was pretty terrible.
Glad to see you're doing well, Benja :)
Sorry for being curmudgeonly there -- I did afterwards wish that I had tempered that. The thing is that when you write something like
I also agree that the idea of "logical uncertainty" is very interesting. I spend much of my time as a grad student working on problems that could be construed as versions of logical uncertainty.
that sounds to me like you're painting MIRI as working on these topics just because it's fun, and supporting its work by arguments that are obviously naive to someone who knows the field, and that you're supporting this by arguments that miss the point of what MIRI is trying to say. That's why I found the example of program analysis so annoying -- people who think that the halting problem means that program analysis is impossible really are misinformed (actually Rice's theorem, really, but someone with this misconception wouldn't be aware of that), both about the state of the field and about why these theorems say what they say. E.g., yes, of course your condition is undecidable as long as there is any choice f(s) of chooseAction2(s) that satisfies it; proof: let chooseAction2(s) be the program that checks whether chooseAction2(s) satisfies your criterion, and if yes return chooseAction(s), otherwise f(s). That's how these proofs always go, and of course that doesn't mean that there are no programs that are able to verify the condition for an interesting subclass of chooseAction2's; the obvious interesting example is searching for a proof of the condition in ZFC, and the obvious boring example is that there is a giant look-up table which decides the condition for all choices of chooseAction2(s) of length less than L.
One possibility is that MIRI's arguments actually do look that terrible to you, but that this is because MIRI hasn't managed to make them clearly enough. I'm thinking this may be the case because you write:
In addition, there are more general proof strategies than the above if that one does not satisfy you. For instance, I could just require that any proposed modification to chooseAction2 come paired with a proof that that modification will be safe. Now I agree that there exist choices of chooseAction2 that are safe but not provably safe and this strategy disallows all such modifications. But that doesn't seem so restrictive to me.
First, that's precisely the "obvious" strategy that's the starting point for MIRI's work.
Second, yes, Eliezer's arguing that this isn't good enough, but the reason isn't that it there are some safe modifications which aren't provably safe. The work around the Löbian obstacle has nothing to do with trying to work around undecidability. (I will admit that for a short period at the April workshop I thought this might be a good research direction, because I had my intuitions shaken by the existence of Paul's system and got overly optimistic, but Paul quickly convinced me that this was an unfounded hope, and in any case the main work around the Löbian obstacle was never really related to this.) MIRI's argument definitely isn't that "the above algorithm can't decide for all chooseAction2 whether they're safe, therefore it probably can't decide it for the kind of chooseAction2 we're interested in, therefore it's unacceptable". If that's how you've understood the argument, then I see why you would think that the program analysis example is relevant. (The argument is indeed that it seems to be unable to decide safety for the chooseAction2 we're interested in, but not because it's unable to decide this for any generic chooseAction2.)
Third, you seem to imply that your proposal will only take safe actions. You haven't given an argument for why we should think so, but the implication seems clear: You're using a chooseAction that is obviously safe as long as it doesn't rewrite itself, and it will only accept a proposed modification if it comes with a proof that it is safe, so if it does choose to rewrite itself then its successor will in fact be safe as well. Now I think this is fine reasoning, but you don't seem to agree:
Finally, I agree that such a naieve proof strategy as "doing the following trivial self-modification is safe because the modified me will only do things that it proves won't destroy the world, thus it won't destroy the world" does not work. I'm not proposing that. The proof system clearly has to do some actual work.
You seem to be quite willing to use that reasoning yourself to show that the initial AI is safe, but you don't think the AI should be able to use the same sort of reasoning. Eliezer's argument is that this is in fact reasoning you want to use when building a self-improving AI: Yes, you can reason in more detail about how the AI you are building works, but this AI_1 will build an AI_2 and so on, and when proving that it's ok to build AI_1 you don't want to reason in detail about how AI_1,000,000 is going to work (which is built using design principles you don't understand, by AI_999,999 which is much smarter than you); rather, you want to use general principles to reason that the because of the way AI_1,000,000 came to be, it has to be safe (because AI_999,999 only builds safe AIs, because it was built by AI_999,998 which only builds safe AIs...). But not only you need to reason like that, because you don't know and aren't smart enough to comprehend AI_1,000,000's exact design; AI_1, which also isn't that smart, will need to be able to use the same sort of reasoning. Hence, the interest in the Löbian obstacle.
There are caveats to add to this and parts of your comment I haven't replied to, but I'm running into the same problem as you with your original comment in this thread, having already spent too much time on this. I'd be happy to elaborate if useful. For my part, I'd be interested in your reply to the other part of my comment: do you think I have localized our disagreement correctly?
Oh, one last point that I shouldn't skip over: I assume the point about MIRI lacking "an understanding of the surrounding literature" refers to the thing about being tripped up at the July workshop by not knowing Gaifman's work on logical uncertainty well enough. If so, I agree that that was an avoidable fail, but I don't think it's indicative of always ignoring the relevant literature or something like that. I'll also admit that I still haven't myself more with Gaifman's work, but that's because I'm not currently focusing on logical uncertainty, and I intend to do so in the future.
Jacob, have you seen Luke's interview with me, where I've tried to reply to some arguments of the sort you've given in this thread and elsewhere?
I don't think [the fact that humans' predictions about themselves and each other often fail] is sufficient to dismiss my example. Whether or not we prove things, we certainly have some way of reasoning at least somewhat reliably about how we and others will behave. It seems important to ask why we expect AI to be fundamentally different; I don't think that drawing a distinction between heuristics and logical proofs is sufficient to do so, since many of the logical obstacles carry over to the heuristic case, and to the extent they don't this seems important and worth grappling with.
Perhaps here is a way to get a handle on where we disagree: Suppose we make a whole-brain emulation of Jacob Steinhardt, and you start modifying yourself in an attempt to achieve superintelligence while preserving your values, so that you can save the world. You try to go through billions of (mostly small) changes. In this process, you use careful but imperfect human (well, eventually transhuman) reasoning to figure out which changes are sufficiently safe to make. My expectation is that one of two things happens: Either you fail, ending up with very different values than you started with or stopping functioning completely; or you think very hard about how much confidence you need to have in each self-modification, and how much confidence you can achieve by ordinary human reasoning, and end up not doing a billion of these because you can't achieve the necessary level of confidence. The only way I know for a human to reach the necessary level of confidence in the majority of the self-modifications would be to use formally verified proofs.
Presumably you disagree. If you could make a convincing case that a whole-brain emulation could safely go through many self-modifications using ordinary human reasoning, that would certainly change my position in the direction that the Löbian obstacle and other diagonalization issues won't be that important in practice. If you can't convince me that it's probably possible and I can't convince you that it probably isn't, this might still help understanding where the disagreement is coming from.
Also note that, even if you did think it was sufficient, I gave you another example that was based purely in the realm of formal logic.
I thought the example was pretty terrible. Everybody with more than passing familiarity with the halting problem, and more generally Rice's theorem, understands that the result that you can't decide for every program whether it's in a given class doesn't imply that there are no useful classes of programs for which you can do so. MIRI's argument for the importance of Löb's theorem is: There's an obvious way you can try to get stable self-modification, which is to require that if the AI self-modifies, it has to prove that the successor will not destroy the world. But if the AI tries to argue "doing the following trivial self-modification is safe because the modified me will only do things that it proves won't destroy the world, thus it won't destroy the world", that requires the AI to understand the soundness of its own proof system, which is impossible by Löb's theorem. This seems to me like a straight-up application of what Löb's theorem actually says, rather than the kind of half-informed misunderstanding that would suggest that program analysis is impossible because Rice's theorem.
Things that result in fewer resources going into AI specifically would result in fewer UFAI resources without reducing overall economic growth, but it needs to be kept in mind that some such research occurs in financial firms pushing trading algorithms, and a lot more in Google, not just in places like universities.
To the extent that industry researchers publish less than academia (this seems particularly likely in financial firms, and to a lesser degree at Google), a hypothetical complete shutdown of academic AI research should reduce uFAI's parallelization advantage by 2+ orders of magnitude, though (presumably, the largest industrial uFAI teams are much smaller than the entire academic AI research community). It seems that reducing academic funding for AI only somewhat should translate pretty well into less parallel uFAI development as well.
I'd definitely be interested to talk more about many of these, especially anthropics and reduced impact / Oracle AI, and potentially collaborate. Lots of topics for future Oxford visits! :-)
Hope you'll get interest from others as well.
Sorry for the long-delayed reply, Wei!
So you think that humans do not have a built-in solution to the Löbstacle, and you must also think we are capable of building an FAI that does have a built-in solution to the Löbstacle. That means an intelligence without a solution to the Löbstacle can produce another intelligence that shares its values and does have a solution to the Löbstacle.
Yup.
But then why is it necessary for us to solve this problem? [...] Why can't we instead built an FAI without solving this problem, and depend on the FAI to solve the problem while it's designing the next generation FAI?
I have two main reasons in mind. First, if you are willing to grant that (a) this is a problem that would require humans years of serial research to solve and (b) that it looks much easier to build this into an AI designed from scratch rather than bolting it on to an existing AI design that was created without taking these considerations into account, but you still think that (c) it would be a good plan to have the first-generation FAI solve this problem when building the next-generation FAI, then it seems that you need to assume that the FAI will be much better at AGI design than its human designers before it executes its first self-rewrite, since the human team would by assumption still need years to solve the problem at that point and the plan wouldn't be particularly helpful if the first-generation FAI would need a similar amount of time or longer. But it seems unlikely to me that we first need to build ultraintelligent machines a la I.J. Good, far surpassing humans, before we can get an intelligence explosion: it seems to me that most of the probability mass should be in the required level of AGI research ability being <= the level of the human research team working on the AGI. I admit that one possible strategy could be to continue having humans improve the initial FAI until it is superintelligent and then ask it to write a successor from scratch, solving the Löbstacle in the process, but it doesn't seem particularly likely that this is cheaper than solving the problem beforehand.
Second, if we followed this plan, when building the initial FAI we would be unable to use mathematical logic (or other tools sufficiently similar to be subject to the same issues) in a straight-forward way when having it reason about its potential successor. This cuts off a large part of design-space that I'd naturally be looking to. Yes, if we can do it then it's possible in principle to get an FAI to do it, but mimicking human reasoning doesn't seem likely to me to be the easiest way to build a safe AGI.
Have you been following the discussions under my most recent post?
I agree with you that relying on an FAI team to solve a large number of philosophical problems correctly seems dangerous, although I'm sympathetic to Eliezer's criticism of your outside-view arguments -- I essentially agree with your conclusions, but I think I use more inside-view reasoning to arrive at them (would need to think longer to tease this apart). I agree with Paul that something like CEV for philosophy in addition to values should probably part of an FAI design. I agree with you that progress in metaphilosophy would be very valuable, but I do not have any concrete leads to follow. But I think that having good solutions to some of these problems is not unlikely to be helpful for FAI design (and more helpful to FAI than uFAI) so I still think that some amount of work allocated to these philosophical problems looks like a good thing; and I also think that working on these problems does on average reduce the probability of making a bad mistake even if we manage to have the FAI do philosophy itself and have it checked by "coherent extrapolated philosophy".
You quoted my earlier comment that I think that making object-level progress is important enough that it seems a net positive despite making AGI research more interesting, but I don't really feel that your post or the discussion below that contains much in the way of arguments about that -- could you elaborate on the connection?
Drats. But also, yay, information! Thanks for trying this!
ETA: Worth noting that I found that post useful, though.
Glad to hear that & looking forward to seeing how it works! I very much understand that one might be concerned about posting "quick and dirty" thoughts (I find it so very difficult to lower my own standards even when it's obviously blocking me from getting stuff done), but there seems to be little cost of trying it with a Discussion post and seeing how it goes -- yay value of information! :-)
For future readers: The discussion has continued here.
Note that you're wrongly discouraging people from doing strategy research by saying that they need to catch up to insiders' unpublished knowledge when they really don't.
What makes you say that? I believe you can reinvent much of what Eliezer and Carl and Bostrom and a few others already know but haven't written down. Not sure that's true for almost most everyone else.
I read the idea as being that people rediscovering and writing up stuff that goes 5% towards what E/C/N have already figured out but haven't written down would be a net positive and it's a bad idea to discourage this. It seems like there's something to that, to the degree that getting the existing stuff written up isn't an available option -- increasing the level of publicly available strategic research could be useful even if the vast majority of it doesn't advance the state of the art, if it leads to many more people vetting it in the long run. I do think there is probably a tradeoff, where Eliezer &c might not be motivated to comment on other people's posts all that much, making it difficult to see what is the current state of the art and what are ideas that the poster just hasn't figured out the straight-forward counter-arguments to. I don't know how to deal with that, but encouraging discussion that is high quality compared to currently publicly available strategy work still seems quite likely to be a net positive?
You should frequently change your passwords, use strong passwords, and not use the same password for multiple services (only one point of failure where all your passwords get compromised rather than every such service being a point of failure). It's not easy to live up to this in practice, but there are approximations that are much easier:
Using a password manager is better than using the same password for lots of services. Clipperz is a web service that does the encryption on your computer (so your passwords never get sent to the server), and can be installed locally. Alternatively, you can use a local application if you're not worried about ever needing your passwords when you don't have access to your computer. I currently try to get by with (a login password) + (passwords for particularly important online services like online banking) + (a password manager password).
If you balk at the inconvenience of regularly memorizing randomly-generated passwords, it's better than nothing to come up with memorable phrases and take the first letter of each word to form your password. (Non-boring bonus advice: You can use phrases that remind you of something you want to do each time you log in to your computer, like looking at your todo list. [ETA: Never mind. I've now tried this twice and both times entering the password has become automatic far too quick, stopping almost immediately to serve as a useful reminder.])
As a pedestrian or cyclist, you're not all that easy to see from a car at night, worse if you don't wear white. High-visibility vests (that thing that construction workers wear, yellow or orange with reflective stripes) fix the problem and cost around $7-$8 from Amazon including shipping, or £3 in the UK.