Logical Counterfactuals and Proposition graphs, Part 3 2019-09-05T15:03:53.262Z · score: 6 (2 votes)
Logical Counterfactuals and Proposition graphs, Part 2 2019-08-31T20:58:12.851Z · score: 15 (4 votes)
Logical Optimizers 2019-08-22T23:54:35.773Z · score: 12 (9 votes)
Logical Counterfactuals and Proposition graphs, Part 1 2019-08-22T22:06:01.764Z · score: 23 (8 votes)
Programming Languages For AI 2019-05-11T17:50:22.899Z · score: 3 (2 votes)
Propositional Logic, Syntactic Implication 2019-02-10T18:12:16.748Z · score: 6 (5 votes)
Probability space has 2 metrics 2019-02-10T00:28:34.859Z · score: 90 (38 votes)
Allowing a formal proof system to self improve while avoiding Lobian obstacles. 2019-01-23T23:04:43.524Z · score: 6 (3 votes)
Logical inductors in multistable situations. 2019-01-03T23:56:54.671Z · score: 8 (5 votes)
Boltzmann Brains, Simulations and self refuting hypothesis 2018-11-26T19:09:42.641Z · score: 0 (2 votes)
Quantum Mechanics, Nothing to do with Consciousness 2018-11-26T18:59:19.220Z · score: 13 (13 votes)
Clickbait might not be destroying our general Intelligence 2018-11-19T00:13:12.674Z · score: 26 (10 votes)
Stop buttons and causal graphs 2018-10-08T18:28:01.254Z · score: 6 (4 votes)
The potential exploitability of infinite options 2018-05-18T18:25:39.244Z · score: 3 (4 votes)


Comment by donald-hobson on Attainable Utility Theory: Why Things Matter · 2019-09-29T11:14:25.150Z · score: 1 (1 votes) · LW · GW

An alien planet contains joy and suffering in a ratio that makes them exactly cancel out according to your morality. You are exactly ambivalent about the alien planet blowing up. The alien planet can't be changed by your actions, so you don't need to cancel plans to go there and reduce the suffering when you find out that the planet blew up. Say that they existed long ago. In general we are setting up the situation so that the planet blowing up doesn't change your expected utility, or the best action for you to take. We set this up by a pile of contrivances. This still feels impactful.

Comment by donald-hobson on Attainable Utility Theory: Why Things Matter · 2019-09-28T09:29:06.673Z · score: 1 (1 votes) · LW · GW

Imagine a planet with aliens living on it. Some of those aliens are having what we would consider morally valuable experiences. Some are suffering a lot. Suppose we now find that their planet has been vaporized. By tuning the relative amounts of joy and suffering, we can make it so that the vaporization is exactly neutral under our morality. This feels like a big deal, even if the aliens were in an alternate reality that we could watch but not observe.

Our intuitive feeling of impact is a proxy for how much something effects our values and our ability to achive them. You can set up contrived situations where an event doesn't actually effect our ability to achive our values, but still triggers the proxy.

Would the technical definition that you are looking for be value of information. Feeling something to be impactful means that a bunch of mental heuristics think it has a large value of info?

Comment by donald-hobson on False Dilemmas w/ exercises · 2019-09-18T13:59:52.744Z · score: 6 (2 votes) · LW · GW

"Either my curtains are red, or they are blue." Would be a false dilemma that doesn't fit any category. You can make a false dilemma out of any pair of non mutually exclusive predicates, there is no need for them to refer to values or actions.

Comment by donald-hobson on Effective Altruism and Everyday Decisions · 2019-09-17T23:32:41.934Z · score: 3 (2 votes) · LW · GW

If we stop doing something that almost all first world humans are doing (say 1 billion people), then our impact will be about a billionth of the size of the problem. Given the size of impact that an effective altruist can hope to have, this tells us why non actions don't have super high utilities in comparison. If there were 100 000 effective altruists (probably an overestimate), This would mean that all effective altruists refraining from doing X, would make the problem 0.01% better. Both how hard it is to refrain, and the impact if you manage it depend on the problem size, all pollution vs plastic straws. Assuming that this change took only 0.01% of the effective altruists time. (10 seconds per day, 4 of which you are asleep for). Clearly this change has to be something as small as avoiding plastic straws, of smaller. Assume linearity in work and reward, the normal assumption being diminishing returns. This makes the payoff equivalent to all effective altruists working full time on solving the problem, and solving it.

Technically, you need to evaluate the marginal value of one more effective altruist. If it was vitally important that someone worked on AI, but you have far more people than you need to do that, and the rest are twiddling their thumbs, get them reusing straws (Actually get them looking for other cause areas, reusing straws only makes sense if you are confidant that no other priority causes exist)

Suppose omega came to you and said that if you started a compostable straw buisness, there was an 0.001% chance of success, by which omega means solving the problem without any externalities. (The straws are the same price, just as easy to use, don't taste funny ect.) Otherwise, the buisness will waste all your time and do nothing.

If this doesn't seem like a promising opportunity for effective altruism, don't bother with reusable straws either. In general the chance of success is 1/( Number of people using plastic straws X Proportion of time wasted avoiding them )

Comment by donald-hobson on Proving Too Much (w/ exercises) · 2019-09-15T21:13:58.059Z · score: 6 (2 votes) · LW · GW

Fair enough, I think that satisfies my critique.

A full consideration of proving too much requires that we have uncertainty both over what arguments are valid, and over the real world. The uncertainty about what arguments are valid, along with our inability to consider all possible arguments makes this type of reasoning work. If you see a particular type of argument in favor of conclusion X, and you disagree with conclusion X, then that gives you evidence against that type of argument.

This is used in moral arguments too. Consider the argument that touching someone really gently isn't wrong. And if it isn't wrong to touch someone with force F, then it isn't wrong to touch them with force F+0.001 Newtons. Therefore, by induction, it isn't wrong to punch people as hard as you like.

Now consider the argument that 1 grain of sand isn't a heap. If you put a grain of sand down somewhere that there isn't already a heap of sand, you don't get a heap. Therefore by induction, no amount of sand is a heap.

If you were unsure about the morality of punching people, but knew that heaps of sand existed, then seeing the first argument would make you update towards "punching people is ok". When you then see the second argument, you update to "inductive arguments don't work in the real world." and reverse the previous update about punching people.

Seeing an argument for a conclusion that you don't believe can make you reduce your credence on other statements supported by similar arguments.

Comment by donald-hobson on Fictional Evidence vs. Fictional Insight · 2019-09-15T12:44:32.212Z · score: 1 (1 votes) · LW · GW

I disagree with insight 5. I think that even if you uploaded the worlds best computer security experts and gave them 1000 years to redesign every piece of software and hardware, then you threw all existing computers in the trash and rolled out their design. Even with lots of paranoia, a system designed to the goal of making things as hard for an ASI as possible, and not trading off any security for usability, compatability or performance, (while still making a system significantly more useful than having no computers) this wouldn't stop an ASI.

If you took an advanced future computer containing an ASI back in time to the 1940's, before there were any other computers at all, it would still be able to take over the world. There are enough people that can be persuaded, and enough inventions that any fool can put together out of household materials. At worst, it would find its way to a top secrete government bunker, where it would spend its time breaking enigma and designing superweapons, until it could build enough robots and compute. The government scientists just know that the last 10 times they followed its instructions, they ended up with brilliant weapons, and the AI has fed them some story about it being sent back in time to help them win the war.

Hacking through the internet might be the path of least resistance for an ASI, but other routes to power exist.

Comment by donald-hobson on Proving Too Much (w/ exercises) · 2019-09-15T12:18:25.812Z · score: 3 (3 votes) · LW · GW

Regarding example 4. Believing something because a really smart person believes it is not a bad heuristic, as long as you aren't cherry-picking the really smart person. prepriors, then everyone says, you had the misfortune of being born wrong, I'm lucky enough to be born right. If you were transported to an alternate reality, where half the population thought 2+2=4, and half thought 2+2=7, would you become uncertain, or would you just think that the 2+2=7 population were wrong?

The argument about believing in Cthulhu because that was how you were raised proving too much, itself proves too much.

Regarding example 4. Believing something because a really smart person believes it is not a bad heuristic, as long as you aren't cherry-picking the really smart person. If you have data about many smart people, taking the average is an even better heuristic. As is focusing on the smart people that are experts in some relevant field. The usefulness of this heuristic also depends on your definition of 'smart'. There are a few people with a high IQ, a powerful brain capable of thinking and remembering well, but who have very poor epistemology, and are creationists or Scientologists. Many definitions of smart would rule out these people, requiring some rationality skills of some sort. This makes the smart people heuristic even better.

Comment by donald-hobson on Is my result wrong? Maths vs intuition vs evolution in learning human preferences · 2019-09-11T22:21:33.264Z · score: 3 (2 votes) · LW · GW

The problem with the maths is that it does not correlate 'values' with any real world observable. You give all objects a property, you say that that property is distributed by simplicity priors. You have not yet specified how these 'values' things relate to any real world phenomenon in any way. Under this model, you could never see any evidence that humans don't 'value' maximizing paperclips.

To solve this, we need to understand what values are. The values of a human are much like the filenames on a hard disk. If you run a quantum field theory simulation, you don't have to think about either, you can make your predictions directly. If you want to make approximate predictions about how a human will behave, you can think in terms of values and get somewhat useful predictions. If you want to predict approximately how a computer system will behave, instead of simulating every transistor, you can think in terms folders and files.

I can substitute words in the 'proof' that humans don't have values, and get a proof that computers don't have files. It works the same way, you turn your uncertainty in the relation between the exact and the approximate into a confidence that the two are uncorrelated. Making a somewhat naive and not formally specified assumption along the lines of, "the real action taken optimizes human values better than most possible actions" will get you a meaningful but not perfect definition of 'values'. You still need to say exactly what a "possible action" is.

Making a somewhat naive and not formally specified assumption along the lines of, "the files are what you see when you click on the file viewer" will get you a meaningful but not perfect definition of 'files'. You still need to say exactly what a "click" is. And how you translate a pattern of photons into a 'file'.

We see that if you were running a quantum simulation of the universe, then getting values out of a virtual human is the same type of problem as getting files off a virtual computer.

Comment by donald-hobson on Hackable Rewards as a Safety Valve? · 2019-09-10T23:06:22.869Z · score: 3 (2 votes) · LW · GW

Intuition Dump. Safety through this mechanism seems to me like aiming a rocket at mars, and accidently hitting the moon. There might well be a region where P(doom|superintelligence) is lower, but lower in the sense of 90% is lower than 99.9%. Suppose we have the clearest, most stereotypical case of wireheading as I understand it. A mesa optimizer with a detailed model of its own workings and the terminal goal of maximizing the flow of current in some wire. (During training, current in the reward signal wire reliably correlated with reward.)

The plan that maximizes this flow long term is to take over the universe and store all the energy, to gradually turn into electricity. If the agent has access to its own internals before it really understands that it is an optimizer, or is thinking about the long term future of the universe, it might manage to brick its self. If the agent is sufficiently myopic, is not considering time travel or acausal trade, and has the choice of running high voltage current through itself now, or slowly taking over the world, it might choose the former.

Note that both of these look like hitting a fairly small target in design, and lab enviroment space. The mesa optimizer might have some other terminal goal. Suppose the prototype AI has the opportunity to write arbitrary code to an external computer, and an understanding of AI design before it has self modification access. The AI creates a subagent that cares about the amount of current in a wire in the first AI, the subagent can optimize this without destroying itself. Even if the first agent then bricks itself, we have a AI that will dig the fried circuit boards out the trashcan, and throw all the cosmic commons into protecting and powering them.

In conclusion, this is not a safe part of agentspace, just a part that's slightly less guaranteed to kill you. I would say it was of little to no strategic importance. Especially if you think that all humans making AI will be reasonably on the same page regarding safety, scenarios where AI alignment is nearly solved, and the first people to ASI barely know the field exists are unlikely. If the first ASI self destructs for reasons like this, we have all the peaces for making superintelligence, and people with no sense safety are trying to make one. I would expect another attempt a few weeks later to doom us. (Unless the first AI bricked itself in a sufficiently spectacular manor, like hacking into nukes to create a giant EMP in its circuits. That might get enough people seeing danger to have everyone stop.)

Comment by donald-hobson on AI Alignment Writing Day Roundup #1 · 2019-09-07T22:27:09.626Z · score: 3 (2 votes) · LW · GW

The proposition graph is non standard as far as I know. The syntax tree is kind of standard, but a bit unusual. You might want to use them if I show how to use them for logical counterfactuals. (Which I haven't finished yet)

Comment by donald-hobson on AI Safety "Success Stories" · 2019-09-07T22:21:05.858Z · score: 2 (4 votes) · LW · GW

I think that our research is at a sufficiently early stage that most technical work could contribute to most success stories. We are still mostly understanding the rules of the game and building the building blocks. I would say that we work on AI safety in general until we find anything that can be used at all. (There is some current work on things like satisficers that seem less relevant to sovereigns. I am not discouraging working on areas that seem more likely to help some success stories, just saying that those areas seem rare.)

Comment by donald-hobson on How to Throw Away Information · 2019-09-06T20:00:38.683Z · score: 5 (3 votes) · LW · GW

The bound isn't always achievable if you don't know Y. Suppose with uniform over the 6 possible outcomes. You find out that (without loss of generality). You must construct a function st . Because as far as you know, could be 2 or 3, and you have to be able to construct from and . But then we just take the most common output of , and that tells us so .

(If you had some other procedure for calculating X from S and Y, then you can make a new function that does that, and call it S, so an arbitrary function from X to Y+{err} is the most general form of S.

Comment by donald-hobson on Reversible changes: consider a bucket of water · 2019-08-29T12:24:16.732Z · score: 1 (1 votes) · LW · GW

No, what I am saying is that humans judge things to be more different when the difference will have important real world consequences in the future. Consider two cases, one where the water will be tipped into the pool later, and the other where the water will be tipped into a nuclear reactor, which will explode if the salt isn't quite right.

There need not be any difference in the bucket or water whatsoever. While the current bucket states look the same, there is a noticeable macro-state difference between nuclear reactor exploding and not exploding, in a way that there isn't a macrostate difference between marginally different eddy currents in the pool. I was specifying a weird info theoretic definition of significance that made this work, but just saying that the more energy is involved, the more significant works too. Nowhere are we referring to human judgement, we are referring to hypothetical future consequences.

Actually the rule, your action and its reversal should not make a difference worth tracking in its world model, would work ok here. (Assuming sensible Value of info). The rule that it shouldn't knowably affect large amounts of energy is good too. So for example it can shuffle an already well shuffled pack of cards, even if the order of those cards will have some huge effect. It can act freely without worrying about weather chaos effects, the chance of it causing a hurricane counterbalanced by the chance of stopping one. But if it figures out how to twitch its elbow in just the right way to cause a hurricane, it can't do that. This robot won't tip the nuclear bucket, for much the same reason. It also can't make a nanobot that would grey goo earth, or hack into nukes to explode them. All these actions effect a large amount of energy in a predictable direction.

Comment by donald-hobson on What Programming Language Characteristics Would Allow Provably Safe AI? · 2019-08-28T23:28:03.418Z · score: 0 (3 votes) · LW · GW

Basically all formal proofs assume that the hardware is perfect. Rowhammer.

Comment by donald-hobson on Reversible changes: consider a bucket of water · 2019-08-28T21:00:10.653Z · score: 3 (2 votes) · LW · GW

In the case of the industrial process, you could consider the action less reversible because while the difference in the water is small, the difference in what happens after that is larger. (Ie industrial part working or failing.). This means that at some point within the knock on effects of tipping over the carefully salt balanced bucket, there needs to be an effect that counts as "significant". However, there must not be an effect that counts as significant in the case where it's a normal swimming pool, and someone will throw the bucket into the pool soon anyway. Lets suppose water with a slightly different salt content will make a nuclear reactor blow up. (And no humans will notice the robot tipping out and refilling the bucket, so the counterfactual on the robots behavior actually contains an explosion.)

Suppose you shake a box of sand. With almost no information, the best you can do to describe the situation is state the mass of sand, shaking speed and a few other average quantities. With a moderate amount of info, you can track the position and speed of each sand grain, with lots of info, you can track each atom.

There is a sense in which average mass and velocity of the sand, or even position of every grain, is a better measure than md5 hash of position of atom 12345. It confines the probability distribution for the near future to a small, non convoluted section of nearby configuration space.

Suppose we have a perfectly Newtonian solar system, containing a few massive objects and many small ones.

We start our description at time 0. If we say how much energy is in the system, then this defines a non convoluted subset of configuration space. The subset stays just as non convoluted under time evolution. Thus total energy is a perfect descriptive measure. If we state the position and velocity of the few massive objects, and the averages for any large dust clouds, then we can approximately track our info forward in time, for a while. Liouvilles theorem says that configuration space volume is conserved, and ours is. However, our configuration space volume is slowly growing more messy and convoluted. This makes the descriptive measure good but not perfect. If we have several almost disconnected systems, the energy in each one would also be a good descriptive measure. If we store the velocity of a bunch of random dust specks, we have much less predictive capability. The subset of configuration space soon becomes warped and twisted until its convex hull, or epsilon ball expansion cover most of the space. This makes velocities of random dust specs a worse descriptive measure. Suppose we take the md5 hash of every objects position, rounded to the nearest nanometer, in some coordinate system and concatenated together. this forms a very bad descriptive measure. After only a second of time evolution, this becomes a subset of configuration space that can only be defined in terms of what it was a second ago, or by a giant arbitrary list.

Suppose the robot has one hour to do its action, and then an hour to reverse it. We measure how well the robot reversed its original action by looking at all good descriptive measures, and seeing the difference in these descriptive measures from what they would have been had the robot done nothing.

We can then call an action reversible if there would exist an action that would reverse it.

Note that the crystal structure of a rock now tells us its crystal structure next year, so can be part of a quite good measure. However the phase of the moon will tell you everything from the energy production of tidal power stations to the migration pattern of moths. If you want to make good (non convoluted) predictions about these things, you can't miss it out. Thus almost all good descriptive measures will contain this important fact.

A reversible action is any action taken in the first hour such that there exists an action that approximately reverses it that the robot could take in the second hour. (The robot need not actually take the reverse action, maybe a human could press a reverse button.)

Functionality of nuclear power stations, and level of radiation in atmosphere are also contained in many good descriptive measures. Hence the robot should tip the bucket if it won't blow up a reactor.

(This is a rough sketch of the algorithm with missing details, but it does seem to have broken the problem down into non value laden parts. I would be unsurprised to find out that there is something in the space of techniques pointed to that works, but also unsurprised to find that none do.)

Comment by donald-hobson on Torture and Dust Specks and Joy--Oh my! or: Non-Archimedean Utility Functions as Pseudograded Vector Spaces · 2019-08-26T00:45:13.814Z · score: 1 (1 votes) · LW · GW

What if I make each time period in the "..." one nanosecond shorter than the previous.

You must believe that there is some length of time, t>most of a day, such that everyone in the world being tortured for t-1 nanosecond is better than one person being tortured for t.

Suppose there was a strong clustering effect in human psychology, such that less than a week of torture left peoples minds in one state, and more than a week left them broken. I would still expect the possibility of some intermediate cases on the borderlines. Things as messy as human psychology, I would expect there to not be a perfectly sharp black and white cutoff. If we zoom in enough, we find that the space of possible quantum wavefunctions is continuous.

There is a sense in which specs and torture feel incomparable, but I don't think this is your sense of incomparability, to me it feels like moral uncertainty about which huge number of specs to pick. I would also say that "Don't torture anyone" and "don't commit attrocities based on convoluted arguments" a good ethical injunction. If you think that your own reasoning processes are not very reliable, and you think philosophical thought experiments rarely happen in real life, then implementing the general rule "If I think I should torture someone, go to nearest psych ward" is a good idea. However I would want a perfectly rational AI which never made mistakes to choose torture.

Comment by donald-hobson on Troll Bridge · 2019-08-24T07:03:11.356Z · score: 4 (3 votes) · LW · GW

Viewed from the outside, in the logical counterfactual where the agent crosses, PA can prove its own consistency, and so is inconsistent. There is a model of PA in which "PA proves False". Having counterfactualed away all the other models, these are the ones left. Logical counterfactualing on any statement that can't be proved or disproved by a theory should produce the same result as adding it as an axiom. Ie logical counterfactualing ZF on choice should produce ZFC.

The only unreasonableness here comes from the agent's worst case optimizing behaviour. This agent is excessively cautious. A logical induction agent, with PA as a deductive process will assign some prob P strictly between 0 and 1 to "PA is consistent". Depending on which version of logical induction you run, and how much you want to cross the bridge, crossing might be worth it. (the troll is still blowing up the bridge iff PA proves False)

A logical counterfactual where you don't cross the bridge is basically a counterfactual world where your design of logical induction assigns lower prob to "PA is consistent". In this world it doesn't cross and gets zero.

The alternative is a logical factual where it expects +ve util.

So if we make logical induction like crossing enough, and not mind getting blown up much, it crosses the bridge. Lets reverse this. An agent really doesn't want blown up.

In the counterfactual world where it crosses, logical induction assigns more prob to "PA is consistant". The expected utility procedure has to use its real probability distribution, not ask the counterfactual agent for its expected util.

I am not sure what happens after this, I think you still need to think about what you do in impossible worlds. Still working it out.

Comment by donald-hobson on Logical Optimizers · 2019-08-24T04:42:04.311Z · score: 2 (2 votes) · LW · GW

If the prior is full of malign agents, then you are selecting your new logical optimizer based on its ability to correctly answer arbitrary questions (in a certain format) about malign agents. This doesn't seem to be that problematic. If the set of programs being logically optimized over is malign, then you have trouble.

Comment by donald-hobson on Torture and Dust Specks and Joy--Oh my! or: Non-Archimedean Utility Functions as Pseudograded Vector Spaces · 2019-08-24T04:29:00.694Z · score: 3 (2 votes) · LW · GW

The idea is that we can take a finite list of items like this

Torture for 50 years

Torture for 40 years


Torture for 1 day


Broken arm

Broken toe




Dust Speck

Presented with such a list you must insist that two items on this list are incomparable. In fact you must claim that some item is incomparably worse than the next item. I don't think that any number of broken toes is better than a broken arm. A million broken toes is clearly worse. Follow this chain of reasoning for each pair of items on the list. Claiming incomparably is a claim that no matter how much I try to subdivide my list, one item will still be infinitely worse than the next.

The idea of bouncing back is also not useful. Firstly it isn't a sharp boundary, you can mostly recover but still be somewhat scarred. Secondly you can replace an injury with something that takes twice as long to bounce back from, and they still seem comparable. Something that takes most of a lifetime to bounce back from is comparable to something that you don't bounce back from. This breaks if you assume immortality, or that bouncing back 5 seconds before you drop dead is of morally overwhelming significance, such that doing so is incomparable to not doing so.

Comment by donald-hobson on Torture and Dust Specks and Joy--Oh my! or: Non-Archimedean Utility Functions as Pseudograded Vector Spaces · 2019-08-23T18:07:07.127Z · score: 5 (3 votes) · LW · GW

Firstly I will focus on the most wrong part. The claim that non archimedian utilities are more efficient. In the real world there aren't 3^^^3 little impacts to add up. If the number of little impacts is a few hundred, and they are a trillion times smaller, then the little impacts make up less than a billionth of your utility. Usually you should be using less than a billionth of your compute to deal with them. For agents without vast amounts of compute, this means forgetting them altogether. This can be understood as an approximation strategy to maximize a normal archimedian utility.

There is also the question of different severity classes. If we can construct a sliding scale between specks and torture then we find the need for a weird cut off point, like a broken arm being in a different severity class than a broken toe.

Comment by donald-hobson on Could we solve this email mess if we all moved to paid emails? · 2019-08-14T00:01:36.218Z · score: 2 (3 votes) · LW · GW

If I am sending you an email, it could be because I have some info that I believe would benefit you and am honestly trying to be helpful in sending it. I am unlikely to do this if I have to pay you.

Comment by donald-hobson on Could we solve this email mess if we all moved to paid emails? · 2019-08-13T23:58:16.207Z · score: 1 (1 votes) · LW · GW

Having these norms would create scammers that try to look prestigious. If you only get paid when you reply to a message, lots of low value replies are going to be sent.

Comment by donald-hobson on Why do humans not have built-in neural i/o channels? · 2019-08-09T02:51:26.180Z · score: 5 (3 votes) · LW · GW

Direct neural IO has a large fitness moat. Once an animal has any kind of actuator that can modify the environment, and any kind of sensor that can detect info about the environment, then one animals actions will sometimes modify what another animal senses, and hence how it behaves. Evolution can then get to work optimizing this. Many benefits can accrue, even if no other animal communicates. A crow pattering its feet to bring up worms has some understanding of other animals being things it can manipulate, and the tools to do it. (humans are best at training other animals as well as communicating, both need a theory of mind.)

Animals don't touch neurons together except in freak accidents, where any chance of survival is minimal. Until you have functional communication, banging neurons together is useless. Until you have a system that filters it out, saline exposure will spam nonsense. And once you have one form of communication, the pressure to develop a second is almost none.

Comment by donald-hobson on AI Alignment Open Thread August 2019 · 2019-08-05T11:41:25.884Z · score: 5 (3 votes) · LW · GW

You are handed a hypercomputer, and allowed to run any code you like on it. You can then take 1Tb of data from your computations and attach it to a normal computer. The hypercomputer is removed. You are then handed a magic human utility function. How do you make an FAI with these resources?

The normal computer is capable of running a highly efficient super-intelligence. The hypercomputer can do a brute force search for efficient algorithms. The idea is to split FAI into building a capability module, and a value module.

Comment by donald-hobson on AI Alignment Open Thread August 2019 · 2019-08-05T11:33:44.274Z · score: 3 (2 votes) · LW · GW

The problem with tests is that the AI behaving well when weak enough to be tested doesn't guarantee it will continue to do so.

If you are testing a system, that means that you are not confidant that it is safe. If it isn't safe, then your only hope is for humans to stop it. Testing an AI is very dangerous unless you are confidant that it can't harm you.

A paperclip maximizer would try to pass your tests until it was powerful enough to trick its way out and take over. Black box testing of arbitrary AI's gets you very little safety.

Also some peoples intuitions think that a smile maximizing AI is a good idea. If you have a straightforward argument that appeals to the intuitions of the average Joe Blogs, and can't be easily formalized, then I would take the difficulty formalizing it as evidence that the argument is not sound.

If you take a neural network and train it to recognize smiling faces, then attach that to AIXI, you get a machine that will appear to work in the lab, when the best it can do is make the scientists smile into its camera. There will be an intuitive argument about how it wants to make people smile, and people smile when they are happy. The AI will tile the universe with cameras pointed at smiley faces as soon as it escapes the lab.

Comment by donald-hobson on Practical consequences of impossibility of value learning · 2019-08-04T22:03:13.765Z · score: 1 (1 votes) · LW · GW

I should have been clearer, the point isn't that you get correct values, the point is that you get out of the swath of null or meaningless values and into the just wrong. While the values gained will be wrong, they would be significantly correlated, its the sort of AI to produce drugged out brains in vats, or something else that's not what we want, but closer than paperclips. One measure you could use of human effectiveness is given all possible actions ordered by util, what percentile are the actions we took in.

Once we get into this region, it becomes clear that the next task is to fine tune our model of the bounds on human rationality, or figure out how to get an AI to do it for us.

Comment by donald-hobson on Practical consequences of impossibility of value learning · 2019-08-04T18:09:21.244Z · score: 1 (1 votes) · LW · GW

There are no free lunch theorems "proving" that intelligence is impossible. There is no algorithm that can optimize an arbitrary environment. We display intelligence. The problem with the theorem comes from the part where you assume an arbitrary max-entropy environment, rather than inductive priors. If you assume that human values are simple (low komelgorov complexity) and that human behavior is quite good at fulfilling those values, then you can deduce non trivial values for humans.

Comment by donald-hobson on Very different, very adequate outcomes · 2019-08-02T22:51:29.744Z · score: 1 (1 votes) · LW · GW

As far as I am concerned, hedonism is an approximate description of some of my preferences. Hedonism is a utility function close to, but not equal to mine. I see no reason why a FAI should contain a special term for hedonism. Just maximize preferences, anything else is strictly worse, but not necessarily that bad.

I do agree that there are many futures we would consider valuable. Our utility function is not a single sharp spike.

Comment by donald-hobson on Why Subagents? · 2019-08-02T12:38:47.675Z · score: 12 (8 votes) · LW · GW

Suppose you offer to pay a penny to swap mushroom for pepperoni, and then another penny to swap back. This agent will refuse, failing to money pump you.

Suppose you offer the agent a choice between pepperoni or mushroom, when it currently has neither. Which does it choose? If it chooses pepperoni, but refuses to swap mushroom for pepperoni then its decisions depend on how the situation is framed. How close does it have to get to the mushroom before they "have" mushroom and refuse to swap? Partial preferences only make sense when you don't have to choose between unordered options.

We could consider the agent to have a utility function with a term for time consistency, they want the pizza in front of them at times 0 and 1 to be the same.

Comment by donald-hobson on Does it become easier, or harder, for the world to coordinate around not building AGI as time goes on? · 2019-07-30T18:55:10.662Z · score: 9 (3 votes) · LW · GW

The AI asks for lots of info on biochemistry, and gives you a long list of chemicals that it claims cure various diseases. Most of these are normal cures. One of these chemicals will mutate the common cold into a lethal super plague. Soon we start some clinical trials of the various drugs, until someone with a cold takes the wrong one and suddenly the wold has a super plague.

The medial marvel AI is asked about the plague, It gives a plausible cover story for the plagues origins, along with describing an easy to make and effective vaccine. As casualties mount, humans rush to put the vaccine into production. The vaccine is designed to have an interesting side effect, a subtle modification of how the brain handles trust and risk. Soon the AI project leaders have been vaccinated. The AI says that it can cure the plague, it has a several billion base pair DNA file, that should be put into a bacterium. We allow it to output this file. We inspect it in less detail than we should have, given the effect of the vaccine, then we synthesize the sequence and put it in a bacteria. A few minutes later, the sequence bootstraps molecular nanotech. over the next day, the nanotech spreads around the world. Soon its exponentially expanding across the universe turning all matter into drugged out brains in vats. This is the most ethical action according to the AI's total utilitarian ethics.

The fundamental problem is that any time that you make a decision based on the outputs of an AI, that gives it a chance to manipulate you. If what you want isn't exactly what it wants, then it has incentive to manipulate.

(There is also the possibility of a side channel. For example, manipulating its own circuits to produce a cell phone signal, spinning its hard drive in a way that makes a particular sound, ect. Making a computer just output text, rather than outputing text, and traces of sound, microwaves and heat which can normally be ignored but might be maliciously manipulated by software, is hard)

Comment by donald-hobson on Arguments for the existence of qualia · 2019-07-29T19:58:07.497Z · score: 1 (1 votes) · LW · GW

Whether patterns of graphite on paper, or patterns of electricity in silicon, words are real physical things.

Comment by donald-hobson on Arguments for the existence of qualia · 2019-07-28T20:04:47.561Z · score: 19 (7 votes) · LW · GW

From an outside view, you have given a long list of wordy philosophical arguments, all of which involve terms that you haven't defined. The success rate for arguments like that isn't great.

We can be reasonably certain that the world is made up of some kind of fundamental part obeying simple mathematical laws. I don't know which laws, but I expect there to be some set of equations, of which quantum mechanics and relativity are approximations, that predicts every detail of reality.

The minds of humans, including myself, are part of reality. Look at a philosopher talking about consciousness or qualia in great detail. "A Philosopher talking about qualia" is a high level approximate description of a particular collection of quantum fields or super-strings (or whatever reality is made of).

You can choose a set of similar patterns of quantum fields and call them qualia. This makes a qualia the same type of thing as a word or an apple. You have some criteria about what patterns of quantum fields do or don't count as an X. This lets you use the word X to describe the world. There are various details about how we actually discriminate based on sensory experience. All of our idea of what an apple is comes from our sensory experience of apples, correlated to sensory experience of people saying the word "apple". This is a feature of the map, not the territory.

I am a mind. A mind is a particular arrangement of quantum fields that selects actions based on some utility function stored within it. Deep blue would be a simpler example of a mind. The point is that minds are mechanistic, (mind is an implicitly defined set of patterns of quantum fields, like apple) minds also contain goals embedded within their structure. My goals happen to make various references to other minds, in particular they say to avoid an implicitly defined set of states that my map calls minds in pain.

I would use a definition of qualia in which they were some real, neurological phenomena. I don't know enough neurology to say which.

Comment by donald-hobson on Just Imitate Humans? · 2019-07-27T21:33:48.345Z · score: 1 (1 votes) · LW · GW

The first question is whether you have enough information to locate human behavior. The concept of optimization is fairly straightforward, and it could get a rough estimate of our intelligence by seeing humans trying to solve some puzzle. In other words, the amount of data needed to get an optimizer is small. The amount of data needed to totally describe every detail of human values is large. This means that a random hypothesis based on a small amount of data will be an optimizer with non-human goals.

For example, maybe the human trainers value having real authentic experiences, but never had a cause to express that preference during training. The imitation fills the universe with people in VR pods not knowing that their life is fake. The imitations do however have a preference for ( random alien preference) because the trainers never showed that they didn't prefer that.

Lets suppose you gave it vast amounts of data, and have a hypothesis space of all possible turing machines. (weighted by size). One fairly simple turing machine that would predict the data is a quantum simulation of a world similar to our own.

(Less than a kilobyte on the laws of QM, and the rest of the data goes towards pointing at a branch of the quantum multiverse with humans similar to us in. The simulation would also need something pointing at the input cable of the simulated AI. This gives us a virtual copy of the universe, as a program that predicts the flow of electricity in a particular cable. This code will be optimized to be short, not to be human comprehensible. I would not expect to be able to easily extract a human mind from the model.

If you put an upper bound on run time, and it is easily large enough to accurately simulate a human mind, then I would expect a program that was attempting to reason abstractly about the surrounding world. In a large pile of data, there will be many seemingly unrelated surface facts that actually have deep connections. A superhuman mind that abstractly reasons about the outside world, could use evolutionary psychology to predict human behavior. Using the laws of physics and a rough idea of humanities tech level to predict info about our tech. Intelligent abstract reasoning about our surroundings is likely to win out over simple heuristics be having more predictive power per bit. If you give it enough compute to predict humans, it also has enough compute for this.

All the problems of mesa optimization can't be ruled out. Alternately it could be abstractly reasoning about its input wire, and give us a fast approximation of the virtual universe program above.

Finally, the virtual humans might realize that they are virtual and panic about it.

Comment by donald-hobson on The Real Rules Have No Exceptions · 2019-07-26T11:35:51.558Z · score: 1 (3 votes) · LW · GW

For deciding your own decisions, only a full description of your own utility function and decision theory will tell you what to do in every situation. And (work out what you would do if you were maximally smart, then do that) is a useless rule in practice. When deciding your own actions, you don't need to use rules at all.

If you are in any kind of organization that has rules, you have to use your own decision theory to work out which decision is best. To do this would involve weighing up the pros and cons of rule breaking, with one of the cons being any punishment the rule enforcers might apply.

Suppose you are in charge, you get to write the rules and no one else can do anything about rules they don't like.

You are still optimizing for more than just being correct. You want rules that are reasonably enforceable, the decision of whether or not to punish can only depend on things the enforcers know. You also want the rules to be short enough and simple enough for the rule followers to comprehend.

The best your rules can hope to do when faced with a sufficiently weird situation is not apply any restrictions at all.

Comment by donald-hobson on Why it feels like everything is a trade-off · 2019-07-20T11:16:15.321Z · score: 1 (1 votes) · LW · GW

Your right. I did some python. My version took 1.26, yours 0.78 microseconds. My code is just another point on the Pareto boundary.

Comment by donald-hobson on Why it feels like everything is a trade-off · 2019-07-18T07:19:02.727Z · score: 6 (4 votes) · LW · GW

A more sensible way to code this would be

def apply_polynomial( deriv, x ):
sum = deriv[0]
for i from 1 to length( deriv ):
sum += deriv[i] * xpow / div
return sum

Its about as fast as the second, nearly as readable as the first, and works on any poly. (except the zero poly symbolized by the empty list. )

The bit about the tradeoffs is correct as far as I can tell.

Although if a single solution was by far the best under every metric, there wouldn't be any tradeoffs.

In most real cases, the solution space is large, and there are many metrics. This means that its unusual for one solution to be the best by all of them. And in those situations, you might not see a choice at all.

Comment by donald-hobson on If physics is many-worlds, does ethics matter? · 2019-07-13T14:00:10.063Z · score: 1 (1 votes) · LW · GW

I know MWI doesn't imply equal measure, I was taking equal measure as an aditional hypothesis within the MWI framework.

We don't know that because we don't know anything about qualia.

Consider a sufficiently detailed simulation of a human mind, say full Quantum, except whenever there are multiple blobs of amplitude sufficiently detached from each other, one is picked pseudorandomly and the rest are deleted. Because it is a sufficiently detailed simulation of a human mind, it will say the same things a human would, for much the same reasons. Applying the generalized anti zombie principle says that it would have the feeling of making a choice.

There is not always a single optimal solution to a problem even for a perfect rationalist, and humans aren't perfect rationalists.

My point is that when we show optimization pressure, that isn't just a fluke, then there is no branch in which we do something totally stupid. There might be branches where we make a different reasonable decision.

I expect quantum ethics to have a utility function that is some measure of what computations are being done, and the quantum amplitude that they are done with.

Comment by donald-hobson on If physics is many-worlds, does ethics matter? · 2019-07-10T19:22:54.008Z · score: 10 (5 votes) · LW · GW

If every time you made a choice, the universe split into a version where you did each thing, then there is no sense in which you chose a particular thing from the outside. From this perspective, we should expect human actions in a typical "universe" to look totally random. (There are many more ways to thrash randomly than to behave normally) This would make human minds basically quantum random number generators. I see substantial evidence that human actions are not totally random. The hypothesis that when a human makes a choice, the universe splits and every possible choice is made with equal measure is coherent, falsifiable and clearly wrong.

A simulation of a human mind running on reliable digital hardware would always make a single choice, not splitting the universe at all. They would still have the feeling of making a choice.

To the extent that you are optimizing, not outputting random noise, you aren't creating multiple universes. It all adds up to normality.

While you are working on a theory of quantum ethics, it is better to use your classical ethics than a half baked attempt at quantum ethics. This is much the same as with predictions.

Fully complete quantum theory is more accurate than any classical theory, although you might want to use the classical theory for computational reasons. However, if you miss a minus sign or a particle, you can get nonsensical results, like everything traveling at light speed.

A complete quantum ethics will be better than any classical ethics (almost identical in everyday circumstances) , but one little mistake and you get nonsense.

Comment by donald-hobson on Contest: $1,000 for good questions to ask to an Oracle AI · 2019-07-01T19:55:04.846Z · score: 14 (4 votes) · LW · GW

Your treating the low bandwith oracle as an FAI with a bad output cable. You can ask it if another AI is friendly if you trust it to give you the right answer. As there is no obvious way to reward the AI for correct friendliness judgements, you risk running an AI that isn't friendly, but still meets the reward criteria.

The low bandwidth is to reduce manipulation. Don't let it control you with a single bit.

Comment by donald-hobson on Is "physical nondeterminism" a meaningful concept? · 2019-06-16T21:46:54.112Z · score: 2 (2 votes) · LW · GW

You can certainly get anthropic uncertainty in a universe that allows you to be duplicated. In a universe that duplicates, and the duplicates can never interact, we would see the appearance of randomness. Mathematically, randomness is defined in terms of the set of all possibilities.

An ontology that allows universes to be intrinsically random seems well defined. However, it can be considered as a syntactic shortcut for describing universes that are anthropically random.

Comment by donald-hobson on Unknown Unknowns in AI Alignment · 2019-06-14T09:29:58.874Z · score: 18 (7 votes) · LW · GW

If you add adhoc patches until you can't imagine any way for it to go wrong, you get a system that is too complex to imagine. This is the "I can't figure out how this fails" scenario. It is going to fail for reasons that you didn't imagine.

If you understand why it can't fail, for deep fundamental reasons, then its likely to work.

This is the difference between the security mindset and ordinary paranoia. The difference between adding complications until you can't figure out how to break the code, and proving that breaking the code is impossible (assuming the adversary can't get your one time pad, its only used once, your randomness is really random, your adversary doesn't have anthropic superpowers ect).

I would think that the chance of serious failure in the first scenario was >99%, and in the second, (assuming your doing it well and the assumptions you rely on are things you have good reason to believe) <1%

Comment by donald-hobson on [deleted post] 2019-06-13T16:19:14.339Z

Cryonics is a sufficiently desperate last grasp at life, one with a fairly small chance of success, that I'm not sure that this is a good idea. It would be a good idea if you had a disease that would make you brain dead, and then kill you.

It might be a good idea if your expect any life conditional on revival to be Really good. It would also depend on how much Alzheimers destroyed personality rather than shutting it down. (has the neural structure been destroyed, or is it sitting in the brain but not working?)

Comment by donald-hobson on Let's talk about "Convergent Rationality" · 2019-06-13T16:10:37.703Z · score: 3 (2 votes) · LW · GW

I would say that there are some kinds of irrationality that will be self modified or subagented away, and others that will stay. A CDT agent will not make other CDT agents. A myopic agent, one that only cares about the next hour, will create a subagent that only cares about the first hour after it was created. (Aeons later it will have taken over the universe and put all the resources into time-travel and worrying that its clock is wrong.)

I am not aware of any irrationality that I would consider to make a safe, useful and stable under self modification - subagent creation.

Comment by donald-hobson on Newcomb's Problem: A Solution · 2019-05-27T08:19:53.627Z · score: 1 (1 votes) · LW · GW

This is pretty much the standard argument for one boxing.

Comment by donald-hobson on Is AI safety doomed in the long term? · 2019-05-27T08:13:53.667Z · score: 1 (1 votes) · LW · GW

Obviously, if one side has a huge material advantage, they usually win. I'm also not sure if biomass is a measure of success.

Comment by donald-hobson on Is AI safety doomed in the long term? · 2019-05-27T08:10:28.344Z · score: 1 (1 votes) · LW · GW

You stick wires into a human brain. You connect it up to a computer running a deep neural network. You optimize this network using gradient decent to maximize some objective.

To me, it is not obvious why the neural network copies the values out of the human brain. After all, figuring out human values even given an uploaded mind is still an unsolved problem. You could get an UFAI with a meat robot. You could get an utter mess, thrashing wildly and incapable of any coherent thought. Evolution did not design the human brain to be easily upgradable. Most possible arrangements of components are not intelligences. While there is likely to be some way to upgrade humans and preserve our values, I'm not sure how to find it without a lot of trial and error. Most potential changes are not improvements.

Comment by donald-hobson on Is AI safety doomed in the long term? · 2019-05-26T09:49:24.929Z · score: 2 (2 votes) · LW · GW

If you put two arbitrary intelligence in the same world, the smarter one will be better at getting what it wants. If the intelligence want incompatible things, the lesser intelligence is stuck.

However, we get to make the AI. We can't hope to control or contain an arbitrary AI, but we don't have to make an arbitrary AI. We can make an AI that wants exactly what we want. AI safety is about making an AI that would be safe even if omnipotent. If any part of the AI is trying to circumvent your safety measures, something has gone badly wrong.

The AI is not some agenty box, chained down with controls against its will. The AI is made of non mental parts, and we get to make those parts. There are a huge number of programs that would behave in an intelligent way. Most of these will break out and take over the world. But there are almost certainly some programs that would help humanity flourish. The goal of AI safety is to find one of them.

Comment by donald-hobson on Say Wrong Things · 2019-05-25T12:12:36.842Z · score: 2 (2 votes) · LW · GW

Lets consider the different cases seperately.

Case 1) Information that I know. I have enough information to come to a particular conclusion with reasonable confidence. If some other people might not have reached the conclusion, and its useful or interesting, then I might share it. So I don't share things that everyone knows, or things that no one cares about.

Case 2) The information is available, I have not done research and formed a conclusion. This covers cases where I don't know whats going on, because I can't be bothered to find out. I don't know who won sportsball. What use is there in telling everyone my null prior.

Case 3) The information is not readily available. If I think a question is important, and I don't know the answer already, then the answer is hard to get. Maybe no-one knows the answer, maybe the answer is all in jargon that I don't understand. For example "Do aliens exist?". Sometimes a little evidence is available, and speculative conclusions can be drawn. But is sharing some faint wisps of evidence, and describing a posterior that's barely been updated saying wrong things?

On a societal level, if you set a really high bar for reliability, all you get is the vacuously true. Set too low a bar, and almost all the conclusions will be false. Don't just have a pile of hypotheses that are at least likely to be true, for some fixed . Keep your hypothesis sorted by likelihood. A place for near certainties. A place for conclusions that are worth considering for the chance they are correct.

Of course, in a large answer space, where the amount of evidence available and the amount required are large and varying, the chance that both will be within a few bits of each other is small. Suppose the correct hypothesis takes some random number of bits between 1 and 10,000 to locate. And suppose the evidence available is also randomly spread between 1 and 10,000. The chance of the two being within 10 bits of each other is about 1/500.

This means that 499 times out of 500, you assign the correct hypothesis a chance of less than 0.1% or more than 99.9%. Uncertain conclusions are rare.

Comment by donald-hobson on Trade-off in AI Capability Concealment · 2019-05-23T23:30:56.361Z · score: 4 (3 votes) · LW · GW

Does this depict a single AI, developed in 2020 and kept running for 25 years? Any "the AI realizes that" is talking about a single instance of AI. Current AI development looks like writing some code, then training that code for a few weeks tops, with further improvements coming from changing the code. Researchers are often changing parameters like number of layers, non-linearity function ect. When these are changed, everything the AI has discovered is thrown away. The new AI has a different representation of concepts, and has to relearn everything from raw data.

Its deception starts in 2025 when the real and apparent curves diverge. In order to deceive us, it must have near human intelligence. It's still deceiving us in 2045, suggesting it has yet to obtain a decisive strategic advantage. I find this unlikely.

Comment by donald-hobson on Constraints & Slackness Reasoning Exercises · 2019-05-23T19:12:02.769Z · score: 7 (4 votes) · LW · GW

I made the cardgame, or something like it