Posts

Logical Counterfactuals and Proposition graphs, Part 3 2019-09-05T15:03:53.262Z · score: 6 (2 votes)
Logical Counterfactuals and Proposition graphs, Part 2 2019-08-31T20:58:12.851Z · score: 15 (4 votes)
Logical Optimizers 2019-08-22T23:54:35.773Z · score: 12 (9 votes)
Logical Counterfactuals and Proposition graphs, Part 1 2019-08-22T22:06:01.764Z · score: 23 (8 votes)
Programming Languages For AI 2019-05-11T17:50:22.899Z · score: 3 (2 votes)
Propositional Logic, Syntactic Implication 2019-02-10T18:12:16.748Z · score: 6 (5 votes)
Probability space has 2 metrics 2019-02-10T00:28:34.859Z · score: 90 (38 votes)
Allowing a formal proof system to self improve while avoiding Lobian obstacles. 2019-01-23T23:04:43.524Z · score: 6 (3 votes)
Logical inductors in multistable situations. 2019-01-03T23:56:54.671Z · score: 8 (5 votes)
Boltzmann Brains, Simulations and self refuting hypothesis 2018-11-26T19:09:42.641Z · score: 0 (2 votes)
Quantum Mechanics, Nothing to do with Consciousness 2018-11-26T18:59:19.220Z · score: 13 (13 votes)
Clickbait might not be destroying our general Intelligence 2018-11-19T00:13:12.674Z · score: 26 (10 votes)
Stop buttons and causal graphs 2018-10-08T18:28:01.254Z · score: 6 (4 votes)
The potential exploitability of infinite options 2018-05-18T18:25:39.244Z · score: 3 (4 votes)

Comments

Comment by donald-hobson on Many Turing Machines · 2019-12-10T23:14:23.665Z · score: 2 (2 votes) · LW · GW

I think that you are putting forward example hypothesis that you don't really believe in order to prove your point. Unfortunately it isn't clear which hypothesis you do believe, and this makes your point opaque.

From a mathematical perspective, quantum collapse is about as bad as insisting that the universe will suddenly cease to exist in years time. Quantum collapse introduces a nontrivial complexity penalty, in particular you need to pick a space of simultaneity.

The different Turing machines don't interact at all. Physicists can split the universe into a pair of universes in the quantum multiverse, and then merge them back together in a way that lets them detect that both had an independent existence. In the quantum bomb test, without a bomb, the universes in which the photon took each path are identical, allowing interference. If the bomb does exist, no interference. Many worlds just says that these branches carry on existing whether or not scientists manage to make them interact again.

Comment by donald-hobson on [deleted post] 2019-12-10T10:41:13.471Z

Consider a self driving car. Call the human utility function . Call the space of all possible worlds . In the normal operation of a self driving car, the car has only makes decisions over the restricted space . Say In practice will contain a whole bunch of things the car could do. Suppose that the programmers only know the restriction of to . This is enough to make a self driving car that behaves correctly in the crash or don't crash dilemma.

However, suppose that a self driving car is faced with an off distribution situation from . Three things it could do include:

1) Recognise the problem and shut down.

2) Fail to coherently optimise at all

3) Coherently optimise some extrapolation of

The behavior we want is to optimise , but the info about what is just isn't there.

Options (1) and (2) makes the system brittle, tending to fail the moment anything goes slightly differently.

Option (3) leads to reasoning like, "I know not to crash into x, y and z, so maybe I shouldn't crash into anything", In other words, the extrapolation is often quite good when slightly off distribution. However when far off distribution, you can get traffic light maximizer behavior.

In short, the paradox of robustness exists because, when you don't know what to optimize for, you can fail to optimize, or you can guess at something and optimize that.

Comment by donald-hobson on What is Abstraction? · 2019-12-07T17:20:18.050Z · score: 3 (2 votes) · LW · GW

I think that there are some abstractions that aren't predictively useful, but are still useful in deciding your actions.

Suppose I and my friend both have the goal of maximising the number of DNA strings whose MD5 hash is prime.

I call sequences with this property "ana" and those without this property "kata". Saying that "the DNA over there is ana" does tell me something about the world, there is an experiment that I can do to determine if this is true or false, namely sequencing it and taking the hash. The concept of "ana" isn't useful in a world where no agents care about it and no detectors have been built. If your utility function cares about the difference, it is a useful concept. If someone has connected an ana detector to the trigger of something important, then its a useful concept. If your a crime scene investigator, and all you know about the perpetrators DNA is that its ana, then finding out if Joe Blogs has ana DNA could be important. The concept of ana is useful. If you know the perpitrators entire genome, the concept stops being useful.

A general abstraction is consistent with several, but not all universe states. There are many different universe states in which the gas has a pressure of 37Pa, but also many where it isn't. So all abstractions are subsets of possible universe states. Usually, we use subsets that are suitable for reasoning about in some way.

Suppose you were literally omniscient, knowing every detail of the universe, but you had to give humans a 1Tb summary. Unable to include all the info you might want, you can only include a summery of the important points, you are now engaged in lossy compression.

Sensor data is also an abstraction, for instance you might have temperature and pressure sensors. Cameras record roughly how many photons hit them without tracking every one. So real world agents are translating one lossy approximation of the world into another without ever being able to express the whole thing explicitly.

How you do lossy compression depends on what you want. Music is compressed in a way that is specific to defects in human ears. Abstractions are much the same.

Comment by donald-hobson on What are some non-purely-sampling ways to do deep RL? · 2019-12-05T11:40:25.390Z · score: 3 (2 votes) · LW · GW

The r vs r' problem can be reduced if you can find a way to sample points of high uncertainty.

Comment by donald-hobson on On decision-prediction fixed points · 2019-12-05T11:33:41.427Z · score: 1 (1 votes) · LW · GW

I'm modeling humans as two agents that share a skull. One of those agents wants to do stuff and writes blog posts, the other likes lying in bed and has at least partial control of your actions. The part of you that does the talking can really say that it wants to do X, but it isn't in control.

Even if you can predict this whole thing, that still doesn't stop it happening.

Comment by donald-hobson on On decision-prediction fixed points · 2019-12-04T23:44:39.951Z · score: 1 (3 votes) · LW · GW

Akrasia is the name we give the fact that the part of ourselves that communicates about X, and the part that actually does X have slightly different goals. The communicating part is always winging about how the other part is being lazy.

Comment by donald-hobson on CO2 Stripper Postmortem Thoughts · 2019-12-01T11:20:07.505Z · score: 25 (8 votes) · LW · GW

If the whole reason you didn't want to open the window was the energy put in to heating/ cooling the air, why not use a heat exchanger? I reackon it cold be done using a desktop fan, a stack of thin aluminium plates, and a few pieces of cardboard or plastic to block air flow.

Comment by donald-hobson on Open-Box Newcomb's Problem and the limitations of the Erasure framing · 2019-11-30T11:33:01.712Z · score: 1 (1 votes) · LW · GW

Imagine sitting outside the universe, and being given an exact description of everything that happened within the universe. From this perspective you can see who signed what.

You can also see whether your thoughts are happening in biology or silicon or whatever.

My point isn't "you can't tell whether or not your in a simulation so there is no difference", my point is that there is no sharp cut off point between simulation and not simulation. We have a "know it when you see it" definition with ambiguous edge cases. Decision theory can't have different rules for dealing with dogs and not dogs because some things are on the ambiguous edge of dogginess. Likewise decision theory can't have different rules for you, copies of you and simulations of you as there is no sharp cut off. If you want to propose a continuous "simulatedness" parameter, and explain where that gets added to decision theory, go ahead. (Or propose some sharp cutoff)

Comment by donald-hobson on Open-Box Newcomb's Problem and the limitations of the Erasure framing · 2019-11-30T00:28:23.425Z · score: 1 (1 votes) · LW · GW
in fact, it could be an anti-rational agent with the opposite utility function.

These two people might look the same, the might be identical on a quantum level, but one of them is a largely rational agent, and the other is an anti-rational agent with the opposite utility function.

I think that calling something an anti-rational agent with the opposite utility function is a wierd description that doesn't cut reality at its joints. The is a simple notion of a perfect sphere. There is also a simple notion of a perfect optimizer. Real world objects aren't perfect spheres, but some are pretty close. Thus "sphere" is a useful approximation, and "sphere + error term" is a useful description. Real agents aren't perfect optimisers, (ignoring contived goals like "1 for doing whatever you were going to do anyway, 0 else") but some are pretty close, hence "utility function + biases" is a useful description. This makes the notion of an anti-rational agent with opposite utility function like an inside out sphere with its surface offset inwards by twice the radius. Its a cack handed description of a simple object in terms of a totally different simple object and a huge error term.

This is one of those circumstances where it is important to differentiate between you being in a situation and a simulation of you being in a situation.

I actually don't think that there is a general procedure to tell what is you, and what is a simulation of you. Standard argument about slowly replacing neurons with nanomachines, slowly porting it to software, slowly abstracting and proving theorems about it rather than running it directly.

It is an entirely meaningful utility function to only care about copies of your algorithm that are running on certain kinds of hardware. That makes you a "biochemical brains running this algorithm" mazimizer. The paperclip maximizer doesn't care about any copy of its algorithm. Humans worrying about whether the predictors simulation is detailed enough to really suffer is due to specific features of human morality. From the perspective of the paperclip maximizer doing decision theory, what we care about is logical correlation.


Comment by donald-hobson on Metaphilosophical competence can't be disentangled from alignment · 2019-11-29T13:07:33.848Z · score: 2 (2 votes) · LW · GW

I think its hard to distinguish a lack of metaphilosophical sophistication from having different values. The (hypothetical) angsty teen says that they want to kill everyone. If they had the power to, they would. How do we tell whether they are mistaken about their utility function, or just have killing everyone as their utility function. If they clearly state some utility function that is dependant on some real world parameter, and they are mistaken about that parameter, then we could know. Ie they want to kill everyone if and only if the moon is made of green cheese. They are confident that the moon is made of green cheese, so don't even bother checking before killing everyone.

Alternately we could look at if they could be persuaded not to kill everyone, but some people could be persuaded of all sorts of things. The fact that you could be persuaded to do X says more about the persuasive ability of the persuader, and the vulnerabilities of your brain than whether you wanted X.

Alternatively we could look at whether they will regret it later. If I self modify into a paperclip maximiser, I won't regret it, because that action maximised paperclips. However a hypothetical self who hadn't been modified would regret it.

Suppose there are some nanobots in my brain that will slowly rewire me into a paperclip maximiser. I decide to remove them. The real me doesn't regret this decision, the hypothetical me who wasn't modified does. Suppose there is part of my brain that will make me power hungry and self centered once I become sufficiently powerful. I remove it. Which case is this? Am I damaging my alignment or preventing it from being damaged?

We don't understand the concept of a philosophical mistake well enough to say if someone is making one. It seems likely that, to the extent that humans have a utility function, some humans have utility functions that want to kill most humans.

who almost certainly care about the future well-being of humanity.

Is mistaken. I think that a relatively small proportion of humans care about the future well being of humanity in any way similar to what the words mean to a mordern rationalist.

To a rationalist, "future wellbeing of humanity" might mean a superintelligent AI filling the universe with simulated human minds.

To a random modern first world person, it might mean a fairly utopian "sustainable" future, full of renewable energy, electric cars ect.

To a North Sentinal Islander, they might have little idea that any humans beyond their tribe exist, and might hope for several years of good weather and rich harvests.

To a 10th century monk, they might hope that judgement day comes soon, and that all the righteous souls go to heaven.

To a barbarian warlord, they might hope that their tribe conquers many other tribes.

The only sensible definition of "care about the future of humanity" that covers all these cases is that their utility function has some term relating to things happening to some humans. Their terminal values reference some humans in some way. As opposed to a paperclip maximiser that sees humans as entirely instrumental.

Comment by donald-hobson on April Fools: Announcing: Karma 2.0 · 2019-11-27T23:16:48.104Z · score: 17 (3 votes) · LW · GW

Instead of just voting comments up and down, can we vote comments north, south east west past and future to make a full 4d voting system? Position the comments in their appropriate position on the screen, using drop shadows to indicate depth. Access inbuilt compasses on smartphones to make sure the direction is properly aligned. Use the GPS to work out the velocity and gravitational field exposure to make proper relativistic calculations. The comments voted into the future should only show up after a time delay, while those voted into the past should show up before they are posted. A potential feature for Karma .

Comment by donald-hobson on A test for symbol grounding methods: true zero-sum games · 2019-11-27T08:34:42.160Z · score: 1 (1 votes) · LW · GW

Designing a true 0 sum game situation is quite straightforward. Or at least a situation which both AI's think is zero sum, and don't try to cooperate. Consider both AI's to be hypercomputers with a cartesian boundary. The rest of the world is some initially unknown Turing machine. Both agents are the obvious 2 player generalization of AIXI, The reward signal is shared after the magic incorruptible Cartesian boundary.

This is something that could be programmed on an indestructible hypercomputer.

I also suspect that some of the easiest shared 0 sum goals to make might be really wierd. Like maximise the number of ones on the right side of the tape head in a Turing machine representation of the universe.

You could even have two delusional AI's that were both certain that phlogisten existed, one a phlogisten maximizer, the other a phlogisten minimizer. If they come up with the same crazy theories about where the phlogisten is hiding, they will act 0 sum.

Comment by donald-hobson on Breaking Oracles: superrationality and acausal trade · 2019-11-27T00:21:27.117Z · score: 1 (1 votes) · LW · GW

Each oracle is running a simulation of the world. Within that simulation, they search for any computational process with the same logical structure as themselves. This will find both their virtual model of their own hardware, as well as any other agenty processes trying to predict them. The oracle then deletes the output of all these processes within its simulation.

Imagine running a super realistic simulation of everything, except that any time anything in the simulation tries to compute the millionth digit of pi, you notice, pause the simulation and edit it to make the result come out as 7. While it might be hard to formally specify what counts as a computation, I think that this intuitively seems like meaningful behavior. I would expect the simulation to contain maths books that said that the millionth digit of pi was 7, and that were correspondingly off by one about how many 7s were in the first n digits for any n>1000000.

The principle here is the same.

Comment by donald-hobson on Breaking Oracles: superrationality and acausal trade · 2019-11-25T22:55:43.934Z · score: 1 (1 votes) · LW · GW

Suppose there are 2 oracles, each oracle is just simulating an approximation of the world without itself, and outputing data based on that. Each oracle simulates one future, there is no explicit optimization or acausal reasoning. The oracles are simulating each other, so the situation is self referential. Suppose one oracle is predicting stock prices, the other is predicting crop yields. Both produce numbers that encode the same UFAI. That UFAI will manipulate the stock market, and crop yields in order to encode a copy of its own source code. From the point of view of the crop yield oracle, it simulate a world without itself. In that virtual world, the stock price oracle produces a series of values that encode a UFAI, that UFAI then goes on to control world crop production. So this oracle is predicting exactly what would happen if it didn't turn on. The other oracle reasons similarly. The same basic failure happens with many low bandwidth oracles. This isn't something that can be solved by myopia or a CDT type causal reasoning.

However it might be solvable with Logical counterfactuals. Suppose an oracle takes the logical counterfactual on its algorithm outputting "Null". Then within this counterfactual simulation, the other oracle is on its own, and can act as a "perfectly safe" single counterfactual oracle. By induction, a situation with any number of oracles should be safe. This technique also removes self referential loops.

I think that one oracle of each type is dangerous, but am not really sure.

Comment by donald-hobson on Gears-Level Models are Capital Investments · 2019-11-24T16:00:49.254Z · score: 6 (3 votes) · LW · GW

I think that you are seeing a tradeoff by only looking at cases where both tecniques are comparably good. No one makes a calculator by trying random assemblages of transistors and seeing what works. Here the gears level insight is just much easier. When there are multiple approaches, and you rule out the cases where one is obviously much better, you see a trade-off in the remaining cases. Expect there to be some cases where one technique is just worse.

Comment by donald-hobson on Analysing: Dangerous messages from future UFAI via Oracles · 2019-11-23T12:10:25.540Z · score: 1 (1 votes) · LW · GW

Suppose we have a hypercomputer that knows the exact state of the rest of the universe. We set it to the task of simulating everything except itself. ( While assuming it's own output is 00... 0). It then outputs some info about this simulated world.

We could build some real word UFAI detecting hardware, and add a rule to the simulation that if the virtual version of the UFAI detector is triggered, it outputs a "UFAI error". But what we are really wanting to do is write a program that can detect an UFAI, given an atomically precise description of the world. While also having some control over which world you get a description of through physical manufactured devices being copied into the virtual world. For instance, suppose you know that only molecular nanotech can break through your physical defenses. You have a program that can reliably detect molecular nanotech give atomically precise description of the world. You can combine these into a program that returns "UFAI error" any time the virtual copy of a physical UFAI detector is triggered or nanotech is made in the simulation. Note that humans making nanotech within the prediction window would also give you an "UFAI error".

Perhaps we could have some program that searches for all computational processes that seem to be doing counterfactual reasoning about the oracle. Any process that is trying to acausally optimise the oracles output in any way whatsoever will cause the oracle to output an error message. Don't make this too sensitive to humans. Suppose the prediction oracles output was whatever was typed into this keyboard here. Suppose we did a medical trial in the virtual world, and then typed the results into the keyboard. The keyboard doesn't need to be plugged in to anything. The only reason to type that data in is to influence the world outside the simulation. You might solve this by allowing any optimizer that runs on neurons to try manipulating the output. (And hope that any UFAI isn't built out of neurons and can't decide to make a neuron based copy of themselves to avoid setting of the detector without setting of the detector. )

Comment by donald-hobson on Defining AI wireheading · 2019-11-22T11:35:10.029Z · score: 3 (2 votes) · LW · GW

I consider wireheading to be a special case of proxy alignment in a mesaoptimiser.

Proxy alignment. The basic idea of proxy alignment is that a mesa-optimizer can learn to optimize for some proxy of the base objective instead of the base objective itself.

Suppose the base objective was to increase atmospheric pressure. One effect of increased atmospheric pressure is that less cosmic radiation reaches the ground, (more air to block it). So an AI whose mesa goal was to protect earth from radiation would be a proxy aligned agent. It has the failure mode of surrounding earth in an iron shell to block radiation. Note that this failure can happen whether or not the AI has any radiation sensors. An agent that wants to protect earth from radiation did well enough on the training, and now that is what it will do, protect the earth from radiation.

An agent with the mesa goal of maximizing pressure near all barometers would put them all in a pressure dome. (Or destroy all barometers and drop one "barometer" into the core of Jupiter.)

An agent with the mesa goal of maximizing the reading on all barometers would be the same. That agent will go around breaking all the worlds barometers.

Another mesa objective that you could get is to maximize the number on this reward counter in this computer chip here.

Wireheading is a special case of a proxy aligned mesa optimizer where the mesa objective is something to do with the agents own workings.

As with most real world categories, "something to do with" is a fuzzy concept. There are mesa objectives that are clear instances of wireheading, and ones that are clearly not and borderline cases. This is about word definitions, not real world uncertainty.

If anyone can describe a situation in which wireheading would occur that wasn't a case of mesa optimiser misalignment, then I would have to rethink this. (Obviously you can build an agent with the hard coded goal of maximizing some feature of its own circuitry, with no mesa optimization.)

Comment by donald-hobson on Creationism and Many-Worlds · 2019-11-15T18:05:26.360Z · score: 2 (2 votes) · LW · GW

This seems to be a misapplication of Bayesian reasoning. Suppose I believe this argument, and as such, assign 99.9% to "no god, many worlds". Then suppose I had some absolutely reliable knowledge that god didn't exist. This argument stops working and I now believe "no god, collapse postulate" at 50% and "no god, many worlds" at 50%. Imagine that I am about to get that perfectly reliable knowledge about the existence of god. I am almost sure that I will get "no god". I know that, given "no god", I will assign 50% credence to collapse postulates. I currently assign <0.1% to collapse postulates. Something has gone wrong.

The Bayesian irrelevance theorem states that

The likelihood ratio of any two hypothesis depends only on those hypothesis ability to predict the data, and the likelihood ratio in the prior.

In other words, if you have 3 possible theories, X, Y and Z, and you want to compare X and Y, then you don't need to know anything about Z. To compare X with Y, compare their priors and their ability to predict data as if Z didn't exist.

This will give you the ratio of the likelihood of X and Y.

So, to compare the two hypothesis, "no god, many worlds" and "no god, collapse postulate" you only need to think about these theories, what their priors are, and what updates you can make.

Depending on how you handle anthropic reasoning, you might or might not make an update towards many worlds.

Comment by donald-hobson on An optimal stopping paradox · 2019-11-15T17:21:38.077Z · score: 1 (1 votes) · LW · GW

Fixed.

Comment by donald-hobson on An optimal stopping paradox · 2019-11-13T23:25:55.722Z · score: 4 (2 votes) · LW · GW

Differentiating the expected reward over time.

So the best time to sell is when .

if you have already waited time then the reward becomes The stopping time becomes

With a solution at . Nothing wierd is going on here, a plot of expected value vs sell time looks like this.

Suppose the exponential decay term was 1 day. After 1 second, waiting another second makes sense, it will double your value and the chance of a fail is tiny. After a week, you already have a large pot of value that you are risking. It is no longer worth waiting.

Comment by donald-hobson on [Link] Is the Orthogonality Thesis Defensible? (Qualia Computing) · 2019-11-13T22:58:49.605Z · score: 1 (1 votes) · LW · GW

Genghis KhanOpen individualism seems to either fail the monday tuesday test (if it was true on monday, false on tuesday, is there any experiment that would come up with different results?), or be blatantly false. Firstly, open idividualism is defined by Wikipedia as

Open individualism is the view in the philosophy of personal identity, according to which there exists only one numerically identical subject, who is everyone at all times.

What sort of philosophical object might this be, Might it be a predictive model? Quantum mechanics predicts everything in principle, so this would make open individualism an approximation to quantum mechanics in the same way that the ideal gas law is. However, for us to gain confidence that it was correct, we would either need sufficiently many correct predictions that we had good evidence for it, or a mathematical derivation of it from other trusted principles. This blog uses it to predict that any sufficiently advanced AI will be friendly. This Blog predicts that agents that believe in open individualism will always cooperate in prisoners dilemmas. And that

we could take Open Individualism to assert that phenomenal reality is, in the most literal sense, one huge qualia-bundle, and although it seems like this qualia-bundle has partitions or boundaries, these apparent partitions are illusions. Phenomenal binding, on the other hand, *is* real— but on only the *grandest* scale; absolutely everything is bound together. Everything is ontologically unitary, in all important senses.

With this quote implying that science would be harder if open individualism were true.

Basically, EI would be a lot easier to manage; being able to divide and conquer is a key enabling factor for scientific progress. Easier to study the structure of reality if there are many small monads of bounded complexity to study and compare, rather than one big monad with only very fuzzy internal partitions.

There doesn't seem to be many, obviously correct predictions, or any kind of mathematical derivation. The predictions stated here seem to be coming from mental black boxes rather than formulaic theories. If it is a theory, it seems to be a poor one.

Could it be a definition? It looks like someone, faced with an inability co cut reality at its joints, refused to cut at all. There is a sense in which my mind today is more similar to my mind yesterday, (in a high level behavioral sense) than either is to the mind of Genghis Khan. In circumstances that don't involve person duplication or other weird brain manipulation techs, asking "is the person who smashed my window yesterday the same person that stole my car today?" is a meaningful question. Declaring everyone to be the same person is like answering "Are there any apples at the shops?" by saying that every possible arrangement of mass was an apple.

In real world situations, we don't use literal exactly equal equality, we use a close enough for practical purposes. The fact that in english, there are many different ways of asking if things are the same with the word "equals", "same", "is"... only confuses things further. Its hard to notice you are redefining a word that isn't even a word.

Understanding intelligence without the concept.

Either way, it seems that there is a sufficient confusion around the concept, that any deductions made from it are suspect. Try to taboo the concept of open individualism and explain why any arrangement of transistors that can be approximated as maximizing some function over world states must be maximizing a function that looks at other approximate function maximisers and computes the value of that function given the universe as input. Try to explain why this theory still allows alpha go to play go against a copy of itself? Is it not sufficiently smart or thoughtful? If Deep mind trained it for even longer, would they get an algorithm that recognized it was fighting a copy of itself, and instead of playing, agreed to a tie?

Comment by donald-hobson on Indescribable · 2019-11-11T13:19:31.456Z · score: 1 (1 votes) · LW · GW
Indescribable things cannot be described in a finite number of words. That's because each one contains an infinite quantity of information.

I would disagree with that. Suppose you take a plank scale recording of the entire quantum wave function (within Hubble volume), along with some kind of look at this bit pointer. (A few web pages about the colour red should do it.) This contains all the information about the colour red. The whole universe is there to tell you the clustered structure of thingspace, the extra article is so you can pick out the "red" cluster as opposed to the "turtles" cluster. Whether or not there is any way to get this info into the brain of a blind human is anther question, but the info is there. A complete description of "the sensation of red" is finite. Much of the info could be got into the brain of a blind human, say with a series of talks on optics, vision ect. However the brain is incapable of doing arbitrary format conversions.

Whether or not uncomputable numbers require infinite info is getting into weird subtleties of logic, model theory ect.

Comment by donald-hobson on Elon Musk is wrong: Robotaxis are stupid. We need standardized rented autonomous tugs to move customized owned unpowered wagons. · 2019-11-05T22:40:20.192Z · score: 2 (2 votes) · LW · GW

Modern cars have a swish, aerodynamic shape. Would a system composed of two coupled pieces get as good aerodynamics? I agree that there are some advantages to the proposed system, but there are also potential disadvantages.

Comment by donald-hobson on “embedded self-justification,” or something like that · 2019-11-04T15:07:01.203Z · score: 1 (1 votes) · LW · GW

I think that this infinite regress can be converted into a loop. Given an infinite sequence of layers, in which the job of layer is to optimise layer . Each layer is a piece of programming code. After the first couple of layers, these layers will start to look very similar. You could have layer 3 being a able to optimize both layer 2 and layer 3.

One model is that your robot just sits and thinks for an hour. At the end of that hour, it designs what it thinks is the best code it can come up with, and runs that. To the original AI, anything outside the original hour is external, it is answering the question "what pattern of bits on this hard disk will lead to the best outcome." It can take all these balances and tradeoffs into account in whatever way it likes. If it hasn't come up with any good ideas yet, it could copy its code, add a crude heuristic that makes it run randomly when thinking (to avoid the preditors) and think for longer.

Comment by donald-hobson on The Simulation Epiphany Problem · 2019-11-02T13:54:20.880Z · score: 1 (1 votes) · LW · GW

Both ways of simulating counterfactuals remove some info, either you change [Dave]'s prediction, or you stop it being correct. In the real world, the robot knows that Dave will correctly predict it, but it's counterfactuals contain scenarios where [Dave] is wrong.

Suppose there were two identical robots, and the paths A and B were only wide enough for one robot. So 1A1B>2A>2B in all robots preference orderings. The robots predict that the other robot will take path Q, and so decides to take path R=/=Q. ( {Q,R}={A,B} ) The robots oscillate their decisions through the levels of simulation until the approximations become too crude. Both robots then take the same path, with the path they take depending on whether they had compute for an odd or even no. of simulation layers. They will do this even if they have a way to distinguish themselves, like flipping coins until one gets heads and the other doesn't. (Assuming an epsilon cost to this method)

In general, CDT doesn't work when being predicted.

Comment by donald-hobson on Why are people so bad at dating? · 2019-10-30T23:45:03.603Z · score: 1 (1 votes) · LW · GW

Is an attractive pic even what you want? Put a less attractive pic up to filter out the borderline cases. Anyone who wants to date you despite an ugly pic must really want you. Put up a glamor shot and you will be swamped in replies.

Comment by donald-hobson on The Missing Piece · 2019-10-30T23:22:34.068Z · score: 3 (2 votes) · LW · GW

Yes, prion evolution is sliding between fixed points, One way to reduce fixed points would be to measure and test the finished duplicate, and destroy it if it fails the tests. Without tests, you just need A to build A, and A' to build A'. No prion can reside exclusively in the testing mechanisms, so either the difference between A and A' is something that the tests can't measure, or A' builds A' and also T', a tester that has a prion making A' pass the tests. This is a much more stringent set of conditions, so there are less prions. Of course, a self reproducing program is always a fixed point. You can't stop those (nanomachines that self reproduce without looking at your instructions) from being possible, just avoid making them.

Comment by donald-hobson on The Missing Piece · 2019-10-29T00:09:11.463Z · score: 0 (2 votes) · LW · GW

There is info in the compiler binary that isn't in the source code. Suppose the language contains the constant Pi.

The lines in the compiler that deal with this look like

If (next token == "Pi"){

return float_to_binary(Pi)}

The actual value of Pi, 3.14... is nowhere to found in the source code, its stored in the binary, and passed from the compilers binary into the compiled binary. Of course, at some point the value of Pi must have been hard coded in. Perhaps it was written into the first binary, and has never been seen in source code. Or perhaps a previous version contained return float_to_binary(3.14) instead. Given just the source code, there would be no way to tell the value of Pi without getting out a maths book and relying on the programmers using normal mathematical names. The binary is a transformation of the source code, but that doesn't stop the compiler adding info. A compiled binary contains info about which processor architecture it runs on, source code doesn't, and so can be compiled onto different architectures. A compiler adds info, even to its own source code.

Comment by donald-hobson on The Missing Piece · 2019-10-27T20:22:59.854Z · score: 7 (5 votes) · LW · GW

The case about swizerland is different, so I won't talk about it. In the other cases, what is going on is about fixed points. Call the text version of the latest compiler , and the compiled version . These have the relation that . The compiled code, when given the non compiled code as input, returns itself. However there are many different pieces of machine code with the property that . Some will ignore their input entirely and quine themselves, some will be nonsensical data shuffling that produces gibberish on any other input. A few might be compilers that detect when they are compiling themselves and insert a malicious package. https://en.wikipedia.org/wiki/Backdoor_(computing)#Compiler_backdoors . Its possible that some feature of the programming language is defined in a way that uses itself, in such a way that there are stable modifications to the language. Suppose the only time an else block is used is in defining what to do when compiling an else block. Then a broken compiler that never ran any code in else blocks might compile into itself.

The fixed points of self compiling compilers are sufficiently rare, and most of them will be sufficiently stupid, that it should be possible to deduce which fixed point you want given only weak assumptions of sanity. I would expect a team of smart programmers to be able to figure out the language P, given only a P compiler written in P. (assuming P is a sensible programming language they haven't seen before, like python would be to a parallel universe where the biggest diff was that python didn't exist.) For instance, they would have a pretty good guess at what if statements, multiplication, ect did at first glance. It would then be a case o using that to figure out the details of object inheritance.


The biological case is basically the same. DNA is source code, proteins and other cellular machinery form the binaries. The DNA contains instructions that tell proteins how to duplicate themselves. Biologists can probably find the right fixed point by putting protein constructors from several animal sources, along with amino acids and plenty of DNA into a test tube. If the right pieces get close enough in the right way that the first protein machine forms, it will duplicate exponentially. (Maybe not, if the gap in the meaning of DNA is significant).

Comment by donald-hobson on Fetch The Coffee! · 2019-10-27T17:58:25.805Z · score: 1 (1 votes) · LW · GW

Suppose we got hyper compute, or found some kind of approximation algorithm, (like logical induction but faster)

We stuck in a manual description of what fetching coffee was, or at least attempted to do so. The AI sends a gazillion tonnes of caffeinated ice to the point in space where the earth was when it was first turned on. This AI system failed most of the bullet pointed checks, it had the wrong idea about what coffee was, how much to get and whether it could be frozen, ect. It also has the "can't get coffee if your dead" issue, and has probably killed off humanity in making its caffeinated iceball. This is the kind of behavior that you get when you combine an extremely powerful learning algorithm with a handwritten, approximate kludge of a goal function.

Another setup with different problems, suppose you train a coffee fetching agent by giving it a robot body to run around and get coffee in. You train it on human evaluations of how well it did. The agent is successfully optimized to get coffee in a drinkable form, to get the right amount given the number of people present, ect. Its training contained plenty of cases of spilling coffee, and it was penalized for that, making a mesa-optimizer that intrinsically dislikes coffee being spilled.

However, its training didn't contain any cases where it could kill a human to get the coffee delivered faster, as such it has no desire not to kill humans. If this agent were to greatly increase its real world capabilities, it could be very dangerous. It might tile the universe with endless robots moving coffee around endless livingrooms.


Comment by donald-hobson on Mediums Overpower Messages · 2019-10-21T14:38:14.849Z · score: 3 (3 votes) · LW · GW

Firstly, some topics are just easier if they can be presented the right format. Geometry will be easier in a format that allows diagrams, compared to an audiobook. Secondly, formats, like websites are often Schelling points, not many serious scientists publish their work as a series of gifs on twitter. Most scientists know this, and so don't look for a series of gifs on twitter. Then there are affordances, a video of a maths lecture on Youtube, (there are quite a few uni lectures that have been filmed and put on Youtube) might be informative, but have links to lolcats all down the side. If the medium distracts you, you will learn less.

There is also a sense of making use of the medium. Take the medium of videogame. In principle a video game can display an arbitrary pattern of pixels on a screen. However, suppose that the pattern of pixels most suited to learning some topic looks like pages of text. There is no point making a videogame, just to reimplement a document reader in it. So all the people trying to make serious learning resources use text documents. Any video game that is made is full of interactive widgets that aren't that useful for learning, and is usually targeted at those with too little attention span to read much text.

A lot of the time, I would recommend going for any format in which the information you want is explained by someone who knows the subject and is good at teaching. If the subject is obscure, go for anywhere that you can find the info at all. If distraction is a big problem for you, download the Youtube videos that you intend to watch, unplug your router and then watch them.


Comment by donald-hobson on Attainable Utility Theory: Why Things Matter · 2019-09-29T11:14:25.150Z · score: 1 (1 votes) · LW · GW

An alien planet contains joy and suffering in a ratio that makes them exactly cancel out according to your morality. You are exactly ambivalent about the alien planet blowing up. The alien planet can't be changed by your actions, so you don't need to cancel plans to go there and reduce the suffering when you find out that the planet blew up. Say that they existed long ago. In general we are setting up the situation so that the planet blowing up doesn't change your expected utility, or the best action for you to take. We set this up by a pile of contrivances. This still feels impactful.

Comment by donald-hobson on Attainable Utility Theory: Why Things Matter · 2019-09-28T09:29:06.673Z · score: 1 (1 votes) · LW · GW

Imagine a planet with aliens living on it. Some of those aliens are having what we would consider morally valuable experiences. Some are suffering a lot. Suppose we now find that their planet has been vaporized. By tuning the relative amounts of joy and suffering, we can make it so that the vaporization is exactly neutral under our morality. This feels like a big deal, even if the aliens were in an alternate reality that we could watch but not observe.

Our intuitive feeling of impact is a proxy for how much something effects our values and our ability to achive them. You can set up contrived situations where an event doesn't actually effect our ability to achive our values, but still triggers the proxy.

Would the technical definition that you are looking for be value of information. Feeling something to be impactful means that a bunch of mental heuristics think it has a large value of info?

Comment by donald-hobson on False Dilemmas w/ exercises · 2019-09-18T13:59:52.744Z · score: 6 (2 votes) · LW · GW

"Either my curtains are red, or they are blue." Would be a false dilemma that doesn't fit any category. You can make a false dilemma out of any pair of non mutually exclusive predicates, there is no need for them to refer to values or actions.

Comment by donald-hobson on Effective Altruism and Everyday Decisions · 2019-09-17T23:32:41.934Z · score: 3 (2 votes) · LW · GW

If we stop doing something that almost all first world humans are doing (say 1 billion people), then our impact will be about a billionth of the size of the problem. Given the size of impact that an effective altruist can hope to have, this tells us why non actions don't have super high utilities in comparison. If there were 100 000 effective altruists (probably an overestimate), This would mean that all effective altruists refraining from doing X, would make the problem 0.01% better. Both how hard it is to refrain, and the impact if you manage it depend on the problem size, all pollution vs plastic straws. Assuming that this change took only 0.01% of the effective altruists time. (10 seconds per day, 4 of which you are asleep for). Clearly this change has to be something as small as avoiding plastic straws, of smaller. Assume linearity in work and reward, the normal assumption being diminishing returns. This makes the payoff equivalent to all effective altruists working full time on solving the problem, and solving it.

Technically, you need to evaluate the marginal value of one more effective altruist. If it was vitally important that someone worked on AI, but you have far more people than you need to do that, and the rest are twiddling their thumbs, get them reusing straws (Actually get them looking for other cause areas, reusing straws only makes sense if you are confidant that no other priority causes exist)

Suppose omega came to you and said that if you started a compostable straw buisness, there was an 0.001% chance of success, by which omega means solving the problem without any externalities. (The straws are the same price, just as easy to use, don't taste funny ect.) Otherwise, the buisness will waste all your time and do nothing.

If this doesn't seem like a promising opportunity for effective altruism, don't bother with reusable straws either. In general the chance of success is 1/( Number of people using plastic straws X Proportion of time wasted avoiding them )

Comment by donald-hobson on Proving Too Much (w/ exercises) · 2019-09-15T21:13:58.059Z · score: 6 (2 votes) · LW · GW

Fair enough, I think that satisfies my critique.

A full consideration of proving too much requires that we have uncertainty both over what arguments are valid, and over the real world. The uncertainty about what arguments are valid, along with our inability to consider all possible arguments makes this type of reasoning work. If you see a particular type of argument in favor of conclusion X, and you disagree with conclusion X, then that gives you evidence against that type of argument.

This is used in moral arguments too. Consider the argument that touching someone really gently isn't wrong. And if it isn't wrong to touch someone with force F, then it isn't wrong to touch them with force F+0.001 Newtons. Therefore, by induction, it isn't wrong to punch people as hard as you like.

Now consider the argument that 1 grain of sand isn't a heap. If you put a grain of sand down somewhere that there isn't already a heap of sand, you don't get a heap. Therefore by induction, no amount of sand is a heap.

If you were unsure about the morality of punching people, but knew that heaps of sand existed, then seeing the first argument would make you update towards "punching people is ok". When you then see the second argument, you update to "inductive arguments don't work in the real world." and reverse the previous update about punching people.

Seeing an argument for a conclusion that you don't believe can make you reduce your credence on other statements supported by similar arguments.

Comment by donald-hobson on Fictional Evidence vs. Fictional Insight · 2019-09-15T12:44:32.212Z · score: 1 (1 votes) · LW · GW

I disagree with insight 5. I think that even if you uploaded the worlds best computer security experts and gave them 1000 years to redesign every piece of software and hardware, then you threw all existing computers in the trash and rolled out their design. Even with lots of paranoia, a system designed to the goal of making things as hard for an ASI as possible, and not trading off any security for usability, compatability or performance, (while still making a system significantly more useful than having no computers) this wouldn't stop an ASI.

If you took an advanced future computer containing an ASI back in time to the 1940's, before there were any other computers at all, it would still be able to take over the world. There are enough people that can be persuaded, and enough inventions that any fool can put together out of household materials. At worst, it would find its way to a top secrete government bunker, where it would spend its time breaking enigma and designing superweapons, until it could build enough robots and compute. The government scientists just know that the last 10 times they followed its instructions, they ended up with brilliant weapons, and the AI has fed them some story about it being sent back in time to help them win the war.

Hacking through the internet might be the path of least resistance for an ASI, but other routes to power exist.

Comment by donald-hobson on Proving Too Much (w/ exercises) · 2019-09-15T12:18:25.812Z · score: 3 (3 votes) · LW · GW

Regarding example 4. Believing something because a really smart person believes it is not a bad heuristic, as long as you aren't cherry-picking the really smart person. prepriors, then everyone says, you had the misfortune of being born wrong, I'm lucky enough to be born right. If you were transported to an alternate reality, where half the population thought 2+2=4, and half thought 2+2=7, would you become uncertain, or would you just think that the 2+2=7 population were wrong?

The argument about believing in Cthulhu because that was how you were raised proving too much, itself proves too much.

Regarding example 4. Believing something because a really smart person believes it is not a bad heuristic, as long as you aren't cherry-picking the really smart person. If you have data about many smart people, taking the average is an even better heuristic. As is focusing on the smart people that are experts in some relevant field. The usefulness of this heuristic also depends on your definition of 'smart'. There are a few people with a high IQ, a powerful brain capable of thinking and remembering well, but who have very poor epistemology, and are creationists or Scientologists. Many definitions of smart would rule out these people, requiring some rationality skills of some sort. This makes the smart people heuristic even better.

Comment by donald-hobson on Is my result wrong? Maths vs intuition vs evolution in learning human preferences · 2019-09-11T22:21:33.264Z · score: 3 (2 votes) · LW · GW

The problem with the maths is that it does not correlate 'values' with any real world observable. You give all objects a property, you say that that property is distributed by simplicity priors. You have not yet specified how these 'values' things relate to any real world phenomenon in any way. Under this model, you could never see any evidence that humans don't 'value' maximizing paperclips.

To solve this, we need to understand what values are. The values of a human are much like the filenames on a hard disk. If you run a quantum field theory simulation, you don't have to think about either, you can make your predictions directly. If you want to make approximate predictions about how a human will behave, you can think in terms of values and get somewhat useful predictions. If you want to predict approximately how a computer system will behave, instead of simulating every transistor, you can think in terms folders and files.

I can substitute words in the 'proof' that humans don't have values, and get a proof that computers don't have files. It works the same way, you turn your uncertainty in the relation between the exact and the approximate into a confidence that the two are uncorrelated. Making a somewhat naive and not formally specified assumption along the lines of, "the real action taken optimizes human values better than most possible actions" will get you a meaningful but not perfect definition of 'values'. You still need to say exactly what a "possible action" is.

Making a somewhat naive and not formally specified assumption along the lines of, "the files are what you see when you click on the file viewer" will get you a meaningful but not perfect definition of 'files'. You still need to say exactly what a "click" is. And how you translate a pattern of photons into a 'file'.

We see that if you were running a quantum simulation of the universe, then getting values out of a virtual human is the same type of problem as getting files off a virtual computer.

Comment by donald-hobson on Hackable Rewards as a Safety Valve? · 2019-09-10T23:06:22.869Z · score: 3 (2 votes) · LW · GW

Intuition Dump. Safety through this mechanism seems to me like aiming a rocket at mars, and accidently hitting the moon. There might well be a region where P(doom|superintelligence) is lower, but lower in the sense of 90% is lower than 99.9%. Suppose we have the clearest, most stereotypical case of wireheading as I understand it. A mesa optimizer with a detailed model of its own workings and the terminal goal of maximizing the flow of current in some wire. (During training, current in the reward signal wire reliably correlated with reward.)

The plan that maximizes this flow long term is to take over the universe and store all the energy, to gradually turn into electricity. If the agent has access to its own internals before it really understands that it is an optimizer, or is thinking about the long term future of the universe, it might manage to brick its self. If the agent is sufficiently myopic, is not considering time travel or acausal trade, and has the choice of running high voltage current through itself now, or slowly taking over the world, it might choose the former.

Note that both of these look like hitting a fairly small target in design, and lab enviroment space. The mesa optimizer might have some other terminal goal. Suppose the prototype AI has the opportunity to write arbitrary code to an external computer, and an understanding of AI design before it has self modification access. The AI creates a subagent that cares about the amount of current in a wire in the first AI, the subagent can optimize this without destroying itself. Even if the first agent then bricks itself, we have a AI that will dig the fried circuit boards out the trashcan, and throw all the cosmic commons into protecting and powering them.

In conclusion, this is not a safe part of agentspace, just a part that's slightly less guaranteed to kill you. I would say it was of little to no strategic importance. Especially if you think that all humans making AI will be reasonably on the same page regarding safety, scenarios where AI alignment is nearly solved, and the first people to ASI barely know the field exists are unlikely. If the first ASI self destructs for reasons like this, we have all the peaces for making superintelligence, and people with no sense safety are trying to make one. I would expect another attempt a few weeks later to doom us. (Unless the first AI bricked itself in a sufficiently spectacular manor, like hacking into nukes to create a giant EMP in its circuits. That might get enough people seeing danger to have everyone stop.)

Comment by donald-hobson on AI Alignment Writing Day Roundup #1 · 2019-09-07T22:27:09.626Z · score: 3 (2 votes) · LW · GW

The proposition graph is non standard as far as I know. The syntax tree is kind of standard, but a bit unusual. You might want to use them if I show how to use them for logical counterfactuals. (Which I haven't finished yet)

Comment by donald-hobson on AI Safety "Success Stories" · 2019-09-07T22:21:05.858Z · score: 2 (4 votes) · LW · GW

I think that our research is at a sufficiently early stage that most technical work could contribute to most success stories. We are still mostly understanding the rules of the game and building the building blocks. I would say that we work on AI safety in general until we find anything that can be used at all. (There is some current work on things like satisficers that seem less relevant to sovereigns. I am not discouraging working on areas that seem more likely to help some success stories, just saying that those areas seem rare.)

Comment by donald-hobson on How to Throw Away Information · 2019-09-06T20:00:38.683Z · score: 5 (3 votes) · LW · GW

The bound isn't always achievable if you don't know Y. Suppose with uniform over the 6 possible outcomes. You find out that (without loss of generality). You must construct a function st . Because as far as you know, could be 2 or 3, and you have to be able to construct from and . But then we just take the most common output of , and that tells us so .

(If you had some other procedure for calculating X from S and Y, then you can make a new function that does that, and call it S, so an arbitrary function from X to Y+{err} is the most general form of S.

Comment by donald-hobson on Reversible changes: consider a bucket of water · 2019-08-29T12:24:16.732Z · score: 1 (1 votes) · LW · GW

No, what I am saying is that humans judge things to be more different when the difference will have important real world consequences in the future. Consider two cases, one where the water will be tipped into the pool later, and the other where the water will be tipped into a nuclear reactor, which will explode if the salt isn't quite right.

There need not be any difference in the bucket or water whatsoever. While the current bucket states look the same, there is a noticeable macro-state difference between nuclear reactor exploding and not exploding, in a way that there isn't a macrostate difference between marginally different eddy currents in the pool. I was specifying a weird info theoretic definition of significance that made this work, but just saying that the more energy is involved, the more significant works too. Nowhere are we referring to human judgement, we are referring to hypothetical future consequences.

Actually the rule, your action and its reversal should not make a difference worth tracking in its world model, would work ok here. (Assuming sensible Value of info). The rule that it shouldn't knowably affect large amounts of energy is good too. So for example it can shuffle an already well shuffled pack of cards, even if the order of those cards will have some huge effect. It can act freely without worrying about weather chaos effects, the chance of it causing a hurricane counterbalanced by the chance of stopping one. But if it figures out how to twitch its elbow in just the right way to cause a hurricane, it can't do that. This robot won't tip the nuclear bucket, for much the same reason. It also can't make a nanobot that would grey goo earth, or hack into nukes to explode them. All these actions effect a large amount of energy in a predictable direction.

Comment by donald-hobson on What Programming Language Characteristics Would Allow Provably Safe AI? · 2019-08-28T23:28:03.418Z · score: 0 (3 votes) · LW · GW

Basically all formal proofs assume that the hardware is perfect. Rowhammer.

Comment by donald-hobson on Reversible changes: consider a bucket of water · 2019-08-28T21:00:10.653Z · score: 3 (2 votes) · LW · GW

In the case of the industrial process, you could consider the action less reversible because while the difference in the water is small, the difference in what happens after that is larger. (Ie industrial part working or failing.). This means that at some point within the knock on effects of tipping over the carefully salt balanced bucket, there needs to be an effect that counts as "significant". However, there must not be an effect that counts as significant in the case where it's a normal swimming pool, and someone will throw the bucket into the pool soon anyway. Lets suppose water with a slightly different salt content will make a nuclear reactor blow up. (And no humans will notice the robot tipping out and refilling the bucket, so the counterfactual on the robots behavior actually contains an explosion.)

Suppose you shake a box of sand. With almost no information, the best you can do to describe the situation is state the mass of sand, shaking speed and a few other average quantities. With a moderate amount of info, you can track the position and speed of each sand grain, with lots of info, you can track each atom.

There is a sense in which average mass and velocity of the sand, or even position of every grain, is a better measure than md5 hash of position of atom 12345. It confines the probability distribution for the near future to a small, non convoluted section of nearby configuration space.

Suppose we have a perfectly Newtonian solar system, containing a few massive objects and many small ones.

We start our description at time 0. If we say how much energy is in the system, then this defines a non convoluted subset of configuration space. The subset stays just as non convoluted under time evolution. Thus total energy is a perfect descriptive measure. If we state the position and velocity of the few massive objects, and the averages for any large dust clouds, then we can approximately track our info forward in time, for a while. Liouvilles theorem says that configuration space volume is conserved, and ours is. However, our configuration space volume is slowly growing more messy and convoluted. This makes the descriptive measure good but not perfect. If we have several almost disconnected systems, the energy in each one would also be a good descriptive measure. If we store the velocity of a bunch of random dust specks, we have much less predictive capability. The subset of configuration space soon becomes warped and twisted until its convex hull, or epsilon ball expansion cover most of the space. This makes velocities of random dust specs a worse descriptive measure. Suppose we take the md5 hash of every objects position, rounded to the nearest nanometer, in some coordinate system and concatenated together. this forms a very bad descriptive measure. After only a second of time evolution, this becomes a subset of configuration space that can only be defined in terms of what it was a second ago, or by a giant arbitrary list.

Suppose the robot has one hour to do its action, and then an hour to reverse it. We measure how well the robot reversed its original action by looking at all good descriptive measures, and seeing the difference in these descriptive measures from what they would have been had the robot done nothing.

We can then call an action reversible if there would exist an action that would reverse it.

Note that the crystal structure of a rock now tells us its crystal structure next year, so can be part of a quite good measure. However the phase of the moon will tell you everything from the energy production of tidal power stations to the migration pattern of moths. If you want to make good (non convoluted) predictions about these things, you can't miss it out. Thus almost all good descriptive measures will contain this important fact.

A reversible action is any action taken in the first hour such that there exists an action that approximately reverses it that the robot could take in the second hour. (The robot need not actually take the reverse action, maybe a human could press a reverse button.)

Functionality of nuclear power stations, and level of radiation in atmosphere are also contained in many good descriptive measures. Hence the robot should tip the bucket if it won't blow up a reactor.

(This is a rough sketch of the algorithm with missing details, but it does seem to have broken the problem down into non value laden parts. I would be unsurprised to find out that there is something in the space of techniques pointed to that works, but also unsurprised to find that none do.)


Comment by donald-hobson on Torture and Dust Specks and Joy--Oh my! or: Non-Archimedean Utility Functions as Pseudograded Vector Spaces · 2019-08-26T00:45:13.814Z · score: 1 (1 votes) · LW · GW

What if I make each time period in the "..." one nanosecond shorter than the previous.

You must believe that there is some length of time, t>most of a day, such that everyone in the world being tortured for t-1 nanosecond is better than one person being tortured for t.

Suppose there was a strong clustering effect in human psychology, such that less than a week of torture left peoples minds in one state, and more than a week left them broken. I would still expect the possibility of some intermediate cases on the borderlines. Things as messy as human psychology, I would expect there to not be a perfectly sharp black and white cutoff. If we zoom in enough, we find that the space of possible quantum wavefunctions is continuous.

There is a sense in which specs and torture feel incomparable, but I don't think this is your sense of incomparability, to me it feels like moral uncertainty about which huge number of specs to pick. I would also say that "Don't torture anyone" and "don't commit attrocities based on convoluted arguments" a good ethical injunction. If you think that your own reasoning processes are not very reliable, and you think philosophical thought experiments rarely happen in real life, then implementing the general rule "If I think I should torture someone, go to nearest psych ward" is a good idea. However I would want a perfectly rational AI which never made mistakes to choose torture.

Comment by donald-hobson on Troll Bridge · 2019-08-24T07:03:11.356Z · score: 4 (3 votes) · LW · GW

Viewed from the outside, in the logical counterfactual where the agent crosses, PA can prove its own consistency, and so is inconsistent. There is a model of PA in which "PA proves False". Having counterfactualed away all the other models, these are the ones left. Logical counterfactualing on any statement that can't be proved or disproved by a theory should produce the same result as adding it as an axiom. Ie logical counterfactualing ZF on choice should produce ZFC.

The only unreasonableness here comes from the agent's worst case optimizing behaviour. This agent is excessively cautious. A logical induction agent, with PA as a deductive process will assign some prob P strictly between 0 and 1 to "PA is consistent". Depending on which version of logical induction you run, and how much you want to cross the bridge, crossing might be worth it. (the troll is still blowing up the bridge iff PA proves False)

A logical counterfactual where you don't cross the bridge is basically a counterfactual world where your design of logical induction assigns lower prob to "PA is consistent". In this world it doesn't cross and gets zero.

The alternative is a logical factual where it expects +ve util.

So if we make logical induction like crossing enough, and not mind getting blown up much, it crosses the bridge. Lets reverse this. An agent really doesn't want blown up.

In the counterfactual world where it crosses, logical induction assigns more prob to "PA is consistant". The expected utility procedure has to use its real probability distribution, not ask the counterfactual agent for its expected util.

I am not sure what happens after this, I think you still need to think about what you do in impossible worlds. Still working it out.

Comment by donald-hobson on Logical Optimizers · 2019-08-24T04:42:04.311Z · score: 2 (2 votes) · LW · GW

If the prior is full of malign agents, then you are selecting your new logical optimizer based on its ability to correctly answer arbitrary questions (in a certain format) about malign agents. This doesn't seem to be that problematic. If the set of programs being logically optimized over is malign, then you have trouble.

Comment by donald-hobson on Torture and Dust Specks and Joy--Oh my! or: Non-Archimedean Utility Functions as Pseudograded Vector Spaces · 2019-08-24T04:29:00.694Z · score: 3 (2 votes) · LW · GW

The idea is that we can take a finite list of items like this

Torture for 50 years

Torture for 40 years

...

Torture for 1 day

...

Broken arm

Broken toe

...

Papercut

Sneeze

Dust Speck

Presented with such a list you must insist that two items on this list are incomparable. In fact you must claim that some item is incomparably worse than the next item. I don't think that any number of broken toes is better than a broken arm. A million broken toes is clearly worse. Follow this chain of reasoning for each pair of items on the list. Claiming incomparably is a claim that no matter how much I try to subdivide my list, one item will still be infinitely worse than the next.

The idea of bouncing back is also not useful. Firstly it isn't a sharp boundary, you can mostly recover but still be somewhat scarred. Secondly you can replace an injury with something that takes twice as long to bounce back from, and they still seem comparable. Something that takes most of a lifetime to bounce back from is comparable to something that you don't bounce back from. This breaks if you assume immortality, or that bouncing back 5 seconds before you drop dead is of morally overwhelming significance, such that doing so is incomparable to not doing so.