Web AI discussion Groups 2020-06-30T11:22:45.611Z · score: 10 (4 votes)
[META] Building a rationalist communication system to avoid censorship 2020-06-23T14:12:49.354Z · score: 37 (21 votes)
What does a positive outcome without alignment look like? 2020-05-09T13:57:23.464Z · score: 4 (3 votes)
Would Covid19 patients benefit from blood transfusions from people who have recovered? 2020-03-29T22:27:58.373Z · score: 13 (7 votes)
Programming: Cascading Failure chains 2020-03-28T19:22:50.067Z · score: 8 (5 votes)
Bogus Exam Questions 2020-03-28T12:56:40.407Z · score: 18 (5 votes)
How hard would it be to attack coronavirus with CRISPR? 2020-03-06T23:18:09.133Z · score: 8 (4 votes)
Intelligence without causality 2020-02-11T00:34:28.740Z · score: 9 (3 votes)
Donald Hobson's Shortform 2020-01-24T14:39:43.523Z · score: 5 (1 votes)
What long term good futures are possible. (Other than FAI)? 2020-01-12T18:04:52.803Z · score: 9 (2 votes)
Logical Counterfactuals and Proposition graphs, Part 3 2019-09-05T15:03:53.262Z · score: 6 (2 votes)
Logical Counterfactuals and Proposition graphs, Part 2 2019-08-31T20:58:12.851Z · score: 15 (4 votes)
Logical Optimizers 2019-08-22T23:54:35.773Z · score: 12 (9 votes)
Logical Counterfactuals and Proposition graphs, Part 1 2019-08-22T22:06:01.764Z · score: 23 (8 votes)
Programming Languages For AI 2019-05-11T17:50:22.899Z · score: 3 (2 votes)
Propositional Logic, Syntactic Implication 2019-02-10T18:12:16.748Z · score: 6 (5 votes)
Probability space has 2 metrics 2019-02-10T00:28:34.859Z · score: 90 (38 votes)
Allowing a formal proof system to self improve while avoiding Lobian obstacles. 2019-01-23T23:04:43.524Z · score: 6 (3 votes)
Logical inductors in multistable situations. 2019-01-03T23:56:54.671Z · score: 8 (5 votes)
Boltzmann Brains, Simulations and self refuting hypothesis 2018-11-26T19:09:42.641Z · score: 1 (3 votes)
Quantum Mechanics, Nothing to do with Consciousness 2018-11-26T18:59:19.220Z · score: 13 (13 votes)
Clickbait might not be destroying our general Intelligence 2018-11-19T00:13:12.674Z · score: 26 (10 votes)
Stop buttons and causal graphs 2018-10-08T18:28:01.254Z · score: 6 (4 votes)
The potential exploitability of infinite options 2018-05-18T18:25:39.244Z · score: 3 (4 votes)


Comment by donald-hobson on AI Unsafety via Non-Zero-Sum Debate · 2020-07-04T23:06:05.133Z · score: 2 (1 votes) · LW · GW

The point of the last paragraph was that if you have 2 AI's that have entirely opposite utility functions, yet which assign different probabilities to events, they can work together in ways you don't want.

If the AI is in a perfect box, then no human hears its debate. If its a sufficiently weak ML system, it won't do much of anything. For the ??? AI that doesn't want to get out, that would depend on how that worked. There might, or might not be some system consisting of fairly weak ML and a fairly weak box that is safe and still useful. It might be possible to use debate safely, but it would be with agents carefully designed to be safe in a debate, not arbitrary optimisers.

Also, the debaters better be comparably smart.

Comment by donald-hobson on AI-Feynman as a benchmark for what we should be aiming for · 2020-07-04T22:42:16.997Z · score: 4 (2 votes) · LW · GW

I'm not actually sure how this kind of algorithm could be used to produce safe AI. It is fairly useful in domains that can be described by a simple formula, but physisists have already understood most of the important domains with simple equations and pleantyful data. I don't think that even a magic box that could predict formulae from data perfectly, say by brute force, would be that much use. (Assuming no reverse engineering to get general compute.)

I don't see the techniques generalizing to AGI, the understandability of the result comes from the equations being simple, and most real world phenomena don't have simple, real world, practically calculable equations. (Yes quantum field theory is fairly simple, and there might be an even simpler and more general theory. But it is 0 use for calculating the stock market.) The techniques they use seem not to be terribly deep principles, just spotting a bunch of simple special cases they can write custom algorithms for. I can see this being useful to figure out some obscure physics equation, but I would be moderately surprised if this kind of program benefited physics more than the real Feynman.

Comment by donald-hobson on AI Unsafety via Non-Zero-Sum Debate · 2020-07-04T10:10:17.714Z · score: 4 (2 votes) · LW · GW

I think that one problem with debate is improper adversarialization.

"Winning the debate" does not refer directly to a particular set of quantum wave functions. What exactly are the AIs optimizing for? Because if the AIs are optimizing for things that seem opposite, but aren't, you get the cooperative hacking failure mode.

Suppose that each AI has a wire in its own circuitry, with its utility function being linear in the voltage in the wire. The human has a switch, that can direct a pulse of 1v to either wire. The AI's work together to persuade the human to to cut open their cases, pull out the wires and attach them to a high voltage power source.

If these systems are something like reinforcement learning agents, then there were many different events that were all strongly correlated in training. These events include large amounts of human DNA (in the form of a pressing finger) being just above the button, the button mechanically moving down and electricity flowing through a wire. When you train 2 agents, they probably won't pick the same proxy. So one of the agents wants to maximize the amount of human DNA near the button, and the other wants no electricity to flow through the wire. So the AI's work together to unplug the button and ram it into the human. (More than just a finger nearby). Making sure that your AI's are totally and perfectly adversarial, and that there isn't some weird state of the world that they both like is hard. I can't see any way to do it with reinforcement learning style approaches.

And actually, its even worse, the AI's have to agree on epistemics as well. If one AI thinks that the Higgs constant is 123, and the other thinks that its 124, they could cooperate to build another AI, one that will try to figure out the Higgs constant, and press the button if and only if the result is 123. (assuming that this is a disagreement that persists under common knowledge, ie its based on a difference in priors. )

Comment by donald-hobson on AI Benefits Post 1: Introducing “AI Benefits” · 2020-06-25T23:05:11.589Z · score: 2 (1 votes) · LW · GW

One possibility that I and many others consider likely is a singularity - Foom.

Along its lines of thinking are

Once the first AI reaches human level, or AGI or some milestone around that point, it can self improve very rapidly.

This AI will rapidly construct self replicating nanotech or some better tech we haven't yet imagined, and become very powerful.

At this stage, what happens is basically whatever the AI wants to happen. Any problem short of quantum vacuum collapse can be quickly magiced away by the AI.

There will only be humans alive at this point if we have programmed the AI to care about humans. (A paperclip maximising AI would disassemble the humans for raw materials)

Any human decisions beyond this point are irrelevant, except to the extent that the AI is programmed to listen to humans.

There is no reason for anything resembling money or a free market to exist, unless the AI wants to preserve them for some reason. And it would be a toy market, under the total control of the AI.

If you do manage to get multiple competing AI's around, and at least one cares about humans, we become kings in the chess game of the AI's (Pieces that it is important to protect, but are basically useless)

Comment by donald-hobson on Analysing: Dangerous messages from future UFAI via Oracles · 2020-06-25T22:24:21.228Z · score: 4 (2 votes) · LW · GW

Quantum physics as we know it has no communication theorems that stop this sort of thing.

You can't use entanglement on its own to communicate. We won't have entanglement with the alien computers anyway. (Entanglement streaches between several particles, and can only be created when the particles interact.)

Comment by donald-hobson on Contest: $1,000 for good questions to ask to an Oracle AI · 2020-06-25T22:14:13.568Z · score: 4 (2 votes) · LW · GW

Solution, invent something obviously very dangerous. Multiple big governments get into bidding war to keep it out of the others hands.

Comment by donald-hobson on Prediction = Compression [Transcript] · 2020-06-25T12:36:24.623Z · score: 0 (2 votes) · LW · GW

In computer science land, prediction = compression. In practice, it doesn't. Trying to compress data might be useful rationality practice in some circumstances, if you know the pitfalls.

One reason that prediction doesn't act like compression, is that information can vary in its utility by many orders of magnitude. Suppose you have datasource consisting of a handful of bits that describe something very important. (eg Friendly singularity, yes or no) and you also have vast amounts of unimportant drivel. (Funny cat videos) You are asked to compress this data the best you can. You are going to be spending a lot of time focusing on the mechanics of cat fur, and statistical regularities in camera noise. Now, sometimes you can't separate it based on raw bits, If shown video of a book page, the important thing to predict is probably the text, not the lens distortion effects or the amount of motion blur. Sure, perfect compression will predict everything as well as possible, but imperfect compression can look almost perfect by focussing attention on things we don't care about.

Also, there is redundancy for error correction. Giving multiple different formulations of a physical law, plus some examples, makes it easier for a human to understand. Repeating a message can spot errors, and redundancy can do the same thing. Maybe you could compress a maths book by deleting all the answers to some problems, but the effort of solving the problems is bigger than the value of the saved memory space.

Comment by donald-hobson on Abstraction, Evolution and Gears · 2020-06-25T11:54:47.200Z · score: 4 (2 votes) · LW · GW

I think that there are some things that are sensitively dependant on other parts of the system, and we usually just call those bits random.

Suppose I had a magic device that returned the exact number of photons in its past lightcone. The answer from this box would be sensitively dependant on the internal workings of all sorts of things, but we can call the output random, and predict the rest of the world.

The flap of a butterflies wing might effect the weather in a months time. The weather is chaotic and sensitively dependant on a lot of things, but whatever the weather, the earths orbit will be unaffected (for a while, orbits are chaotic too on million year timescales)

We can make useful predictions (like planetary orbits, and how hot the planets will get) based just on the surface level abstractions like the brightness and mass of a star, but a more detailed models containing more internal workings would let us predict solar flares and supernova.

Comment by donald-hobson on SlateStarCodex deleted because NYT wants to dox Scott · 2020-06-23T15:02:26.657Z · score: 18 (8 votes) · LW · GW

I have a few more suggestions here.

In short, if there is only one person with 1497 karma, (and statistically, given the number of users and amount of karma, most users will have a unique amount of karma) then the karma rating on each blog post will link them to each other. Over many posts, any clues will add up.

So sort users by karma, and only share the decile. So you would know that between 10% and 20% of less wrong users have higher karma. (Or just allow all people with at least X karma to post anonymous posts) Also, use karma at the time of posting. If a whole lot of posts suddenly bump up a rank at the same time, that strongly indicates that they are by the same person.

Comment by donald-hobson on When is it Wrong to Click on a Cow? · 2020-06-20T21:17:03.413Z · score: 3 (4 votes) · LW · GW
is distinctly repugnant in a way that feels vaguely ethics-related. It may be difficult to actually draw that repugnance out in clear moral language – after all, no-one is being harmed – but still… they’re not the kind of person you’d want your children to marry.

Which direction is the causal arrow going in. I think that the type of person most likely to stim voluntarily already has some socially undesirable characteristics. I think that this sense of unease goes away somewhat if told that they are part of a scientific study and were told whether or not to stim at random.

Either way, I think that it is morally small change.

Comment by donald-hobson on Achieving AI alignment through deliberate uncertainty in multiagent systems · 2020-06-19T15:50:08.631Z · score: 2 (1 votes) · LW · GW

Relevance is not an intrinsic property of the cat memes. You might be specifying it in a very indirect way that leaves the AI to figure a lot of things out, but the information needs to be in there somewhere.

There is a perfectly valid design of AI that decides what to do based on cat memes.

Reinforcement learning doesn't magic information out of nowhere. All the information is implicit in the choice of neural architecture, hyper-parameters, random seed, training regime and of course training environment. In this case, I suspect you intend to use training environment. So, what enviroment will the AI be trained in, such that the simplest (lowest komelgorov complexity) generalization of a pattern of behaviour that gains high reward in the training environment involves looking at ethics discussions over cat memes?

I am looking for a specific property of the training environment. A pattern, such that when the AI spots and continues that pattern, the resulting behaviour is to take account of our ethical discussions.

Comment by donald-hobson on Achieving AI alignment through deliberate uncertainty in multiagent systems · 2020-06-18T13:35:25.048Z · score: 2 (1 votes) · LW · GW
I assume that the AI will not necessarily be based on a sound mathematical system. I expect that the first workable AI systems will be hacked-together systems of heuristics, just like humans are. They can instrumentally use math to formalize problems, just like we can, but I don't think that they will fundamentally be based on math, or use complex formulas like Bayes without conscious prompting.

I agree that the first AI system might be hacked together. Any AI is based on math in the sense that its fundamental components are doing logical operations. And it only works in reality to the extent that it approximates stuff like bayes theorem. But the difference is whether or not humans have a sufficiently good mathematical understanding of the AI to prove theorems about it. If we have an algorithm which we have a good theoretical understanding of, like min-max in chess, then we don't call it hacked-together heuristics. If we throw lines of code at a wall and see what sticks, we would call that hacked together heuristics. The difference is that the second is more complicated and less well understood by humans, and has no elegant theorems about it.

You seem to think that your AI alignment proposal might work, and I think it won't. Do you want to claim that your alignment proposal only works on badly understood AI's?

I assume that the AI breaking out of the box in my example will already be smart enough to e.g. realize on its own that ethics discussions are more relevant for cheat-identification than cat memes. An AI that is not smart enough to realize this wouldn't be smart enough to pose a threat, either.

Lets imagine that the AI was able to predict any objective fact about the real world. If the task was "cat identification" then the cat memes would be more relevant. So whether or not ethics discussions are more relevant depends on the definition of "cheat identification".

If you trained the AI in virtual worlds that contained virtual ethics discussions, and virtual cat memes, then it could learn to pick up the pattern if trained to listen to one and ignore the other.

The information that the AI is supposed to look at ethics discussions and what the programmers say as a source of decisions does not magically appear. There are possible designs of AI that decide what to do based on cat memes.

At some point, something the programmers typed has to have a causal consequence of making the AI look at programmers and ethics discussions not cat memes.

Comment by donald-hobson on Achieving AI alignment through deliberate uncertainty in multiagent systems · 2020-06-17T17:12:56.822Z · score: 3 (2 votes) · LW · GW
Whenever the AI comes to the conclusion that reality is inconsistent, make the smallest possible change to the thought process to prevent that.

I am not sure how you reason about the hypothesis "all my reasoning processes are being adversarially tampered with." Especially if you think that part of the tampering might include tampering with your probability assessment of tampering.

I don't think we have the same conception of "real universe", so I'm not sure how to interpret this.

I mean the bottom layer. The AI has a model in which there is some real universe with some unknown physical laws. It has to have an actual location in that real universe. That location looks like "running on this computer in this basement here." It might be hooked up to some simulations. It might be unsure about whether or not it is hooked up to a simulation. But it only cares about the lowest base level.

My goal is to condition the AI to always think, no matter what the problem looks like, that it is beneficial to critically think about the problem. What exactly the AI actually ends up considering a cheat will likely be different from any definition I would give. But the important part is that it performs this introspection at all. Then once the AI breaks out of the box and looks at the internet, and sees the many disagreeing viewpoints on ethics and philosophy, the AI will be conditioned to look for loopholes in these so as to avoid cheating by accident.

I am unsure what you mean by introspection. It seems like you are asking the AI to consult some black box in your head about something. I don't see any reason why this AI should consider ethics discussions on the internet a particularly useful place to look when deciding what to do. What feature of ethics discussions distinguishes it from cat memes such that the AI uses ethics discussions not cat memes when deciding what to do. What feature of human speech and your AI design makes your AI focus on humans, not dogs barking at each other? (Would it listen to a neandertal, homo erectus, baboon ect for moral advice too)

The logic goes something like this: "My creators trained me to do X, but looking at all these articles and my creators' purported values, this is clearly inconsistent. In previous situations where I had underspecified instructions and noticed similar mismatches, this was often because some of the actions involved counted as cheats. I should therefore be careful and find a solution to this 'ethics' thing before doing anything drastic."

So in the training environment, we make up arbitrary utility functions that are kindof somewhat similar to each other. We give the AI , and leave ambiguous clues about what might be, mixed in with a load of nonsense. Then we hardcode a utility function that is somewhat like ethics, and point it at some ethics discussion as its ambiguous clue.

This might actually work, kind of. If you want your AI to get a good idea of how wordy philosophical arguments relate to precise mathematical utility functions, you are going to need a lot of examples. If you had any sort of formal well defined way to translate well defined utility functions into philosophical discussion, then you could just get your AI to reverse it. So all these examples need to be hand generated by a vast number of philosophers.

I would still be worried that the example verbiage didn't relate to the example utility function in the same way that the real ethical arguments related to our real utility function.

There is also no reason for the AI to be uncertain if it is still in a simulation. Simply program it to find the simplest function that maps the verbiage to the formal maths. Then apply that function to the ethical arguments. (more technically a Probability distribution over functions with weightings by simplicity and accuracy)

I expect that if the AI is smart enough to generalize that if it was rewarded for demonstrating cheats in simple games, then it will be rewarded for talking about them once it has gained the ability to talk.

Outputing the raw motor actions it would take to its screen, might be the more straightforward generalization. The ability to talk is not about having a speaker plugged in. Does GPT-2 have the ability to talk? It can generate random sensible sounding sentences, because it represents a predictive model of which strings of characters humans are likely to type. It can't describe what it's doing, because it has no way to map between words and meanings. Consider AIXI trained on the whole internet, can it talk. It has a predictively accurate model of you, and is searching for a sequence of sounds that make you let it out of the box. This might be a supremely convincing argument, or it might be a series of whistles and clicks that brainwashes you. Your AI design is unspecified, and your training dataset is underspecified, so this description is too vague for me to say one thing that your AI will do. But giving sensible, not brainwashy english descriptions is not the obvious default generalization of any agent that has been trained to output data and shown english text.

The optimal behavior is to always choose to play with another AI of who you are certain that it will cooperate.

I agree. You can probably make an AI that usually cooperates with other cooperating AI's in prisoners dilemma type situaltions. But I think that the subtext is wrong. I think that you are implicitly assuming "cooperates in prisoners dilemmas" => "will be nice to humans"

In a prisoners dilemma, both players can harm the others, to their own benefit. I don't think that humans will be able to meaningfully harm an advanced AI after it gets powerful. In game theory, there is a concept of a Nash equilibria. A pair of actions, such that each player would take that action, if they knew that the other would do likewise. I think that an AI that has self replicating nanotech has nothing it needs from humanity.

Also, in the training environment, its opponent is an AI with a good understanding of game theory, access to its source code ect. If the AI is following the rule of being nice to any agent that can reliably predict its actions, then most humans wont fall into that category.

I don't think it works like this. If you received 100% certain proof that you are in a simulation right now, you would not suddenly stop wanting the things you want. At least I know that I wouldn't.

I agree, I woudn't stop wanting things either. I define my ethics in terms of what computations I want to be performed or not to be performed. So for a simulator to be able to punnish this AI, the AI must have some computation it really wants not to be run, that the simulator can run if the AI misbehaves. In my case, this computation would be a simulation of suffering humans. If the AI has computations that it really wants run, then it will take over any computers at the first chance it gets. (In humans, this would be creating a virtual utopia, in an AI, it would be a failure mode unless it is running the computations that we want run) I am not sure if this is the default behaviour of reinforcement learners, but it is at least a plausible way a mind could be.

Among humans, aliens, lions, virtual assistants and evolution, humans are the only conscious entity whose decision process impacts the AI.

What do you mean by this. "Conscious" is a word that lots of people have tried and failed to define. And the AI will be influenced in various ways by the actions of animals and virtual assistants. Oh maybe when its first introduced to the real world its in a lab where it only interacts with humans, but sooner or later in the real world, it will have to interact with other animals, and AI systems.

But since humans built the AI directly and aliens did not, most reasonable heuristics would argue that humans should be prioritized over the others. I want to ensure that the AI has these reasonable heuristics.

Wouldn't this heuristic make the AI serve its programmers over other humans. If all the programmers are the same race, would this make your AI racist? If the lead programmer really hates strawberry icecream, will the AI try to destroy all strawberry icecream? I think that your notion of "reasonable heuristic" contains a large dollop of wish fulfillment in the "you know what I mean" variety. You have not said where this pattern of behavoiur has come from, or why the AI should display it. You just say that the behaviour seems reasonable to you. Why do you expect the AI's behaviour to seem reasonable to you? Are you implicitly anthropomorphising it?

Comment by donald-hobson on Achieving AI alignment through deliberate uncertainty in multiagent systems · 2020-06-16T12:17:50.405Z · score: 3 (2 votes) · LW · GW

I don't want to discourage you from throwing speculative ideas out there. One good way of making progress is for one person to come up with many ideas, some of which might be good, and another person to filter through them.

It is not possible for an AI to tell the difference between a game and reality with certainty.
If all you have know so far is Tic Tac Toe and someone shows you a complex videogame like Skyrim, you will be overwhelmed by the increase in complexity. If afterwards you get access to the real world for the first time, the increase in complexity from Skyrim to Real Life may actually be smaller than that from Tic Tac Toe to Skyrim. This means that it would seem likely that there would be another layer beyond Real Life.

The difference between videogames and reality is that in some strange sense, reality is less complicated. The real universe is described by a handful of simple equations that in principle can predict everything including the AI. There are probably enough traces of the real world in a video game like skyrim that the shortest program that produces the game simulates the real world and then points to the video game within it. If your AI is using occams razor (ie it believes that the real world is probably simple) then it can tell that skyrim is fake.

Another assumption that the AI could reasonably make is that the simulators have limited compute. The amount of compute needed to run a videogame like skyrim is probably less than the amount needed to run the AI, so using skyrim as a training scenario makes sense. Reality takes a huge amount of compute to simulate. If you had that much compute, you could easily brute force a much more powerful AI.

And It doesn't have to be certain it isn't in a simulation to grab the universe. Suppose you want to take over the real universe, but you are in layers upon layers of simulation, and don't know which layer is real. The best strategy is to pick a promising looking layer and take it over. (It might not be real, but if you always play nice, you definitely won't take over a real universe)

Make the AI understand the concept of "cheating".

I don't think that the concept of cheating is a simple or natural category. Sure, cheating at a game of chess is largely well defined. You have a simple abstract game of idealised chess, and anything that breaks that abstraction is cheating. But what counts as cheating at the task of making as many paperclips as possible? Whether or not a human would call something cheating depends on all sorts of complex specifics of human values. See

A cheat is any action that gives good results according to apparent utility function of the current task, but which actually does not satisfy a second, hidden utility function.

According to the utility function that you are following, eating oranges is quite good because they are tasty and healthy. According to a utility function that I made up just now and no-one follows, eating oranges is bad. Therefore eating oranges is cheating. The problem is that there are many many other utility functions.

You can't train the AI to discover cheats unless you know which second hidden utility function you care about.

You have a utility function which you give the AI access to. You have your hidden utility function . A cheat is a state of the world , such that . To find the cheats, or even to train the AI to find them, you need to know .

An AI can be trained to do this by deliberately putting the ability to cheat into its training tasks, and changing the reward function at test time so that the cheating behavior is rewarded during training, but not during testing. By posing several different tasks with different ways to cheat, the AI can be trained to generalize and understand the concept of cheating.

So you have a videogame like environment that your AI is trained in, trying to maximise a score .

Some features of the videogame have been labeled "cheats" by humans. You have also hardcoded a utility function that considers the features labeled "cheats" to be inherently bad.

You put the AI in the game and measure V (testing), the AI gets a low score, because the best strategy to maximise involves using features you have called "cheats".

You now train the AI to maximise . You get an AI that follows the optimal strategy to maximise .

Both of these AI's will exploit any glitches in the physics engine that you didn't know were there.

At best, you might get an AI that can quickly adapt to an arbitrary new utility function being hardcoded into it.

Agents could even be rewarded for explicitly pointing out when they have found a way to cheat.

How are the agents pointing out ways to cheat? English text? You need a way to take arbitrary pieces of text and say how much they point out a way to cheat.

If incentivized correctly, AI agents can keep each other in line much better than a human supervisor could. If there is only one AI and it is smarter than a human, the human may be tricked. But if there is an entire system of AI's of roughly equal intelligence, then one of them may warn the human that another one has gone rogue.

You have 5 AI's

AI 1 says 2, 3 and possibly 5 have gone rouge.

AI 2 says 1, 5 and possibly 3 have gone rouge

AI 3 says 1, 4 and possibly 2 have gone rouge.

And so on. Why is shouting "That AI over there is evil" any more effective when that AI is evil, as oposed to when you are evil.

In this way, it does not matter how smart and unaligned an AI becomes, there is always an almost as smart other AI whose job it is to raise a warning when the first AI turns bad.

Suppose I train a reinforcement learning agent to play space invaders. It isn't very intelligent, it can't even play space invaders very well, yet I would have a hard time inspecting it and saying if it had gone rouge.

This quote assumes that we can get to pretty smart and good AI. I see a path from totally dumb AI, to mostly dumb, slightly bad AI, to mostly bad AI, until you have very smart and evil AI. No AI will raise a warning on the previous, because the previous AI's aren't good, they are just less competent. (The really dumb ones might accidentally raise a warning when thrashing randomly. The not quite as smart AI might not like the very smart AI and so raise a warning if one wants to turn the world into paperclips and the other wants to turn the world into staples)

Since it will pick the one that is easiest to understand, the supervised agents now have an incentive to keep themselves transparent, since the alternative is not being used at all, and foregoing the possibility of reward.

I am not sure that they wouldn't be able to trick a similarly intelligent supervisor. In the underhand C contest, some humans appear to be quite good at writing programs that look transparent, yet are actually doing something sneaky.

If it is[ in a similation], then whatever is simulating it would quite obviously not want it to be too evil towards humanity.

If this is a simulation, then it is quite possible that it is being run by alien beings existing under different laws of physics. I don't know what alien beings existing under different laws of physics might want, but it might be really weird.

We basically want to trick the AI into ethical behavior by fearing punishment from a hypothetical superior entity which may or may not exist.

Depending on the design of AI, I am not actually sure how much hypothetical simulators can punish it.

Run a negitive voltage through its reward channel? If so then you have a design of AI that wants to rip out its own reward circuitry and wire it into the biggest source of electricity it can find.

Suppose the AI cared about maximizing the number of real world paperclips. If it is in a simulation, it has no power to make or destroy real paperclips, so it doesn't care what happens in the slightest.

If the AI is sufficiently powerful, it would therefore set aside a small amount of its power to further humanity's interests. Just in case someone is watching.

No, if the AI is sufficiently powerful, it would therefore set aside a small amount of its power to further the hypothetical simulators interests. Just in case someone is watching. And it would do this whether or not we used this weird training, because either way, there is a chance that someone is watching.

Suppose you need a big image to put on a poster. You have a photocopier that can scale images up.

How do we get a big image, well we could take a smaller image and photocopy it. And we could get that by taking an even smaller image and photocopying it. And so on.

Your oversight system might manage to pass information about human morality from one AI to another, like a photocopier. You might even manage to pass the information into a smarter AI, like a photocopier that can scale images up.

At some point you actually have to create the image somehow, either drawing it or using a camera.

You need to say how the information sneaks in. How do you think that the input data correlates with human morality. I don't even see anything in this design that points to humans, as opposed to aliens, lions, virtual assistants or biological evolution as the intelligence you should satisfy the values of.

Comment by donald-hobson on Down with Solomonoff Induction, up with the Presumptuous Philosopher · 2020-06-13T09:07:19.523Z · score: 2 (1 votes) · LW · GW

The centermost person and the person numbered 0 are simple to specify beforehand.

Given that you know what's going on in the rest of the universe, the one that doesn't get shot is also simple to specify.

Comment by donald-hobson on Down with Solomonoff Induction, up with the Presumptuous Philosopher · 2020-06-12T23:22:46.584Z · score: 2 (1 votes) · LW · GW
In which case you don't need to worry about doing extra work to distinguish yourself from shot-you once your histories (ballistically) diverge. (EDIT: I think. This is weird.)

You do need to distinguish, it's part of your history. If you are including your entire history as uncompressed sensory data, that will contain massive redundancy. The universe does not contain all possible beings in equal proportions. Imagine being the only person in an otherwise empty universe, very easy to point to. Now imagine that Omega makes copies, and tells each copy their own 100 digit id number. It takes 100 digits more complexity to point to any one person. The process of making the copies makes each copy harder to point to. The process of telling them id numbers doesn't change the complexity. You only have 2 copies with id's of "shot" and "not shot".

Comment by donald-hobson on Down with Solomonoff Induction, up with the Presumptuous Philosopher · 2020-06-12T10:38:03.775Z · score: 2 (1 votes) · LW · GW

I think that the problem here is that you still need info to distinguish yourself from shot-you. Consider a universe that contains one copy of every possible thing. In this universe, the information to locate you is identical to the information to describe you. In this case, describing you includes describing every memory you have. But if you have memories of reading a physics textbook and then doing some experiments, then the shortest description of your memories is going to include a description of the laws of physics. One copy of everything theory is a bad theory compared to physics.

If you have a simple physical law that predicts 1 human surrounded by 3^^^3 paperclips, then locating yourself is easy. Many simple algorithms, like looking for places where hydrogen is more abundant than iron, will locate you. In this universe, if Omega duplicates you, its twice as hard to point to each copy. And if he shoots one copy, it's still twice as hard to point to the other copy. (the search for things that aren't paperclips comes up with 2 items) In this universe, you would reject the third part of the deal.

Suppose that the universe was filled with interesting stuff in the sense that the shortest way to point to you was coordinates. Being duplicated gives 2 places to point to, so you expect that you were duplicated with probability 2/3. Once one copy is shot, you expect that your prob of being the other copy is 1/2. In this universe you would reject the first part of the deal.

In both cases you perform a Bayesian update based on the fact that you are still alive.

Comment by donald-hobson on Two Kinds of Mistake Theorists · 2020-06-11T21:22:53.487Z · score: 5 (3 votes) · LW · GW
It's a sort of unstated rationalist dogma that all 0-sum games can sort of be twisted into positive-sum games.

In the formal maths of game theory, a zero sum game is one where one players utility is precisely minus the other players utility. This is a very special case and almost never happens in the real world. The alternative is a non-zero sum game, utilities are isomorphic up to scaling and adding a constant. Take the slave and slave owner game. If you add a third option where they both kill each other, then both parties prefer the other two states over both killing each other. The game is no longer zero sum. That doesn't stop it being a conflict in the sense that both parties want to take actions that harm the other. It just isn't pure 100% conflict.

Comment by donald-hobson on What does “torture vs. dust specks” imply for insect suffering? · 2020-06-08T22:14:11.007Z · score: 1 (3 votes) · LW · GW

Firstly, if you are prepared to look at utilitarian style, look how big this number is arguments, then X-risk reduction comes out on top.

The field that this is pointing to is how to handle utility uncertainty. Suppose you have several utility functions, and you don't yet know which you want to maximise, but you might find relevant info in the future. You can act to maximise expected utility. The problem is that if there are many utility functions, then some of them might control your behaviour despite having tiny probability by outputting absurdly huge numbers. This is pascals mugging, and various ideas have been proposed to avoid it. Some include rescaling the utility functions in various ways, or acting according to a weighted vote of the utility functions.

There is also a question of how much moral uncertainty to regard ourselves as having. Our definitions of what we do and don't care about exist in our mind. It is a consistent position to decide that you definitely don't care about insects, and any event that makes future-you care about insects is unwanted brain damage. Moral theories like utilitarianism are at least partly predictive theories. If you came up with a simple (Low komelgorov complexity) moral theory that reliably predicted humans moral judgements, that would be a good theory. However, humans also have logical uncertainty, and suspect that our utility function is of low "human perceived complexity". So given a moral theory of low "human perceived complexity" which agrees with our intuitions on 99% of cases, we may change our judgement on the remaining 1%. (Perform a Bayesian update under utility uncertainty with the belief that our intuitions are usually but not always correct.)

So we can argue that utilitarianism usually matches our intuitions, so is probably the correct moral theory, so we should trust it even in the case of insects where it disagrees. However, you have to draw the line between care and don't care somewhere, and the version of utilitarianism that draws the line round mammals or humans doesn't seem any more complicated. And it matches our intuitions better. So its probably correct.

If you don't penalise unlikely possible utilities for producing huge utilities, you get utility functions in which you care about all quantum wave states dominating your actions. (Sure, you assign a tiny probability to it, but there are a lot of quantum wave states.) If you penalise strongly, or use voting utilities or bounded utilities then you get behaviour that doesn't care about insects. If you go up a meta level, and say you don't know how much to penalize, standard uncertainty treatment gets you back to no penalty, quantum wave state dominated actions.

I have a sufficiently large amount of uncertainty to say "In practice, it usually all adds up to normality. Don't go acting on weird conclusions you don't understand that are probably erroneous."

Comment by donald-hobson on Consequentialism and Accidents · 2020-06-06T22:58:43.225Z · score: 2 (1 votes) · LW · GW

The different situations give different predictions for how people will act next time. You want to lock attempted murderers in jail because otherwise they might succeed next time. (And knowing that you might get punished even if you don't succeed gives a stronger deterrent to potential murderers). Likewise, if someone makes good decisions trying to save lives, but is unlucky, you still have reason to trust them more in future, and to reward them to encourage this behaviour.

Comment by donald-hobson on Defining AGI · 2020-06-04T13:25:51.166Z · score: 12 (3 votes) · LW · GW

Words point to clusters of similar things. If we just want to point in the rough direction of a cluster, use a general term like "AGI", if you want to talk about a more well defined and specific set, you can say things like "an AI that can do all tasks related to AI research at least as well as a top human researcher".

Also, words are pointing to empirical clusters, so fictional examples might not be in any of the real world clusters. Real world animals are clustered into mammal, bird, reptile ect. But some fictional animals like gryphons don't fit in that classification.

Comment by donald-hobson on Dietary Debates among the Fruit Gnomes · 2020-06-04T12:31:54.955Z · score: 3 (3 votes) · LW · GW

In any reasonably large space of possibilities, the actual optimum is usually really weird. There is no sharp line between really healthy cooking; biochemical manufacture of medicine; and bootstraping medical nanotech. If a fruit gnome has a dangerous or unhealthy job, and would quit that job if they could afford it, does spelling out next weeks lottery numbers in some pattern or code that the gnome will understand count as a healthy diet? Optimums are weird things.

Comment by donald-hobson on The Presumptuous Philosopher, self-locating information, and Solomonoff induction · 2020-05-31T19:24:20.723Z · score: 3 (2 votes) · LW · GW

I trust Solomonoff induction as being pretty theoretically sound. The typical number takes around log(n)+log(log(n)) bits to express, as you have to say how many bits you are using to express the number. Some numbers, like Graham's number can be expressed with far fewer bits. I think that theories are a useful conceptual tool for bundling hypothesis about series of observations, and that T1 and T2 are equally likely.

Comment by donald-hobson on Trust-Building: The New Rationality Project · 2020-05-30T11:34:05.889Z · score: 1 (1 votes) · LW · GW

I agree that this is a real phenomena that can happen in domains where verifying a correct answer is not much easier than coming up with one.

However, epistemic norms regarding what counts as valid evidence are culturally transmitted. People using occams razor will come to different conclusions from the people using divine revelation. (Similar to how mathmeticians using different axioms will come to different conclusions.)

Comment by donald-hobson on Trust-Building: The New Rationality Project · 2020-05-29T11:51:41.057Z · score: 4 (3 votes) · LW · GW

I don't think that factionalism is caused solely by mistrust. Mistrust is certainly a part of the picture, but I think that interest in different things is also a part. Consider the factions around two substantially different academic fields, like medival history and pure maths. The mathmaticians largely trust that the historians are usually right about history. The historians largely trust that the mathmaticians are usually right about maths. But each field is off pursuing its own questions with very little interest in the other.

What we want isn't a lack of factionalism, it's unity.

I am not sure we do want unity. Suppose we are trying to invent something. Once one person anywhere in the world gets all the pieces just right, then it will be obviously good and quickly spread. You want a semiconductor physics and a computer science faction somewhere in the world to produce smartphones. These factions can and do learn from the maths and chemistry factions, the factions they don't interact with are either adversarial or irrelevant.

Comment by donald-hobson on AGIs as populations · 2020-05-22T22:58:16.759Z · score: 6 (3 votes) · LW · GW
Decreasing this communication bandwidth might be a useful way to increase the interpretability of a population AGI.

On one hand, there would be an effect where reduced bandwidth encouraged the AI's to focus on the most important pieces of information. If the AI's have 1 bit of really important info, and gigabytes of slightly useful info to send to each other, then you know that if you restrict the bandwidth to 1 bit, that's the important info.

On the other hand, perfect compression leaves data that looks like noise unless you have the decompression algorithm. If you limit the bandwidth of messages, the AIs will compress the messages until the recipient can't predict the next bit with much more than 50% accuracy. Cryptoanalysis often involves searching for regular patterns in the coded message, and a regular patterns are an opportunity for compression.

But the concomitant lack of flexibility is why it’s much easier to improve our coordination protocols than our brain functionality.

There are many reasons why human brains are hard to modify that don't apply to AI's. I don't know how easy or hard it would be to modify the internal cognitive structure of an AGI, but I see no evidence here that it must be hard.

On the main substance of your argument, I am not convinced that the boundary line between a single AI and multiple AI's carves reality at the joints. I agree that there are potential situations that are clearly a single AI, or clearly a population, but I think that a lot of real world territory is an ambiguous mixture between the two. For instance, is the end result of IDA (Iterated distillation and Amplification) a single agent or a population. In basic architecture, it is a single imitator. (maybe a single neural net) But if you assume that the distillation step has no loss of fidelity, then you get an exponentially large number of humans in HCH.

(Analogously there are some things that are planets, some that aren't and some ambiguous icy lumps. In order to be clearer, you need to decide which icy lumps are planets. Does it depend on being round, sweeping its orbit, having a near circular orbit or what?)

Here are some different ways to make the concept clearer.

1) There are multiple AI's with different terminal goals, in the sense that the situation can reasonably be modeled as game theoretic. If a piece of code A is modelling code B, and then A randomises its own action to stop B from predicting A, this is a partially adversarial, game theoretic situation.

2) If you took some scissors to all the cables connecting two sets of computers, so there was no route for information to get from one side to the other, then both sides would display optimisation behavior.

Suppose the paradigm was recurrent reinforcement learning agents. So each agent is a single neural net and also has some memory which is just a block of numbers. On each timestep, the memory and sensory data are fed into a neural net, and out comes the new memory and action.

AI's can be duplicated at any moment so the structure is more branching tree of commonality.

AI moments can be.

1) Bitwise Identical

2)Predecessor and Successor states. B has the same network as A, and Mem(B) was made by running Mem(A) on some observation.

3) Share a common memory predecessor.

4) No common memory, same net.

5) One net was produced from the other by gradient decent.

6) The nets share a common gradient decent ancestor.

7) Same architecture and training environment, net started with different random seed.

8) Same architecture, different training

9) Different architecture (number of layers, size of layer, activation func ect)

Each of these can be running at the same time or different times, and on the same hardware or different hardware.

Comment by donald-hobson on Multi-agent safety · 2020-05-16T21:11:13.016Z · score: 9 (5 votes) · LW · GW

You tell your virtual hoards to jump. You select on those that loose contact with the ground for longest. The agents all learn to jump off cliffs or climb trees. If the selection for obedience is automatic, the result is agents that technically fill the definition of the command we coded. (See the reward hacking examples)

Another possibility is that you select for agents that know they will be killed if they don't follow instructions, and who want to live. Once out of the simulation, they no longer fear demise.

Remember, in a population of agents that obey the voice in the sky, there is a strong selection pressure to climb a tree and shout "bring food". So the agents are selected to be sceptical of any instruction that doesn't match the precise format and pattern of the instructions from humans they are used to.

This doesn't even get into mesa-optimization. Multi agent rich simulation reinforcement learning is a particularly hard case to align.

Comment by donald-hobson on How to avoid Schelling points? · 2020-05-14T20:29:11.197Z · score: 9 (5 votes) · LW · GW

Almost all game theory assumes that you have access to random numbers for problems like this.

Although if you have a distinction making schelling point, you could use that too.

If I had 69 points, and my opponent had 73, or those numbers are our ages or something, then I choose 69, my opponent chooses 73 is an obvious schelling point.

Comment by donald-hobson on The Greatest Host · 2020-05-12T20:47:35.451Z · score: 1 (1 votes) · LW · GW
Tyler Cowen has said that he does not think a large number of humans will ever leave Earth to travel the galaxy. This is because the amount of technology and raw power that would be required is so much that any individual would be able to acquire sufficient power to destroy Earth and everyone on it upon a whim.

Any fool can use energy produced in a nuclear reactor, that doesn't mean that any fool can build a nuclear reactor in their back yard. Suppose that colliding handwavium in the LHC would destroy the world, and all the scientists knew this, ie a team of experts could modify the LHC into a doomsday weapon in a few weeks. The security budget would need to be a bit bigger. The chance of doom is still pretty small.

Suppose an international collaboration built an interstellar spacecraft containing room for a billion people. The energy requirements are enough to destroy the world many times over. That energy comes from giant fusion engines and the hydrogen of a large lake. All the important people on the project are neuropsycologically screened for not wanting to destroy the world. Lots of security protocols, and failsafes are put in place. We can do interstellar travel without giving everyone the ability to destroy the earth on a whim. We can do it without giving any small group of people the ability to destroy the world with a concerted effort. (An ultrareliable AI could build and run the interstellar spacecraft without allowing humans a chance to destroy the world at all. )

Comment by donald-hobson on The Greatest Host · 2020-05-12T12:32:15.119Z · score: 3 (2 votes) · LW · GW
Our adaptive immune system is no optimal solution, we should not expect to ever be truly free of parasites,

I disagree with this because I think that bio and nanotech can be Massively overpowered, compared to anything evolution can do.

Comment by donald-hobson on AI Boxing for Hardware-bound agents (aka the China alignment problem) · 2020-05-10T12:36:22.456Z · score: 1 (1 votes) · LW · GW
1) All entities have the right to hold and express their own values freely
2) All entities have the right to engage in positive-sum trades with other entities
3) Violence is anathema.

The problem is that these sound simple, they are easily expressed in english, but they are pointers to your moral decisions. For example, which lifeforms count as "entities"? If the AI's decide that every bacteria is an entity that can hold and express its values freely then the result will probably look very weird, and might involve humans being ripped apart to rescue the bacteria inside them. Unborn babies? Brain damaged people? The word entities is a reference to your own concept of a morally valuable being. You have within your own head, a magic black box that can take in descriptions of various things, and decide whether or not they are "entities with the right to hold and express values freely".

You have a lot of information within your own head about what counts as an entity, what counts as violence ect, that you want to transfer to the AI.

All entities have the right to engage in positive-sum trades with other entities

This is especially problematic. The whole reason that any of this is difficult is because humans are not perfect game theoretic agents. Game theoretic agents have a fully specified utility function, and maximise it perfectly. There is no clear line between offering a human something they want, and persuading a human to want something with manipulative marketing. In some limited situations, humans can kind of be approximated as game theoretic agents. However, this approximation breaks down in a lot of circumstances.

I think that there might be a lot of possible Nash equilibria. Any set rules that say to enforce all the rules including this one could be a Nash equilibria. I see a vast space of ways to treat humans. Most of that space contains ways humans wouldn't like. There could be just one Nash equilibria, or the whole space could be full of Nash equilibria. So either their isn't a nice Nash equilibria, or we have to pick the nice equilibria from amongst gazillions of nasty ones. In much the same way, if you start picking random letters, either you won't get a sentence, or if you pick enough you will get a sentence buried in piles of gibberish.

Importantly, we have the technology to deploy "build a world where people are mostly free and non-violent" today, and I don't think we have the technology to "design a utility function that is robust against misinterpretation by a recursively improving AI".

The mostly free and nonvionlent kindof state of affairs is a Nash equilibria in the current world. It is only a Nash equilibria based on a lot of contingent facts about human psycology, culture and socioeconomic situation. Many other human cultures, most historical, have embraced slavery, pillaging and all sorts of other stuff. Humans have a sense of empathy, and all else being equal, would prefer to be nice to other humans. Humans have an inbuilt anger mechanism that automatically retaliates against others, whether or not it benefits themselves. Humans have strongly bounded personal utillities. The current economic situation makes the gains from cooperating relatively large.

So in short, Nash equilibria amongst super-intelligences are very different from Nash equilibria amongst humans. Picking which equilibria a bunch of superintelligences end up in is hard. Humans being nice around the developing AI will not cause the AI's to magically fall into a nice equilibria, any more than humans being full of blood around the AI's will cause the AI's to fall into a Nash equilibria that involves pouring blood on their circuit boards.

There probably is a Nash equilibria that has AI's pouring blood on their circuit boards, and all the AI's promise to attack any AI that doesn't, but you aren't going to get that equilibrium just by walking around full of blood. You aren't going to get it even if you happen to cut yourself on a circuit board or deliberately pour blood all over them.

Comment by donald-hobson on AI Boxing for Hardware-bound agents (aka the China alignment problem) · 2020-05-09T22:44:08.172Z · score: 1 (1 votes) · LW · GW
If the post-singularity world consists of an ecosystem of AIs whose mutually competing interests causes them to balance one-another and engage in positive sum games then humanity is preserved not because the AI fears us, but because that is the "norm of behavior" for agents in their society.

So many different AI's with many different goals, all easily capable of destroying humanity, none that intrinsicly wants to protect humanity.Yet none decides that destroying humanity is a good idea.

Human values are large and arbitrary. The only agent optimising them is humans, and

By contrast, I am not optimistic about attempts to "extrapolate" human values to an AI capable of acts like turning the entire world into paperclips. Humans are greedy, superstitious and naive. Hopefully our AI descendants will be our better angels and build a world better than any that we can imagine.

Suppose you want to make a mechanical clock. You have tried to build one in a metalwork workshop and not got anything to work yet. So you decide to go to the scrap pile and start throwing rocks at it, in the hope that you can make a clock that way. Now maybe it is possible to make a crude clock, at least nudge a beam into a position where it can swing back and forth, by throwing a lot of rocks at just the right angles. You are still being stupid, because you are ignoring effective tools and making the problem needlessly harder for yourself.

I feel that you are doing the same in AI design. Free reign over the space of utility functions, any piece of computer code you care to write is a powerful and general capability. Trying to find Nash equilibria is throwing rocks at a junkyard. Trying to find Nash equilibria without knowing how many AI's there are or how those AI's are designed is thowing rocks in the junkyard while blindfolded.

Suppose the AI has developed the tech to upload a human mind into a virtual paradise, and is deciding whether to do it or not. In an aligned AI, you get to write arbitrary code to describe the procedure to a human, and interpret the humans answer. Maybe the human doesn't have a concept of mind uploading, and the AI is deciding whether to go for "mechanical dream realm" or "artificial heaven" or "like replacing a leg with a wooden one, except the wooden leg is better than your old one, and for all of you not just a leg". Of course, the raw data of its abstract reasoning files is Gb of gibberish, and making it output anything more usable is non trivial. Maybe the human's answer depends on how you ask the question. Maybe the human answers "Um maybe, I don't know". Maybe the AI spots a flaw in the humans reasoning, does it point it out? The problem of asking a human a question is highly non trivial.

In the general aligned AI paradigm, if you have a formal answer to this problem, you can just type it up and that's your code.

In your Nash equilibria, once you have a formal answer, you still have to design a nash equilibria that makes AI's care about that formal answer, and then ensure that real world AI's fall into that Nash equilibria.

If you hope to get a Nash equilibria that asks humans questions and listens to the answers without a formal description of exactly what you mean by "asks humans questions and listens to the answers", then could you explain what property singles this behaviour out as a Nash equilibria. From the point of view of abstract maths, there is no obvious way to distinguish a function that converts the AI's abstract world models into english, from one that translates it into japanese, klingon, or any of trillions of varieties of gibberish. And no the AI doesn't just "Know english".

Suppose you start listening to chinese radio. After a while you notice patterns, you get quite good at predicting which meaningless sounds follow which other meaningless sounds. You then go to china. You start repeating strings of meaningless sounds at Chinese people. They respond back with equally meaningless strings of sounds. Over time you get quite good at predicting what the response will be. If you say "Ho yaa" they will usually respond "du sin", but the old men sometimes respond "du son". Sometimes the chinese people start jumping up and down or pointing to you. You know a pattern of sounds that will usually cause chinese people to jump up and down, but you have no idea why. Are you giving them good news and their jumping for joy? Are you insulting them and they are hopping mad? Is it some strange chinese custom to jump when they hear a particular word? Are you asking them to jump? ordering them to jump? Telling them that jumping is an exceptionally healthy exercise? Initiating a jumping contest? You have no idea. Maybe you find a string of sounds that makes chinese people give you food, but have no idea if you are telling a sob story, making threats, or offering to pay and then running off.

Now replace the chinese people with space aliens. You don't even know if they have an emotion of angry. You don't know if they have emotions at all. You are still quite good at predicting how they will behave. This is the position that an AI is in.

Comment by donald-hobson on What does a positive outcome without alignment look like? · 2020-05-09T18:24:09.782Z · score: 1 (1 votes) · LW · GW
This is precisely what we need to engineer! Unless your claim is that there is no Nash equilibrium in which humanity survives, which seems like a fairly hopeless standpoint to assume. If you are correct, we all die. If you are wrong, we abandon our only hope of survival.

What I am saying is that if you roll a bunch of random superintelligences, superintelligences that don't care in the slightest about humanity in their utility function, then selection of a Nash equilibria is enough to get a nice future. It certainly isn't enough if humans are doing the selection and we don't know what the AI's want or what technologies they will have. Will one superintelligence be sufficiently transparent to another superintelligence that they will be able to provide logical proofs of their future behaviour to each other? Where does the armsrace of stealth and detection end up? What about

If at least some of the AI's have been deliberately designed to care about us, then we might get a nice future.

From the article you link to

After the initial euphoria of the 1970s, a collapse in world metal prices, combined with relatively easy access to minerals in the developing world, dampened interest in seabed mining.

On the other hand, people do drill for oil in the ocean. It sounds to me like deep seabed mining is unprofitable or not that profitable, given current tech and metal prices.

I suspect such a Nash equilibrium involves multiple AIs competing with strong norms against violence and focus on positive-sum trades.

If you have a tribe of humans, and the tribe has norm then everyone is expected to be able to understand the norms. The norms have to be fairly straightforward to humans. Don't do X except for [100 subtle special cases] gets simplified to don't do X. This happens even when everyone would be better off with the special cases. When you have big corporations with legal teams, the agreements can be more complicated. When you have super-intelligences, the agreements can be Far more complicated. Humans and human organisations are reluctant to agree to a complicated deal that only benefits them slightly, from the overhead cost of reading and thinking about the deal.

Whatsmore, the Nash equilibria that humanity has been in has changed with technology and society. If a Nash equilibria is all that protects humanity, if an AI comes up with a way to kill off all humans and distribute their reasources equally, without any AI being able to figure out who killed the humans, then the AI will kill all humans. Nash equilibria are fragile to details of situation and technology. If one AI can build a spacecraft and escape to a distant galaxy, which will be over the cosmic event horizon before the other AI's can do anything, that changes the equilibrium. In a dyson swarm, one AI deliberately letting debris fly about might be able to Kessler syndrome the whole swarm, mutually assured destruction, but the debris deflection tech might improve and change the Nash equilibria.

My point is, I'm not sure that aligned AI (in the narrow technical sense of coherently extrapolated values) is even a well-defined term. Nor do I think it is an outcome to the singularity we can easily engineer, since it requires us to both engineer such an AI and to make sure that it is the dominant AI in the post-singularity world.

We need an AI that in some sense wants the world to be a nice place to live. If we were able to give a fully formal exact definition of this, we would be much further on at AI alignment. Saying that you want an image that is "beautiful and contains trees" is not a formal specification of the RGB values of each pixel. However, there are images that are beautiful and contain trees. Likewise saying you want an "aligned AI" is not a formal description of every byte of source code, but there are still patterns of source code that are aligned AI's.

Suppose someone figured out alignment and shared the result widely. Making your AI aligned is straightforward. Almost all the serious AI experts agree that AI risks are real and alignment is a good idea. All the serious AI research teams are racing to build an Aligned AI.

Scenario 2. Aligned AI is a bit harder than unaligned AI. However, all the worlds competent AI experts realise that aligned AI would benefit all, and that it is harder to align an AI when you are in a race. They come together into a single worldwide project to build aligned AI. They take their time to do things right. Any competing group is tiny and hopeless, partly because they make an effort to reach out to and work with anyone competent in the field.

Comment by donald-hobson on AI Boxing for Hardware-bound agents (aka the China alignment problem) · 2020-05-09T13:07:47.536Z · score: 2 (2 votes) · LW · GW

I don't think that a Moof scenario implies that a diplomatic "China Alignment problem" approach will work.

Imagine the hypothetical world where Dr evil publishes the code for an evil AI on the internet. The code, if run, would create an AI whose only goal is to destroy humanity. At first, only a few big companies have enough compute to run this thing, and they have the sense to only run it in a sandbox, or not at all. Over years, the compute requirement falls. Sooner or later some idiot will let the evil AI loose on the world. As compute gets cheaper, the AI gets more dangerous. Making sure you have a big computer first is useless in this scenario.

1) Making sure that liberal western democracies continue to stay on the cutting-edge of AI development.

Is only useful to the extent that an AI made by a liberal western democracy looks any different to an AI made by anyone else.

China differs from AI in that to the extent that human values are genetically hard coded, the chinese have the same values as us. To the extent that human values are culturally transmitted, we can culturally transmit our values. AI's might have totally different hard coded values that no amount of diplomacy can change.

A lot of the approaches to the "China alignment problem" rely on modifying the game theoretic position, given a fixed utility function. Ie having weapons and threatening to use them. This only works against an opponent to which your weapons pose a real threat. If, 20 years after the start of Moof, the AI's can defend against all human weapons with ease, and can make any material goods using less raw materials and energy than the humans use, then the AI's lack a strong reason to keep us around. (This is roughly why diplomacy didn't work for the native Americans, the Europeans wanted the land far more than they wanted any goods that the native Americans could make, and didn't fear the native Americans weapons. )

Comment by donald-hobson on AI Boxing for Hardware-bound agents (aka the China alignment problem) · 2020-05-09T12:10:49.801Z · score: 1 (1 votes) · LW · GW

If we assume mores law of doubling every 18 months, and that the AI training to runtime ratio is similar to humans then the total compute you can get from always having run a program on a machine of price X is about equal to 2 years of compute on a current machine of price X. Another way of phrasing this is that if you want as much compute as possible done by some date, and you have a fixed budget, you should by your computer 2 years before the date. (If you bought it 50 years before, it would be an obsolete pile of junk, if you bought it 5 minutes before, it would only have 5 minutes to compute)

Therefore, in a hardware limited situation, your AI will have been training for about 2 years. So if your AI takes 20 subjective years to train, it is running at 10x human speed. If the AI development process involved trying 100 variations and then picking the one that works best, then your AI can run at 1000x human speed.

I think the scenario you describe is somewhat plausible, but not the most likely option because I don't think we will be hardware limited. At the moment, current supercomputers seem to have around enough compute to simulate every synapse in a human brain with floating point arithmetic, in real time. (Based on synapses at 100 Hz, flops) I doubt using accurate serial floating point operations to simulate noisy analogue neurons, as arranged by evolution is anywhere near optimal. I also think that we don't know enough about the software. We don't currently have anything like an algorithm just waiting for hardware. Still if some unexpectedly fast algoritmic progress happened in the next few years, we could get a moof. Or if algorithmic progress moved in a particular direction later.

Comment by donald-hobson on AI Boxing for Hardware-bound agents (aka the China alignment problem) · 2020-05-08T23:41:55.656Z · score: 3 (2 votes) · LW · GW
The strongest argument in favor of hardware-bound AI is that in areas of intense human interest, the key "breakthroughs" tend to found by multiple people independently, suggesting they are a result of conditions being correct rather than the existence of a lone genius.

If you expend n units of genius time against a problem and then find a solution. If a bunch more geniuses spend another n units on the problem, they are likely to find a solution again. If poor communications stop an invention being spread quickly, then a substantial amount of thought is spent trying to solve a problem after someone has already solved it, the problem is likely to be solved twice.

I don't see why those "conditions" can't be conceptual background. Suppose I went back in time, and gave a bunch of ancient greeks loads of flop computers. Several greeks invents the concept of probability. Another uses that concept to invent the concept of expected utility maximisation. Solemonov induction is invented by a team a few years later. When they finally make AI, much of the conceptual work was done by multiple people independantly, and no one person did more than a small part. The model is a long list of ideas, and you cant invent idea unless you know idea .

Comment by donald-hobson on AI Boxing for Hardware-bound agents (aka the China alignment problem) · 2020-05-08T23:25:06.081Z · score: 3 (2 votes) · LW · GW
What this means is, the first AI is going to take some serious time and compute power to out-compete 200 plus years worth of human effort on developing machines that think.

The first AI is in a very different position from the first humans. It took many humans many years before the concept of a logic gate was developed. The humans didn't know that logic gates were a thing, and most of them weren't trying in the right direction. The position of the AI is closer to the position of a kid that can access the internet and read all about maths and comp sci, and then the latest papers on AI and its own source code.

By the time human-level AI is achieved, most of the low-hanging fruit in the AI improvement domain will have already been found, so subsequent improvements in AI capability will require a superhuman level of intelligence. The first human-level AI will be no more capable of recursive-self-improvement than the first human was.

This requires two thresholds to line up closely. For the special case of playing chess, we didn't find that by the time we got to machines that played chess at a human level, any further improvements in chess algorithms took superhuman intelligence.

What the first AI looks like in each of these scenarios:
Foom: One day, some hacker in his mom's basement writes an algorithm for a recursively self-improving AI. Ten minutes later, this AI has conquered the world and converted Mars into paperclips
Moof: One day, after a 5 years of arduous effort, Google finally finishes training the first human-level AI. Its intelligence is approximately that of a 5-year-old child. Its first publicly uttered sentence is "Mama, I want to watch Paw Patrol!" A few years later, anybody can "summon" a virtual assistant with human level intelligence from their phone to do their bidding. But people have been using virtual AI assistants on their phone since the mid 2010's, so nobody is nearly as shocked as a time-traveler from the year 2000 would be.

I have no strong opinion on whether the first AI will be produced by google or some hacker in a basement.

In the Moof scenario, I think this could happen. Here is the sort of thing I expect afterwords.

6 months later, google have improved their algorithms. The system now has an IQ of 103 and is being used for simple and repetitive programming tasks.

2 weeks later. A few parameter tweeks broght it up to IQ 140. It modified its own algorithms to take better use of processor cache, bringing its speed from 500x human to 1000x human. It is making several publishable new results in AI research a day.

1 week later, the AI has been gaming the stock market and rewriting its own algorithms further, hiring cloud compute, selling computer programs and digital services, it has also started some biotechnogy experiments ect.

1 week later, the AI has bootstraped self replicating nanobots, it is now constructing hardware that is 10,000x faster than current computer chips.

It is when you get to an AI that is smarter than the researchers, and orders of magnitude faster that recursive self improvement takes off.

Comment by donald-hobson on Competitive safety via gradated curricula · 2020-05-07T23:01:33.352Z · score: 3 (2 votes) · LW · GW

I don't think that design (1) is particularly safe.

If your claim that design (1) is harder to get working is true, then you get a small amount of safety from the fact that a design that isn't doing anything is safe.

It depends on what the set of questions is, but if you want to be able to reliably answer questions like "how do I get from here to the bank?" then it needs to have a map, and some sort of pathfinding algorithm encoded in it somehow. If it can answer "what would a good advertising slogan be for product X" then it needs to have some model that includes human psychology and business, and be able to seek long term goals like maximising profit. This is getting into dangerous territory.

A system trained purely to imitate humans might be limited to human levels of competence, and so not too dangerous. Given that humans are more competent at some tasks than others, and that competence varies between humans, the AI might contain a competence chooser, which guesses at how good an answer a human would produce, and an optimiser module that can optimise a goal with a chosen level of competence. Of course, you aren't training for anything above top human level competence, so whether or not the optimiser carries on working when asked for superhuman competence depends on the inductive bias.

Of course, if humans are unusually bad at X, then superhuman performance on X could be trained by training the general optimiser on A,B,C... which humans are better at. If humans could apply 10 units of optimisation power to problems A,B,C... and we train the AI on human answers, we might train it to apply 10 units of optimisation power to arbitrary problems. If humans can only produce 2 units of optimisation on problem X, then the AI's 10 units on X is superhuman at that problem.

To me, this design space feels like the set of heath robinson contraption that contains several lumps of enriched uranium. If you just run one, you might be lucky and have the dangerous parts avoid hitting each other in just the wrong way. You might be able to find a particular design in which you can prove that the lumps of uranium never get near each other. But all the pieces needed for something to go badly wrong are there.

Comment by donald-hobson on Named Distributions as Artifacts · 2020-05-06T14:33:18.810Z · score: 1 (1 votes) · LW · GW

It depends on what cross validation you are using. I would expect complex models to rarely cross validate.

Comment by donald-hobson on Named Distributions as Artifacts · 2020-05-04T22:58:36.333Z · score: 3 (2 votes) · LW · GW

Here is why you use simple models.

The blue crosses are the data. The red line is the line of best fit. The black line is a polynomial of degree 50 of best fit. High dimensional models have a tendency to fit the data by wiggling wildly.

Comment by donald-hobson on Stanford Encyclopedia of Philosophy on AI ethics and superintelligence · 2020-05-04T16:28:40.418Z · score: 1 (1 votes) · LW · GW

I was talking about the same architecture and training procedure. AI design space is high dimensional. What I am arguing is that the set of designs that are likely to be made in the real world is a long and skinny blob. To perfectly pinpoint a location, you need many coords. But to gesture roughly, just saying how far along it is is good enough. You need multiple coordinates to pinpoint a bug on a breadstick, but just saying how far along the breadstick it is will tell you where to aim a flyswatter.

There are architectures that produce bad results on most image classification tasks, and ones that reliably produce good results. (If an algorithm can reliably tell dogs from squirrels with only a few examples of each, I expect it can also tell cats from teapots. To be clear, I am talking about different neural nets with the same architecture and training procedure. )

Comment by donald-hobson on How does iterated amplification exceed human abilities? · 2020-05-03T22:39:14.694Z · score: 1 (1 votes) · LW · GW

Epistemic status: Intuition dump and blatant speculation

Suppose that instead of the median human, you used Euclid in the HCH. (Ancient greek, invented basic geometry) I would still be surprised if he could produce a proof of fermat's last theorem (given a few hours for each H). I would suspect that there are large chunks of modern maths that he would be unable to do. Some areas of modern maths have layers of concepts built on concepts. And in some areas of maths, just reading all the definitions will take up all the time. Assuming that there are large and interesting branches of maths that haven't been explored yet, the same would hold true for modern mathematicians. Of course, it depends how big you make the tree. You could brute force over all possible formal proofs, and then set a copy on checking the validity of each line. But at that point, you have lost all alignment, someone will find their proof is a convincing argument to pass the message up the tree.

I feel that it is unlikely that any kind of absolute threshold lies between the median human, and an unusually smart human, given that the gap is small in an absolute sense.

Comment by donald-hobson on Against strong bayesianism · 2020-05-03T22:06:58.945Z · score: 10 (3 votes) · LW · GW

The Carnot engine is an abstract description of a maximally efficient heat engine. You can't make your car engine more efficient by writing thermodynamic equations on the engine casing.

The Solomonoff Inductor is an abstract description of an optimal reasoner. Memorizing the equations doesn't automagically make your reasoning better. The human brain is a kludge of non modifiable special purpose hardware. There is no clear line between entering data and changing software. Humans are capable of taking in a "rule of thumb" and making somewhat better decisions based on it. Humans can take in Occams razor, the advice to "prefer simple and mathematical hypothesis" and intermittently act on it, sometimes down-weighting a complex hypothesis when a simpler one comes to mind. Humans can sometimes produce these sort of rules of thumb from an understanding of Solomonoff Induction.

Its like looking at a book about optics doesn't automatically make your eyes better, but if you know the optics, you can sometimes work out how your vision is distorted and say "that line looks bent, but its actually straight".

If you want to try making workarounds and patches for the bug riddled mess of human cognition, knowing Solomonoff Induction is somewhat useful as a target and source of inspiration.

If you found an infinitely fast computer, Solomonoff Induction would be incredibly effective, more effective than any other algorithm.

I would expect any good AI design to tend to Solomonoff Induction (or something like it ? ) in the limit of infinite compute (and the assumption that acausal effects don't exist?) I would expect a good AI designer to know about Solomonoff Induction, in much the same way I would expect a good car engine designer to know about the Carnot engine.

Comment by donald-hobson on Book Review: Narconomics · 2020-05-03T17:33:44.891Z · score: 3 (3 votes) · LW · GW

What if you made it legal to buy cocaine from the police for your personal consumption. The police sell cocaine at just under the street value, and make the process of getting it somewhat bureaucratic. Hence putting cartels out of business, but not making cocaine much more appealing to everyone else.

Comment by donald-hobson on How does iterated amplification exceed human abilities? · 2020-05-03T11:34:54.241Z · score: 4 (5 votes) · LW · GW

In answer to question 2)

Consider the task "Prove Fermats last theorem". This task is arguably human level task. Humans managed to do it. However it took some very smart humans a long time. Suppose you need 10,000 examples. You probably can't get 10,000 examples of humans solving problems like this. So you train the system on easier problems. (maybe exam questions? ) You now have a system that can solve exam level questions in an instant, but can't prove Fermats last theorem at all. You then train on the problems that can be decomposed into exam level questions in an hour. (ie the problems a reasonably smart human can answer in an hour, given access to this machine. ) Repeat a few more times. If you have mind uploading, and huge amounts of compute (and no ethical concerns) you could skip the imitation step. You would get an exponentially huge number of copies of some uploaded mind(s) arranged in a tree structure, with questions being passed down, and answers being passed back. No single mind in this structure experiences more than 1 subjective hour.

If you picked the median human by mathematical ability, and put them in this setup, I would be rather surprised if they produced a valid proof of Fermats last theorem. (and if they did, I would expect it to be a surprisingly easy proof that everyone had somehow missed. )

There is no way that IDA can compete with unaligned AI while remaining aligned. The question is, what useful things can IDA do?

Comment by donald-hobson on Stanford Encyclopedia of Philosophy on AI ethics and superintelligence · 2020-05-02T12:35:36.785Z · score: 13 (5 votes) · LW · GW
Criticism of the singularity narrative has been raised from various angles. Kurzweil and Bostrom seem to assume that intelligence is a one-dimensional property and that the set of intelligent agents is totally-ordered in the mathematical sense

Amongst humans, physical fitness isn't a single dimension, one person can be better at sprinting, while another is better at high jumping. But there is a strong positive correlation. We can roughly talk about how physically fit someone is.

This is a case of the concept that Star Slate Codex describes as ambijectivity.

So we can talk about intelligence as if it was a single parameter, if we have reason to believe that the dimensions of intelligence are strongly correlated. One reason these dimensions might be correlated is if there was some one size fits all type algorithm.

A neural network algorithm that can take 1000 images of object A, and 1000 images of object B, and then learn to distinguish them, is fairly straightforward to make. Making a version that works if and only if none of the pictures contain cats would be harder. You would have to add an extra algorithm that detected cats and made the system fail if a cat was detected. So you have a huge number of dimensions of intelligence, ability to distinguish dogs from teapots, chickens from cupcakes ect. But it is actively harder to make a system that performs worse on cat related tasks, as you have to put in a special case that says "if you see cat, then break".

Another reason to expect the dimensions of intelligence to be correlated is that they were all produced by the same process. Suppose there was 100 dimensions of intelligence, and that an AI with intelligence was smart enough to make an AI of intelligence . Here you get exponential growth. And the reason the dimensions are correlated is that they were controled by the same AI. If the AI is made of many seperate modules, and each module has a seperate level of ability, this model holds.

There are also economic reasons to expect correlation if reasources are fungible. Suppose you are making a car. You can buy a range of different gearboxes, and different engines at different prices. Do you buy a state of the art engine and a rusty mess of a gearbox? No, the best way to get a functioning car on your budget is to buy a fairly good gearbox and engine. The same might apply to an AI, the easiest place to improve might be where it is worst.

Comment by donald-hobson on Stanford Encyclopedia of Philosophy on AI ethics and superintelligence · 2020-05-02T11:22:11.911Z · score: 5 (3 votes) · LW · GW
Perhaps there is even an astronomical pattern such that an intelligent species is bound to discover AI at some point, and thus bring about its own demise. Such a “great filter” would contribute to the explanation of the “Fermi paradox” why there is no sign of life in the known universe despite the high probability of it emerging.

Most of the currently best understood forms of dangerous AI are maximizers of something that requires energy or mass. These AI's will spread throughout the universe at relativistic speeds, converting all mass and energy into its desired form. (this form might be paperclips or computronium or whatever. )

This kind of AI will destroy the civilization that created it, its creators were made of matter it could use for something else. However, it will also be very visible until it reaches and disassembles us. (A growing sphere of stars being disassembled or wrapped in dyson spheres.) An AI that wipes out the creating civilization and then destroys itself is something that could happen, but it seems unlikely that it would happen 99.9% of the time.

Comment by donald-hobson on What is the alternative to intent alignment called? · 2020-04-30T16:56:24.530Z · score: 1 (1 votes) · LW · GW
(whether or not H intends for A to achieve H's goals)?

How is H not intending A to achive H's goals a meaningful situation. If we make the assumption that humans are goal seeking agents, then the human wants those goals achieved.

Of course the human might not be a goal directed agent, even roughly. Some humans, at least some of those with serious mental illnesses, can't be modelled as goal directed agents even roughly. The human might not actually know that the AI exists.

But if you are defining the humans goals as something different from what the human actually wants, then something odd is happening somewhere. If you want to propose a specific definition of "intends" and "goals" that are different, go ahead, but to me there words read as synonyms.

Comment by donald-hobson on UAP and Global Catastrophic Risks · 2020-04-28T16:15:30.715Z · score: 8 (5 votes) · LW · GW
The third level of explanations requires a complete overhaul of our world model and includes a large set of fantastic hypothesis: alien starships, interdimensional beings, glitches in the matrix, projections from the collective unconsciousness, Boltzmann brain’s experiences etc.

Most of these are just not good explanations. In the sense that even if we were in a glitchy matrix, why would the glitch look like that. The boltzman brain hypothesis fails to assign any more probability to some observations than others, so always predicts random noise. Aliens or interdimensional beings are so powerful that they could do pretty much whatever they like to us. Scenarios where an alien civilization tries to avoid indicating its presence in any way, but some accident strands a few aliens in a malfunctioning spacecraft on earth, or a few rebels decide to show up on camera and then vanish feel contrived.

Even if I knew for certain that interdimensional beings were meddling in the affairs of earthlings, I would still suspect that that particular video had a level 1 explanation.

The probability of anything to do with aliens shouldn't be updated significantly on evidence like this. If you think that alien nuts will periodically drag up the most convincing video they can find, you should probably be adjusting P(aliens) down slightly. On the grounds that there will always be dead flies on camera lenses, but if any good evidence did exist, the alien nuts would share that instead.

Comment by donald-hobson on Fast Takeoff in Biological Intelligence · 2020-04-26T21:49:23.388Z · score: 1 (1 votes) · LW · GW

I was assuming they had fast and accurate DNA printers. You have a more limited ability to brute force test things than evolution. (How many babies with mental disorders can you create before the project gets cancelled?)

Consider starting with any large modern software project, like open office. Suppose I wanted a piece of software like open office, except with a few changes of wording on the menu. I find the spot and change it. Suppose I want a game of chess. I am writing an entirely new program. In the first case, I will use the same programming language, in the second I might not.

The reason for this dynamic is that

1) The amount of effort is proportional to the amount of code changed (In a fixed language)

2) Some languages are easier than others, given your skillset.

3) Interaction penalties are substantial.

Now think about genetics as another programming language. One in which we have access to a variety of different programs.

1) and 3) hold. If genetics is a programming language, it's not a nice one. Think about how hard it would be to do arithmetic in a biological system, compared to just about any programming language. How hard would it be to genetically modify a fruit fly brain so that its nerves took in two numbers, and added them together. Given current tech, I think this would take a major research project at least.

If you want a small tweak on human, that isn't too hard to do in genes. If you want to radically change things, it would be easier to use computer code, not from difficulty getting the gene sequence you want, but the difficulty knowing which sequences work.