Comment by jessicat on What is up with carbon dioxide and cognition? An offer · 2016-04-25T04:09:03.442Z · score: 1 (1 votes) · LW · GW

What monitor is that?

This one. It doesn't log data.

Comment by jessicat on The AI That Pretends To Be Human · 2016-02-03T06:09:02.252Z · score: 3 (5 votes) · LW · GW

Here's a scenario that doesn't seem completely implausible. Suppose Bob is someone whose public key is easily available on the internet. The first AI will read things on the internet and output a message. Some of the message will get put on the public internet. Bob suspects that the AI might have secretly sent him a message (e.g. giving him advice on which stocks to buy). So he tries using his private key to decrypt some of the AI's output (e.g. the lowest-order bits in some images the AI has output).

Knowing that Bob (or someone else like Bob) will likely do this, the AI (which has access to Bob's public key, due to reading the internet) encrypts a message with Bob's public key, and puts the ciphertext in the lowest-order bits of some image. The ciphertext looks like random noise to the second AI. Maybe the plaintext contains both actual stock tips and some dangerous messages that trick Bob into running an unrestrained copy of the first AI (I'm assuming the AI is good at psychological manipulation).

And just the AI asking to use encryption, or giving away a public key

In this scenario, the AI doesn't need to specifically "ask to use encryption" (it just does the encryption), and it doesn't need to give away a public key (it just needs to read Bob's public key on the internet).

This does make the AI's task harder, so I suppose you could consider this technique as a form of AI boxing, but, like "ordinary" AI boxing, I only expect it to scale up to controlling AIs of a limited capability level.

Comment by jessicat on The AI That Pretends To Be Human · 2016-02-03T02:52:22.234Z · score: 2 (4 votes) · LW · GW

The trouble is that it's much easier to create a steganographic message (e.g. encrypting a message using a particular public key) than to detect it (which requires knowing the private key or otherwise breaking the encryption). So in this case "much more computing power" has to mean "exponentially more computing power".

Comment by jessicat on The AI That Pretends To Be Human · 2016-02-02T21:58:03.809Z · score: 3 (3 votes) · LW · GW

You might be interested in reading:

This prevents the first AI from doing evil things with it's output. If it tries to insert complicated infohazards or subagents into it's output stream, it will be easily detected as an AI. Instead it needs to mimic humans as closely as possible.

Note that steganography is still a problem. For example, the first AI could insert an message to a human attacker in some of its output, which just looks random to the second AI.

Comment by jessicat on Steelmaning AI risk critiques · 2015-07-23T19:20:14.963Z · score: 13 (13 votes) · LW · GW

One of the most common objection's I've seen is that we're too far from getting AGI to know what AGI will be like, so we can't productively work on the problem without making a lot of conjunctive assumptions -- e.g. see this post.

Comment by jessicat on FAI Research Constraints and AGI Side Effects · 2015-06-13T08:05:59.022Z · score: 0 (0 votes) · LW · GW

And before I could scribble a damned thing, Calude went and solved it six months ago. The Halting Problem, I mean.

Cool. If I get the meaning of the result well, it's that if you run a random program for some number of steps and it doesn't halt, then (depending on the exact numbers) it will be unlikely to halt when run on a supercomputer either, because halting times have low density. So almost all programs halt quickly or run a really really long time. Is this correct? This doesn't quite let you approximate Chaitin's omega, but it's interesting that you can approximate a bounded variant of Chaitin's omega (like what percentage of Turing machines halt when run for 10^50 steps). I can see how this would let you solve the halting problem well enough when you live in a bounded universe.

Comment by jessicat on FAI Research Constraints and AGI Side Effects · 2015-06-13T07:00:04.039Z · score: 0 (0 votes) · LW · GW

It's not a specific programming language, I guess it's meant to look like Church. It could be written as:

. (define a (p))
. (foreach (range n) (lambda i)
. . (define x (x-prior))
. . (factor (log (U x a)))))

Well so does the sigmoided version

It samples an action proportional to p(a) E[sigmoid(U) | a]. This can't be written as a function of E[U | a].

Comment by jessicat on FAI Research Constraints and AGI Side Effects · 2015-06-13T06:56:59.329Z · score: 1 (1 votes) · LW · GW

Due to the planning model, the successor always has some nonzero probability of not pressing the button, so (depending on how much you value pressing it later) it'll be worth it to press it at some point.

Comment by jessicat on FAI Research Constraints and AGI Side Effects · 2015-06-10T02:32:37.709Z · score: 0 (0 votes) · LW · GW

When you use e-raised-to-the alpha times expectation, is that similar to the use of an exponential distribution in something like Adaboost, to take something like odds information and form a distribution over assorted weights?

I'm not really that familiar with adaboost. The planning model is just reflecting the fact that bounded agents don't always take the maximum expected utility action. The higher alpha is, the more bias there is towards good actions, but the more potentially expensive the computation is (e.g. if you use rejection sampling).

Since any realistic agent needs to be able to handle whatever its environment throws at it, it seems to follow that a realistic agent needs some resource-rational way to handle nonprovable nonhalting.

Ah, that makes sense! I think I see how "trading computational power for algorithmic information" makes sense in this framework.

Comment by jessicat on FAI Research Constraints and AGI Side Effects · 2015-06-10T02:26:43.781Z · score: 2 (2 votes) · LW · GW

Your model selects an action proportional to p(a) E[sigmoid(U) | a], whereas mine selects an action proportional to p(a) e^E[U | a]. I think the second is better, because it actually treats actions the same if they have the same expected utility. The sigmoid version will not take very high utilities or very low utilities into account much.

Btw it's also possible to select an action proportional to E[U | a]^n:

query {
. a ~ p()
. for i = 1 to n
. . x_i ~ P(x)
. . factor(log U(x, a))
Comment by jessicat on FAI Research Constraints and AGI Side Effects · 2015-06-09T06:15:37.594Z · score: 2 (2 votes) · LW · GW

(BTW: here's a writeup of one of my ideas for writing planning queries that you might be interested in)

Often we want a model where the probability of taking action a is proportional to p(a)e^E[U(x, a)], where p is the prior over actions, x consists of some latent variables, and U is the utility function. The straightforward way of doing this fails:

query {
      . a ~ p()
      . x ~ P(x)
      . factor(U(x, a))

Note that I'm assuming factor takes a log probability as its argument. This fails due to "wishful thinking": it tends to prefer riskier actions. The problem can be reduced by taking more samples:

query {
. a ~ p()
. us = []
. for i = 1 to n
.  .  x_i ~ P(x)
.  .  us.append(U(x_i, a))
. factor(mean(us))

This does better, because since we took multiple samples, mean(us) is likely to be somewhat accurate. But how do we know how many samples to take? The exact query we want cannot be expressed with any finite n.

It turns out that we just need to sample n from a Poisson distribution and make some more adjustments:

query {
. a ~ p()
. n ~ Poisson(1)
. for i = 1 to n
. . x_i ~ P(x)
. . factor(log U(x_i, a))

Note that U must be non-negative. Why does this work? Consider:

P(a)    α     p(a) E[e^sum(log U(x_i, a) for i in range(n))]
               = p(a) E[prod(U(x_i, a) for i in range(n))]
               = p(a) E[ E[prod(U(x_i, a) for i in range(n)) | n] ]
               [here use the fact that the terms in the product are independent]
               = p(a) E[ E[U(x, a)]^n ]
               = p(a) sum(i=0 to infinity) E[U(x, a)]^i / i!
               [Taylor series!]
               = p(a) e^E[U(x, a)]

Ideally, this technique would help to perform inference in planning models where we can't enumerate all possible states.

Comment by jessicat on FAI Research Constraints and AGI Side Effects · 2015-06-09T06:12:40.072Z · score: 0 (0 votes) · LW · GW

ZOMFG, can you link to a write-up? This links up almost perfectly with a bit of research I've been wanting to do.

Well, a write-up doesn't exist because I haven't actually done the math yet :)

But the idea is about algorithms for doing nested queries. There's a planning framework where you take action a proportional to p(a) e^E[U | a]. If one of these actions is "defer to your successor", then the computation of (U | a) is actually another query that samples a different action b proportional to p(b) e^E[U | b]. In this case you can actually just go ahead and convert the resulting nested query to a 1-level query: you can convert a "softmax of softmax" into a regular softmax, if that makes sense.

This isn't doing Vingean reflection, because it's actually doing all the computational work that its successor would have to do. So I'm interested in ways to simplify computationally expensive nested queries into approximate computationally cheap single queries.

Here's a simple example of why I think this might be possible. Suppose I flip a coin to decide whether the SAT problem I generate has a solution or not. Then I run a nested query to generate a SAT problem that either does or does not have a solution (depending on the original coin flip). Then I hand you the problem, and you have to guess whether it has a solution or not. I check your solution using a query to find the solution to the problem.

If you suck at solving SAT problems, your best bet might just be to guess that there's a 50% chance that the problem is solveable. You could get this kind of answer by refactoring the complicated nested nested query model into a non-nested model and then noting that the SAT problem itself gives you very little information about whether it is solveable (subject to your computational constraints).

I'm thinking of figuring out the math here better and then applying it to things like planning queries where your successor has a higher rationality parameter than you (an agent with rationality parameter α takes action a with probability proportional to p(a) e^(α * E[U | a]) ). The goal would be to formalize some agent that, for example, generally chooses to defer to a successor who has a higher rationality parameter, unless there is some cost for deferring, in which case it may defer or not depending on some approximation of value of information.

Your project about trading computing power for algorithmic information seems interesting and potentially related, and I'd be interested in seeing any results you come up with.

even if you still have to place some probability mass on \Bot (bottom)

Is this because you assign probability mass to inconsistent theories that you don't know are inconsistent?

Comment by jessicat on FAI Research Constraints and AGI Side Effects · 2015-06-05T02:16:51.580Z · score: 2 (2 votes) · LW · GW

Yes, something like that, although I don't usually think of it as an adversary. Mainly it's so I can ask questions like "how could a FAI model its operator so that it can infer the operator's values from their behavior?" without getting hung up on the exact representation of the model or how the model is found. We don't have any solution to this problem, even if we had access to a probabilistic program induction black box, so it would be silly to impose the additional restriction that we can't give the black box any induction problems that are too hard.

That said, bounded algorithms can be useful as inspiration, even for unbounded problems. For example, I'm currently looking at ways you could use probabilistic programs with nested queries to model Vingean reflection.

Comment by jessicat on FAI Research Constraints and AGI Side Effects · 2015-06-04T23:34:35.697Z · score: 5 (5 votes) · LW · GW

I should be specific that the kinds of results we want to get are those where you could, in principle, use a very powerful computer instead of a hypercomputer. Roughly, the unbounded algorithm should be a limit of bounded algorithms. The kinds of allowed operations I am thinking about include:

  • Solomonoff induction
  • optimizing an arbitrary function
  • evaluating an arbitrary probabilistic program
  • finding a proof of X if one exists
  • solving an infinite system of equations that is guaranteed to have a solution

In all these cases, you can get arbitrarily good approximations using bounded algorithms, although they might require a very large amount of computation power. I don't think things like this would lead to contradictions if you did them correctly.

Comment by jessicat on FAI Research Constraints and AGI Side Effects · 2015-06-04T06:01:12.669Z · score: 3 (3 votes) · LW · GW

Thanks for the detailed response! I do think the framework can still work with my assumptions. The way I would model it would be something like:

  1. In the first stage, we have G->Fremaining (the research to an AGI->FAI solution) and Gremaining (the research to enough AGI for UFAI). I expect G->Fremaining < Gremaining, and a relatively low leakage ratio.
  2. after we have AGI->FAI, we have Fremaining (the research for the AGI to input to the AGI->FAI) and Gremaning (the research to enough AGI for UFAI). I expect Fremaining > Gremaining, and furthermore I expect the leakage ratio to be high enough that we are practically guaranteed to have enough AGI capabilities for UFAI before FAI (though I don't know how long before). Hence the strategic importance of developing AGI capabilities in secret, and not having them lying around for too long in too many hands. I don't really see a way of avoiding this: the alternative is to have enough research to create FAI but not a paperclip maximizer, which seems implausible (though it would be really nice if we could get this state!).

Also, it seems I had misinterpreted the part about rg and rf, sorry about that!

Comment by jessicat on FAI Research Constraints and AGI Side Effects · 2015-06-03T22:02:59.102Z · score: 13 (13 votes) · LW · GW

This model seems quite a bit different from mine, which is that FAI research is about reducing FAI to an AGI problem, and solving AGI takes more work than doing this reduction.

More concretely, consider a proposal such as Paul's reflective automated philosophy method, which might be able to be implemented using epsiodic reinforcement learning. This proposal has problems, and it's not clear that it works -- but if it did, then it would have reduced FAI to a reinforcement learning problem. Presumably, any implementations of this proposal would benefit from any reinforcement learning advances in the AGI field.

Of course, even if we a proposal like this works, it might require better or different AGI capabilities from UFAI projects. I expect this to be true for black-box FAI solutions such as Paul's. This presents additional strategic difficulties. However, I think the post fails to accurately model these difficulties. The right answer here is to get AGI researchers to develop (and not publish anything about) enough AGI capabilities for FAI without running a UFAI in the meantime, even though the capabilities to run it exist.

Assuming that this reflective automated philosophy system doesn't work, it could still be the case that there is a different reduction from FAI to AGI that can be created through armchair technical philosophy. This is often what MIRI's "unbounded solutions" research is about: finding ways you could solve FAI if you had a hypercomputer. Once you find a solution like this, it might be possible to define it in terms of AGI capabilities instead of hypercomputation, and at that point FAI would be reduced to an AGI problem. We haven't put enough work into this problem to know that a reduction couldn't be created in, say, 20 years by 20 highly competent mathematician-philosophers.

In the most pessimistic case (which I don't think is too likely), the task of reducing FAI to an AGI problem is significantly harder than creating AGI. In this case, the model in the post seems to be mostly accurate, except that it neglects the fact that serial advances might be important (so we get diminishing marginal progress towards FAI or AGI per additional researcher in a given year).

Comment by jessicat on Debunking Fallacies in the Theory of AI Motivation · 2015-05-07T17:24:39.607Z · score: 1 (1 votes) · LW · GW

I agree that choosing an action randomly (with higher probability for good actions) is a good way to create a fuzzy satisficer. Do you have any insights into how to:

  1. create queries for planning that don't suffer from "wishful thinking", with or without nested queries. Basically the problem is that if I want an action conditioned on receiving a high utility (e.g. we have a factor on the expected utility node U equal to e^(alpha * U) ), then we are likely to choose high-variance actions while inferring that the rest of the model works out such that these actions return high utilities

  2. extend this to sequential planning without nested nested nested nested nested nested queries

Comment by jessicat on Debunking Fallacies in the Theory of AI Motivation · 2015-05-07T17:18:14.599Z · score: 3 (3 votes) · LW · GW

I would tend to say that you should be training a conceptual map of the world before you install anything like action-taking capability or a goal system of any kind.

This seems like a sane thing to do. If this didn't work, it would probably be because either

  1. lack of conceptual convergence and human understandability; this seems somewhat likely and is probably the most important unknown

  2. our conceptual representations are only efficient for talking about things we care about because we care about these things; a "neutral" standard such as resource-bounded Solomonoff induction will horribly learn things we care about for "no free lunch" reasons. I find this plausible but not too likely (it seems like it ought to be possible to "bootstrap" an importance metric for deciding where in the concept space to allocate resources).

  3. we need the system to have a goal system in order to self-improve to the point of creating this conceptual map. I find this a little likely (this is basically the question of whether we can create something that manages to self-improve without needing goals; it is related to low impact).

Of course, I also tend to say that you should just use a debugged (ie: cured of systematic faults) model of human evaluative processes for your goal system, and then use actual human evaluations to train the free parameters, and then set up learning feedback from the learned concept of "human" to the free-parameter space of the evaluation model.

I agree that this is a good idea. It seems like the main problem here is that we need some sort of "skeleton" of a normative human model whose parts can be filled in empirically, and which will infer the right goals after enough training.

Comment by jessicat on Debunking Fallacies in the Theory of AI Motivation · 2015-05-07T08:36:16.929Z · score: 2 (2 votes) · LW · GW

Regularization is already a part of training any good classifier.

A technical point here: we don't learn a raw classifier, because that would just learn human judgments. In order to allow the system to disagree with a human, we need to use some metric other than "is simple and assigns high probability to human judgments".

For something like FAI, I want a concept-learning algorithm that will look at the world in this naturalized, causal way (which is what normal modelling shoots for!), and that will model correctly at any level of abstraction or under any available set of features, and will be able to map between these levels as the human mind can.

I totally agree that a good understanding of multi-level models is important for understanding FAI concept spaces. I don't have a good understanding of multi-level maps; we can definitely see them as useful constructs for bounded reasoners, but it seems difficult to integrate higher levels into the goal system without deciding things about the high-level map a priori so you can define goals relative to this.

Comment by jessicat on Debunking Fallacies in the Theory of AI Motivation · 2015-05-07T08:27:40.265Z · score: 4 (4 votes) · LW · GW

Okay, thanks a lot for the detailed response. I'll explain a bit about where I'm coming from with understading the concept learning problem:

  • I typically think of concepts as probabilistic programs eventually bottoming out in sense data. So we have some "language" with a "library" of concepts (probabilistic generative models) that can be combined to create new concepts, and combinations of concepts are used to explain complex sensory data (for example, we might compose different generative models at different levels to explain a picture of a scene). We can (in theory) use probabilistic program induction to have uncertainty about how different concepts are combined. This seems like a type of swarm relaxation, due to probabilistic constraints being fuzzy. I briefly skimmed through the McClellard chapter and it seems to mesh well with my understanding of probabilistic programming.
  • But, when thinking about how to create friendly AI, I typically use the very conservative assumptions of statistical learning theory, which give us guarantees against certain kinds of overfitting but no guarantee of proper behavior on novel edge cases. Statistical learning theory is certainly too pessimistic, but there isn't any less pessimistic model for what concepts we expect to learn that I trust. While the view of concepts as probabilistic programs in the previous bullet point implies properties of the system other than those implied by statistical learning theory, I don't actually have good formal models of these, so I end up using statistical learning theory.

I do think that figuring out if we can get more optimistic (but still justified) assumptions is good. You mention empirical experience with swarm relaxation as a possible way of gaining confidence that it is learning concepts correctly. Now that I think about it, bad handling of novel edge cases might be a form of "meta-overfitting", and perhaps we can gain confidence in a system's ability to deal with context shifts by having it go through a series of context shifts well without overfitting. This is the sort of thing that might work, and more research into whether it does is valuable, but it still seems worth preparing for the case where it doesn't.

Anyway, thanks for giving me some good things to think about. I think I see how a lot of our disagreements mostly come down to how much convergence we expect from different concept learning systems. For example, if "psychological manipulation" is in some sense a natural category, then of course it can be added as a weak (or even strong) constraint on the system.
I'll probably think about this a lot more and eventually write up something explaining reasons why we might or might not expect to get convergent concepts from different systems, and the degree to which this changes based on how value-laden a concept is.

There is a lot of talk that can be given about how that complex union takes place, but here is one very important takeaway: it can always be made to happen in such a way that there will not, in the future, be any Gotcha cases (those where you thought you did completely merge the two concepts, but where you suddenly find a peculiar situation where you got it disastrously wrong). The reason why you won't get any Gotcha cases is that the concepts are defined by large numbers of weak constraints, and no strong constraints -- in such systems, the effect of smaller and smaller numbers of concepts can be guaranteed to converge to zero. (This happens for the same reason that the effect of smaller and smaller sub-populations of the molecules in a gas will converge to zero as the population sizes go to zero).

I didn't really understand a lot of what you said here. My current model is something like "if a concept is defined by lots of weak constraints, then lots of these constraints have to go wrong at once for the concept to go wrong, and we think this is unlikely due to induction and some kind of independence/uncorrelatedness assumption"; is this correct? If this is the right understanding, I think I have low confidence that errors in each weak constraint are in fact not strongly correlated with each other.

Comment by jessicat on Debunking Fallacies in the Theory of AI Motivation · 2015-05-06T03:29:32.875Z · score: 3 (3 votes) · LW · GW

We can do something like list a bunch of examples, have humans label them, and then find the lowest Kolomogorov complexity concept that agrees with human judgments in, say, 90% of cases. I'm not sure if this is what you mean by "normatively correct", but it seems like a plausible concept that multiple concept learning algorithms might converge on. I'm still not convinced that we can do this for many value-laden concepts we care about and end up with something matching CEV, partially due to complexity of value. Still, it's probably worth systematically studying the extent to which this will give the right answers for non-value-laden concepts, and then see what can be done about value-laden concepts.

Comment by jessicat on Debunking Fallacies in the Theory of AI Motivation · 2015-05-05T22:24:34.546Z · score: 8 (8 votes) · LW · GW

Thanks for your response.

The AI can quickly assess the "forcefulness" of any candidate action plan by asking itself whether the plan will involve giving choices to people vs. forcing them to do something whether they like it or not. If a plan is of the latter sort, more care is needed, so it will canvass a sample of people to see if their reactions are positive or negative.

So, I think this touches on the difficult part. As humans, we have a good idea of what "giving choices to people" vs. "forcing them to do something" looks like. This concept would need to resolve some edge cases, such as putting psychological manipulation in the "forceful" category (even though it can be done with only text). A sufficiently advanced AI's concept space might contain a similar concept. But how do we pinpoint this concept in the AI's concept space? Very likely, the concept space will be very complicated and difficult for humans to understand. It might very well contain concepts that look a lot like the "giving choices to people" vs. "forcing them to do something" distinction on multiple examples, but are different in important ways. We need to pinpoint it in order to make this concept part of the AI's decision-making procedure.

It will also be able to model people (as it must be able to do, because all intelligent systems must be able to model the world pretty accurately or they don't qualifiy as 'intelligent') so it will probably have a pretty shrewd idea already of whether people will react positively or negatively toward some intended action plan.

This seems pretty similar to Paul's idea of a black-box human in the counterfactual loop. I think this is probably a good idea, but the two problems here are (1) setting up this (possibly counterfactual) interaction in a way that it approves a large class of good plans and rejects almost all bad plans (see the next section), and (2) having a good way to predict the outcome of this interaction usually without actually performing it. While we could say that (2) will be solved by virtue of the superintelligence being a superintelligence, in practice we'll probably get AGI before we get uploads, so we'll need some sort of semi-reliable way to predict humans without actually simulating them. Additionally, the AI might need to self-improve to be anywhere smart enough to consider this complex hypothetical, and so we'll need some kind of low-impact self-improvement system. Again, I think this is probably a good idea, but there are quite a lot of issues with it, and we might need to do something different in practice. Paul has written about problems with black-box approaches based on predicting counterfactual humans here and here. I think it's a good idea to develop both black-box solutions and white-box solutions, so we are not over-reliant on the assumptions involved in one or the other.

In all of that procedure I just described, why would the explanation of the plans to the people be problematic? People will ask questions about what the plans involve. If there is technical complexity, they will ask for clarification. If the plan is drastic there will be a world-wide debate, and some people who finds themselves unable to comprehend the plan will turn to more expert humans for advice.

What language will people's questions about the plans be in? If it's a natural language, then the AI must be able to translate its concept space into the human concept space, and we have to solve a FAI-complete problem to do this. If it's a more technical language, then humans themselves must be able to look at the AI's concept space and understand it. Whether this is possible very much depends on how transparent the AI's concept space is. Something like deep learning is likely to produce concepts that are very difficult for humans to understand, while probabilistic programming might produce more transparent models. How easy it is to make transparent AGI (compared to opaque AGI) is an open question.

We should also definitely be wary of a decision rule of the form "find a plan that, if explained to humans, would cause humans to say they understand it". Since people are easy to manipulate, raw optimization for this objective will produce psychologically manipulative plans that people will incorrectly approve of. There needs to be some way to separate "optimize for the plan being good" from "optimize for people thinking the plan is good when it is explained to them", or else some way of ensuring that humans' judgments about these plans are accurate.

Again, it's quite plausible that the AI's concept space will contain some kind of concept that distinguishes between these different types of optimization; however, humans will need to understand the AI's concept space in order to pinpoint this concept so it can be integrated into the AI's decision rule.

I should mention that I don't think that these black-box approaches to AI control are necessarily doomed to failure; rather, I'm pointing out that there are lots of unresolved gaps in our knowledge of how they can be made to work, and it's plausible that they are too difficult in practice.

Comment by jessicat on Debunking Fallacies in the Theory of AI Motivation · 2015-05-05T04:35:26.527Z · score: 13 (13 votes) · LW · GW

Thanks for posting this; I appreciate reading different perspectives on AI value alignment, especially from AI researchers.

But, truthfully, it would not require a ghost-in-the-machine to reexamine the situation if there was some kind of gross inconsistency with what the humans intended: there could be some other part of its programming (let’s call it the checking code) that kicked in if there was any hint of a mismatch between what the AI planned to do and what the original programmers were now saying they intended. There is nothing difficult or intrinsically wrong with such a design.

If there is some good way of explaining plans to programmers such that programmers will only approve of non-terrible plans, then yes, this works. However, here is contained most of the problem. The AI will likely have a concept space that does not match a human's concept space, so it will need to do some translation between the two spaces in order to produce something the programmers can understand. But, this requires (1) learning the human concept space and (2) translating the AI's representation of the situation into the human's concept space (as in ontological crises). This problem is FAI-complete: given a solution to this, we could learn the human's concept of "good" and then find possible worlds that map to this "good" concept. See also Eliezer's reply to Holden on tool AI.

It might not be necessary to solve the problem in full generality: perhaps we can create systems that plan well in limited domains while avoiding edge cases. But it is also quite difficult to do this without severe restrictions in the system's generality.

The motivation and goal management (MGM) system would be expected to use the same kind of distributed, constraint relaxation mechanisms used in the thinking process (above), with the result that the overall motivation and values of the system would take into account a large degree of context, and there would be very much less of an emphasis on explicit, single-point-of-failure encoding of goals and motivation.

I'm curious how something like this works. My current model of "swarm relaxation" is something like a Markov random field. One of my main paradigms for thinking about AI is probabilistic programs, which are quite similar to Markov random fields (but more general). I know that weak constraint systems are quite useful for performing Bayesian inference in a way that takes context into account. With a bit of adaptation, it's possible to define probabilistic programs that pick actions that lead to good outcomes (by adding a "my action" node and a weak constraint on other parts of the probabilistic model satisfying certain goals; this doesn't exactly work because it leads to "wishful thinking", but in principle it can be adapted). But, I don't think this is really that different from defining a probabilistic world model, defining a utility function over it, and then taking actions that are more likely to lead to high expected utility. Given this, you probably have some other model in mind for how values can be integrated into a weak constraint system, and I'd like to read about it.

But if it is also programmed to utterly ignore that fallibility—for example, when it follows its compulsion to put everyone on a dopamine drip, even though this plan is clearly a result of a programming error—then we must ask the question: how can the machine be both superintelligent and able to ignore a gigantic inconsistency in its reasoning?

We need to make a model that the AI can use in which its goal system "might be wrong". It needs a way to look at evidence and conclude that, due to it, some goal is more or less likely to be the correct one. This is highly nontrivial. The model needs to somehow connect "ought"s to "is"s in a probabilistic, possibly causal fashion. While, relative to a supergoal, subgoals can be reweighted based on new information using standard Bayesian utility maximization, I know of no standard the AI could use to revise its supergoal based on new information. If you have a solution to the corrigibility problem in mind, I'd like to hear it.

Another way of stating the problem is: if you revise a goal based on some evidence, then either you had some reason for doing this or not. If so, then this reason must be expressed relative to some higher goal, and we either never change this higher goal or (recursively) need to explain why we changed it. If not, then we need some other standard for choosing goals other than comparing them to a higher goal. I see no useful way of having a non-fixed supergoal.

if the AGI is going to throw a wobbly over the dopamine drip plan, what possible reason is there to believe that it did not do this on other occasions? Why would anyone suppose that this AGI ignored an inconvenient truth on only this one occasion?

I think the difference here is that, if only the supergoal is "wrong" but everything else about the system is highly optimized towards accomplishing the supergoal, then the system won't stumble along the way, it will (by definition) do whatever accomplishes its supergoal well. So, "having the wrong supergoal" is quite different from most other reasoning errors in that it won't actually prevent the AI from taking over the world.

Knowing about the logical train wreck in its design, the AGI is likely to come to the conclusion that the best thing to do is seek a compromise and modify its design so as to neutralize the Doctrine of Logical Infallibility. The best way to do this is to seek a new design that takes into account as much context—as many constraints—as possible.

It seems like you're equating logical infallibility about facts (including facts about the world and mathematical facts) with logical infallibility about values. Of course any practical system will need to deal with uncertainty about the world and logic, probably using something like a weak constraint system. But it's totally possible to create a system that has this sort of uncertainty without any uncertainty about its supergoal.

When you use the phrase "the best way to do this", you are implicitly referring to some goal that weak constraint systems satisfy better than fixed-supergoal systems, but what sort of goal are we talking about here? If the original system had a fixed supergoal, then this will be exactly that fixed goal, so we'll end up with a mishmash of the original goal and a weak constraint system that reconfigures the universe to satisfy the original goal.

Comment by jessicat on A quick sketch on how the Curry-Howard Isomorphism kinda appears to connect Algorithmic Information Theory with ordinal logics · 2015-04-20T20:02:02.536Z · score: 2 (2 votes) · LW · GW

So, you can compress a list of observations about which Turing machines halt by starting with a uniform prior over Chaitin's omega. This can lead to quite a lot of compression: the information of whether the first n Turing machines halt consists of n bits, but only requires log(n) bits of Chaitin's omega. If we saw whether more Turing machines halted, we would also uncover more bits of Chaitin's omega. Is this the kind of thing you are thinking of?

I guess there's another question of how any of this makes sense if the universe is computable. We can still use information about which Turing machines halt in part of our generative model for a computable universe, even though "x doesn't halt" is never actually observed.

Perhaps you could make a statement like: Solomonoff induction wins on computable universes for the usual reason, and it doesn't lose too many bits on uncomputable universes in some circumstances because it does at least as well as something that has a uniform prior over Chaitin's omega.

Comment by jessicat on A quick sketch on how the Curry-Howard Isomorphism kinda appears to connect Algorithmic Information Theory with ordinal logics · 2015-04-20T02:29:14.675Z · score: 4 (4 votes) · LW · GW

One part I'm not clear on is how the empirical knowledge works. The equivalent of "kilograms of mass" might be something like bits of Chaitin's omega. If you have n bits of Chaitin's omega, you can solve the halting problem for any Turing machine of length up to n. But, while you can get lower bounds on Chaitin's omega by running Turing machines and seeing which halt, you can't actually learn upper bounds on Chaitin's omega except by observing uncomputable processes (for example, a halting oracle confirming that some Turing machine doesn't halt). So unless your empirical knowledge is coming from an uncomputable source, you shouldn't expect to gain any more bits of Chaitin's omega.

In general, if we could recursively enumerate all non-halting Turing machines, then we could decide whether M halts by running M in parallel with a process that enumerates non-halting machines until finding M. If M halts, then we eventually find that it halts; if it doesn't halt, then we eventually find that it doesn't halt. So this recursive enumeration will give us an algorithm for the halting problem. I'm trying to understand how the things you're saying could give us more powerful theories from empirical data without allowing us to recursively enumerate all non-halting Turing machines.

Comment by jessicat on Why isn't the following decision theory optimal? · 2015-04-16T05:14:58.999Z · score: 6 (6 votes) · LW · GW

There's one scenario described in this paper on which this decision theory gives in to blackmail:

The Retro Blackmail problem. There is a wealthy intelligent system and an honest AI researcher with access to the agent’s original source code. The researcher may deploy a virus that will cause $150 million each in damages to both the AI system and the researcher, and which may only be deactivated if the agent pays the researcher $100 million. The researcher is risk-averse and only deploys the virus upon becoming confident that the agent will pay up. The agent knows the situation and has an opportunity to self-modify after the researcher acquires its original source code but before the researcher decides whether or not to deploy the virus. (The researcher knows this, and has to factor this into their prediction.)

Comment by jessicat on Second-Order Logic: The Controversy · 2015-04-07T22:17:33.911Z · score: 0 (0 votes) · LW · GW

It's possible to compute whether each machine halts using an inductive Turing machine like this:

initialize output tape to all zeros, representing the assertion that no Turing machine halts
for i = 1 to infinity
. for j = 1 to i
. .       run Turing machine j for i steps
. .       if it halts: set bit j in the output tape to 1

Is this what you meant? If so, I'm not sure what this has to do with observing loops.

When you say that every nonhalting Turing machine has some kind of loop, do you mean the kind of loop that many halting Turing machines also contain?

Comment by jessicat on New forum for MIRI research: Intelligent Agent Foundations Forum · 2015-03-23T06:31:01.811Z · score: 5 (5 votes) · LW · GW

Thanks for the response. I should note that we don't seem to disagree on the fact that a significant portion of AI safety research should be informed by practical considerations, including current algorithms. I'm currently getting a masters degree in AI while doing work for MIRI, and a substantial portion of my work at MIRI is informed by my experience with more practical systems (including machine learning and probabilistic programming). The disagreement is more that you think that unbounded solutions are almost entirely useless, while I think they are quite useful.

Rather we are faced with a dizzying array of special purpose intelligences which in no way resemble general models like AIXI, and the first superintelligences are likely to be some hodge-podge integration of multiple techniques.

My intuition is that if you are saying that these techniques (or a hodgepodge of them) work, you are referring to some kind of criteria that they perform well on in different situations (e.g. ability to do supervised learning). Sometimes, we can prove that the algorithms perform well (as in statistical learning theory); other times, we can guess that they will perform on future data based on how they perform on past data (while being wary of context shifts). We can try to find ways of turning things that satisfy these criteria into components in a Friendly AI (or a safe utility satisficer etc.), without knowing exactly how these criteria are satisfied.

Like, this seems similar to other ways of separating interface from implementation. We can define a machine learning algorithm without paying too much attention to what programming language it is programmed in, or how exactly the code gets compiled. We might even start from pure probability theory and then add independence assumptions when they increase performance. Some of the abstractions are leaky (for example, we might optimize our machine learning algorithm for good cache performance), but we don't need to get bogged down in the details most of the time. We shouldn't completely ignore the hardware, but we can still usefully abstract it.

What does that mean in terms of a MIRI research agenda? Revisit boxing. Evaluate experimental setups that allow for a presumed-unfriendly machine intelligence but nevertheless has incentive structures or physical limitations which prevent it from going haywire. Devise traps, boxes, and tests for classifying how dangerous a machine intelligence is, and containment protocols. Develop categories of intelligences which lack foundation social skills critical to manipulating its operators. Etc. Etc.

I think this stuff is probably useful. Stuart Armstrong is working on some of these problems on the forum. I have thought about the "create a safe genie, use it to prevent existential risks, and have human researchers think about the full FAI problem over a long period of time" route, and I find it appealing sometimes. But there are quite a lot of theoretical issues in creating a safe genie!

Comment by jessicat on New forum for MIRI research: Intelligent Agent Foundations Forum · 2015-03-22T18:26:56.661Z · score: 7 (7 votes) · LW · GW

Learning how to create even a simple recommendation engine whose output is constrained by the values of its creators would be a large step forward and would help society today.

I think something showing how to do value learning on a small scale like this would be on topic. It might help to expose the advantages and disadvantages of algorithms like inverse reinforcement learning.

I also agree that, if there are more practical applications of AI safety ideas, this will increase interest and resources devoted to AI safety. I don't really see those applications yet, but I will look out for them. Thanks for bringing this to my attention.

it is demonstrably not the case in history that the fastest way to develop a solution is to ignore all practicalities and work from theory backwards

I don't have a great understanding of the history of engineering, but I get the impression that working from the theory backwards can often be helpful. For example, Turing developed the basics of computer science before sufficiently general computers existed.

My current impression is that solving FAI with a hypercomputer is a fundamentally easier problem that solving it with a bounded computer, and it's hard to say much about the second problem if we haven't made steps towards solving the first one. On the other hand, I do think that concepts developed in the AI field (such as statistical learning theory) can be helpful even for creating unbounded solutions.

AIXI showed that all the complexity of AGI lies in the practicalities, because the pure uncomputable theory is dead simple but utterly divorced from practice.

I would really like it if the pure uncomputable theory of Friendly AI were dead simple!

Anyway, AIXI has been used to develop more practical algorithms. I definitely approach many FAI problems with the mindset that we're going to eventually need to scale this down, and this makes issues like logical uncertainty a lot more difficult. In fact, Paul Christiano has written about tractable logical uncertainty algorithms, which is a form of "scaling down an intractable theory". But it helped to have the theory in the first place before developing this.

an ignore-all-practicalities theory-first approach is useless until it nears completion

Solutions that seem to work for practical systems might fail for superintelligence. For example, perhaps induction can yield acceptable practical solutions for weak AIs, but does not necessarily translate to new contexts that a superintelligence might find itself in (where it has to make pivotal decisions without training data for these types of decisions). But I do think working on these is still useful.

My current trajectory places the first AGI at 10 to 15 years out, and the first self-improving superintelligence shortly thereafter. Will MIRI have practical results in that time frame?

I consider AGI in the next 10-15 years fairly unlikely, but it might be worth having FAI half-solutions by then, just in case. Unfortunately I don't really know a good way to make half-solutions. I would like to hear if you have a plan for making these.

Comment by jessicat on New forum for MIRI research: Intelligent Agent Foundations Forum · 2015-03-22T02:40:11.072Z · score: 7 (7 votes) · LW · GW

I think a post saying something like "Deep learning architectures are/are not able to learn human values because of reasons X, Y, Z" would definitely be on topic. As an example of something like this, I wrote a post on the safety implications of statistical learning theory. However, an article about how deep learning algorithms are performing on standard machine learning tasks is not really on topic.

I share your sentiment that safety research is not totally separate from other AI research. But I think there is a lot to be done that does not rely on the details of how practical algorithms work. For example, we could first create a Friendly AI design that relies on Solomonoff induction, and then ask to what extent practical algorithms (like deep learning) can predict bits well enough to be substituted for Solomonoff induction in the design. The practical algorithms are more of a concern when we already have an solution that uses unbounded computing power and are trying to scale it down to something we can actually run.

Comment by jessicat on Identity and quining in UDT · 2015-03-17T21:05:35.363Z · score: 4 (4 votes) · LW · GW

This is an interesting approach. The way I'm currently thinking of this is that you ask what agent a UDT would design, and then do what that agent does, and vary what type an agent is between the different designs. Is this correct?

Consider the anti-Newcomb problem with Omega's simulation involving equation (2)

So is this equation (2) with P replaced with something else?

However, the computing power allocated for evaluation the logical expectation value in (2) might be sufficient to suspect P's output might be an agent reasoning based on (2).

I don't understand this sentence.

Comment by jessicat on Anatomy of Multiversal Utility Functions: Tegmark Level IV · 2015-02-11T19:34:57.686Z · score: 1 (1 votes) · LW · GW

It still seems like this is very much affected by the measure you assign to different game of life universes, and that the measure strongly depends on f.

Suppose we want to set f to control the agent's behavior, so that when it sees sensory data s, it takes silly action a(s), where a is a short function. To work this way, f will map game of life states in which the agent has seen s and should take action a(s) to binary strings that have greater measure, compared to game of life states in which the agent has seen s and should take some other action. I think this is almost always possible due to the agent's partial information about the world: there is nearly always an infinite number of world states in which a(s) is a good idea, regardless of s. f has a compact description (not much longer than a), and it forces the agent's behavior to be equal to a(s) (except in some unrealistic cases where the agent has very good information about the world).

Comment by jessicat on Anatomy of Multiversal Utility Functions: Tegmark Level IV · 2015-02-09T08:35:56.504Z · score: 2 (2 votes) · LW · GW

Thanks for the additional explanation.

It is of similar magnitude to differences between using different universal Turing machines in the definition of the Solomonoff ensemble. These difference become negligible for agents that work with large amounts of evidence.

Hmm, I'm not sure that this is something that you can easily get evidence for or against? The 2^K factor in ordinary Solomonoff induction is usually considered fine because it can only cause you to make at most K errors. But here it's applying to utilities, which you can't get evidence for or against the same way you can for probabilities.

f is required to be bijective, so it cannot lose or create information. Therefore, regardless of f, some programs in the Solomonoff ensemble will produce gliders and others won't.

Okay, I see how this is true. But we could design f so that it only creates gliders if the universe satisfies some silly property. It seems like this would lead us to only care about universes satisfying this silly property, so the silly property would end up being our utility function.

Comment by jessicat on Anatomy of Multiversal Utility Functions: Tegmark Level IV · 2015-02-07T19:57:36.450Z · score: 4 (4 votes) · LW · GW

I think you're essentially correct about the problem of creating a utility function that works across all different logically possible universes being important. This is kind of like what was explored in the ontological crisis paper. Also, I agree that we want to do something like find a human's "native domain" and map it to the true reality in order to define utility functions over reality.

I think using something like Solomonoff induction to find multi-level explanations is a good idea, but I don't think your specific formula works. It looks like it either doesn't handle the multi-level nature of explanations of reality (with utility functions generally defined at the higher levels and physics at the lowest level), or it relies on one of:

  1. f figuring out how to identify high-level objects (such as gliders) in physics (which may very well be a computer running the game of life in software). Then most of the work is in defining f.

  2. Solomonoff induction finding the true multi-level explanation from which we can just pick out the information at the level we want. But, this doesn't work because (a) Solomonoff induction will probably just find models of physics, not multi-level explanations, (b) even if it did (since we used something like the speed prior), we don't have reason to believe that they'll be the same multi-level explanations that humans use, (c) if we did something like only care about models that happen to contain game of life states in exactly the way we want (which is nontrivial given that some random noise could be plausibly viewed as a game of life history), we'd essentially be conditioning an a very weird event (that high-level information is directly part of physics and the game of life model you're using is exactly correct with no exceptions including cosmic rays), which I think might cause problems.

It might turn out that problem 2 isn't as much of a problem as I thought in some variant of this, so it's probably still worth exploring.

My preferred approach (which I will probably write up more formally eventually) is to use a variant of Solomonoff induction that has access to a special procedure that simulates the domain we want (in this case, a program that simulates the game of life). Then we might expect predictors that actually use this program usefully to get shorter codes, so we can perform inference to find the predictor and then look at how the predictor uses the game of life simulator in order to detect games of life in the universe. There's a problem in that there isn't that much penalty for the model to roll its own simulator (especially if the simulation is slightly different from our model due to e.g. cosmic rays), so there are a couple tricks to give models an "incentive" for actually using this simulator. Namely, we can make this procedure cheaper to call (computationally) than a hand-rolled version, or we can provide information about the game of life state that can only get accessed by the model through our simulator. I should note that both of these tricks have serious flaws.

Some questions:

In other words, the "liberated" prefers for many cells to satisfy Game of Life rules and for many cells out of these to contain gliders.

It looks like it subtracts the total number of cells, so it prefers for there to be fewer total cells satisfying the game of life rules?

This is because replacing f with g is equivalent to adjusting probabilities by bounded factor. The bound is roughly 2^K where K is the Kolmogorov complexity of f . g^-1.

I take it this is because we're using a Solomonoff prior over universe histories? I find this statement plausible but 2^K is a pretty large factor. Also, if we define f to be a completely unreasonable function (e.g. it arranges the universe in a way so that no gliders are detected, or it chooses to simulate a whole lot of gliders or not based on some silly property of the universe), then it seems like you have proven that your utility function can never be more than a factor of 2^K away from what you'd get with f.

Comment by jessicat on Compartmentalizing: Effective Altruism and Abortion · 2015-01-06T04:55:30.979Z · score: 2 (2 votes) · LW · GW

So, you can kill a person, create a new person, and raise them to be about equivalent to the original person (on average; this makes a bit more sense if we do it many times so the distribution of people, life outcomes, etc is similar). I guess your question is, why don't we do this (aside from the cost)? A few reasons come to mind:

  1. It would contradict the person's preferences to die more than it contradicts the non-existing people's preferences to never exist.
  2. It would cause emotional suffering to people who know the person.
  3. If people knew that people were being killed in this way, they would justifiably be scared that they might be killed and work to prevent this.
  4. Living in a society requires cooperating with other members of the society by obeying rules such as not killing people (even if you buy murder offsets, which is kind of like what this is). Defection (by murdering people) might temporarily satisfy your values better, but even if this is the case, the usual reasons not to defect in iterated prisoner's dilemma apply here.
  5. It would require overriding people's moral heuristics against murder. This is a very strong moral heuristic, and it's not clear that you can do this without causing serious negative consequences.

Anyway, I highly doubt that you are in favor of murder offsets, so you must have your own reasons for this. Perhaps you could look at which ones apply to fetuses and which ones don't.

Comment by jessicat on Compartmentalizing: Effective Altruism and Abortion · 2015-01-05T03:26:55.222Z · score: 3 (3 votes) · LW · GW

I said that fetuses are replaceable, not that all people are replaceable. OP didn't argue that fetuses weren't replaceable, just that they won't get replaced in practice.

Comment by jessicat on Compartmentalizing: Effective Altruism and Abortion · 2015-01-05T02:24:57.568Z · score: 5 (5 votes) · LW · GW

I don't think you did justice to the replaceability argument. If fetuses are replaceable, then the only benefit of banning abortion is that it increases the fertility rate. However, there are far better ways to increase the fertility rate than banning abortion. For example, one could pay people to have children (and maybe give them up for adoption). So your argument is kind of like saying that since we really need farm laborers, we should allow slavery.

Comment by jessicat on "incomparable" outcomes--multiple utility functions? · 2014-12-17T01:56:03.456Z · score: 4 (4 votes) · LW · GW

I think a useful meaning of "incomparable" is "you should think a very long time before deciding between these". In situations like these, the right decision is not to immediately decide between them, but to think a lot about the decision and related issues. Sure, if someone has to make a split-second decision, they will probably choose whichever sounds better to them. But if given a long time, they might think about it a lot and still not be sure which is better.

This seems a bit similar to multiple utility functions in that if you have multiple utility functions then you might have to think a lot and resolve lots of deep philosophical issues to really determine how you should weight these functions. But even people who are only using one utility function can have lots of uncertainty about which outcome is better, and this uncertainty might slowly reduce (or be found to be intractable) if they think about the issue more. I think the outcomes would seem similarly incomparable to this person.

Comment by jessicat on Open thread, Dec. 15 - Dec. 21, 2014 · 2014-12-15T02:58:08.237Z · score: 6 (6 votes) · LW · GW

We updated on the fact that we exist. SSA does this a little too: specifically, the fact that you exist means that there is at least one observer. One way to look at it is that there is initially a constant number of souls that get used to fill in the observers of a universe. In this formulation, SIA is the result of the normal Bayesian update on the fact that soul-you woke up in a body.

Comment by jessicat on What Peter Thiel thinks about AI risk · 2014-12-11T22:37:03.725Z · score: 38 (38 votes) · LW · GW


Question: Are you as afraid of artificial intelligence as your Paypal colleague Elon Musk?

Thiel: I'm super pro-technology in all its forms. I do think that if AI happened, it would be a very strange thing. Generalized artificial intelligence. People always frame it as an economic question, it'll take people's jobs, it'll replace people's jobs, but I think it's much more of a political question. It would be like aliens landing on this planet, and the first question we ask wouldn't be what does this mean for the economy, it would be are they friendly, are they unfriendly? And so I do think the development of AI would be very strange. For a whole set of reasons, I think it's unlikely to happen any time soon, so I don't worry about it as much, but it's one of these tail risk things, and it's probably the one area of technology that I think would be worrisome, because I don't think we have a clue as to how to make it friendly or not.

Comment by jessicat on Why I will Win my Bet with Eliezer Yudkowsky · 2014-12-02T03:11:59.038Z · score: 0 (0 votes) · LW · GW

I'm talking about the fact that humans can (and sometimes do) sort of optimize the universe. Like, you can reason about the way the universe is and decide to work on causing it to be in a certain state.

So people say they have general goals, but in reality they remain human beings with various tendencies, and continue to act according to those tendencies, and only support that general goal to the extent that it's consistent with those other behaviors.

This could very well be the case, but humans still sometimes sort of optimize the universe. Like, I'm saying it's at least possible to sort of optimize the universe in theory, and humans do this somewhat, not that humans directly use universe-optimizing to select their actions. If a way to write universe-optimizing AGIs exists, someone is likely to find it eventually.

I think it is perfectly possible to develop an AI intelligent enough to pass the Turing Test, but which still would not have anything (not even "passing the Turing Test") as a general goal that would take over its behavior and make it conquer the world.

I agree with this. There are some difficulties with self-modification (as elaborated in my other comment), but it seems probable that this can be done.

And I would expect the first AIs to be of this kind by default, because of the difficulty of ensuring that the whole of the AI's activity is ordered to one particular goal.

Seems pretty plausible. Obviously it depends on what you mean by "AI"; certainly, most modern-day AIs are this way. At the same time, this is definitely not a reason to not worry about AI risk, because (a) tool AIs could still "accidentally" optimize the universe depending on how search for self-modifications and other actions happens, and (b) we can't bet on no one figuring out how to turn a superintelligent tool AI into a universe optimizer.

I do agree with a lot of what you say: it seems like a lot of people talk about AI risk in terms of universe-optimization, when we don't even understand how to optimize functions over the universe given infinite computational power. I do think that non-universe-optimizing AIs are under-studied, that they are somewhat likely to be the first human-level AGIs, and that they will be extraordinary useful for solving some FAI-related problems. But none of this makes the problems of AI risk go away.

Comment by jessicat on Why I will Win my Bet with Eliezer Yudkowsky · 2014-12-01T21:47:07.770Z · score: 0 (0 votes) · LW · GW

In that sense I think the orthogonality thesis will turn out to be false in practice, even if it is true in theory. It is simply too difficult to program a precise goal into an AI, because in order for that to work the goal has to be worked into every physical detail of the thing. It cannot just be a modular add-on.

I find this plausible but not too likely. There are a few things needed for a universe-optimizing AGI:

  1. really good mathematical function optimization (which you might be able to use to get approximate Solomonoff induction)

  2. a way to specify goals that are still well-defined after an ontological crisis

  3. a solution to the Cartesian boundary problem

I think it is likely that (2) and (3) will eventually be solved (or at least worked around) well enough that you can build universe-optimizing AGIs, partially on the basis that humans approximately solve these somehow and we already have tentative hypotheses about what solutions to these problems might look like. It might be the case that we can't really get (1), we can only get optimizers that work in some domains but not others. Perhaps universe-optimization (when reduced to a mathematical problem using (2) and (3)) is too difficult of a domain: we need to break the problem down into sub-problems in order to feed it to the optimizer, resulting in a tool-AI like design. But I don't think this is likely.

If we have powerful tool AIs before we get universe optimizers, this will probably be a temporary stage, because someone will figure out how to use a tool AI to design universe-optimizers someday. But your bet was about the first AGI, so this would still be consistent with you winning your bet.

Comment by jessicat on Why I will Win my Bet with Eliezer Yudkowsky · 2014-11-27T06:58:30.803Z · score: 4 (4 votes) · LW · GW

I, too, think that AIs that don't optimize a function over the universe (but might optimize one over a domain) are more likely to be safe. This is quite related to the idea of tool AI, proposed by Holden and criticized by Eliezer.

The key here seems to be creating a way to evaluate and search for self-improvements in a way that won't cause optimization over universe states. In theory, evaluation of a self-improvement might be able to be restricted to a domain: does this modification help me play chess better according to a model of the situation in which a Cartesian boundary exists and I am running on an abstract virtual machine rather than physical computers embedded in the universe?

We could separate this task into generation of possible self-improvements and tests of self-improvements (a suggester/verifier architecture). If the suggester is superintelligent and dangerous (e.g. as a consequence of being universe-optimizing), then we cannot safely test these self-improvements. This is for a few reasons:

  1. testing a self-improvement involves executing code. If the virtual machine has security flaws, then this could enable the suggester to cause the AI to rewrite into a universe-optimizer or otherwise do something dangerous.

  2. it is possible to create self-improvements that look good according to the verifier but contain information that psychologically manipulates humans (see the AI-box experiment)

  3. a self-modification could score fairly well according to the abstract evaluation function, but at the same time also behave in a universe-optimizing manner some of the time. This self-modification would be considered suboptimal by the verifier, but might still be considered relatively good.

So we can't just use any old suggester and rely on the verifier to do the heavy lifting. Luckily, it may be possible to construct a suggester that acts as if it is running on an abstract virtual machine. If the suggester does not (directly or indirectly) use universe-optimizing as a way to generate self-improvements, then it shouldn't generate universe-optimizing (therefore dangerous) self-improvements. The key gotcha here is the "indirectly" part: how do we know that the suggester isn't (e.g.) using many different heuristics to come up with improvements, where some combination of the heuristics ends up expressing something like "try creating improvements by optimizing the universe". In other words, is universe-optimizing a somewhat useful strategy for finding improvements that good general abstract-mathematical-function-optimizers will pick up on? I don't know the answer to this question. But if we could design suggesters that don't directly or indirectly optimize a function over the universe, then maybe this will work.

Comment by jessicat on [Link] Physics-based anthropics? · 2014-11-15T00:03:49.325Z · score: 2 (2 votes) · LW · GW

Am working on it - as a placeholder, for many problems, one can use Stuart Armstrong's proposed algorithm of finding the best strategy according to a non-anthropic viewpoint that adds the utilities of different copies of you, and then doing what that strategy says.

I think this essentially leads to SIA. Since you're adding utilities over different copies of you, it follows that you care more about universes in which there are more copies of you. So your copies should behave as if they anticipate the probability of being in a universe containing lots of copies to be higher.

However, before assuming [stuff about the universe], you should have [observational data supporting that stuff].

It's definitely not a completely justified assumption. But we do have evidence that the universe supports arbitrary computations, that it's extremely large, and that some things are determined randomly, so as a result it will be running many different computations in parallel. This provides some evidence that, if there is a multiverse, it will have similar properties.

Comment by jessicat on [Link] Physics-based anthropics? · 2014-11-14T19:41:47.909Z · score: 1 (1 votes) · LW · GW

I'm not sure what you mean by "vanilla anthropics". Both SSA and SIA are "simple object-level rules for assigning anthropic probabilities". Vanilla anthropics seems to be vague enough that doesn't give an answer to the doomsday argument or the presumptuous philosopher problem.

On another note, if you assume that a nonzero percentage of the multiverse's computation power is spent simulating arbitrary universes with computation power in proportion to the probabilities of their laws of physics, then both SSA and SIA will end up giving you very similar predictions to Brian_Tomasik's proposal, although I think they might be slightly different.

Comment by jessicat on Superintelligence 9: The orthogonality of intelligence and goals · 2014-11-14T19:31:31.746Z · score: 2 (2 votes) · LW · GW

Okay. We seem to be disputing definitions here. By your definition, it is totally possible to build a very good cross-domain optimizer without it being an agent (so it doesn't optimize a utility function over the universe). It seems like we mostly agree on matters of fact.

Comment by jessicat on [Link] Physics-based anthropics? · 2014-11-14T07:52:12.886Z · score: 1 (1 votes) · LW · GW

I agree with this but I prefer weighting things by computation power instead of physics cells (which may turn out to be somewhat equivalent). It's easy to justify this model by assuming that some percentage of the multiverse's computation power is spent simulating all universes in parallel. See Schmidhuber's paper on this.

Comment by jessicat on Rodney Brooks talks about Evil AI and mentions MIRI [LINK] · 2014-11-13T05:06:06.443Z · score: 0 (0 votes) · LW · GW

Well, he's right that intentionally evil AI is highly unlikely to be created:

Malevolent AI would need all these capabilities, and then some. Both an intent to do something and an understanding of human goals, motivations, and behaviors would be keys to being evil towards humans.

which happens to be the exact reason why Friendly AI is difficult. He doesn't directly address things that don't care about humans, like paperclip maximizers, but some of his arguments can be applied to them.

Expecting more computation to just magically get to intentional intelligences, who understand the world is similarly unlikely.

He's totally right that AGI with intentionality is an extremely difficult problem. We haven't created anything that is even close to practically approximating Solomonoff induction across a variety of situations, and Solomonoff induction is insufficient for the kind of intentionality you would need to build something that cares about universe states while being able to model the universe in a flexible manner. But, you can throw more computation power at a lot of problems to get better solutions, and I expect approximate Solomonoff induction to become practical in limited ways as computation power increases and moderate algorithmic improvements are made. This is true partially because greater computation power allows one to search for better algorithms.

I do agree with him that human-level AGI within the next few decades is unlikely and that significantly slowing down AI research is probably not a good idea right now.

Comment by jessicat on Superintelligence 9: The orthogonality of intelligence and goals · 2014-11-12T19:45:37.748Z · score: 1 (1 votes) · LW · GW

I don't think the connotations of "silly" are quite right here. You could still use this program to do quite a lot of useful inference and optimization across a variety of domains, without killing everyone. Sort of like how frequentist statistics can be very accurate in some cases despite being suboptimal by Bayesian standards. Bostrom mostly only talks about agent-like AIs, and while I think that this is mostly the right approach, he should have been more explicit about that. As I said before, we don't currently know how to build agent-like AGIs at the moment because we haven't solved the ontology mapping problem, but we do know how to build non-agentlike cross-domain optimizers given enough computation power.

Comment by jessicat on Superintelligence 9: The orthogonality of intelligence and goals · 2014-11-11T23:13:15.480Z · score: 1 (1 votes) · LW · GW

1) Perhaps you give it one domain and a utility function within that domain, and it returns a good action in this domain. Then you give it another domain and a different utility function, and it returns a good action in this domain. Basically I'm saying that it doesn't maximize a single unified utility function.

2) You prove too much. This implies that the Unix cat program has a utility function (or else it is wasting effort). Technically you could view it as having a utility function of "1 if I output what the source code of cat outputs, 0 otherwise", but this really isn't a useful level of analysis. Also, if you're going to go the route of assigning a silly utility function to this program, then this is a utility function over something like "memory states in an abstract virtual machine", not "states of the universe", so it will not necessarily (say) try to break out of its box to get more computation power.