AGIs as collectives

post by Richard_Ngo (ricraz) · 2020-05-22T20:36:52.843Z · LW · GW · 23 comments

Contents

  Interpretability
  Flexibility
  Fine-tunability
  Agency
  Overall evaluation of collective AGIs
None
23 comments

Note that I originally used the term population AGI, but changed it to collective AGI to match Bostrom's usage in Superintelligence.

I think there’s a reasonably high probability that we will end up training AGI in a multi-agent setting [AF · GW]. But in that case, we shouldn’t just be interested in how intelligent each agent produced by this training process is, but also in the combined intellectual capabilities of a large group of agents. If those agents cooperate, they will exceed the capabilities of any one of them - and then it might be useful to think of the whole collective as one AGI. Arguably, on a large-scale view, this is how we should think of humans. Each individual human is generally intelligent in our own right. Yet from the perspective of chimpanzees, the problem was not that any single human was intelligent enough to take over the world, but rather that millions of humans underwent cultural evolution to make the human collective as a whole much more intelligent.

This idea isn’t just relevant to multi-agent training though: even if we train a single AGI, we will have strong incentives to copy it many times to get it to do more useful work. If that work involves generating new knowledge, then putting copies in contact with each other to share that knowledge would also increase efficiency. And so, one way or another, I expect that we’ll eventually end up dealing with a “collective” of AIs. Let’s call the resulting system, composed of many AIs working together, a collective AGI.

We should be clear about the differences between three possibilities which each involve multiple entities working together:

  1. A single AGI composed of multiple modules, trained in an end-to-end way.
  2. The Comprehensive AI Services (CAIS) model of a system of interlinked AIs which work together to complete tasks.
  3. A collective AGI as described above, consisting of many individual AIs working together in comparable ways to how a collective of humans might collaborate.

This essay will only discuss the third possibility, which differs from the other two in several ways:

What are the relevant differences from a safety perspective between this collective-based view and the standard view? Specifically, let’s compare a “collective AGI” to a single AGI which can do just as much intellectual work as the whole collective combined. Here I’m thinking particularly of the most high-level work (such as doing scientific research, or making good strategic decisions), since that seems like a fairer comparison.

Interpretability

We might hope that a collective AGI will be more interpretable than a single AGI, since its members will need to pass information to each other in a standardised “language”. By contrast, the different modules in a single AGI may have developed specialised ways of communicating with each other. In humans, language is much lower-bandwidth than thought. This isn’t a necessary feature of communication, though - members of a population AGI could be allowed to send data between each other at an arbitrarily high rate. Decreasing this communication bandwidth might be a useful way to increase the interpretability of a population AGI.

Flexibility

Regardless of the specific details of how they collaborate and share information, members of a collective AGI will need structures and norms for doing so. There’s a sense in which some of the “work” of solving problems is done by those norms - for example, the structure of a debate can be more or less helpful in adjudicating the claims made. The analogous aspect of a single AGI is the structure of its cognitive modules and how they interact with each other. However, the structure of a collective AGI would be much more flexible - and in particular, it could be redesigned by the collective AGI itself in order to improve the flow of information. By contrast, the modules of a single AGI will have been designed by an optimiser, and so fit together much more rigidly. This likely makes them work together more efficiently; the efficiency of end-to-end optimisation is why a human with a brain twice as large would be much more intelligent than two normal humans collaborating. But the concomitant lack of flexibility is why it’s much easier to improve our coordination protocols than our brain functionality.

Fine-tunability

Suppose we want to retrain an AGI to have a new set of goals. How easy is this in each case? Well, for a single AGI we can just train it on a new objective function, in the same way we trained it on the old one. For a collective AGI where each of the members was trained individually, however, we may not have good methods for assigning credit when the whole collective is trying to work together towards a single task. For example, a difficulty discussed in Sunehag et al. (2017) is that one agent starting to learn a new skill might interfere with the performance of other agents - and the resulting decrease in reward teaches the first agent to stop attempting the new skill. This would be particularly relevant if the original collective AGI was produced by copying an single agent trained by itself - if so, it’s plausible that multi-agent reinforcement learning techniques have lagged behind.

Agency

This is a tricky one. I think that a collective AGI is likely to be less agentic and goal-directed than a single AGI of equivalent intelligence, because different members of the collective may have different goals which push in different directions. However, it’s also possible that collective-level phenomena amplify goal-directed behaviour. For example, competition between different members in a collective AGI could push the group as a whole towards dangerous behaviour (in a similar way to how competition between companies makes humans less safe from the perspective of chimpanzees). And our lessened ability to fine-tune them, as discussed in the previous paragraph, might make it difficult to know how to intervene to prevent that.

Overall evaluation of collective AGIs

I think that the extent to which a collective AGI is more dangerous than an equivalently intelligent single AGI will mainly depend on how the individual members are trained (in ways which I’ve discussed previously [AF · GW]). If we condition on a given training regime being used for both approaches, though, it’s much less clear which type of AGI we should prefer. It’d be useful to see more arguments either way - in particular because a better understanding of the pros and cons of each approach might influence our training decisions. For example, during multi-agent training there may be a tradeoff between training individual AIs to be more intelligent, versus running more copies of them to teach them to cooperate at larger scales. In such environments we could also try to encourage or discourage them from in-depth communication with each other.

In my next post, I’ll discuss one argument for why collective AGIs might be safer: because they can be deployed in more constrained ways.

23 comments

Comments sorted by top scores.

comment by Donald Hobson (donald-hobson) · 2020-05-22T22:58:16.759Z · LW(p) · GW(p)
Decreasing this communication bandwidth might be a useful way to increase the interpretability of a population AGI.

On one hand, there would be an effect where reduced bandwidth encouraged the AI's to focus on the most important pieces of information. If the AI's have 1 bit of really important info, and gigabytes of slightly useful info to send to each other, then you know that if you restrict the bandwidth to 1 bit, that's the important info.

On the other hand, perfect compression leaves data that looks like noise unless you have the decompression algorithm. If you limit the bandwidth of messages, the AIs will compress the messages until the recipient can't predict the next bit with much more than 50% accuracy. Cryptoanalysis often involves searching for regular patterns in the coded message, and a regular patterns are an opportunity for compression.

But the concomitant lack of flexibility is why it’s much easier to improve our coordination protocols than our brain functionality.

There are many reasons why human brains are hard to modify that don't apply to AI's. I don't know how easy or hard it would be to modify the internal cognitive structure of an AGI, but I see no evidence here that it must be hard.

On the main substance of your argument, I am not convinced that the boundary line between a single AI and multiple AI's carves reality at the joints. I agree that there are potential situations that are clearly a single AI, or clearly a population, but I think that a lot of real world territory is an ambiguous mixture between the two. For instance, is the end result of IDA (Iterated distillation and Amplification) a single agent or a population. In basic architecture, it is a single imitator. (maybe a single neural net) But if you assume that the distillation step has no loss of fidelity, then you get an exponentially large number of humans in HCH.

(Analogously there are some things that are planets, some that aren't and some ambiguous icy lumps. In order to be clearer, you need to decide which icy lumps are planets. Does it depend on being round, sweeping its orbit, having a near circular orbit or what?)

Here are some different ways to make the concept clearer.

1) There are multiple AI's with different terminal goals, in the sense that the situation can reasonably be modeled as game theoretic. If a piece of code A is modelling code B, and then A randomises its own action to stop B from predicting A, this is a partially adversarial, game theoretic situation.

2) If you took some scissors to all the cables connecting two sets of computers, so there was no route for information to get from one side to the other, then both sides would display optimisation behavior.

Suppose the paradigm was recurrent reinforcement learning agents. So each agent is a single neural net and also has some memory which is just a block of numbers. On each timestep, the memory and sensory data are fed into a neural net, and out comes the new memory and action.

AI's can be duplicated at any moment so the structure is more branching tree of commonality.

AI moments can be.

1) Bitwise Identical

2)Predecessor and Successor states. B has the same network as A, and Mem(B) was made by running Mem(A) on some observation.

3) Share a common memory predecessor.

4) No common memory, same net.

5) One net was produced from the other by gradient decent.

6) The nets share a common gradient decent ancestor.

7) Same architecture and training environment, net started with different random seed.

8) Same architecture, different training

9) Different architecture (number of layers, size of layer, activation func ect)

Each of these can be running at the same time or different times, and on the same hardware or different hardware.

comment by Wei Dai (Wei_Dai) · 2020-05-22T23:06:26.572Z · LW(p) · GW(p)

What success story [LW · GW] (or stories) did you have in mind when writing this?

Replies from: ricraz
comment by Richard_Ngo (ricraz) · 2020-05-22T23:33:47.287Z · LW(p) · GW(p)

Nothing in particular. My main intention with this post was to describe a way the world might be, and some of the implications. I don't think such work should depend on being related to any specific success story.

Replies from: Wei_Dai
comment by Wei Dai (Wei_Dai) · 2020-05-22T23:47:30.676Z · LW(p) · GW(p)

I don’t think such work should depend on being related to any specific success story.

The reason I asked was that you talk about "safer" and "less safe" and I wasn't sure if "safer" here should be interpreted as "more likely to let us eventually achieve some success story", or "less likely to cause immediate catastrophe" (or something like that). Sounds like it's the latter?

Maybe I should just ask directly, what you tend to mean when you say "safer"?

Replies from: ricraz
comment by Richard_Ngo (ricraz) · 2020-05-23T08:42:05.339Z · LW(p) · GW(p)

My thought process when I use "safer" and "less safe" in posts like this is: the main arguments that AGI will be unsafe depends on it having certain properties, like agency, unbounded goals, lack of interpretability, desire and ability to self-improve, and so on. So reducing the extent to which it has those properties will make it safer, because those arguments will be less applicable.

I guess you could have two objections to this:

  • Maybe safety is non-monotonic in those properties.
  • Maybe you don't get any reduction in safety until you hit a certain threshold (corresponding to some success story).

I tend not to worry so much about these two objections because to me, the properties I outlined above are still too vague to have a good idea of the landscape of risks with respect to those properties. Once we know what agency is, we can talk about its monotonicity. For now my epistemic state is: extreme agency is an important component of the main argument for risk, so all else equal reducing it should reduce risk.

I like the idea of tying safety ideas to success stories in general, though, and will try to use it for my next post, which proposes more specific interventions during deployment. Having said that, I also believe that most safety work will be done by AGIs, and so I want to remain open-minded to success stories that are beyond my capability to predict.

Replies from: Wei_Dai
comment by Wei Dai (Wei_Dai) · 2020-05-26T08:59:36.257Z · LW(p) · GW(p)

For now my epistemic state is: extreme agency is an important component of thee main argument for risk, so all else equal reducing it should reduce risk.

I appreciate the explanation, but this is pretty far from my own epistemic state, which is that arguments for AI risk are highly [LW(p) · GW(p)] disjunctive [LW · GW], most types of AGI (not just highly agentic ones) are probably unsafe (i.e., are likely to lead us away from rather than towards a success story), at best probably only a few very specific AGI designs (which may well be agentic if combined with other properties) are both feasible and safe (i.e., can count as success stories), so it doesn't make sense to say that an AGI is "safer" just because it's less agentic.

Having said that, I also believe that most safety work will be done by AGIs, and so I want to remain open-minded to success stories that are beyond my capability to predict.

Getting to an AGI that can safely do human or superhuman level safety work would be a success story in itself, which I labeled "Research Assistant" in my post.

Replies from: ricraz
comment by Richard_Ngo (ricraz) · 2020-05-26T09:48:04.591Z · LW(p) · GW(p)
my own epistemic state, which is that arguments for AI risk are highly [LW(p) · GW(p)] disjunctive [LW · GW], most types of AGI (not just highly agentic ones) are probably unsafe (i.e., are likely to lead us away from rather than towards a success story), at best probably only a few very specific AGI designs (which may well be agentic if combined with other properties) are both feasible and safe (i.e., can count as success stories)

Yeah, I guess I'm not surprised that we have this disagreement. To briefly sketch out why I disagree (mostly for common knowledge; I don't expect this to persuade you):

I think there's something like a logistic curve for how seriously we should take arguments. Almost all arguments are bad, and have many many ways in which they might fail. This is particularly true for arguments trying to predict the future, since they have to invent novel concepts to do so. Only once you've seen a significant amount of work put into exploring an argument, the assumptions it relies on, and the ways it might be wrong, should you start to assign moderate probability that the argument is true, and that the concepts it uses will in hindsight make sense.

Most of the arguments mentioned in your post on disjunctive safety arguments fall far short of any reasonable credibility threshold. Most of them haven't even had a single blog post which actually tries to scrutinise them in a critical way, or lay out their key assumptions. And to be clear, a single blog post is just about the lowest possible standard you might apply. Perhaps it'd be sufficient in a domain where claims can be very easily verified, but when we're trying to make claims that a given effect will be pivotal for the entire future of humanity despite whatever efforts people will make when the problem starts becoming more apparent, we need higher standards to get to the part of the logistic curve with non-negligible gradient.

This is not an argument for dismissing all of these possible mechanisms out of hand, but an argument that they shouldn't (yet) be given high credence. I think they are often given too high credence because there's a sort of halo effect from the arguments which have been explored in detail, making us more willing to consider arguments that in isolation would seem very out-there. When you think about the arguments made in your disjunctive post, how hard do you try to imagine each one conditional on the knowledge that the other arguments are false? Are they actually compelling in a world where Eliezer is wrong about intelligence explosions and Paul is wrong about influence-seeking agents? (Maybe you'd say that there are legitimate links between these arguments, e.g. common premises - but if so, they're not highly disjunctive).

Getting to an AGI that can safely do human or superhuman level safety work would be a success story in itself, which I labeled "Research Assistant" in my post

Good point, I shall read that post more carefully. I still don't think that this post is tied to the Research Assistant success story though.

Replies from: Wei_Dai, Wei_Dai
comment by Wei Dai (Wei_Dai) · 2020-05-26T10:20:45.027Z · LW(p) · GW(p)

but when we’re trying to make claims that a given effect will be pivotal for the entire future of humanity despite whatever efforts people will make when the problem starts becoming more apparent, we need higher standards to get to the part of the logistic curve with non-negligible gradient.

I guess a lot of this comes down to priors and burden of proof. (I guess I have a high prior that making something smarter than human is dangerous unless we know exactly what we're doing including the social/political aspects, and you don't, so you think the burden of proof is on me?) But (1) I did write a bunch of blog posts which are linked to in the second post (maybe you didn't click on that one?) and it would help if you could point out more where you're not convinced, and (2) does the current COVID-19 disaster not make you more pessimistic about "whatever efforts people will make when the problem starts becoming more apparent"?

When you think about the arguments made in your disjunctive post, how hard do you try to imagine each one conditional on the knowledge that the other arguments are false? Are they actually compelling in a world where Eliezer is wrong about intelligence explosions and Paul is wrong about influence-seeking agents?

I think I did? Eliezer being wrong about intelligence explosions just means we live in a world without intelligence explosions, and Paul being wrong about influence-seeking agents just means he (or someone) succeeds in building intent-aligned AGI, right? Many of my "disjunctive" arguments were written specifically with that scenario in mind.

Replies from: ricraz
comment by Richard_Ngo (ricraz) · 2020-05-26T23:16:23.136Z · LW(p) · GW(p)
Many of my "disjunctive" arguments were written specifically with that scenario in mind.

Cool, makes sense. I retract my pointed questions.

I guess I have a high prior that making something smarter than human is dangerous unless we know exactly what we're doing including the social/political aspects, and you don't, so you think the burden of proof is on me?

This seems about right. In general when someone proposes a mechanism by which the world might end, I think the burden of proof is on them. You're not just claiming "dangerous", you're claiming something like "more dangerous than anything else has ever been, even if it's intent-aligned". This is an incredibly bold claim and requires correspondingly thorough support.

does the current COVID-19 disaster not make you more pessimistic about "whatever efforts people will make when the problem starts becoming more apparent"?

Actually, COVID makes me a little more optimistic. First because quite a few countries are handling it well. Secondly because I wasn't even sure that lockdowns were a tool in the arsenal of democracies, and it seemed pretty wild to shut the economy down for so long. But they did. Also essential services have proven much more robust than I'd expected (I thought there would be food shortages, etc).

Replies from: Wei_Dai
comment by Wei Dai (Wei_Dai) · 2020-05-27T09:01:13.358Z · LW(p) · GW(p)

This seems about right. In general when someone proposes a mechanism by which the world might end, I think the burden of proof is on them. You’re not just claiming “dangerous”, you’re claiming something like “more dangerous than anything else has ever been, even if it’s intent-aligned”. This is an incredibly bold claim and requires correspondingly thorough support.

  1. "More dangerous than anything else has ever been" does not seem incredibly bold to me, given that superhuman AI will be more powerful than anything else the world has seen. Historically the risk of civilization doing damage to itself seems to grow with the power that it has access to (e.g., the two world wars, substantial risks of nuclear war and man-made pandemic that continue to accumulate each year, climate change) so I think I'm just extrapolating a clear trend. (Past risks like these could not have been eliminated by solving a single straightforward, self-contained, technical problem analogous to "intent alignment" so why expect that now?)

To risk being uncharitable, your position seems analogous to someone saying, before the start of the nuclear era, "I think we should have a low prior that developing any particular kind of nuclear weapon will greatly increase the risk of global devastation in the future, because (1) that would be unprecedentedly dangerous and (2) nobody wants global devastation so everyone will work to prevent it. The only argument that has been developed well enough to overcome this low prior is that some types of nuclear weapons could potentially ignite the atmosphere, so to be safe we'll just make sure to only build bombs that definitely can't do that." (What would be a charitable historical analogy to your position if this one is not?)

  1. "The world might end" is not the only or even the main thing I'm worried about, especially because there are more people who can be expected to worry about "the world might end" and try to do something about it. My focus is more on the possibility that humanity survives but the values of people like me (or human values, or objective morality, depending on what the correct metaethics turn out to be) end up controlling only a small fraction of universe so we end up with astronomical waste or Beyond Astronomical Waste [LW · GW] as a result. (Or our values become corrupted and the universe ends up being optimized for completely alien or wrong values.) There is plenty of precedence for the world becoming quite suboptimal according to some group's values, and there is no apparent reason to think the universe has to evolve according to objective morality (if such a thing exists), so my claim also doesn't seem very extraordinary from this perspective.

First because quite a few countries are handling it well. Secondly because I wasn’t even sure that lockdowns were a tool in the arsenal of democracies, and it seemed pretty wild to shut the economy down for so long.

If you think societal response to a risk like pandemic (and presumably AI) is substantially suboptimal by default (and it clearly is given that large swaths of humanity are incurring a lot of needless deaths), doesn't that imply significant residual risks, and plenty of room for people like us to try to improve the response? To a first approximation, the default suboptimal social response reduces all risks by some constant amount, so if some particular x-risk is important to work on without considering default social response, it's probably still important to work on after considering "whatever efforts people will make when the problem starts becoming more apparent". Do you disagree this argument? Did you have some other reason for saying that, that I'm not getting?

comment by Wei Dai (Wei_Dai) · 2020-05-26T21:13:41.035Z · LW(p) · GW(p)

To try to encourage you to engage with my arguments more (as far as pointing out where you're not convinced), I think I'm pretty good at being skeptical of my own ideas [LW · GW] and have a good track record in terms of not spewing off a lot of random ideas that turn out to be far off the mark. But I am too lazy / have too many interests / am too easily distracted to write long papers/posts where I lay out every step of my reasoning and address every possible counterargument in detail.

So what I'd like to do is to just amend my posts to address the main objections that many people actually have, enough for more readers like you to "assign moderate probability that the argument is true". In order to do that, I need to have a better idea what objections people actually have or what counterarguments they currently find convincing. Does this make sense to you?

Replies from: ricraz, Wei_Dai
comment by Richard_Ngo (ricraz) · 2020-05-26T23:07:53.648Z · LW(p) · GW(p)

I'm pretty skeptical of this as a way of making progress. It's not that I already have strong disagreements with your arguments. But rather, if you haven't yet explained them thoroughly, I expect them to be underspecified, and use some words and concepts that are wrong [LW · GW] in hard-to-see ways. One way this might happen is if those arguments use concepts (like "metaphilosophy") that kinda intuitively seem like they're pointing at something, but come with a bunch of connotations and underlying assumptions that make actually understanding them very tricky.

So my expectation for what happens here is: I look at one of your arguments, formulate some objection X, and then you say either: "No, that wasn't what I was claiming", or "Actually, ~X is one of the implicit premises", or "Your objection doesn't make any sense in the framework I'm outlining" and then we repeat this a dozen or more times. I recently went through this process with Rohin, and it took a huge amount of time and effort (both here [EA(p) · GW(p)] and in private conversation) to get anywhere near agreement, despite our views on AI being much more similar than yours and mine.

And even then, you'll only have fixed the problems I'm able to spot, and not all the others. In other words, I think of patching your way to good arguments as kinda like patching your way to safe AGI. (To be clear, none of this is meant as specific criticism of your arguments, but rather as general comments about any large-scale arguments using novel concepts that haven't been made very thoroughly and carefully).

Having said this, I'm open to trying it for one of your arguments. So perhaps you can point me to one that you particularly want engagement on?

Replies from: Raemon, Wei_Dai, TurnTrout
comment by Raemon · 2020-05-26T23:14:46.108Z · LW(p) · GW(p)

(serious question, I'm not sure what the right process here is)

What do you think should happen instead of "read through and object to Wei_Dai's existing blogposts?". Is there a different process that would work better? Or you think this generally isn't worth the time? Or you think Wei Dai should write a blogpost that more clearly passes your "sniff test" of "probably compelling enough to be worth more of my attention?"

Replies from: ricraz
comment by Richard_Ngo (ricraz) · 2020-05-26T23:52:38.113Z · LW(p) · GW(p)

Mostly "Wei Dai should write a blogpost that more clearly passes your "sniff test" of "probably compelling enough to be worth more of my attention"". And ideally a whole sequence or a paper.

It's possible that Wei has already done this, and that I just haven't noticed. But I had a quick look at a few of the blog posts linked in the "Disjunctive scenarios" post, and they seem to overall be pretty short and non-concrete, even for blog posts. Also, there are literally thirty items on the list, which makes it hard to know where to start (and also suggests low average quality of items). Hence why I'm asking Wei for one which is unusually worth engaging with; if I'm positively surprised, I'll probably ask for another.

comment by Wei Dai (Wei_Dai) · 2020-05-28T05:16:36.889Z · LW(p) · GW(p)

Having said this, I’m open to trying it for one of your arguments. So perhaps you can point me to one that you particularly want engagement on?

Perhaps you could read all three of these posts (they're pretty short :) and then either write a quick response to each one and then I'll decide which one to dive into, or pick one yourself (that you find particularly interesting, or you have something to say about).

Also, let me know if you prefer to do this here, via email, or text/audio/video chat. (Also, apologies ahead of time for any issues/delays as my kid is home all the time now, and looking after my investments is a much bigger distraction / time-sink than usual, after I updated away from "just put everything into an index fund".)

Replies from: ricraz
comment by Richard_Ngo (ricraz) · 2020-05-28T19:15:15.304Z · LW(p) · GW(p)

My thoughts on each of these. The common thread is that it seems to me you're using abstractions at way too high a level to be confident that they will actually apply, or that they even make sense in those contexts.

AGIs and economies of scale

  • Do we expect AGIs to be so competitive that reducing coordination costs is a big deal? I expect that the dominant factor will be AGI intelligence, which will vary enough that changes in coordination costs aren't a big deal. Variations in human intelligence have a huge effect, and presumably variations in AGI intelligence will be much bigger.
  • There's an obvious objection to giving one AGI all of your resources, which is "how do you know it's aligned"? And this seems like an issue where there'd be unified dissent from people worried about both short-term and long-term safety.
  • Oh, another concern: if they're all intent aligned to the same person, then this amounts to declaring that person dictator. Which is often quite a difficult thing to convince people to do.
  • Consider also that we'll be in an age of unprecedented plenty, once we have aligned AGIs that can do things for us. So I don't see why economic competition will be very strong. Perhaps military competition will be strong, but will countries really be converting so much of their economy to military spending that they need this edge to keep up?

So this seems possible, but very far from a coherent picture in my mind.

Some thoughts on metaphilosophy

  • These are a bunch of fun analogies here. But it is very unclear to me what you mean by "philosophy" here, since most, or perhaps all, of your descriptions would be equally applicable to "thinking" or "reasoning". The model you give of philosophy is also a model of choosing the next move in the game of chess, and countless other things.
  • Similarly, what is metaphilosophy, and what would it mean to solve it? Reach a dead end? Be able to answer any question? Why should we think that the concept of a "solution" to metaphilosophy makes any sense?

Overall, this posts feels like it's pointing at something interesting but I don't know if it actually communicated any content to me. Like, is the point of the sections headed "Philosophy as interminable debate" and "Philosophy as Jürgen Schmidhuber's General TM" just to say that we can never be certain of any proposition? As written, the post is consistent both with you having some deep understanding of metaphilosophy that I just am not comprehending, and also with you using this word in a nonsensical way.

Two Neglected Problems in Human-AI Safety

  • "There seems to be no reason not to expect that human value functions have similar problems, which even "aligned" AIs could trigger unless they are somehow designed not to." There are plenty of reasons to think that we don't have similar problems - for instance, we're much smarter than the ML systems on which we've seen adversarial examples. Also, there are lots of us, and we keep each other in check.
  • "For example, such AIs could give humans so much power so quickly or put them in such novel situations that their moral development can't keep up, and their value systems no longer apply or give essentially random answers." What does this actually look like? Suppose I'm made the absolute ruler of a whole virtual universe - that's a lot of power. How might my value system "not keep up"?
  • The second half of this post makes a lot of sense to me, in large part because you can replace "corrupt human values" with "manipulate people", and then it's very analogous to problems we face today. Even so, a *lot* of additional work would need to be done to make a plausible case that this is an existential risk.
  • "An objective that is easy to test/measure (just check if the target has accepted the values you're trying to instill, or has started doing things that are more beneficial to you)". Since when was it easy to "just check" someone's values? Like, are you thinking of an AI reading them off our neurons?

Here's a slightly stretched analogy to try and explain my overall perspective. If you talked to someone born a thousand years ago about the future, they might make claims like "the most important thing is making process on metatheology" or "corruption of our honour is an existential risk", or "once instantaneous communication exists then economies of scale will be so great that countries will be forced to nationalise all their resources". How do we distinguish our own position from theirs? The only way is to describe our own concepts at a level of clarity and detail that they just couldn't have managed. So what I want is a description of what "metaphilosophy" is such that it would have been impossible to give an equally clear description of "metatheology" without realising that this concept is not useful or coherent. Maybe that's too high a target, but I think it's one we should keep in mind as what is *actually necessary* to reason at such an abstract level without getting into confusion.

Replies from: dxu
comment by dxu · 2020-05-29T02:33:46.320Z · LW(p) · GW(p)
  • "There seems to be no reason not to expect that human value functions have similar problems, which even "aligned" AIs could trigger unless they are somehow designed not to." There are plenty of reasons to think that we don't have similar problems - for instance, we're much smarter than the ML systems on which we've seen adversarial examples. Also, there are lots of us, and we keep each other in check.
  • "For example, such AIs could give humans so much power so quickly or put them in such novel situations that their moral development can't keep up, and their value systems no longer apply or give essentially random answers." What does this actually look like? Suppose I'm made the absolute ruler of a whole virtual universe - that's a lot of power. How might my value system "not keep up"?

I confess to being uncertain of what you find confusing/unclear here. Think of any subject you currently have conflicting moral intuitions about (do you have none?), and now imagine being given unlimited power without being given the corresponding time to sort out which intuitions you endorse. It seems quite plausible to me that you might choose to do the wrong thing in such a situation, which could be catastrophic if said decision is irreversible.

Replies from: ricraz
comment by Richard_Ngo (ricraz) · 2020-05-29T08:19:17.565Z · LW(p) · GW(p)

But I can't do the wrong thing, by my standards of value, if my "value system no longer applies". So that's part of what I'm trying to tease out.

Another part is: I'm not sure if Wei thinks this is just a governance problem (i.e. we're going to put people in charge who do the wrong thing, despite some people advocating caution) or a more fundamental problem that nobody would do the right thing.

If the former, then I'd characterise this more as "more power magnifies leadership problems". But maybe it won't, because there's also a much larger space of morally acceptable things you can do. It just doesn't seem that easy to me to accidentally do a moral catastrophe if you've got a huge amount of power, and less so an irreversible one. But maybe this is just because I don't know of whatever possible examples Wei thinks about.

comment by TurnTrout · 2020-05-27T00:19:38.666Z · LW(p) · GW(p)

In other words, I think of patching your way to good arguments

As opposed to what?

Replies from: ricraz
comment by Richard_Ngo (ricraz) · 2020-05-27T09:00:31.628Z · LW(p) · GW(p)

As opposed to coming up with powerful and predictive concepts, and refining them over time. Of course argument and counterargument are crucial to that, so there's no sharp line between this and "patching", but for me the difference is: are you starting with the assumption that the idea is fundamentally sound, and you just need to fix it up a bit to address objections? If you are in that position despite not having fleshed out the idea very much, that's what I'd characterise as "patching your way to good arguments".

comment by Wei Dai (Wei_Dai) · 2020-05-26T23:26:37.565Z · LW(p) · GW(p)

It looks like someone strong downvoted a couple of my comments in this thread (the parent and this one [LW(p) · GW(p)]). (The parent comment was at 5 points with 3 votes, now it's 0 points with 4 votes.) This is surprising to me as I can't think of what I have written that could cause someone to want to do that. Does the person who downvoted want to explain, or anyone else want to take a guess?

Replies from: Benito
comment by Ben Pace (Benito) · 2020-05-27T00:09:42.405Z · LW(p) · GW(p)

(I also can't think of a clear reason why anyone would strong-downvote your comments. I liked reading this comment thread, even though I had some sticky-feeling sense that it would be hard to resolve to convo with Richard for some reason I can't easily articulate.)

Replies from: ricraz
comment by Richard_Ngo (ricraz) · 2020-05-27T09:17:23.612Z · LW(p) · GW(p)

What do you mean by "hard to resolve to convo with Richard"? I can't parse that grammar.

I didn't downvote those comments, but if you interpret me as saying "More rigour for important arguments please", and Wei as saying "I'm too lazy to provide this rigour", then I can see why someone might have downvoted them.

Like, on one level I'm fine with Wei having different epistemic standards to me, and I appreciate his engagement. And I definitely don't intend my arguments as attacks on Wei specifically, since he puts much more effort into making intellectual progress than almost anyone on this site.

But on another level, the whole point of this site is to have higher epistemic standards, and (I would argue) the main thing preventing that is just people being so happy to accept blog-post-sized insights without further scrutiny.