Projects I would like to see (possibly at AI Safety Camp)

post by Linda Linsefors · 2023-09-27T21:27:29.539Z · LW · GW · 12 comments

Contents

    What is AI Safety Camp?
  Project ideas
    Is substrate-needs convergence inevitable for any autonomous system, or is it preventable with sufficient error correction techniques?
    Any alignment-relevant adversarial collaboration
      Possible topic:
    What capability thresholds along what dimensions should we never cross?
    A taxonomy of: What end-goal are “we” aiming for?
None
12 comments

I recently discussed with my AISC co-organiser Remmelt, some possible project ideas I would be excited about seeing at the upcoming AISC, and I thought these would be valuable to share more widely. 

Thanks to Remmelt for helfull suggestions and comments.

 

What is AI Safety Camp?

AISC in its current form is primarily a structure to help people find collaborators. As a research lead we give your project visibility, and help you recruit a team. As a regular participant, we match you up with a project you can help with.

I want to see more good projects happening. I know there is a lot of unused talent wanting to help with AI safety. If you want to run one of these projects, it doesn't matter to me if you do it as part of AISC or independently, or as part of some other program. The purpose of this post is to highlight these projects as valuable things to do, and to let you know AISC can support you, if you think what we offer is helpful.

 

Project ideas

These are not my after-long-consideration top picks of most important things to do, just some things I think would be net positive if someone would do. I typically don't spend much cognitive effort on absolute rankings anyway, since I think personal fit is more important for ranking your personal options. 

I don't claim originality for anything here. It’s possible there is work on one or several of these topics, which I’m not aware of. Please share links in comments, if you know of such work.

 

Is substrate-needs convergence inevitable for any autonomous system, or is it preventable with sufficient error correction techniques?

This can be done as an adversarial collaboration (see below) but doesn't have to be.

The risk from substrate-needs convergence can be summarised as such:

  1. If AI is complex enough to self-sufficiently maintain its components, natural selection will sneak in. 
  2. This would select for components that cause environmental conditions needed for artificial self-replication.
  3. An AGI will necessarily be complex enough. 

Therefore natural selection will push the system towards self replication. Therefore it is not possible for an AGI to be stably aligned with any other goal. Note that this line of reasoning does not necessitate that the AI will come to represent self replication as its goal (although that is a possible outcome), only that natural selection will push it towards this behaviour. 

I'm simplifying and skipping over a lot of steps! I don't think there currently is a great writeup of the full argument, but if you're interested you can read more here [LW · GW] or watch this talk by Remmelt or reach out to me or Remmelt. Remmelt has a deeper understanding of the arguments for substrate-needs convergence than me, but my communication style might be better suited for some people.

I think substrate-needs convergence is pointing at a real risk. I don't know yet if the argument (which I summarised above) proves that building an AGI that stays aligned is impossible, or if it points to one more challenge to be overcome. Figuring out which of these is the case seems very important. 

I've talked to a few people about this problem, and identified what I think is the main crux: How well you can execute error correction mechanisms? 

When Forest Laundry and Anders Sandberg discussed substrate-needs convergence, they ended up with a similar crux, but unfortunately did not have time to address it. Here’s a recording of their discussion, however Landry’s mic breaks about 20 minutes in, which makes it hard to hear him from that point onward. 

 

Any alignment-relevant adversarial collaboration

What is adversarial collaborations? 
Se this SSC post for an explanation: Call or Adversarial Collaborations | Slate Star Codex

Possible topic:

I expect this type of project to be most interesting if both sides already have strong reasons for believing the side they are advocating for. My intuition says that different frames will favour different conclusions, and that you will miss one or more important frame if either or both of you start from a weak conviction. The most interesting conversation will come from taking a solid argument for A and another solid argument for not-A, and finding a way for these perspectives to meet.

I think AISC can help finding good matches. The way I suggest doing this is that one person (the AISC Research Lead) lays out their position in their project proposal. Then we post this for everyone to see. Then when we open up for team member applications, anyone who disagrees can submit their application to join this project. Possibly you can have more than one person defending and attacking the position in question, and you can also add a moderator to the team if that seems useful.

However, if the AISC structure seems a bit overkill or just not the right fit for what you want to do in particular, there are other options too. For example you're invited to post ideas in the comments of this post.

Haven't there already been several AI Safety debates?
Yes, and those have been interesting. But also, doing an adversarial collaboration as part of AISC is a longer time commitment than most of these debates, which will allow you to go deeper. I'm sure there have also been long conversations in the past, which continue back and forth over months, and I'm sure many of those have been useful too. Let's have more!

 

What capability thresholds along what dimensions should we never cross?

This is a project for people who think that alignment is not possible, or at least not tractable in the next couple of decades. I'd be extra interested to see someone work on this from the perspective of risk due to substrate-needs convergence, or at least taking this risk into account, since this is an underexplored risk.

If alignment is not possible and we have to settle for less than god-like AI, then where do we draw the boundary for safe AI capabilities? What capability thresholds along what dimensions should we never cross? 

Karl suggested something similar here: Where are the red lines for AI? [LW · GW]

 

A taxonomy of: What end-goal are “we” aiming for?

In what I think of as “classical alignment research” the end goal is a single aligned superintelligent AI, which will solve all our future problems, including defending us against any future harmful AIs. But the field of AI safety has broadened a lot since then. For example there are much more efforts into coordination now. But is the purpose of regulation and other coordination, just to slow down AI so we have time to solve alignment, so we can build our benevolent god later on? Or are we aiming for a world where humans stay in control? I expect different people and different projects to have different end-goals in mind. However, this isn’t talked about much, so I don’t know. 

It is likely that some of the disagreement around alignment is based on different agendas aiming for different things. I think it would be good for the AI safety community to have an open discussion about this. However the first step should not be to argue who is right or wrong, but just to map out what end-goal different people and groups have in mind. 

In fact, I don’t think consensus on what the end-goal should be is necessarily something we want at this point. We don’t know yet what is possible. It’s probably good for humanity to keep our options open, which means different people preparing the path for different options. I like the fact that different agendas are aiming at different things. But I think the discourse and understanding could be improved by more common knowledge about who is aiming for what.

12 comments

Comments sorted by top scores.

comment by MiguelDev (whitehatStoic) · 2023-09-28T04:29:56.885Z · LW(p) · GW(p)

A taxonomy of: What end-goal are “we” aiming for?


Strong upvote for this section. We should begin discussing how to develop a specific type of network that can manage the future we envision, with a cohesive end goal that can be integrated into an AI system.

comment by mesaoptimizer · 2023-10-03T07:57:29.574Z · LW(p) · GW(p)

Here is one output I want to see: a succinct, concrete fully specified (ideally formalized) argument for uncontrollabilty that involves a simple setup and demonstrates Remmelt's reasons for uncontrollability of superhuman AGI.

That should make the argument easier to communicate and evaluate.

Replies from: remmelt-ellen
comment by Remmelt (remmelt-ellen) · 2023-12-30T08:33:22.683Z · LW(p) · GW(p)

Still wanted to say:

I appreciate the spirit of this comment.

There are trade-offs here.

If it’s simple or concrete like a toy model, then it is not fully specified. If it is fully specified, then the inferential distance of going through the reasoning steps is large (people get overwhelmed and they opt out).

If it’s formalised, then people need to understand the formal language. Look back at Gödel’s incompleteness theorems, which involved creating a new language and describing a mathematical world people were not familiar with. Actually reading through the original paper would have been a toil for most mathematicians.

There are further bottlenecks, which I won’t get into.

For now, I suggest people who care to understand (because everything we care about is at stake) to read this summary post: https://www.lesswrong.com/posts/xp6n2MG5vQkPpFEBH/the-control-problem-unsolved-or-unsolvable [LW · GW]

Anders Sandberg also had an insightful conversation with my research mentor about fundamental controllability limits. I guess the transcript will be posted on this forum somewhere next month.

Again, it’s not simple.

For the effort it takes you to read through and understand parts, please recognise it took a couple of orders of magnitude more effort for me and my collaborators to convey the arguments in a more intuitive digestible form.

Replies from: mesaoptimizer
comment by mesaoptimizer · 2023-12-30T10:21:38.554Z · LW(p) · GW(p)

I now agree with your sentiment here and don't think my request when I made that comment was very sensible. It does seem like going from an informal / not-fully-specified argument to a fully specified argument is extremely difficult and unlikely to be worth the effort in convincing people who would already be convinced by extremely sensible but not fully formalized arguments.

It does seem to me that a toy model that is not fully specified is still a big deal when it comes to progress in communicating what is going on though.

I might look into the linked post again in more detail and seriously consider it and wrap my head around it. Thanks for this comment.

Replies from: remmelt-ellen
comment by Remmelt (remmelt-ellen) · 2024-01-25T13:40:49.153Z · LW(p) · GW(p)

Thanks for coming back on this. Just saw your comment, and am agreeing with your thoughtful points.

Let me also DM you the edited transcript of the conversation with Anders Sandberg.

comment by chasmani · 2023-09-28T12:02:26.266Z · LW(p) · GW(p)

I am interested in the substrate-needs convergence project. 

Here are some initial thoughts, I would love to hear some responses:

  • An approach could be to say under what conditions natural selection will and will not sneak in. 
  • Natural selection requires variation. Information theory tells us that all information is subject to noise and therefore variation across time. However, we can reduce error rates to arbitrarily low probabilities using coding schemes. Essentially this means that it is possible to propagate information across finite timescales with arbitrary precision. If there is no variation then there is no natural selection. 
  • In abstract terms, evolutionary dynamics require either a smooth adaptive landscape such that incremental changes drive organisms towards adaptive peaks and/or unlikely leaps away from local optima into attraction basins of other optima. In principle AI systems could exist that stay in safe local optima and/or have very low probabilities of jumps to unsafe attraction basins. 
  • I believe that natural selection requires a population of "agents" competing for resources. If we only had a single AI system then there is no competition and no immediate adaptive pressure.
  • Other dynamics will be at play which may drown out natural selection. There may be dynamics that occur at much faster timescales that this kind of natural selection, such that adaptive pressure towards resource accumulation cannot get a foothold. 
  • Other dynamics may be at play that can act against natural selection. We see existence-proofs of this in immune responses against tumours and cancers. Although these don't work perfectly in the biological world, perhaps an advanced AI could build a type of immune system that effectively prevents individual parts from undergoing runaway self-replication. 
Replies from: Linda Linsefors, remmelt-ellen
comment by Linda Linsefors · 2023-09-29T10:29:45.491Z · LW(p) · GW(p)
  • An approach could be to say under what conditions natural selection will and will not sneak in. 

Yes!

  • Natural selection requires variation. Information theory tells us that all information is subject to noise and therefore variation across time. However, we can reduce error rates to arbitrarily low probabilities using coding schemes. Essentially this means that it is possible to propagate information across finite timescales with arbitrary precision. If there is no variation then there is no natural selection. 

Yes! The big question to me is if we can reduced error rates enough. And "error rates" here is not just hardware signal error, but also randomness that comes from interacting with the environment.

  • In abstract terms, evolutionary dynamics require either a smooth adaptive landscape such that incremental changes drive organisms towards adaptive peaks and/or unlikely leaps away from local optima into attraction basins of other optima. In principle AI systems could exist that stay in safe local optima and/or have very low probabilities of jumps to unsafe attraction basins. 

It has to be smooth relative to the jumps the jumps that can be achieved what ever is generating the variation. Natural mutation don't typically do large jumps. But if you have a smal change in motivation for an intelligent system, this may cause a large shift in behaviour. 

  • I believe that natural selection requires a population of "agents" competing for resources. If we only had a single AI system then there is no competition and no immediate adaptive pressure.

I though so too to start with. I still don't know what is the right conclusion, but I think that substrate-needs convergence it at least still a risk even with a singleton. Something that is smart enough to be a general intelligence, is probably complex enough to have internal parts that operate semi independently, and therefore these parts can compete with each other. 

I think the singleton scenario is the most interesting, since I think that if we have several competing AI's, then we are just super doomed. 

And by singleton I don't necessarily mean a single entity. It could also be a single alliance. The boundaries between group and individual is might not be as clear with AIs as with humans. 

  • Other dynamics will be at play which may drown out natural selection. There may be dynamics that occur at much faster timescales that this kind of natural selection, such that adaptive pressure towards resource accumulation cannot get a foothold. 

This will probably be correct for a time. But will it be true forever? One of the possible end goals for Alignment research is to build the aligned super intelligence that saves us all. If substrate convergence is true, then this end goal is of the table. Because even if we reach this goal, it will inevitable start to either value drift towards self replication, or get eaten from the inside by parts that has mutated towards self replication (AI cancer), or something like that.

  • Other dynamics may be at play that can act against natural selection. We see existence-proofs of this in immune responses against tumours and cancers. Although these don't work perfectly in the biological world, perhaps an advanced AI could build a type of immune system that effectively prevents individual parts from undergoing runaway self-replication. 

Cancer is an excellent analogy. Humans defeat it in a few ways that works together

  1. We have evolved to have cells that mostly don't defect
  2. We have an evolved immune system that attracts cancer when it does happen
  3. We have developed technology to help us find and fight cancer when it happens
  4. When someone gets cancer anyway and it can't be defeated, only they die, it don't spread to other individuals. 

Point 4 is very important. If there is only one agent, this agent needs perfect cancer fighting ability to avoid being eaten by natural selection. The big question to me is: Is this possible?

If you on the other hand have several agents, they you defiantly don't escape natural selection, because these entities will compete with each other. 
 



 

Replies from: chasmani
comment by chasmani · 2023-10-11T15:29:38.491Z · LW(p) · GW(p)

Thanks for the reply! 

I think it might be true that substrate convergence is inevitable eventually. But it would be helpful to know how long it would take. Potentially we might be ok with it if the expected timescale is long enough (or the probability of it happening in a given timescale is low enough).

I think the singleton scenario is the most interesting, since I think that if we have several competing AI's, then we are just super doomed. 

If that's true then that is a super important finding! And also an important thing to communicate to people! I hear a lot of people who say the opposite and that we need lots of competing AIs.

I agree that analogies to organic evolution can be very generative. Both in terms of describing the general shape of dynamics, and how AI could be different. That line of thinking could give us a good foundation to start asking how substrate convergence could be exacerbated or avoided. 

Replies from: Linda Linsefors
comment by Linda Linsefors · 2023-10-11T17:30:12.726Z · LW(p) · GW(p)

Potentially we might be ok with it if the expected timescale is long enough (or the probability of it happening in a given timescale is low enough).

Agreed. I'd love for someone to investigate the possibility of slowing down substrate-convergence enough to be basically solved.

If that's true then that is a super important finding! And also an important thing to communicate to people! I hear a lot of people who say the opposite and that we need lots of competing AIs.

Hm, to me this conclusion seem fairly obvious. I don't know how to communicate it though, since I don't know what the crux is. I'd be up for participating in a public debate about this, if you can find me an opponent. Although, not until after AISC research lead applications are over, and I got some time to recover. So maybe late November at the earliest. 

comment by Remmelt (remmelt-ellen) · 2023-09-30T14:28:55.127Z · LW(p) · GW(p)

Thanks for the thoughts! Some critical questions:

Natural selection requires variation. Information theory tells us that all information is subject to noise and therefore variation across time.

Are you considering variations introduced during learning (as essentially changes to code, that can then be copied). Are you consider variations introduced by microscopic changes to the chemical/structural configurations of the maintained/produced hardware?

However, we can reduce error rates to arbitrarily low probabilities using coding schemes.

Claude Shannon showed this to be the case for a single channel of communication. How about when you have many possible routing channels through which physical signals can leak to and back from the environment?

If you look at existing networked system architectures, does the near-zero error rates you can correct toward at the binary level (eg. with use of CRC code) also apply at higher layers of abstraction (eg. in detecting possible trojan horse adversarial attacks)?

If there is no variation then there is no natural selection.

This is true. Can there be no variation introduced into AGI, when they are self-learning code and self-maintaining hardware in ways that continue to be adaptive to changes within a more complex environment?

In abstract terms, evolutionary dynamics require either a smooth adaptive landscape such that incremental changes drive organisms towards adaptive peaks…

Besides point-change mutations, are you taking into account exaptation, as the natural selection for shifts in the expression of previous (learned) functionality? Must exaptation, as involving the reuse of functionality in new ways, involve smooth changes in phenotypic expression?

…and/or unlikely leaps away from local optima into attraction basins of other optima.

Are the other attraction basins instantiated at higher layers of abstraction? Are any other optima approached through selection across the same fine-grained super-dimensional landscape that natural selection is selective across? If not, would natural selection “leak” around those abstraction layers, as not completely being pulled into the attraction basins that are in fact pulling across a greatly reduced set of dimensions? Put a different way, can natural selection pull side-ways on the dimensional pulls of those other attraction basins?

I believe that natural selection requires a population of "agents" competing for resources. If we only had a single AI system then there is no competition and no immediate adaptive pressure.

I get how you would represent it this way, because that’s often how natural selection gets discussed as applying to biological organisms.

It is not quite thorough in terms of describing what can get naturally selected for. For example, within a human body (as an “agent”) there can be natural selection across junk DNA that copies itself across strands, or virus particles, or cancer cells. At that microscopic level though, the term “agent” would lose its meaning if used to describe some molecular strands.

At the macroscopic level of “AGI”, the single vs. multiple agents distinction would break down, for reasons I described here [LW(p) · GW(p)].

Therefore, to thoroughly model this, I would try describe natural selection as occurring across a population of components. Those components would be connected and co-evolving, and can replicate individually (eg. as with viruses replacing other code) or as part of larger packages or symbiotic processes of replication (eg. code with hardware). For AGI, they would all rely on somewhat similar infrastructure (eg. for electricity and material replacement) and also need somewhat similar environmental conditions to operate and reproduce.

Other dynamics will be at play which may drown out natural selection…Other dynamics may be at play that can act against natural selection.

Can the dynamic drown out all possible natural selection over x shortest-length reproduction cycles? Assuming the “AGI” continues to exist, could any dynamics you have in mind drown out any and all interactions between components and surrounding physical contexts that could feed back into their continued/increased existence?

We see existence-proofs of this in immune responses against tumours and cancers. Although these don't work perfectly in the biological world, perhaps an advanced AI could build a type of immune system that effectively prevents individual parts from undergoing runaway self-replication.

Immune system responses were naturally selected for amongst organisms that survived.

Would such responses also be naturally selected for in “advanced AI” such that not the AI but the outside humans survive more? Given that bottom-up natural selection by nature selects for designs across the greatest number of possible physical interactions (is the most comprehensive), can alternate designs built through faster but more narrow top-down engineering actually match or exceed that fine-grained extent of error detection and correction? Even if humans could get “advanced AI” to build in internal error detection and correction mechanisms that are kind of like an immune system, would that outside-imposed immune system withstand natural selection while reducing the host’s rates of survival and reproduction [LW · GW]?

~ ~ ~

Curious how you think about those questions. I also passed on your comment to my mentor (Forrest) in case he has any thoughts.

Replies from: chasmani
comment by chasmani · 2023-10-11T15:41:43.285Z · LW(p) · GW(p)

Thank you for the great comments! I think I can sum up a lot of that as "the situation is way more complicated and high dimensional and life will find a way". Yes I agree. 

I think what I had in mind was an AI system that is supervising all other AIs (or AI components) and preventing them from undergoing natural selection.  A kind of immune system. I don't see any reason why that would be naturally selected for in the short-term in a way that also ensures human survival. So it would have to be built on purpose. In that model, the level of abstraction that would need to be copied faithfully would be the high-level goal to prevent runaway natural selection. 

It would be difficult to build for all the reasons that you highlight. If there is an immunity/self-replicating arms race then you might ordinarily expect the self-replication to win because it only has to win once while the immune system has to win every time. But if the immune response had enough oversight and understanding of the system then it could potentially prevent the self-replication from ever getting started. I guess that comes down to whether a future AI can predict or control future innovations of itself indefinitely. 

Replies from: remmelt-ellen
comment by Remmelt (remmelt-ellen) · 2023-10-13T11:42:42.612Z · LW(p) · GW(p)

I guess that comes down to whether a future AI can predict or control future innovations of itself indefinitely.

 

That's a key question. You might be interested in this section [LW · GW] on limits of controllability.

Clarifying questions:
1. To what extent can AI predict the code they will learn from future unknown inputs, and how that code will subsequently interact with then connected surroundings of the environment?

2. To what extent can AI predict all the (microscopic) modifications that will result from all the future processes involved in the future re-production of hardware components?