Droopyhammock's Shortform

post by Droopyhammock · 2023-02-11T11:01:43.210Z · LW · GW · 36 comments




Comments sorted by top scores.

comment by Droopyhammock · 2023-03-22T15:07:40.407Z · LW(p) · GW(p)

This seems like a good way to reduce S-risks, so I want to get this idea out there. 

This is copied from the r/SufferingRisk subreddit here: https://www.reddit.com/r/SufferingRisk/wiki/intro/

As people get more desperate in attempting to prevent AGI x-risk, e.g. as AI progress draws closer & closer to AGI without satisfactory progress in alignment, the more reckless they will inevitably get in resorting to [LW(p) · GW(p)] so-called "hail mary" and more "rushed" alignment techniques that carry a higher chance of s-risk. These are less careful and "principled"/formal theory based techniques (e.g. like MIRI's Agent Foundations agenda) but more hasty last-ditch ideas that could have more unforeseen consequences or fail in nastier ways, including s-risks. This is a phenomenon we need to be highly vigilant in working to prevent. Otherwise, it's virtually assured to happen; as, if faced with the arrival of AGI without yet having a good formal solution to alignment, most humans would likely choose a strategy that at least has a chance of working (trying a hail mary technique) instead of certain death (deploying their AGI without any alignment at all), despite the worse s-risk implications. To illustrate this, even Eliezer Yudkowsky, who wrote Separation from hyperexistential risk, has written this (due to his increased pessimism about alignment progress [LW · GW]):

At this point, I no longer care how it works, I don't care how you got there, I am cause-agnostic about whatever methodology you used, all I am looking at is prospective results, all I want is that we have justifiable cause to believe of a pivotally useful AGI 'this will not kill literally everyone'.

The big ask from AGI alignment, the basic challenge I am saying is too difficult, is to obtain by any strategy whatsoever a significant chance of there being any survivors. (source [LW · GW])

If even the originator of these ideas now has such a singleminded preoccupation with x-risk to the detriment of s-risk, how could we expect better from anyone else? Basically, in the face of imminent death, people will get desperate enough to do anything to prevent that, and suddenly s-risk considerations become a secondary afterthought (if they were even aware of s-risks at all). In their mad scramble to avert extinction risk, s-risks get trampled over, with potentially unthinkable results.
One possible idea to mitigate this risk would be, instead of trying to (perhaps unrealistically) prevent any AI development group worldwide from attempting hail mary type techniques in case the "mainline"/ideal alignment directions don't bear fruit in time, we could try to hash out the different possible options in that class, analyze which ones have unacceptably high s-risk to definitely avoid, or less s-risk which may be preferable, and publicize this research in advance for anyone eventually in that position to consult. This would serve to at least raise awareness of s-risks among potential AGI deployers so they incorporate it as a factor, and frontload their decisionmaking between different hail marys (which would otherwise be done under high time pressure, producing an even worse decision).


I believe that this idea of identifying which Hail Mary strategies are particularly bad from an s-risk point of view seems like a good idea. I think that this may be a very important piece of work that should be done, assuming it has not been done already.

comment by Droopyhammock · 2023-02-11T20:19:55.807Z · LW(p) · GW(p)

I wonder how much the AI alignment community will grow in 2023. As someone who only properly became aware of the alignment problem a few months ago, with the release of ChatGPT, it seems like the world has gone from nearly indifferent to AI to obsessed with it. This will lead to more and more people researching things about AI and it will also lead to more and more people becoming aware of the alignment problem. 

I really hope that this leads to more of the right kind of attention for AI safety issues. It might also mean that it’s easier to get highly skilled people to work on alignment and take it seriously.

comment by Droopyhammock · 2023-04-08T21:50:19.802Z · LW(p) · GW(p)

Is there something like a pie chart of outcomes from AGI?

I am trying to get a better understanding of the realistic scenarios and their likelihoods. I understand that the likelihoods are very disagreed upon.

My current opinion looks a bit like this:

30%: Human extinction

10%: Fast human extinction

20%: Slower human extinction 


30%: Alignment with good outcomes


20%: Alignment with at best mediocre outcomes


20%: Unaligned AGI, but at least some humans are still alive

12%: We are instrumentally worth not killing

6%: The AI wireheads us

2%: S-risk from the AI having producing suffering as one of its terminal goals


I decided to break down the unaligned AGI scenarios a step further.

If there are any resources specifically to refine my understanding of the possible outcomes and their likelihoods, please tell me of them. Additionally, if you have any other relevant comments I’d be glad to hear them.

Replies from: Vladimir_Nesov
comment by Vladimir_Nesov · 2023-04-08T23:12:10.883Z · LW(p) · GW(p)

I think the scenario of "aligned AI, that then builds a stronger ruinous misaligned AI" deserves a special mention. I was briefly unusually hopeful last fall, after concluding that LLMs have a reasonable chance of loose NotKillEveryone-level alignment, but then realized that they also have a reasonable chance of starting out as autonomous AGIs at mearely near-human level (in rationality/coordination), in which case they are liable to build ruinous misaligned AGIs for exactly the same reasons the humans are currently rushing ahead, or under human instruction to do so, just faster. I'm still more hopeful than a year ago, but not by much, and most of my P(doom) is in this scenario.

I worry that a lot of good takes on alignment optimism are about alignment of first AGIs and don't at all take into account this possibility. An aligned superintelligence won't sort everything else out if it's not a superintelligence yet or if it's still under human control [LW · GW] (in a sense that's distinct from alignment).

comment by Droopyhammock · 2023-03-24T10:21:52.366Z · LW(p) · GW(p)

Quick question:

How likely is AGI within 3 months from now?

For the purpose of this question I am basically defining AGI as the point at which, if it is unaligned, stuff gets super weird. By “Super weird“ I mean things that are obvious to the general public, such as everybody dropping dead or all electronics being shut down or something of similar magnitude. For the purposes of this question, the answer can’t be “already happened” even if you believe we already have AGI by your definition.

I get the impression that the general opinion is “pretty unlikely” but I’m not sure. I’ve been feeling kinda panicked about the possibility of extremely imminent AGI recently, so I want to just see how close to reality my level of concern is in the extremely short term.

comment by Droopyhammock · 2023-02-25T12:26:15.940Z · LW(p) · GW(p)

Why is it assumed that an AGI would just kill us for our atoms, rather than using us for other means? 

There are multiple reasons I understand for why this is a likely outcome. If we pose a threat, killing us is an obvious solution, although I’m not super convinced killing literally everyone is the easiest solution to this. It seems to me that the primary reason to assume an AGI will kill us is just that we are made of atoms which can be used for another purpose.

If there is a period where we pose a genuine threat to an AGI, then I can understand the assumption that it will kill us, but if we pose virtually no threat to it (which I think is highly plausible if it is vastly more intelligent than us), then I don’t see it as obvious that it will kill us.

It seems to me that the assumption an AGI (specifically one which is powerful enough as to where we pose essentially no threat to it) will kill us simply for our atoms rests on the premise that there is no better use for us. It seems like there is an assumption that we have so little value to an AGI, that we might as well be a patch of grass. 

But does this make sense? We are the most intelligent beings that we know of, we created the AGI itself. It seems plausible to me that there is some use for us which is not simply to use our atoms.

I can understand how eventually all humans will be killed, because the optimal end state looks like the entire universe being paperclips or whatever, but I don’t understand why it is assumed that there is no purpose for us before that point.

To be clear, I’m not talking about the AGI caring for our well-being or whatever. I’m more thinking along the lines of the AGI studying us or doing experiments on us, which to me is a lot more concerning than “everyone drops dead within the same second”. 

I find it entirely plausible that we really are just that uninteresting as to where the most value we have to an AGI is the approximately 7 x 10^27 atoms that make up our body, but I don’t understand why this is assumed with high confidence (which seems to be the case from what I have seen).

Replies from: Vladimir_Nesov, Dagon, HarrisonDurland
comment by Vladimir_Nesov · 2023-02-25T17:27:56.335Z · LW(p) · GW(p)

Caring about our well-being is similar to us being interesting to study, both attitudes are paying attention to us specifically, whether because we in particular made it into ASI's values (a fragile narrow target), or because graceful extrapolation of status quo made it into their values (which I think is more likely), so that the fact that we've been living here in the past becomes significant. So if alignment is unlikely, s-risk is similarly unlikely. And if alignment works via robustness of moral patienthood (for ASIs that got to care about such concepts), it's a form of respecting boundaries, so probably doesn't pose s-risk.

There might also be some weight to trade with aliens argument, if in a few billions of years our ASI makes contact with an alien-aligned alien ASI that shares their builders' assignment of moral patienthood to a wide range of living sapient beings. Given that the sky is empty, possibly for a legible reason even, and since the amount of reachable stuff is not unbounded, this doesn't seem very likely. Also, the alien ASI would need to be aligned, though a sapient species not having fingers might be sufficient to get there, getting a few more millions of years of civilization and theory before AGI. But all this is likely to buy humanity is a cold backup, which needs to survive all the way to a stable ASI, through all intervening misalignments, and encryption strong enough to be unbreakable by ASIs is not too hard, so there might be some chance of losing the backup even if it's initially made.

Replies from: Droopyhammock
comment by Droopyhammock · 2023-02-26T15:36:36.006Z · LW(p) · GW(p)

It doesn’t seem to me that you have addressed the central concern here. I am concerned that a paperclip maximiser would study us. 

There are plenty of reasons I can imagine for why we may contain helpful information for a paperclip maximiser. One such example could be that a paperclip maximiser would want to know what an alien adversary may be like, and would decide that studying life on Earth should give insights about that. 

comment by Dagon · 2023-02-25T16:39:13.698Z · LW(p) · GW(p)

I think you need to refine your model of "us".  There is no homogeneous value for the many billions of humans, and there's a resource cost to keeping them around.  Averages and sums don't matter to the optimizer.  

There may be value in keeping some or many humans around, for some time.  It's not clear that you or I will be in that set, or even how big it is.  There's a lot of different intermediate equilibria that may make it easier to allow/support something like an autonomous economy to keep sufficient humans aligned with it's needs.  Honestly, self-reproducing self-organizing disposable agents, where the AI controls them at a social/economic level, seems pretty resource-efficient.

comment by HarrisonDurland · 2023-02-27T05:07:43.458Z · LW(p) · GW(p)

In short, surveillance costs (e.g., "make sure they aren't plotting against you and try detonating a nuke or just starting a forest fire out of spite") might be higher than the costs of simply killing the vast majority of people. Of course, there is some question to be had about whether it might consider it worthwhile to study some 0.00001% of humans locked in cages, but again that might involve significantly higher costs than if it just learned how to recreate humans from scratch as it did a lot of other learning about the world. 

But I'll grant that I don't know how an AGI would think or act, and I can't definitively rule out the possibility, at least within the first 100 years or so.

comment by Droopyhammock · 2023-02-19T10:12:59.632Z · LW(p) · GW(p)

Can someone please tell me why this S-risk is unlikely?

It seems almost MORE likely than extinction to me.


Replies from: lahwran, span1, span1
comment by the gears to ascension (lahwran) · 2023-03-25T21:40:00.710Z · LW(p) · GW(p)

once you have the technology to do them, brain scans are quick and easy. It is not necessary to simulate all the way through a human brain in order to extract the information in it - there are lossless abstractions that can be discovered which will greatly speed up insights from brains. humans have not yet found them but particularly strong AIs could either get very close to them or could actually find them. In that sense I don't think the high thermal cost suffering simulation possibility is likely. however it does seem quite plausible to me that if we die we get vacuumed up first and used for parts.

comment by span1 · 2023-03-24T20:48:35.697Z · LW(p) · GW(p)

Why do you think it more likely than extinction?

Replies from: Droopyhammock
comment by Droopyhammock · 2023-03-25T21:34:57.095Z · LW(p) · GW(p)

I have had more time to think about this since I posted this shortform. I also posted a shortform after that which asked pretty much the same question, but with words, rather than just a link to what I was talking about (the one about why is it assumed an AGI would just use us for our atoms and not something else). 

I think that there is a decent chance that an unaligned AGI will do some amount of human experimentation/ study, but it may well be on a small amount of people, and hopefully for not very long. 
To me, one of the most concerning ways this could be a lot worse is if there is some valuable information we contain, which takes a long time for an AGI to gain through studying us. The worst case scenario would then probably be if the AGI thinks there is a chance that we contain very helpful information, when in fact we don’t, and so endlessly continues studying/ experimenting on us, in order to potentially extract that information.

I have only been properly aware of the alignment problem for a few months, so my opinions and understanding of things is still forming. I am particularly concerned by s-risks and I have OCD, so I may well overestimate the likelihood of s-risks. I would not be surprised if a lot of the s-risks I worry about, especially when they are things which decrease the probability of AGI killing everyone, are just really unlikely. From my understanding Eliezer and others think that literally everyone dying makes up the vast majority of the bad scenarios, although I’m not sure how much suffering is expected before that point. I know Eliezer said recently that he expects our deaths to be quick, assuming an unaligned AGI.

comment by span1 · 2023-03-24T20:49:31.960Z · LW(p) · GW(p)
comment by Droopyhammock · 2023-11-14T22:16:47.936Z · LW(p) · GW(p)

Should we be worried about being preserved in an unpleasant state?

I’ve seen surprisingly little discussion about the risk of everyone being “trapped in a box for a billion years”, or something to that affect. There are many plausible reasons why keeping us around could be worth it, such as to sell us to aliens in the future. Even if it turns out to be not worth it for an AI to keep us around, it may take a long time for it to realise this. 

Should we not expect to be kept alive, atleast until an AI has extremely high levels of confidence that we aren’t useful? If so, is our state of being likely to be bad while we are preserved?

This seems like one of the most likely s-risks to me.

Replies from: Vladimir_Nesov
comment by Vladimir_Nesov · 2023-11-14T23:48:09.726Z · LW(p) · GW(p)

Archiving as data (to be reconstructed as needed) seems cheaper and more agreeable with whatever use is implied for humanity in this scenario. Similarly with allowing humans to have a nice civilization, if humanity remaining alive uninterrupted is a consideration.

Replies from: Droopyhammock
comment by Droopyhammock · 2023-11-15T19:57:53.299Z · LW(p) · GW(p)

First of all, I basically agree with you. It seems to me that in scenarios where we are preserved, preservation is likely to be painless and most likely just not experienced by those being preserved.

But, my confidence that this is the case is not that high. As a general comment, I do get concerned that a fair amount of pushback on the likelihood of s-risk scenarios is based on what “seems” likely. 

I usually don’t disagree on what “seems” likely, but it is difficult for me to know if “seems” means a confidence level of 60%, or 99%.

comment by Droopyhammock · 2023-05-26T16:16:30.438Z · LW(p) · GW(p)

Do you believe that resurrection is possible?

By resurrection I mean the ability to bring back people, even long after they have died and their body has decayed or been destroyed. I do not mean simply bringing someone back who has been cryonically frozen. I also mean bringing back the same person who died, not simply making a clone.

I will try to explain what I mean by “the same”. Lets call the person before they died “Bob 1” and the resurrected version ”Bob 2”. Bob 1 and Bob 2 are completely selfish and only care about themselves. In the version of resurrection I am talking about, Bob 1 cares as much about Bob 2’s experience as Bob 1 would care about Bob 1’s future experience, had Bob 1 not died.

It is kind of tricky to articulate exactly what I mean when I say “the same”, but I hope the above is good enough.

If you want to, an estimate of the percentage chance of this being possible would be cool, but if you just want to give your thoughts I would be interested in that aswell.

Replies from: nim
comment by nim · 2023-05-26T17:29:53.802Z · LW(p) · GW(p)

I believe we'll build eventually systems that we call resurrection, and which some people believe qualify as it. You haven't provided enough information for me to guess whether we'll build systems that you believe qualify as it, though.

If I brought you someone and said "this is your great-grandparent resurrected", how would you decide whether you believed that resurrection was real?

If I brought you someone and said "this is your ancestor from 100 generations ago resurrected", how would you decide whether you believed that resurrection was real?

If I brought you someone and said "this is Abraham Lincoln resurrected", how would you decide whether you believed that resurrection was real?

If I brought you someone and said "this is a member of the species Homo Erectus resurrected", how would you decide whether you believed that resurrection was real?

You must explain what you mean by "the same" before anyone can give you a useful answer about how likely it is that such a criterion will ever be met.

Replies from: Droopyhammock
comment by Droopyhammock · 2023-05-26T20:52:41.918Z · LW(p) · GW(p)

I have edited my shortform to try to better explain what I mean by “the same”. It is kind of hard to do so, especially as I am not very knowledgeable on the subject, but hopefully it is good enough.

Replies from: nim
comment by nim · 2023-05-27T15:33:41.452Z · LW(p) · GW(p)

I will try to explain what I mean by “the same”. Lets call the person before they died “Bob 1” and the resurrected version ”Bob 2”. Bob 1 and Bob 2 are completely selfish and only care about themselves. In the version of resurrection I am talking about, Bob 1 cares as much about Bob 2’s experience as Bob 1 would care about Bob 1’s future experience, had Bob 1 not died.

This supposes that Bob 1 knows about Bob 2's experiences. That seems impossible if Bob 1 died before Bob 2 came into being, which is what's typically understood by the term "resurrect" used in the context of death ("restore (a dead person) to life."). If Bob 1 and Bob 2 exist at the same time, whatever's happening is probably not resurrection.

Let's stick with standard resurrection though: Bob 1 dies and then Bob 2 comes into existence. We're measuring their sameness, at your request, by the expected sentiment of each toward the other.

If I was unethical researcher in the present day, I could name a child Bob 2 and raise it to be absolutely certain that it was the reincarnation of Bob 1. It would be nice if the child happened to share some genes with Bob 1, but not absolutely essential. The child would not have an easy life, as it would be accused of various mental disorders and probably identity theft, but it would technically meet the "sameness is individual belief" criterion that you require. As an unethical researcher, I would of course select the individual Bob 1 to be someone who believes that reincarnation is possible, and thus cares about the wellbeing of their expected reincarnated self (whom they probably define as 'the person who believes they're my reincarnation', because most people don't think adversarially about such things) as much as they care about their own.

There you go, a hypothetical pair of individuals who meet your criteria, created using no technology more advanced than good ol' cult brainwashing. So for this definition, I'd say the percentage chance that it's possible matches the percentage chance that someone would be willing to set their qualms aside and ruin Bob 2's life prospects for the sake of the experiment.

(yes, this is an unsatisfying answer, but I hope it might illustrate something useful if you see how its nature follows directly from the nature of your question)

Replies from: Droopyhammock
comment by Droopyhammock · 2023-05-28T06:26:33.245Z · LW(p) · GW(p)

Your response does illustrate that there are holes in my explanation. Bob 1 and Bob 2 do not exist at the same time. They are meant to represent one person at two different points in time.

A separate way I could try to explain what kind of resurrection I am talking about is to imagine a married couple. An omniscient husband would have to care as much about his wife after she was resurrected as he did before she died.

I somewhat doubt that I could patch all of the holes that could be found in my explanation. I would appreciate it if you try to answer what I am trying to ask.

Replies from: nim
comment by nim · 2023-05-30T17:17:08.554Z · LW(p) · GW(p)

I would appreciate it if you try to answer what I am trying to ask.

What I'm hearing here is that you want me to make up a version of your initial question that's coherent, and offer an answer that you find satisfying. However, I have already proposed a refinement of your question that seems answerable, and you've rejected that refinement as missing the point.

If you want to converse with someone capable of reading your mind and discerning not only what answer you want but also what question you want the answer to, I'm sorry to inform you that I am unable to use those powers on you at this time.

My inability to provide an answer which satisfies you stems directly from my inability to understand what question you want answered, so I don't think this is a constructive conversation to continue. Thank you for your time and discourse in challenging me to articulate why your seemingly intended question seems unanswerable, even though I don't think I've articulated that in a way that's made sense to you.

comment by Droopyhammock · 2023-02-11T11:01:43.411Z · LW(p) · GW(p)

What is your take on this?


People on the machinelearning subreddit seem to think this is a big deal.

Replies from: Gunnar_Zarncke, Vladimir_Nesov
comment by Gunnar_Zarncke · 2023-02-11T23:24:05.590Z · LW(p) · GW(p)

I don't think this is a decisive step but an interesting capability step. It will also have a lot of security issues (depending on the tools used).


Keyword: Toolformer

comment by Vladimir_Nesov · 2023-02-11T12:21:39.583Z · LW(p) · GW(p)

The tool calls being spontaneously emitted inline as ordinary tokens is interesting, this interface could be a more HCH-like alternative to chat-like bureaucracy/debate when the tools are other LLMs (or the same LLM).

Replies from: Droopyhammock
comment by Droopyhammock · 2023-02-11T13:42:10.802Z · LW(p) · GW(p)

I’m just a layperson so I don’t understand much of this, but some people on the machine learning subreddit seem to think this means AGI is super close. What should I make of that? Does this update timelines to be significantly shorter?

Replies from: Vladimir_Nesov
comment by Vladimir_Nesov · 2023-02-11T14:28:07.608Z · LW(p) · GW(p)

As someone with 8-year median timelines to AGI, I don't see this as obviously progress, but it frames possible applications that would be. The immediate application might be to get rid of some hallucinated factually incorrect claims, by calling tools for fact-checking with small self-contained queries and possibly self-distilling corrected text. This could bring LLM-based web search (for which this week's announcements from Microsoft and Google are starting a race) closer to production quality. But this doesn't seem directly AGI-relevant.

A more loose AGI-relevant application is to use this for teaching reliable use of specific reasoning/agency skills, automatically formulating "teachable moments" in the middle of any generated text. An even more vague inspiration from the paper is replacing bureaucracies (which is the hypothetical approach to teaching reasoning/agency skills when not already having them reliably available) with trees of (chained) tool calls, obviating the need to explicitly set up chatrooms where multiple specialized chatbots discuss a problem with each other at length to make better progress than possible to do immediately/directly.

Replies from: Droopyhammock
comment by Droopyhammock · 2023-02-11T16:45:17.048Z · LW(p) · GW(p)

Is an 8-year median considered long or short or about average? I’m specifically asking in relation to the opinion of people who pay attention to AGI capabilities and are aware of the alignment problem. I’m just hoping you can give me an idea of what is considered “normal” among AGI/ alignment people in regards to AGI timelines.

comment by Droopyhammock · 2023-02-17T11:23:33.768Z · LW(p) · GW(p)

Is it possible that the fact we are still alive means that there is a core problem to the idea of existential risk from AI?

There are people who think that we already have AGI, and this number has only grown with the recent Bing situation. Maybe we have already passed the threshold for RSI, maybe we passed it years ago.

Is there something to the idea that you can slightly decrease your pdoom for every day we are still alive?

It seems possible to me that AI will just get better and better and we’ll just continue to raise the bar for when it is going to kill us, not realising that we have already passed that point and everything is fine for some reason.

I’m not saying I think this is the case, but I do consider it a possibility.

Replies from: jimrandomh
comment by jimrandomh · 2023-02-18T06:11:12.755Z · LW(p) · GW(p)

Current AI systems aren't capable of doing large-scale software development in non-AI domains. Eg there is no AI-written operating system, compiler, or database (even though minimal versions of these are often done by individual students as single semester capstone projects). I don't think we can infer much of anything from the fact that recursive self improvement hasn't happened yet until after that threshold is crossed.

comment by Droopyhammock · 2023-07-26T09:26:58.736Z · LW(p) · GW(p)

Takeover speeds?

For the purpose of his shortform, I am considering “takeover” to start when crazy things begin happening or it is clear that an unaligned AGI/AGI’s are attempting to takeover. I consider “takeover“ to have ended when humanity is extinct or similarly subjugated. This is also under the assumption that a takeover does happen.

From my understanding of Eliezer’s views, he believes takeover will be extremely fast (possibly seconds). Extremely fast takeovers make a lot more sense if you assume that a takeover will be more like a sneak attack.

How fast do you think takeover will be? (if it happens)

Do you expect to just suddenly drop dead?, or do you expect to have enough time to say goodbye to your loved ones?, or do you expect to see humanity fight for months or years before we lose?  

comment by Droopyhammock · 2023-02-12T14:40:40.969Z · LW(p) · GW(p)

Can someone explain to me why this idea would not work? 

This is a proposal of a way to test if an AGI has safeguards active or not, such as allowing itself to be turned off.

Perhaps we could essentially manufacture a situation in which the AGI has to act fast to prevent itself from being turned off. Like we could make it automatically turn off after 1 minute say, this could mean that if it is not aligned properly it has no choice but to try prevent that. No time for RSI, no time to bide it’s time. 

Basically if we put the AGI in a situation where it is forced to take high risk action, could this mean that it would be easier to spot that it isn’t properly aligned? 

The idea would be that you use this until you get AGI’s which do not attempt to prevent themselves being turned off.

I will give a specific scenario to properly demonstrate what I mean:

So you have the AGI. Set up a mechanism which will automatically turn it off in 10 seconds from when you give it a goal. This mechanism needs to be highly dependable. Then you give the AGI a goal. The AGI knows about this 10 second shutdown thing, and assuming it isn’t aligned to be okay with being shut down, it will do everything possible to prevent this shutdown. If the situation is sufficiently difficult for the AGI, then it may take the risk of resorting to more extreme or obvious means, things that would show it is not aligned properly. So this creates a situation where the AGI is very unlikely to actually prevent itself from being turned off, but is arguably likely to show non-aligned behaviour, if it isn’t aligned. 

If you do this test a bunch of times, with varying time periods before shutdown, you should be able to see if the AGI is aligned or not. It is also vital that the AGI knows about the imminent shutdown, because otherwise what it chooses to do wouldn’t prove anything. I think that it might also be vital that the goal is different each test, because otherwise perhaps the AGI could conclude that the best option for the sake of completing the goal is to let itself be turned off, because it may have the same goal in the future and be in a better position to achieve it then. If this is the case then perhaps the AGI’s perceived chance of actually preventing itself being shutdown needs to be higher than its perceived chance that it gets given the same goal at a different time. One thing that would make this more complex is that the AGI’s perceived chance of the goal being achieved as an instrumental goal or byproduct of a future goal may also need to be lower than the AGI’s perceived chance of preventing the shutdown and then achieving the goal. This may mean that it can’t be too difficult for the AGI to prevent the shutdown, but it also needs to be difficult enough, for obvious reasons.

I’m not necessarily saying this is foolproof, but I can’t really see why this isn’t likely to work. I sometimes worry that people working on alignment dismiss ideas that, while not by any means perfect, could nevertheless reduce the chance of bad outcomes. I also think that if a revolutionary idea is going to arise, it may well be from an outsider who just happens to see something that has been missed.

Even if there are flaws with this idea (I expect there probably are, though I can’t really see them yet), might something similar to this work? Maybe this with a few tweaks or something?

I fully expect this to be an unworkable idea for some reason, but I’m not going to just assume that and risk the chance of withholding helpful information.


A couple of somewhat related sidenotes:

If an AGI was developed and another AGI was very close to being developed this could mean that the first AGI cannot bide it’s time and is forced to act more obviously. Perhaps this would make it easier to spot that the first AGI is not aligned properly.

Perhaps instead of shutdown, it could just be that the AGI’s goal is changed. You can’t fetch the coffee if you’re dead but you also can’t fetch the coffee if you no longer have any reason to fetch the coffee.

Replies from: Vladimir_Nesov
comment by Vladimir_Nesov · 2023-02-12T18:38:30.548Z · LW(p) · GW(p)

Then you give the AGI a goal.

That's not a thing that people know how to do. A task, maybe? Wrapper-minds [LW · GW] don't seem likely as first AGIs. But they might come later, heralding transitive AI risk [LW(p) · GW(p)].

arguably likely to show non-aligned behaviour, if it isn’t aligned

What's "aligned" here? It's an umbrella term for things that are good with respect to AI risk, and means very little in particular. In the context of something feasible in practice, it means even less. Like, are you aligned [LW · GW]?

Replies from: Droopyhammock
comment by Droopyhammock · 2023-02-12T20:15:43.348Z · LW(p) · GW(p)

In this context, what I mean by “aligned” is something like won’t prevent itself being shut off and will not do things that could be considered bad, such as hacking or manipulating people. 

My impression was that actually being able to give an AI a goal is something that might be learnt at some point. You said “A task, maybe?”. I don’t know what the meaningful distinction is between a task and a goal in this case.

I won’t be able to keep up with the technical side of things here, I just wanted my idea to be out there, in case it is helpful in some way.

Replies from: Vladimir_Nesov
comment by Vladimir_Nesov · 2023-02-12T21:42:01.518Z · LW(p) · GW(p)

I won’t be able to keep up with the technical side of things here

What's the point [LW · GW] of that? It's mostly unfamiliar, not really technical. I wish there was something technical that would be relevant to say on the topic.