Where's the foom?

fergus-fettes

Where's the foom?

post by Fergus Fettes (fergus-fettes) · 2023-04-11T15:50:43.461Z · LW · GW · 27 comments

27 comments

"The first catastrophe mechanism seriously considered seems to have been the possibility, raised in the 1940s at Los Alamos before the first atomic bomb tests, that fission or fusion bombs might ignite the atmosphere or oceans in an unstoppable chain reaction."^[1]

This is not our first rodeo. We have done risk assessments before. The best reference-class examples I could find were the bomb, vacuum decay, killer strangelets, and LHC black holes (all covered in ^[1]).

I was looking for a few days, but didn't complete my search, but I decided to publish this note as now Tyler Cowen is asking too: "Which is the leading attempt to publish a canonical paper on AGI risk, in a leading science journal, refereed of course. The paper should have a formal model or calibration of some sort, working toward the conclusion of showing that the relevant risk is actually fairly high. Is there any such thing?"

The three papers people replied with were:
- Is Power-Seeking AI an Existential Risk?
- The Alignment Problem from a Deep Learning Perspective
- Unsolved Problems in ML Safety

Places I was looking so far:
- The list of references for that paper^[2]
- The references for the Muehlhauser and Salamon intelligence explosion paper^[3]
- The Sandburg review of singularities^[4] and related papers (these are quite close to passing muster I think)

Places I wanted to look further:

- Papers by Yampolsky, aka^[5]
- Papers mentioned in there by Schmidhuber (haven't gotten around to this)
- I haven't thoroughly reviewed Intelligence Explosion Microeconomics, maybe this is the closest thing to fulfilling the criteria?

But if there is something concrete in eg. some papers by Yampolsky and Schmidhuber, why hasn't anyone fleshed it out in more detail?

For all the time people spend working on 'solutions' to the alignment problem, there still seems to be a serious lack of 'descriptions' of the alignment problem. Maybe the idea is, if you found the latter you would automatically have the former?

I feel like something built on top of Intelligence Explosion Microeconomics and the Orthogonality Thesis could be super useful and convincing to a lot of people. And I think people like TC are perfectly justified in questioning why it doesn't exist, for all the millions of words collectively written on this topic on LW etc.

I feel like a good simple model of this would be much more useful than another ten blog posts about the pros and cons of bombing data centers. This is the kind of thing that governments and lawyers and insurance firms can sink their teeth into.

Where's the foom?

Edit: Forgot to mention clippy. Clippy is in many ways the most convincing of all the things I read looking for this, and whenever I find myself getting skeptical of foom I read it again. Maybe an summary of the mechanisms described in there would be a step in the right direction?

27 comments

Comments sorted by top scores.

comment by Roman Leventov · 2023-04-12T07:40:52.725Z · LW(p) · GW(p)

Scientific-style risk assessments should be positioned in the frame of some science. For all the examples that you gave (atomic bomb, vacuum decay, etc.) that was obviously physics.

To formally assess and quantify the risks of misaligned AI, the calculations should be done within a frame of the science of intelligence (i.e., "abstract" cognitive science, not tied to the specifics of the human brain) and agency.

No such general science of intelligence exists at the moment. There are frameworks that aspire to describe intelligence, agency, and sentience in general, such as Active Inference (Friston et al., 2022), thermodynamic ML (Boyd et al., 2022), MCR^2 (Ma et al., 2022), and, perhaps, Vanchurin's "neural network toolbox for describing the world and agents within it". Of these, I'm somewhat familiar only with Active Inference and Vanchun's work, and it seems to me that both of them are too general frameworks to be the basis of concrete risk calculations. I'm not familiar with thermodynamic ML work and MCR^2 work at all (a recurrent note: if would be very useful if the alignment community had people who are at least deeply familiar with and conversant in these frameworks).

Then, there are more concrete engineering proposals for A(G)I architectures, such as Transformer-based LLMs, RL agents of specific architectures, or LeCun's H-JEPA architecture [LW · GW]. Empirical and more general theoretical papers have been published about the risks of goal misgeneralisation for RL agents (I believe these works are referenced in Ngo et al.'s "The Alignment Problem from a Deep Learning Perspective"). I hope at least in part due to this (although he doesn't explicitly acknowledge this, as far as I know), LeCun is sceptical about RL and advocates for minimising RL from intelligence architectures.

Theoretical analysis of the risks of the H-JEPA architecture hasn't been done yet (let alone empirical, because H-JEPA agents haven't been constructed yet), however, H-JEPA belongs to a wider class of intelligence architectures that employ energy-based modelling (EBM) of the world. EBM agents have been constructed and could be analysed empirically already for their alignment properties. It's worth noting that energy-based modelling is conceptually rather similar to Active Inference, and Active Inference agents have been constructed, too (Fountas et al., 2020), and thus could be probed empirically for their alignment properties as well.

LLMs are also tested empirically for alignment (e.g., OpenAI's evals, the Machiavelli benchmark). Theoretical analysis of alignment in future LLMs is probably bottlenecked by better theories of transformers and auto-regressive LLMs in general, such as Roberts, Yaida, and Hanin’s deep learning theory (2021), Anthropic’s mathematical framework for transformers (2021), Edelman et al.'s theory of "sparse variable creation" in transformers (2022), Marciano’s theory of DNNs as a semi-classical limit of topological quantum NNs (2022), Bahri et al.’s review of statistical mechanics of deep learning (2022), and other general theories of mechanistic interpretability and representation in DNNs (Räukur et al. 2022). See here [LW · GW] for a more detailed discussion.

The fundamental limitation of analysing the risks of any specific architecture is that once AGI starts to rewrite itself, it could probably easily depart from its original architecture. Then then the question is whether we can organizationally, institutionally, and economically prevent AGIs from taking control of their own self-improvement. This question doesn't have a scientific answer and it's impossible to rigorously put a probability on this event. I would only say that if AGI is available in open-source, I think the probability of this will be approximately 100%: surely, some open-source hacker (or, more likely, dozens of them) will task AGI with improving itself, just out of curiosity or due to other motivations.

Likewise, even if AGI is developed strictly within tightly controlled and regulated labs, it's probably impossible to rigorously assess the probability that the labs will drift into failure (e.g., will accidentally release an unaligned version of AGI through AI), or will "leak" the AGI code or parameter weights. I suspect the majority of AI x-risk in the assessments of MIRI people and others is not even purely technical ("as if we had a perfect plan and executed it perfectly"), but from these kinds of execution, organisation resilience, theft, governmental or military coercion, and other forms of not strictly technical and therefore not formally assessable risk.

Replies from: AnthonyC, fergus-fettes

↑ comment by AnthonyC · 2023-04-12T17:31:48.144Z · LW(p) · GW(p)

Scientific-style risk assessments should be positioned in the frame of some science. For all the examples that you gave (atomic bomb, vacuum decay, etc.) that was obviously physics...No such general science of intelligence exists at the moment.

This is a very good point. IDK how much less true this is becoming, but I agree that one reason it's hard to make a comprehensive quantitative caser rather than a scattered (though related) set of less rigorous arguments is that there are so many unknown interactions. The physicists inventing atomic weapons knew they needed to be sure they wouldn't ignite the atmosphere, but at least they knew what it meant to calculate that. Biologists and epidemiologists may not be able to immediately tell how dangerous a new pathogen will be, but at least they have a lot of data and research on existing pathogens and past outbreaks to draw from.

Instead, we're left relying on more abstract forms of reasoning, which feel less convincing to most people and are much more open to disagreement about reference classes and background assumptions.

Also, AFAICT, TC's base rate for AI being dangerous seems to be something analogous to, "Well, past technologies have been good on net, even dangerous ones, so [by some intuitive analog of Laplace's rule of succession] we should expect this to turn out fine." Whereas mine is more like, "Well, evolution of new hominins has never gone well for past hominins (or great apes) so why should this new increase in intelligence go better for us?" combined with, "Well, we've never yet been able to write software that does exactly what we want it to for anything complex, or been able to prevent other humans from misusing and breaking it, so why should this be different?"

Replies from: Roman Leventov, fergus-fettes

↑ comment by Roman Leventov · 2023-04-13T05:04:52.129Z · LW(p) · GW(p)

Also, AFAICT, TC's base rate for AI being dangerous seems to be something analogous to, "Well, past technologies have been good on net, even dangerous ones, so [by some intuitive analog of Laplace's rule of succession] we should expect this to turn out fine." Whereas mine is more like, "Well, evolution of new hominins has never gone well for past hominins (or great apes) so why should this new increase in intelligence go better for us?" combined with, "Well, we've never yet been able to write software that does exactly what we want it to for anything complex, or been able to prevent other humans from misusing and breaking it, so why should this be different?"

Yup, this is the point of Scott Alexander's "MR Tries The Safe Uncertainty Fallacy".

BTW, what "TC" is?

Replies from: AnthonyC, fergus-fettes

↑ comment by AnthonyC · 2023-04-13T13:47:37.195Z · LW(p) · GW(p)

Tyler Cohen. I only used initials because the OP did the same.

And yes, I read that post, and I've seen similar arguments a number of times, and not just recently. They're getting a lot sharper recently for obvious reasons, though.

↑ comment by Fergus Fettes (fergus-fettes) · 2023-04-13T07:46:36.262Z · LW(p) · GW(p)

TC is Tyler Cowen.

I don't think the base rates are crazy-- the new evolution of hominins one is only wrong if you forget who 'you' is. TC and many other people are assuming that 'we' will be the 'you' that are evolving. (The worry among people here is that 'they' will have their own 'you'.)

And the second example, writing new software that breaks-- that is the same as making any new technology, we have done this before, and we were fine last time. Yes there were computer viruses, yes some people lost fingers in looms back in the day. But it was okay in the long run.

I think people arguing against these base rates need to do more work. The base rates are reasonable, it is the lack of updating that makes the difference. So lets help them update!

Replies from: Seth Herd

↑ comment by Seth Herd · 2023-06-14T18:21:01.231Z · LW(p) · GW(p)

I think updating against these base rates is the critical thing.

But it's not really an update. The key difference between optimists and pessimists in this area is the recognition that there are no base rates for something like AGI. We have developed new technologies before, but we have never developed a new species before.

New forms of intelligence and agency are a completely new phenomena. Sonic you wanted to ascribe a base rate of our surviving this with zero previous examples, you'd put it at .5. if you counted all of the previous hominid extinctions as relevant, you'd actually put the base rate much lower.

This really seems like the relevant comparison. Tools don't kill you, but strange creatures do. AGI will be a creature, not a tool.

↑ comment by Fergus Fettes (fergus-fettes) · 2023-04-12T18:16:24.155Z · LW(p) · GW(p)

Instead, we're left relying on more abstract forms of reasoning

See, the frustrating thing is, I really don't think we are! There are loads of clear, concrete things that can be picked out and expanded upon. (See my sibling comment also.)

Replies from: AnthonyC

↑ comment by AnthonyC · 2023-04-12T18:59:55.454Z · LW(p) · GW(p)

Honestly not sure if I agree or not, but even if true, it's very hard to convince most people even with lots of real-world examples and data. Just ask anyone with an interest in the comparative quantitative risk assessment of different electricity sources, or ways of handling waste, and then ask them about the process of getting that permitted and built. And really, could you imagine if we subjected AI labs even to just 10% of the regulation we put in the way of letting people add a bedroom or bathroom to their houses?

Also, it doesn't take a whole lot of abstraction to be more abstract than the physics examples I was responding too, and even then I don't think we had nearly as much concrete data as we probably should have about the atmospheric nitrogen question. (Note that the H bomb developers also did the math that made them think Lithium-6 won't contribute to yield, and were wrong. Not nearly as high stakes, so maybe they weren't as careful? But still disconcerting).

↑ comment by Fergus Fettes (fergus-fettes) · 2023-04-12T18:11:20.088Z · LW(p) · GW(p)

Thanks very much for this thorough response!

One thing though-- in contrast to the other reply, I'm not so convinced by the problem that

No such general science of intelligence exists at the moment.

This would be like the folks at Los Alomos saying 'well, we need to model the socioeconomic impacts of the bomb, plus we don't even know what happens to a human subjected to such high pressures and temperatures, we need a medical model and a biological model' etc. etc.

They didn't have a complete science of socioeconomics. Similarly, we don't have a complete science of intelligence. But I think we should be able to put together a model of some core step of the process (maybe within the realm of physics as you suggest) that can be brought to a discussion.

But thanks again for all the pointers, I will follow some of these threads.

comment by Max H (Maxc) · 2023-04-11T18:04:46.329Z · LW(p) · GW(p)

AGI Safety from first principles [? · GW] doesn't meet all of Tyler's requirements and desiderata, but I think it's a good introduction for a technical but skeptical / uninitiated audience.

... in a leading science journal, refereed of course.

Being published and upvoted / endorsed by peers on the Alignment Forum should arguably count as this. Highly-upvoted content on the Alignment Forum is often higher in quality than papers published in even the best traditional academic journals.

I think the question of where and how to conduct science and review is a separate question from the question of why to care about AI risk though. I am happy to read content published in any venue, though I might be hesitant to spend much time diving into any particular piece, unless it has been endorsed by someone I trust on Twitter, LW, AF, etc. Publication in an academic journal is just a different kind of endorsement, one that, in my experience, is a weaker and less reliable indicator of quality than the others.

Replies from: fergus-fettes

↑ comment by Fergus Fettes (fergus-fettes) · 2023-04-11T18:26:44.574Z · LW(p) · GW(p)

Thats a good paper, but I think it exemplifies the problem outlined by Cowen-- it mostly contains references to Bostrom and Yudkowsky, it doesn't really touch on more technical stuff (Yampolskiy, Schmidhuber) which exists, which makes me think that it isn't a very thorough review of the field. It seems like more of the same. Maybe the Hubinger paper referenced therein is on the right track?

The question of where to do science is relevent but not important-- Cowen even mentions that 'if it doesn't get published, just post it online'-- he is not against reading forums.

It really looks like there could be enough stuff out there to make a model. Which makes me think the scepticism is even more justified! Because if it looks like a duck and talks like a duck but doesn't float like a duck, maybe its a lump of stone?

Replies from: dr_s

↑ comment by dr_s · 2023-04-11T21:12:42.510Z · LW(p) · GW(p)

What would you consider as possible angles of attack to the problem? A few points to address that come to mind:

feasibility of AGI itself. Honestly may be the hardest thing to pinpoint;
feasibility of AGI performing self-improvement. This might be more interesting but only focusing on a specific paradigm. I think there might be a decent case for suggesting the LLM paradigm, even in an agentic loop and equipped with tools, eventually stagnates and never goes anywhere in terms of true creativity. But that's just telling capability researchers to invent something else;
ability to escape. Some kind of analysis of e.g. how much agency can an AI exert on the physical world, what's the fastest path to having a beachhead (such as, what would be the costs of it having a robot built for itself, even assuming it was super smart and designed better robotics than ours? Would it be realistic for it to go entirely unnoticed?)
more general game theory/modelling about instrumental convergence and power seeking being optimal. I can think of experiments with game worlds and AIs of some sort set into it to try and achieve competing goals, or even some kind of purely mathematical model. This seems pretty trivial though, and I'm not sure people wouldn't just say "ok but it applies to your toy model, not the real world"
more general information theoretical arguments about the limits of intelligence, or lack thereof, in controlling a chaotic system? What are the Lyapunov exponents of real world systems? Seems like this would affect the diminishing returns of intelligence in controlling the world.

Replies from: fergus-fettes

↑ comment by Fergus Fettes (fergus-fettes) · 2023-04-11T21:58:50.599Z · LW(p) · GW(p)

With what little I know now I think 2 would be most clear to people. However I appreciate that that might contribute to capabilities, so maybe exfohazard.

4 is definitely interesting, and I think there are actually a few significant papers about instrumental convergence. More of those would be good, but I don't think that gets to the heart of the matter w.r.t a simple model to aid communication.

5. I would love some more information theory stuff, drilling into how much information is communicated to eg. a model relative to how much is contained in the world. This could at the very least put some bounds on orthogonality (if 'alignment' is seen in terms of 'preserving information'). I feel like this could be a productive avenue, but personally worry its above my pay grade (I did an MSc in Experimental Physics but its getting rustier by the day).

Now I think about it, maybe 1 and 3 would also contribute to a 'package' if this was seen as a nothing but an attempt at didactics. But maybe including every step of the way complicates things too much, ideally there would be a core idea that could get most of the message across on its own. I think Orthogonality does this for a lot of people in LW, and maybe just a straightforward explainer of that with some information-theory sugar would be enough.

Replies from: dr_s

↑ comment by dr_s · 2023-04-12T09:16:21.064Z · LW(p) · GW(p)

I was thinking more that the question here was also about more rigorous and less qualitative papers supporting the thesis, than just explanations for laypeople. One of the most common arguments against AI safety is that it's unscientific because it doesn't have rigorous theoretical support. I'm not super satisfied with that criticism (I feel like the general outlines are clear enough, and I don't think you can really make up some quantitative framework to predict, e.g., which fraction of goals in the total possible goal-space benefit from power-seeking and self-preservation, so in the end you still have to go with the qualitative argument and your feel for how much does it apply to reality), but I think if it has to be allayed, it should be by something that targets specific links in the causal chain of Doom. Important side bonus, formalizing and investigating these problems might actually reveal interesting potential alignment ideas.

I'll have to read those papers you linked, but to me in general it feels like perhaps the topic more amenable to this sort of treatment is indeed Instrumental Convergence. The Orthogonality Thesis feels to me more of a philosophical statement, and indeed we've had someone arguing for moral realism here just days ago. I don't think you can really prove it or not from where we are. But I think if you phrased it as "being smart does not make you automatically good" you'd find that most people agree with you - especially people of the persuasion that right now regards AI safety and TESCREAL people as they dubbed us with most suspicion. Orthogonality is essentially moral relativism!

Now if we're talking about a more outreach-oriented discussion, then I think all concepts can be explained pretty clearly. I'd also recommend using analogies to e.g. invasive species in new habitats, or the evils of colonialism, to stress why and how it's both dangerous and unethical to unleash things that are more capable than us and are driven by too simple and greedy a goal on the world; insist on the fact that what makes us special is the richness and complexity of our values, and that our highest values are the ones that most prevent us from simply going on a power seeing rampage. That makes the notion of the first AGI being dangerous pretty clear: if you focus only on making them smart but you slack off on making them good, the latter part will be pretty rudimentary, and so you're creating something that is like a colony of intelligent bacteria.

comment by dr_s · 2023-04-11T18:41:37.170Z · LW(p) · GW(p)

This definitely sounds like something we'd need, yes. Just bringing this up that if anyone's interested in this sort of work, I'm game to help! My expertise is in computational physics and software engineering so I would be more help with e.g. modelling or even working with basic ML toy systems. PM for more details if you have any ideas that could use this kind of expertise.

comment by JenniferRM · 2023-04-11T18:02:47.315Z · LW(p) · GW(p)

This is a useful video because it is about the limits and possibilities of explanation (because it is an expert on ZKPs), while also (because of the format of the entire youtube series) nicely handling the issue that different audiences need different "levels of explanation":

I consider some version of "foom" to just be obviously true to all experts who care to study it for long enough, and learn the needed background concepts.

Things like instrumental convergence, and the orthogonality thesis are relevant here, but also probably some security mindset.

Like part of my concern arises from the fact that I'm aware of lots of critical systems for human happiness that are held together with bailing wire, and full of zero-days, and the only reason they haven't been knocked over yet is that no one smart is stupid-or-evil enough to choose to knock them over and then robustly to act on that choice.

At the same time, I don't believe that it is wise to spell out all the zero-days in the critical infrastructure of human civilization.

There is an ongoing debate within the infosec community about proper norms for "responsible disclosure" where "what does 'responsible' even mean?" is somewhat contested.

Like from the perspective of many profit-seeking system maintainers who want to build insecure systems and make money from deploying them, they would rather there just be no work at all by anyone to hack any systems, and maybe all such work is irresponsible, and if the work produces exploits them maybe those should just be censored, and if those people want to publish then instead they should give the exploits, for zero pay, to the people whose systems were buggy.

But then security researchers get no pay, so that's probably unfair as a compensation model.

What happens in practice is (arguably) that some organizations create "bug bounty systems"... and then all the other organizations that don't have bug bounty systems get hacked over and over and over and over?

But like... if you can solve all the problems for the publication of a relatively simple and seemingly-low-stakes version of "exploits that can root a linux kernel" then by all means, please explain how that should work and let's start doing that kind of thing all over the rest of society too (including for exploits that become possible with AGI).

A key issue here, arguably, is that if a security measure needs democratic buy-in and the median voter can't understand "a safe version of the explanation to spread around in public" then... there's gonna be a bad time.

Options seem logically to include:

1) Just publish lots and lots and lots of concrete vivid workable recipes that could be used by any random idiot to burn the world down from inside any country on the planet using tools available at the local computer shop and free software they can download off the internet.... and keep doing this over and over... until the GLOBAL median voter sees why there should be some GLOBAL government coordination on GLOBAL fire-suppression systems, and then there is a vote, and then the vote is counted, and then a wise ruler is selected as the delegate of the median voter... and then that ruler tries to do some coherent thing that would prevent literally all the published "world burning plans" from succeeding.

2) Give up on legible democracy and/or distinct nation-states and/or many other cherished institutions.

3) Lose.

4) Something creative! <3

Personally, I'm still working off of "option 4" for now, which is something like "encourage global elite culture to courageously embrace their duties and just do the right thing whether or not they will keep office after having done the right thing".

However, as option 3 rises higher and higher in my probability estimate I'm going to start looking at options 1 and 2 pretty closely.

Also, it is reasonable clear to me that option 2 is a shorter abstract super category that includes option 1, because option 1 also sounds like "not how many cherished institutions currently work"!

So probably what I'll do is look at option 2, and examine lots of different constraints that might be relaxed, and lots of consequences, and try to pick the least scary "cherished institutional constraints" to give up on, and this will probably NOT include giving up on my norm against publishing lots and logs of world-burning recipes because "maybe that will convince the fools!"

Publishing lots of world-burning recipes still seems quite foolish to me, in comparison to something normal and reasonable.

Like I personally think it would be wiser to swiftly bootstrap a new "United People's House of Representative Parliament" that treats the existing "United Nation-State Delegates" roughly as if it were a global "Senate" or global "House of Lords" or some such?

Then the "United People" could write legislation using their own internal processes and offer the "United Nations" the right to veto the laws in a single up/down vote?

And I think this would be LESS dangerous than just publishing all the world-burning plans? (But also this is a way to solve ONE of several distinct barriers to reaching a Win Condition (the part of the problem where global coordination institutions don't actually exist at all) and so this is also probably an inadequate plan all by itself.)

Category 2 is a big category. It is full of plans that would cause lots and lots of people to complain about how the proposals are "too weird", or sacrifice too many of their "cherished institutions", or many other complaints.

In summary: this proposals feels like you're personally asking to be "convinced in public using means that third parties can watch, so that third parties will grant that it isn't your personal fault for believing something at variance with the herd's beliefs" and not like your honest private assessment of the real situation is bleak. These are different things.

My private assessment is that if I talk for anyone for several days face-to-face about all of this, and they even understand what Aumann Agreement is (such as to be able to notice in a verbal formal way when their public performance of "belief" is violating core standards of a minimally tolerable epistemology) that seems like a high cost to pay... but one I would gladly pay three or four or five times if me paying those costs was enough to save the world. But in practice, I would have to put a halt to any pretense of having a day job for maybe 6 weeks to do that, and that hit to my family budget is not wise.

Also: I believe that no humans exist who offer that degree of leverage over the benevolent competent rulership of the world, and so it is not worth it, yet, to me, personally, to pay such costs-of-persuasion with... anyone?

Also: the plausible candidates wouldn't listen to me, because they are currently too busy with things they (wrongly, in my opinion) think are more important to work on than AI stuff.

Also: if you could get the list of people, and plan the conversations, there are surely better people than me to be in those conversations, because the leaders will make decisions based off of lots of non-verbal factors, and I am neither male, nor tall, nor <other useful properties> and so normal-default-human leaders will not listen to me very efficiently.

Replies from: fergus-fettes

↑ comment by Fergus Fettes (fergus-fettes) · 2023-04-11T18:38:40.259Z · LW(p) · GW(p)

In summary: this proposals feels like you're personally asking to be "convinced in public using means that third parties can watch, so that third parties will grant that it isn't your personal fault for believing something at variance with the herd's beliefs" and not like your honest private assessment of the real situation is bleak. These are different things.

Well, thats very unfortunate because that was very much not what I was hoping for.

I'm hoping to convince someone somewhere that proposing a concrete model of foom will be useful to help think about policy proposals and steer public discourse. I don't think such a model has to be exfohazardous at all (see for example the list of technical approaches to the singularity, in the paper I linked-- they are good and quite convincing, and not at all exfohazardous)!

Replies from: JenniferRM

↑ comment by JenniferRM · 2023-04-11T19:28:45.083Z · LW(p) · GW(p)

Can you say more about the "will be useful to help think about policy proposals and steer public discourse" step?

A new hypothesis is that maybe you want a way to convince OTHER people, in public, via methods that will give THEM plausibility deniability about having to understand or know things based on their own direct assessment of what might or might not be true.

Replies from: fergus-fettes

↑ comment by Fergus Fettes (fergus-fettes) · 2023-04-11T19:54:00.996Z · LW(p) · GW(p)

give THEM plausibility deniability about having to understand or know things based on their own direct assessment

I don't follow what you are getting at here.

I'm just thinking about historical cases of catastrophic risk, and what was done. One thing that was done, was the the government payed very clever people to put together models of what might happen.

My feeling is that the discussion around AI risk is stuck in an inadequate equilibrium, where everyone on the inside thinks its obvious but people on the outside don't grok it. I'm trying to think of the minimum possible intervention to bridge that gap, something very very different from your 'talk ... for several days face-to-face about all of this'. As you mentioned, this is not scalable.

Replies from: JenniferRM

↑ comment by JenniferRM · 2023-04-13T20:36:35.548Z · LW(p) · GW(p)

On a simple level, all exponential explosions work on the same principle, which is that there's some core resource, and in each unit of time, the resource is roughly linearly usable to cause more of the resource to exist and be similarly usable.

Neutrons in radioactive material above a certain density causes more neutrons and so on to "explosion".

Prions in living organisms catalyze more prions, which catalyze more prions, and so on until the body becomes "spongiform".

Oxytocin causes uterine contractions, and uterine contractions are rigged to release more oxytocin, and so on until "the baby comes out".

(Not all exponential processes are bad, just most. It is an idiom rarely used by biology, and when biology uses the idiom it tends to be used to cause phase transitions where humble begins lead to large outcomes.)

"Agentic effectiveness" that loops on itself to cause more agentic effectiveness can work the same way. The inner loop uses optimization power to get more optimization power. Spelling out detailed ways to use optimization power to get more optimization power is the part where it feels like talking about zero-days to me?

Maybe its just that quite a few people literally don't know how exponential processes work? That part does seem safe to talk about, and if it isn't safe then the horse is out of the barn anyway. Also, if there was a gap in such knowledge it might explain why they don't seem to understand this issue, and it would also explain why many of the same people handled covid so poorly.

Do you have a cleaner model of the shape of the ignorance that is causing the current policy failure?

comment by Tamsin Leake (carado-1) · 2023-04-11T16:44:58.322Z · LW(p) · GW(p)

gwern's clippy story does a reasonable amount of the work, i think. to me, part of why convincing people of AI doom is something that often needs to be done interactively rather than having a comprehensive explanation, is that people get stuck in a variety of different ways rather than always in the same way. that and numerous cognitive biases and stuff.

to me the argument is pretty straightforward:

recursive self-improvement has a lot of room to get to superintelligence, so the capabilities can get there
agentic AI totally makes sense
an intelligent agentic AI will follow the instrumental convergence goals and win
narrow values means it winning doesn't entail us surviving

i have a lot more ideas about how foom will happen but alas sharing them would be a capability exfohazard.

Replies from: fergus-fettes

↑ comment by Fergus Fettes (fergus-fettes) · 2023-04-11T17:14:21.088Z · LW(p) · GW(p)

needs to be done interactively ... people get stuck in a variety of different ways

I think the previous examples of large-scale risk I mentioned are a clear counterexample-- if you have at least one part of the scenario clearly modeled, people have something concrete to latch on to.

You also link somewhere that talks about the nuclear discontinuity, and hints at an intelligence discontinuity-- but I also went searching for evidence of a discontinuity in cognition and didn't find one. You would expect cognitive scientists to have found this by now.

Hard to find 'counter references' for a lack of something, this is the best I can do:

"Thus, standard (3rd-person) investigations of this process leave open the ancient question as to whether specific upgrades to cognition induce truly discontinuous jumps in consciousness. The TAME framework is not incompatible with novel discoveries about sharp phase transitions, but it takes the null hypothesis to be continuity, and it remains to be seen whether contrary evidence for truly sharp upgrades in consciousness can be provided." TAME, Levin

Do you have a post regarding your decision calculus re the foom exfohazard?

Because it seems to me the yolo-brigade are a lot better at thinking up foom mechanisms than the folks in DC. So by holding information back you are just keeping it from people who might actually need it (politicians, who can't think of it for themselves), while making no difference to people who might use it (who can come up with plenty of capabilities themselves).

But maybe you have gone over this line of reasoning somewhere..

comment by 1a3orn · 2023-04-11T21:24:13.589Z · LW(p) · GW(p)

There are a lot of places which somewhat argue for FOOM -- i.e., very fast intelligence growth in the future, probably not preceded by smooth growth -- but they tend to be deeply out of date ( Yud-Hanson Debate and Intelligence Explosion Microeconomics ) or really cursory (Yud's pararaph in List of Lethalities [LW · GW] ) or a dialogue between two people being confused at each other (Christiano / Yud Discussion [? · GW] ).

I think the last one is probably the best as overview, but none provide like a great overview. Here's Christiano's blog on the topic which was written in 2018, so if its predictions hold up then its evidence for it. (But it is very much not in favor of FOOM... although you really have to read it to see what that actually means.)

Replies from: fergus-fettes

↑ comment by Fergus Fettes (fergus-fettes) · 2023-04-11T22:26:44.270Z · LW(p) · GW(p)

Yeah, unfortunately 'somewhat argue for foom' is exactly what I'm not looking for, rather a simple and concrete model that can aid communication with people who don't have time to read the 700-page Hanson-Yudkowsky debate. (Which I did read, for the record.)

Replies from: quintin-pope

↑ comment by Quintin Pope (quintin-pope) · 2023-04-12T01:20:32.306Z · LW(p) · GW(p)

If that's what you're interested in, I'd suggest: What a compute-centric framework says about AI takeoff speeds - draft report [LW · GW]

Replies from: fergus-fettes

↑ comment by Fergus Fettes (fergus-fettes) · 2023-04-12T08:11:25.202Z · LW(p) · GW(p)

This is the closest thing yet! Thank you. Maybe that is it.

comment by Christopher King (christopher-king) · 2023-04-11T18:34:30.429Z · LW(p) · GW(p)

The paper should have a formal model or calibration of some sort, working toward the conclusion of showing that the relevant risk is actually fairly high.

I doubt anyone has created this model yet, let alone published it in a journal. I guess that should be a first step 🤔. It seems very hard, since AGI would touch upon every domain.

The closest thing I can think of are the economic models based on hyperbolic growth. They predict a singularity, but don't specify what will cause it.

Where's the foom?

Contents

27 comments