Posts
Comments
Fair question. You can assume it is AoE.
Research leads are not going to be too picky in terms of what hour you send the application in,
There is no need to worry about the exact deadline. Even if you send in your application on the next day, that probably won't significantly impact your chances of getting picked up by your desired project(s).
Sooner is better, since many research leads will begin composing their teams after the 17th, but there is no hard cut-off point.
Thanks! These are thoughtful points. See some clarifications below:
AGI could be very catastrophic even when it stops existing a year later.
You're right. I'm not even covering all the other bad stuff that could happen in the short-term, that we might still be able to prevent, like AGI triggering global nuclear war.
What I'm referring to is unpreventable convergence on extinction.
If AGI makes earth uninhabitable in a trillion years, that could be a good outcome nonetheless.
Agreed that could be a good outcome if it could be attainable.
In practice, the convergence reasoning is about total human extinction happening within 500 years after 'AGI' has been introduced into the environment (with very very little probability remainder above that).
In theory of course, to converge toward 100% chance, you are reasoning about going across a timeline of potentially infinite span.
I don't know whether that covers "humans can survive on mars with a space-suit",
Yes, it does cover that. Whatever technological means we could think of shielding ourselves, or 'AGI' could come up with to create as (temporary) barriers against the human-toxic landscape it creates, still would not be enough.
if humans evolve/change to handle situations that they currently do not survive under
Unfortunately, this is not workable. The mismatch between the (expanding) set of conditions needed for maintaining/increasing configurations of the AGI artificial hardware and for our human organic wetware is too great.
Also, if you try entirely changing our underlying substrate to the artificial substrate, you've basically removed the human and are left with 'AGI'. The lossy scans of human brains ported onto hardware would no longer feel as 'humans' can feel, and will be further changed/selected for to fit with their artificial substrate. This is because what humans and feel and express as emotions is grounded in the distributed and locally context-dependent functioning of organic molecules (eg. hormones) in our body.
Update: reverting my forecast back to 80% chance likelihood for these reasons.
I'm also feeling less "optimistic" about an AI crash given:
- The election result involving a bunch of tech investors and execs pushing for influence through Trump's campaign (with a stated intention to deregulate tech).
- A military veteran saying that the military could be holding up the AI industry like "Atlas holding the globe", and an AI PhD saying that hyperscaled data centers, deep learning, etc, could be super useful for war.
I will revise my previous forecast back to 80%+ chance.
Yes, I agree formalisation is needed. See comment by flandry39 in this thread on how one might go about doing so.
Worth considering is that there are actually two aspects that make it hard to define the term ‘alignment’ such to allow for sufficiently rigorous reasoning:
- It must allow for logically valid reasoning (therefore requiring formalisation).
- It must allow for empirically sound reasoning (ie. the premises correspond with how the world works).
In my reply above, I did not help you much with (1.). Though even while still using the English language, I managed to restate a vague notion of alignment in more precise terms.
Notice how it does help to define the correspondences with how the world works (2.):
- “That ‘AGI’ continuing to exist, in some modified form, does not result eventually in changes to world conditions/contexts that fall outside the ranges that existing humans could survive under.”
The reason why 2. is important is that just formalisation is not enough. Just describing and/or deriving logical relations between mathematical objects does not say something about the physical world. Somewhere in your fully communicated definition there also needs to be a description of how the mathematical objects correspond with real-world phenonema. Often, mathematicians do this by talking to collaborators about what symbols mean while they scribble the symbols out on eg. a whiteboard.
But whatever way you do it, you need to communicate how the definition corresponds to things happening in the real world, in order to show that it is a rigorous definition. Otherwise, others could still critique you that the formally precise definition is not rigorous, because it does not adequately (or explicitly) represent the real-world problem.
For an overview of why such a guarantee would turn out impossible, suggest taking a look at Will Petillo's post Lenses of Control.
Defining alignment (sufficiently rigorous so that a formal proof of (im)possibility of alignment is conceivable) is a hard thing!
It's less hard than you think, if you use a minimal-threshold definition of alignment:
That "AGI" continuing to exist, in some modified form, does not result eventually in changes to world conditions/contexts that fall outside the ranges that existing humans could survive under.
Yes, I think there is a more general proof available. This proof form would combine limits to predictability and so on, with a lethal dynamic that falls outside those limits.
The question is more if it can ever be truly proved at all, or if it doesn't turn out to be an undecidable problem.
Control limits can show that it is an undecidable problem.
A limited scope of control can in turn be used to prove that a dynamic convergent on human-lethality is uncontrollable. That would be a basis for an impossibility proof by contradiction (cannot control AGI effects to stay in line with human safety).
Awesome directions. I want to bump this up.
This might include AGI predicting its own future behaviour, which is kind of essential for it to stick to a reliably aligned course of action.
There is a simple way of representing this problem that already shows the limitations.
Assume that AGI continues to learn new code from observations (inputs from the world) – since learning is what allows the AGI to stay autonomous and adaptable in acting across changing domains of the world.
Then in order for AGI code to be run to make predictions about relevant functioning of its future code:
- Current code has to predict what future code will be learned from future unknown inputs (there would be no point in learning then if the inputs were predictable and known ahead of time).
- Also, current code has to predict how the future code will compute subsequent unknown inputs into outputs, presumably using some shortcut algorithm that can infer relevant behavioural properties across the span of possible computationally-complex code.
- Further, current code would have to predict how the outputs would result in relevant outside effects (where relevant to sticking to a reliably human-aligned course of action)
- Where it is relevant how some of those effects could feed back into sensor inputs (and therefore could cause drifts in the learned code and the functioning of that code).
- Where other potential destabilising feedback loops are also relevant, particularly that of evolutionary selection.
Just found your insightful comment. I've been thinking about this for three years. Some thoughts expanding on your ideas:
my idea is more about whether alignment could require that the AGI is able to predict its own results and effects on the world (or the results and effects of other AGIs like it, as well as humans)...
In other words, alignment requires sufficient control. Specifically, it requires AGI to have a control system with enough capacity to detect, model, simulate, evaluate, and correct outside effects propagated by the AGI's own components.
... and that proved generally impossible such that even an aligned AGI can only exist in an unstable equilibrium state in which there exist situations in which it will become unrecoverably misaligned, and we just don't know which.
For example, what if AGI is in some kind of convergence basin where the changing situations/conditions tend to converge outside the ranges humans can survive under?
so we can assume that they will have to be somehow interpreted by the AGI itself who is supposed to hold them
There's a problem you are pointing of somehow mapping the various preferences – expressed over time by diverse humans from within their (perceived) contexts – onto reference values. This involves making (irreconcilable) normative assumptions of how to map the dimensionality of the raw expressions of preferences onto internal reference values. Basically, you're dealing with NP-complex combinatorics such as encountered with the knapsack problem.
Further, it raises the question of how to make comparisons across all the possible concrete outside effects of the machinery against the internal reference values, such to identify misalignments/errors to correct. Ie. just internalising and holding abstract values is not enough – there would have to be some robust implementation process that translates the values into concrete effects.
No actually, assuming the machinery has a hard substrate and is self-maintaining is enough.
we could create aligned ASI by simulating the most intelligent and moral people
This is not an existence proof, because it does not take into account the difference in physical substrates.
Artificial General Intelligence would be artificial, by definition. In fact, what allows for the standardisation of hardware components is the fact that the (silicon) substrate is hard under human living temperatures and pressures. That allows for configurations to stay compartmentalised and stable.
Human “wetware” has a very different substrate. It’s a soup of bouncing organic molecules constantly reacting under living temperatures and pressures
Just found a podcast on OpenAI’s bad financial situation.
It’s hosted by someone in AI Safety (Jacob Haimes) and an AI post-doc (Igor Krawzcuk).
https://kairos.fm/posts/muckraiker-episodes/muckraiker-episode-004/
Noticing no response here after we addressed superficial critiques and moved to discussing the actual argument.
For those few interested in questions raised above, Forrest wrote some responses: http://69.27.64.19/ai_alignment_1/d_241016_recap_gen.html
The claims made will feel unfamiliar and the reasoning paths too. I suggest (again) taking the time to consider what is meant. If a conclusion looks intuitively wrong from some AI Safety perspective, it may be valuable to explicitly consider the argumentation and premises behind that.
BTW if anyone does want to get into the argument, Will Petillo’s Lenses of Control post is a good entry point.
It’s concise and correct – a difficult combination to achieve here.
Resonating with you here! Yes, I think autonomous corporations (and other organisations) would result in society-wide extraction, destabilisation and totalitarianism.
Sam Altman demonstrating what kind of actions you can get away with in front of everyone's eyes seems problematic.
Very much agreeing with this.
Appreciating your inquisitive question!
One way to think about it:
For OpenAI to scale more toward “AGI”, the corporation needs more data, more automatable work, more profitable uses for working machines, and more hardware to run those machines.
If you look at how OpenAI has been increasing those four variables, you can notice that there are harms associated with each. This tends to result in increasing harms.
One obvious example: if they increase hardware, this also increases pollution (from mining, producing, installing, and running the hardware).
Note that the above is not a claim that the harms outweigh the benefits. But if OpenAI & co continue down their current trajectory, I expect that most communities would look back and say that the harms to what they care about in their lives were not worth it.
I wrote a guide to broader AI harms meant to emotionally resonate for laypeople here.
Let me rephrase that sentence to ‘industry expenditures in deep learning’.
what signals you send to OAI execs seems not relevant.
Right, I don’t occupy myself much with what the execs think. I do worry about stretching the “Overton window” for concerned/influential stakeholders broadly. Like, if no-one (not even AI Safety folk) acts to prevent OpenAI from continuing to violate its charter, then everyone kinda gets used to it being this way and maybe assumes it can’t be helped or is actually okay.
i don't see why this would lead them to downsize, if "the gap between industry investment in deep learning and actual revenue has ballooned to over $600 billion a year"
Note that with ‘investments’, I meant injections of funds to cover business capital expenditures in general, including just to keep running their models. My phrasing here is a little confusing, but couldn’t find another concise way to put it yet.
The reason why OpenAI and other large-AI-model companies would cease to gain investments, is similar to why dotcom companies ceased to gain investments (even though a few like Amazon went on to be trillion-dollar companies). Because investors become skeptical about their prospect of the companies reaching break even and about whether they would still be able to offload their stake later (to even more investors willing to sink in their capital).
Donation opportunities for restricting AI companies
- Pause AI: protests and lobbying
- Stop AI: is barricading OpenAI
- For Humanity: a podcast on risks.
- FoxGlove: a legal non-profit targeting big tech scaling.
- Disruption Network Institute: whistleblowers and researchers reporting on the AI military misuses.
- Distributed AI Research Institute: AI ethics researchers giving general AI advocates hell and advocating for specialised models serving communities.
- European Guild for AI Regulation: lobbying in the EU against the data scraping that makes large models possible.
- Concept Art Association: lobbying in the EU against the data scraping that makes large models possible.
In my pipeline:
- funding a 'horror documentary' against AI by an award-winning documentary maker (got a speculation grant of $50k)
- funding lawyers in the EU for some high-profile lawsuits and targeted consultations with EU AI Office.
If you're a donor, I can give you details on their current activities. I worked with staff in each of these organisations. DM me.
When you say failures will "build up toward lethality at some unknown rate", why would failures build up toward lethality? We have lots of automated systems e.g. semiconductor factories, and failures do not accumulate until everyone at the factory dies, because humans and automated systems can notice errors and correct them.
Let's take your example of semiconductor factories.
There are several ways to think about failures here. For one, we can talk about local failures in the production of the semiconductor chips. These especially will get corrected for.
A less common way to talk about factory failures is when workers working in the factories die or are physically incapacitated as a result, eg. because of chemical leaks or some robot hitting them. Usually when this happens, the factories can keep operating and existing. Just replace the expendable workers with new workers.
Of course, if too many workers die, other workers will decide to not work at those factories. Running the factories has to not be too damaging to the health of the internal human workers, in any of the many (indirect) that ways operations could turn out to be damaging.
The same goes for humans contributing to the surrounding infrastructure needed to maintain the existence of these sophisticated factories – all the building construction, all the machine parts, all the raw materials, all the needed energy supplies, and so on. If you try overseeing the relevant upstream and downstream transactions, it turns out that a non-tiny portion of the entire human economy is supporting the existence of these semiconductor factories one way or another. It took a modern industrial cross-continental economy to even make eg. TSMC's factories viable.
The human economy acts as a forcing function constraining what semiconductor factories can be. There are many, many ways to incapacitate complex multi-celled cooperative organisms like us. So the semiconductor factories that humans are maintaining today ended up being constrained to those that for the most part do not trigger those pathways downstream.
Some of that is because humans went through the effort of noticing errors explicitly and then correcting them, or designing automated systems to do likewise. But the invisible hand of the market considered broadly – as constituting of humans with skin in the game, making often intuitive choices – will actually just force semiconductor factories to be not too damaging to surrounding humans maintaining the needed infrastructure.
With AGI, you lose that forcing function.
Let's take AGI to be machinery that is autonomous enough to at least automate all the jobs needed to maintain its own existence. Then AGI is no longer dependent on an economy of working humans to maintain its own existence. AGI would be displacing the human economy – as a hypothetical example, AGI is what you'd get if those semiconductor factories producing microchips expanded to producing servers and robots using those microchips that in turn learn somehow to design themselves to operate the factories and all the factory-needed infrastructure autonomously.
Then there is one forcing function left: the machine operation of control mechanisms. Ie. mechanisms that detect, model, simulate, evaluate, and correct downstream effects in order to keep AGI safe.
The question becomes – Can we rely on only control mechanisms to keep AGI safe?
That question raises other questions.
E.g. as relevant to the hashiness model:
“Consider the space of possible machinery output sequences over time. How large is the subset of output sequences that in their propagation as (cascading) environmental effects would end up lethally disrupting the bodily functioning of humans? How is the accumulative probability of human extinction distributed across the entire output possibility space (or simplified: how mixed are the adjoining lethal and non-lethal possibility subspaces)? Can any necessarily less complex control system connected with/in this machinery actually keep tracking whether possible machinery outputs fall into the lethal sub-space or the non-lethal sub-space? "
This is pretty similar to Hendrycks's natural selection argument, but with the additional piece that the goals of AIs will converge to optimizing the environment for the survival of silicon-based life.
There are some ways to expand Hendrycks’ argument to make it more comprehensive:
- Consider evolutionary selection at the more fundamental level of physical component interactions. Ie. not just at the macro level of agents competing for resources, since this is a leaky abstraction that can easily fail to capture underlying vectors of change.
- Consider not only selection of local variations (ie. mutations) that introduces new functionality, but also the selection of variants connecting up with surrounding units in ways that ends up repurposing existing functionality.
- Consider not only the concept of goals that are (able to be) explicitly tracked by the machinery itself, but also that of the implicit conditions needed by components which end up being selected for in expressions across the environment.
Evolutionary arguments are notoriously tricky and respected scientists get them wrong all the time
This is why we need to take extra care in modelling how evolution – as a kind of algorithm – would apply across the physical signalling pathways of AGI.
I might share a gears-level explanation that Forrest that just gave in response to your comment.
I agree that with superficial observations, I can't conclusively demonstrate that something is devoid of intellectual value.
Thanks for recognising this, and for taking some time now to consider the argument.
However, the nonstandard use of words like "proof" is a strong negative signal on someone's work.
Yes, this made us move away from using the term “proof”, and instead write “formal reasoning”.
Most proofs nowadays are done using mathematical notation. So it is understandable that when people read “proof”, they automatically think “mathematical proof”.
Having said that, there are plenty of examples of proofs done in formal analytic notation that is not mathematical notation. See eg. formal verification practices in the software and hardware industries, or various branches of analytical philosophy.
If someone wants to demonstrate a scientific fact, the burden of proof is on them to communicate this in some clear and standard way
Yes, much of the effort has been to translate argument parts in terms more standard for the alignment community.
What we cannot expect is that the formal reasoning is conceptually familiar and low-inferential distance. That would actually be surprising – why then has someone inside the community not already derived the result in the last 20 years?
The reasoning is going to be as complicated as it has to be to reason things through.
This problem is exacerbated when someone bases their work on original philosophy. To understand Forrest Landry's work to his satisfaction someone will have to understand his 517-page book An Immanent Metaphysics
Cool that you took a look at his work. Forrest’s use of terms is meant to approximate everyday use of those terms, but the underlying philosophy is notoriously complicated.
Jim Rutt is an ex-chair of Santa Fe Institute who defaults to being skeptical of metaphysics proposals (funny quote he repeats: “when someone mentions metaphysics, I reach for my pistol”). But Jim ended up reading Forrest’s book and it passed his B.S. detector. So he invited Forrest over to his podcast for a three-part interview. Even if you listen to that though, I don’t expect you to immediately come away understanding the conceptual relations.
So here is a problem that you and I are both seeing:
- There is this polymath who is clearly smart and recognised for some of his intellectual contributions (by interviewers like Rutt, or co-authors like Anders).
- But what this polymath claims to be using as the most fundamental basis for his analysis would take too much time to work through.
- So then if this polymath claims to have derived a proof by contradiction –concluding that long-term AGI safety is not possible – then it is intractable for alignment researchers to verify the reasoning using his formal annotation and his conceptual framework. That would be asking for too much – if he’d have insisted on that, I agree that would have been a big red flag signalling crankery.
- The obvious move then is for some people to work with the polymath to translate his reasoning to a basis of analysis that alignment researchers agree is a sound basis to reason from. And to translate to terms/concepts people are familiar with. Also, the chain of reasoning should not be so long that busy researchers never end up reading through, but also not so short that you either end up having to use abstractions readers are unfamiliar with, or open up unaddressed gaps in the reasoning. Etc.
- The problem becomes finding people who are both willing and available to do that work. One person is probably not enough.
Having read the research proposal my guess is that they will prove something roughly like the Good Regulator Theorem or Rice's theorem
Both are useful theorems, which have specific conclusions that demonstrate that there are at least some limits to control.
(ie. Good Regulator Theorem demonstrates a limit to a system’s capacity to model – or internally functionally represent – the statespace of some more complex super-system. Rice Theorem demonstrates a particular limit to having some general algorithm predict a behavioural property of other algorithms.)
The hashiness model is a tool meant for demonstrating under conservative assumptions – eg. of how far from cryptographically hashy the algorithm run through ‘AGI’ is, and how targetable human-safe ecosystem conditions are – that AGI would be uncontainable. With “uncontainable”, I mean that no available control system connected with/in AGI could constrain the possibility space of AGI’s output sequences enough over time such that the (cascading) environmental effects do not lethally disrupt the bodily functioning of humans.
Paul expressed appropriate uncertainty. What is he supposed to...say...?
I can see Paul tried expressing uncertainty by adding “probably” to his claim of how the entire scientific community (not sure what this means) would interpret that one essay.
To me, it seemed his commentary was missing some meta-uncertainty. Something like “I just did some light reading. Based on how it’s stated in this essay, I feel confident it makes no sense for me to engage further with the argument. However, maybe other researchers would find it valuable to spend more time engaging with the argument after going through this essay or some other presentation of the argument.”
~
That covers your comments re: communicating the argument in a form that can be verified by the community.
Let me cook dinner, and then respond to your last two comments to dig into the argument itself. EDIT: am writing now, will respond tomorrow.
How about I assume there is some epsilon such that the probability of an agent going off the rails
Got it. So we are both assuming that there would be some accumulative failure rate [per point 3.].
Why can't the agent split into multiple ~uncorrelated agents and have them each control some fraction of resources (maybe space) such that one off-the-rails agent can easily be fought and controlled by the others?
I tried to adopt this ~uncorrelated agents framing, and then argue from within that. But I ran up against some problems with this framing:
- It assumes there are stable boundaries between "agents" that allows us to mark them as separate entities. This kinda works for us as physically bounded and communication-bottlenecked humans. But in practice it wouldn't really work to define "agent" separations within a larger machine network maintaining of own existence in the environment.
(Also, it is not clear to me how failures of those defined "agent" subsets would necessarily be sufficiently uncorrelated – as an example, if the failure involves one subset hijacking the functioning of another subset, their failures become correlated.) - It assumes that if any (physical or functional) subset of this adaptive machinery happens to gain any edge in influencing the distributed flows of atoms and energy back towards own growth, that the other machinery subsets can robustly "control" for that.
- It assumes a macroscale-explanation of physical processes that build up from the microscale. Agreed that the concept of agents owning and directing the allocation of "resources" is a useful abstraction, but it also involves holding a leaky representation of what's going on. Any argument for control using that representation can turn out not to capture crucial aspects.
- It raises the question what "off-the-rails" means here. This gets us into the hashiness model:
Consider the space of possible machinery output sequences over time. How large is the subset of output sequences that in their propagation as (cascading) environmental effects would end up lethally disrupting the bodily functioning of humans? How is the accumulative probability of human extinction distributed across the entire output possibility space (or simplified: how mixed are the adjoining lethal and non-lethal possibility subspaces)? Can any necessarily less complex control system connected with/in this machinery actually keep tracking whether possible machinery outputs fall into the lethal sub-space or the non-lethal sub-space?
→ Do those problems makes sense to you as stated? Do you notice anything missing there?
To sum it up, you and I are still talking about a control system [per point 4.]:
- However you define the autonomous "agents", they are still running through code running across connected hardware.
- There are limits to the capacity of this aggregate machinery to sense, model, simulate, evaluate, and correct own component effects propagating through a larger environment.
I'm gonna try and stay focused on a single point, specifically the argument that leads up to >99%
I'm also for now leaving aside substrate-needs convergence [point 5]:
- That the entire population of nested/connected machine components would be pulled toward a human-lethal attractor state.
I usually do not take authorities too seriously before I understand their reasoning in a particular question. And understanding a person's reasoning may occasionally mean that I disagree in particular points as well. In my experience, even the most respectful people are still people, which means they often think in messy ways and they are good just on average
Right – this comes back to actually examining people’s reasoning.
Relying on the authority status of an insider (who dismissed the argument) or on your ‘crank vibe’ of the outsider (who made the argument) is not a reliable way of checking whether a particular argument is good.
IMO it’s also fine to say “Hey, I don’t have time to assess this argument, so for now I’m going to go with these priors that seemed to broadly kinda work in the past for filtering out poorly substantiated claims. But maybe someone else actually has a chance to go through the argument, I’ll keep an eye open.”
Yes, Remmelt has some extreme expressions…
I may not agree that we are going to die with 99% probability. At the same time I find that his current directions are definitely worthwhile of exploring.
…describing black-and-white thinking
I’m putting these quotes together because I want to check whether you’re tracking the epistemic process I’m proposing here.
Reasoning logically from premises is necessarily black-and-white thinking. Either the truth value is true or it is false.
A way to check the reasoning is to first consider the premises (in how they are described using defined terms, do they correspond comprehensively enough with how the world works?). And then check whether the logic follows from the premises through to each next argument step until you reach the conclusion.
Finally, when you reach the conclusion, and you could not find any soundness or validity issues, then that is the conclusion you have reasoned to.
If the conclusion is that it turns out impossible for some physical/informational system to meet several specified desiderata at the same time, this conclusion may sound extreme.
But if you (and many other people in the field who are inclined to disagree with the conclusion) cannot find any problem with the reasoning, the rational thing would be to accept it, and then consider how it applies to the real world.
Apparently, computer scientists hotly contested CAP theorem for a while. They wanted to build distributed data stores that could send messages that consistently represented new data entries, while the data was also made continuously available throughout the network, while the network was also tolerant to partitions. It turns out that you cannot have all three desiderata at once. Grumbling computer scientists just had to face the reality and turn to designing systems that would fail in the least bad way.
Now, assume there is a new theorem for which the research community in all their efforts have not managed to find logical inconsistencies nor empirical soundness issues. Based on this theorem, it turns out that you cannot both have machinery that keeps operating and learning autonomously across domains, and a control system that would contain the effects of that machinery enough to not feedback in ways that destabilise our environment outside the ranges we can survive in.
We need to make a decision then – what would be the least bad way to fail here? On one hand we could decide against designing increasingly autonomous machines, and lose out on the possibility of having machines running around doing things for us. On the other hand, we could have the machinery fail in about the worst way possible, which is to destroy all existing life on this planet.
efficiently filters for people who are inclined to join the activist movement--especially on the hard-core "front lines"--whereas passive "supporters" can be more trouble than they're worth.
I had not considered how our messaging is filtering out non-committed supporters. Interesting!
No worries. We won't be using ChatGPT or any other model to generate our texts.
As I understand the issue, the case for barricading AI rests on:
Great list! Basically agreeing with the claims under 1. and the structure of what needs to be covered under 2.
Meanwhile, the value of disruptive protest is left to the reader to determine.
You're right. Usually when people hear about a new organisation on the forum, they expect some long write-up of the theory of change and the considerations around what to prioritise.
I don't think I have time right now for writing a neat public write-up. This is just me being realistic – Sam and I are both swamped in terms of handling our work and living situations.
So the best I can do is point to examples where civil disobedience has worked (eg. Just Stop Oil demands, Children's March) and then discuss our particular situation (how the situatiojn is similar and different, who are important stakeholders, what are our demands, what are possible effective tactics in this context).
In particular, the argument for 99% extinction is given a lot of space relative to the post as a whole,
Ha, fair enough. The more rigorously I tried to write out the explanation, the more space it took.
So it's the AI being incompetent?
Yes, but in the sense that there are limits to the AGI's capacity to sense, model, simulate, evaluate, and correct own component effects propagating through a larger environment.
You don't have to simulate something to reason about it.
If you can't simulate (and therefore predict) that a failure mode that by default is likely to happen would happen, then you cannot counterfactually act to prevent the failure mode.
You could walk me though how one of these theorems is relevant to capping self-improvement of reliability?
Maybe take a look at the hashiness model of AGI uncontainability. That's an elegant way of representing the problem (instead of pointing at lots of examples of theorems that show limits to control).
This is not put into mathematical notation yet though. Anders Sandberg is working on it, but also somewhat distracted. Would value your contribution/thinking here, but I also get if you don't want to read through the long transcripts of explanation at this stage. See project here.
Anders' summary:
"A key issue is the thesis that AGI will be uncontrollable in the sense that there is no control mechanism that can guarantee aligned behavior since the more complex and abstract the target behavior is the amount of resources and forcing ability needed become unattainable.
In order to analyse this better a sufficiently general toy model is needed for how controllable systems of different complexity can be, that ideally can be analysed rigorously.
One such model is to study families of binary functions parametrized by their circuit complexity and their "hashiness" (how much they mix information) as an analog for the AGI and the alignment model, and the limits to finding predicates that can keep the alignment system making the AGI analog producing a desired output."
Garrabrant induction shows one way of doing self-referential reasoning.
We're talking about learning from inputs received from a more complex environment (through which AGI outputs also propagate as changed effects of which some are received as inputs).
Does Garrabrant take that into account in his self-referential reasoning?
As an analogy: Use something more like democracy than like dictatorship, such that any one person going crazy can't destroy the world/country, as a crazy dictator would.
A human democracy is composed out of humans with similar needs. This turns out to be an essential difference.
claiming to have a full mathematical proof that safe AI is impossible,
I have never claimed that there is a mathematical proof. I have claimed that the researcher I work with has done their own reasoning in formal analytical notation (just not maths). Also, that based on his argument – which I probed and have explained here as carefully as I can – AGI cannot be controlled enough to stay safe, and actually converges on extinction.
That researcher is now collaborating with Anders Sandberg to formalise an elegant model of AGI uncontainability in mathematical notation.
I’m kinda pointing out the obvious here, but if the researcher was a crank, why would Anders be working with them?
claiming the "proof" uses mathematical arguments from Godel's theorem, Galois Theory,
Nope, I haven’t claimed either of that.
The claim is that the argument is based on showing a limited extent of control (where controlling effects consistently in line with reference values).
The form of the reasoning there shares some underlying correspondences with how the Gödel’s incompleteness theorems (concluding there is a limit to deriving a logical result within a formal axiomatic system) and Galois Theory (concluding that there is a limited scope of application of an algebraic tool) are reasoned through.
^– This is a pedagogical device. It helps researchers already acquainted with Gödel’s theorems or Galois Theory to understand roughly what kind of reasoning we’re talking about.
inexplicably formatted as a poem
Do you mean the fact that the researcher splits his sentences’ constituent parts into separate lines so that claims are more carefully parsable?
That is a format for analysis, not a poem format.
While certainly unconventional, it is not a reason to dismiss the rigour of someone’s analysis.
Paul Christiano read some of this and concluded "the entire scientific community would probably consider this writing to be crankery",
If you look at that exchange, I and the researcher I was working with were writing specific and carefully explained responses.
Paul had zoned in on a statement of the conclusion, misinterpreted what was meant, and then moved on to dismissing the entire project. Doing this was not epistemically humble.
But I think that as rationalists, one of our core skills should be to condemn actual crankery and all of its influences
When accusing someone of crankery (which is a big deal) it is important not to fall into making vague hand-wavey statements yourself.
You are making vague hand-wavey (and also inaccurate) statements above. Insinuating that something is “science-babble” doesn’t do anything. Calling an essay formatted as shorter lines a “poem” doesn’t do anything.
superficially agree with the conclusions from actually good arguments.
Unlike Anders – who examined the insufficient controllability part of the argument – you are not a position to judge whether this argument is a good argument or not.
Read the core argument please (eg. summarised in point 3-5. above) and tell me where you think premises are unsound or the logic does not follow from the premises.
It is not enough to say ‘as a rationalist’. You got to walk the talk.
Let me recheck the AI Impacts paper.
I definitely made a mistake in quickly checking that number shared by colleague.
The 2023 AI Impacts survey shows a mean risk of 14.4% for the question “What probability do you put on future AI advances causing human extinction or similarly permanent and severe disempowerment of the human species within the next 100 years?”.
Whereas the other smaller sample survey gives a median estimate of 30%
I already thought using those two figures as a range did not make sense, but putting a mean and a median in the same range is even more wrong.
Thanks for pointing this out! Let me add a correcting comment above.
Thanks, as far as I can this is a mix of critiques of strategic approach (fair enough), about communication style (fair enough), and partial misunderstandings of the technical arguments.
instead of a succession of events which need to go your way, I think you should aim for incremental marginal gains. There is no cost-effectiveness analysis…
I agree that we should not get hung up on a succession of events to go a certain way. IMO, we need to get good at simultaneously broadcasting our concerns in a way that’s relatable to other concerned communities, and opportunistically look for new collaborations there.
At the same time, local organisers often build up an activist movement by ratcheting up the number of people joining the events and the pressure they put on demanding institutions to make changes. These are basic cheap civil disobedience tactics that have worked for many movements (climate, civil rights, feminist, changing a ruling party, etc). I prefer to go with what has worked, instead of trying to reinvent the wheel based on fragile cost-effectiveness estimates. But if you can think of concrete alternative activities that also have a track record of working, I’m curious to hear.
Your press release is unreadable (poor formatting), and sounds like a conspiracy theory (catchy punchlines, ALL CAPS DEMANDS, alarmist vocabulary and unsubstantiated claims)
I think this is broadly fair. The turnaround time of this press release was short, and I think we should improve on the formatting and give more nuanced explanations next time.
Keep in mind the text is not aimed at you but people more broadly who are feeling concerned and we want to encourage to act. A press release is not a paper. Our press release is more like a call to action – there is a reason to add punchy lines here.
The figures you quote are false (the median from AI Impacts is 5%) or knowingly misleading (the numbers from Existential risk from AI survey are far from robust and as you note, suffer from selection bias)
Let me recheck the AI Impacts paper. Maybe I was ditzy before, in which case, my bad.
As you saw from my commentary above, I was skeptical about using that range of figures in the first place.
You conflate AGI and self-modifying systems
Not sure what you see as the conflation?
AGI, as an autonomous system that would automate many jobs, would necessarily be self-modifying – even in the limited sense of adjusting its internal code/weights on the basis of new inputs.
Your arguments are invalid
The reasoning shared in the press release by my colleague was rather loose, so I more rigorously explained a related set of arguments in this post.
As to whether arguments from point 1 to 6. above are invalid, I haven’t seen you point out inconsistencies in the logic yet, so as it stands you seem to be sharing a personal opinion.
I am appalled to see this was not downvoted into oblivion!
Should I comment on the level of nuance in your writing here? :P
An obvious consideration for any reflective agent is to find ways to reduce the risk of goal-related failure.
…
by "goal-related systems" I just meant whatever is keeping track of the outcomes being optimized for.
So the argument for 3. is that just by AGI continuing to operate and maintain its components as adapted to a changing environment, the machinery can accidentally end up causing destabilising effects that were untracked or otherwise insufficiently corrected for.
You could call this a failure of the AGI’s goal-related systems if you mean with that that the machinery failed to control its external effects in line with internally represented goals.
But this would be a problem with the control process itself.
An obvious consideration for any reflective agent is to find ways to reduce the risk of goal-related failure.
Unfortunately, there are fundamental limits to that cap the extent to which the machinery can improve its own control process.
Any of the machinery’s external downstream effects that its internal control process cannot track (ie. detect, model, simulate, and identify as a “goal-related failure”), that process cannot correct for.
For further explanation, please see links under point 4.
Decentralizing away from a single point of failure is another obvious step that one would take in a post-ASI world.
The problem here is that (a) we are talking about not just a complicated machine product but self-modifying machinery and (b) at the scale this machinery would be operating at it cannot account for most of the potential human-lethal failures that could result.
For (a), notice how easily feedback processes can become unsimulatable for such unfixed open-ended architectures.
- E.g. How can AGI code predict how its future code learned from unknown inputs will function in processing subsequent unknown inputs? What if future inputs are changing as a result of effects propagated across the larger environment from previous AGI outputs? And those outputs were changing as a result of previous new code that was processing signals in connection with other code running across the machinery? And so on.
For (b), engineering decentralised redundancy can help especially at the microscale.
- E.g. correcting for bit errors.
- But what does it mean to correct for failures at the level of local software (bugs, viruses, etc)? What does it mean to correct for failures across some decentralised server network? What does it mean to correct for failures at the level of an entire machine ecosystem (which AGI effectively becomes)?
~
In scaling up the connected components, this exponentially increases their degrees of freedom of interaction. And as those components change in feedback with surrounding contexts of the environment (and have to in order for AGI to autonomously adapt), an increasing portion of the possible human-lethal failures cannot be adequately controlled for by the system itself.
Appreciating your openness.
(Just making dinner – will get back to this when I’m behind my laptop in around an hour).
There is some risk of its goal-related systems breaking
Ah, that’s actually not the argument.
Could you try read points 1-5. again?
Even if you know a certain market is a bubble, it's not exactly trivial to exploit if you don't know when it's going to burst, which prices will be affected, and to what degree. "The market can remain irrational longer than you can remain solvent" and all that.
Yes, all of this. I didn’t know how to time this, and also good point that operationalising it in terms of AI stocks to target at what strike price could be tricky too.
If I could get the timing right, this makes sense. But I don’t have much of an edge in judging when the bubble would burst. And put options are expensive.
If someone here wants to make a 1:1 bet over the next three years, I’m happy to take them up on the offer.
If there's less demand from cloud users to rent GPU's Google/Microsoft/Amazon would likely use the GPU's in their datacenters for their own projects (or projects like Antrophic/OpenAI).
That’s a good point. Those big tech companies are probably prepared to pay for the energy use if they have the hardware lying around anyway.
To clarify for future reference, I do think it’s likely (80%+) that at some point over the next 5 years there will be a large reduction in investment in AI and a corresponding market crash in AI company stocks, etc, and that both will continue to be for at least three months.
Update: I now think this is 90%+ likely to happen (from original prediction date).
Looks like I summarised it wrong. It’s not about ionising radiation directly from bombarding ions from outer space. It’s about the interaction of the ions with the Earth’s magnetic field, which as you stated “induced large currents in long transmission lines, overloading the transformers.”.
Here is what Greg Weinstein wrote in a scenario I just found written by him:
In 2013, a report had warned that an extreme geomagnetic storm was almost inevitable, and would induce huge currents in Earth’s transmission lines. This vulnerability could, with a little effort, have been completely addressed for a tiny sum of money — less than a tenth of what the world invested annually in text messaging prior to the great collapse of 2024.
Will correct my mistake in the post now.
There is one question on my mind still is whether and how a weakened Earth magnetic field makes things worse. Would the electromagnetic interactions occur on the whole closer to Earth, therefore causing larger currents in power transmission lines? Does that make any sense?
But it’s weird that I cannot find even a good written summary of Bret’s argument online (I do see lots of political podcasts).
I found an earlier scenario written by Bret that covers just one nuclear power plant failing and that does not discuss the risk of a weakening magnetic field.
This was an interesting read, thank you.
Good question! Will look into it / check more if I have the time.
Ah, thanks! Corrected now
Ah, thank you for correcting. I didn’t realise it could be easily interpreted that way.
Also suggest exploring what it may means we are unable to be able to solve the alignment problem for fully autonomous learning machinery.
There will be a [new AI Safety Camp project](https://docs.google.com/document/d/198HoQA600pttXZA8Awo7IQmYHpyHLT49U-pDHbH3LVI/edit) about formalising a model of AGI uncontainability.
Fixed it! You can use either link now to share with your friends.
Igor Krawzcuk, an AI PhD researcher, just shared more specific predictions:
“I agree with ed that the next months are critical, and that the biggest players need to deliver. I think it will need to be plausible progress towards reasoning, as in planning, as in the type of stuff Prolog, SAT/SMT solvers etc. do.
I'm 80% certain that this literally can't be done efficiently with current LLM/RL techniques (last I looked at neural comb-opt vs solvers, it was bad), the only hope being the kitchen sink of scale, foundation models, solvers and RL … If OpenAI/Anthropic/DeepMind can't deliver on promises of reasoning and planning (Q*, Strawberry, AlphaCode/AlphaProof etc.) in the coming months, or if they try to polish more turds into gold (e.g., coming out with GPT-Reasoner, but only for specific business domains) over the next year, then I would be surprised to see the investments last to make it happen in this AI summer.” https://x.com/TheGermanPole/status/1826179777452994657