Why Stop AI is barricading OpenAI

post by Remmelt (remmelt-ellen) · 2024-10-14T07:12:43.049Z · LW · GW · 26 comments

This is a link post for https://docs.google.com/document/d/1ivUwBbzMfwZUd7EAeEWWBtiO1BXFiSC1cft0dxn1q60/

Contents

26 comments

26 comments

Comments sorted by top scores.

comment by Joseph Miller (Josephm) · 2024-10-14T15:12:47.269Z · LW(p) · GW(p)

Respect for doing this.

I strongly wish you would not tie StopAI to the claim that extinction is >99% likely. It means that even your natural supporters in PauseAI will have to say "yes I broadly agree with them but disagree with their claims about extinction being certain."

I would also echo the feedback here [EA(p) · GW(p)]. There's no reason to write in the same style as cranks.

Replies from: thomas-kwa
comment by Thomas Kwa (thomas-kwa) · 2024-10-14T21:29:22.969Z · LW(p) · GW(p)

It's not just the writing that sounds like a crank. Core arguments that Remmelt endorses [LW · GW] are AFAIK considered crankery by the community; with all the classic signs like

  • making up science-babble,
  • claiming to have a full mathematical proof that safe AI is impossible, despite not providing any formal mathematical reasoning
    • claiming the "proof" uses mathematical arguments from Godel's theorem, Galois Theory, Rice's Theorem
  • inexplicably formatted as a poem

Paul Christiano read some of this and concluded "the entire scientific community would probably consider this writing to be crankery [LW(p) · GW(p)]", which seems about accurate to me.

Now I don't like or intend to make personal attacks. But I think that as rationalists, one of our core skills should be to condemn actual crankery and all of its influences, even when the conclusions of cranks and their collaborators superficially agree with the conclusions from actually good arguments.

Replies from: remmelt-ellen, roland-pihlakas
comment by Remmelt (remmelt-ellen) · 2024-10-15T02:28:03.332Z · LW(p) · GW(p)

claiming to have a full mathematical proof that safe AI is impossible,

I have never claimed that there is a mathematical proof. I have claimed that the researcher I work with has done their own reasoning in formal analytical notation (just not maths). Also, that based on his argument – which I probed and have explained here as carefully as I can – AGI cannot be controlled enough to stay safe, and actually converges on extinction.

That researcher is now collaborating with Anders Sandberg to formalise an elegant model of AGI uncontainability in mathematical notation.

I’m kinda pointing out the obvious here, but if the researcher was a crank, why would Anders be working with them?

 

claiming the "proof" uses mathematical arguments from Godel's theorem, Galois Theory,

Nope, I haven’t claimed either of that. 

The claim is that the argument is based on showing a limited extent of control (where controlling effects consistently in line with reference values). 

The form of the reasoning there shares some underlying correspondences with how the Gödel’s incompleteness theorems (concluding there is a limit to deriving a logical result within a formal axiomatic system) and Galois Theory (concluding that there is a limited scope of application of an algebraic tool) are reasoned through.  

^– This is a pedagogical device. It helps researchers already acquainted with Gödel’s theorems or Galois Theory to understand roughly what kind of reasoning we’re talking about.

 

inexplicably formatted as a poem

Do you mean the fact that the researcher splits his sentences’ constituent parts into separate lines so that claims are more carefully parsable?

That is a format for analysis, not a poem format.

While certainly unconventional, it is not a reason to dismiss the rigour of someone’s analysis. 

 

Paul Christiano read some of this and concluded "the entire scientific community would probably consider this writing to be crankery",  

If you look at that exchange, I and the researcher I was working with were writing specific and carefully explained responses.

Paul had zoned in on a statement of the conclusion, misinterpreted what was meant, and then moved on to dismissing the entire project. Doing this was not epistemically humble. 

 

But I think that as rationalists, one of our core skills should be to condemn actual crankery and all of its influences

When accusing someone of crankery (which is a big deal) it is important not to fall into making vague hand-wavey statements yourself.

You are making vague hand-wavey (and also inaccurate) statements above. Insinuating that something is “science-babble” doesn’t do anything. Calling an essay formatted as shorter lines a “poem” doesn’t do anything.

 

superficially agree with the conclusions from actually good arguments.

Unlike Anders – who examined the insufficient controllability part of the argument – you are not a position to judge whether this argument is a good argument or not. 

Read the core argument please (eg. summarised in point 3-5. above) and tell me where you think premises are unsound or the logic does not follow from the premises.

It is not enough to say ‘as a rationalist’. You got to walk the talk. 

Replies from: thomas-kwa
comment by Thomas Kwa (thomas-kwa) · 2024-10-15T20:50:42.865Z · LW(p) · GW(p)
  • I agree that with superficial observations, I can't conclusively demonstrate that something is devoid of intellectual value. However, the nonstandard use of words like "proof" is a strong negative signal on someone's work.
  • If someone wants to demonstrate a scientific fact, the burden of proof is on them to communicate this in some clear and standard way, because a basic strategy of anyone practicing pseudoscience is to spend lots of time writing something inscrutable that ends in some conclusion, then claim that no one can disprove it and anyone who thinks it's invalid is misunderstanding something inscrutable.
    • This problem is exacerbated when someone bases their work on original philosophy. To understand Forrest Landry's work to his satisfaction someone will have to understand his 517-page book An Immanent Metaphysics, which uses words like "proof", "theorem", "conjugate", "axiom", and "omniscient" in a nonstandard sense, and also probably requires someone to have a background in metaphysics. I scanned the 134-page version, can't make any sense of it, and found several concrete statements that sound wrong. I read about 50 pages of various articles on the website and found them to be reasonably coherent but often oddly worded and misusing words like entropy, with the same content quality as a ~10 karma LW post but super overconfident.

That researcher is now collaborating with Anders Sandberg to formalise an elegant model of AGI uncontainability in mathematical notation.

Ok. To be clear I don't expect any Landry and Sandberg paper that comes out of this collaboration to be crankery. Having read the research proposal my guess is that they will prove something roughly like the Good Regulator Theorem or Rice's theorem which will be slightly relevant to AI but not super relevant because the premises are too strong, like the average item in Yampolskiy's list of impossibility proofs (I can give examples if you want of why these are not conclusive).

I'm not saying we should discard all reasoning by someone that claims an informal argument is a proof, but rather stop taking their claims of "proofs" at face value without seeing more solid arguments.

claiming the "proof" uses mathematical arguments from Godel's theorem, Galois Theory,

Nope, I haven’t claimed either of that. 

Fair enough. I can't verify this because Wayback Machine is having trouble displaying the relevant content though.

Paul had zoned in on a statement of the conclusion, misinterpreted what was meant, and then moved on to dismissing the entire project. Doing this was not epistemically humble. 

Paul expressed appropriate uncertainty. What is he supposed to do, say "I see several red flags, but I don't have time to read a 517-page metaphysics book, so I'm still radically uncertain whether this is a crank or the next Kurt Godel"?

Read the core argument please (eg. summarised in point 3-5. above) and tell me where you think premises are unsound or the logic does not follow from the premises.

When you say failures will "build up toward lethality at some unknown rate", why would failures build up toward lethality? We have lots of automated systems e.g. semiconductor factories, and failures do not accumulate until everyone at the factory dies, because humans and automated systems can notice errors and correct them.

Variants get evolutionarily selected for how they function across the various contexts they encounter over time. [...] The artificial population therefore converges on fulfilling their own expanding needs.

This is pretty similar to Hendrycks's natural selection argument, but with the additional piece that the goals of AIs will converge to optimizing the environment for the survival of silicon-based life. He claims that there are various ways to counter evolutionary pressures, like "carefully designing AI agents’ intrinsic motivations, introducing constraints on their actions, and institutions that encourage cooperation". In the presence of ways to change incentives such that benign AI systems get higher fitness, I don't think you can get to 99% confidence. Evolutionary arguments are notoriously tricky and respected scientists get them wrong all the time, from Malthus to evolutionary psychology to the group selectionists.

comment by Roland Pihlakas (roland-pihlakas) · 2024-10-15T16:50:19.146Z · LW(p) · GW(p)

I think your own message is also too extreme to be rational. So it seems to me that you are fighting fire with a fire. Yes, Remmelt has some extreme expressions, but you definitely have extreme expressions here too, while having even weaker arguments.

Could we find a golden middle road, a common ground, please? With more reflective thinking and with less focus on right and wrong? (Regardless of the dismissive-judgemental title of this forum :P)

I agree that Remmelt can improve the message. And I believe he will do that.

I may not agree that we are going to die with 99% probability. At the same time I find that his current directions are definitely worthwhile of exploring.

I also definitely respect Paul. But mentioning his name here is mostly irrelevant for my reasoning or for taking your arguments seriously, simply because I usually do not take authorities too seriously before I understand their reasoning in a particular question. And understanding a person's reasoning may occasionally mean that I disagree in particular points as well. In my experience, even the most respectful people are still people, which means they often think in messy ways and they are good just on average, not per instance of a thought line (which may mean they are poor thinkers 99% of the time, while having really valuable thoughts 1% of the time). I do not know the distribution for Paul, but definitely I would not be disappointed if he makes mistakes sometimes.

I think this part of Remmelt's response sums it up nicely: "When accusing someone of crankery (which is a big deal) it is important not to fall into making vague hand-wavey statements yourself. You are making vague hand-wavey (and also inaccurate) statements above. Insinuating that something is “science-babble” doesn’t do anything. Calling an essay formatted as shorter lines a “poem” doesn’t do anything."

In my interpretation, black-and-white thinking is not "crankery". It is a normal and essential step in the development of cognition about a particular problem. Unfortunately. There is research about that in the field of developmental and cognitive psychology. Hopefully that applies to your own black-and-white thinking as well. Note that, unfortunately this development is topic specific, not universal.

In contrast, "crankery" is too strong word for describing black-and-white thinking because it is a very judgemental word, a complete dismissal, and essentially an expression of unwillingness to understand, an insult, not just a disagreement about a degree of the claims. Is labelling someone's thoughts as "a crankery" also a form of crankery of its own then? Paradoxical isn't it?

Replies from: remmelt-ellen
comment by Remmelt (remmelt-ellen) · 2024-10-15T22:15:04.786Z · LW(p) · GW(p)

I usually do not take authorities too seriously before I understand their reasoning in a particular question. And understanding a person's reasoning may occasionally mean that I disagree in particular points as well. In my experience, even the most respectful people are still people, which means they often think in messy ways and they are good just on average

Right – this comes back to actually examining people’s reasoning. 

Relying on the authority status of an insider (who dismissed the argument) or on your ‘crank vibe’ of the outsider (who made the argument) is not a reliable way of checking whether a particular argument is good.

IMO it’s also fine to say “Hey, I don’t have time to assess this argument, so for now I’m going to go with these priors that seemed to broadly kinda work in the past for filtering out poorly substantiated claims. But maybe someone else actually has a chance to go through the argument, I’ll keep an eye open.”

 

Yes, Remmelt has some extreme expressions…

I may not agree that we are going to die with 99% probability. At the same time I find that his current directions are definitely worthwhile of exploring.

…describing black-and-white thinking

I’m putting these quotes together because I want to check whether you’re tracking the epistemic process I’m proposing here.

Reasoning logically from premises is necessarily black-and-white thinking. Either the truth value is true or it is false.

A way to check the reasoning is to first consider the premises (in how they are described using defined terms, do they correspond comprehensively enough with how the world works?). And then check whether the logic follows from the premises through to each next argument step until you reach the conclusion.

Finally, when you reach the conclusion, and you could not find any soundness or validity issues, then that is the conclusion you have reasoned to.

If the conclusion is that it turns out impossible for some physical/informational system to meet several specified desiderata at the same time, this conclusion may sound extreme. 

But if you (and many other people in the field who are inclined to disagree with the conclusion) cannot find any problem with the reasoning, the rational thing would be to accept it, and then consider how it applies to the real world.

Apparently, computer scientists hotly contested CAP theorem for a while. They wanted to build distributed data stores that could send messages that consistently represented new data entries, while the data was also made continuously available throughout the network, while the network was also tolerant to partitions. It turns out that you cannot have all three desiderata at once. Grumbling computer scientists just had to face the reality and turn to designing systems that would fail in the least bad way.

Now, assume there is a new theorem for which the research community in all their efforts have not managed to find logical inconsistencies nor empirical soundness issues. Based on this theorem, it turns out that you cannot both have machinery that keeps operating and learning autonomously across domains, and a control system that would contain the effects of that machinery enough to not feedback in ways that destabilise our environment outside the ranges we can survive in. 

We need to make a decision then – what would be the least bad way to fail here? On one hand we could decide against designing increasingly autonomous machines, and lose out on the possibility of having machines running around doing things for us. On the other hand, we could have the machinery fail in about the worst way possible, which is to destroy all existing life on this planet.

comment by gilch · 2024-10-14T17:48:33.912Z · LW(p) · GW(p)

The press release strikes me as poorly written. It's middle-school level. ChatGPT can write better than this. Exactly who is your (Stop AI's) audience here? "The press"?

Exclamation points are excessive. "Heart's content"? You're not in this for "contentment". The "you can't prove it, therefore I'm right" argument is weak. The second page is worse. "Toxic conditions"? I think I know what you meant, but you didn't connect it well enough for a general audience. "accelerate our mass extinction until we are all dead"? I'm pretty sure the "all dead" part has to come before the "extinction". "(and abusing his sister)"? OK, there's enough in the public record to believe than Sam is not (ahem) "consistently candid", but I'm at under 50% about the sister abuse even then on priors. Do you want to get sued for libel on top of your jail time? Is that a good strategy?

I admire your courage and hope you make an impact, but if you're willing to pay these heavy costs, of getting arrested, and facing jail time etc., then please try to win! Your necessity defense is an interesting idea, but if this is the best you can do, it will fail. If you can afford to hire a good defense attorney, you can afford a better writer! Tell me how this is move is 4-D chess and not just a blunder.

comment by momom2 (amaury-lorin) · 2024-10-14T15:22:23.475Z · LW(p) · GW(p)

I do not find this post reassuring about your approach.

  • Your plan is unsound; instead of a succession of events which need to go your way, I think you should aim for incremental marginal gains. There is no cost-effectiveness analysis, and the implicit theory of change is lacunar.
  • Your press release is unreadable (poor formatting), and sounds like a conspiracy theory (catchy punchlines, ALL CAPS DEMANDS, alarmist vocabulary and unsubstantiated claims) ; I think it's likely to discredit safety movements and raise attention in counterproductive ways.
  • The figures you quote are false (the median from AI Impacts is 5%)  or knowingly misleading (the numbers from Existential risk from AI survey are far from robust and as you note, suffer from selection bias), so I think it's fair to call them lies.
  • Your explanations for what you say in the press release sometimes don't make sense! You conflate AGI and self-modifying systems, your explanation for "eventually" does not match the sentence.
  • Your arguments are based on wrong premises - it's easy to check that your facts such as "they are not following the scientific method" are plain wrong. It sounds like you're trying to smear OpenAI and Sam Altman as much as possible without consideration for whether what you're saying is true.
     

I am appalled to see this was not downvoted into oblivion! My best guess is that people feel that there are not enough efforts going towards stopping AI and did not read the post and the press release to check that you have good reason motivating your actions.

Replies from: remmelt-ellen
comment by Remmelt (remmelt-ellen) · 2024-10-14T16:21:56.556Z · LW(p) · GW(p)

Thanks, as far as I can this is a mix of critiques of strategic approach (fair enough), about communication style (fair enough), and partial misunderstandings of the technical arguments.

 

instead of a succession of events which need to go your way, I think you should aim for incremental marginal gains. There is no cost-effectiveness analysis…

I agree that we should not get hung up on a succession of events to go a certain way. IMO, we need to get good at simultaneously broadcasting our concerns in a way that’s relatable to other concerned communities, and opportunistically look for new collaborations there.  

At the same time, local organisers often build up an activist movement by ratcheting up the number of people joining the events and the pressure they put on demanding institutions to make changes. These are basic cheap civil disobedience tactics that have worked for many movements (climate, civil rights, feminist, changing a ruling party, etc). I prefer to go with what has worked, instead of trying to reinvent the wheel based on fragile cost-effectiveness estimates. But if you can think of concrete alternative activities that also have a track record of working, I’m curious to hear.

Your press release is unreadable (poor formatting), and sounds like a conspiracy theory (catchy punchlines, ALL CAPS DEMANDS, alarmist vocabulary and unsubstantiated claims)

I think this is broadly fair.  The turnaround time of this press release was short, and I think we should improve on the formatting and give more nuanced explanations next time.

Keep in mind the text is not aimed at you but people more broadly who are feeling concerned and we want to encourage to act. A press release is not a paper. Our press release is more like a call to action – there is a reason to add punchy lines here.  

 

The figures you quote are false (the median from AI Impacts is 5%)  or knowingly misleading (the numbers from Existential risk from AI survey are far from robust and as you note, suffer from selection bias)

Let me recheck the AI Impacts paper. Maybe I was ditzy before, in which case, my bad.  

As you saw from my commentary above, I was skeptical about using that range of figures in the first place.

 

You conflate AGI and self-modifying systems

Not sure what you see as the conflation? 

AGI, as an autonomous system that would automate many jobs, would necessarily be self-modifying – even in the limited sense of adjusting its internal code/weights on the basis of new inputs. 

 

Your arguments are invalid

The reasoning shared in the press release by my colleague was rather loose, so I more rigorously explained a related set of arguments in this post.

As to whether arguments from point 1 to 6. above are invalid, I haven’t seen you point out inconsistencies in the logic yet, so as it stands you seem to be sharing a personal opinion. 

 

I am appalled to see this was not downvoted into oblivion!

Should I comment on the level of nuance in your writing here? :P

Replies from: remmelt-ellen
comment by Remmelt (remmelt-ellen) · 2024-10-14T16:56:16.865Z · LW(p) · GW(p)

Let me recheck the AI Impacts paper.

I definitely made a mistake in quickly checking that number shared by colleague.

The 2023 AI Impacts survey shows a mean risk of 14.4% for the question “What probability do you put on future AI advances causing human extinction or similarly permanent and severe disempowerment of the human species within the next 100 years?”.

Whereas the other smaller sample survey [LW · GW] gives a median estimate of 30%  

I already thought using those two figures as a range did not make sense, but putting a mean and a median in the same range is even more wrong.

Thanks for pointing this out! Let me add a correcting comment above. 

comment by Jeremy Gillen (jeremy-gillen) · 2024-10-14T10:35:27.173Z · LW(p) · GW(p)

In practice, engineers know that complex architectures interacting with the surrounding world end up having functional failures (because of unexpected interactive effects, or noisy interference). With AGI, we are talking about an architecture here that would be replacing all our jobs and move to managing conditions across our environment. If AGI continues to persist in some form over time, failures will occur and build up toward lethality at some unknown rate. Over a long enough period, this repeated potential for uncontrolled failures pushes the risk of human extinction above 99%.

This part is invalid, I think. 

My understanding of this argument is: 1) There is an extremely powerful agent, so powerful that if it wanted to it could cause human extinction. 2) There is some risk of its goal-related systems breaking, and this risk doesn't rapidly decrease over time. Therefore the risk adds up over time and converges toward 1.

This argument doesn't work because the two premises won't hold. For 2) An obvious consideration for any reflective agent is to find ways to reduce the risk of goal-related failure. For 1) Decentralizing away from a single point of failure is another obvious step that one would take in a post-ASI world.

So the risk of everyone dying should only come from a relatively short period after an agent (or agents) become powerful enough that killing everyone is an ~easy option.

Replies from: remmelt-ellen
comment by Remmelt (remmelt-ellen) · 2024-10-14T11:37:53.593Z · LW(p) · GW(p)

There is some risk of its goal-related systems breaking

Ah, that’s actually not the argument.

Could you try read points 1-5. again?

Replies from: jeremy-gillen
comment by Jeremy Gillen (jeremy-gillen) · 2024-10-14T12:12:24.486Z · LW(p) · GW(p)

I've reread and my understanding of point 3 remains the same. I wasn't trying to summarize points 1-5, to be clear. And by "goal-related systems" I just meant whatever is keeping track of the outcomes being optimized for.

Perhaps you could point me to my misunderstanding?

Replies from: remmelt-ellen
comment by Remmelt (remmelt-ellen) · 2024-10-14T12:27:56.298Z · LW(p) · GW(p)

Appreciating your openness. 

(Just making dinner – will get back to this when I’m behind my laptop in around an hour). 

Replies from: remmelt-ellen
comment by Remmelt (remmelt-ellen) · 2024-10-14T13:54:01.516Z · LW(p) · GW(p)

An obvious consideration for any reflective agent is to find ways to reduce the risk of goal-related failure.

by "goal-related systems" I just meant whatever is keeping track of the outcomes being optimized for.


So the argument for 3. is that just by AGI continuing to operate and maintain its components as adapted to a changing environment, the machinery can accidentally end up causing destabilising effects that were untracked or otherwise insufficiently corrected for. 

You could call this a failure of the AGI’s goal-related systems if you mean with that that the machinery failed to control its external effects in line with internally represented goals. 

But this would be a problem with the control process itself.

 

An obvious consideration for any reflective agent is to find ways to reduce the risk of goal-related failure.

Unfortunately, there are fundamental limits to that cap the extent to which the machinery can improve its own control process. 

Any of the machinery’s external downstream effects that its internal control process cannot track (ie. detect, model, simulate, and identify as a “goal-related failure”), that process cannot correct for.  

For further explanation, please see links under point 4.

 

Decentralizing away from a single point of failure is another obvious step that one would take in a post-ASI world.

The problem here is that (a) we are talking about not just a complicated machine product but self-modifying machinery and (b) at the scale this machinery would be operating at it cannot account for most of the potential human-lethal failures that could result. 

For (a), notice how easily feedback processes can become unsimulatable for such unfixed open-ended architectures. 

  • E.g. How can AGI code predict how its future code learned from unknown inputs will function in processing subsequent unknown inputs? What if future inputs are changing as a result of effects propagated across the larger environment from previous AGI outputs? And those outputs were changing as a result of previous new code that was processing signals in connection with other code running across the machinery? And so on.  

For (b), engineering decentralised redundancy can help especially at the microscale. 

  • E.g. correcting for bit errors.
  • But what does it mean to correct for failures at the level of local software (bugs, viruses, etc)? What does it mean to correct for failures across some decentralised server network? What does it mean to correct for failures at the level of an entire machine ecosystem (which AGI effectively becomes)?

~

In scaling up the connected components, this exponentially increases their degrees of freedom of interaction. And as those components change in feedback with surrounding contexts of the environment (and have to in order for AGI to autonomously adapt), an increasing portion of the possible human-lethal failures cannot be adequately controlled for by the system itself.

Replies from: jeremy-gillen
comment by Jeremy Gillen (jeremy-gillen) · 2024-10-14T16:36:11.658Z · LW(p) · GW(p)

You could call this a failure of the AGI’s goal-related systems if you mean with that that the machinery failed to control its external effects in line with internally represented goals. 

But this would be a problem with the control process itself.

So it's the AI being incompetent?

Unfortunately, there are fundamental limits to that cap the extent to which the machinery can improve its own control process. 

Yeah I think would be a good response to my argument against premise 2). I've had a quick look at the list of theorems in the paper, I don't know most of them, but the ones I do know don't seem to support the point you're making. So I don't buy it. You could walk me though how one of these theorems is relevant to capping self-improvement of reliability?

For (a), notice how easily feedback processes can become unsimulatable for such unfixed open-ended architectures. 

You don't have to simulate something to reason about it.

E.g. How can AGI code predict how its future code learned from unknown inputs will function in processing subsequent unknown inputs?

Garrabrant induction shows one way of doing self-referential reasoning.

  • But what does it mean to correct for failures at the level of local software (bugs, viruses, etc)? What does it mean to correct for failures across some decentralised server network? What does it mean to correct for failures at the level of an entire machine ecosystem (which AGI effectively becomes)?

As an analogy: Use something more like democracy than like dictatorship, such that any one person going crazy can't destroy the world/country, as a crazy dictator would.

Replies from: remmelt-ellen
comment by Remmelt (remmelt-ellen) · 2024-10-15T05:08:37.634Z · LW(p) · GW(p)

So it's the AI being incompetent?

Yes, but in the sense that there are limits to the AGI's capacity to sense, model, simulate, evaluate, and correct own component effects propagating through a larger environment.
 


You don't have to simulate something to reason about it.

If you can't simulate (and therefore predict) that a failure mode that by default is likely to happen would happen, then you cannot counterfactually act to prevent the failure mode.

 

You could walk me though how one of these theorems is relevant to capping self-improvement of reliability?

Maybe take a look at the hashiness model of AGI uncontainability. That's an elegant way of representing the problem (instead of pointing at lots of examples of theorems that show limits to control).

This is not put into mathematical notation yet though. Anders Sandberg is working on it, but also somewhat distracted. Would value your contribution/thinking here, but I also get if you don't want to read through the long transcripts of explanation at this stage. See project here

Anders' summary:
"A key issue is the thesis that AGI will be uncontrollable in the sense that there is no control mechanism that can guarantee aligned behavior since the more complex and abstract the target behavior is the amount of resources and forcing ability needed become unattainable. 

In order to analyse this better a sufficiently general toy model is needed for how controllable systems of different complexity can be, that ideally can be analysed rigorously.

One such model is to study families of binary functions parametrized by their circuit complexity and their "hashiness" (how much they mix information) as an analog for the AGI and the alignment model, and the limits to finding predicates that can keep the alignment system making the AGI analog producing a desired output."

 

Garrabrant induction shows one way of doing self-referential reasoning.

We're talking about learning from inputs received from a more complex environment (through which AGI outputs also propagate as changed effects of which some are received as inputs). 

Does Garrabrant take that into account in his self-referential reasoning?

 

As an analogy: Use something more like democracy than like dictatorship, such that any one person going crazy can't destroy the world/country, as a crazy dictator would.

A human democracy is composed out of humans with similar needs. This turns out to be an essential difference.

Replies from: jeremy-gillen
comment by Jeremy Gillen (jeremy-gillen) · 2024-10-15T11:19:19.661Z · LW(p) · GW(p)

How about I assume there is some epsilon such that the probability of an agent going off the rails is greater than epsilon in any given year. Why can't the agent split into multiple ~uncorrelated agents and have them each control some fraction of resources (maybe space) such that one off-the-rails agent can easily be fought and controlled by the others? This should reduce the risk to some fraction of epsilon, right?

(I'm gonna try and stay focused on a single point, specifically the argument that leads up to >99%, because that part seems wrong for quite simple reasons).

comment by WillPetillo · 2024-10-15T01:01:50.222Z · LW(p) · GW(p)

There are some writing issues here that make it difficult to evaluate the ideas presented purely on their merits.  In particular, the argument for 99% extinction is given a lot of space relative to the post as a whole, where it should really be a bullet point that links to where this case is made elsewhere (or if it is not made adequately elsewhere, as a new post entirely).  Meanwhile, the value of disruptive protest is left to the reader to determine.

As I understand the issue, the case for barricading AI rests on:
1. Safety doesn't happen by default
a) AI labs are not on track to achieve "alignment" as commonly considered by safety researchers.
b) Those standards may be over-optimistic--link to Substrate Needs Convergence, arguments by Yampolskiy, etc.
c) Even if the conception of safety assumed by the AI labs is right, it is not clear that their utopic vision for the future is actually good.
2. Advocacy, not just technical work, is needed for AI safety
a) See above
b) Market incentives are misaligned
c) Policy (and culture) matters
3. Disruptive actions, not just working within civil channels, is needed for effective advocacy.
a) Ways that working entirely within ordinary democratic channels can get delayed or derailed
b) Benefits of disruptive actions, separate from or in synergy with other forms of advocacy
c) Plan for how StopAI's specific choice of disruptive actions effectively plays to the above benefits
d) Moral arguments, if not already implied

Replies from: remmelt-ellen
comment by Remmelt (remmelt-ellen) · 2024-10-15T05:19:21.361Z · LW(p) · GW(p)

As I understand the issue, the case for barricading AI rests on:  

Great list! Basically agreeing with the claims under 1. and the structure of what needs to be covered under 2.
  

Meanwhile, the value of disruptive protest is left to the reader to determine.

You're right. Usually when people hear about a new organisation on the forum, they expect some long write-up of the theory of change and the considerations around what to prioritise. 

I don't think I have time right now for writing a neat public write-up. This is just me being realistic – Sam and I are both swamped in terms of handling our work and living situations.

So the best I can do is point to examples where civil disobedience has worked (eg. Just Stop Oil demands, Children's March) and then discuss our particular situation (how the situatiojn is similar and different, who are important stakeholders, what are our demands, what are possible effective tactics in this context).
 

In particular, the argument for 99% extinction is given a lot of space relative to the post as a whole, 

Ha, fair enough.  The more rigorously I tried to write out the explanation, the more space it took.

comment by gilch · 2024-10-14T18:17:33.406Z · LW(p) · GW(p)

I mean, yes, hence my comment about ChatGPT writing better than this, but if word gets out that Stop AI is literally using the product of the company they're protesting in their protests, it could come off as hypocrisy.

I personally don't have a problem with it, but I understand the situation at a deeper level than the general public. It could be a wise strategic move to hire a human writer, or even ask for competent volunteer writers, including those not willing to join the protests themselves, although I can see budget or timing being a factor in the decision.

Or they could just use one of the bigger Llamas on their own hardware and try to not get caught. Seems like an unnecessary risk though.

Replies from: remmelt-ellen
comment by Remmelt (remmelt-ellen) · 2024-10-15T05:20:26.748Z · LW(p) · GW(p)

No worries. We won't be using ChatGPT or any other model to generate our texts.

comment by Prometheus · 2024-10-14T19:20:30.848Z · LW(p) · GW(p)

sigh Protests last year, barricading this year, I've already mentally prepared myself for someone next year throwing soup at a human-generated painting while shouting about AI. This is the kind of stuff that makes no one in the Valley want to associate with you. It makes the cause look low-status, unintelligent, lazy, and uninformed.

Replies from: WillPetillo, gilch
comment by WillPetillo · 2024-10-15T01:28:30.376Z · LW(p) · GW(p)

Just because the average person disapproves of a protest tactic doesn't mean that the tactic didn't work.  See Roger Hallam's "Designing the Revolution" series for the thought process underlying the soup-throwing protests.  Reasonable people may disagree (I disagree with quite a few things he says), but if you don't know the arguments, any objection is going to miss the point.  The series is very long, so here's a tl/dr:

- If the public response is: "I'm all for the cause those protestors are advocating, but I can't stand their methods" notice that the first half of this statement was approval of the only thing that matters--approval of the cause itself, as separate from the methods, which brought the former to mind.
- The fact that only a small minority of the audience approves of the protest action is in itself a good thing, because this efficiently filters for people who are inclined to join the activist movement--especially on the hard-core "front lines"--whereas passive "supporters" can be more trouble than they're worth.  These high-value supporters don't need to be convinced that the cause is right; they need to be convinced that the organization is the "real deal" and can actually get things done.  In short, it's niche marketing.
- The disruptive protest model assumes that the democratic system is insufficient, ineffective, or corrupted, such that simply convincing the (passive) center majority is not likely to translate into meaningful policy change.  The model instead relies on a putting the powers-that-be into a bind where they have to either ignore you (in which case you keep growing with impunity) or over-react (in which case you leverage public sympathy to grow faster).  Again, it isn't important how sympathic the protestors are, only that the reaction against them is comparatively worse, from the perspective of the niche audience that matters.
- The ultimate purpose of this recursive growth model is to create a power bloc that forces changes that wouldn't otherwise occur on any reasonable timeline through ordinary democratic means (like voting) alone.
- Hallam presents incremental and disruptive advocacy as in opposition.  This is where I most strongly disagree with his thesis.  IMO: moderates get results, but operate within the boundaries defined by extremists, so they need to learn how to work together.

In short, when you say an action makes a cause "look low status", it is important to ask "to whom?" and "is that segment of the audience relevant to my context?"

Replies from: remmelt-ellen
comment by Remmelt (remmelt-ellen) · 2024-10-15T05:24:10.922Z · LW(p) · GW(p)

efficiently filters for people who are inclined to join the activist movement--especially on the hard-core "front lines"--whereas passive "supporters" can be more trouble than they're worth.

I had not considered how our messaging is filtering out non-committed supporters. Interesting!

comment by gilch · 2024-10-14T21:50:04.492Z · LW(p) · GW(p)

Protesters are expected to be at least a little annoying. Strategic unpopularity might be a price worth paying if it gets results. Sometimes extremists shift the Overton Window.