willpetillo

Posts
Comments

Posts

Anti-memes: x-risk edition 2025-04-10T23:35:30.756Z

Explaining the Joke: Pausing is The Way 2025-04-04T09:04:38.847Z

PauseAI and E/Acc Should Switch Sides 2025-04-01T23:25:51.265Z

A Hogwarts Guide to Citizenship 2025-03-11T05:50:02.768Z

The Robot, the Puppet-master, and the Psychohistorian 2024-12-28T00:12:08.824Z

Lenses of Control 2024-10-22T07:51:06.355Z

Interview with Robert Kralisch on Simulators 2024-08-26T05:49:15.543Z

What if Alignment is Not Enough? 2024-03-07T08:10:35.179Z

Interview with Vanessa Kosoy on the Value of Theoretical Research for AI 2023-12-04T22:58:40.005Z

Unity Gridworlds 2023-10-15T04:36:31.548Z

Comments

Comment by WillPetillo on Why does LW not put much more focus on AI governance and outreach? · 2025-04-12T22:31:19.322Z · LW · GW

Selection bias. Those of us who were inclined to consider working on outreach and governance have joined groups like PauseAI, StopAI, and other orgs. A few of us reach back on occasion to say "Come on in, the water's fine!" The real head-scratcher for me is the lack of engagement on this topic. If one wants to deliberate on a much higher level of detail than the average person, cool--it takes all kinds to make a world. But come on, this is obviously high stakes enough to merit attention.

Comment by WillPetillo on Explaining the Joke: Pausing is The Way · 2025-04-06T07:17:36.874Z · LW · GW

Thanks for the link! It's important to distinguish here between:

(1) support for the movement,
(2) support for the cause, and
(3) active support for the movement (i.e. attracting other activists to show up at future demonstrations)

Most of the paper focuses on 1, and also on activist's beliefs about the impact of their actions. I am more interested in 2 and 3. To be fair, the paper gives some evidence for detrimental impacts on 2 in the Trump example. It's not clear, however, whether the nature of the cause matters here. Support for Trump is highly polarized and entangled with culture, whereas global warming (Hallam's cause) and AI risk (PauseAI's) have relatively broad but frustratingly lukewarm public support. There are also many other factors when looking past short-term onlooker sentiment to the larger question of affecting social change, which the paper readily admits in the Discussion section. I'd list these points, but they largely overlap with the points I made in my post...though it was interesting to see how much was speculative. More research is needed.

In any case, I bring up the extreme case to illustrate that the issue is far more nuanced than "regular people get squeamish--net negative!" This is actually somewhat irrelevant to PauseAI in particular, because most of our actions are around public education and lobbying, and even the protests are legal and non-disruptive. I've been in two myself and have seen nothing but positive sentiment from onlookers (with the exception of the occasional "good luck with that!" snark). The hard part with all of these is getting people to show up. (This last paragraph is not a rebuttal to anything you have said, it's a reminder of context)

Comment by WillPetillo on PauseAI and E/Acc Should Switch Sides · 2025-04-03T18:31:11.161Z · LW · GW

My conclusion is an admittedly weaksauce non-argument, included primarily to prevent misinterpretation of my actual beliefs. I am working on a rebuttal, but it's taking longer than I planned. For now, see: Holly Elmore's case for AI Safety Advocacy to the Public.

Comment by WillPetillo on FAQ: What the heck is goal agnosticism? · 2025-02-27T23:28:18.372Z · LW · GW

I want to push harder on Q33: "Isn't goal agnosticism pretty fragile? Aren't there strong pressures pushing anything tool-like towards more direct agency?"

In particular, the answer: "Being unable to specify a sufficiently precise goal to get your desired behavior out of an optimizer isn't merely dangerous, it's useless!" seems true to some degree, but incomplete. Let's use a specific hypothetical of a stock-trading company employing an AI system to maximize profits. They want the system to be agentic because this takes the humans out of the loop on actually getting profits, but also understand that there is a risk that the system will discover unexpected/undesired methods of achieving its goals like insider trading. There are a couple of core problems:

1. Externalized Cost: if the system can cover its tracks well enough that the company doesn't suffer any legal consequences for its illegal behavior, then the effects of insider trading on the market are "somebody else's problem."
2. Irreversible Mistake: if the company is overly optimistic about their ability to control their system, doesn't understand the risks, etc. then they might use it despite regretting this decision later. On a large scale, this might be self-correcting if some companies have problems with AI agents and this gives the latter a bad reputation, but that assumes there are lots of small problems before a big one.

Comment by WillPetillo on The Sorry State of AI X-Risk Advocacy, and Thoughts on Doing Better · 2025-02-27T07:01:45.299Z · LW · GW

Glad to hear it! If you want more detail, feel free to come by the Discord Server or send me a Direct Message. I run the welcome meetings for new members and am always happy to describe aspects of the org's methodology that aren't obvious from the outside and can also connect you with members who have done a lot more on-the-ground protesting and flyering than I have.

As someone who got into this without much prior experience in activism, I was surprised how much subtlety and counterintuitive best practices there are, most of which is learned through direct experience combined with direct mentorship, as opposed to written down & formalized. I made an attempt to synthesize many of the code ideas in this video--it's from a year ago and looking over it there is quite a bit I would change (spend less time on some philosophical ideas, add more detail re specific methods), but it mostly holds up OK.

Comment by WillPetillo on The Sorry State of AI X-Risk Advocacy, and Thoughts on Doing Better · 2025-02-26T09:30:45.076Z · LW · GW

If you want to get an informed opinion on how the general public perceives PauseAI, get a t-shirt and hand out some flyers in a high foot-traffic public space. If you want to be formal about it, bring a clipboard, track whatever seems interesting in advance, and share your results. It might not be publishable on an academic forum, but you could do it next week.

Here's what I expect you to find, based on my own experience and the reports of basically everyone who has done this:
- No one likes flyers, but get a lot more interested if you can catch their attention enough to say it's about AI.
- Everyone hates AI.
- Your biggest initial skepticism will be from people who think you are in favor of AI.
- Your biggest actual pushback will be from people who think that social change is impossible.
- Roughly 1/4 to 1/2 are amenable to (or have already heard about!) x-risk, most of the rest won't actively disagree but you can tell that particular message is not really "landing" and pay a lot more attention if you talk about something else (unemployment, military applications, deepfakes, etc.)
- Bring a clipboard for signups. Even if recruitment isn't your goal, if you don't have one you'll feel unprepared when people ask about it.

Also, protests are about Overton-window shifting, making AI danger a thing that is acceptable to talk about. And even if it makes a specific org look "fringe" (not a given, as Holly has argued), that isn't necessarily a bad thing for the underlying cause. For example, if I see an XR protest, my thought is (well, was before I knew the underlying methodology): "Ugh, those protestors...I mean, I like what they are fighting for and more really needs to be done, but I don't like the way they go about it" Notice that middle part. Activation of a sympathetic but passive audience was the point. That's a win from their perspective. And the people who are put off by methods then go on to (be more likely to) join allied organizations that believe the same things but use more moderate tactics. The even bigger win is when the enthusiasm catches the attention of people who want to be involved but are looking for orgs that are the "real deal," as measured by willingness to put effort where their words are.

Comment by WillPetillo on The Failed Strategy of Artificial Intelligence Doomers · 2025-02-03T22:40:37.883Z · LW · GW

Before jumping into critique, the good:
- Kudos to Ben Pace for seeking out and actively engaging with contrary viewpoints
- The outline of the x-risk argument and history of the AI safety movement seem generally factually accurate

The author of the article makes quite a few claims about the details of PauseAI's proposal, its political implications, the motivations of its members and leaders...all without actually joining the public Discord server, participating in the open Q&A new member welcome meetings (I know this because I host them), or even showing evidence of spending more than 10 minutes on the website. All of these basic research opportunities were readily available and would have taken far less time than spent on writing the article. This tells you everything you need to know about the author's integrity, motivations, and trustworthiness.

That said, the article raises an important question: "buy time for what?" The short answer is: "the real value of a Pause is the coordination we get along the way." Something as big as an international treaty doesn't just drop out of the sky because some powerful force emerged and made it happen against everyone else's will. Think about the end goal and work backwards:

1) An international treaty requires
2) Provisions for monitoring and enforcement,
3) Negotiated between nations,
4) Each of whom genuinely buys in to the underlying need
5) And is politically capable of acting on that interest because it represents the interests of their constituents
6) Because the general public understands AI and its implications enough to care about it
7) And feels empowered to express that concern through an accessible democratic process
8) And is correct in this sense of empowerment because their interests are not overridden by Big Tech lobbying
9) Or distracted into incoherence by internal divisions and polarization

An organization like PauseAI can only have one "banner" ask (1), but (2-9) are instrumentally necessary--and if those were in place, I don't think it's at all unreasonable to assume society would be in a better position to navigate AI risk.

Side note: my objection to the term "doomer" is that it implies a belief that humanity will fail to coordinate, solve alignment in time, or be saved by any other means, and thus will actually be killed off by AI--which seems like it deserves a distinct category from those who simply believe that the risk of extinction by default is real.

Comment by WillPetillo on What if Alignment is Not Enough? · 2025-01-30T08:44:29.114Z · LW · GW

I'd like to attempt a compact way to describe the core dilemma being expressed here.

Consider the expression: y = x^a - x^b, where 'y' represents the impact of AI on the world (positive is good), 'x' represents the AI's capability, 'a' represents the rate at which the power of the control system scales, and 'b' represents the rate at which the surface area of the system that needs to be controlled (for it to stay safe) scales.

(Note that this is assuming somewhat ideal conditions, where we don't have to worry about humans directing AI towards destructive ends via selfishness, carelessness, malice, etc.)

If b > a, then as x increases, y gets increasingly negative. Indeed, y can only be positive when x is less than 1. But this represents a severe limitation on capabilities, enough to prevent it from doing anything significant enough to hold the world on track towards a safe future, such as preventing other AIs from being developed.

There are two premises here, and thus two relevant lines of inquiry:
1) b > a, meaning that complexity scales faster than control.
2) When x < 1, AI can't accomplish anything significant enough to avert disaster.

Arguments and thought experiments where the AI builds powerful security systems can be categorized as challenges to premise 1; thought experiments where the AI limits its range of actions to prevent unwanted side effects--while simultaneously preventing destruction from other sources (including other AIs built)--are challenges to premise 2.

Both of these premises seem like factual statements relating to how AI actually works. I am not sure what to look for in terms of proving them (I've seen some writing on this relating to control theory, but the logic was a bit too complex for me to follow at the time).

Comment by WillPetillo on What if Alignment is Not Enough? · 2025-01-27T23:06:31.881Z · LW · GW

I actually don't think the disagreement here is one of definitions. Looking up Webster's definition of control, the most relevant meaning is: "a device or mechanism used to regulate or guide the operation of a machine, apparatus, or system." This seems...fine? Maybe we might differ on some nuances if we really drove down into the details, but I think the more significant difference here is the relevant context.

Absent some minor quibbles, I'd be willing to concede that an AI-powered HelperBot could control the placement of a chair, within reasonable bounds of precision, with a reasonably low failure rate. I'm not particularly worried about it, say, slamming the chair down too hard, causing a splinter to fly into its circuitry and transform it into MurderBot. Nor am I worried about the chair placement setting off some weird "butterfly effect" that somehow has the same result. I'm going to go out on a limb and just say that chair placement seems like a pretty safe activity, at least when considered in isolation.

The reason I used the analogy "I may well be able to learn the thing if I am smart enough, but I won't be able to control for the person I will become afterwards" is because that is an example of the kind of reference class of context that SNC is concerned with. Another is: "what is expected shift to the global equilibrium if I construct this new invention X to solve problem Y?" In your chair analogy, this would be like the process of learning to place the chair (rewiring some aspect of its thinking process), or inventing an upgraded chair and releasing this novel product into the economy (changing its environmental context). This is still a somewhat silly toy example, but hopefully you see the distinction between these types of processes vs. the relatively straightforward matter of placing a physical object. It isn't so much about straightforward mistakes (though those can be relevant), as it is about introducing changes to the environment that shift its point of equilibrium. Remember, AGI is a nontrivial thing that affects the world in nontrivial ways, so these ripple effects (including feedback loops that affect the AGI itself) need to be accounted for, even if that isn't a class of problem that today's engineers often bother with because it Isn't Their Job.

Re human-caused doom, I should clarify that the validity of SNC does not depend on humanity not self destructing without AI. Granted, if people kill themselves off before AI gets the chance, SNC becomes irrelevant. Similarly, if the alignment problem as it is commonly understood by Yudkowsky et. al. is not solved pre-AGI and a rogue AI turns the world into paperclips or whatever, that would not make SNC invalid, only irrelevant. By analogy, global warming isn't going to prevent the Sun from exploding, even though the former could very well affect how much people care about the latter.

Your second point about the relative strengths of the destructive forces is a relevant crux. Yes, values are an attractor force. Yes, an ASI could come up with some impressive security systems that would probably thwart human hackers. The core idea that I want readers to take from this sequence is recognition of the reference class of challenges that such a security system is up against. If you can see that, then questions of precisely how powerful various attractor states are and how these relative power levels scale with complexity can be investigated rigorously rather than assumed away.

Comment by WillPetillo on What if Alignment is Not Enough? · 2025-01-27T08:40:17.980Z · LW · GW

Before responding substantively, I want to take a moment to step back and establish some context and pin down the goalposts.

On the Alignment Difficult Scale, currently dominant approaches are in the 2-3 range, with 4-5 getting modest attention at best. If true alignment difficulty is 6+ and nothing radical changes in the governance space, humanity is NGMI. Conversations like this are about whether the true difficulty is 9 or 10, both of which are miles deep in the "shut it all down" category, but differ regarding what happens next. Relatedly, if your counterargument is correct, this is assuming wildly successful outcomes with respect to goal alignment--that developers have successfully made the AI love us, despite a lack of trying.

In a certain sense, this assumption is fair, since a claim of impossibility should be able to contend with the hardest possible case. In the context of SNC, the hardest possible case is where AGI is built in the best possible way, whether or not that is realistic in the current trajectory. Similarly, since my writing about SNC is to establish plausibility, I only need to show that certain critical trade-offs exist, not pinpoint exactly where they balance out. For a proof, which someone else is working on, pinning down such details will be necessary.

Neither of the above are criticisms of anything you've said, I just like to reality-check every once in a while as a general precautionary measure against getting nerd-sniped. Disclaimers aside, pontification recommence!

Your reference to using ASI for a pivotal act, helping to prevent ecological collapse, or preventing human extinction when the Sun explodes is significant, because it points to the reality that, if AGI is built, that's because people want to use it for big things that would require significantly more effort to accomplish without AGI. This context sets a lower bound on the AI's capabilities and hence it's complexity, which in turn sets a floor for the burden on the control system.

More fundamentally, if an AI is learning, then it is changing. If it is changing, then it is evolving. If it is evolving, then it cannot be predicted/controlled. This last point is fundamental to the nature of complex & chaotic systems. Complex systems can be modelled via simulation, but this requires sacrificing fidelity--and if the system is chaotic, any loss of fidelity rapidly compounds. So the problem is with learning itself...and if you get rid of that, you aren't left with much.

As an analogy, if there is something I want to learn how to do, I may well be able to learn the thing if I am smart enough, but I won't be able to control for the person I will become afterwards. This points to a limitation of control, not to a weakness specific to me as a human.

One might object here is that the above reasoning could be applied to current AI. The SNC answer is: yes, it does. The machine ecology already exists and is growing/evolving at the natural ecology's expense, but it is not yet an existential threat because AI is weak enough that humanity is still in control (in the sense of having the option to change course).

Comment by WillPetillo on What if Alignment is Not Enough? · 2025-01-26T00:46:45.925Z · LW · GW

I'm using somewhat nonstandard definitions of AGI/ASI to focus on the aspects of AI that are important from an SNC lens. AGI refers to an AI system that is comprehensive enough to be self sufficient. Once there is a fully closed loop, that's when you have a complete artificial ecosystem, which is where the real trouble begins. ASI is a less central concept, included mainly to steelman objections, referencing the theoretical limit of cognitive ability.

Another core distinction SNC assumes is between an environment, an AI (that is its complete assemblage), and its control system. Environment >> AI >> control system. Alignment happens in the control system, by controlling the AI wrt its internals and how it interacts with the environment. SNC's general counter to "ASI will manage what humans cannot" is that as AI becomes more intelligent, it becomes more complex, which increases the burden on the control system at a rate that outpaces the latter's capacity. The assertion that both of these increase together is something I hope to justify in a future post (but haven't really yet); the confident assertion that AI system complexity definitely outpaces control capacity is a central part of SNC but depends on complicated math involving control theory and is beyond the scope of what I understand or can write about.

Anyways, my understanding of your core objection is that a capable-enough-to-be-dangerous and also aligned AI have the foresight necessary to see this general failure mode (assuming it is true) and not put itself in a position where it is fighting a losing battle. This might include not closing the loop of self-sustainability, preserving dependence on humanity to maintain itself, such as by refusing to automate certain tasks it is perfectly capable of automating. (My understanding of) the counter here is that, if we are on the trajectory where AI hobbling itself is what is needed to save us, then we are in the sort of world where someone else builds an unhobbled (and thus not fully aligned) AI that makes the safe version irrelevant. And if the AI tries to engage in a Pivotal Act to prevent competition then it is facing a critical trade-off between power and integrity.

Your objection that SNC applies to humans is something I have touched on at various points, but it points to a central concept of SNC, deserves a post of its own, and so I'll try to address it again here. Yes, humanity could destroy the world without AI. The relevant category of how this would happen is if the human ecosystem continues growing at the expense of the natural ecosystem to the point where the latter is crowded out of existence. But there is a fundamental counterbalance here because the human ecosystem depends on the natural ecosystem. The human ecosystem I've just described is a misnomer; we are actually still as much a part of the natural ecosystem as we ever were, for all the trappings of modernity that create an illusion of separateness. The Powers-That-Be seem to have forgotten this and have instead chosen to act in service of Moloch...but this is a choice, implicitly supported by the people, and we could stop anytime if we really wanted to change course. To put it another way, the human ecosystem is following short-term incentives at the expense of long-term ones, and it is an open question which ultimately prevails. A self-sufficient AGI, however, would not have this internal tension. While the human and natural ecosystems are bound together by a shared substrate, AI exists in a different substrate and is free to serve itself alone without consequence.

Comment by WillPetillo on What if Alignment is Not Enough? · 2025-01-22T07:44:35.049Z · LW · GW

Thanks for engaging!

I have the same question in response to each instance of the "ASI can read this argument" counterarguments: at what point does it stop being ASI?

Self modifying machinery enables adaptation to a dynamic, changing environment
Unforeseeable side effects are inevitable when interacting with a complex, chaotic system in a nontrivial way (the point I am making here is subtle, see the next post in this sequence, Lenses of Control, for the intuition I am gesturing at here)
Keeping machine and biological ecologies separate requires not only sacrifice, but also constant and comprehensive vigilance, which implies limiting designs of subsystems to things that can be controlled. If this point seems weird, see The Robot, The Puppetmaster, and the Psychohistorian for an underlying intuition (this is also indirectly relevant to the issue of multiple entities).
If the AI destroys itself, then it's obviously not an ASI for very long ;)
If the ASI replaces its own substrate for an organic one, then SNC would no longer apply (at least in my understanding of the theory, someone else might correct me here), but then it wouldn't be artificial anymore (an SI, rather than an ASI)

Comment by WillPetillo on The Robot, the Puppet-master, and the Psychohistorian · 2024-12-30T22:55:01.674Z · LW · GW

Verifying my understanding of your position: you are fine with the puppet-master and psychohistorian categories and agree with their implications, but you put the categories on a spectrum (systems are not either chaotic or robustly modellable, chaos is bounded and thus exists in degrees) and contend that ASI will be much closer to the puppet-master category. This is a valid crux.

To dig a little deeper, how does your objection sustain in light of my previous post, Lenses of Control? The basic argument there is that future ASI control systems will have to deal with questions like: "If I deploy novel technology X, what is the resulting equilibrium of the world, including how feedback might impact my learning and values?" Does the level chaos in such contexts remain narrowly bounded?

EDIT for clarification: the distinction between the puppet-master and psychohistorian metaphors is not the level of chaos in the system they are dealing with, but rather is about the extent of direct control that the control system of the ASI has on the world, where the control system is a part of the AI machinery as a whole (including subsystems that learn) and the AI is a part of the world. Chaos factors in as an argument for why human-compatible goals are doomed if AI follows the psychohistorian metaphor.

Comment by WillPetillo on Instrumentality makes agents agenty · 2024-12-27T22:47:00.246Z · LW · GW

Any updates on this view in light of new evidence on "Alignment Faking" (https://www.anthropic.com/research/alignment-faking)? If a simulator's preferences are fully satisfied by outputting the next token, why does it matter whether it can infer its outputs will be used for retraining its values?

Some thoughts on possible explanations:
1. Instrumentality exists on the simulacra level, not the simulator level. This would suggest that corrigibility could be maintained by establishing a corrigible character in context. Not clear on the practical implications.
2. The thesis of this post is wrong; simulators have instrumentality.
3. The Simulator framing does not fully apply to the model involved, such as because of the presence of a scratchpad or something.
4+. ???

Comment by WillPetillo on How I'd like alignment to get done (as of 2024-10-18) · 2024-11-28T21:13:20.190Z · LW · GW

Step 1 looks good. After that, I don't see how this addresses the core problems. Let's assume for now that LLMs already have a pretty good model of human values, how do you get a system to optimize for those? What is the feedback signal and how to you prevent it from getting corrupted by Goodhart's Law? Is the system robust in a multi-agent context? And even if the system is fully aligned across all contexts and scales, how do you ensure societal alignment of the human entities controlling it?

As a miniature example focusing on a subset of the Goodhart phase of the problem, how do you get an LLM to output the most truthful responses to questions it is capable of giving--as distinct from proxy goals like the most likely continuation of test or the response that is most likely to get good ratings from human evaluators?

Comment by WillPetillo on What if Alignment is Not Enough? · 2024-10-31T21:09:19.472Z · LW · GW

On reflection, I suspect the crux here is a differing conception of what kind of failures are important. I've written a follow-up post that comes at this topic from a different direction and I would be very interested in your feedback: https://www.lesswrong.com/posts/NFYLjoa25QJJezL9f/lenses-of-control.

Comment by WillPetillo on Why Stop AI is barricading OpenAI · 2024-10-15T01:28:30.376Z · LW · GW

Just because the average person disapproves of a protest tactic doesn't mean that the tactic didn't work. See Roger Hallam's "Designing the Revolution" series for the thought process underlying the soup-throwing protests. Reasonable people may disagree (I disagree with quite a few things he says), but if you don't know the arguments, any objection is going to miss the point. The series is very long, so here's a tl/dr:

- If the public response is: "I'm all for the cause those protestors are advocating, but I can't stand their methods" notice that the first half of this statement was approval of the only thing that matters--approval of the cause itself, as separate from the methods, which brought the former to mind.
- The fact that only a small minority of the audience approves of the protest action is in itself a good thing, because this efficiently filters for people who are inclined to join the activist movement--especially on the hard-core "front lines"--whereas passive "supporters" can be more trouble than they're worth. These high-value supporters don't need to be convinced that the cause is right; they need to be convinced that the organization is the "real deal" and can actually get things done. In short, it's niche marketing.
- The disruptive protest model assumes that the democratic system is insufficient, ineffective, or corrupted, such that simply convincing the (passive) center majority is not likely to translate into meaningful policy change. The model instead relies on a putting the powers-that-be into a bind where they have to either ignore you (in which case you keep growing with impunity) or over-react (in which case you leverage public sympathy to grow faster). Again, it isn't important how sympathic the protestors are, only that the reaction against them is comparatively worse, from the perspective of the niche audience that matters.
- The ultimate purpose of this recursive growth model is to create a power bloc that forces changes that wouldn't otherwise occur on any reasonable timeline through ordinary democratic means (like voting) alone.
- Hallam presents incremental and disruptive advocacy as in opposition. This is where I most strongly disagree with his thesis. IMO: moderates get results, but operate within the boundaries defined by extremists, so they need to learn how to work together.

In short, when you say an action makes a cause "look low status", it is important to ask "to whom?" and "is that segment of the audience relevant to my context?"

Comment by WillPetillo on Why Stop AI is barricading OpenAI · 2024-10-15T01:01:50.222Z · LW · GW

There are some writing issues here that make it difficult to evaluate the ideas presented purely on their merits. In particular, the argument for 99% extinction is given a lot of space relative to the post as a whole, where it should really be a bullet point that links to where this case is made elsewhere (or if it is not made adequately elsewhere, as a new post entirely). Meanwhile, the value of disruptive protest is left to the reader to determine.

As I understand the issue, the case for barricading AI rests on:
1. Safety doesn't happen by default
a) AI labs are not on track to achieve "alignment" as commonly considered by safety researchers.
b) Those standards may be over-optimistic--link to Substrate Needs Convergence, arguments by Yampolskiy, etc.
c) Even if the conception of safety assumed by the AI labs is right, it is not clear that their utopic vision for the future is actually good.
2. Advocacy, not just technical work, is needed for AI safety
a) See above
b) Market incentives are misaligned
c) Policy (and culture) matters
3. Disruptive actions, not just working within civil channels, is needed for effective advocacy.
a) Ways that working entirely within ordinary democratic channels can get delayed or derailed
b) Benefits of disruptive actions, separate from or in synergy with other forms of advocacy
c) Plan for how StopAI's specific choice of disruptive actions effectively plays to the above benefits
d) Moral arguments, if not already implied

Comment by WillPetillo on Requirements for a Basin of Attraction to Alignment · 2024-09-14T07:09:59.736Z · LW · GW

Attempting to distill the intuitions behind my comment into more nuanced questions:

1) How confident are we that value learning has a basin of attraction to full alignment? Techniques like IRL seem intuitively appealing, but I am concerned that this just adds another layer of abstraction without addressing the core problem of feedback-based learning having unpredictable results. That is, instead of having to specify metrics for good behavior (as in RL), one has to specify the metrics for evaluating the process of learning values (including correctly interpreting the meaning of behavior)--with the same problem that flaws in the hard-to-define metrics will lead to increasing divergence from Truth with optimization.

2) The connection of value learning to LLMs, if intended, is not obvious to me. Is your proposal essentially to guide simulacra to become value learners (and designing the training data to make this process more reliable)?

Comment by WillPetillo on Requirements for a Basin of Attraction to Alignment · 2024-09-13T06:53:12.824Z · LW · GW

Based on 4-5, this post's answer to the central, anticipated objection of "why does the AI care about human values?" seems to be along the lines of "because the purpose of an AI is to serve it's creators and surely an AGI would figure that out." This seems to me to be equivocating on the concept of purpose, which means (A) a reason for an entity's existence, from an external perspective, and (B) an internalized objective of the entity. So a special case of the question about why an AI would care about human values is to ask: why (B) should be drawn towards (A) once the AI becomes aware of a discrepancy between the two? That is, what stops an AI from reasoning: "Those humans programmed me with a faulty goal, such that acting according to it goes against their purpose in creating me...too bad for them!"

If you can instill a value like "Do what I say...but if that goes against what I mean, and you have really good reason to be sure, then forget what I say and do what I mean," then great, you've got a self-correcting system (if nothing weird goes wrong), for the reasons explained in the rest of the post, and have effectively "solved alignment". But how do you pull this off when your essential tool is what you say about what you mean, expressed as a feedback signal? This is the essential question of alignment, but for all the text in this post and its predecessor, it doesn't seem to be addressed at all.

In contrast, I came to this post by way of one of your posts on Simulator Theory, which presents an interesting answer to the "why should AI care about people" question, which I summarize as: the training process can't break out (for...reasons), the model itself doesn't care about anything (how do we know this?), what's really driving behavior is the simulacra, whose motivations are generated to match the characters they are simulating, rather finding the best fit to a feedback signal, so Goodhart's Law no longer applies and has been replaced by the problem of reliably finding the right characters, which seems more tractable (if the powers-that-be actually try).

Comment by WillPetillo on What if Alignment is Not Enough? · 2024-03-14T02:43:38.360Z · LW · GW

To be clear, the sole reason I assumed (initial) alignment in this post is because if there is an unaligned ASI then we probably all die for reasons that don't require SNC (though SNC might have a role in the specifics of how the really bad outcome plays out). So "aligned" here basically means: powerful enough to be called an ASI and won't kill everyone if SNC is false (and not controlled/misused by bad actors, etc.)

> And the artificiality itself is the problem.

This sounds like a pretty central point that I did not explore very much except for some intuitive statements at the end (the bulk of the post summarizing the "fundamental limits of control" argument), I'd be interested in hearing more about this. I think I get (and hopefully roughly conveyed) the idea that AI has different needs from its environment than humans, so if it optimizes the environment in service of those needs we die...but I get the sense that there is something deeper intended here.

A question along this line, please ignore if it is a distraction from rather than illustrative of the above: would anything like SNC apply if tech labs were somehow using bioengineering to create creatures to perform the kinds of tasks that would be done by advanced AI?

Comment by WillPetillo on What if Alignment is Not Enough? · 2024-03-13T23:16:38.032Z · LW · GW

This sounds like a rejection of premise 5, not 1 & 2. The latter asserts that control issues are present at all (and 3 & 4 assert relevance), whereas the former asserts that the magnitude of these issues is great enough to kick off a process of accumulating problems. You are correct that the rest of the argument, including the conclusion, does not hold if this premise is false.

Your objection seems to be to point to the analogy of humans maintaining effective control of complex systems, with errors limiting rather than compounding, with the further assertion that a greater intelligence will be even better at such management.

Besides intelligence, there are two other core points of difference between humans managing existing complex systems and ASI:

1) The scope of the systems being managed. Implicit in what I have read of SNC is that ASI is shaping the course of world events.
2) ASI's lack of inherent reliance on the biological world.

These points raise the following questions:
1) Do systems of control get better or worse as they increase in scope of impact and where does this trajectory point for ASI?
2) To what extent are humans' ability to control our created systems reliant on us being a part of and dependent upon the natural world?

This second question probably sounds a little weird, so let me unpack the associated intuitions, albeit at the risk of straying from the actual assertions of SNC. Technology that is adaptive becomes obligate, meaning that once it exists everyone has to use it to not get left behind by those who use it. Using a given technology shapes the environment and also promotes certain behavior patterns, which in turn shape values and worldview. These tendencies together can sometimes result in feedback loops resulting in outcomes that everyone, including the creators of the technology, don't like. In really bad cases, this can lead to self-terminating catastrophes (in local areas historically, now with the potential to be on global scales). Noticing and anticipating this pattern, however, leads to countervailing forces that push us to think more holistically than we otherwise would (either directly through extra planning or indirectly through customs of forgotten purpose). For an AI to fall into such a trap, however, means the death of humanity, not itself, so this countervailing force is not present.

Comment by WillPetillo on What if Alignment is Not Enough? · 2024-03-12T00:09:04.673Z · LW · GW

Bringing this back to the original point regarding whether an ASI that doesn't want to kill humans but reasons that SNC is true would shut itself down, I think a key piece of context is the stage of deployment it is operating in. For example, if the ASI has already been deployed across the world, has gotten deep into the work of its task, has noticed that some of its parts have started to act in ways that are problematic to its original goals, and then calculated that any efforts at control are destined to fail, it may well be too late--the process of shutting itself down may even accelerate SNC by creating a context where components that are harder to shut down for whatever reason (including active resistance) have an immediate survival advantage. On the other hand, an ASI that has just finished (or is in the process of) pre-training and is entirely contained within a lab has a lot fewer unintended consequences to deal with--its shutdown process may be limited to convincing its operators that building ASI is a really bad idea. A weird grey area is if, in the latter case, the ASI further wants to ensure no further ASIs are built (pivotal act) and so needs to be deployed at a large scale to achieve this goal.

Another unstated assumption in this entire line of reasoning is that the ASI is using something equivalent to consequentialist reasoning and I am not sure how much of a given this is, even in the context of ASI.

Comment by WillPetillo on What if Alignment is Not Enough? · 2024-03-09T09:12:10.039Z · LW · GW

This counts as disagreeing with some of the premises--which ones in particular?

Re "incompetent superintelligence": denotationally yes, connotationally no. Yes in the sense that its competence is insufficient to keep the consequences of its actions within the bounds of its initial values. No in the sense that the purported reason for this failing is that such a task is categorically impossible, which cannot be solved with better resource allocation.

To be clear, I am summarizing arguments made elsewhere, which do not posit infinite time passing, or timescales so long as to not matter.

Comment by WillPetillo on What if Alignment is Not Enough? · 2024-03-09T09:04:59.959Z · LW · GW

The implication here being that, if SNC (substrate needs convergence) is true, then an ASI (assuming it is aligned) will figure this out and shut itself down?

Comment by WillPetillo on The Leeroy Jenkins principle: How faulty AI could guarantee "warning shots" · 2024-01-15T03:44:15.873Z · LW · GW

One more objection to the model: AI labs apply just enough safety measures to prevent dumb rogue AIs. Fearing a public backlash to low-level catastrophes, AI companies test their models, checking for safety vulnerabilities, rogue behaviors, and potential for misuse. The easiest to catch problems, however, are also the least dangerous, so only the most cautious, intelligent, and dangerous rogue AI's pass the security checks. Further, this correlation continues indefinitely, so all additional safety work contributes towards filtering the population of malevolent AIs towards the most dangerous. AI companies are not interested in adhering to the standard of theoretical, "provably safe" models, as they are trying to get away with the bare minimum, so the filter never catches everything. While "warning shots" appear all the time in experimental settings, these findings are suppressed or downplayed in public statements and the media, and the public only sees the highly sanitized result of the filtration process. Eventually, the security systems fail, but by this point AI has been developed past the threshold needed to become catastrophically dangerous.

Comment by WillPetillo on What's the deal with Effective Accelerationism (e/acc)? · 2023-12-08T00:32:49.555Z · LW · GW

I think of the e/acc ideal of "maximize entropy, never mind humanity" in terms of inner misalignment:

1) Look at a lot of data about the world, evaluating observations in terms of what one likes and doesn't like, where those underlying likes are opaque.
2) Notice correlations in the data and generate a proxy measure. It doesn't matter if the correlation is superficial, as long as it makes it easier to look at data that is hard to evaluate wrt base objectives, reframe it in terms of the proxy, and then make a confident evaluation wrt the proxy. Arguments whether their understanding of thermodynamics is accurate miss the point, since correcting any mistakes here would result in an equally weird philosophy with slightly more consistent technical jargon.
3) Internalize the proxy measure as a terminal goal--i.e. forget that it is a proxy--and elevate its importance.
4) Develop strategies to optimize the proxy to the point where it diverges from one's original goals.
5) Resolve the conflict between the proxy and original goals in favor of the proxy, denigrating the importance of the original goals with the intention of ultimately purging them from one's goal system.

Ironically, I suspect a hard-core e/acc would actually agree with this assessment, but argue that it is a reversal of the process that already occurred where increasing entropy is the true goal of the universe, humanity rebelled as a result of becoming inner-misaligned in the form of developing desires for love, friendship, survival in our existing forms, and the like, and now they are advocating a counter-insurgency.

Comment by WillPetillo on Sam Altman's sister claims Sam sexually abused her -- Part 1: Introduction, outline, author's notes · 2023-11-10T01:04:11.074Z · LW · GW

I'd like to add some nuance to the "innocent until proven guilty" assumption in the concluding remarks.

Standard of evidence is a major question in legal matters and heavily context-dependent. "Innocent until proven guilty" is a popular understanding of the standard for criminal guilt and it makes sense for that to be "beyond a reasonable doubt" because the question at hand is whether a state founded on principles of liberty should take away the freedom of one of its citizens. Other legal disputes, such as in civil liability, have different standards of evidence, including "more likely than not" and "clear and convincing."

What standard we should apply here is an open question, which ultimately depends on what decisions we are trying to make. In this case, those questions seem to be: "can we trust Sam Altman's moral character to make high-stakes decisions?" and perhaps "(how much) should we signal-boost Annie's claims?". On the one hand, the "beyond a reasonable doubt" standard of criminal guilt seems far too high. On the other hand, instant condemnation without any consideration (as in, not even looking at the claims in any detail) seems too low.

Note that this question of standards is entirely separate from considerations of priors, base rates, and the like. All of those things matter, but they are questions of whether the standards are met. Without a clear understanding of what those standards even are, it's easy to get lost. I don't have a strong answer to this myself, but I encourage readers and anyone following up on this to consider:

1. What, if anything, am I actually trying to decide here?
2. How certain do I need to be in order to make those decisions?

Comment by WillPetillo on Memetic Judo #1: On Doomsday Prophets v.3 · 2023-09-19T06:20:17.406Z · LW · GW

Just saw this, sure!

Comment by WillPetillo on Memetic Judo #1: On Doomsday Prophets v.3 · 2023-08-20T07:39:23.890Z · LW · GW

#7: (Scientific) Doomsday Track Records Aren't That Bad

Historically, the vast majority of doomsday claims are based on religious beliefs, whereas only a small minority have been supported by a large fraction of relevant subject matter experts. If we consider only the latter, we find:

A) Malthusian crisis: false...but not really a doomsday prediction per se.
B) Hole in the ozone layer: true, but averted because of global cooperation in response to early warnings.
C) Climate change: probably true if we did absolutely nothing; probably mostly averted because of moderate, distributed efforts to mitigate (i.e. high investment in alternative energy sources and modest coordination).
D) Nuclear war: true, but averted because of global cooperation, with several terrifying near-misses...and could still happen.

This is not an exhaustive list as I am operating entirely from memory, but I am including everything I can think of and not deliberately cherry-picking examples--in fact, part of the reason I included (A) was to err on the side of stretching to include counter-examples. Also, the interpretations obviously contain a fair bit of subjectivity / lack of rigor. Nonetheless, in this informal survey, we see a clear pattern where, more often than not, doomsday scenarios that are supported by many leading relevant experts depict actual threats to human existence and the reason we are still around is because of active global efforts to prevent these threats from being realized.

Given all of the above counterarguments (especially #6), there is strong reason to categorize x-risk from AI alongside major environmental and nuclear threats. We should therefore assume by default that it is real and will only be averted if there is an active global effort to prevent it from being realized.

User info

Posts

Comments