Posts
Comments
I found this thought provoking, but I didn't find the arguments very strong.
(a) Misdirected Regulations Reduce Effective Safety Effort; Regulations Will Almost Certainly Be Misdirected
(b) Regulations Generally Favor The Legible-To-The-State
(c) Heavy Regulations Can Simply Disempower the Regulator
(d) Regulations Are Likely To Maximize The Power of Companies Pushing Forward Capabilities the Most
Briefly responding:
a) The issue in this story seems to be that the company doesn't care about x-safety, not that they are legally obligated to care about face-blindness.
b) If governments don't have bandwidth to effectively vet small AI projects, it seems prudent to err on the side of forbidding projects that might pose x-risk.
c) I do think we need effective international cooperation around regulation. But even buying 1-4 years time seems good in expectation.
d) I don't see the x-risk aspect of this story.
This means that the model can and will implicitly sacrifice next-token prediction accuracy for long horizon prediction accuracy.
Are you claiming this would happen even given infinite capacity?
If so, can you perhaps provide a simple+intuitive+concrete example?
What do you mean by "random linear probe"?
I skimmed this. A few quick comments:
- I think you characterized deceptive alignment pretty well.
- I think it only covers a narrow part of how deceptive behavior can arise.
- CICERO likely already did some of what you describe.
So let us specify a probability distribution over the space of all possible desires. If we accept the orthogonality thesis, we should not want this probability distribution to build in any bias towards certain kinds of desires over others. So let's spread our probabilities in such a way that we meet the following three conditions. Firstly, we don't expect Sia's desires to be better satisfied in any one world than they are in any other world. Formally, our expectation of the degree to which Sia's desires are satisfied at is equal to our expectation of the degree to which Sia's desires are satisfied at , for any . Call that common expected value ''. Secondly, our probabilities are symmetric around . That is, our probability that satisfies Sia's desires to at least degree is equal to our probability that it satisfies her desires to at most degree . And thirdly, learning how well satisfied Sia's desires are at some worlds won't tell us how well satisfied her desires are at other worlds. That is, the degree to which her desires are satisfied at some worlds is independent of how well satisfied they are at any other worlds. (See the appendix for a more careful formulation of these assumptions.) If our probability distribution satisfies these constraints, then I'll say that Sia's desires are 'sampled randomly' from the space of all possible desires.
This is a characterization, and it remains to show that there exist distributions that fit it (I suspect there are not, assuming the sets of possible desires and worlds are unbounded).
I also find the 3rd criteria counterintuitive. If worlds share features, I would expect these to not be independent.
I think it might be more effective in future debates at the outset to:
* Explain that it's only necessary to cross a low bar (e.g. see my Tweet below). -- This is a common practice in debates.
* Outline the responses they expect to hear from the other side, and explain why they are bogus. Framing: "Whether AI is an x-risk has been debated in the ML community for 10 years, and nobody has provided any compelling counterarguments that refute the 3 claims (of the Tweet). You will hear a bunch of counter arguments from the other side, but when you do, ask yourself whether they are really addressing this. Here are a few counter-arguments and why they fail..." -- I think this could really take the wind out of the sails of the opposition, and put them on the back foot.
I also don't think Lecun and Meta should be given so much credit -- Is Facebook really going to develop and deploy AI responsibly?
1) They have been widely condemned for knowingly playing a significant role in the Rohingya genocide, have acknowledged that they failed to act to prevent Facebook's role in the Rohingya genocide, and are being sued for $150bn for this.
2) They have also been criticised for the role that their products, especially Instagram, play in contributing to mental health issues, especially around body image in teenage girls.
More generally, I think the "companies do irresponsible stuff all the time" point needs to be stressed more. And one particular argument that is bogus is the "we'll make it safe" -- x-safety is a common good, and so companies should be expected to undersupply it. This is econ 101.
Organizations that are looking for ML talent (e.g. to mentor more junior people, or get feedback on policy) should offer PhD students high-paying contractor/part-time work.
ML PhD students working on safety-relevant projects should be able to augment their meager stipends this way.
That is in addition to all the people who will give their AutoGPT an instruction that means well but actually translates to killing all the humans or at least take control over the future, since that is so obviously the easiest way to accomplish the thing, such as ‘bring about world peace and end world hunger’ (link goes to Sully hyping AutoGPT, saying ‘you give it a goal like end world hunger’) or ‘stop climate change’ or ‘deliver my coffee every morning at 8am sharp no matter what as reliably as possible.’ Or literally almost anything else.
I think these mostly only translate into dangerous behavior if the model badly "misunderstands" the instruction, which seems somewhat implausible.
One must notice that in order to predict the next token as well as possible the LMM will benefit from being able to simulate every situation, every person, and every causal element behind the creation of every bit of text in its training distribution, no matter what we then train the LMM to output to us (what mask we put on it) afterwards.
Is there any rigorous justification for this claim? As far as I can tell, this is folk wisdom from the scaling/AI safety community, and I think it's far from obvious that it's correct, or what assumptions are required for it to hold.
It seems much more plausible in the infinite limit than in practice.
I have gained confidence in my position that all of this happening now is a good thing, both from the perspective of smaller risks like malware attacks, and from the perspective of potential existential threats. Seems worth going over the logic.
What we want to do is avoid what one might call an agent overhang.
One might hope to execute our Plan A of having our AIs not be agents. Alas, even if technically feasible (which is not at all clear) that only can work if we don’t intentionally turn them into agents via wrapping code around them. We’ve checked with actual humans about the possibility of kindly not doing that. Didn’t go great.
This seems like really bad reasoning...
It seems like the evidence that people won't "kindly not [do] that" is... AutoGPT.
So if AutoGPT didn't exist, you might be able to say: "we asked people to not turn AI systems into agents, and they didn't. Hooray for plan A!"
Also: I don't think it's fair to say "we've checked [...] about the possibility". The AI safety community thought it was sketch for a long time, and has provided some lackluster pushback. Governance folks from the community don't seem to be calling for a rollback of the plugins, or bans on this kind of behavior, etc.
Christiano and Yudkowsky both agree AI is an x-risk -- a prediction that would distinguish their models does not do much to help us resolve whether or not AI is an x-risk.
I'm not necessarily saying people are subconsciously trying to create a moat.
I'm saying they are acting in a way that creates a moat, and that enables them to avoid competition, and that more competition would create more motivation for them to write things up for academic audiences (or even just write more clearly for non-academic audiences).
Q: "Why is that not enough?"
A: Because they are not being funded to produce the right kinds of outputs.
My point is not specific to machine learning. I'm not as familiar with other academic communities, but I think most of the time it would probably be worth engaging with them if there is somewhere where your work could fit.
In my experience people also often know their blog posts aren't very good.
My point (see footnote) is that motivations are complex. I do not believe "the real motivations" is a very useful concept here.
The question becomes why "don't they judge those costs to be worth it"? Is there motivated reasoning involved? Almost certainly yes; there always is.
- A lot of work just isn't made publicly available
- When it is, it's often in the form of ~100 page google docs
- Academics have a number of good reasons to ignore things that don't meet academic standards or rigor and presentation
works for me too now
The link is broken, FYI
Yeah this was super unclear to me; I think it's worth updating the OP.
FYI: my understanding is that "data poisoning" refers to deliberately the training data of somebody else's model which I understand is not what you are describing.
Oh I see. I was getting at the "it's not aligned" bit.
Basically, it seems like if I become a cyborg without understanding what I'm doing, the result is either:
- I'm in control
- The machine part is in control
- Something in the middle
Only the first one seems likely to be sufficiently aligned.
I don't understand the fuss about this; I suspect these phenomena are due to uninteresting, and perhaps even well-understood effects. A colleague of mine had this to say:
- After a skim, it looks to me like an instance of hubness: https://www.jmlr.org/papers/volume11/radovanovic10a/radovanovic10a.pdf
- This effect can be a little non-intuitive. There is an old paper in music retrieval where the authors battled to understand why Joni Mitchell's (classic) "Don Juan’s Reckless Daughter" was retrieved confusingly frequently (the same effect) https://d1wqtxts1xzle7.cloudfront.net/33280559/aucouturier-04b-libre.pdf?1395460009=&respon[…]xnyMeZ5rAJ8cenlchug__&Key-Pair-Id=APKAJLOHF5GGSLRBV4ZA
- For those interested, here is a nice theoretical argument on why hubs occur: https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=e85afe59d41907132dd0370c7bd5d11561dce589
- If this is the explanation, it is not unique to these models, or even to large language models. It shows up in many domains.
Indeed. I think having a clean, well-understood interface for human/AI interaction seems useful here. I recognize this is a big ask in the current norms and rules around AI development and deployment.
I don't understand what you're getting at RE "personal level".
I think the most fundamental objection to becoming cyborgs is that we don't know how to say whether a person retains control over the cyborg they become a part of.
FWIW, I didn't mean to kick off a historical debate, which seems like probably not a very valuable use of y'all's time.
Unfortunately, I think even "catastrophic risk" has a high potential to be watered down and be applied to situations where dozens as opposed to millions/billions die. Even existential risk has this potential, actually, but I think it's a safer bet.
I don't think we should try and come up with a special term for (1).
The best term might be "AI engineering". The only thing it needs to be distinguished from is "AI science".
I think ML people overwhelmingly identify as doing one of those 2 things, and find it annoying and ridiculous when people in this community act like we are the only ones who care about building systems that work as intended.
I say it is a rebrand of the "AI (x-)safety" community.
When AI alignment came along we were calling it AI safety, even though it was really basically AI existential safety all along that everyone in the community meant. "AI safety" was (IMO) a somewhat successful bid for more mainstream acceptance, that then lead to dillution and confusion, necessitating a new term.
I don't think the history is that important; what's important is having good terminology going forward.
This is also why I stress that I work on AI existential safety.
So I think people should just say what kind of technical work they are doing and "existential safety" should be considered as a social-technical problem that motivates a community of researchers, and used to refer to that problem and that community. In particular, I think we are not able to cleanly delineate what is or isn't technical AI existential safety research at this point, and we should welcome intellectual debates about the nature of the problem and how different technical research may or may not contribute to increasing x-safety.
Hmm... this is a good point.
I think structural risk is often a better description of reality, but I can see a rhetorical argument against framing things that way. One problem I see with doing that is that I think it leads people to think the solution is just for AI developers to be more careful, rather than observing that there will be structural incentives (etc.) pushing for less caution.
I do think that there’s a pretty solid dichotomy between (A) “the AGI does things specifically intended by its designers” and (B) “the AGI does things that the designers never wanted it to do”.
1) I don't think this dichotomy is as solid as it seems once you start poking at it... e.g. in your war example, it would be odd to say that the designers of the AGI systems that wiped out humans intended for that outcome to occur. Intentions are perhaps best thought of as incomplete specifications.
2) From our current position, I think “never ever create AGI” is a significantly easier thing to coordinate around than "don't build AGI until/unless we can do it safely". I'm not very worried that we will coordinate too successfully and never build AGI and thus squander the cosmic endowment. This is both because I think that's quite unlikely, and because I'm not sure we'll make very good / the best use of it anyways (e.g. think S-risk, other civilizations).
3) I think the conventional framing of AI alignment is something between vague and substantively incorrect, as well as being misleading. Here is a post I dashed off about that:
https://www.lesswrong.com/posts/biP5XBmqvjopvky7P/a-note-on-terminology-ai-alignment-ai-x-safety. I think creating such a manual is an incredibly ambitious goal, and I think more people in this community should aim for more moderate goals. I mostly agree with the perspective in this post: https://coordination.substack.com/p/alignment-is-not-enough, but I could say more on the matter.
4) RE connotations of accident: I think they are often strong.
While defining accident as “incident that was not specifically intended & desired by the people who pressed ‘run’ on the AGI code” is extremely broad, it still supposes that there is such a thing as "the AGI code", which significantly restricts the space of possibile risks.
There are other reasons I would not be happy with that browser extension. There is not one specific conversation I can point to; it comes up regularly. I think this replacement would probably lead to a lot of confusion, since I think when people use the word "accident" they often proceed as if it meant something stricter, e.g. that the result was unforseen or unforseeable.
If (as in "Concrete Problems", IMO) the point is just to point out that AI can get out-of-control, or that misuse is not the only risk, that's a worthwhile thing to point out, but it doesn't lead to a very useful framework for understanding the nature of the risk(s). As I mentioned elsewhere, it is specifically the dichotomy of "accident vs. misuse" that I think is the most problematic and misleading.
I think the chart is misleading for the following reasons, among others:
- It seems to suppose that there is such a manual, or the goal of creating one. However, if we coordinate effectively, we can simply forgoe development and deployment of dangerous technologies ~indefinitely.
- It inappropriately separates "coordination problems" and "everyone follows the manual"
I think the construction gives us $C(\pi) \leq C(U) + e$ for a small constant $e$ (representing the wrapper). It seems like any compression you can apply to the reward function can be translated to the policy via the wrapper. So then you would never have $C(\pi) >> C(U)$. What am I missing/misunderstanding?
By "intend" do you mean that they sought that outcome / selected for it?
Or merely that it was a known or predictable outcome of their behavior?
I think "unintentional" would already probably be a better term in most cases.
Apologies, I didn't take the time to understand all of this yet, but I have a basic question you might have an answer to...
We know how to map (deterministic) policies to reward functions using the construction at the bottom of page 6 of the reward modelling agenda (https://arxiv.org/abs/1811.07871v1): the agent is rewarded only if it has so far done exactly what the policy would do. I think of this as a wrapper function (https://en.wikipedia.org/wiki/Wrapper_function).
It seems like this means that, for any policy, we can represent it as optimizing reward with only the minimal overhead in description/computational complexity of the wrapper.
So...
- Do you think this analysis is correct? Or what is it missing? (maybe the assumption that the policy is deterministic is significant? This turns out to be the case for Orseau et al.'s "Agents and Devices" approach, I think https://arxiv.org/abs/1805.12387).
- Are you trying to get around this somehow? Or are you fine with this minimal overhead being used to distinguish goal-directed from non-goal directed policies?
"Concrete Problems in AI Safety" used this distinction to make this point, and I think it was likely a useful simplification in that context. I generally think spelling it out is better, and I think people will pattern match your concerns onto the “the sci-fi scenario where AI spontaneously becomes conscious, goes rogue, and pursues its own goal” or "boring old robustness problems" if you don't invoke structural risk. I think structural risk plays a crucial role in the arguments, and even if you think things that look more like pure accidents are more likely, I think the structural risk story is more plausible to more people and a sufficient cause for concern.
RE (A): A known side-effect is not an accident.
I agree somewhat, however, I think we need to be careful to distinguish "do unsavory things" from "cause human extinction", and should generally be squarely focused on the latter. The former easily becomes too political, making coordination harder.
Yes it may be useful in some very limited contexts. I can't recall a time I have seen it in writing and felt like it was not a counter-productive framing.
AI is highly non-analogous with guns.
I think inadequate equillibrium is too specific and insider jargon-y.
I really don't think the distinction is meaningful or useful in almost any situation. I think if people want to make something like this distinction they should just be more clear about exactly what they are talking about.
This is a great post. Thanks for writing it! I think Figure 1 is quite compelling and thought provoking.
I began writing a response, and then realized a lot of what I wanted to say has already been said by others, so I just noted where that was the case. I'll focus on points of disagreement.
Summary: I think the basic argument of the post is well summarized in Figure 1, and by Vanessa Kosoy’s comment.
A high-level counter-argument I didn't see others making:
- I wasn't entirely sure what was your argument that long-term planning ability saturates... I've seen this argued both based on complexity and chaos, and I think here it's a bit of a mix of both.
- Counter-argument to chaos-argument: It seems we can make meaningful predictions of many relevant things far into the future (e.g. that the sun's remaining natural life-span is 7-8 billion years).
- Counter-argument to complexity-argument: Increases in predictive ability can have highly non-linear returns, both in terms of planning depth and planning accuracy.
- Depth: You often only need to be "one step ahead" of your adversary in order to defeat them and win the whole "prize" (e.g. of market or geopolitical dominance), e.g. if I can predict the weather one day further ahead, this could have a major impact in military strategy.
- Accuracy: If you can make more accurate predictions about, e.g. how prices of assets will change, you can make a killing in finance.
High-level counter-arguments I would've made that Vanessa already made:
- This argument proves too much: it suggests that there are not major differences in ability to do long-term planning that matter.
- Humans have not reached the limits of predictive ability
Low-level counter-arguments:
- RE Claim 1: Why would AI only have an advantage in IQ as opposed to other forms of intelligence / cognitive skill? No argument is provided.
- (Argued by Jonathan Uesato) RE Claim 3: Scaling laws provide ~zero evidence that we are at the limit of “what can be achieved with a certain level of resources”.
This is a great post. Thanks for writing it!
I agree with a lot of the counter-arguments others have mentioned.
Summary:
- I think the basic argument of the post is well summarized in Figure 1, and by Vanessa Kosoy’s comment.
- High-level counter-arguments already argued by Vanessa:
- This argument proves too much: it suggests that there are not major differences in ability to do long-term planning that matter.
- Humans have not reached the limits of predictive ability
- You often only need to be one step ahead of your adversary to defeat them.
- Prediction accuracy is not the relevant metric: an incremental increase in depth-of-planning could be decisive in conflicts (e.g. if I can predict the weather one day further ahead, this could have a major impact in military strategy).
- More generally, the ability to make large / highly leveraged bets on future outcomes means that slight advantages in prediction ability could be decisive.
- Low-level counter-arguments:
- (RE Claim 1: Why would AI only have an advantage in IQ as opposed to other forms of intelligence / cognitive skill? No argument is provided.
- (Argued by Jonathan Uesato) RE Claim 3: Scaling laws provide ~zero evidence that we are at the limit of “what can be achieved with a certain level of resources”.
- RE Claim 5: Systems trained with short-term objectives can learn to do long-term planning competently.
This post tacitly endorses the "accident vs. misuse" dichotomy.
Every time this appears, I feel compelled to mention I think is a terrible framing.
I believe the large majority of AI x-risk is best understood as "structural" in nature: https://forum.effectivealtruism.org/posts/oqveRcMwRMDk6SYXM/clarifications-about-structural-risk-from-ai
I understand your point of view and think it is reasonable.
However, I don't think "don't build bigger models" and "don't train models to do complicated things" need to be at odds with each other. I see the argument you are making, but I think success on these asks are likely highly correlated via the underlying causal factor of humanity being concerned enough about AI x-risk and coordinated enough to ensure responsible AI development.
I also think the training procedure matters a lot (and you seem to be suggesting otherwise?), since if you don't do RL or other training schemes that seem designed to induce agentyness and you don't do tasks that use an agentic supervision signal, then you probably don't get agents for a long time (if ever).
(A very quick response):
Agree with (1) and (2).
I am ambivalent RE (3) and the replaceability arguments.
RE (4): I largely agree, but I think the norm should be "let's try to do less ambitious stuff properly" rather than "let's try to do the most ambitious stuff we can, and then try and figure out how to do it as safely as possible as a secondary objective".
In the current climate, I think playing up the neglectedness and "directly working on x-risks" is somewhat likely be counterproductive, especially if not done carefully, some reasons:
1) It fosters an "us-vs-them" mindset.
2) It fails to acknowledge that these researchers don't know what the most effective ways are to reduce x-risk, and there is not much consensus (and that which does exist is likely partially due to insular community epistemics).
3) It discounts the many researchers doing work that is technically indistinguishable the work by researchers "directly working on x-risks".
4) Concern about x-risk (or more generally, the impact of powerful AI) from AI researchers is increasing organically, and we want to welcome this concern, rather than (accidentally/implicitly/etc.) telling people they don't count.
I think we should be working to develop clearer ideas about which kinds of work is differentially useful for x-safety, seeking to build a broader (outside this community) consensus about that, and try to incentivize more explicit focus on x-safety.
You don't need to be advocating a specific course of action. There are smart people who could be doing things to reduce AI x-risk and aren't (yet) because they haven't heard (enough) about the problem.
it will be better to build such systems from the ground up, rather than taking existing unsafe systems and tweaking them to be safe
I would say "it may be better, and people should seriously consider this" not "it is better".
I am co-supervising Ethan's PhD, and we previously wrote another ML paper together: https://arxiv.org/abs/2003.00688
Ethan has an unusual communication style, but he's not a crackpot, and this work is legit (according to me, the anchor author). I haven't listened to the interview.