Posts
Comments
OK, so it's not really just your results? You are aggregating across these studies (and presumably ones of "Westerners" as well)? I do wonder how directly comparable things are... Did you make an effort to translate a study or questions from studies, or are the questions just independently conceived and formulated?
No, I was only responding to the the first part.
Not necessarily fooling it, just keeping it ignorant. I think such schemes can plausibly scale to very high levels of capabilities, perhaps indefinitely, since intelligence doesn't give one the ability to create information from thin air...
This is a super interesting and important problem, IMO. I believe it already has significant real world practical consequences, e.g. powerful people find it difficult to avoid being surrounded by sychophants: even if they really don't want to be, that's just an extra constraint for the sychophants to satisfy ("don't come across as sychophantic")! I am inclined to agree that avoiding power differentials is the only way to really avoid these perverse outcomes in practice, and I think this is a good argument in favor of doing so.
--------------------------------------
This is also quite related to an (old, unpublished) work I did with Jonathan Binas on "bounded empowerment". I've invited you to the Overleaf (it needs to clean-up, but I've also asked Jonathan about putting it on arXiv).
To summarize: Let's consider this in the case of a superhuman AI, R, and a human H. The basic idea of that work is that R should try and "empower" H, and that (unlike in previous works on empowerment), there are two ways of doing this:
1) change the state of the world (as in previous works)
2) inform H so they know how to make use of the options available to them to achieve various ends (novel!)
If R has a perfect model of H and the world, then you can just compute how to effectively do these things (it's wildly intractable, ofc). I think this would still often look "patronizing" in practice, and/or maybe just lead to totally wild behaviors (hard to predict this sort of stuff...), but it might be a useful conceptual "lead".
Random thought OTMH: Something which might make it less "patronizing" is if H were to have well-defined "meta-preferences" about how such interactions should work that R could aim to respect.
What makes you say this: "However, our results suggest that students are broadly less concerned about the risks of AI than people in the United States and Europe"?
This activation function was introduced in one of my papers from 10 years ago ;)
See Figure 2 of https://arxiv.org/abs/1402.3337
Really interesting point!
I introduced this term in my slides that included "paperweight" as an example of an "AI system" that maximizes safety.
I sort of still think it's an OK term, but I'm sure I will keep thinking about this going forward and hope we can arrive at an even better term.
You could try to do tests on data that is far enough from the training distribution that it won't generalize in a simple immitative way there, and you could do tests to try and confirm that you are far enough off distribution. For instance, perhaps using a carefully chosen invented language would work.
I don't disagree... in this case you don't get agents for a long time; someone else does though.
I meant "other training schemes" to encompass things like scaffolding that deliberately engineers agents using LLMs as components, although I acknowledge they are not literally "training" and more like "engineering".
I would look at the main FATE conferences as well, which I view as being: FAccT, AIES, EEAMO.
I found this thought provoking, but I didn't find the arguments very strong.
(a) Misdirected Regulations Reduce Effective Safety Effort; Regulations Will Almost Certainly Be Misdirected
(b) Regulations Generally Favor The Legible-To-The-State
(c) Heavy Regulations Can Simply Disempower the Regulator
(d) Regulations Are Likely To Maximize The Power of Companies Pushing Forward Capabilities the Most
Briefly responding:
a) The issue in this story seems to be that the company doesn't care about x-safety, not that they are legally obligated to care about face-blindness.
b) If governments don't have bandwidth to effectively vet small AI projects, it seems prudent to err on the side of forbidding projects that might pose x-risk.
c) I do think we need effective international cooperation around regulation. But even buying 1-4 years time seems good in expectation.
d) I don't see the x-risk aspect of this story.
This means that the model can and will implicitly sacrifice next-token prediction accuracy for long horizon prediction accuracy.
Are you claiming this would happen even given infinite capacity?
If so, can you perhaps provide a simple+intuitive+concrete example?
What do you mean by "random linear probe"?
I skimmed this. A few quick comments:
- I think you characterized deceptive alignment pretty well.
- I think it only covers a narrow part of how deceptive behavior can arise.
- CICERO likely already did some of what you describe.
So let us specify a probability distribution over the space of all possible desires. If we accept the orthogonality thesis, we should not want this probability distribution to build in any bias towards certain kinds of desires over others. So let's spread our probabilities in such a way that we meet the following three conditions. Firstly, we don't expect Sia's desires to be better satisfied in any one world than they are in any other world. Formally, our expectation of the degree to which Sia's desires are satisfied at is equal to our expectation of the degree to which Sia's desires are satisfied at , for any . Call that common expected value ''. Secondly, our probabilities are symmetric around . That is, our probability that satisfies Sia's desires to at least degree is equal to our probability that it satisfies her desires to at most degree . And thirdly, learning how well satisfied Sia's desires are at some worlds won't tell us how well satisfied her desires are at other worlds. That is, the degree to which her desires are satisfied at some worlds is independent of how well satisfied they are at any other worlds. (See the appendix for a more careful formulation of these assumptions.) If our probability distribution satisfies these constraints, then I'll say that Sia's desires are 'sampled randomly' from the space of all possible desires.
This is a characterization, and it remains to show that there exist distributions that fit it (I suspect there are not, assuming the sets of possible desires and worlds are unbounded).
I also find the 3rd criteria counterintuitive. If worlds share features, I would expect these to not be independent.
I think it might be more effective in future debates at the outset to:
* Explain that it's only necessary to cross a low bar (e.g. see my Tweet below). -- This is a common practice in debates.
* Outline the responses they expect to hear from the other side, and explain why they are bogus. Framing: "Whether AI is an x-risk has been debated in the ML community for 10 years, and nobody has provided any compelling counterarguments that refute the 3 claims (of the Tweet). You will hear a bunch of counter arguments from the other side, but when you do, ask yourself whether they are really addressing this. Here are a few counter-arguments and why they fail..." -- I think this could really take the wind out of the sails of the opposition, and put them on the back foot.
I also don't think Lecun and Meta should be given so much credit -- Is Facebook really going to develop and deploy AI responsibly?
1) They have been widely condemned for knowingly playing a significant role in the Rohingya genocide, have acknowledged that they failed to act to prevent Facebook's role in the Rohingya genocide, and are being sued for $150bn for this.
2) They have also been criticised for the role that their products, especially Instagram, play in contributing to mental health issues, especially around body image in teenage girls.
More generally, I think the "companies do irresponsible stuff all the time" point needs to be stressed more. And one particular argument that is bogus is the "we'll make it safe" -- x-safety is a common good, and so companies should be expected to undersupply it. This is econ 101.
Organizations that are looking for ML talent (e.g. to mentor more junior people, or get feedback on policy) should offer PhD students high-paying contractor/part-time work.
ML PhD students working on safety-relevant projects should be able to augment their meager stipends this way.
That is in addition to all the people who will give their AutoGPT an instruction that means well but actually translates to killing all the humans or at least take control over the future, since that is so obviously the easiest way to accomplish the thing, such as ‘bring about world peace and end world hunger’ (link goes to Sully hyping AutoGPT, saying ‘you give it a goal like end world hunger’) or ‘stop climate change’ or ‘deliver my coffee every morning at 8am sharp no matter what as reliably as possible.’ Or literally almost anything else.
I think these mostly only translate into dangerous behavior if the model badly "misunderstands" the instruction, which seems somewhat implausible.
One must notice that in order to predict the next token as well as possible the LMM will benefit from being able to simulate every situation, every person, and every causal element behind the creation of every bit of text in its training distribution, no matter what we then train the LMM to output to us (what mask we put on it) afterwards.
Is there any rigorous justification for this claim? As far as I can tell, this is folk wisdom from the scaling/AI safety community, and I think it's far from obvious that it's correct, or what assumptions are required for it to hold.
It seems much more plausible in the infinite limit than in practice.
I have gained confidence in my position that all of this happening now is a good thing, both from the perspective of smaller risks like malware attacks, and from the perspective of potential existential threats. Seems worth going over the logic.
What we want to do is avoid what one might call an agent overhang.
One might hope to execute our Plan A of having our AIs not be agents. Alas, even if technically feasible (which is not at all clear) that only can work if we don’t intentionally turn them into agents via wrapping code around them. We’ve checked with actual humans about the possibility of kindly not doing that. Didn’t go great.
This seems like really bad reasoning...
It seems like the evidence that people won't "kindly not [do] that" is... AutoGPT.
So if AutoGPT didn't exist, you might be able to say: "we asked people to not turn AI systems into agents, and they didn't. Hooray for plan A!"
Also: I don't think it's fair to say "we've checked [...] about the possibility". The AI safety community thought it was sketch for a long time, and has provided some lackluster pushback. Governance folks from the community don't seem to be calling for a rollback of the plugins, or bans on this kind of behavior, etc.
Christiano and Yudkowsky both agree AI is an x-risk -- a prediction that would distinguish their models does not do much to help us resolve whether or not AI is an x-risk.
I'm not necessarily saying people are subconsciously trying to create a moat.
I'm saying they are acting in a way that creates a moat, and that enables them to avoid competition, and that more competition would create more motivation for them to write things up for academic audiences (or even just write more clearly for non-academic audiences).
Q: "Why is that not enough?"
A: Because they are not being funded to produce the right kinds of outputs.
My point is not specific to machine learning. I'm not as familiar with other academic communities, but I think most of the time it would probably be worth engaging with them if there is somewhere where your work could fit.
In my experience people also often know their blog posts aren't very good.
My point (see footnote) is that motivations are complex. I do not believe "the real motivations" is a very useful concept here.
The question becomes why "don't they judge those costs to be worth it"? Is there motivated reasoning involved? Almost certainly yes; there always is.
- A lot of work just isn't made publicly available
- When it is, it's often in the form of ~100 page google docs
- Academics have a number of good reasons to ignore things that don't meet academic standards or rigor and presentation
works for me too now
The link is broken, FYI
Yeah this was super unclear to me; I think it's worth updating the OP.
FYI: my understanding is that "data poisoning" refers to deliberately the training data of somebody else's model which I understand is not what you are describing.
Oh I see. I was getting at the "it's not aligned" bit.
Basically, it seems like if I become a cyborg without understanding what I'm doing, the result is either:
- I'm in control
- The machine part is in control
- Something in the middle
Only the first one seems likely to be sufficiently aligned.
I don't understand the fuss about this; I suspect these phenomena are due to uninteresting, and perhaps even well-understood effects. A colleague of mine had this to say:
- After a skim, it looks to me like an instance of hubness: https://www.jmlr.org/papers/volume11/radovanovic10a/radovanovic10a.pdf
- This effect can be a little non-intuitive. There is an old paper in music retrieval where the authors battled to understand why Joni Mitchell's (classic) "Don Juan’s Reckless Daughter" was retrieved confusingly frequently (the same effect) https://d1wqtxts1xzle7.cloudfront.net/33280559/aucouturier-04b-libre.pdf?1395460009=&respon[…]xnyMeZ5rAJ8cenlchug__&Key-Pair-Id=APKAJLOHF5GGSLRBV4ZA
- For those interested, here is a nice theoretical argument on why hubs occur: https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=e85afe59d41907132dd0370c7bd5d11561dce589
- If this is the explanation, it is not unique to these models, or even to large language models. It shows up in many domains.
Indeed. I think having a clean, well-understood interface for human/AI interaction seems useful here. I recognize this is a big ask in the current norms and rules around AI development and deployment.
I don't understand what you're getting at RE "personal level".
I think the most fundamental objection to becoming cyborgs is that we don't know how to say whether a person retains control over the cyborg they become a part of.
FWIW, I didn't mean to kick off a historical debate, which seems like probably not a very valuable use of y'all's time.
Unfortunately, I think even "catastrophic risk" has a high potential to be watered down and be applied to situations where dozens as opposed to millions/billions die. Even existential risk has this potential, actually, but I think it's a safer bet.
I don't think we should try and come up with a special term for (1).
The best term might be "AI engineering". The only thing it needs to be distinguished from is "AI science".
I think ML people overwhelmingly identify as doing one of those 2 things, and find it annoying and ridiculous when people in this community act like we are the only ones who care about building systems that work as intended.
I say it is a rebrand of the "AI (x-)safety" community.
When AI alignment came along we were calling it AI safety, even though it was really basically AI existential safety all along that everyone in the community meant. "AI safety" was (IMO) a somewhat successful bid for more mainstream acceptance, that then lead to dillution and confusion, necessitating a new term.
I don't think the history is that important; what's important is having good terminology going forward.
This is also why I stress that I work on AI existential safety.
So I think people should just say what kind of technical work they are doing and "existential safety" should be considered as a social-technical problem that motivates a community of researchers, and used to refer to that problem and that community. In particular, I think we are not able to cleanly delineate what is or isn't technical AI existential safety research at this point, and we should welcome intellectual debates about the nature of the problem and how different technical research may or may not contribute to increasing x-safety.
Hmm... this is a good point.
I think structural risk is often a better description of reality, but I can see a rhetorical argument against framing things that way. One problem I see with doing that is that I think it leads people to think the solution is just for AI developers to be more careful, rather than observing that there will be structural incentives (etc.) pushing for less caution.
I do think that there’s a pretty solid dichotomy between (A) “the AGI does things specifically intended by its designers” and (B) “the AGI does things that the designers never wanted it to do”.
1) I don't think this dichotomy is as solid as it seems once you start poking at it... e.g. in your war example, it would be odd to say that the designers of the AGI systems that wiped out humans intended for that outcome to occur. Intentions are perhaps best thought of as incomplete specifications.
2) From our current position, I think “never ever create AGI” is a significantly easier thing to coordinate around than "don't build AGI until/unless we can do it safely". I'm not very worried that we will coordinate too successfully and never build AGI and thus squander the cosmic endowment. This is both because I think that's quite unlikely, and because I'm not sure we'll make very good / the best use of it anyways (e.g. think S-risk, other civilizations).
3) I think the conventional framing of AI alignment is something between vague and substantively incorrect, as well as being misleading. Here is a post I dashed off about that:
https://www.lesswrong.com/posts/biP5XBmqvjopvky7P/a-note-on-terminology-ai-alignment-ai-x-safety. I think creating such a manual is an incredibly ambitious goal, and I think more people in this community should aim for more moderate goals. I mostly agree with the perspective in this post: https://coordination.substack.com/p/alignment-is-not-enough, but I could say more on the matter.
4) RE connotations of accident: I think they are often strong.
While defining accident as “incident that was not specifically intended & desired by the people who pressed ‘run’ on the AGI code” is extremely broad, it still supposes that there is such a thing as "the AGI code", which significantly restricts the space of possibile risks.
There are other reasons I would not be happy with that browser extension. There is not one specific conversation I can point to; it comes up regularly. I think this replacement would probably lead to a lot of confusion, since I think when people use the word "accident" they often proceed as if it meant something stricter, e.g. that the result was unforseen or unforseeable.
If (as in "Concrete Problems", IMO) the point is just to point out that AI can get out-of-control, or that misuse is not the only risk, that's a worthwhile thing to point out, but it doesn't lead to a very useful framework for understanding the nature of the risk(s). As I mentioned elsewhere, it is specifically the dichotomy of "accident vs. misuse" that I think is the most problematic and misleading.
I think the chart is misleading for the following reasons, among others:
- It seems to suppose that there is such a manual, or the goal of creating one. However, if we coordinate effectively, we can simply forgoe development and deployment of dangerous technologies ~indefinitely.
- It inappropriately separates "coordination problems" and "everyone follows the manual"
I think the construction gives us $C(\pi) \leq C(U) + e$ for a small constant $e$ (representing the wrapper). It seems like any compression you can apply to the reward function can be translated to the policy via the wrapper. So then you would never have $C(\pi) >> C(U)$. What am I missing/misunderstanding?
By "intend" do you mean that they sought that outcome / selected for it?
Or merely that it was a known or predictable outcome of their behavior?
I think "unintentional" would already probably be a better term in most cases.
Apologies, I didn't take the time to understand all of this yet, but I have a basic question you might have an answer to...
We know how to map (deterministic) policies to reward functions using the construction at the bottom of page 6 of the reward modelling agenda (https://arxiv.org/abs/1811.07871v1): the agent is rewarded only if it has so far done exactly what the policy would do. I think of this as a wrapper function (https://en.wikipedia.org/wiki/Wrapper_function).
It seems like this means that, for any policy, we can represent it as optimizing reward with only the minimal overhead in description/computational complexity of the wrapper.
So...
- Do you think this analysis is correct? Or what is it missing? (maybe the assumption that the policy is deterministic is significant? This turns out to be the case for Orseau et al.'s "Agents and Devices" approach, I think https://arxiv.org/abs/1805.12387).
- Are you trying to get around this somehow? Or are you fine with this minimal overhead being used to distinguish goal-directed from non-goal directed policies?
"Concrete Problems in AI Safety" used this distinction to make this point, and I think it was likely a useful simplification in that context. I generally think spelling it out is better, and I think people will pattern match your concerns onto the “the sci-fi scenario where AI spontaneously becomes conscious, goes rogue, and pursues its own goal” or "boring old robustness problems" if you don't invoke structural risk. I think structural risk plays a crucial role in the arguments, and even if you think things that look more like pure accidents are more likely, I think the structural risk story is more plausible to more people and a sufficient cause for concern.
RE (A): A known side-effect is not an accident.
I agree somewhat, however, I think we need to be careful to distinguish "do unsavory things" from "cause human extinction", and should generally be squarely focused on the latter. The former easily becomes too political, making coordination harder.
Yes it may be useful in some very limited contexts. I can't recall a time I have seen it in writing and felt like it was not a counter-productive framing.
AI is highly non-analogous with guns.