Posts

A Sober Look at Steering Vectors for LLMs 2024-11-23T17:30:00.745Z
Is there any rigorous work on using anthropic uncertainty to prevent situational awareness / deception? 2024-09-04T12:40:07.678Z
An ML paper on data stealing provides a construction for "gradient hacking" 2024-07-30T21:44:37.310Z
[Link Post] "Foundational Challenges in Assuring Alignment and Safety of Large Language Models" 2024-06-06T18:55:09.151Z
Testing for consequence-blindness in LLMs using the HI-ADS unit test. 2023-11-24T23:35:29.560Z
"Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities) 2023-03-18T19:01:54.199Z
What organizations other than Conjecture have (esp. public) info-hazard policies? 2023-03-16T14:49:12.411Z
A (EtA: quick) note on terminology: AI Alignment != AI x-safety 2023-02-08T22:33:52.713Z
Why I hate the "accident vs. misuse" AI x-risk dichotomy (quick thoughts on "structural risk") 2023-01-30T18:50:17.613Z
Quick thoughts on "scalable oversight" / "super-human feedback" research 2023-01-25T12:55:31.334Z
Mechanistic Interpretability as Reverse Engineering (follow-up to "cars and elephants") 2022-11-03T23:19:20.458Z
"Cars and Elephants": a handwavy argument/analogy against mechanistic interpretability 2022-10-31T21:26:05.388Z
I'm planning to start creating more write-ups summarizing my thoughts on various issues, mostly related to AI existential safety. What do you want to hear my nuanced takes on? 2022-09-24T12:38:24.163Z
[An email with a bunch of links I sent an experienced ML researcher interested in learning about Alignment / x-safety.] 2022-09-08T22:28:54.534Z
An Update on Academia vs. Industry (one year into my faculty job) 2022-09-03T20:43:37.701Z
Causal confusion as an argument against the scaling hypothesis 2022-06-20T10:54:05.623Z
Do FDT (or similar) recommend reparations? 2022-04-29T17:34:48.479Z
What's a good probability distribution family (e.g. "log-normal") to use for AGI timelines? 2022-04-13T04:45:04.649Z
Is "gears-level" just a synonym for "mechanistic"? 2021-12-13T04:11:45.159Z
Is there a name for the theory that "There will be fast takeoff in real-world capabilities because almost everything is AGI-complete"? 2021-09-02T23:00:42.785Z
What do we know about how much protection COVID vaccines provide against transmitting the virus to others? 2021-05-06T07:39:48.366Z
What do we know about how much protection COVID vaccines provide against long COVID? 2021-05-06T07:39:16.873Z
What do the reported levels of protection offered by various vaccines mean? 2021-05-04T22:06:23.758Z
Did they use serological testing for COVID vaccine trials? 2021-05-04T21:48:30.507Z
When's the best time to get the 2nd dose of Pfizer Vaccine? 2021-04-30T05:11:27.936Z
Are there any good ways to place a bet on RadicalXChange and/or related ideas/mechanisms taking off in a big way? e.g. is there something to invest $$$ in? 2021-04-17T06:58:42.414Z
What does vaccine effectiveness as a function of time look like? 2021-04-17T00:36:20.366Z
How many micromorts do you get per UV-index-hour? 2021-03-30T17:23:26.566Z
AI x-risk reduction: why I chose academia over industry 2021-03-14T17:25:12.503Z
"Beliefs" vs. "Notions" 2021-03-12T16:04:31.194Z
Any work on honeypots (to detect treacherous turn attempts)? 2020-11-12T05:41:56.371Z
When was the term "AI alignment" coined? 2020-10-21T18:27:56.162Z
Has anyone researched specification gaming with biological animals? 2020-10-21T00:20:01.610Z
Is there any work on incorporating aleatoric uncertainty and/or inherent randomness into AIXI? 2020-10-04T08:10:56.400Z
capybaralet's Shortform 2020-08-27T21:38:18.144Z
A reductio ad absurdum for naive Functional/Computational Theory-of-Mind (FCToM). 2020-01-02T17:16:35.566Z
A list of good heuristics that the case for AI x-risk fails 2019-12-02T19:26:28.870Z
What I talk about when I talk about AI x-risk: 3 core claims I want machine learning researchers to address. 2019-12-02T18:20:47.530Z
A fun calibration game: "0-hit Google phrases" 2019-11-21T01:13:10.667Z
Can indifference methods redeem person-affecting views? 2019-11-12T04:23:10.011Z
What are the reasons to *not* consider reducing AI-Xrisk the highest priority cause? 2019-08-20T21:45:12.118Z
Project Proposal: Considerations for trading off capabilities and safety impacts of AI research 2019-08-06T22:22:20.928Z
False assumptions and leaky abstractions in machine learning and AI safety 2019-06-28T04:54:47.119Z
Let's talk about "Convergent Rationality" 2019-06-12T21:53:35.356Z
X-risks are a tragedies of the commons 2019-02-07T02:48:25.825Z
My use of the phrase "Super-Human Feedback" 2019-02-06T19:11:11.734Z
Thoughts on Ben Garfinkel's "How sure are we about this AI stuff?" 2019-02-06T19:09:20.809Z
The role of epistemic vs. aleatory uncertainty in quantifying AI-Xrisk 2019-01-31T06:13:35.321Z
Imitation learning considered unsafe? 2019-01-06T15:48:36.078Z
Conceptual Analysis for AI Alignment 2018-12-30T00:46:38.014Z

Comments

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on Survey: How Do Elite Chinese Students Feel About the Risks of AI? · 2024-09-14T22:59:52.009Z · LW · GW

OK, so it's not really just your results?  You are aggregating across these studies (and presumably ones of "Westerners" as well)?  I do wonder how directly comparable things are... Did you make an effort to translate a study or questions from studies, or are the questions just independently conceived and formulated? 

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on Is there any rigorous work on using anthropic uncertainty to prevent situational awareness / deception? · 2024-09-14T22:57:12.610Z · LW · GW

No, I was only responding to the the first part.

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on Is there any rigorous work on using anthropic uncertainty to prevent situational awareness / deception? · 2024-09-04T15:55:39.332Z · LW · GW

Not necessarily fooling it, just keeping it ignorant.  I think such schemes can plausibly scale to very high levels of capabilities, perhaps indefinitely, since intelligence doesn't give one the ability to create information from thin air...

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on Consent across power differentials · 2024-09-04T13:18:02.962Z · LW · GW

This is a super interesting and important problem, IMO.  I believe it already has significant real world practical consequences, e.g. powerful people find it difficult to avoid being surrounded by sychophants: even if they really don't want to be, that's just an extra constraint for the sychophants to satisfy ("don't come across as sychophantic")!  I am inclined to agree that avoiding power differentials is the only way to really avoid these perverse outcomes in practice, and I think this is a good argument in favor of doing so.

--------------------------------------
This is also quite related to an (old, unpublished) work I did with Jonathan Binas on "bounded empowerment".  I've invited you to the Overleaf (it needs to clean-up, but I've also asked Jonathan about putting it on arXiv).
 
To summarize: Let's consider this in the case of a superhuman AI, R, and a human H.  The basic idea of that work is that R should try and "empower" H, and that (unlike in previous works on empowerment), there are two ways of doing this:
1) change the state of the world (as in previous works)
2) inform H so they know how to make use of the options available to them to achieve various ends (novel!)

If R has a perfect model of H and the world, then you can just compute how to effectively do these things (it's wildly intractable, ofc).  I think this would still often look "patronizing" in practice, and/or maybe just lead to totally wild behaviors (hard to predict this sort of stuff...), but it might be a useful conceptual "lead".

Random thought OTMH: Something which might make it less "patronizing" is if H were to have well-defined "meta-preferences" about how such interactions should work that R could aim to respect.  

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on Survey: How Do Elite Chinese Students Feel About the Risks of AI? · 2024-09-04T12:45:35.088Z · LW · GW

What makes you say this: "However, our results suggest that students are broadly less concerned about the risks of AI than people in the United States and Europe"? 

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on ProLU: A Nonlinearity for Sparse Autoencoders · 2024-08-09T00:32:06.633Z · LW · GW

This activation function was introduced in one of my papers from 10 years ago ;)

See Figure 2 of https://arxiv.org/abs/1402.3337

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on [Link Post] "Foundational Challenges in Assuring Alignment and Safety of Large Language Models" · 2024-06-07T01:39:31.027Z · LW · GW

Really interesting point!  

I introduced this term in my slides that included "paperweight" as an example of an "AI system" that maximizes safety.  

I sort of still think it's an OK term, but I'm sure I will keep thinking about this going forward and hope we can arrive at an even better term.

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on Testing for consequence-blindness in LLMs using the HI-ADS unit test. · 2024-03-15T12:09:30.979Z · LW · GW

You could try to do tests on data that is far enough from the training distribution that it won't generalize in a simple immitative way there, and you could do tests to try and confirm that you are far enough off distribution.  For instance, perhaps using a carefully chosen invented language would work.

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on Quick thoughts on "scalable oversight" / "super-human feedback" research · 2024-03-15T12:07:53.173Z · LW · GW

I don't disagree... in this case you don't get agents for a long time; someone else does though.

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on Quick thoughts on "scalable oversight" / "super-human feedback" research · 2024-03-06T21:08:26.047Z · LW · GW

I meant "other training schemes" to encompass things like scaffolding that deliberately engineers agents using LLMs as components, although I acknowledge they are not literally "training" and more like "engineering".

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on Reading the ethicists 2: Hunting for AI alignment papers · 2023-11-22T20:38:12.328Z · LW · GW

I would look at the main FATE conferences as well, which I view as being: FAccT, AIES, EEAMO.

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on Ways I Expect AI Regulation To Increase Extinction Risk · 2023-08-14T18:42:00.968Z · LW · GW

I found this thought provoking, but I didn't find the arguments very strong.

(a) Misdirected Regulations Reduce Effective Safety Effort; Regulations Will Almost Certainly Be Misdirected

(b) Regulations Generally Favor The Legible-To-The-State

(c) Heavy Regulations Can Simply Disempower the Regulator

(d) Regulations Are Likely To Maximize The Power of Companies Pushing Forward Capabilities the Most

Briefly responding:
a) The issue in this story seems to be that the company doesn't care about x-safety, not that they are legally obligated to care about face-blindness.
b) If governments don't have bandwidth to effectively vet small AI projects, it seems prudent to err on the side of forbidding projects that might pose x-risk. 
c) I do think we need effective international cooperation around regulation.  But even buying 1-4 years time seems good in expectation.
d) I don't see the x-risk aspect of this story.

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on How LLMs are and are not myopic · 2023-07-26T23:15:23.545Z · LW · GW

This means that the model can and will implicitly sacrifice next-token prediction accuracy for long horizon prediction accuracy.

Are you claiming this would happen even given infinite capacity?
If so, can you perhaps provide a simple+intuitive+concrete example?
 

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on What Discovering Latent Knowledge Did and Did Not Find · 2023-07-19T23:03:33.691Z · LW · GW

What do you mean by "random linear probe"?

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on Deceptive AI vs. shifting instrumental incentives · 2023-07-11T15:25:42.747Z · LW · GW

I skimmed this.  A few quick comments:
- I think you characterized deceptive alignment pretty well.  
- I think it only covers a narrow part of how deceptive behavior can arise. 
- CICERO likely already did some of what you describe.

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on Instrumental Convergence? [Draft] · 2023-06-29T10:26:36.263Z · LW · GW

So let us specify a probability distribution over the space of all possible desires. If we accept the orthogonality thesis, we should not want this probability distribution to build in any bias towards certain kinds of desires over others. So let's spread our probabilities in such a way that we meet the following three conditions. Firstly, we don't expect Sia's desires to be better satisfied in any one world than they are in any other world. Formally, our expectation of the degree to which Sia's desires are satisfied at  is equal to our expectation of the degree to which Sia's desires are satisfied at , for any . Call that common expected value ''. Secondly, our probabilities are symmetric around . That is, our probability that  satisfies Sia's desires to at least degree  is equal to our probability that it satisfies her desires to at most degree .  And thirdly, learning how well satisfied Sia's desires are at some worlds won't tell us how well satisfied her desires are at other worlds.  That is, the degree to which her desires are satisfied at some worlds is independent of how well satisfied they are at any other worlds.  (See the appendix for a more careful formulation of these assumptions.) If our probability distribution satisfies these constraints, then I'll say that Sia's desires are 'sampled randomly' from the space of all possible desires.


This is a characterization, and it remains to show that there exist distributions that fit it (I suspect there are not, assuming the sets of possible desires and worlds are unbounded).

I also find the 3rd criteria counterintuitive.  If worlds share features, I would expect these to not be independent.

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on Did Bengio and Tegmark lose a debate about AI x-risk against LeCun and Mitchell? · 2023-06-27T10:15:21.879Z · LW · GW

I think it might be more effective in future debates at the outset to: 
* Explain that it's only necessary to cross a low bar (e.g. see my Tweet below).  -- This is a common practice in debates.
* Outline the responses they expect to hear from the other side, and explain why they are bogus.  Framing: "Whether AI is an x-risk has been debated in the ML community for 10 years, and nobody has provided any compelling counterarguments that refute the 3 claims (of the Tweet).  You will hear a bunch of counter arguments from the other side, but when you do, ask yourself whether they are really addressing this.  Here are a few counter-arguments and why they fail..." -- I think this could really take the wind out of the sails of the opposition, and put them on the back foot.

I also don't think Lecun and Meta should be given so much credit -- Is Facebook really going to develop and deploy AI responsibly?
1) They have been widely condemned for knowingly playing a significant role in the Rohingya genocide, have acknowledged that they failed to act to prevent Facebook's role in the Rohingya genocide, and are being sued for $150bn for this.  
2) They have also been criticised for the role that their products, especially Instagram, play in contributing to mental health issues, especially around body image in teenage girls.  

More generally, I think the "companies do irresponsible stuff all the time" point needs to be stressed more.  And one particular argument that is bogus is the "we'll make it safe" -- x-safety is a common good, and so companies should be expected to undersupply it.  This is econ 101.


Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on capybaralet's Shortform · 2023-05-22T22:01:13.371Z · LW · GW

Organizations that are looking for ML talent (e.g. to mentor more junior people, or get feedback on policy) should offer PhD students high-paying contractor/part-time work.

ML PhD students working on safety-relevant projects should be able to augment their meager stipends this way.

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on On AutoGPT · 2023-04-22T16:51:50.004Z · LW · GW

That is in addition to all the people who will give their AutoGPT an instruction that means well but actually translates to killing all the humans or at least take control over the future, since that is so obviously the easiest way to accomplish the thing, such as ‘bring about world peace and end world hunger’ (link goes to Sully hyping AutoGPT, saying ‘you give it a goal like end world hunger’) or ‘stop climate change’ or ‘deliver my coffee every morning at 8am sharp no matter what as reliably as possible.’ Or literally almost anything else.


I think these mostly only translate into dangerous behavior if the model badly "misunderstands" the instruction, which seems somewhat implausible.  

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on On AutoGPT · 2023-04-22T16:48:03.810Z · LW · GW

One must notice that in order to predict the next token as well as possible the LMM will benefit from being able to simulate every situation, every person, and every causal element behind the creation of every bit of text in its training distribution, no matter what we then train the LMM to output to us (what mask we put on it) afterwards.


Is there any rigorous justification for this claim?  As far as I can tell, this is folk wisdom from the scaling/AI safety community, and I think it's far from obvious that it's correct, or what assumptions are required for it to hold.  

It seems much more plausible in the infinite limit than in practice.

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on On AutoGPT · 2023-04-22T16:41:14.440Z · LW · GW

I have gained confidence in my position that all of this happening now is a good thing, both from the perspective of smaller risks like malware attacks, and from the perspective of potential existential threats. Seems worth going over the logic.

What we want to do is avoid what one might call an agent overhang.

One might hope to execute our Plan A of having our AIs not be agents. Alas, even if technically feasible (which is not at all clear) that only can work if we don’t intentionally turn them into agents via wrapping code around them. We’ve checked with actual humans about the possibility of kindly not doing that. Didn’t go great.


This seems like really bad reasoning... 

It seems like the evidence that people won't "kindly not [do] that" is... AutoGPT.
So if AutoGPT didn't exist, you might be able to say: "we asked people to not turn AI systems into agents, and they didn't.  Hooray for plan A!"

Also: I don't think it's fair to say "we've checked [...] about the possibility".  The AI safety community thought it was sketch for a long time, and has provided some lackluster pushback.  Governance folks from the community don't seem to be calling for a rollback of the plugins, or bans on this kind of behavior, etc.
 

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on OpenAI could help X-risk by wagering itself · 2023-04-20T23:32:29.510Z · LW · GW

Christiano and Yudkowsky both agree AI is an x-risk -- a prediction that would distinguish their models does not do much to help us resolve whether or not AI is an x-risk.

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on "Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities) · 2023-03-21T03:22:10.700Z · LW · GW

I'm not necessarily saying people are subconsciously trying to create a moat.  

I'm saying they are acting in a way that creates a moat, and that enables them to avoid competition, and that more competition would create more motivation for them to write things up for academic audiences (or even just write more clearly for non-academic audiences).

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on "Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities) · 2023-03-21T03:15:16.504Z · LW · GW

Q: "Why is that not enough?"
A: Because they are not being funded to produce the right kinds of outputs.

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on "Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities) · 2023-03-19T12:02:52.995Z · LW · GW

My point is not specific to machine learning. I'm not as familiar with other academic communities, but I think most of the time it would probably be worth engaging with them if there is somewhere where your work could fit.

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on "Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities) · 2023-03-19T11:09:23.724Z · LW · GW

In my experience people also often know their blog posts aren't very good.

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on "Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities) · 2023-03-19T11:08:17.683Z · LW · GW

My point (see footnote) is that motivations are complex.  I do not believe "the real motivations" is a very useful concept here.  

The question becomes why "don't they judge those costs to be worth it"?  Is there motivated reasoning involved?  Almost certainly yes; there always is.

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on "Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities) · 2023-03-19T11:04:31.141Z · LW · GW
  1. A lot of work just isn't made publicly available
  2. When it is, it's often in the form of ~100 page google docs
  3. Academics have a number of good reasons to ignore things that don't meet academic standards or rigor and presentation
Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on Japan AI Alignment Conference · 2023-03-11T22:37:31.577Z · LW · GW

works for me too now

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on Japan AI Alignment Conference · 2023-03-10T20:59:10.024Z · LW · GW

The link is broken, FYI

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on Can we efficiently distinguish different mechanisms? · 2023-02-14T09:49:00.437Z · LW · GW

Yeah this was super unclear to me; I think it's worth updating the OP.

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on SolidGoldMagikarp (plus, prompt generation) · 2023-02-14T09:23:56.354Z · LW · GW

FYI: my understanding is that "data poisoning" refers to deliberately the training data of somebody else's model which I understand is not what you are describing.

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on Cyborgism · 2023-02-13T17:47:48.806Z · LW · GW

Oh I see.  I was getting at the "it's not aligned" bit.

Basically, it seems like if I become a cyborg without understanding what I'm doing, the result is either:

  • I'm in control
  • The machine part is in control
  • Something in the middle

Only the first one seems likely to be sufficiently aligned. 

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on SolidGoldMagikarp (plus, prompt generation) · 2023-02-13T17:45:58.246Z · LW · GW

I don't understand the fuss about this; I suspect these phenomena are due to uninteresting, and perhaps even well-understood effects.  A colleague of mine had this to say:

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on Cyborgism · 2023-02-12T16:05:58.706Z · LW · GW

Indeed.  I think having a clean, well-understood interface for human/AI interaction seems useful here.  I recognize this is a big ask in the current norms and rules around AI development and deployment.

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on Cyborgism · 2023-02-12T16:04:19.114Z · LW · GW

I don't understand what you're getting at RE "personal level".

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on Cyborgism · 2023-02-11T14:38:44.120Z · LW · GW

I think the most fundamental objection to becoming cyborgs is that we don't know how to say whether a person retains control over the cyborg they become a part of.

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on A (EtA: quick) note on terminology: AI Alignment != AI x-safety · 2023-02-10T09:43:50.345Z · LW · GW

FWIW, I didn't mean to kick off a historical debate, which seems like probably not a very valuable use of y'all's time.

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on A (EtA: quick) note on terminology: AI Alignment != AI x-safety · 2023-02-10T09:42:52.157Z · LW · GW

Unfortunately, I think even "catastrophic risk" has a high potential to be watered down and be applied to situations where dozens as opposed to millions/billions die.  Even existential risk has this potential, actually, but I think it's a safer bet.

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on A (EtA: quick) note on terminology: AI Alignment != AI x-safety · 2023-02-10T09:39:29.838Z · LW · GW

I don't think we should try and come up with a special term for (1).
The best term might be "AI engineering".  The only thing it needs to be distinguished from is "AI science".

I think ML people overwhelmingly identify as doing one of those 2 things, and find it annoying and ridiculous when people in this community act like we are the only ones who care about building systems that work as intended.

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on A (EtA: quick) note on terminology: AI Alignment != AI x-safety · 2023-02-09T13:17:21.273Z · LW · GW

I say it is a rebrand of the "AI (x-)safety" community.
When AI alignment came along we were calling it AI safety, even though it was really basically AI existential safety all along that everyone in the community meant.  "AI safety" was (IMO) a somewhat successful bid for more mainstream acceptance, that then lead to dillution and confusion, necessitating a new term.

I don't think the history is that important; what's important is having good terminology going forward.
This is also why I stress that I work on AI existential safety.

So I think people should just say what kind of technical work they are doing and "existential safety" should be considered as a social-technical problem that motivates a community of researchers, and used to refer to that problem and that community.  In particular, I think we are not able to cleanly delineate what is or isn't technical AI existential safety research at this point, and we should welcome intellectual debates about the nature of the problem and how different technical research may or may not contribute to increasing x-safety.

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on Why I hate the "accident vs. misuse" AI x-risk dichotomy (quick thoughts on "structural risk") · 2023-02-08T22:40:04.105Z · LW · GW

Hmm... this is a good point.

I think structural risk is often a better description of reality, but I can see a rhetorical argument against framing things that way.  One problem I see with doing that is that I think it leads people to think the solution is just for AI developers to be more careful, rather than observing that there will be structural incentives (etc.) pushing for less caution.

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on Why I hate the "accident vs. misuse" AI x-risk dichotomy (quick thoughts on "structural risk") · 2023-02-08T22:37:09.234Z · LW · GW

I do think that there’s a pretty solid dichotomy between (A) “the AGI does things specifically intended by its designers” and (B) “the AGI does things that the designers never wanted it to do”.

1) I don't think this dichotomy is as solid as it seems once you start poking at it... e.g. in your war example, it would be odd to say that the designers of the AGI systems that wiped out humans intended for that outcome to occur.  Intentions are perhaps best thought of as incomplete specifications.  

2) From our current position, I think “never ever create AGI” is a significantly easier thing to coordinate around than "don't build AGI until/unless we can do it safely".  I'm not very worried that we will coordinate too successfully and never build AGI and thus squander the cosmic endowment.  This is both because I think that's quite unlikely, and because I'm not sure we'll make very good / the best use of it anyways (e.g. think S-risk, other civilizations).

3) I think the conventional framing of AI alignment is something between vague and substantively incorrect, as well as being misleading.  Here is a post I dashed off about that:
https://www.lesswrong.com/posts/biP5XBmqvjopvky7P/a-note-on-terminology-ai-alignment-ai-x-safety.  I think creating such a manual is an incredibly ambitious goal, and I think more people in this community should aim for more moderate goals.  I mostly agree with the perspective in this post: https://coordination.substack.com/p/alignment-is-not-enough, but I could say more on the matter.

4) RE connotations of accident: I think they are often strong.

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on Why I hate the "accident vs. misuse" AI x-risk dichotomy (quick thoughts on "structural risk") · 2023-02-05T12:05:20.019Z · LW · GW

While defining accident as “incident that was not specifically intended & desired by the people who pressed ‘run’ on the AGI code” is extremely broad, it still supposes that there is such a thing as "the AGI code", which  significantly restricts the space of possibile risks.

There are other reasons I would not be happy with that browser extension.  There is not one specific conversation I can point to; it comes up regularly.  I think this replacement would probably lead to a lot of confusion, since I think when people use the word "accident" they often proceed as if it meant something stricter, e.g. that the result was unforseen or unforseeable.  

If (as in "Concrete Problems", IMO) the point is just to point out that AI can get out-of-control, or that misuse is not the only risk, that's a worthwhile thing to point out, but it doesn't lead to a very useful framework for understanding the nature of the risk(s).  As I mentioned elsewhere, it is specifically the dichotomy of "accident vs. misuse" that I think is the most problematic and misleading.

I think the chart is misleading for the following reasons, among others:

  • It seems to suppose that there is such a manual, or the goal of creating one.  However, if we coordinate effectively, we can simply forgoe development and deployment of dangerous technologies ~indefinitely.
  • It inappropriately separates "coordination problems" and "everyone follows the manual"
     
Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on Vanessa Kosoy's Shortform · 2023-02-05T12:01:16.830Z · LW · GW

I think the construction gives us $C(\pi) \leq C(U) + e$ for a small constant $e$ (representing the wrapper).  It seems like any compression you can apply to the reward function can be translated to the policy via the wrapper.  So then you would never have $C(\pi) >> C(U)$.  What am I missing/misunderstanding?

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on Why I hate the "accident vs. misuse" AI x-risk dichotomy (quick thoughts on "structural risk") · 2023-02-02T12:39:33.747Z · LW · GW

By "intend" do you mean that they sought that outcome / selected for it?  
Or merely that it was a known or predictable outcome of their behavior?

I think "unintentional" would already probably be a better term in most cases. 

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on Vanessa Kosoy's Shortform · 2023-02-02T12:35:38.602Z · LW · GW

Apologies, I didn't take the time to understand all of this yet, but I have a basic question you might have an answer to...

We know how to map (deterministic) policies to reward functions using the construction at the bottom of page 6 of the reward modelling agenda (https://arxiv.org/abs/1811.07871v1): the agent is rewarded only if it has so far done exactly what the policy would do.  I think of this as a wrapper function (https://en.wikipedia.org/wiki/Wrapper_function).

It seems like this means that, for any policy, we can represent it as optimizing reward with only the minimal overhead in description/computational complexity of the wrapper.

So...

  • Do you think this analysis is correct?  Or what is it missing?  (maybe the assumption that the policy is deterministic is significant?  This turns out to be the case for Orseau et al.'s "Agents and Devices" approach, I think https://arxiv.org/abs/1805.12387).
  • Are you trying to get around this somehow?  Or are you fine with this minimal overhead being used to distinguish goal-directed from non-goal directed policies?
Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on Why I hate the "accident vs. misuse" AI x-risk dichotomy (quick thoughts on "structural risk") · 2023-02-01T18:27:48.406Z · LW · GW

"Concrete Problems in AI Safety" used this distinction to make this point, and I think it was likely a useful simplification in that context.  I generally think spelling it out is better, and I think people will pattern match your concerns onto the “the sci-fi scenario where AI spontaneously becomes conscious, goes rogue, and pursues its own goal” or "boring old robustness problems" if you don't invoke structural risk.  I think structural risk plays a crucial role in the arguments, and even if you think things that look more like pure accidents are more likely, I think the structural risk story is more plausible to more people and a sufficient cause for concern.

RE (A): A known side-effect is not an accident.


 

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on Why I hate the "accident vs. misuse" AI x-risk dichotomy (quick thoughts on "structural risk") · 2023-01-31T09:59:10.625Z · LW · GW

I agree somewhat, however, I think we need to be careful to distinguish "do unsavory things" from "cause human extinction", and should generally be squarely focused on the latter.  The former easily becomes too political, making coordination harder.

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on Why I hate the "accident vs. misuse" AI x-risk dichotomy (quick thoughts on "structural risk") · 2023-01-31T09:56:52.792Z · LW · GW

Yes it may be useful in some very limited contexts.  I can't recall a time I have seen it in writing and felt like it was not a counter-productive framing.

AI is highly non-analogous with guns.