Posts

Testing for consequence-blindness in LLMs using the HI-ADS unit test. 2023-11-24T23:35:29.560Z
"Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities) 2023-03-18T19:01:54.199Z
What organizations other than Conjecture have (esp. public) info-hazard policies? 2023-03-16T14:49:12.411Z
A (EtA: quick) note on terminology: AI Alignment != AI x-safety 2023-02-08T22:33:52.713Z
Why I hate the "accident vs. misuse" AI x-risk dichotomy (quick thoughts on "structural risk") 2023-01-30T18:50:17.613Z
Quick thoughts on "scalable oversight" / "super-human feedback" research 2023-01-25T12:55:31.334Z
Mechanistic Interpretability as Reverse Engineering (follow-up to "cars and elephants") 2022-11-03T23:19:20.458Z
"Cars and Elephants": a handwavy argument/analogy against mechanistic interpretability 2022-10-31T21:26:05.388Z
I'm planning to start creating more write-ups summarizing my thoughts on various issues, mostly related to AI existential safety. What do you want to hear my nuanced takes on? 2022-09-24T12:38:24.163Z
[An email with a bunch of links I sent an experienced ML researcher interested in learning about Alignment / x-safety.] 2022-09-08T22:28:54.534Z
An Update on Academia vs. Industry (one year into my faculty job) 2022-09-03T20:43:37.701Z
Causal confusion as an argument against the scaling hypothesis 2022-06-20T10:54:05.623Z
Do FDT (or similar) recommend reparations? 2022-04-29T17:34:48.479Z
What's a good probability distribution family (e.g. "log-normal") to use for AGI timelines? 2022-04-13T04:45:04.649Z
Is "gears-level" just a synonym for "mechanistic"? 2021-12-13T04:11:45.159Z
Is there a name for the theory that "There will be fast takeoff in real-world capabilities because almost everything is AGI-complete"? 2021-09-02T23:00:42.785Z
What do we know about how much protection COVID vaccines provide against transmitting the virus to others? 2021-05-06T07:39:48.366Z
What do we know about how much protection COVID vaccines provide against long COVID? 2021-05-06T07:39:16.873Z
What do the reported levels of protection offered by various vaccines mean? 2021-05-04T22:06:23.758Z
Did they use serological testing for COVID vaccine trials? 2021-05-04T21:48:30.507Z
When's the best time to get the 2nd dose of Pfizer Vaccine? 2021-04-30T05:11:27.936Z
Are there any good ways to place a bet on RadicalXChange and/or related ideas/mechanisms taking off in a big way? e.g. is there something to invest $$$ in? 2021-04-17T06:58:42.414Z
What does vaccine effectiveness as a function of time look like? 2021-04-17T00:36:20.366Z
How many micromorts do you get per UV-index-hour? 2021-03-30T17:23:26.566Z
AI x-risk reduction: why I chose academia over industry 2021-03-14T17:25:12.503Z
"Beliefs" vs. "Notions" 2021-03-12T16:04:31.194Z
Any work on honeypots (to detect treacherous turn attempts)? 2020-11-12T05:41:56.371Z
When was the term "AI alignment" coined? 2020-10-21T18:27:56.162Z
Has anyone researched specification gaming with biological animals? 2020-10-21T00:20:01.610Z
Is there any work on incorporating aleatoric uncertainty and/or inherent randomness into AIXI? 2020-10-04T08:10:56.400Z
capybaralet's Shortform 2020-08-27T21:38:18.144Z
A reductio ad absurdum for naive Functional/Computational Theory-of-Mind (FCToM). 2020-01-02T17:16:35.566Z
A list of good heuristics that the case for AI x-risk fails 2019-12-02T19:26:28.870Z
What I talk about when I talk about AI x-risk: 3 core claims I want machine learning researchers to address. 2019-12-02T18:20:47.530Z
A fun calibration game: "0-hit Google phrases" 2019-11-21T01:13:10.667Z
Can indifference methods redeem person-affecting views? 2019-11-12T04:23:10.011Z
What are the reasons to *not* consider reducing AI-Xrisk the highest priority cause? 2019-08-20T21:45:12.118Z
Project Proposal: Considerations for trading off capabilities and safety impacts of AI research 2019-08-06T22:22:20.928Z
False assumptions and leaky abstractions in machine learning and AI safety 2019-06-28T04:54:47.119Z
Let's talk about "Convergent Rationality" 2019-06-12T21:53:35.356Z
X-risks are a tragedies of the commons 2019-02-07T02:48:25.825Z
My use of the phrase "Super-Human Feedback" 2019-02-06T19:11:11.734Z
Thoughts on Ben Garfinkel's "How sure are we about this AI stuff?" 2019-02-06T19:09:20.809Z
The role of epistemic vs. aleatory uncertainty in quantifying AI-Xrisk 2019-01-31T06:13:35.321Z
Imitation learning considered unsafe? 2019-01-06T15:48:36.078Z
Conceptual Analysis for AI Alignment 2018-12-30T00:46:38.014Z
Disambiguating "alignment" and related notions 2018-06-05T15:35:15.091Z
Problems with learning values from observation 2016-09-21T00:40:49.102Z
Risks from Approximate Value Learning 2016-08-27T19:34:06.178Z
Inefficient Games 2016-08-23T17:47:02.882Z

Comments

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on Testing for consequence-blindness in LLMs using the HI-ADS unit test. · 2024-03-15T12:09:30.979Z · LW · GW

You could try to do tests on data that is far enough from the training distribution that it won't generalize in a simple immitative way there, and you could do tests to try and confirm that you are far enough off distribution.  For instance, perhaps using a carefully chosen invented language would work.

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on Quick thoughts on "scalable oversight" / "super-human feedback" research · 2024-03-15T12:07:53.173Z · LW · GW

I don't disagree... in this case you don't get agents for a long time; someone else does though.

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on Quick thoughts on "scalable oversight" / "super-human feedback" research · 2024-03-06T21:08:26.047Z · LW · GW

I meant "other training schemes" to encompass things like scaffolding that deliberately engineers agents using LLMs as components, although I acknowledge they are not literally "training" and more like "engineering".

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on Reading the ethicists 2: Hunting for AI alignment papers · 2023-11-22T20:38:12.328Z · LW · GW

I would look at the main FATE conferences as well, which I view as being: FAccT, AIES, EEAMO.

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on Ways I Expect AI Regulation To Increase Extinction Risk · 2023-08-14T18:42:00.968Z · LW · GW

I found this thought provoking, but I didn't find the arguments very strong.

(a) Misdirected Regulations Reduce Effective Safety Effort; Regulations Will Almost Certainly Be Misdirected

(b) Regulations Generally Favor The Legible-To-The-State

(c) Heavy Regulations Can Simply Disempower the Regulator

(d) Regulations Are Likely To Maximize The Power of Companies Pushing Forward Capabilities the Most

Briefly responding:
a) The issue in this story seems to be that the company doesn't care about x-safety, not that they are legally obligated to care about face-blindness.
b) If governments don't have bandwidth to effectively vet small AI projects, it seems prudent to err on the side of forbidding projects that might pose x-risk. 
c) I do think we need effective international cooperation around regulation.  But even buying 1-4 years time seems good in expectation.
d) I don't see the x-risk aspect of this story.

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on How LLMs are and are not myopic · 2023-07-26T23:15:23.545Z · LW · GW

This means that the model can and will implicitly sacrifice next-token prediction accuracy for long horizon prediction accuracy.

Are you claiming this would happen even given infinite capacity?
If so, can you perhaps provide a simple+intuitive+concrete example?
 

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on What Discovering Latent Knowledge Did and Did Not Find · 2023-07-19T23:03:33.691Z · LW · GW

What do you mean by "random linear probe"?

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on Deceptive AI vs. shifting instrumental incentives · 2023-07-11T15:25:42.747Z · LW · GW

I skimmed this.  A few quick comments:
- I think you characterized deceptive alignment pretty well.  
- I think it only covers a narrow part of how deceptive behavior can arise. 
- CICERO likely already did some of what you describe.

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on Instrumental Convergence? [Draft] · 2023-06-29T10:26:36.263Z · LW · GW

So let us specify a probability distribution over the space of all possible desires. If we accept the orthogonality thesis, we should not want this probability distribution to build in any bias towards certain kinds of desires over others. So let's spread our probabilities in such a way that we meet the following three conditions. Firstly, we don't expect Sia's desires to be better satisfied in any one world than they are in any other world. Formally, our expectation of the degree to which Sia's desires are satisfied at  is equal to our expectation of the degree to which Sia's desires are satisfied at , for any . Call that common expected value ''. Secondly, our probabilities are symmetric around . That is, our probability that  satisfies Sia's desires to at least degree  is equal to our probability that it satisfies her desires to at most degree .  And thirdly, learning how well satisfied Sia's desires are at some worlds won't tell us how well satisfied her desires are at other worlds.  That is, the degree to which her desires are satisfied at some worlds is independent of how well satisfied they are at any other worlds.  (See the appendix for a more careful formulation of these assumptions.) If our probability distribution satisfies these constraints, then I'll say that Sia's desires are 'sampled randomly' from the space of all possible desires.


This is a characterization, and it remains to show that there exist distributions that fit it (I suspect there are not, assuming the sets of possible desires and worlds are unbounded).

I also find the 3rd criteria counterintuitive.  If worlds share features, I would expect these to not be independent.

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on Did Bengio and Tegmark lose a debate about AI x-risk against LeCun and Mitchell? · 2023-06-27T10:15:21.879Z · LW · GW

I think it might be more effective in future debates at the outset to: 
* Explain that it's only necessary to cross a low bar (e.g. see my Tweet below).  -- This is a common practice in debates.
* Outline the responses they expect to hear from the other side, and explain why they are bogus.  Framing: "Whether AI is an x-risk has been debated in the ML community for 10 years, and nobody has provided any compelling counterarguments that refute the 3 claims (of the Tweet).  You will hear a bunch of counter arguments from the other side, but when you do, ask yourself whether they are really addressing this.  Here are a few counter-arguments and why they fail..." -- I think this could really take the wind out of the sails of the opposition, and put them on the back foot.

I also don't think Lecun and Meta should be given so much credit -- Is Facebook really going to develop and deploy AI responsibly?
1) They have been widely condemned for knowingly playing a significant role in the Rohingya genocide, have acknowledged that they failed to act to prevent Facebook's role in the Rohingya genocide, and are being sued for $150bn for this.  
2) They have also been criticised for the role that their products, especially Instagram, play in contributing to mental health issues, especially around body image in teenage girls.  

More generally, I think the "companies do irresponsible stuff all the time" point needs to be stressed more.  And one particular argument that is bogus is the "we'll make it safe" -- x-safety is a common good, and so companies should be expected to undersupply it.  This is econ 101.


Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on capybaralet's Shortform · 2023-05-22T22:01:13.371Z · LW · GW

Organizations that are looking for ML talent (e.g. to mentor more junior people, or get feedback on policy) should offer PhD students high-paying contractor/part-time work.

ML PhD students working on safety-relevant projects should be able to augment their meager stipends this way.

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on On AutoGPT · 2023-04-22T16:51:50.004Z · LW · GW

That is in addition to all the people who will give their AutoGPT an instruction that means well but actually translates to killing all the humans or at least take control over the future, since that is so obviously the easiest way to accomplish the thing, such as ‘bring about world peace and end world hunger’ (link goes to Sully hyping AutoGPT, saying ‘you give it a goal like end world hunger’) or ‘stop climate change’ or ‘deliver my coffee every morning at 8am sharp no matter what as reliably as possible.’ Or literally almost anything else.


I think these mostly only translate into dangerous behavior if the model badly "misunderstands" the instruction, which seems somewhat implausible.  

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on On AutoGPT · 2023-04-22T16:48:03.810Z · LW · GW

One must notice that in order to predict the next token as well as possible the LMM will benefit from being able to simulate every situation, every person, and every causal element behind the creation of every bit of text in its training distribution, no matter what we then train the LMM to output to us (what mask we put on it) afterwards.


Is there any rigorous justification for this claim?  As far as I can tell, this is folk wisdom from the scaling/AI safety community, and I think it's far from obvious that it's correct, or what assumptions are required for it to hold.  

It seems much more plausible in the infinite limit than in practice.

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on On AutoGPT · 2023-04-22T16:41:14.440Z · LW · GW

I have gained confidence in my position that all of this happening now is a good thing, both from the perspective of smaller risks like malware attacks, and from the perspective of potential existential threats. Seems worth going over the logic.

What we want to do is avoid what one might call an agent overhang.

One might hope to execute our Plan A of having our AIs not be agents. Alas, even if technically feasible (which is not at all clear) that only can work if we don’t intentionally turn them into agents via wrapping code around them. We’ve checked with actual humans about the possibility of kindly not doing that. Didn’t go great.


This seems like really bad reasoning... 

It seems like the evidence that people won't "kindly not [do] that" is... AutoGPT.
So if AutoGPT didn't exist, you might be able to say: "we asked people to not turn AI systems into agents, and they didn't.  Hooray for plan A!"

Also: I don't think it's fair to say "we've checked [...] about the possibility".  The AI safety community thought it was sketch for a long time, and has provided some lackluster pushback.  Governance folks from the community don't seem to be calling for a rollback of the plugins, or bans on this kind of behavior, etc.
 

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on OpenAI could help X-risk by wagering itself · 2023-04-20T23:32:29.510Z · LW · GW

Christiano and Yudkowsky both agree AI is an x-risk -- a prediction that would distinguish their models does not do much to help us resolve whether or not AI is an x-risk.

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on "Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities) · 2023-03-21T03:22:10.700Z · LW · GW

I'm not necessarily saying people are subconsciously trying to create a moat.  

I'm saying they are acting in a way that creates a moat, and that enables them to avoid competition, and that more competition would create more motivation for them to write things up for academic audiences (or even just write more clearly for non-academic audiences).

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on "Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities) · 2023-03-21T03:15:16.504Z · LW · GW

Q: "Why is that not enough?"
A: Because they are not being funded to produce the right kinds of outputs.

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on "Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities) · 2023-03-19T12:02:52.995Z · LW · GW

My point is not specific to machine learning. I'm not as familiar with other academic communities, but I think most of the time it would probably be worth engaging with them if there is somewhere where your work could fit.

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on "Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities) · 2023-03-19T11:09:23.724Z · LW · GW

In my experience people also often know their blog posts aren't very good.

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on "Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities) · 2023-03-19T11:08:17.683Z · LW · GW

My point (see footnote) is that motivations are complex.  I do not believe "the real motivations" is a very useful concept here.  

The question becomes why "don't they judge those costs to be worth it"?  Is there motivated reasoning involved?  Almost certainly yes; there always is.

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on "Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities) · 2023-03-19T11:04:31.141Z · LW · GW
  1. A lot of work just isn't made publicly available
  2. When it is, it's often in the form of ~100 page google docs
  3. Academics have a number of good reasons to ignore things that don't meet academic standards or rigor and presentation
Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on Japan AI Alignment Conference · 2023-03-11T22:37:31.577Z · LW · GW

works for me too now

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on Japan AI Alignment Conference · 2023-03-10T20:59:10.024Z · LW · GW

The link is broken, FYI

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on Can we efficiently distinguish different mechanisms? · 2023-02-14T09:49:00.437Z · LW · GW

Yeah this was super unclear to me; I think it's worth updating the OP.

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on SolidGoldMagikarp (plus, prompt generation) · 2023-02-14T09:23:56.354Z · LW · GW

FYI: my understanding is that "data poisoning" refers to deliberately the training data of somebody else's model which I understand is not what you are describing.

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on Cyborgism · 2023-02-13T17:47:48.806Z · LW · GW

Oh I see.  I was getting at the "it's not aligned" bit.

Basically, it seems like if I become a cyborg without understanding what I'm doing, the result is either:

  • I'm in control
  • The machine part is in control
  • Something in the middle

Only the first one seems likely to be sufficiently aligned. 

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on SolidGoldMagikarp (plus, prompt generation) · 2023-02-13T17:45:58.246Z · LW · GW

I don't understand the fuss about this; I suspect these phenomena are due to uninteresting, and perhaps even well-understood effects.  A colleague of mine had this to say:

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on Cyborgism · 2023-02-12T16:05:58.706Z · LW · GW

Indeed.  I think having a clean, well-understood interface for human/AI interaction seems useful here.  I recognize this is a big ask in the current norms and rules around AI development and deployment.

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on Cyborgism · 2023-02-12T16:04:19.114Z · LW · GW

I don't understand what you're getting at RE "personal level".

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on Cyborgism · 2023-02-11T14:38:44.120Z · LW · GW

I think the most fundamental objection to becoming cyborgs is that we don't know how to say whether a person retains control over the cyborg they become a part of.

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on A (EtA: quick) note on terminology: AI Alignment != AI x-safety · 2023-02-10T09:43:50.345Z · LW · GW

FWIW, I didn't mean to kick off a historical debate, which seems like probably not a very valuable use of y'all's time.

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on A (EtA: quick) note on terminology: AI Alignment != AI x-safety · 2023-02-10T09:42:52.157Z · LW · GW

Unfortunately, I think even "catastrophic risk" has a high potential to be watered down and be applied to situations where dozens as opposed to millions/billions die.  Even existential risk has this potential, actually, but I think it's a safer bet.

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on A (EtA: quick) note on terminology: AI Alignment != AI x-safety · 2023-02-10T09:39:29.838Z · LW · GW

I don't think we should try and come up with a special term for (1).
The best term might be "AI engineering".  The only thing it needs to be distinguished from is "AI science".

I think ML people overwhelmingly identify as doing one of those 2 things, and find it annoying and ridiculous when people in this community act like we are the only ones who care about building systems that work as intended.

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on A (EtA: quick) note on terminology: AI Alignment != AI x-safety · 2023-02-09T13:17:21.273Z · LW · GW

I say it is a rebrand of the "AI (x-)safety" community.
When AI alignment came along we were calling it AI safety, even though it was really basically AI existential safety all along that everyone in the community meant.  "AI safety" was (IMO) a somewhat successful bid for more mainstream acceptance, that then lead to dillution and confusion, necessitating a new term.

I don't think the history is that important; what's important is having good terminology going forward.
This is also why I stress that I work on AI existential safety.

So I think people should just say what kind of technical work they are doing and "existential safety" should be considered as a social-technical problem that motivates a community of researchers, and used to refer to that problem and that community.  In particular, I think we are not able to cleanly delineate what is or isn't technical AI existential safety research at this point, and we should welcome intellectual debates about the nature of the problem and how different technical research may or may not contribute to increasing x-safety.

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on Why I hate the "accident vs. misuse" AI x-risk dichotomy (quick thoughts on "structural risk") · 2023-02-08T22:40:04.105Z · LW · GW

Hmm... this is a good point.

I think structural risk is often a better description of reality, but I can see a rhetorical argument against framing things that way.  One problem I see with doing that is that I think it leads people to think the solution is just for AI developers to be more careful, rather than observing that there will be structural incentives (etc.) pushing for less caution.

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on Why I hate the "accident vs. misuse" AI x-risk dichotomy (quick thoughts on "structural risk") · 2023-02-08T22:37:09.234Z · LW · GW

I do think that there’s a pretty solid dichotomy between (A) “the AGI does things specifically intended by its designers” and (B) “the AGI does things that the designers never wanted it to do”.

1) I don't think this dichotomy is as solid as it seems once you start poking at it... e.g. in your war example, it would be odd to say that the designers of the AGI systems that wiped out humans intended for that outcome to occur.  Intentions are perhaps best thought of as incomplete specifications.  

2) From our current position, I think “never ever create AGI” is a significantly easier thing to coordinate around than "don't build AGI until/unless we can do it safely".  I'm not very worried that we will coordinate too successfully and never build AGI and thus squander the cosmic endowment.  This is both because I think that's quite unlikely, and because I'm not sure we'll make very good / the best use of it anyways (e.g. think S-risk, other civilizations).

3) I think the conventional framing of AI alignment is something between vague and substantively incorrect, as well as being misleading.  Here is a post I dashed off about that:
https://www.lesswrong.com/posts/biP5XBmqvjopvky7P/a-note-on-terminology-ai-alignment-ai-x-safety.  I think creating such a manual is an incredibly ambitious goal, and I think more people in this community should aim for more moderate goals.  I mostly agree with the perspective in this post: https://coordination.substack.com/p/alignment-is-not-enough, but I could say more on the matter.

4) RE connotations of accident: I think they are often strong.

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on Why I hate the "accident vs. misuse" AI x-risk dichotomy (quick thoughts on "structural risk") · 2023-02-05T12:05:20.019Z · LW · GW

While defining accident as “incident that was not specifically intended & desired by the people who pressed ‘run’ on the AGI code” is extremely broad, it still supposes that there is such a thing as "the AGI code", which  significantly restricts the space of possibile risks.

There are other reasons I would not be happy with that browser extension.  There is not one specific conversation I can point to; it comes up regularly.  I think this replacement would probably lead to a lot of confusion, since I think when people use the word "accident" they often proceed as if it meant something stricter, e.g. that the result was unforseen or unforseeable.  

If (as in "Concrete Problems", IMO) the point is just to point out that AI can get out-of-control, or that misuse is not the only risk, that's a worthwhile thing to point out, but it doesn't lead to a very useful framework for understanding the nature of the risk(s).  As I mentioned elsewhere, it is specifically the dichotomy of "accident vs. misuse" that I think is the most problematic and misleading.

I think the chart is misleading for the following reasons, among others:

  • It seems to suppose that there is such a manual, or the goal of creating one.  However, if we coordinate effectively, we can simply forgoe development and deployment of dangerous technologies ~indefinitely.
  • It inappropriately separates "coordination problems" and "everyone follows the manual"
     
Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on Vanessa Kosoy's Shortform · 2023-02-05T12:01:16.830Z · LW · GW

I think the construction gives us $C(\pi) \leq C(U) + e$ for a small constant $e$ (representing the wrapper).  It seems like any compression you can apply to the reward function can be translated to the policy via the wrapper.  So then you would never have $C(\pi) >> C(U)$.  What am I missing/misunderstanding?

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on Why I hate the "accident vs. misuse" AI x-risk dichotomy (quick thoughts on "structural risk") · 2023-02-02T12:39:33.747Z · LW · GW

By "intend" do you mean that they sought that outcome / selected for it?  
Or merely that it was a known or predictable outcome of their behavior?

I think "unintentional" would already probably be a better term in most cases. 

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on Vanessa Kosoy's Shortform · 2023-02-02T12:35:38.602Z · LW · GW

Apologies, I didn't take the time to understand all of this yet, but I have a basic question you might have an answer to...

We know how to map (deterministic) policies to reward functions using the construction at the bottom of page 6 of the reward modelling agenda (https://arxiv.org/abs/1811.07871v1): the agent is rewarded only if it has so far done exactly what the policy would do.  I think of this as a wrapper function (https://en.wikipedia.org/wiki/Wrapper_function).

It seems like this means that, for any policy, we can represent it as optimizing reward with only the minimal overhead in description/computational complexity of the wrapper.

So...

  • Do you think this analysis is correct?  Or what is it missing?  (maybe the assumption that the policy is deterministic is significant?  This turns out to be the case for Orseau et al.'s "Agents and Devices" approach, I think https://arxiv.org/abs/1805.12387).
  • Are you trying to get around this somehow?  Or are you fine with this minimal overhead being used to distinguish goal-directed from non-goal directed policies?
Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on Why I hate the "accident vs. misuse" AI x-risk dichotomy (quick thoughts on "structural risk") · 2023-02-01T18:27:48.406Z · LW · GW

"Concrete Problems in AI Safety" used this distinction to make this point, and I think it was likely a useful simplification in that context.  I generally think spelling it out is better, and I think people will pattern match your concerns onto the “the sci-fi scenario where AI spontaneously becomes conscious, goes rogue, and pursues its own goal” or "boring old robustness problems" if you don't invoke structural risk.  I think structural risk plays a crucial role in the arguments, and even if you think things that look more like pure accidents are more likely, I think the structural risk story is more plausible to more people and a sufficient cause for concern.

RE (A): A known side-effect is not an accident.


 

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on Why I hate the "accident vs. misuse" AI x-risk dichotomy (quick thoughts on "structural risk") · 2023-01-31T09:59:10.625Z · LW · GW

I agree somewhat, however, I think we need to be careful to distinguish "do unsavory things" from "cause human extinction", and should generally be squarely focused on the latter.  The former easily becomes too political, making coordination harder.

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on Why I hate the "accident vs. misuse" AI x-risk dichotomy (quick thoughts on "structural risk") · 2023-01-31T09:56:52.792Z · LW · GW

Yes it may be useful in some very limited contexts.  I can't recall a time I have seen it in writing and felt like it was not a counter-productive framing.

AI is highly non-analogous with guns.

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on Why I hate the "accident vs. misuse" AI x-risk dichotomy (quick thoughts on "structural risk") · 2023-01-31T09:55:21.003Z · LW · GW

I think inadequate equillibrium is too specific and insider jargon-y.

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on Why I hate the "accident vs. misuse" AI x-risk dichotomy (quick thoughts on "structural risk") · 2023-01-31T09:53:29.493Z · LW · GW

I really don't think the distinction is meaningful or useful in almost any situation.  I think if people want to make something like this distinction they should just be more clear about exactly what they are talking about.

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on AI will change the world, but won’t take it over by playing “3-dimensional chess”. · 2023-01-29T15:47:52.410Z · LW · GW

This is a great post.  Thanks for writing it!  I think Figure 1 is quite compelling and thought provoking.
I began writing a response, and then realized a lot of what I wanted to say has already been said by others, so I just noted where that was the case.  I'll focus on points of disagreement.

Summary: I think the basic argument of the post is well summarized in Figure 1, and by Vanessa Kosoy’s comment.

A high-level counter-argument I didn't see others making: 

  • I wasn't entirely sure what was your argument that long-term planning ability saturates... I've seen this argued both based on complexity and chaos, and I think here it's a bit of a mix of both.
    • Counter-argument to chaos-argument: It seems we can make meaningful predictions of many relevant things far into the future (e.g. that the sun's remaining natural life-span is 7-8 billion years).
    • Counter-argument to complexity-argument: Increases in predictive ability can have highly non-linear returns, both in terms of planning depth and planning accuracy.  
      • Depth: You often only need to be "one step ahead" of your adversary in order to defeat them and win the whole "prize" (e.g. of market or geopolitical dominance), e.g. if I can predict the weather one day further ahead, this could have a major impact in military strategy.
      • Accuracy: If you can make more accurate predictions about, e.g. how prices of assets will change, you can make a killing in finance.
         

High-level counter-arguments I would've made that Vanessa already made: 

  • This argument proves too much: it suggests that there are not major differences in ability to do long-term planning that matter.
  • Humans have not reached the limits of predictive ability


Low-level counter-arguments:

  • RE Claim 1: Why would AI only have an advantage in IQ as opposed to other forms of intelligence / cognitive skill?  No argument is provided.
  • (Argued by Jonathan Uesato) RE Claim 3: Scaling laws provide ~zero evidence that we are at the limit of “what can be achieved with a certain level of resources”.
Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on AI will change the world, but won’t take it over by playing “3-dimensional chess”. · 2023-01-29T15:36:49.453Z · LW · GW

This is a great post.  Thanks for writing it!

I agree with a lot of the counter-arguments others have mentioned.

Summary:

  • I think the basic argument of the post is well summarized in Figure 1, and by Vanessa Kosoy’s comment.

     
  • High-level counter-arguments already argued by Vanessa: 
    • This argument proves too much: it suggests that there are not major differences in ability to do long-term planning that matter.
    • Humans have not reached the limits of predictive ability


 

  • You often only need to be one step ahead of your adversary to defeat them.
  • Prediction accuracy is not the relevant metric: an incremental increase in depth-of-planning could be decisive in conflicts (e.g. if I can predict the weather one day further ahead, this could have a major impact in military strategy).
    • More generally, the ability to make large / highly leveraged bets on future outcomes means that slight advantages in prediction ability could be decisive.


 

  • Low-level counter-arguments:
  • (RE Claim 1: Why would AI only have an advantage in IQ as opposed to other forms of intelligence / cognitive skill?  No argument is provided.
  • (Argued by Jonathan Uesato) RE Claim 3: Scaling laws provide ~zero evidence that we are at the limit of “what can be achieved with a certain level of resources”.
  • RE Claim 5: Systems trained with short-term objectives can learn to do long-term planning competently.
Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on What does it take to defend the world against out-of-control AGIs? · 2023-01-29T15:35:20.703Z · LW · GW

This post tacitly endorses the "accident vs. misuse" dichotomy.
Every time this appears, I feel compelled to mention I think is a terrible framing.
I believe the large majority of AI x-risk is best understood as "structural" in nature: https://forum.effectivealtruism.org/posts/oqveRcMwRMDk6SYXM/clarifications-about-structural-risk-from-ai

 

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on Quick thoughts on "scalable oversight" / "super-human feedback" research · 2023-01-25T23:25:16.142Z · LW · GW

I understand your point of view and think it is reasonable.

However, I don't think "don't build bigger models" and "don't train models to do complicated things" need to be at odds with each other.  I see the argument you are making, but I think success on these asks are likely highly correlated via the underlying causal factor of humanity being concerned enough about AI x-risk and coordinated enough to ensure responsible AI development.

I also think the training procedure matters a lot (and you seem to be suggesting otherwise?), since if you don't do RL or other training schemes that seem designed to induce agentyness and you don't do tasks that use an agentic supervision signal, then you probably don't get agents for a long time (if ever).

 

Comment by David Scott Krueger (formerly: capybaralet) (capybaralet) on Quick thoughts on "scalable oversight" / "super-human feedback" research · 2023-01-25T23:18:57.006Z · LW · GW

(A very quick response):


Agree with (1) and (2).  
I am ambivalent RE (3) and the replaceability arguments.
RE (4): I largely agree, but I think the norm should be "let's try to do less ambitious stuff properly" rather than "let's try to do the most ambitious stuff we can, and then try and figure out how to do it as safely as possible as a secondary objective".