[Linkpost] Interpretability Dreams 2023-05-24T21:08:17.254Z
Difficulties in making powerful aligned AI 2023-05-14T20:50:05.304Z
AXRP Episode 21 - Interpretability for Engineers with Stephen Casper 2023-05-02T00:50:07.045Z
Podcast with Divia Eden and Ronny Fernandez on the strong orthogonality thesis 2023-04-28T01:30:45.681Z
AXRP Episode 20 - ‘Reform’ AI Alignment with Scott Aaronson 2023-04-12T21:30:06.929Z
[Link] A community alert about Ziz 2023-02-24T00:06:00.027Z
Video/animation: Neel Nanda explains what mechanistic interpretability is 2023-02-22T22:42:45.054Z
[linkpost] Better Without AI 2023-02-14T17:30:53.043Z
AXRP: Store, Patreon, Video 2023-02-07T04:50:05.409Z
Podcast with Oli Habryka on LessWrong / Lightcone Infrastructure 2023-02-05T02:52:06.632Z
AXRP Episode 19 - Mechanistic Interpretability with Neel Nanda 2023-02-04T03:00:11.144Z
First Three Episodes of The Filan Cabinet 2023-01-18T19:20:06.588Z
Podcast with Divia Eden on operant conditioning 2023-01-15T02:44:29.706Z
On Blogging and Podcasting 2023-01-09T00:40:00.908Z
Things I carry almost every day, as of late December 2022 2022-12-30T07:40:01.261Z
Announcing The Filan Cabinet 2022-12-30T03:10:00.494Z
Takeaways from a survey on AI alignment resources 2022-11-05T23:40:01.917Z
AXRP Episode 18 - Concept Extrapolation with Stuart Armstrong 2022-09-03T23:12:01.242Z
AXRP Episode 17 - Training for Very High Reliability with Daniel Ziegler 2022-08-21T23:50:20.513Z
AXRP Episode 16 - Preparing for Debate AI with Geoffrey Irving 2022-07-01T22:20:18.456Z
AXRP Episode 15 - Natural Abstractions with John Wentworth 2022-05-23T05:40:19.293Z
AXRP Episode 14 - Infra-Bayesian Physicalism with Vanessa Kosoy 2022-04-05T23:10:09.817Z
AXRP Episode 13 - First Principles of AGI Safety with Richard Ngo 2022-03-31T05:20:17.883Z
What’s the chance a smart London resident dies of a Russian nuke in the next month? 2022-03-10T19:20:01.434Z
A Nice Representation of the Laplacian 2022-02-12T03:20:00.918Z
AXRP Episode 12 - AI Existential Risk with Paul Christiano 2021-12-02T02:20:17.041Z
Even if you're right, you're wrong 2021-11-22T05:40:00.747Z
The Meta-Puzzle 2021-11-22T05:30:01.031Z
Everything Studies on Cynical Theories 2021-10-27T01:31:20.608Z
AXRP Episode 11 - Attainable Utility and Power with Alex Turner 2021-09-25T21:10:26.995Z
Announcing the Vitalik Buterin Fellowships in AI Existential Safety! 2021-09-21T00:33:08.074Z
AXRP Episode 10 - AI’s Future and Impacts with Katja Grace 2021-07-23T22:10:14.624Z
Handicapping competitive games 2021-07-22T03:00:00.498Z
CGP Grey on the difficulty of knowing what's true [audio] 2021-07-13T20:40:13.506Z
A second example of conditional orthogonality in finite factored sets 2021-07-07T01:40:01.504Z
A simple example of conditional orthogonality in finite factored sets 2021-07-06T00:36:40.264Z
AXRP Episode 9 - Finite Factored Sets with Scott Garrabrant 2021-06-24T22:10:12.645Z
Up-to-date advice about what to do upon getting COVID? 2021-06-19T02:37:10.940Z
AXRP Episode 8 - Assistance Games with Dylan Hadfield-Menell 2021-06-08T23:20:11.985Z
AXRP Episode 7.5 - Forecasting Transformative AI from Biological Anchors with Ajeya Cotra 2021-05-28T00:20:10.801Z
AXRP Episode 7 - Side Effects with Victoria Krakovna 2021-05-14T03:50:11.757Z
Challenge: know everything that the best go bot knows about go 2021-05-11T05:10:01.163Z
AXRP Episode 6 - Debate and Imitative Generalization with Beth Barnes 2021-04-08T21:20:12.891Z
AXRP Episode 5 - Infra-Bayesianism with Vanessa Kosoy 2021-03-10T04:30:10.304Z
Privacy vs proof of character 2021-02-28T02:03:31.009Z
AXRP Episode 4 - Risks from Learned Optimization with Evan Hubinger 2021-02-18T00:03:17.572Z
AXRP Episode 3 - Negotiable Reinforcement Learning with Andrew Critch 2020-12-29T20:45:23.435Z
AXRP Episode 2 - Learning Human Biases with Rohin Shah 2020-12-29T20:43:28.190Z
AXRP Episode 1 - Adversarial Policies with Adam Gleave 2020-12-29T20:41:51.578Z
Cognitive mistakes I've made about COVID-19 2020-12-27T00:50:05.212Z


Comment by DanielFilan on Short Remark on the (subjective) mathematical 'naturalness' of the Nanda--Lieberum addition modulo 113 algorithm · 2023-06-02T17:00:54.454Z · LW · GW

Like, the only reason we're calling it a "Fourier basis" is that we're looking at a few different speeds of rotation, in order to scramble the second-place answers that almost get you a cos of 1 at the end, while preserving the actual answer.

Comment by DanielFilan on Short Remark on the (subjective) mathematical 'naturalness' of the Nanda--Lieberum addition modulo 113 algorithm · 2023-06-02T16:58:49.777Z · LW · GW

I agree a rotation matrix story would fit better, but I do think it's a fair analogy: the numbers stored are just coses and sines, aka the x and y coordinates of the hour hand.

Comment by DanielFilan on Short Remark on the (subjective) mathematical 'naturalness' of the Nanda--Lieberum addition modulo 113 algorithm · 2023-06-01T21:28:35.270Z · LW · GW

My submission: when we teach modular arithmetic to people, we do it using the metaphor of clock arithmetic. Well, if you ignore the multiple frequencies and argmax weirdness, clock arithmetic is exactly what this network is doing! Find the coordinates of rotating the hour hand (on a 113-hour clock) x hours, then y hours, use trig identities to work out what it would be if you rotated x+y hours, then count how many steps back you have to rotate to get to 0 to tell where you ended up. In fairness, the final step is a little bit different than the usual imagined rule of "look at the hour mark where the hand ends up", but not so different that clock arithmetic counts as a bad prediction IMO.

Comment by DanielFilan on DanielFilan's Shortform Feed · 2023-06-01T20:52:03.245Z · LW · GW

An attempt at rephrasing a shard theory critique of utility function reasoning, while restricting myself to things I basically agree with:

Yes, there are representation theorems that say coherent behaviour is optimizing some utility function. And yes, for the sake of discussion let's say this extends to reward functions in the setting of sequential decision-making (even tho I don't remember seeing a theorem for that). But: just because there's a mapping, doesn't mean that we can pull back a uniform measure on utility/reward functions to get a reasonable measure on agents - those theorems don't tell us that we should expect a uniform distribution on utility/reward functions, or even a nice distribution! They would if agents were born with utility functions in their heads represented as tables or something, where you could swap entries in different rows, but that's not what the theorems say!

Comment by DanielFilan on Seriously, what goes wrong with "reward the agent when it makes you smile"? · 2023-06-01T20:26:07.280Z · LW · GW

Not having read other responses, my attempt to answer in my own words: what goes wrong is that there are tons of possible cognitive influences that could be reinforced by rewards for making people smile. E.g. "make things of XYZ type think things are going OK", "try to promote physical configurations like such-and-such", "trying to stimulate the reinforcer I observe in my environment". Most of these decision-influences, when extrapolated to coherent behaviour where those decision-influences drive the course of the behaviour, lead to resource-gathering and not respecting what the informed preferences of humans would be. Then this causes doom because you can better achieve most goals/preferences you could have by having more power and disempowering the humans.

Comment by DanielFilan on Power-seeking can be probable and predictive for trained agents · 2023-06-01T19:28:51.390Z · LW · GW

I think you're making a mistake: policies can be reward-optimal even if there's not an obvious box labelled "reward" that they're optimal with respect to the outputs of. Similarly, the formalism of "reward" can be useful even if this box doesn't exist, or even if the policy isn't behaving the way you would expect if you identified that box with the reward function. To be fair, the post sort of makes this mistake by talking about "internal representations", but I think everything goes thru if you strike out that talk.

The main thing I want to talk about

I can talk about utility functions instead (which would be equivalent to value functions in this case)

I disagree that these are equivalent, and expect the policy and value function to come apart in practice. Indeed, that was observed in the original goal misgeneralization paper (3.3, actor-critic inconsistency).

I think you're the one who's imposing a type error here. For "value functions" to be useful in modelling a policy, it doesn't have to be the case that the policy is acting optimally with respect to a suggestively-labeled critic - it just has to be the case that the agent is acting consistently with some value function. Analogously, momentum is conserved in classical mechanics, even if objects have labels on them that inaccurately say "my momentum is 23 kg m/s".

Anyways, we can talk about utility functions, but then we're going to lose claim to predictiveness, no? Why should we assume that the network will internally represent a scalar function over observations, consistent with a historical training signal's scalar values (and let's not get into nonstationary reward), such that the network will maximize discounted sum return of this internally represented function? That seems highly improbable to me, and I don't think reality will be "basically that" either.

The utility function formalism doesn't require agents to "internally represent a scalar function over observations". You'll notice that this isn't one of the conclusions of the VNM theorem.

Another thing I don't get

My point is rather that these results are not predictive because the assumption won't hold. The assumptions are already known to not be good approximations of trained policies, in at least some prototypical RL situations.

What part of the post you link rules this out? As far as I can tell, the thing you're saying is that a few factors influence the decisions of the maze-solving agent, which isn't incompatible with the agent acting optimally with respect to some reward function such that it produces training-reward-optimal behaviour on the training set.

Comment by DanielFilan on Reacts now randomly enabled on 50% of posts (you can still manually change this yourself in the post settings) · 2023-05-28T23:30:14.949Z · LW · GW

Unrelatedly: it's weird to me that the wordings of "I'll reply later" and "Not planning to respond" don't mirror each other.

Comment by DanielFilan on Reacts now randomly enabled on 50% of posts (you can still manually change this yourself in the post settings) · 2023-05-28T23:26:16.517Z · LW · GW

For what it's worth: when selecting a voting system for my post, it looks like if I pick name-attached reactions, I won't get the two-axis voting, which I assume isn't correct?

Comment by DanielFilan on Rishi Sunak mentions "existential threats" in talk with OpenAI, DeepMind, Anthropic CEOs · 2023-05-24T23:35:35.876Z · LW · GW

Is there any firm evidence that Rishi Sunak mentioned existential threats in the discussion? The way I read the press release, it could be that the CEOs mentioned them.

Comment by DanielFilan on Open Thread With Experimental Feature: Reactions · 2023-05-24T19:30:42.400Z · LW · GW

Maybe too hard but it might be nice to have somewhere you can go to see all the comments you've reacted "I plan to respond later" to that you haven't yet responded to.

Comment by DanielFilan on Open Thread With Experimental Feature: Reactions · 2023-05-24T19:29:00.278Z · LW · GW

I'm not bothered by this, but it does seem wrong that the button to click to add a react is in a different place from where the existing reacts are displayed.

Comment by DanielFilan on Open Thread With Experimental Feature: Reactions · 2023-05-24T19:28:00.754Z · LW · GW

Maybe too much semantic content to be a react, but I wanted to reach for "That's a good thing" in response to this comment

Comment by DanielFilan on Phone Number Jingle · 2023-05-23T17:30:36.118Z · LW · GW

I guess it could be that that hotline had a more consistent phone number and advertisement style than others.

Comment by DanielFilan on Phone Number Jingle · 2023-05-23T17:24:47.510Z · LW · GW

Australia has a phone number you can call if you need help improving your literacy, and there were a bunch of TV ads advertising this number via a jingle that is catchy enough that I still remember it to this day (not having lived in the country since 2016). Plausible that that jingle was more optimized than most for memorability given their target market, and indeed I don't think I remember any other phone number jingles. If true, suggests that there continues to be returns to optimization for catchiness for quite a while.

Comment by DanielFilan on rohinmshah's Shortform · 2023-05-15T18:21:34.648Z · LW · GW

I like this, but it feels awkward to say that something can be not inside a space of "possibilities" but still be "possible". Maybe "possibilities" here should be "imagined scenarios"?

Comment by DanielFilan on rohinmshah's Shortform · 2023-05-15T18:18:31.658Z · LW · GW

I guess this doesn't fit with the use in the Truthful AI paper that you quote. Also in that case I have an objection that only punishing for negligence may incentivize an AI to lie in cases where it knows the truth but thinks the human thinks the AI doesn't/can't know the truth, compared to a "strict liability" regime.

Comment by DanielFilan on rohinmshah's Shortform · 2023-05-15T18:15:59.736Z · LW · GW

I wonder if this use of "fair" is tracking (or attempting to track) something like "this problem only exists in an unrealistically restricted action space for your AI and humans - in worlds where it can ask questions, and we can make reasonable preparation to provide obviously relevant info, this won't be a problem".

Comment by DanielFilan on Steering GPT-2-XL by adding an activation vector · 2023-05-14T21:48:22.577Z · LW · GW

I think you might be interpreting the break after the sentence "Their results are further evidence for feature linearity and internal activation robustness in these models." as the end of the related work section? I'm not sure why that break is there, but the section continues with them citing Mikolov et al (2013), Larsen et al (2015), White (2016), Radford et al (2016), and Upchurch et al (2016) in the main text, as well as a few more papers in footnotes.

Comment by DanielFilan on Why aren’t more of us working to prevent AI hell? · 2023-05-04T23:30:33.712Z · LW · GW

I don't believe that reducing s-risks from AI involves substantially different things than those you'd need to deal with AI alignment.

Comment by DanielFilan on AGI safety career advice · 2023-05-02T17:17:31.518Z · LW · GW

Alignment is an unusual field because the base of fans and supporters is much larger than the number of researchers

Isn't this entirely usual? Like, I'd assume that there are more readers of popular physics books than working physicists. Similarly for nature documentary viewers vs biologists.

Comment by DanielFilan on Tuning your Cognitive Strategies · 2023-04-27T21:27:52.992Z · LW · GW

Comment by DanielFilan on Tuning your Cognitive Strategies · 2023-04-27T21:25:02.149Z · LW · GW

Comment by DanielFilan on Tuning your Cognitive Strategies · 2023-04-27T21:24:26.130Z · LW · GW

Wayback machine

Comment by DanielFilan on Tuning your Cognitive Strategies · 2023-04-27T21:24:08.640Z · LW · GW

ITT: links to the original post on various archives.

Comment by DanielFilan on [deleted post] 2023-04-26T18:04:06.849Z

Looks like this tag is being used both for fictional dialogues and actual dialogues. This seems kind of bad to me, in that I don't want to conflate "what an author imagines a hypothetical interlocutor might say" with "what an actual interlocutor said". Any chance we should split up the tag?

Comment by DanielFilan on AGI ruin mostly rests on strong claims about alignment and deployment, not about society · 2023-04-24T20:26:16.639Z · LW · GW

could be a subtitle (appended with the word "Or,")?

Comment by DanielFilan on But why would the AI kill us? · 2023-04-20T21:42:16.761Z · LW · GW

Re: optimality in trading partners, I'm talking about whether humans are the best trading partner out of trading partners the AI could feasibly have, as measured by whether trading with us gets the AI what it wants. You're right that we have some advantages, mainly that we're a known quantity that's already there. But you could imagine more predictable things that sync with the AI's thoughts better, operate more efficiently, etc.

We just don't know so I think it's more fair to say that "likely not much to offer for a super-intelligent maximizer".

Maybe we agree? I read this as compatible with the original quote "humans are probably not the optimal trading partners".

Comment by DanielFilan on But why would the AI kill us? · 2023-04-20T21:38:39.478Z · LW · GW

Or do you mean that neural networks would develop an indirect goal as side product of training conditions or via some hidden variable?

This one: I mean the way we train AIs, the things that will emerge are things that pursue goals, at least in some weak sense. So, e.g., suppose you're training an AI to write valid math proofs via way 2. Probably the best way to do that is to try to gain a bunch of knowledge about math, use your computation efficiently, figure out good ways of reasoning, etc. And the idea would be that as the system gets more advanced, it's able to pursue these goals more and more effectively, which ends up disempowering humans (because we're using a bunch of energy that could be devoted to running computations).

Comment by DanielFilan on But why would the AI kill us? · 2023-04-20T21:35:07.799Z · LW · GW

My original point was to contrast between AI having a goal or goals as some emerging property of large neural networks versus us humans giving it goals one way or the other.

Fair enough - I just want to make the point that humans giving AIs goals is a common thing. I guess I'm assuming in the background "and it's hard to write a goal that doesn't result in human disempowerment" but didn't argue for that.

Comment by DanielFilan on Moderation notes re: recent Said/Duncan threads · 2023-04-18T03:40:51.933Z · LW · GW

Oops, sorry for saying something that probabilistically implied a strawman of you.

Comment by DanielFilan on Moderation notes re: recent Said/Duncan threads · 2023-04-18T01:23:45.851Z · LW · GW

I could imagine an admin feature that literally just lets Said comment a few times on a post, but if he gets significantly downvoted, gives him a wordcount-based rate-limit that forces him to wrap up his current points quickly and then call it a day.

I feel like this incentivizes comments to be short, which doesn't make them less aggravating to people. For example, IIRC people have complained about him commenting "Examples?". This is not going to be hit hard by a rate limit.

Comment by DanielFilan on But why would the AI kill us? · 2023-04-18T01:11:31.389Z · LW · GW

But WHY would the AGI "want" anything at all unless humans gave it a goal(/s)?

There are two main ways we make AIs:

  1. writing programs that evaluate actions they could take in terms of how well it could achieve some goal and choose the best one
  2. take a big neural network and jiggle the numbers that define it until it starts doing some task we pre-designated.

In way 1, it seems like your AI "wants" to achieve its goal in the relevant sense. In way 2, it seems like for hard enough goals, probably the only way to achieve them is to be thinking about how to achieve them and picking actions that succeed - or to somehow be doing cognition that leads to similar outcomes (like being sure to think about how well you're doing at stuff, how to manage resources, etc.).

IF AGI got hell bent on own survival and improvement of itself to maximize goal "X" even then it might value the informational formations of our atoms more than the energy it could gain from those atoms,

It might - but if an alien wanted to extract as much information out of me as possible, it seems like that's going to involve limiting my ability to mess with that alien's sensors at minimum, and plausibly involves just destructively scanning me (depending on what type of info the alien wants). For humans to continue being free-range it needs to be the case that the AI wants to know how we behave under basically no limitations, and also your AI isn't able to simulate us well enough to answer that question - which sounds like a pretty specific goal for an AI to have, such that you shouldn't expect an AI to have that sort of goal without strong evidence.

humans are probably not the optimal trading partners.

Probably? Based on what?

Most things aren't the optimal trading partner for any given intelligence, and it's hard to see why humans should be so lucky. The best answer would probably be "because the AI is designed to be compatible with humans and not other things" but that's going to rely on getting alignment very right.

Comment by DanielFilan on Moderation notes re: recent Said/Duncan threads · 2023-04-16T19:52:26.279Z · LW · GW

I think weird bugs are neat.

Comment by DanielFilan on LW Team is adjusting moderation policy · 2023-04-05T01:04:52.791Z · LW · GW

What's a "new user"? It seems like this category matters for moderation but I don't see a definition of it. (Maybe you're hoping to come up with one?)

Comment by DanielFilan on My Objections to "We’re All Gonna Die with Eliezer Yudkowsky" · 2023-04-05T01:00:32.228Z · LW · GW

gotcha, thanks!

Comment by DanielFilan on FLI open letter: Pause giant AI experiments · 2023-03-29T18:03:35.850Z · LW · GW

I think you can support a certain policy without putting your name to a flawed argument for that policy. And indeed ensuring that typical arguments for your policy are high-quality is a forrm of support.

Comment by DanielFilan on EIS VI: Critiques of Mechanistic Interpretability Work in AI Safety · 2023-03-29T01:03:43.694Z · LW · GW

In this same example, these interpretations are then validated by using a different interpretability tool – test set exemplars. This begs the question of why we shouldn’t just use test set exemplars instead.

Doesn't Olah et al (2017) answer this in the "Why visualize by optimization" section, where they show a bunch of cases where neurons fire on similar test set exemplars, but visualization by optimization appears to reveal that the neurons are actually 'looking for' specific aspects of those images?

Comment by DanielFilan on EIS II: What is “Interpretability”? · 2023-03-28T23:21:03.973Z · LW · GW

I guess this proves the superiority of the mechanistic interpretability technique "note that it is mechanistically possible for your model to say that things are gorillas" :P

Comment by DanielFilan on EIS II: What is “Interpretability”? · 2023-03-28T23:20:25.753Z · LW · GW

Re: the gorilla example, seems worth noting that the solution that was actually deployed ended up being refusing to classify anything as a gorilla, at least as of 2018 (perhaps things have changed since then).

Comment by DanielFilan on My Objections to "We’re All Gonna Die with Eliezer Yudkowsky" · 2023-03-23T00:49:32.357Z · LW · GW

Yeah, I guess I think words are the things with spaces between them. I get that this isn't very linguistically deep, and there are edge cases (e.g. hyphenated things, initialisms), but there are sentences that have an unambiguous number of words.

Comment by DanielFilan on My Objections to "We’re All Gonna Die with Eliezer Yudkowsky" · 2023-03-22T23:02:38.182Z · LW · GW

But for true powered controlled flight - it is exactly similarity to birds that gave them confidence as avian flight control is literally the source of their key innovation.

Why do you think the confidence came from this and not from the fact that

they downloaded a library of existing flyer designs from the smithsonian and then developed a wind tunnel to test said designs at high throughput before selecting a few for full-scale physical prototypes.


Comment by DanielFilan on My Objections to "We’re All Gonna Die with Eliezer Yudkowsky" · 2023-03-22T01:32:44.322Z · LW · GW

This is a good corrective, and also very compatible with "similarity to birds is not what gave the Wright brothers confidence that their plane would fly".

Comment by DanielFilan on My Objections to "We’re All Gonna Die with Eliezer Yudkowsky" · 2023-03-22T00:32:14.062Z · LW · GW

Well, I'm only arguing from surface features of Eliezer's comments, so I could be wrong too :P

Comment by DanielFilan on My Objections to "We’re All Gonna Die with Eliezer Yudkowsky" · 2023-03-22T00:19:39.263Z · LW · GW

It's pretty easily definable in English, at least in special cases, and my understanding is that GPT-4 fails in those cases.

Comment by DanielFilan on My Objections to "We’re All Gonna Die with Eliezer Yudkowsky" · 2023-03-22T00:17:52.566Z · LW · GW

BTW: the way I found that first link was by searching the title on google scholar, finding the paper, and clicking "All 5 versions" below (it's right next to "Cited by 7" and "Related articles"). That brought me to a bunch of versions, one of which was a seemingly-ungated PDF. This will probably frequently work, because AI researchers usually make their papers publicly available (at least in pre-print form).

Comment by DanielFilan on My Objections to "We’re All Gonna Die with Eliezer Yudkowsky" · 2023-03-22T00:15:16.845Z · LW · GW

For the 40 parameters thing, this link should work. See also this earlier paper.

Comment by DanielFilan on My Objections to "We’re All Gonna Die with Eliezer Yudkowsky" · 2023-03-22T00:13:11.659Z · LW · GW

I think asking someone to do something is pretty different from ordering someone to do something. I also think for the sake of the conversation it's good if there's public, non-DM evidence that he did that: you'd make a pretty different inference if he just picked one point and said that Quintin misunderstood him, compared to once you know that that's the point Quintin picked as his strongest objection.

Comment by DanielFilan on My Objections to "We’re All Gonna Die with Eliezer Yudkowsky" · 2023-03-21T07:37:51.042Z · LW · GW

To tie up this thread: I started writing a more substantive response to a section but it took a while and was difficult and I then got invited to dinner, so probably won't get around to actually writing it.

Comment by DanielFilan on My Objections to "We’re All Gonna Die with Eliezer Yudkowsky" · 2023-03-21T07:33:53.023Z · LW · GW

I don't want to get super hung up on this because it's not about anything Yudkowsky has said but:

Consider the whole transformed line of reasoning:

avian flight comes from a lot of factors; you can't just ape one of the factors and expect the rest to follow; to get an entity which flies, that entity must be as close to a bird as birds are to each other.

IMO this is not a faithful transformation of the line of reasoning you attribute to Yudkowsky, which was:

human intelligence/alignment comes from a lot of factors; you can't just ape one of the factors and expect the rest to follow; to get a mind which wants as humans do, that mind must be as close to a human as humans are to each other.

Specifically, where you wrote "an entity which flies", you were transforming "a mind which wants as humans do", which I think should instead be transformed to "an entity which flies as birds do". And indeed planes don't fly like birds do. [EDIT: two minutes or so after pressing enter on this comment, I now see how you could read it your way]

I guess if I had to make an analogy I would say that you have to be pretty similar to a human to think the way we do, but probably not to pursue the same ends, which is probably the point you cared about establishing.

Comment by DanielFilan on My Objections to "We’re All Gonna Die with Eliezer Yudkowsky" · 2023-03-21T07:17:31.426Z · LW · GW

I think this is evidence against the hypothesis that a system trained to make lots of correct predictions will thereby intrinsically value making lots of correct predictions.

Note that Yudkowsky said

maybe if you train a thing really hard to predict humans, then among the things that it likes are tiny, little pseudo-things that meet the definition of human, but weren't in its training data, and that are much easier to predict

which isn't at all the same thing as intrinsically valuing making lots of correct predictions. A better analogy would be the question of whether humans like things that are easier to visually predict. (Except that's presumably one of many things that went into human RL, so presumably this is a weaker prediction for humans than it is for GPT-n?)