Watermarking considered overrated? 2023-07-31T21:36:05.268Z
AXRP Episode 24 - Superalignment with Jan Leike 2023-07-27T04:00:02.106Z
AXRP Episode 23 - Mechanistic Anomaly Detection with Mark Xu 2023-07-27T01:50:02.808Z
AXRP announcement: Survey, Store Closing, Patreon 2023-06-28T23:40:02.537Z
AXRP Episode 22 - Shard Theory with Quintin Pope 2023-06-15T19:00:01.340Z
[Linkpost] Interpretability Dreams 2023-05-24T21:08:17.254Z
Difficulties in making powerful aligned AI 2023-05-14T20:50:05.304Z
AXRP Episode 21 - Interpretability for Engineers with Stephen Casper 2023-05-02T00:50:07.045Z
Podcast with Divia Eden and Ronny Fernandez on the strong orthogonality thesis 2023-04-28T01:30:45.681Z
AXRP Episode 20 - ‘Reform’ AI Alignment with Scott Aaronson 2023-04-12T21:30:06.929Z
[Link] A community alert about Ziz 2023-02-24T00:06:00.027Z
Video/animation: Neel Nanda explains what mechanistic interpretability is 2023-02-22T22:42:45.054Z
[linkpost] Better Without AI 2023-02-14T17:30:53.043Z
AXRP: Store, Patreon, Video 2023-02-07T04:50:05.409Z
Podcast with Oli Habryka on LessWrong / Lightcone Infrastructure 2023-02-05T02:52:06.632Z
AXRP Episode 19 - Mechanistic Interpretability with Neel Nanda 2023-02-04T03:00:11.144Z
First Three Episodes of The Filan Cabinet 2023-01-18T19:20:06.588Z
Podcast with Divia Eden on operant conditioning 2023-01-15T02:44:29.706Z
On Blogging and Podcasting 2023-01-09T00:40:00.908Z
Things I carry almost every day, as of late December 2022 2022-12-30T07:40:01.261Z
Announcing The Filan Cabinet 2022-12-30T03:10:00.494Z
Takeaways from a survey on AI alignment resources 2022-11-05T23:40:01.917Z
AXRP Episode 18 - Concept Extrapolation with Stuart Armstrong 2022-09-03T23:12:01.242Z
AXRP Episode 17 - Training for Very High Reliability with Daniel Ziegler 2022-08-21T23:50:20.513Z
AXRP Episode 16 - Preparing for Debate AI with Geoffrey Irving 2022-07-01T22:20:18.456Z
AXRP Episode 15 - Natural Abstractions with John Wentworth 2022-05-23T05:40:19.293Z
AXRP Episode 14 - Infra-Bayesian Physicalism with Vanessa Kosoy 2022-04-05T23:10:09.817Z
AXRP Episode 13 - First Principles of AGI Safety with Richard Ngo 2022-03-31T05:20:17.883Z
What’s the chance a smart London resident dies of a Russian nuke in the next month? 2022-03-10T19:20:01.434Z
A Nice Representation of the Laplacian 2022-02-12T03:20:00.918Z
AXRP Episode 12 - AI Existential Risk with Paul Christiano 2021-12-02T02:20:17.041Z
Even if you're right, you're wrong 2021-11-22T05:40:00.747Z
The Meta-Puzzle 2021-11-22T05:30:01.031Z
Everything Studies on Cynical Theories 2021-10-27T01:31:20.608Z
AXRP Episode 11 - Attainable Utility and Power with Alex Turner 2021-09-25T21:10:26.995Z
Announcing the Vitalik Buterin Fellowships in AI Existential Safety! 2021-09-21T00:33:08.074Z
AXRP Episode 10 - AI’s Future and Impacts with Katja Grace 2021-07-23T22:10:14.624Z
Handicapping competitive games 2021-07-22T03:00:00.498Z
CGP Grey on the difficulty of knowing what's true [audio] 2021-07-13T20:40:13.506Z
A second example of conditional orthogonality in finite factored sets 2021-07-07T01:40:01.504Z
A simple example of conditional orthogonality in finite factored sets 2021-07-06T00:36:40.264Z
AXRP Episode 9 - Finite Factored Sets with Scott Garrabrant 2021-06-24T22:10:12.645Z
Up-to-date advice about what to do upon getting COVID? 2021-06-19T02:37:10.940Z
AXRP Episode 8 - Assistance Games with Dylan Hadfield-Menell 2021-06-08T23:20:11.985Z
AXRP Episode 7.5 - Forecasting Transformative AI from Biological Anchors with Ajeya Cotra 2021-05-28T00:20:10.801Z
AXRP Episode 7 - Side Effects with Victoria Krakovna 2021-05-14T03:50:11.757Z
Challenge: know everything that the best go bot knows about go 2021-05-11T05:10:01.163Z
AXRP Episode 6 - Debate and Imitative Generalization with Beth Barnes 2021-04-08T21:20:12.891Z
AXRP Episode 5 - Infra-Bayesianism with Vanessa Kosoy 2021-03-10T04:30:10.304Z
Privacy vs proof of character 2021-02-28T02:03:31.009Z


Comment by DanielFilan on The Talk: a brief explanation of sexual dimorphism · 2023-09-28T03:13:07.608Z · LW · GW

Part 3: not my type

should probably be "Part 2" instead

Comment by DanielFilan on The Talk: a brief explanation of sexual dimorphism · 2023-09-28T03:12:11.599Z · LW · GW

McDonalds et al. (2016) had yeast evolve with and without sex for 1000 generations

... wait, yeast can reproduce sexually or asexually?
looks at paper

OK, they figured out how to get yeast to have sex? Seems wild.

Comment by DanielFilan on PIBBSS Summer Symposium 2023 · 2023-09-27T23:20:06.132Z · LW · GW

Some talks are visible on YouTube here

Comment by DanielFilan on An Intuitive Guide to Garrabrant Induction · 2023-09-21T00:03:53.621Z · LW · GW

Did this ever get written up? I'm still interested in it.

Comment by DanielFilan on Where might I direct promising-to-me researchers to apply for alignment jobs/grants? · 2023-09-19T01:33:17.874Z · LW · GW

Redwood Research?

Comment by DanielFilan on The Apologist and the Revolutionary · 2023-09-18T20:40:15.891Z · LW · GW

It gets weirder. For some reason, squirting cold water into the left ear canal wakes up the revolutionary.

This link gets me "page not found", both here and on the oldest saved copy on the internet archive. That said, some papers are available here, here, here if you're at a university that pays for this sort of stuff, and generally linked to from this page. I'll be adding these links to the wayback machine, unfortunately when I go to I get caught in some sort of weird loop of captchas and am unable to actually get to the site.

Comment by DanielFilan on The commenting restrictions on LessWrong seem bad · 2023-09-17T02:16:01.640Z · LW · GW

Most of your comment seems to be an appeal to modest epistemology. We can in fact do better than total agnosticism about whether some arguments are productive or not, and worth having more or less of on the margin.

Note that the more you believe that your commenters can tell whether some arguments are productive or not, and worth having more or less of on the margin, the less you should worry as mods about preventing or promoting such arguments (altho you still might want to put them near the top or bottom of pages for attention-management reasons)

Comment by DanielFilan on Sharing Information About Nonlinear · 2023-09-10T04:52:29.307Z · LW · GW

Site admins, would it be possible to see the edit history of posts, perhaps in diff format (or at least make that a default that authors can opt out of)? Seems like something I want in a few cases:

  • controversial posts like these
  • sometimes mods edit my posts and I'd like to know what they edited
Comment by DanielFilan on Sharing Information About Nonlinear · 2023-09-07T21:19:03.834Z · LW · GW

Is your point that "being asked to not hang out with low value people" is inherently abusive in a way worse than everything else going on in that list?


Comment by DanielFilan on Sharing Information About Nonlinear · 2023-09-07T20:37:31.124Z · LW · GW

Spencer responded to a similar request in the EA forum. Copy-pasting the response here in quotes, but for further replies etc. I encourage readers to follow the link:

Yes, here two examples, sorry I can’t provide more detail:

-there were claims in the post made about Emerson that were not actually about Emerson at all (they were about his former company years after he left). I pointed this out to Ben hours before publication and he rushed to correct it (in my view it’s a pretty serious mistake to make false accusations about a person, I see this as pretty significant)!

-there was also a very disparaging claim made in the piece (I unfortunately can’t share the details for privacy reasons; but I assume nonlinear will later) that was quite strongly contradicted by a text message exchange I have

Comment by DanielFilan on Sharing Information About Nonlinear · 2023-09-07T19:59:01.359Z · LW · GW

Sorry, I was using "normal" to mean "not abusive". Even in weird and atypical environments, I find it hard to think of situations where "don't hang out with your family" is an acceptable ask (with the one exception listed in my comment).

Comment by DanielFilan on Sharing Information About Nonlinear · 2023-09-07T18:53:05.223Z · LW · GW

Sure, but wasn't there some previous occasion where Lightcone made a grant to people after they shared negative stories about a former employer (maybe to Zoe Curzi? but I can't find that atm)? If so, then presumably at some point you get a reputation for doing so.

Comment by DanielFilan on Sharing Information About Nonlinear · 2023-09-07T18:34:37.038Z · LW · GW

I can guarantee you from my perspective as a coach that a good number of the items mentioned here are abjectly false.

What's an example of something that's false?

Comment by DanielFilan on Sharing Information About Nonlinear · 2023-09-07T18:32:39.222Z · LW · GW

Being asked to... not hang out with low value people... is just one more thing that is consistent with the office environment.

Maybe I'm naive, but I don't think there's approximately any normal relationship in which it's considered acceptable to ask someone to not associate with ~anyone other than current employees. The closest example I can think of is monasticism, but in that context (a) that expectation is clear and (b) at least in the Catholic church there's a higher internal authority who can adjudicate abuse claims.

Comment by DanielFilan on Sharing Information About Nonlinear · 2023-09-07T18:25:24.906Z · LW · GW

The nearly final draft of this post that I was given yesterday had factual inaccuracies that (in my opinion and based on my understanding of the facts) are very serious

Could you share examples of these inaccuracies?

Comment by DanielFilan on Fundamental question: What determines a mind's effects? · 2023-09-06T18:07:39.977Z · LW · GW

Any reversible effect might be reversed. The question asks about the final effects of the mind

This talk of "reversible" and "final" effects of a mind strikes me as suspicious: for one, in a block / timeless universe, there's no such thing as "reversible" effects, and for another, in the end, it may wash out in an entropic mess! But it does suggest a rephrasing of "a first-order approximation of the (direction of the) effects, understood both spatially and temporally".

Comment by DanielFilan on Invulnerable Incomplete Preferences: A Formal Statement · 2023-09-06T17:57:16.605Z · LW · GW

Is the idea that the set of "states" is the codomain of gamma?

Comment by DanielFilan on Invulnerable Incomplete Preferences: A Formal Statement · 2023-09-06T17:55:04.568Z · LW · GW

 assigns the set of states that remain possible once a node is reached.

What's bold S here?

Comment by DanielFilan on [Linkpost] Michael Nielsen remarks on 'Oppenheimer' · 2023-08-31T22:09:25.410Z · LW · GW

I was at a party recently, and happened to meet a senior person at a well-known AI startup in the Bay Area. They volunteered that they thought "humanity had about a 50% chance of extinction" caused by artificial intelligence. I asked why they were working at an AI startup if they believed that to be true. They told me that while they thought it was true, "in the meantime I get to have a nice house and car".

This strikes me as the sort of thing one would say without quite meaning it. Like, I'm sure this person could get other jobs that also support a nice house and car. And if they thought about it, they could probably also figure this out. I'm tempted to chalk the true decision up to conformity / lack of confidence in one's ability to originate and execute consequentialist plans, but that's just a guess and I'm not particularly well-informed about this person.

Comment by DanielFilan on [Linkpost] Michael Nielsen remarks on 'Oppenheimer' · 2023-08-31T22:07:24.932Z · LW · GW

The Manhattan Project brought us nuclear weapons, whose existence affects the world to this day, 79 years after its founding - I would call that a long timeline. And we might not have seen all the relevant effects!

But yeah, I think we have enough info to make tentative judgements of at least Klaus Fuchs' espionage, and maybe Joseph Rotblat's quitting.

Comment by DanielFilan on Report on Frontier Model Training · 2023-08-31T00:37:57.955Z · LW · GW

I appreciate the multiple levels of summarization!

Comment by DanielFilan on Dating Roundup #1: This is Why You’re Single · 2023-08-31T00:27:12.934Z · LW · GW

Would you go on a first date if there were a 20% chance that instead of an actual date someone would yell at you? It's obviously not a pleasant possibility, but IMO still worth it.

Comment by DanielFilan on DanielFilan's Shortform Feed · 2023-08-24T19:52:09.831Z · LW · GW

Research project idea: formalize a set-up with two reinforcement learners, each training the other. I think this is what's going on in baby care. Specifically: a baby is learning in part by reinforcement learning: they have various rewards they like getting (food, comfort, control over environment, being around people). Some of those rewards are dispensed by you: food, and whether you're around them, smiling and/or mimicking them. Also, you are learning via RL: you want the baby to be happy, nourished, rested, and not cry (among other things). And the baby is dispensing some of those rewards.


  • What even happens? (I think in many setups you won't get mutual wireheading)
  • Do you get a nice equilibrium?
  • Is there some good alignment property you can get?
    • Maybe a terrible alignment property?

This could also be a model of humans and advanced algorithmic timelines or some such thing.

Comment by DanielFilan on Visible loss landscape basins don't correspond to distinct algorithms · 2023-08-03T16:13:28.391Z · LW · GW

Update: I currently think that Nguyen (2019) proves the claim, but it actually requires a layer to have two hidden neurons per training example.

Comment by DanielFilan on Visible loss landscape basins don't correspond to distinct algorithms · 2023-08-01T22:10:40.196Z · LW · GW

Mechanistically dissimilar algorithms can be "mode connected" - that is, local minima-ish that are connected by a path of local minima (the paper proves this for their definition of "mechanistically similar")

Mea culpa: AFAICT, the 'proof' in Mechanistic Mode Connectivity fails. It basically goes:

  1. Prior work has shown that under overparametrization, all global loss minimizers are mode connected.
  2. Therefore, mechanistically distinct global loss minimizers are also mode connected.

The problem is that prior work made the assumption that for a net of the right size, there's only one loss minimizer up to permutation - aka there are no mechanistically distinct loss minimizers.

[EDIT: the proof also cites Nguyen (2019) in support of its arguments. I haven't checked the proof in Nguyen (2019), but if it holds up, it does substantiate the claim in Mechanistic Mode Connectivity - altho if I'm reading it correctly you need so much overparameterization that the neural net has a layer with as many hidden neurons as there are training data points.]

Comment by DanielFilan on Visible loss landscape basins don't correspond to distinct algorithms · 2023-07-31T21:57:13.604Z · LW · GW

the above papers show that in more realistic settings empirically, two models lie in the same basin (up to permutation symmetries) if and only if they have similar generalization and structural properties.

I think they only check if they lie in linearly-connected bits of the same basin if they have similar generalization properties? E.g. Figure 4 of Mechanistic Mode Connectivity is titled "Non-Linear Mode Connectivity of Mechanistically Dissimilar Models" and the subtitle states that "quadratic paths can be easily identified to mode connect mechanistically dissimilar models[, and] linear paths cannot be identified, even after permutation". Linear Connectivity Reveals Generalization Strategies seems to be focussed on linear mode connectivity, rather than more general mode connectivity.

Comment by DanielFilan on Open Problems and Fundamental Limitations of RLHF · 2023-07-31T21:47:59.064Z · LW · GW

We argue that a sustained commitment to transparency (e.g. to auditors) would make the RLHF research environment more robust from a safety standpoint.

Do you think this is more true of RLHF than other safety techniques or frameworks? At first blush, I would have thought "no", and the reasoning you provide in this post doesn't seem to distinguish RLHF from other things.

Comment by DanielFilan on AXRP Episode 24 - Superalignment with Jan Leike · 2023-07-31T16:30:30.154Z · LW · GW

I think I probably didn't quite word that question right, and that's what's explaining the confusion - I meant something like "Once you've created the AAR, what alignment problems are left to be solved? Please answer in terms of the gap between the AAR and superintelligence."

Comment by DanielFilan on Visible loss landscape basins don't correspond to distinct algorithms · 2023-07-28T19:17:37.479Z · LW · GW

The second paper is just about linear connectivity, and does seem to suggest that linearly connected models run similar algorithms. But I guess I don't expect neural net training to go in straight lines? (Altho I suppose momentum helps with this?)

Comment by DanielFilan on Visible loss landscape basins don't correspond to distinct algorithms · 2023-07-28T19:13:14.629Z · LW · GW

I didn't read super carefully, but it seems like the former paper is saying that, for some definition of "mechanistically similar":

  • Mechanistically dissimilar algorithms can be "mode connected" - that is, local minima-ish that are connected by a path of local minima (the paper proves this for their definition of "mechanistically similar").
  • If two models aren't linearly mode connected, then that means that they're dissimilar (note that this is a conjecture, but I guess they probably find evidence for it).

I don't think this is in much tension with the post?

Comment by DanielFilan on Cost-effectiveness of professional field-building programs for AI safety research · 2023-07-24T19:01:36.023Z · LW · GW

I guess I just don't have a strong sense of where the practitioners' numbers are coming from, or why they believe what they believe. Which is fine if you want to bulid a pipeline that turn some intuitions into decisions, but not obviously incredibly useful for the rest of us (beyond just telling us those intuitions).

Finally, it’s worth mentioning that the cost-effectiveness of these programs relative to one another do not rely very heavily on conversions.

The thing you link shows that if you change the conversion ratio of both programs the same amount, the relative cost-effectiveness doesn't change, which makes sense. But if workshops produced 100x more conversions than socials, or vice versa, presumably this must make a difference. If you say that the treatment effects only differ by a factor of 2, then fair enough, but that's just not super a priori clear (and the fact that you claim that (a) you can measure the TDC better and (b) the TDC has a different treatment effect makes me skeptical).

(For the record, I couldn't really make heads or tails of the spreadsheet you linked or what the calculations in the script were supposed to be, but I didn't try super hard to understand them - perhaps I'd write something different if I really understood them)

Comment by DanielFilan on By Default, GPTs Think In Plain Sight · 2023-07-18T21:47:35.088Z · LW · GW

Note that this paper in section 2.6 claims to demonstrate that some 175 billion parameter RLHF'ed LLM (presumably Claude or Claude 2?) does not use steganography in its CoT. It does this by:

  1. Get a bunch of answers from the model using CoT prompting.
  2. Take prefixes of CoTs, asking another language model to paraphrase them, regenerate the rest of the CoT, and see how accurate the models are now.
  3. Check if the model loses accuracy for paraphrased CoTs.

The basic result is that paraphrasing parts of the CoT doesn't appear to reduce accuracy.

Comment by DanielFilan on A Hill of Validity in Defense of Meaning · 2023-07-16T06:32:08.058Z · LW · GW

Closest thing I'm aware of is that at the time of the AlphaGo matches he bet people at like 3:2 odds, favourable to him, that Lee Sedol would win. Link here

Comment by DanielFilan on Blanchard's Dangerous Idea and the Plight of the Lucid Crossdreamer · 2023-07-10T23:42:46.164Z · LW · GW

I read RobertM as apophatially saying that Benquo could be confessing to something with Benquo's comment, and Benquo asking what Benquo is allegedly confessing to.

Comment by DanielFilan on Blanchard's Dangerous Idea and the Plight of the Lucid Crossdreamer · 2023-07-10T23:23:34.276Z · LW · GW

wait, how does Benquo's text imply that Benquo is confessing to committing assault?

Comment by DanielFilan on Cost-effectiveness of student programs for AI safety research · 2023-07-10T21:33:53.294Z · LW · GW

Similar to my comment on the other post: it seems like this critically relies on guesses about the 'treatment effects' of these programs, as detailed in the Pipeline probabilities and scientist-equivalence section. How did you come up with these guesses?

Comment by DanielFilan on Cost-effectiveness of professional field-building programs for AI safety research · 2023-07-10T21:26:57.238Z · LW · GW

It seems like the estimates for the cost-effectiveness of the NeurIPS social and workshop rely heavily on estimates of the number of "conversions" those produced, but I couldn't find an explanation of how these estimates were produced in the post. No chance you can walk us thru the envelope math there?

Comment by DanielFilan on AXRP Episode 22 - Shard Theory with Quintin Pope · 2023-06-22T03:22:48.200Z · LW · GW

Thanks for your detailed comments!

Comment by DanielFilan on Holly_Elmore's Shortform · 2023-06-21T02:18:26.891Z · LW · GW

I feel a bit reticent about pause advocacy, altho I have to admit I'm not familiar with the details (and I'm not feeling so negative about it that I want to spend a bunch of time trashing it). My attempt to flesh out why:

  • I'm pretty influenced by the type of libertarian political philosophy that says that hastily-assembled policy proposals can have big negative unintended side effects, especially when such policy proposals involve giving discretionary control over something to a government.
  • I'm pessimistic about our odds of surviving really powerful AI, but not so pessimistic that I think p(doom) couldn't be 10 percentage points higher.
  • Pause advocacy seems to seek compromise with normal people in order to get their policy proposals passed - an obviously good strategy on some level, but I kind of hate policy proposals that normal people like! This is doubly true for polities where it's easiest to start passing serious tech regulation (California, the EU).
  • Relatedly, I have the impression that pause policy advocacy tends to look like taking popular policies and promoting those that slow down AI the most, rather than doing something like mandatory AI liability insurance which seems like it's close to optimal, then adjusting it to be popular with lots of people.
  • I worry that "give more discretionary control over AI to such-and-such political body" just produces worse decisions.

Anyway that's why I have some sort of instinctive negative reaction, but again I'm not very familiar with the details and I'm sure different people are doing different things etc.

Comment by DanielFilan on DanielFilan's Shortform Feed · 2023-06-20T00:03:00.808Z · LW · GW

Toryn Q. Klassen, Parand Alizadeh Alamdari, and Sheila A. McIlraith wrote a paper on the multi-agent AUP thing, framing it as a study of epistemic side effects.

Comment by DanielFilan on AXRP Episode 22 - Shard Theory with Quintin Pope · 2023-06-16T22:03:23.666Z · LW · GW

Yeah, I've been having difficulty getting Google Podcasts to find the new episode, unfortunately. In the meantime, consider listening on YouTube or Spotify, if those work for you?

Comment by DanielFilan on Wildfire of strategicness · 2023-06-07T23:15:23.210Z · LW · GW

Maybe "subagents"?

Comment by DanielFilan on Power-seeking can be probable and predictive for trained agents · 2023-06-06T19:17:10.960Z · LW · GW

The core claim of this post is that if you train a network in some environment, the agent will not generalize optimally with respect to the reward function you trained it on, but will instead be optimal with respect to some other reward function in a way that is compatible with training-reward-optimality, and that this means that it is likely to avoid shutdown in new environments. The idea that this happens because reward functions are "internally represented" isn't necessary for those results. You're right that the post uses the phrase "internal representation" once at the start, and some very weak form of "representation" is presumably necessary for the policy to be optimal for a reward function (at least in the sense that you can derive a bunch of facts about a reward function from the optimal policy for that reward function), but that doesn't mean that they're central to the post.

Comment by DanielFilan on Wildfire of strategicness · 2023-06-05T22:14:52.498Z · LW · GW

The spark of strategicness, if such a thing is possible, recruits the surrounding mental elements. Those surrounding mental elements, by hypothesis, make goals achievable. That means the wildfire can recruit these surrounding elements toward the wildfire's ultimate ends... Also by hypothesis, the surrounding mental elements don't themselves push strongly for goals. Seemingly, that implies that they do not resist the wildfire, since resisting would constitute a strong push.

I think this is probably a fallacy of composition (maybe in the reverse direction than how people usually use that term)? Like, the hypothesis is that the mind as a whole makes goals achievable and doesn't push towards goals, but I don't think this implies that any given subset of the mind does that.

Comment by DanielFilan on Short Remark on the (subjective) mathematical 'naturalness' of the Nanda--Lieberum addition modulo 113 algorithm · 2023-06-02T17:00:54.454Z · LW · GW

Like, the only reason we're calling it a "Fourier basis" is that we're looking at a few different speeds of rotation, in order to scramble the second-place answers that almost get you a cos of 1 at the end, while preserving the actual answer.

Comment by DanielFilan on Short Remark on the (subjective) mathematical 'naturalness' of the Nanda--Lieberum addition modulo 113 algorithm · 2023-06-02T16:58:49.777Z · LW · GW

I agree a rotation matrix story would fit better, but I do think it's a fair analogy: the numbers stored are just coses and sines, aka the x and y coordinates of the hour hand.

Comment by DanielFilan on Short Remark on the (subjective) mathematical 'naturalness' of the Nanda--Lieberum addition modulo 113 algorithm · 2023-06-01T21:28:35.270Z · LW · GW

My submission: when we teach modular arithmetic to people, we do it using the metaphor of clock arithmetic. Well, if you ignore the multiple frequencies and argmax weirdness, clock arithmetic is exactly what this network is doing! Find the coordinates of rotating the hour hand (on a 113-hour clock) x hours, then y hours, use trig identities to work out what it would be if you rotated x+y hours, then count how many steps back you have to rotate to get to 0 to tell where you ended up. In fairness, the final step is a little bit different than the usual imagined rule of "look at the hour mark where the hand ends up", but not so different that clock arithmetic counts as a bad prediction IMO.

Comment by DanielFilan on DanielFilan's Shortform Feed · 2023-06-01T20:52:03.245Z · LW · GW

An attempt at rephrasing a shard theory critique of utility function reasoning, while restricting myself to things I basically agree with:

Yes, there are representation theorems that say coherent behaviour is optimizing some utility function. And yes, for the sake of discussion let's say this extends to reward functions in the setting of sequential decision-making (even tho I don't remember seeing a theorem for that). But: just because there's a mapping, doesn't mean that we can pull back a uniform measure on utility/reward functions to get a reasonable measure on agents - those theorems don't tell us that we should expect a uniform distribution on utility/reward functions, or even a nice distribution! They would if agents were born with utility functions in their heads represented as tables or something, where you could swap entries in different rows, but that's not what the theorems say!

Comment by DanielFilan on Seriously, what goes wrong with "reward the agent when it makes you smile"? · 2023-06-01T20:26:07.280Z · LW · GW

Not having read other responses, my attempt to answer in my own words: what goes wrong is that there are tons of possible cognitive influences that could be reinforced by rewards for making people smile. E.g. "make things of XYZ type think things are going OK", "try to promote physical configurations like such-and-such", "trying to stimulate the reinforcer I observe in my environment". Most of these decision-influences, when extrapolated to coherent behaviour where those decision-influences drive the course of the behaviour, lead to resource-gathering and not respecting what the informed preferences of humans would be. Then this causes doom because you can better achieve most goals/preferences you could have by having more power and disempowering the humans.

Comment by DanielFilan on Power-seeking can be probable and predictive for trained agents · 2023-06-01T19:28:51.390Z · LW · GW

I think you're making a mistake: policies can be reward-optimal even if there's not an obvious box labelled "reward" that they're optimal with respect to the outputs of. Similarly, the formalism of "reward" can be useful even if this box doesn't exist, or even if the policy isn't behaving the way you would expect if you identified that box with the reward function. To be fair, the post sort of makes this mistake by talking about "internal representations", but I think everything goes thru if you strike out that talk.

The main thing I want to talk about

I can talk about utility functions instead (which would be equivalent to value functions in this case)

I disagree that these are equivalent, and expect the policy and value function to come apart in practice. Indeed, that was observed in the original goal misgeneralization paper (3.3, actor-critic inconsistency).

I think you're the one who's imposing a type error here. For "value functions" to be useful in modelling a policy, it doesn't have to be the case that the policy is acting optimally with respect to a suggestively-labeled critic - it just has to be the case that the agent is acting consistently with some value function. Analogously, momentum is conserved in classical mechanics, even if objects have labels on them that inaccurately say "my momentum is 23 kg m/s".

Anyways, we can talk about utility functions, but then we're going to lose claim to predictiveness, no? Why should we assume that the network will internally represent a scalar function over observations, consistent with a historical training signal's scalar values (and let's not get into nonstationary reward), such that the network will maximize discounted sum return of this internally represented function? That seems highly improbable to me, and I don't think reality will be "basically that" either.

The utility function formalism doesn't require agents to "internally represent a scalar function over observations". You'll notice that this isn't one of the conclusions of the VNM theorem.

Another thing I don't get

My point is rather that these results are not predictive because the assumption won't hold. The assumptions are already known to not be good approximations of trained policies, in at least some prototypical RL situations.

What part of the post you link rules this out? As far as I can tell, the thing you're saying is that a few factors influence the decisions of the maze-solving agent, which isn't incompatible with the agent acting optimally with respect to some reward function such that it produces training-reward-optimal behaviour on the training set.