Posts

Why I no longer identify as transhumanist 2024-02-03T12:00:04.389Z
Loneliness and suicide mitigation for students using GPT3-enabled chatbots (survey of Replika users in Nature) 2024-01-23T14:05:40.986Z
Quick thoughts on the implications of multi-agent views of mind on AI takeover 2023-12-11T06:34:06.395Z
Genetic fitness is a measure of selection strength, not the selection target 2023-11-04T19:02:13.783Z
My idea of sacredness, divinity, and religion 2023-10-29T12:50:07.980Z
The 99% principle for personal problems 2023-10-02T08:20:07.379Z
How to talk about reasons why AGI might not be near? 2023-09-17T08:18:31.100Z
Stepping down as moderator on LW 2023-08-14T10:46:58.163Z
How I apply (so-called) Non-Violent Communication 2023-05-15T09:56:52.490Z
Most people should probably feel safe most of the time 2023-05-09T09:35:11.911Z
A brief collection of Hinton's recent comments on AGI risk 2023-05-04T23:31:06.157Z
Romance, misunderstanding, social stances, and the human LLM 2023-04-27T12:59:09.229Z
Goodhart's Law inside the human mind 2023-04-17T13:48:13.183Z
Why no major LLMs with memory? 2023-03-28T16:34:37.272Z
Creating a family with GPT-4 2023-03-28T06:40:06.412Z
Here, have a calmness video 2023-03-16T10:00:42.511Z
[Fiction] The boy in the glass dome 2023-03-03T07:50:03.578Z
The Preference Fulfillment Hypothesis 2023-02-26T10:55:12.647Z
In Defense of Chatbot Romance 2023-02-11T14:30:05.696Z
Fake qualities of mind 2022-09-22T16:40:05.085Z
Jack Clark on the realities of AI policy 2022-08-07T08:44:33.547Z
Open & Welcome Thread - July 2022 2022-07-01T07:47:22.885Z
My current take on Internal Family Systems “parts” 2022-06-26T17:40:05.750Z
Confused why a "capabilities research is good for alignment progress" position isn't discussed more 2022-06-02T21:41:44.784Z
The horror of what must, yet cannot, be true 2022-06-02T10:20:04.575Z
[Invisible Networks] Goblin Marketplace 2022-04-03T11:40:04.393Z
[Invisible Networks] Psyche-Sort 2022-04-02T15:40:05.279Z
Sasha Chapin on bad social norms in rationality/EA 2021-11-17T09:43:35.177Z
How feeling more secure feels different than I expected 2021-09-17T09:20:05.294Z
What does knowing the heritability of a trait tell me in practice? 2021-07-26T16:29:52.552Z
Experimentation with AI-generated images (VQGAN+CLIP) | Solarpunk airships fleeing a dragon 2021-07-15T11:00:05.099Z
Imaginary reenactment to heal trauma – how and when does it work? 2021-07-13T22:10:03.721Z
[link] If something seems unusually hard for you, see if you're missing a minor insight 2021-05-05T10:23:26.046Z
Beliefs as emotional strategies 2021-04-09T14:28:16.590Z
Open loops in fiction 2021-03-14T08:50:03.948Z
The three existing ways of explaining the three characteristics of existence 2021-03-07T18:20:24.298Z
Multimodal Neurons in Artificial Neural Networks 2021-03-05T09:01:53.996Z
Different kinds of language proficiency 2021-02-26T18:20:04.342Z
[Fiction] Lena (MMAcevedo) 2021-02-23T19:46:34.637Z
What's your best alternate history utopia? 2021-02-22T08:17:23.774Z
Internet Encyclopedia of Philosophy on Ethics of Artificial Intelligence 2021-02-20T13:54:05.162Z
Bedtime reminiscences 2021-02-19T11:50:05.271Z
Unwitting cult leaders 2021-02-11T11:10:04.504Z
[link] The AI Girlfriend Seducing China’s Lonely Men 2020-12-14T20:18:15.115Z
Are index funds still a good investment? 2020-12-02T21:31:40.413Z
Snyder-Beattie, Sandberg, Drexler & Bonsall (2020): The Timing of Evolutionary Transitions Suggests Intelligent Life Is Rare 2020-11-24T10:36:40.843Z
Retrospective: November 10-day virtual meditation retreat 2020-11-23T15:00:07.011Z
Memory reconsolidation for self-affection 2020-10-27T10:10:04.884Z
Group debugging guidelines & thoughts 2020-10-19T11:02:32.883Z
Things are allowed to be good and bad at the same time 2020-10-17T08:00:06.742Z

Comments

Comment by Kaj_Sotala on Why I no longer identify as transhumanist · 2024-03-13T21:12:04.654Z · LW · GW

I'm unlikely to reply to further object-level explanation of this, sorry. 

No worries! I'll reply anyway for anyone else reading this, but it's fine if you don't respond further.

Giving up on transhumanism as a useful idea of what-to-aim-for or identify as, separate from how much you personally can contribute to it.

It sounds like we have different ideas of what it means to identify as something. For me, one of the important functions of identity is as a model of what I am, and as what distinguishes me from other people. For instance, I identify as Finnish because of reasons like having a Finnish citizenship, having lived in Finland for my whole life, Finnish being my native language etc.; these are facts about what I am, and they're also important for predicting my future behavior.

For me, it would feel more like rationalization if I stopped contributing to something like transhumanism but nevertheless continued identifying as a transhumanist. My identity is something that should track what I am and do, and if I don't do anything that would meaningfully set me apart from people who don't identify as transhumanists... then that would feel like the label was incorrect and imply wrong kinds of predictions. Rather, I should just update on the evidence and drop the label.

As for transhumanism as a useful idea of what to aim for, I'm not sure of what exactly you mean by that, but I haven't started thinking "transhumanism bad" or anything like that. I still think that a lot of the transhumanist ideals are good and worthy ones and that it's great if people pursue them. (But there are a lot of ideals I think are good and worthy ones without identifying with them. For example, I like that museums exist and that there are people running them. But I don't do anything about this other than occasionally visit one, so I don't identify as a museum-ologist despite approving of them.)

More directly: avoiding "pinning your hopes on AI" (which, depending on how I'm supposed to interpret this, could mean "avoiding solutions that ever lead to aligned AI occurring" or "avoiding near-term AI, period" or "believing that something other than AI is likely to be the most important near-future thing"

Hmm, none of these. I'm not sure of what the first one means but I'd gladly have a solution that led to aligned AI, I use LLMs quite a bit, and AI clearly does seem like the most important near-future thing.

"Pinning my hopes on AI" meant something like "(subconsciously) hoping to get AI here sooner so that it would fix the things that were making me anxious", and avoiding that just means "noticing that therapy and conventional things like that work better for fixing my anxieties than waiting for AI to come and fix them". This too feels to me like actually updating on the evidence (noticing that there's something better that I can do already and I don't need to wait for AI to feel better) rather than like rationalizing something.

Comment by Kaj_Sotala on Why I no longer identify as transhumanist · 2024-03-12T17:20:14.348Z · LW · GW

Okay! It wasn't intended as prescriptive but I can see it as being implicitly that.

What do you think I'm rationalizing? 

Comment by Kaj_Sotala on The Felt Sense: What, Why and How · 2024-03-12T15:49:41.694Z · LW · GW

That's a pseudonym Duncan used at one point, see e.g. the first line of this comment.

Comment by Kaj_Sotala on Why I no longer identify as transhumanist · 2024-03-12T14:49:48.636Z · LW · GW

That makes sense to me, though I feel unclear about whether you think this post is an example of that pattern / whether your comment has some intent aimed at me?

Comment by Kaj_Sotala on CFAR Takeaways: Andrew Critch · 2024-02-28T18:29:57.948Z · LW · GW

There's something about this framing that feels off to me and makes me worry that it could be counterproductive. I think my main concerns are something like:

1) People often figure out what they want by pursuing things they think they want and then updating on the outcomes. So making them less certain about their wants might prevent them from pursuing the things that would give them the information for actually figuring it out.

2) I think that people's wants are often underdetermined and they could end up wanting many different things based on their choices. E.g. most people could probably be happy in many different kinds of careers that were almost entirely unlike each other, if they just picked one that offered decent working conditions and committed to it. I think this is true for a lot of things that people might potentially want, but to me the framing of "figure out what you want" implies that people's wants are a lot more static than this.

I think this 80K article expresses these kinds of ideas pretty well in the context of career choice:

The third problem [with the advice of "follow your passion"] is that it makes it sound like you can work out the right career for you in a flash of insight. Just think deeply about what truly most motivates you, and you’ll realise your “true calling”. However, research shows we’re bad at predicting what will make us happiest ahead of time, and where we’ll perform best. When it comes to career decisions, our gut is often unreliable. Rather than reflecting on your passions, if you want to find a great career, you need to go and try lots of things.

The fourth problem is that it can make people needlessly limit their options. If you’re interested in literature, it’s easy to think you must become a writer to have a satisfying career, and ignore other options.

But in fact, you can start a career in a new area. If your work helps others, you practice to get good at it, you work on engaging tasks, and you work with people you like, then you’ll become passionate about it. The ingredients of a dream job we’ve found are most supported by the evidence, are all about the context of the work, not the content. Ten years ago, we would have never imagined being passionate about giving career advice, but here we are, writing this article.

Many successful people are passionate, but often their passion developed alongside their success, rather than coming first. Steve Jobs started out passionate about zen buddhism. He got into technology as a way to make some quick cash. But as he became successful, his passion grew, until he became the most famous advocate of “doing what you love”.

Comment by Kaj_Sotala on New LessWrong review winner UI ("The LeastWrong" section and full-art post pages) · 2024-02-28T10:58:19.124Z · LW · GW

Comment retracted because right after writing it, I realized that the "leastwrong" is a section on LW, not its own site. I thought there was a separate leastwrong.com or something. In this case, I have much less of a feeling that it makes a global claim.

Comment by Kaj_Sotala on New LessWrong review winner UI ("The LeastWrong" section and full-art post pages) · 2024-02-28T10:51:25.297Z · LW · GW

Edit: An initial attempt is "The LeastWrong" feels a bit like a global claim of "these are the least wrong things on the internet". 

This is how it feels to me. 

Whether you can find a logic in which that interpretation is not coherent doesn't seem relevant to me. You can always construct a story according to which a particular association is actually wrong, but that doesn't stop people from having that association. (And I think there are reasonable grounds for people to be suspicious about such stories, in that they enable a kind of motte-and-bailey: using a phrasing that sends the message X, while saying that of course we don't mean to send that message and here's an alternative interpretation that's compatible with that phrasing. So I think that a lot of the people who'd find the title objectionable would be unpersuaded by your alternative interpretation, even assuming that they bothered to listen to it, and they would not be unreasonable to reject it.)

Comment by Kaj_Sotala on Why you, personally, should want a larger human population · 2024-02-25T09:00:07.149Z · LW · GW

Software/internet gives us much better ability to find.

And yet...

The past few decades have recorded a steep decline in people’s circle of friends and a growing number of people who don’t have any friends whatsoever. The number of Americans who claim to have “no close friends at all” across all age groups now stands at around 12% as per the Survey Center on American Life.

The percentage of people who say they don’t have a single close friend has quadrupled in the past 30 years, according to the Survey Center on American Life.1

It’s been known that friendlessness is more common for men, but it is nonetheless affecting everyone. The general change since 1990 is illustrated below.

Taken from "Adrift: America in 100 Charts" (2022), pg. 223. As a detail, note the drastic drop of people with 10+ friends, now a small minority.

The State of American Friendship: Change, Challenges, and Loss (2021), pg. 7

Although these studies are more general estimates of the entire population, it looks worse when we focus exclusively on generations that are more digitally native. When polling exclusively American millennials, a pre-pandemic 2019 YouGov poll found 22%have “zero friends” and 30% had “no best friends.” For those born between 1997 to 2012 (Generation Z), there has been no widespread, credible study done yet on this question — but if you’re adjacent to internet spaces, you already intuitively grasp that these same online catalysts are deepening for the next generation.

Comment by Kaj_Sotala on Why you, personally, should want a larger human population · 2024-02-25T08:51:14.965Z · LW · GW

Still, the fact that individual companies, for instance, develop layers of bureaucracy is not an argument against having a large economy.

This is true in principle, but population growth has led to the creation of larger companies in practice. ChatGPT when I asked it what proportion of the economy is controlled by the biggest 100 companies: 

For a rough estimate, consider the market capitalization of the 100 largest public companies relative to GDP. As of early 2023, the market capitalization of the S&P 100, which includes the 100 largest U.S. companies by market cap, was several trillion USD, while the U.S. GDP was about 23 trillion USD. This suggests a significant but not dominant share, with the caveat that market cap doesn't directly translate to economic contribution.

And if the population in every country would grow, then we'd end up with larger governments even if we kept the current system and never established a world government. To avoid governments getting bigger, you'd need to actively break up countries into smaller ones as their population increased. That doesn't seem like a thing that's going to happen.

Comment by Kaj_Sotala on Why you, personally, should want a larger human population · 2024-02-24T21:11:50.565Z · LW · GW

A possible countertrend would be something like diseconomies of scale in governance. I don't know the right keywords to find the actual studies on this. Still, it generally seems to me like smaller nations and companies are better run than bigger ones, as the larger ones develop more middle management and organizational layers mainly incentivized to manipulate themselves rather than to do the thing they're supposedly doing. This does not just waste the resources of the government itself, it also damages everyone else as the legislation they enact starts getting worse and worse. And the larger the system becomes, the harder any attempts to reform it become.

Comment by Kaj_Sotala on Why you, personally, should want a larger human population · 2024-02-24T20:59:12.089Z · LW · GW

Better matching to other people. A bigger world gives you a greater chance to find the perfect partner for you: the best co-founder for your business, the best lyricist for your songs, the best partner in marriage.

I'm skeptical of this; "better matching" implies "better ability to find". But just increasing the size of the population does not imply a better chance to find the best matches, given that it also increases the number of non-matches proportionally. And I think it's already the case that the ability to find the people is a much bigger bottleneck than just their existence.

It's also worth noting that as the population grows, so does the number of competitors. Maybe a 100x bigger population would have 100x the lyricists, but it may also have 100x the people wanting to hire those lyricists for themselves.

(Similar points also apply to the other "better matching" items.)

Comment by Kaj_Sotala on 2023 Survey Results · 2024-02-20T07:32:59.597Z · LW · GW

Which religion claims nothing supernatural at all happened?

Secular versions of Buddhism, versions of neo-paganism that interpret themselves to ultimately be manipulating psychological processes, religions whose conception of the divine is derived from scientific ideas, etc. More generally, many religions that define themselves primarily through practice rather than belief can be compatible with a lack of the supernatural (though of course aren't necessarily).

Comment by Kaj_Sotala on CFAR Takeaways: Andrew Critch · 2024-02-16T06:56:33.929Z · LW · GW

Agree. The advice I've heard for avoiding this is, instead of saying "try X", ask "what have you already tried" and then ideally ask some follow-up questions to further probe why exactly the things they've tried haven't worked yet. You might then be able to offer advice that's a better fit, and even if it turns out that they actually haven't tried the thing, it'll likely still be better received because you made an effort to actually understand their problem first. (I've sometimes used the heuristic, "don't propose any solutions until you could explain to a third party why this person hasn't been able to solve their problem yet".)

Comment by Kaj_Sotala on Lsusr's Rationality Dojo · 2024-02-13T18:54:14.750Z · LW · GW

At that moment, he was enlightened.

I somehow felt fuzzy and nice reading this; it's so distinctly your writing style and it's nice to have you around, being you and writing in your familiar slightly quirky style. (It also communicated the point well.)

Comment by Kaj_Sotala on Believing In · 2024-02-12T06:55:52.176Z · LW · GW

While he doesn't explicitly use the word "prediction" that much in the post, he does talk about "anticipated experiences", which around here is taken to be synonymous with "predicted experiences".

Comment by Kaj_Sotala on Why I no longer identify as transhumanist · 2024-02-06T18:19:01.183Z · LW · GW

I don't fully understand the actual math of it so I probably am not fully getting it. But if the core idea is something like "you can at every timestep take new experiences and then choose how to integrate them into a new you, with the particulars of that choice (and thus the nature of the new you) drawing on everything that you are at that timestep", then I like it.

I might quibble a bit about the extent to which something like that is actually a conscious choice, but if the "you" in question is thought to be all of your mind (subconsciousness and all) then that fixes it. Plus making it into more of a conscious choice over time feels like a neat aspirational goal.

... now I do feel more of a desire to live some several hundred years in order to do that, actually.

Comment by Kaj_Sotala on Why I no longer identify as transhumanist · 2024-02-05T20:20:37.300Z · LW · GW

I have read that book, but it's been long enough that I don't really remember anything about it.

Though I would guess that if you were to describe it, my reaction would be something along the lines of "if you want to have a theory of identity, sounds as as valid as any other".

Comment by Kaj_Sotala on Why I no longer identify as transhumanist · 2024-02-03T14:52:54.169Z · LW · GW

I suspect that a short, private conversation with your copy would change your mind

Can you elaborate how?

E.g. suppose that it was the case that I would get copied, and then one of us would be chosen by lot to be taken in front of a firing squad while the other could continue his life freely. I expect - though of course it's hard to fully imagine this kind of a hypothetical - that the thought of being taken in front of that firing squad and never seeing any of my loved ones again would create a rather visceral sense of terror in me. Especially if I was given a couple of days for the thought to sink in, and I wouldn't just be in a sudden shock of "wtf is happening".

It's possible that the thought of an identical copy of me being out there in the world would bring some comfort to that, but mostly I don't see how any conversation would have a chance of significantly nudging those reactions. They seem much too primal and low-level for that.

Comment by Kaj_Sotala on Why I no longer identify as transhumanist · 2024-02-03T14:40:35.936Z · LW · GW

I said "in some sense", which grants the possibility that there is also a sense in which personal identity does exist.

I think the kind of definition that you propose is valid but not emotionally compelling in the same way as my old intuitive sense of personal identity was.

It also doesn't match some other intuitive senses of personal identity, e.g. if you managed to somehow create an identical copy of me then it implies that I should be indifferent to whether I or my copy live. But if that happened, I suspect that both of my instances would prefer to be the ones to live.

Comment by Kaj_Sotala on Why I no longer identify as transhumanist · 2024-02-03T14:16:00.388Z · LW · GW

Do you mean s-risks, x-risks, age of em style future, stagnation, or mainstream dystopic futures?

"All of the above" - I don't know exactly which outcome to expect, but most of them feel bad and there seem to be very few routes to actual good outcomes. If I had to pick one, "What failure looks like" seems intuitively most probable, as it seems to require little else than current trends continuing.

I am suspcious about claims of this sort. It sounds like a case of "x is an illusion. Therefore, the pre-formal things leading to me reifying x are fake too." 

That sounds like a reasonable thing to be suspicious about! I should possibly also have linked my take on the self as a narrative construct.

Though I don't think that I'm saying the pre-formal things are fake. At least to my mind, that would correspond to saying something like "There's no lasting personal identity so there's no reason to do things that make you better off in the future". I'm clearly doing things that will make me better off in the future. I just feel less continuity to the version of me who might be alive fifty years from now, so the thought of him dying of old age doesn't create a similar sense of visceral fear. (Even if I would still prefer him to live hundreds of years, if that was doable in non-dystopian conditions.)

Comment by Kaj_Sotala on Wrong answer bias · 2024-02-02T10:57:49.158Z · LW · GW
  • Hate nuance

Obligatory link

Comment by Kaj_Sotala on Kaj's shortform feed · 2024-02-02T10:49:39.990Z · LW · GW

I only now made the connection that Sauron lost because he fell prey to the Typical Mind Fallacy (assuming that everyone's mind works the way your own does). Gandalf in the book version of The Two Towers:

The Enemy, of course, has long known that the Ring is abroad, and that it is borne by a hobbit. He knows now the number of our Company that set out from Rivendell, and the kind of each of us. But he does not yet perceive our purpose clearly. He supposes that we were all going to Minas Tirith; for that is what he would himself have done in our place. And according to his wisdom it would have been a heavy stroke against his power.

Indeed he is in great fear, not knowing what mighty one may suddenly appear, wielding the Ring, and assailing him with war, seeking to cast him down and take his place. That we should wish to cast him down and have no one in his place is not a thought that occurs to his mind. That we should try to destroy the Ring itself has not yet entered into his darkest dream. In which no doubt you will see our good fortune and our hope. For imagining war he has let loose war, believing that he has no time to waste; for he that strikes the first blow, if he strikes it hard enough, may need to strike no more. So the forces that he has long been preparing he is now setting in motion, sooner than he intended. Wise fool. For if he had used all his power to guard Mordor, so that none could enter, and bent all his guile to the hunting of the Ring, then indeed hope would have faded: neither Ring nor Bearer could long have eluded him.

Comment by Kaj_Sotala on How Emergency Medicine Solves the Alignment Problem · 2023-12-26T21:11:36.690Z · LW · GW

I'm not sure how useful this will be for alignment, but I didn't know anything about EMT training or protocols before, and upvoted this for being a fascinating look into that.

Comment by Kaj_Sotala on Would you have a baby in 2024? · 2023-12-26T09:01:09.571Z · LW · GW

Related previous discussion: Is the AI timeline too short to have children?

Comment by Kaj_Sotala on AI Views Snapshots · 2023-12-14T15:34:36.196Z · LW · GW

That was very convenient, thank you!

Comment by Kaj_Sotala on Is being sexy for your homies? · 2023-12-14T09:23:05.088Z · LW · GW

Interesting speculation!

But… I mean, think of a bakery of all (straight) men.

Then think of the same bakery, but it's all (straight) women.

Then imagine the same bakery, but it's mixed sex.

Can you see what happens?

Even if there's no attraction going on in the last case, the fact that there could be dramatically changes the unspoken dynamics. It's just not as stable as the other two.

I feel like you have some implicit additional assumptions WRT what you mean by "stable", here.

Like, are we talking about an ordinary bakery where someone is the owner and they hire staff to work there? In that case, if "stable" means something like "rate of staff turnover", I wouldn't expect the gender mix to significantly affect the stability. I'd expect it to be much more driven by things like ordinary working conditions, how well the staff were paid, etc..

(I also have the intuition that a single-gender environment would be less stable in the sense of being somehow "more stale" and "less alive" than a mixed-gender one, and thus less stable in the long-term, though that may very well just be my personal discomfort with single-gender environments.)

Comment by Kaj_Sotala on Quick thoughts on the implications of multi-agent views of mind on AI takeover · 2023-12-12T21:06:57.646Z · LW · GW

AGI is dangerous if it pursues an unaligned goal more competently than humans. [...] It's proposing "AGI won't work". 

I'd say it's proposing something like "minds including AGIs generally aren't agentic enough to reliably exert significant power on the world", with an implicit assumption like "minds that look like they have done that have mostly just gotten lucky or benefited from something like lots of built-up cultural heuristics that are only useful in a specific context and would break down in a sufficiently novel situation".

I agree that even if this was the case, it wouldn't eliminate the argument for AI risk; even allowing that, AIs could still become more competent than us and eventually, some of them could get lucky too. My impression of the original discussion was that the argument wasn't meant as an argument against all AI risk, but rather just against hard takeoff-type scenarios depicting a single AI that takes over the world by being supremely intelligent and agentic.

Comment by Kaj_Sotala on Quick thoughts on the implications of multi-agent views of mind on AI takeover · 2023-12-12T16:23:45.939Z · LW · GW

It still seems plausible to me that you might have a mind made of many different parts, but there is a clear "agent" bit that actually has goals and is controlling all the other parts.

What would that look like in practice?

Comment by Kaj_Sotala on Quick thoughts on the implications of multi-agent views of mind on AI takeover · 2023-12-11T15:39:53.529Z · LW · GW

Sure, right. (There are some theories suggesting that the human brain does something like a bidding process with the subagents with the best track record for prediction winning the ability to influence things more, though of course the system is different from an actual prediction market.) That's significantly different from the system ceasing to meaningfully have subagents at all though, and I understood rorygreig to be suggesting that it might cease to have them.

Comment by Kaj_Sotala on Quick thoughts on the implications of multi-agent views of mind on AI takeover · 2023-12-11T14:27:27.837Z · LW · GW

Don't understand what you're saying? (I mean sure they can but what makes you bring that up.)

Comment by Kaj_Sotala on Quick thoughts on the implications of multi-agent views of mind on AI takeover · 2023-12-11T13:00:01.371Z · LW · GW

However it seems plausible to me that these sub-agents may “cohere” under sufficient optimisation or training.

I think it's possible to unify them somewhat, in terms of ensuring that they don't have outright contradictory models or goals, but I don't really see a path where a realistically feasible mind would stop being made up of different subagents. The subsystem that thinks about how to build nanotechnology may have overlap with the subsystem that thinks about how to do social reasoning, but it's still going to be more efficient to have them specialized for those tasks rather than trying to combine them into one. Even if you did try to combine them into one, you'll still run into physical limits - in the human brain, it's hypothesized that one of the reasons why it takes time to think about novel decisions is that

different pieces of relevant information are found in physically disparate memory networks and neuronal sites. Access from the memory networks to the evidence accumulator neurons is physically bottlenecked by a limited number of “pipes”. Thus, a number of different memory networks need to take turns in accessing the pipe, causing a serial delay in the evidence accumulation process.

There are also closely related considerations for how much processing and memory you can cram into a single digital processing unit. In my language, each of those memory networks is its own subagent, holding different perspectives and considerations. For any mind that holds a nontrivial amount of memories and considerations, there are going to be plain physical limits on how much of that can be retrieved and usefully processed at a central location, making it vastly more efficient to run thought processes in parallel than try to force everything through a single bottleneck.

Comment by Kaj_Sotala on The Offense-Defense Balance Rarely Changes · 2023-12-09T19:40:59.692Z · LW · GW

At least assuming that it has fully automated infrastructure. If it still needs humans to keep the power plants running and to do physical maintenance in the server rooms, it becomes indirectly susceptible to the bioweapons.

Comment by Kaj_Sotala on Buy Nothing Day is a great idea with a terrible app— why has nobody built a killer app for crowdsourced 'effective communism' yet? · 2023-12-04T12:43:31.246Z · LW · GW

Wow.

Comment by Kaj_Sotala on 2023 Unofficial LessWrong Census/Survey · 2023-12-02T14:04:15.827Z · LW · GW

Mark whether to make your responses private; ie exclude them when this data is made public. Keep in mind that although it should in theory be difficult to identify you from your survey results, it may be possible if you have an unusual answer to certain questions, for example your Less Wrong karma. Please also be aware that even if this box is checked, the person collecting the surveys (Screwtape) will be able to see your results (but will keep them confidential)

This sounds like there should be a checkbox here, but I see two "spherical" response options instead.

Comment by Kaj_Sotala on 2023 Unofficial LessWrong Census/Survey · 2023-12-02T13:59:20.117Z · LW · GW

I didn't fill it out yet, but I just want to say that I appreciate all of the survey being on one page rather than requiring you to fill out all the answers on the first page and then click "next page" to see more questions. That would have been particularly annoying given the "do you want your answers to be made private" question - I want to see what I'm asked before I can tell whether I want to keep my answers private! Kudos.

Comment by Kaj_Sotala on Thoughts on “AI is easy to control” by Pope & Belrose · 2023-12-02T13:36:59.087Z · LW · GW

I mostly agree with what you say, just registering my disagreement/thoughts on some specific points. (Note that I haven't yet read the page you're responding to.)

Hopefully everyone on all sides can agree that if my LLM reliably exhibits a certain behavior—e.g. it outputs “apple” after a certain prompt—and you ask me “Why did it output ‘apple’, rather than ‘banana’?”, then it might take me decades of work to give you a satisfying intuitive answer. 

Maybe? Depends on what exactly you mean by the word "might", but it doesn't seem obvious to me that this would need to be the case. My intuition from seeing the kinds of interpretability results we've seen so far, is that within less of a decade we'd already have a pretty rigorous theory and toolkit for answering these kinds of questions. At least assuming that we don't keep switching to LLM architectures that work based on entirely different mechanisms and make all of the previous interpretability work irrelevant.

If by "might" you mean something like a "there's at least a 10% probability that this could take decades to answer" then sure, I'd agree with that. Now I haven't actually thought about this specific question very much before seeing it pop up in your post, so I might radically revise my intuition if I thought about it more, but at least it doesn't seem immediately obvious to me that I should assign "it would take decades of work to answer this" a very high probability.

Instead, the authors make a big deal out of the fact that human innate drives are relatively simple (I think they mean “simple compared to a modern big trained ML model”, which I would agree with). I’m confused why that matters. Who cares if there’s a simple solution, when we don’t know what it is?

I would assume the intuition to be something like "if they're simple, then given the ability to experiment on minds and access AI internals, it will be relatively easy to figure out how to make the same drives manifest in an AI; the amount of (theory + trial and error) required for that will not be as high as it would be if the drives were intrinsically complex".

We can run large numbers of experiments to find the most effective interventions, and we can also run it in a variety of simulated environments and test whether it behaves as expected both with and without the cognitive intervention. Each time the AI’s “memories” can be reset, making the experiments perfectly reproducible and preventing the AI from adapting to our actions, very much unlike experiments in psychology and social science.

That sounds nice, but brain-like AGI (like most RL agents) does online learning. So if you run a bunch of experiments, then as soon as the AGI does anything whatsoever (e.g. reads the morning newspaper), your experiments are all invalid (or at least, open to question), because now your AGI is different than it was before (different ANN weights, not just different environment / different prompt). Humans are like that too, but LLMs are not.

There's something to that, but this sounds too strong to me. If someone had hypothetically spent a year observing all of my behavior, having some sort of direct read access to what was happening in my mind, and also doing controlled experiments where they reset my memory and tested what happened with some different stimulus... it's not like all of their models would become meaningless the moment I read the morning newspaper. If I had read morning newspapers before, they would probably have a pretty good model of what the likely range of updates for me would be. 

Of course, if there was something very unexpected and surprising in the newspaper, that might cause a bigger update, but I expect that they would also have reasonably good models of the kinds of things that are likely to trigger major updates or significant emotional shifts in me. If they were at all competent, that's specifically the kind of thing that I'd expect them to work on trying to find out!

And even if there was a major shift, I think it's basically unheard of that literally everything about my thoughts and behavior would change. When I first understood the potentially transformative impact of AGI, it didn't change the motor programs that determine how I walk or brush my teeth, nor did it significantly change what kinds of people I feel safe around (aside for some increase in trust toward other people who I felt "get it"). I think that human brains quite strongly preserve their behavior and prediction structures, just adjusting them somewhat when faced with new information. Most of the models and predictions you've made about an adult will tend to stay valid, though of course with children and younger people there's much greater change.

Now, as it happens, humans do often imitate other humans. But other times they don’t. Anyway, insofar as humans-imitating-other-humans happens, it has to happen via a very different and much less direct algorithmic mechanism than how it happens in LLMs. Specifically, humans imitate other humans because they want to, i.e. because of the history of past reinforcement, directly or indirectly. Whereas a pretrained LLM will imitate human text with no RL or “wanting to imitate” at all, that’s just mechanically what it does.

In some sense yes, but it does also seem to me that prediction and desire does get conflated in humans in various ways, and that it would be misleading to say that the people in question want it. For example, I think about this post by @romeostevensit often:

Fascinating concept that I came across in military/police psychology dealing with the unique challenges people face in situations of extreme stress/danger: scenario completion. Take the normal pattern completion that people do and put fear blinders on them so they only perceive one possible outcome and they mechanically go through the motions *even when the outcome is terrible* and there were obvious alternatives. This leads to things like officers shooting *after* a suspect has already surrendered, having overly focused on the possibility of needing to shoot them. It seems similar to target fixation where people under duress will steer a vehicle directly into an obstacle that they are clearly perceiving (looking directly at) and can't seem to tear their gaze away from. Or like a self fulfilling prophecy where the details of the imagined bad scenario are so overwhelming, with so little mental space for anything else that the person behaves in accordance with that mental picture even though it is clearly the mental picture of the *un*desired outcome.

I often try to share the related concept of stress induced myopia. I think that even people not in life or death situations can get shades of this sort of blindness to alternatives. It is unsurprising when people make sleep a priority and take internet/screen fasts that they suddenly see that the things they were regarding as obviously necessary are optional. In discussion of trauma with people this often seems to be an element of relationships sadly enough. They perceive no alternative and so they resign themselves to slogging it out for a lifetime with a person they are very unexcited about. This is horrific for both people involved.

It's, of course, true that for an LLM, prediction is the only thing it can do, and that humans have a system of desires on top of that. But it looks to me that a lot of human behavior is just having LLM-ish predictive models of how someone like them would behave in a particular situation, which is also the reason why conceptual reframings the like one you can get in therapy can be so powerful ("I wasn't lazy after all, I just didn't have the right tools for being productive" can drastically reorient many predictions you're making of yourself and thus your behavior). (See also my post on human LLMs, which has more examples.)

While it's obviously true that there is a lot of stuff operating in brains besides LLM-like prediction, such as mechanisms that promote specific predictive models over other ones, that seems to me to only establish that "the human brain is not just LLM-like prediction", while you seem to be saying that "the human brain does not do LLM-like prediction at all". (Of course, "LLM-like prediction" is a vague concept and maybe we're just using it differently and ultimately agree.)

Comment by Kaj_Sotala on OpenAI: Facts from a Weekend · 2023-11-21T19:47:58.739Z · LW · GW

To elaborate on that, Shear is presumably saying exactly as much as he is allowed to say in public. This implies that if the removal had nothing to do with safety, then he would say "The board did not remove Sam over anything to do with safety". His inserting of that qualifier implies that he couldn't make a statement that broad, and therefore that safety considerations were involved in the removal.

Comment by Kaj_Sotala on Sam Altman fired from OpenAI · 2023-11-18T20:40:54.208Z · LW · GW

I expect safety of that to be at zero

At least it refuses to give you instructions for making cocaine.

Comment by Kaj_Sotala on New LessWrong feature: Dialogue Matching · 2023-11-17T19:39:57.257Z · LW · GW

For site libraries, there is indeed no alternative since you have to use some libraries to get anything done, so there you do have to do it on a case-by-case basis. In the case of exposing user data, there is an alternative - limiting yourself to only public data. (See also my reply to jacobjacob.)

Comment by Kaj_Sotala on New LessWrong feature: Dialogue Matching · 2023-11-17T19:36:26.294Z · LW · GW

we're a small team and the world is on fire, and I don't think we should really be prioritising making Dialogue Matching robust to this kind of adversarial cyber threat for information of comparable scope and sensitivity! 

I agree that it wouldn't be a very good use of your resources. But there's a simple solution for that - only use data that's already public and users have consented to you using. (Or offer an explicit opt-in where that isn't the case.)

I do agree that in this specific instance, there's probably little harm in the information being revealed. But I generally also don't think that that's the site admin's call to make, even if I happen to agree with that call in some particular instances. A user may have all kinds of reasons to want to keep some information about themselves private, some of those reasons/kinds of information being very idiosyncratic and hard to know in advance. The only way to respect every user's preferences for privacy, even the unusual ones, is by letting them control what information is used and not make any of those calls on their behalf.

Comment by Kaj_Sotala on New LessWrong feature: Dialogue Matching · 2023-11-17T19:15:37.667Z · LW · GW

My point is less about the individual example than the overall decision algorithm. Even if you're correct that in this specific instance, you can verify the whole trail of implications and be certain that nothing bad happens, a general policy of "figure it out on a case-by-case basis and only do it when it feels safe" means that you're probably going to make a mistake eventually, given how easy it is to make a mistake in this domain.

Comment by Kaj_Sotala on Open Thread – Autumn 2023 · 2023-11-17T18:12:58.155Z · LW · GW

I've wondered the same thing; I've suggested before merging them, so that posts in shortform would automatically be posted into that month's open thread and vice versa. As it is, I every now and then can't decide which one to post in, so I post in neither.

Comment by Kaj_Sotala on New LessWrong feature: Dialogue Matching · 2023-11-17T18:04:41.621Z · LW · GW

We tenatively postulated it would be fine to do this as long as seeing a name on your match page gave no more than like a 5:1 update about those people having checked you. 

I would strongly advocate against this kind of thought; any such decision-making procedure relies on the assumption that you correctly figure out all the ways such information can be used, and that there isn't a clever way an adversary can extract more information than you had thought. This is bound to fail - people come up with clever ways to extract more private information than anticipated all the time. For example:

  • Timing Attacks on Web Privacy:
    • We describe a class of attacks that can compromise the privacy of users’ Web-browsing histories. The attacks allow a malicious Web site to determine whether or not the user has recently visited some other, unrelated Web page. The malicious page can determine this information by measuring the time the user’s browser requires to perform certain operations. Since browsers perform various forms of caching, the time required for operations depends on the user’s browsing history; this paper shows that the resulting time variations convey enough information to compromise users’ privacy.
  • Robust De-anonymization of Large Datasets (How to Break Anonymity of the Netflix Prize Dataset)
    • We apply our de-anonymization methodology to the Netflix Prize dataset, which contains anonymous movie ratings of 500,000 subscribers of Netflix, the world’s largest online movie rental service. We demonstrate that an adversary who knows only a little bit about an individual subscriber can easily identify this subscriber’s record in the dataset. Using the Internet Movie Database as the source of background knowledge, we successfully identified the Netflix records of known users, uncovering their apparent political preferences and other potentially sensitive information.
  • De-anonymizing Social Networks 
    • We present a framework for analyzing privacy and anonymity in social networks and develop a new re-identification algorithm targeting anonymized social network graphs. To demonstrate its effectiveness on real-world networks, we show that a third of the users who can be verified to have accounts on both Twitter, a popular microblogging service, and Flickr, an online photo-sharing site, can be re-identified in the anonymous Twitter graph with only a 12% error rate. Our de-anonymization algorithm is based purely on the network topology, does not require creation of a large number of dummy “sybil” nodes, is robust to noise and all existing defenses, and works even when the overlap between the target network and the adversary’s auxiliary information is small.
  • On the Anonymity of Home/Work Location Pairs
    • Many applications benefit from user location data, but location data raises privacy concerns. Anonymization can protect privacy, but identities can sometimes be inferred from supposedly anonymous data. This paper studies a new attack on the anonymity of location data. We show that if the approximate locations of an individual’s home and workplace can both be deduced from a location trace, then the median size of the individual’s anonymity set in the U.S. working population is 1, 21 and 34,980, for locations known at the granularity of a census block, census track and county respectively. The location data of people who live and work in different regions can be re-identified even more easily. Our results show that the threat of re-identification for location data is much greater when the individual’s home and work locations can both be deduced from the data.
  • Bubble Trouble: Off-Line De-Anonymization of Bubble Forms
    • Fill-in-the-bubble forms are widely used for surveys, election ballots, and standardized tests. In these and other scenarios, use of the forms comes with an implicit assumption that individuals’ bubble markings themselves are not identifying. This work challenges this assumption, demonstrating that fill-in-the-bubble forms could convey a respondent’s identity even in the absence of explicit identifying information. We develop methods to capture the unique features of a marked bubble and use machine learning to isolate characteristics indicative of its creator. Using surveys from more than ninety individuals, we apply these techniques and successfully reidentify individuals from markings alone with over 50% accuracy.
Comment by Kaj_Sotala on You can just spontaneously call people you haven't met in years · 2023-11-14T14:48:29.289Z · LW · GW

Hmm, I would actually expect neurotypicals to find this advice more useful, since they're more likely to have thoughts like "I can't do that, that'd be weird" while the stereotypical autist would be blissfully unaware of there being anything weird about it.

Comment by Kaj_Sotala on My idea of sacredness, divinity, and religion · 2023-11-08T06:58:28.636Z · LW · GW

No worries! Yeah, I agree with that. These paragraphs were actually trying to explicitly say that things may very well not work out in the end, but maybe that wasn't clear enough:

Love doesn’t always win. There are situations where loyalty, cooperation, and love win, and there are situations where disloyalty, selfishness, and hatred win. If that wasn’t the case, humans wouldn’t be so clearly capable of both.

It’s possible for people and cultures to settle into stable equilibria where trust and happiness dominate and become increasingly beneficial for everyone, but also for them to settle into stable equilibria where mistrust and misery dominate, or anything in between.

Comment by Kaj_Sotala on Genetic fitness is a measure of selection strength, not the selection target · 2023-11-06T09:44:52.041Z · LW · GW

I don't think any of these arguments depend crucially on whether there is a sole explicit goal of the training process, or if the goal of the training process changes a bunch. The only thing the argument depends on is whether there exist such abstract drives/goals 

I agree that they don't depend on that. Your arguments are also substantially different from the ones I was criticizing! The ones I was responding were ones like the following:

The central analogy here is that optimizing apes for inclusive genetic fitness (IGF) doesn't make the resulting humans optimize mentally for IGF. Like, sure, the apes are eating because they have a hunger instinct and having sex because it feels good—but it's not like they could be eating/fornicating due to explicit reasoning about how those activities lead to more IGF. They can't yet perform the sort of abstract reasoning that would correctly justify those actions in terms of IGF. And then, when they start to generalize well in the way of humans, they predictably don't suddenly start eating/fornicating because of abstract reasoning about IGF, even though they now could. Instead, they invent condoms, and fight you if you try to remove their enjoyment of good food (telling them to just calculate IGF manually). The alignment properties you lauded before the capabilities started to generalize, predictably fail to generalize with the capabilities. (A central AI alignment problem: capabilities generalization, and the sharp left turn)

15. [...] We didn't break alignment with the 'inclusive reproductive fitness' outer loss function, immediately after the introduction of farming - something like 40,000 years into a 50,000 year Cro-Magnon takeoff, as was itself running very quickly relative to the outer optimization loop of natural selection.  Instead, we got a lot of technology more advanced than was in the ancestral environment, including contraception, in one very fast burst relative to the speed of the outer optimization loop, late in the general intelligence game. [...]

16. Even if you train really hard on an exact loss function, that doesn't thereby create an explicit internal representation of the loss function inside an AI that then continues to pursue that exact loss function in distribution-shifted environments.  Humans don't explicitly pursue inclusive genetic fitness; outer optimization even on a very exact, very simple loss function doesn't produce inner optimization in that direction.  (AGI Ruin: A List of Lethalities)

Those arguments are explicitly premised on humans having been optimized for IGF, which is implied to be a single thing. As I understand it, your argument is just that humans now have some very different behaviors from the ones they used to have, omitting any claims of what evolution originally optimized us for, so I see it as making a very different sort of claim.

To respond to your argument itself:

I agree that there are drives for which the behavior looks very different from anything that we did in the ancestral environment. But does very different-looking behavior by itself constitute a sharp left turn relative to our original values?

I would think that if humans had experienced a sharp left turn, then the values of our early ancestors should look unrecognizable to us, and vice versa. And certainly, there do seem to be quite a few things that our values differ on - modern notions like universal human rights and living a good life while working in an office might seem quite alien and repulsive to some tribal warrior who values valor in combat and killing and enslaving the neighboring tribe, for instance.

At the same time... I think we can still basically recognize and understand the values of that tribal warrior, even if we don't share them. We do still understand what's attractive about valor, power, and prowess, and continue to enjoy those kinds of values in less destructive forms in sports, games, and fiction. We can read Gilgamesh or Homer or Shakespeare and basically get what the characters are motivated by and why they are doing the things they're doing. An anthropologist can go to a remote tribe to live among them and report that they have the same cultural and psychological universals as everyone else and come away with at least some basic understanding of how they think and why.

It's true that humans couldn't eradicate diseases before. But if you went to people very far back in time and told them a story about a group of humans who invented a powerful magic that could destroy diseases forever and then worked hard to do so... then the people of that time would not understand all of the technical details, and maybe they'd wonder why we'd bother bringing the cure to all of humanity rather than just our tribe (though Prometheus is at least commonly described as stealing fire for all of humanity, so maybe not), but I don't think they would find it a particularly alien or unusual motivation otherwise. Humans have hated disease for a very long time, and if they'd lost any loved ones to the particular disease we were eradicating they might even cheer for our doctors and want to celebrate them as heroes.

Similarly, humans have always gone on voyages of exploration - e.g. the Pacific islands were discovered and settled long ago by humans going on long sea voyages - so they'd probably have no difficulty relating to a story about sorcerers going to explore the moon, or of two tribes racing for the glory of getting there first. Babylonians had invented the quadratic formula by 1600 BC and apparently had a form of Fourier analysis by 300 BC, so the math nerds among them would probably have some appreciation of modern-day advanced math if it was explained to them. The Greek philosophers argued over epistemology, and there were apparently instructions on how to animate golems (arguably AGI-like) around by the late 12th/early 13th century.

So I agree that the same fundamental values and drives can create very different behavior in different contexts... but if it is still driven by the same fundamental values and drives in a way that people across time might find relatable, why is that a sharp left turn? Analogizing that to AI, it would seem to imply that if the AI generalized its drives in that kind of way when it came to novel contexts, then we would generally still be happy about the way it had generalized them.

This still leaves us with that tribal warrior disgusted with our modern-day weak ways. I think that a lot of what is going on with him is that he has developed particular strategies for fulfilling his own fundamental drives - being a successful warrior was the way you got what you wanted back in that day - and internalized them as a part of his aesthetic of what he finds beautiful and what he finds disgusting. But it also looks to me like this kind of learning is much more malleable than people generally expect. One's sense of aesthetics can be updated by propagating new facts into it, and strongly-held identities (such as "I am a technical person") can change in response to new kinds of strategies becoming viable, and generally many (I think most) deep-seated emotional patterns can at least in principle be updated. (Generally, I think of human values in terms of a two-level model, where the underlying "deep values" are relatively constant, with emotional responses, aesthetics, identities, and so forth being learned strategies for fulfilling those deep values. The strategies are at least in principle updatable, subject to genetic constraints such as the person's innate temperament that may be more hardcoded.)

I think that the tribal warrior would be disgusted by our society because he would rightly recognize that we have the kinds of behavior patterns that wouldn't bring glory in his society and that his tribesmen would find it shameful to associate with, and also that trying to make it in our society would require him to unlearn a lot of stuff that he was deeply invested in. But if he was capable of making the update that there were still ways for him to earn love, respect, power, and all the other deep values that his warfighting behavior had originally developed to get... then he might come to see our society as not that horrible after all.

I am confused by your AlphaGo argument because "winning states of the board" looks very different depending on what kinds of tactics your opponent uses, in a very similar way to how "surviving and reproducing" looks very different depending on what kinds of hazards are in the environment. 

I don't think the actual victory states look substantially different? They're all ones where AlphaGo has more territory than the other player, even if the details of how you get there are going to be different.

I predict that AlphaGo is actually not doing that much direct optimization in the sense of an abstract drive to win that it reasons about, but rather has a bunch of random drives piled up that cover various kinds of situations that happen in Go.

Yeah, I would expect this as well, but those random drives would still be systematically shaped in a consistent direction (that which brings you closer to a victory state).

Comment by Kaj_Sotala on My idea of sacredness, divinity, and religion · 2023-11-06T06:24:45.689Z · LW · GW

I think I agree with this, do you mean it as disagreement to something I said or just an observation?

Comment by Kaj_Sotala on Genetic fitness is a measure of selection strength, not the selection target · 2023-11-05T16:59:01.338Z · LW · GW

Thanks, edited:

I argued that there’s no single thing that evolution selects for; rather, the thing that it’s selecting is constantly changing.

Comment by Kaj_Sotala on Genetic fitness is a measure of selection strength, not the selection target · 2023-11-05T16:57:25.114Z · LW · GW

Does this comment help clarify the point?

Comment by Kaj_Sotala on Genetic fitness is a measure of selection strength, not the selection target · 2023-11-05T16:48:31.237Z · LW · GW

So I think the issue is that when we discuss what I'd call the "standard argument from evolution", you can read two slightly different claims into it. My original post was a bit muddled because I think those claims are often conflated, and before writing this reply I hadn't managed to explicitly distinguish them.

The weaker form of the argument, which I interpret your comment to be talking about, goes something like this:

  • The original evolutionary "intent" of various human behaviors/goals was to increase fitness, but in the modern day these behaviors/goals are executed even though their consequences (in terms of their impact on fitness) are very different. This tells us that the intent of the process that created a behavior/goal does not matter. Once the behavior/goal has been created, it will just do what it does even if the consequences of that doing deviate from their original purpose. Thus, even if we train an AI so that it carries out goal X in a particular context, we have no particular reason to expect that it would continue to automatically carry out the same (intended) goal if the context changes enough.

I agree with this form of the argument and have no objections to it. I don't think that the points in my post are particularly relevant to that claim. (I've even discussed a form of inner optimization in humans that causes value drift that I don't recall anyone else discussing in those terms before.) 

However, I think that many formulations are actually implying, if not outright stating a stronger claim:

  • In the case of evolution, humans were originally selected for IGF but are now doing things that are completely divorced from that objective. Thus, even if we train an AI so that it carries out goal X in a particular context, we have a strong reason to expect that its behavior would deviate so much from the goal as to become practically unrecognizable.

So the difference is something like the implied sharpness of the left turn. In the weak version, the claim is just that the behavior might go some unknown amount to the left. We should figure out how to deal with this, but we don't yet have much empirical data to estimate exactly how much it might be expected to go left. In the strong version, the claim is that the empirical record shows that the AI will by default swerve a catastrophic amount to the left.

(Possibly you don't feel that anyone is actually implying the stronger version. If you don't and you would already disagree with the stronger version, then great! We are in agreement. I don't think it matters whether the implication "really is there" in some objective sense, or even whether the original authors intended it or not. I think the relevant thing is that I got that implication from the posts I read, and I expect that if I got it, some other people got it too. So this post is then primarily aimed at the people who did read the strong version to be there and thought it made sense.)

You wrote:

I agree that humans (to a first approximation) still have the goals/drives/desires we were selected for. I don't think I've heard anyone claim that humans suddenly have an art creating drive that suddenly appeared out of nowhere recently, nor have I heard any arguments about inner alignment that depend on an evolution analogy where this would need to be true. The argument is generally that the ancestral environment selected for some drives that in the ancestral environment reliably caused something that the ancestral environment selected for, but in the modern environment the same drives persist but their consequences in terms of [the amount of that which the ancestral environment was selecting for] now changes, potentially drastically. 

If we are talking about the weak version of the argument, then yes, I agree with everything here. But I think the strong version - where our behavior is implied to be completely at odds with our original behavior - has to implicitly assume that things like an art-creation drive are something novel. 

Now I don't think that anyone who endorses the strong version (if anyone does) would explicitly endorse the claim that our art-creation drive just appeared out of nowhere. But to me, the strong version becomes pretty hard to maintain if you take the stance that we are mostly still executing all of the behaviors that we used to, and it's just that their exact forms and relative weightings are somewhat out of distribution. (Yes, right now our behavior seems to lead to falling birthrates and lots of populations at below replacement rates, which you could argue was a bigger shift than being "somewhat out of distribution", but... to me that intuitively feels like it's less relevant than the fact that most individual humans still want to have children and are very explicitly optimizing for that, especially since we've only been in the time of falling birthrates for a relatively short time and it's not clear whether it'll continue for very long.)

I think the strong version also requires one to hold that evolution does, in fact, consistently and predominantly optimize for a single coherent thing. Otherwise, it would mean that our current-day behaviors could be explained by "evolution doesn't consistently optimize for any single thing" just as well as they could be explained by "we've experienced a left turn from what evolution originally optimized for".

However, it is pretty analogous to RL, and especially multi agent RL, and overall I don't think of the inner misalignment argument as depending on stationarity of the environment in either direction. AlphaGo might early in training select for policies that do tactic X initially because it's a good tactic to use against dumb Go networks, and then once all the policies in the pool learn to defend against that tactic it is no longer rewarded. 

I agree that there are contexts where it would be analogous to that. But in that example, AlphaGo is still being rewarded for winning games of Go, and it's just that the exact strategies it needs to use differ. That seems different than e.g. the bacteria example, where bacteria are selected for exactly the opposite traits - either selected for producing a toxin and an antidote, or selected for not producing a toxin and an antidote. That seems to me more analogous to a situation where AlphaGo is initially being rewarded for winning at Go, then once it starts consistently winning it starts getting rewarded for losing instead, and then once it starts consistently losing it starts getting rewarded for winning again.

And I don't think that that kind of a situation is even particularly rare - anything that consumes energy (be it a physical process such as producing a venom or a fur, or a behavior such as enjoying exercise) is subject to that kind of an "either/or" choice. 

Now you could say that "just like AlphaGo is still rewarded for winning games of Go and it's just the strategies that differ, the organism is still rewarded for reproducing and it's just the strategies that differ". But I think the difference is that for AlphaGo, the rewards are consistently shaping its "mind" towards having a particular optimization goal - one where the board is in a winning state for it. 

And one key premise on which the "standard argument from evolution" rests is that evolution has not consistently shaped the human mind in such a direct manner. It's not that we have been created with "I want to have surviving offspring" as our only explicit cognitive goal, with all of the evolutionary training going into learning better strategies to get there by explicit (or implicit) reasoning.  Rather we have been given various motivations that exhibit varying degrees of directness in how useful they are for that goal - from "I want to be in a state where I produce great art" (quite indirect) to "I want to have surviving offspring" (direct), with the direct goal competing with all the indirect ones for priority. Unlike AlphaGo, which does have the cognitive capacity for direct optimization toward its goal being the sole reward criteria all along.

This is also a bit hard to put a finger on, but I feel like there's some kind of implicit bait-and-switch happening with the strong version of the standard argument. It correctly points out that we have not had IGF as our sole explicit optimization goal because we didn't start by having enough intelligence for that to work. Then it suggests that because of this, AIs are likely to also be misaligned... even though, unlike with human evolution, we could just optimize them for one explicit goal from the beginning, so we should expect our AIs to be much more reliably aligned with that goal!