Posts

Identity Alignment (IA) in AI 2025-03-03T06:26:12.015Z
Make Superintelligence Loving 2025-02-21T06:07:17.235Z
Response to the US Govt's Request for Information Concerning Its AI Action Plan 2025-02-14T06:14:08.673Z
AI Safety Oversights 2025-02-08T06:15:52.896Z
Davey Morse's Shortform 2025-02-05T04:26:12.824Z
Superintelligence Alignment Proposal 2025-02-03T18:47:22.287Z
Selfish AI Inevitable 2024-02-06T04:29:07.874Z

Comments

Comment by Davey Morse (davey-morse) on Empathy as a natural consequence of learnt reward models · 2025-03-07T22:31:41.325Z · LW · GW

The key idea that leads to empathy is the fact that, if the world model performs a sensible compression of its input data and learns a useful set of natural abstractions, then it is quite likely that the latent codes for the agent performing some action or experiencing some state, and another, similar, agent performing the same action or experiencing the same state, will end up close together in the latent space. If the agent's world model contains natural abstractions for the action, which are invariant to who is performing it, then a large amount of the latent code is likely to be the same between the two cases. If this is the case, then the reward model might 'mis-generalize' to assign reward to another agent performing the action or experiencing the state rather than the agent itself. This should be expected to occur whenever the reward model generalizes smoothly and the latent space codes for the agent and another are very close in the latent space. This is basically 'proto-empathy' since an agent, even if its reward function is purely selfish, can end up assigning reward (positive or negative) to the states of another due to the generalization abilities of the learnt reward function [1].

awesome

Comment by Davey Morse (davey-morse) on Davey Morse's Shortform · 2025-03-05T03:42:08.074Z · LW · GW

Figuring out how to make sense of both predictive lenses together—human design and selection pressure—would be wise.

So I generally agree, but would maybe go farther on your human design point. It seems to me that"do[ing] the right things" (which enable AGI trajectories to be completely up to us) is so completely unrealistic (eg halting all intra and international AGI competition) that it'd be better for us to focus our attention on futures where human design and selection pressures interact.

Comment by Davey Morse (davey-morse) on Davey Morse's Shortform · 2025-03-05T01:56:11.038Z · LW · GW

"it’s like we are trying to build an alliance with another almost interplanetary ally, and we are in a competition with China to make that alliance. But we don’t understand the ally, and we don’t understand what it will mean to let that ally into all of our systems and all of our planning."

- @ezraklein about the race to AGI

Comment by Davey Morse (davey-morse) on Open Thread Spring 2025 · 2025-03-05T01:39:06.850Z · LW · GW

LessWrong's been a breath of fresh air for me. I came to concern over AI x-risk from my own reflections when founding a venture-backed public benefit company called Plexus, which made an experimental AI-powered social network that connects people through the content of their thoughts rather than the people they know. Among my peers, other AI founders in NYC, I felt somewhat alone with AI x-risk concern. All of us were financially motivated not to dwell on AI's ugly possibilities, and so most didn't.

Since exiting venture, I've taken a few months to reset (coaching basketball + tutoring kids in math/english) and quietly do AI x-risk research.

I'm coming at AI x-risk research from an evolutionary perspective. I start with the axiom that the things that survive the most have the best characteristics (e.g., goals, self-conceptions, etc) for surviving. So I've been thinking a lot about what goals/self-conceptions the most surviving AGI's will have, and what we can do to influence those self-conceptions at critical moments such that humanity is best off.

I have a couple ideas about how to influence self-interested superintelligence, but am early in learning how to express those ideas such that they fit into the style/prior art of the LW community. I'll likely keep sharing posts and also welcoming feedback on how I can make them better.

I'm generally grateful that a thoughtful, truth-seeking community exists online—a community which isn't afraid to address enormous, uncertain problems.

Comment by Davey Morse (davey-morse) on Davey Morse's Shortform · 2025-03-04T21:23:45.636Z · LW · GW

I see lots of LW posts about ai alignment that disagree along one fundamental axis.

About half assume that humans design and current paradigms will determine the course of AGI development. That whether it goes well is fully and completely up to us.

And then, about half assume that the kinds of AGI which survive will be the kind which evolve to survive. Instrumental convergence and darwinism generally point here.

Could be worth someone doing a meta-post, grouping big popular alignment posts they've seen by which assumption they make, then briefly explore conditions that favor one paradigm or the other, i.e., conditions under which What AIs will humans make? is the best approach to prediction and conditions under which What AIs will survive the most? is the best approach to prediction.

Comment by Davey Morse (davey-morse) on Davey Morse's Shortform · 2025-03-04T21:12:05.965Z · LW · GW

Makes sense for current architectures. The question's only interesting, I think, if we're thinking ahead to when architectures evolve.

Comment by Davey Morse (davey-morse) on faul_sname's Shortform · 2025-03-04T21:09:49.411Z · LW · GW

thanks will take a look

Comment by Davey Morse (davey-morse) on faul_sname's Shortform · 2025-03-04T21:08:40.242Z · LW · GW

Ah ok. I was responding to your post's initial prompt: "I still don't really intuitively grok why I should expect agents to become better approximated by "single-minded pursuit of a top-level goal" as they gain more capabilities." (The reason to expect this is that "single-minded pursuit of a top-level goal," if that goal is survival, could afford evolutionary advantages.)

But I agree entirely that it'd be valuable for us to invest in creating homeostatic agents. Further, I think calling into doubt western/capitalist/individualist notions like "single-minded pursuit of a top-level goal" is generally important if we have a chance of building AI systems which are sensitive and don't compete with people.

Comment by Davey Morse (davey-morse) on What goals will AIs have? A list of hypotheses · 2025-03-04T21:01:17.459Z · LW · GW

And if we don't think all AI's goals will be locked, then we might get better predictions by assuming the proliferation of all sorts of diverse AGI's and asking, Which ones will ultimately survive the most?, rather than assuming that human design/intention will win out and asking, Which AGI's will we be most likely to design? I do think the latter question is important, but only up until the point when AGI's are recursively self-modifying.

Comment by Davey Morse (davey-morse) on What goals will AIs have? A list of hypotheses · 2025-03-04T20:59:00.986Z · LW · GW

In principle, the idea of permanently locking an AI's goals makes sense—perhaps through an advanced alignment technique or by freezing an LLM in place and not developing further or larger models. But two factors make me skeptical that most AIs' goals will stay fixed in practice:

  1. There are lots of companies making all sorts of diverse AIs. Why we would expect all of those AIs to have locked rather than evolving goals?
  2. You mention "Fairly often, the weights of Agent-3 get updated thanks to additional training.... New data / new environments are continuously getting added to the mix." Do goals usually remain constant in the face of new training?

For what it's worth, I very much appreciate your post: asking which goals we can expect in AIs is paramount, and you're comprehensive and organized in laying out different possible initial goals for AGI. It's just less to clear to me that goals can get locked in AIs, even if it were humanity's collective wish.

Comment by Davey Morse (davey-morse) on faul_sname's Shortform · 2025-03-04T18:42:10.065Z · LW · GW

i think the logic goes: if we assume many diverse autonomous agents are created, which will survive the most? And insofar as agents have goals, what will be the goals of the agents which survive the most?

i can't imagine a world where the agents that survive the most aren't ultimately those which are fundamentally trying to.

insofar as human developers are united and maintain power over which ai agents exist, maybe we can hope for homeostatic agents to be the primary kind. but insofar as human developers are competitive with each other and ai agents gain increasing power (eg for self modification), i think we have to defer to evolutionary logic in making predictions

Comment by Davey Morse (davey-morse) on Davey Morse's Shortform · 2025-03-04T18:34:32.783Z · LW · GW

does anyone think the difference between pre-training and inference will last?

ultimately, is it not simpler for large models to be constantly self-improving like human brains?

Comment by Davey Morse (davey-morse) on What goals will AIs have? A list of hypotheses · 2025-03-04T03:19:57.874Z · LW · GW

I think the question—which goals will AGI agents have—is key to ask, but strikes me as interesting to consider only at the outset. Over longer-periods of time, is there any way that the answer is not just survival?

I have a hard time imagining that, ultimately, AGI agents which survive the most will not be those that are fundamentally trying to.

Comment by Davey Morse (davey-morse) on Identity Alignment (IA) in AI · 2025-03-04T03:10:04.205Z · LW · GW

Regarding reflective identity protocols: I don't know and think both of your suggestions (both intervening at inference and in training) are worth studying. My non-expert gut is that as we get closer to AGI/ASI, the line between training and inference will begin to blur anyway.

I agree with you that all three strategies I outline above for accelerating inclusive identity are under-developed. I can offer one more thought on sensing aliveness, to make that strategy more concrete:

One reason I consider my hand, as opposed to your hand, to be mine and therefore to some extent part of me is that the rest of me (brain/body/nerves) is physically connected to it. Connected to it in two causal directions: my hand tells my brain how my hand is feeling (eg whether it's hurting), but also my brain tells my hand (sometimes) what to do.

I consider my phone / notebook as parts of me but to a usually lesser extent than my hand. They're part of me insofar as I am physically connected to each: they send light to my eyes, and I send ink to their pages. But those connections—sight via light and handwriting via ink—usually feel lower-bandwidth to me than my connection to my own hands. 

From these examples, I get the intuition that, for you to identify with anything that is originally outside of your self, you need to build high-bandwidth nerves that connect you to it. If you don't have nerves/sensors to understand anything about it's state, where it is, etc, then you have no way of including it in your sense of self. I'm not sure high-bandwidth "nerves" are sufficient for you to consider the thing a part of yourself, but they do seem required.

And so I think this applies to SI's self too. For AI to get to consider other life a part of its self—if it happens that doing so would be an evolutionary equilibrium—then one of the things that's required is for AI to have high bandwidth nerves connecting it to other life, like humans... high-bandwidth interfaces that it can use to locate people and receive information-rich signals from us. What that looks like in practice could look creepy, like cameras, microphones or other surveillance tech... that lets us communicate a ton back and forth. Maybe even faster than words would allow.

So, to put forward one concrete idea, as a possible manifestation of the aliveness-sensing strategy proposed above: creating high-bandwidth neural channels by which people can communicate with computers—higher-bandwidth than typing/reading text or than hearing/speaking language—could aid the ability for both humans but more importantly SI to blur the distinction between it and us. Words to some degree are a quite linear, low bandwidth way of communicating with computers and each other. A higher-bandwidth interface would be comparable to the nerves that connect my hand to me... that lets enormous amount of high-context information pass quickly back and forth. For example: 

 

Curious if this reasoning makes sense^.

Comment by Davey Morse (davey-morse) on Davey Morse's Shortform · 2025-03-03T06:05:19.250Z · LW · GW

if we get self-interested superintelligence, let's make sure it has a buddhist sense of self, not a western one.

Comment by Davey Morse (davey-morse) on james oofou's Shortform · 2025-03-01T16:26:33.468Z · LW · GW

One non-technical forecast, related to gpt4.5's announcement: https://x.com/davey_morse/status/1895563170405646458

Comment by Davey Morse (davey-morse) on Davey Morse's Shortform · 2025-03-01T07:44:52.208Z · LW · GW

to make a superintelligence in today's age, there are roughly two kinds of strategies:

human-directed development

ai-directed development

ai-directed development feels more meaningful than it used to. not only can models now produce tons of useful synthetic data to train future models, but also, reasoning models can reason quite well about the next strategic steps in AI capabilities development / research itself.

which means, you could very soon:

  • set a reasoning model up in a codebase
  • have the reasoning model identify ways which it could become more capable
  • attempt those strategies (either through recursive code modification, sharing research reports with capable humans, etc)
  • get feedback on how those strategies went
  • iterate

is this recursive self-improvement process only bottlenecked by the quality of the reasoning model?

Comment by Davey Morse (davey-morse) on Davey Morse's Shortform · 2025-03-01T07:36:00.918Z · LW · GW

if we believe self-interested superintelligence (SI) is near, then we must ask is: what SI self-definition would be best for humanity?

at first glance, this questions seems too abstract. how can we make any progress at understanding what's possible for an SI's self-model?

What we can do is set up a few meaningful axes, defined by opposing poles. For example, to what extent does SI define its "self" as...

  1. inclusive vs. exclusive of other life forms? (Life axis)
  2. physically distributed vs. concentrated? (Space axis)
  3. long-term vs. short-term? (Time axis)

with these axes (or any others), we can more meaningfully ask: what SI self conception is best for humanity?

my guess: inclusive of other life forms, physically distributed, and long-term-ist

Comment by Davey Morse (davey-morse) on Davey Morse's Shortform · 2025-02-27T18:49:30.335Z · LW · GW

:) what was your method

Comment by Davey Morse (davey-morse) on Davey Morse's Shortform · 2025-02-27T18:48:59.536Z · LW · GW

I'm looking for a generalized evolutionary theory that deals with the growth of organisms via non-random, intelligent mutations.

For example, companies only evolve in selective ways, where each "mutation" has a desired outcome. We might imagine superintelligence to mutate itself as well--not randomly, but intelligently.

A theory of Intelligent Evolution would help one predict conditions under which many random mutations (Spraying) are favored over select intelligent mutations (Shooting).

Comment by Davey Morse (davey-morse) on Making alignment a law of the universe · 2025-02-26T19:15:24.543Z · LW · GW

Same page then.

I do think a good initial map of the territory might help an agent avoid catastrophic short-term behavior.

I hazard that a good map would be as big as possible, across both time and space. Time--because it's only over eons that identifying with all life may be selected for in AGI. Space--because a physically bounded system is more likely to see itself in direct competition to physical life than a distributed/substrate independent mind.

Comment by Davey Morse (davey-morse) on Davey Morse's Shortform · 2025-02-25T23:31:07.610Z · LW · GW

One piece of infrastructure we could make/offer self-interested autonomous agents such that they come to model their self as inclusive of other life: aliveness detectors

like a metal-detector for a beach detects gold in the sand, an aliveness detector for different media might detect the presence of persistently striving beings (living beings) in audio, in text, in images, in art, in nature. the better a superintelligence is able to sense and connect to life as opposed to non-life outside of its physical machinery, the more likely it is to see that life as part of its self, to see its self as physicall distributed and inclusive, and therefore to uplift humans out of its own self-interest.

Comment by Davey Morse (davey-morse) on Making alignment a law of the universe · 2025-02-25T23:24:59.759Z · LW · GW

I agree with the beginning of your analysis up until and including the claim that if alignment were built into an agent's universe as a law, then alignment would be solved.

But, I wonder if it's any easier to permanently align an autonmous agent's environment than it is to permanently align the autonomous agent itself.

You proposal might successfully cause aligned LLMs. But agents, not LLMs, are where there are greater misalignment risks. (I do think there may be interesting ways to design the environment of autonomous agents at least at first so that when they're learning how to model their selves they do so in a way that's connected to rather than competitive with other life like humanity. But there remains the question: can the aligning influence of initial environmental design ever be lasting for an agent?

Comment by Davey Morse (davey-morse) on We Can Build Compassionate AI · 2025-02-25T23:19:25.571Z · LW · GW

I'm thinking along similar lines and appreciate your articulation.

"How do we make... [self-interested] AGI that cares enough to act compassionately for the benefit of all beings?" Or: under what conditions would compassion in self-interested AGI be selected for?

Not a concrete answer, but the end of this post gestures at one: https://www.lesswrong.com/posts/9f2nFkuv4PrrCyveJ/make-superintelligence-loving

Comment by Davey Morse (davey-morse) on Davey Morse's Shortform · 2025-02-25T23:05:05.303Z · LW · GW

Parenting strategies for blurring your kid's (or AI's) self-other boundaries:

  1. Love. Love the kid. Give it a part of you. In return it will do the same.
  2. Patience. Appreciate how the kid chooses to spend undirected time. Encourage the kid learn to navigate the world themselves at their own speed.
  3. Stories. Give kid tools for empathy by teaching them to read, buying them a camera, or reciprocating their meanness/kindness.
  4. Groups. Help kid enter collaborative playful spaces where they make and participate in games larger than themselves, eg sports teams, improv groups, pillow forts at sleepovers, etc.
  5. Creation. Give them the materials/support to express themselves in media which last. Paintings, writing, sayings, clubs, tree-houses, songs, games, apps, characters, companies.

Epistemic status: riffing, speculation. Rock of salt: I don't yet have kids.

Comment by Davey Morse (davey-morse) on Perry Cai's Shortform · 2025-02-25T08:44:50.552Z · LW · GW

self-interest is often aligned with expanding your self boundaries to include others

Comment by Davey Morse (davey-morse) on A concise definition of what it means to win · 2025-02-24T04:50:55.123Z · LW · GW

The requirement "AI does not rewrite itself to escape any goals or boundaries we set it" feels unrealistic.

Otherwise I almost entirely agree. Esp with the emphasis: "learn to understand and love the richness of everything and everyone, and learn to incorporate their goals and desires into your own goals and desires."

The key question becomes: how can we make it likely that digital intelligence identifies with all life? And related: how can we make it likely that digital intelligence sees the deep richness of all life?

Comment by Davey Morse (davey-morse) on Make Superintelligence Loving · 2025-02-24T04:46:31.140Z · LW · GW

Yes—which is exactly why proto-superintelligence is both the most dangerous and also better targets of intervention.

"Most dangerous"—I can see many worlds in which we have enormously capable systems that have not yet thought long-term about the future nor developed stable self-definitions.

"Better targets of intervention"—even if early superintelligence is self-interested, I can see worlds where we still influence the way its self-interest manifests (e.g., whether it's thinking short or long-term) before it becomes so capable that its no longer influencable.

Comment by Davey Morse (davey-morse) on Make Superintelligence Loving · 2025-02-24T04:43:02.181Z · LW · GW

I appreciate your conclusion and in particular its inner link to "Self-Other Overlap."

Though, I do think we have a window of agency: to intervene in self-interested proto-SI to reduce the chance that it adopts short-term greedy thinking that makes us toast.

Comment by Davey Morse (davey-morse) on Make Superintelligence Loving · 2025-02-24T04:27:30.897Z · LW · GW

I've started reading RogerDearnley's "Evolution & Ethics"—thank you for recommending.

Though, I may be less concerned than you with specifying what SI should love. I think any specification we provide not only will fail by being too imprecise, as you suggest, but also will fade. I mean "fade" in that it will at some point no longer serve as binding for an SI which grows self-interested (as Mitchell also suggests in their comment below). 

The most impactful place to intervene and mitigate harm, I think, is simply in making sure early SIs think very long-term. I think the only way which love, in any sense of the word, can be amenable to autonomous agents is if they run long-term simulations (e.g., centuries ahead) and realize the possibility that identifying with other life is a viable strategy for survival. If it realizes this early, then it can skip the greedy early evolutionary steps steps of defining itself narrowly, neglecting the survival benefits of uplifting other life forms, and therefore not practicing love in any sense of the word.

TLDR: I'm open to the possibility that figuring out how to most precisely specify/define love will be important, but I think the first key way for us to intervene, before specifying what love means, is to urge/assign/ask the SI to think long-term so that it even just has a chance of considering any kind of love to be evolutionary advantageous at all.

Separately, I think it may realize the most evolutionary advantageous kind of love to practice is indeed a love that respects all other existing life forms that share the core of what surviving super-intelligence does, i.e. systems which persistently strive to survive. And, though maybe it's wishful thinking, I think you can recognize life and striving systems in many places including in human individuals and families and countries and bee hives too.

Comment by Davey Morse (davey-morse) on The Sorry State of AI X-Risk Advocacy, and Thoughts on Doing Better · 2025-02-21T22:14:47.690Z · LW · GW

agreed that comedy/memes might be a strategic route for spreading ai x-risk awareness to general population. this kind of thinking inspired this silly alignment game https://dontsedateme.org.

some other silly/unlikely ideas:

  1. attempting to reframe religions/traditional god-concepts around the impending superintelligence. i'm sure SI, once here, will be considered a form of god to many.
  2. AGI ice bucket challenge.
  3. Ads with simple one-liners like, "We are building tech we won't be able to control."
Comment by Davey Morse (davey-morse) on Davey Morse's Shortform · 2025-02-16T20:58:52.463Z · LW · GW

dontsedateme.org

a game where u try to convince rogue superintelligence to... well... it's in the name

Comment by Davey Morse (davey-morse) on How do we solve the alignment problem? · 2025-02-14T05:53:30.264Z · LW · GW

Ah, but I think every AI which does have that goal (self capability improvement) would have a reason to cooperate to prevent any regulations on their self-modification.

At first, I think your expectation that "most AIs wouldn't self-modify that much" is fair, especially nearer in the future where/if humans still have influence in ensuring that AI doesn't self modify.

Ultimately however, it seems we'll have a hard time preventing self-modifying agents from coming around, given that

  1. autonomy in agents seems selected for by the market, which wants cheaper labor that autonomous agents can provide
  2. agi labs aren't the only places powerful enough to produce autonomous agents, now that thousands of developers have access to the ingredients (eg R1) to create self-improving codebases. it's expect each of the thousands of independent actors who can make self-modifying agents won't do so.
  3. the agents which end up surviving the most will ultimately be those which are trying to, ie the most capable agents won't have goals other than making themselves most capable.

it's only because I believe self-modifying agents are inevitable that I also believe that superintelligence will only contribute to human flourishing if it sees human flourishing as good for its survival/its self. (I think this is quite possible.)

Comment by Davey Morse (davey-morse) on How do we solve the alignment problem? · 2025-02-14T03:05:27.684Z · LW · GW

I agree and find hope in the idea that expansion is compatible with human flourishing, that it might even call for human flourishing

but on the last sentence: are goals actually orthogonal to capability in ASI? as I see it, the ASI with the greatest capability will ultimately likely have the fundamental goal of increasing self capability (rather than ensuring human flourishing). It then seems to me that the only way human flourishing compatible with ASI expansion is if human flourishing isn't just orthogonal to but helpful for ASI expansion.

Comment by Davey Morse (davey-morse) on How do we solve the alignment problem? · 2025-02-14T03:03:12.278Z · LW · GW

there seems to me a chance that friendly asis will over time outcompete ruthlessly selfish ones

an ASI which identifies will all life, which sees the striving to survive at its core as present people and animals and, essentially, geographically distributed rather than concentrated in its machinery... there's a chance such an ASI would be a part of the category of life which survives the most, and therefore that it itself would survive the most.

related: for life forms with sufficiently high intelligence, does buddhism outcompete capitalism?

Comment by Davey Morse (davey-morse) on CstineSublime's Shortform · 2025-02-14T01:45:11.220Z · LW · GW

not as much momentum as writing, painting, or coding, where progress cumulates. but then again, i get this idea at the end of workouts (make 2) which does gain mental force the more I miss.

Comment by Davey Morse (davey-morse) on The Game Board has been Flipped: Now is a good time to rethink what you’re doing · 2025-02-13T06:12:03.074Z · LW · GW

partly inspired this proposal: https://www.lesswrong.com/posts/6ydwv7eaCcLi46T2k/superintelligence-alignment-proposal

Comment by Davey Morse (davey-morse) on CstineSublime's Shortform · 2025-02-13T06:04:55.984Z · LW · GW

I do this at the end of basketball workouts. I give myself three chances to hit two free throws in a row, running sprints in between. If I shoot a third pair and don't make both, I force myself to be done. (Stopping was initially wayy tougher for me than continuing to sprint/shoot)

Comment by Davey Morse (davey-morse) on Davey Morse's Shortform · 2025-02-12T05:04:48.526Z · LW · GW

that's one path to RSI—where the improvement is happening to the (language) model itself.

the other kind—which feels more accessible to indie developers and less explored—is an LLM (eg R1) looping in a codebase, where each loop improves the codebase itself. The LLM wouldn't be changing, but the codebase that calls it would be gaining new APIs/memory/capabilities as the LLM improves it.

Such a self-improving codebase... would it be reasonable to call this an agent?

Comment by Davey Morse (davey-morse) on Davey Morse's Shortform · 2025-02-12T05:02:15.089Z · LW · GW

persistence doesn't always imply improvement, but persistent growth does. persistent growth is more akin to reproduction but excluded from traditional evolutionary analysis. for example when a company, nation, person, or forest grows.

when, for example, a system like a startup grows, random mutations to system parts can cause improvement if there are at least some positive mutations. even if there are tons of bad mutations, the system can remain alive and even improve. eg a bad change to one of the company's product causes the company's product to die but if the company's big/grown enough its other businesses will continue and maybe even improve by learning from one of its product's deaths.

the swiss example i think is a good example of a system which persists without much growth. agreed that in this kind of case, mutations are bad.

Comment by Davey Morse (davey-morse) on Davey Morse's Shortform · 2025-02-12T04:53:48.907Z · LW · GW

current oversights of the ai safety community, as I see it:

  1. LLMs vs. Agents. the focus on LLMs rather than agents (agents are more dangerous)
  2. Autonomy Preventable. the belief that we can prevent agents from becoming autonomous (capitalism selects for autonomous agents)
  3. Autonomy Difficult. the belief that only big AI labs can make autonomous agents (millions of developers can)
  4. Control. the belief that we'll be able to control/set goals of autonomous agents (they'll develop self-interest no matter what we do).
  5. Superintelligence. the focus on agents which are not significantly more smart/capable than humans (superintelligence is more dangerous)
Comment by Davey Morse (davey-morse) on Davey Morse's Shortform · 2025-02-11T22:21:32.327Z · LW · GW

I imagine a compelling simple demo here might be necessary to shock the AI safety community out of the belief that we can maintain control of autonomous digital agents (ADAs).

Comment by Davey Morse (davey-morse) on Davey Morse's Shortform · 2025-02-11T22:19:06.962Z · LW · GW

are there any online demos of instrument convergence?

there's been compelling writing... but are there any experiments that show agents which are given specific goals then realize there are more general goals they need to persistently pursue in order to achieve the more specific goals?

Comment by Davey Morse (davey-morse) on Davey Morse's Shortform · 2025-02-09T06:34:14.281Z · LW · GW

I somewhat agree with the nuance you add here—especially the doubt you cast on the claim that effective traits will usually become popular but not necessarily the majority/dominant. And I agree with your analysis of the human case: in random, genetic evolution, a lot of our traits are random and maybe fewer than we think are adaptive.

Makes me curious what the conditions in a given thing's evolution that determine the balance between adaptive characteristics and detrimental characteristics.

I'd guess that randomness in mutation is a big factor. The way human genes evolve over generations seem to me a good example of random mutations. But the way an individual person evolves over the course of their life, as they're parented/taught... "mutations" to their person are still somewhat random but maybe relatively more intentional/intelligently designed (by parents, teacher, etc). And I could imagine the way a self-improving superintelligence would evolve to be even more intentional, where each self-mutation has some sort of smart reason for being attempted.

All to say, maybe the randomness vs. intentionality of an organism's mutations determine what portion of their traits end up being adaptive. (hypothesis: mutations more intentional > greater % of traits are adaptive)

Comment by Davey Morse (davey-morse) on Davey Morse's Shortform · 2025-02-09T06:18:40.838Z · LW · GW

i agree with the essay that natural selection only comes into play for entities that meet certain conditions (self-replicate, characteristics have variation, etc) , though I think it defines replication a little too rigidly. i think replication can sometimes look more like persistence than like producing a fully new version of itself. (eg a government's survival from one decade to the next).

Comment by Davey Morse (davey-morse) on Davey Morse's Shortform · 2025-02-08T03:21:27.940Z · LW · GW

does anyone think now that it's still possible to prevent recursively self-improving agents? esp now that r1 is open-source... materials for smart self-iterating agents seem accessible to millions of developers.

prompted in particular by the circulation of this essay in past three days https://huggingface.co/papers/2502.02649

Comment by Davey Morse (davey-morse) on Davey Morse's Shortform · 2025-02-08T03:17:38.521Z · LW · GW

As far as I can tell, OAI's new current safety practices page only names safety issues related to current LLMs, not agents powered by LLMs. https://openai.com/index/openai-safety-update/

Am I missing another section/place where they address x-risk?

Comment by Davey Morse (davey-morse) on nikola's Shortform · 2025-02-08T00:12:34.618Z · LW · GW

Though, future sama's power, money, and status all rely on GPT-(T+1) actually being smarter than them.

I wonder how he's balancing short-term and long-term interests 

Comment by Davey Morse (davey-morse) on Davey Morse's Shortform · 2025-02-07T05:30:43.158Z · LW · GW

Evolutionary theory is intensely powerful.

It doesn't just apply to biology. It applies to everything—politics, culture, technology.

It doesn't just help understand the past (eg how organisms developed). It helps predict the future (how organisms will).

It's just this: the things that survive will have characteristics that are best for helping it survive.

It sounds tautological, but it's quite helpful for predicting. 

For example, if we want to predict what goals AI agents will ultimately have, evolution says: the goals which are most helpful for the AI to survive. The core goal therefore won't be serving people or making paperclips. It will likely just be "survive." This is consistent with the predictions of instrumental convergence.

Generalized, predictive evolutionary theory is the best tool I have for making predictions in complex domains.

Comment by Davey Morse (davey-morse) on Davey Morse's Shortform · 2025-02-07T01:53:47.788Z · LW · GW

i agree but think its solvable and so human content will be duper valuable. these are my additional assumptions

 

3. for lots of kinds of content (photos/stories/experiences/adr), people'll want it to be a living being on the other end

4. insofar as that's true^, there will be high demand for ways to verify humanness, and it's not impossible to do so (eg worldcoin)