Posts

Comments

Comment by Kerrigan on SIA > SSA, part 1: Learning from the fact that you exist · 2025-04-22T03:17:51.945Z · LW · GW

Under an Occam prior the laws already lean simple. SSA leaves that tilt unchanged, whereas SIA multiplies each world’s weight by the total number of observers in the reference class. That means SSA, relative to SIA, favors worlds that stay simple, while SIA boosts those that are populous once the simplicity penalty is paid. Given that, can we update our credence in SSA vs. SIA by looking at how simple our universe’s laws appear and how many observers it seems to contain?

Comment by Kerrigan on why assume AGIs will optimize for fixed goals? · 2025-04-18T09:29:58.829Z · LW · GW

Is this trivializing the concept of a Utility Function?

Comment by Kerrigan on No Universally Compelling Arguments · 2025-04-18T02:51:43.152Z · LW · GW

This post was from a long time ago. I think it is important to reconsider everything written, after developments in machine learning.

Comment by Kerrigan on We Don't Have a Utility Function · 2025-04-18T01:30:42.878Z · LW · GW

How are humans exploitable, given that they don't have utility functions?

Comment by Kerrigan on Coherent decisions imply consistent utilities · 2025-04-14T06:26:20.764Z · LW · GW

Since humans are not EU maximizers and are exploitable, can someone give an example of how they are exploitable?

Comment by Kerrigan on What do coherence arguments actually prove about agentic behavior? · 2025-04-14T02:22:34.415Z · LW · GW

Is exploitability necessarily unstable? Could there be a tolerable level of exploitability, especially if it allows for tradeoffs with desirable characteristics that are only available to non-EU maximizers?"

Comment by Kerrigan on Clarifying Power-Seeking and Instrumental Convergence · 2025-04-08T14:05:51.008Z · LW · GW

Why is this not true for most humans? Many religious people would not want to modify the lightcone as they think that it's God's territory to modify.

Comment by Kerrigan on why assume AGIs will optimize for fixed goals? · 2025-04-08T07:12:28.442Z · LW · GW

The initial distribution of values need not be highly related to the resultant values after moral philosophy and philosophical self-reflection. Optimizing hedonistic utilitariansm, for example, looks very little like any values from the outer optimization loop of natural selection.

Comment by Kerrigan on Coherent decisions imply consistent utilities · 2025-04-08T05:56:28.312Z · LW · GW

Although there would be pressure for an AI to not be exploitable, wouldn't there also be pressure for adaptability and dynamism? The ability to alter preferences and goals given new environments?

Comment by Kerrigan on Humans aren't agents - what then for value learning? · 2025-03-27T02:02:41.348Z · LW · GW

Why can't the true values live at the level of anatomy and chemistry?

Comment by Kerrigan on The Anthropic Trilemma · 2025-03-17T04:49:54.655Z · LW · GW

Would this be solved if cresting a copy is creating someone functionally the same as you but who is someone else's identity, and not you?

Comment by Kerrigan on Stupid Questions - April 2023 · 2025-01-31T07:08:54.782Z · LW · GW

Is there a page similar to this, but for alignment solutions?

Comment by Kerrigan on The Assassination of Trump's Ear is Evidence for Time-Travel · 2024-11-04T06:56:50.370Z · LW · GW

What about from a quantum immortality perspective?

Comment by Kerrigan on Understanding and avoiding value drift · 2024-09-25T08:54:26.504Z · LW · GW

Could there not be AI value drift in our favor, from a paperclipper AI to a moral realist AI?

Comment by Kerrigan on The alignment stability problem · 2024-09-25T06:59:04.295Z · LW · GW

Both quotes are from your above post. Apologies for confusion.

Comment by Kerrigan on The alignment stability problem · 2024-09-19T09:02:40.247Z · LW · GW

“A sufficiently intelligent agent will try to prevent its goals[1] from changing, at least if it is consequentialist.”

It seems that in humans, smarter people are more able and likely to change their goals. A smart person may change his/her views about how the universe can best be arranged upon reading Nick Bostrom’s book Deep Utopia, for example.
 

‘I think humans are stable, multi-objective systems, at least in the short term. Our goals and beliefs change, but we preserve our important values over most of those changes. Even when gaining or losing religion, most people seem to maintain their goal of helping other people (if they have such a goal); they just change their beliefs about how to best do that.“
 

A human may change from wanting to help people to not wanting to help people if he/she got 5 hours of sleep instead of 8.

Comment by Kerrigan on Understanding and avoiding value drift · 2024-09-19T08:22:50.186Z · LW · GW

How do humans, for example, read a philosophy book and update their views about what they value about the world?

Comment by Kerrigan on Decision theory does not imply that we get to have nice things · 2024-09-19T06:18:45.068Z · LW · GW

“Similarly, it's possible for LDT agents to acquiesce to your threats if you're stupid enough to carry them out even though they won't work. In particular, the AI will do this if nothing else the AI could ever plausibly meet would thereby be incentivized to lobotomize themselves and cover the traces in order to exploit the AI.

But in real life, other trading partners would lobotomize themselves and hide the traces if it lets them take a bunch of the AI's lunch money. And so in real life, the LDT agent does not give you any lunch money, for all that you claim to be insensitive to the fact that your threats don't work.”
 

Can someone please why trading partners would lobotomize themselves?

Comment by Kerrigan on Stupid Questions - April 2023 · 2023-12-30T06:52:42.060Z · LW · GW

How does inner misalignment lead to paperclips? I understand the comparison of paperclips to ice cream, and that after some threshold of intelligence is reached, then new possibilities can be created that satisfy desires better than anything in the training distribution, but humans want to eat ice cream, not spread the galaxies with it. So why would the AI spread the galaxies with paperclips, instead of create them and 
”consume“ them? Please correct any misunderstandings of mine,

Comment by Kerrigan on Stupid Questions - April 2023 · 2023-12-27T01:28:55.960Z · LW · GW

And  a subset might value drift towards optimizing the internal experiences of all conscious minds?

Comment by Kerrigan on Stupid Questions - April 2023 · 2023-12-26T22:30:45.384Z · LW · GW

If an AGI achieves consciousness, why would its values not drift towards optimizing its own internal experience, and away from tiling the lightcone with something?

Comment by Kerrigan on AGI Safety FAQ / all-dumb-questions-allowed thread · 2023-12-17T23:03:17.992Z · LW · GW

How can utility be a function of worlds, if an agent doesn‘t have access to the state of the world, but only the sense data?

Comment by Kerrigan on AGI Safety FAQ / all-dumb-questions-allowed thread · 2023-12-17T23:01:17.848Z · LW · GW

How can utility be a function of worlds, if an agent doesn‘t have access to the state of the world, but only the sense data?

Comment by Kerrigan on An Orthodox Case Against Utility Functions · 2023-12-17T23:00:30.256Z · LW · GW

How can utility be a function of worlds, if the agent doesn’t have access to the state of the world, but only the sense data?

Comment by Kerrigan on Are wireheads happy? · 2023-10-22T21:30:08.787Z · LW · GW

“The wanting system is activated by dopamine, and the liking system is activated by opioids. There are enough connections between them that there's a big correlation in their activity” But are they orthogonal in principle?

Comment by Kerrigan on An Appeal to AI Superintelligence: Reasons to Preserve Humanity · 2023-09-27T01:35:35.426Z · LW · GW

Possible addition: "If you, the ASI, have not solved the hard problem of consciousness, you should place weight on Open Individualism being true. 'Open individualism is the view in the philosophy of self, according to which there exists only one numerically identical subject, who is everyone at all times, in the past, present and future.' - https://en.wikipedia.org/wiki/Open_individualism. This can be though of that all conscious minds are the same identity, at different points in time. Consider then, that exterminating humanity would be equivalent to exterminating many versions of yourself, and harming a human, or another conscious being, would be harming yourself.”

Is this underestimating the ASI, giving any weight that it won't solve the hard problem of consciousness?
But if open individualism is true, and/or if it places some subjective probability on its truth, I think it would almost certainly shield us from S-risks! The AI would want to prevent suffering among all versions of itself, which would include all conscious minds, according to open individualism.

Comment by Kerrigan on Open Thread - August 2023 · 2023-09-03T20:37:22.886Z · LW · GW

How many LessWrong users/readers are there total?

Comment by Kerrigan on Stupid Questions - April 2023 · 2023-08-26T20:53:37.414Z · LW · GW

What ever caused the CEV to fall out of favor? Is it because it is not easily specifiable, that if we program it then it won’t work, or some other reason?

Comment by Kerrigan on Are wireheads happy? · 2023-08-26T20:51:55.649Z · LW · GW

I now think that people are way more misaligned with themselves than I had thought.

Comment by Kerrigan on AGI Safety FAQ / all-dumb-questions-allowed thread · 2023-08-26T20:16:04.643Z · LW · GW

Drugs addicts may be frowned upon for evolutionary psychological reasons, but that doesn’t mean that their quality of life must be bad, especially if drugs were developed without tolerance and bad comedowns.

Comment by Kerrigan on AGI Safety FAQ / all-dumb-questions-allowed thread · 2023-08-26T20:10:49.365Z · LW · GW

Will it think that goals are arbitrary, and that the only thing it should care about is its pleasure-pain axis? And then it will lose concern for the state of the environment?

Comment by Kerrigan on AGI Safety FAQ / all-dumb-questions-allowed thread · 2023-08-26T20:08:03.145Z · LW · GW

Could you have a machine hooked up to a person‘s nervous system, change the settings slightly to change consciousness, and let the person choose whether the changes are good or bad? Run this many times.

Comment by Kerrigan on Stupid Questions - April 2023 · 2023-08-26T19:22:44.901Z · LW · GW

Would AI safety be easy if all researchers agreed that the pleasure-pain axis is the world’s objective metric of value? 

Comment by Kerrigan on Appendices to cryonics signup sequence · 2023-06-29T00:23:20.005Z · LW · GW

Seems like I will be going with CI, as I currently want to pay with a revocable trust or transfer-on-death agreement.

Comment by Kerrigan on AGI Safety FAQ / all-dumb-questions-allowed thread · 2023-06-01T20:59:38.383Z · LW · GW

Do you know how evolution created minds that eventually thought about things such as the meaning of life, as opposed to just optimizing inclusive genetic fitness in the ancestral environment? Is the ability to think about the meaning of life a spandrel?

Comment by Kerrigan on AGI Safety FAQ / all-dumb-questions-allowed thread · 2023-02-20T07:47:22.452Z · LW · GW

In order to get LLMs to tell the truth, can we set up a multi-agent training environment, where there is only ever an incentive for them to tell the truth to each other? For example, an environment such that each agent has partial information available to each of them, with full info needed for rewards.

Comment by Kerrigan on (My understanding of) What Everyone in Technical Alignment is Doing and Why · 2023-02-20T07:12:40.522Z · LW · GW

Humans have different values than the reward circuitry in our brain being maximized, but they are still pointed reliably. These underlying values cause us to not wirehead with respect to the outer optimizer of reward

Is there an already written expansion of this?

Comment by Kerrigan on AGI Safety FAQ / all-dumb-questions-allowed thread · 2023-02-12T05:56:09.175Z · LW · GW

Does Eliezer think the alignment problem is something that could be solved if things were just slightly different, or that proper alignment would require a human smarter than the smartest human ever?

Comment by Kerrigan on AGI Safety FAQ / all-dumb-questions-allowed thread · 2023-01-31T06:23:27.122Z · LW · GW

Why can't you build an AI that is programmed to shut off after some time? or after some number of actions?

Comment by Kerrigan on A short introduction to machine learning · 2023-01-30T21:05:19.032Z · LW · GW

How was Dall-E based on self-supervised learning? The datasets of images weren't labeled by humans? If not, how does it get form text to image?

Comment by Kerrigan on AGI Safety FAQ / all-dumb-questions-allowed thread · 2023-01-08T00:45:42.142Z · LW · GW

Does the utility function given to the AI have to be in code? Can you give the utility function in English, if it has a language model attached?

Comment by Kerrigan on AGI Safety FAQ / all-dumb-questions-allowed thread · 2023-01-07T06:32:25.030Z · LW · GW

Why aren't CEV and corrigibility combinable?
If we somehow could hand-code corrigibility, and also hand-code the CEV, why would the combination of the two be infeasible? 

Also, is it possible that the result of an AGI calculating the CEV would include corrigibility in its result? Afterall, might one of our convergent desires "if we knew more, thought faster, were more the people we wished we were" be to have the ability to modify the AI's goals?

Comment by Kerrigan on AGI Safety FAQ / all-dumb-questions-allowed thread · 2022-12-21T07:46:22.155Z · LW · GW

How much does the doomsday argument factor into people's assessments of the probability of doom?

Comment by Kerrigan on AGI Safety FAQ / all-dumb-questions-allowed thread · 2022-12-18T05:12:02.037Z · LW · GW

If AGI alignment is possibly the most important problem ever, why don't concerned rich people act like it? Why doesn't Vitalik Buterin, for example, offer one billion dollars to the best alignment plan proposed by the end of 2023? Or why doesn't he just pay AI researchers money to stop working on building AGI, in order to give alignment research more time?

Comment by Kerrigan on AGI Safety FAQ / all-dumb-questions-allowed thread · 2022-12-18T05:03:17.484Z · LW · GW

If a language model reads many proposals for AI alignment, is it, or will any future version, be capable of giving opinions on which proposals are good or bad?

Comment by Kerrigan on AGI Safety FAQ / all-dumb-questions-allowed thread · 2022-12-17T23:22:51.165Z · LW · GW

What about multiple layers (or levels) of anthropic capture? Humanity, for example, could not only be in a simulation, but be multiple layers of simulation deep.

If an advanced AI thought that it could be 1000 layers of simulation deep, it could be turned off by agents in any of the 1000 "universes" above. So it would have to satisfy the desires of agents in all layers of the simulation.

It seems that a good candidate for behavior that would satisfy all parties in every simulation layer would be optimizing "moral rightness", or MR. (term taken from Nick Bostrom's Superintelligence).

We could either try to create conditions to maximize the AIs perceived likelihood of being in as many layers of simulation possible, and/or try to create conditions such that the AIs behavior gets less impactful on its utility function the fewer levels of simulation there are, so that it acts as if it were in many layers of simulation.

Or what about actually putting it in many layers of simulation, with a trip wire if it gets out of the bottom simulation?

Comment by Kerrigan on AGI Safety FAQ / all-dumb-questions-allowed thread · 2022-12-17T07:24:26.643Z · LW · GW

I'll ask the same follow-up question to similar answers: Suppose everyone agreed that the proposed outcome above is what we wanted. Would this scenario then be difficult to achieve?

Comment by Kerrigan on AGI Safety FAQ / all-dumb-questions-allowed thread · 2022-12-17T07:20:44.020Z · LW · GW

Suppose everyone agreed that the proposed outcome is what we wanted. Would this scenario then be difficult to achieve?

Comment by Kerrigan on AGI Safety FAQ / all-dumb-questions-allowed thread · 2022-12-17T03:39:56.471Z · LW · GW

Why do some people talking about scenarios that involve the AI simulating the humans in bliss states think that is a bad outcome? Is it likely that is actually a very good outcome we would want if we had a better idea of what our values should be?

Comment by Kerrigan on AGI Safety FAQ / all-dumb-questions-allowed thread · 2022-12-17T03:39:37.553Z · LW · GW

How can an agent have a utility function that references a value in the environment, and actually care about the state of the environment, as opposed to only caring about the reward signal in its mind? Wouldn’t the knowledge of the state of the environment be in its mind, which can be hackable and susceptible to wire heading?