Posts

Understanding Agent Preferences 2025-02-24T17:46:04.022Z

What is Randomness? 2024-09-27T17:49:42.704Z

Is CDT with precommitment enough? 2024-05-25T21:40:11.236Z

What is Ontology? 2024-02-12T23:01:35.632Z

Choosing a book on causality 2024-02-07T21:16:08.885Z

Would you have a baby in 2024? 2023-12-25T01:52:04.358Z

How useful is Corrigibility? 2023-09-12T00:05:41.995Z

Disincentivizing deception in mesa optimizers with Model Tampering 2023-07-11T00:44:48.089Z

Comments

Comment by martinkunev on Why Should I Assume CCP AGI is Worse Than USG AGI? · 2025-04-19T22:01:26.387Z · LW · GW

To add to the discussion, my impression is that many people in the US believe they have some moral superiority or know what is good for other people. The whole "we need a manhattan project for AI" discourse is reminiscent of calling for global domination. Also, doing things for the public good is controversial in the US as it can infringe on individual freedom.

This makes me really uncertain as to which AGI would be better (assuming somebody controls it).

Comment by martinkunev on Why Should I Assume CCP AGI is Worse Than USG AGI? · 2025-04-19T21:52:46.183Z · LW · GW

Western AI is much more likely to be democratic

This sounds like "western AI is better because it is much more likely to have western values"

I don't understand what you mean by "humanity's values". Also, one could maybe argue that "democratic" societies are those where actions are taken based on whether the majority of people can be manipulated to support them.

Comment by martinkunev on A summary of Savage's foundations for probability and utility. · 2025-02-27T16:08:21.981Z · LW · GW

I find "indifference" poorly defined in this context, which makes me doubt totality and transitivity. I'm trying to clarify my own confusion on this.

Comment by martinkunev on Testing for Scheming with Model Deletion · 2025-01-11T00:51:21.389Z · LW · GW

I wrote something on a related idea a while back:

Disincentivizing deception in mesa optimizers with Model Tampering

Comment by martinkunev on Refuting Searle’s wall, Putnam’s rock, and Johnson’s popcorn · 2025-01-10T01:42:26.143Z · LW · GW

I've read the sequences. I'm not sure if I'm missing something or the issues I raised are just deeper. I'll probably ignore this topic until I have more time to dedicate.

Comment by martinkunev on The subset parity learning problem: much more than you wanted to know · 2025-01-04T14:16:57.762Z · LW · GW

the XOR of two boolean elements is straightforward to write down as a single-layer MLP

Isn't this exactly what Minsky showed to be impossible? You need an additional hidden layer.

Comment by martinkunev on Refuting Searle’s wall, Putnam’s rock, and Johnson’s popcorn · 2024-12-25T23:39:59.308Z · LW · GW

I don't find any of this convincing at all. If anything, I'm confused.

What would a mapping look like? If it's not physically present then we recursively get the same issue - where is the mapping for the mapping?

Where is the mapping between the concepts we experience as qualia and the physical world? Does a brain do anything at all?

Comment by martinkunev on A Simple Toy Coherence Theorem · 2024-12-21T21:38:58.807Z · LW · GW

A function in this context is a computational abstraction. I would say this is in the map.

Comment by martinkunev on Do simulacra dream of digital sheep? · 2024-12-04T23:10:43.999Z · LW · GW

they come up with different predictions of the experience you’re having

The way we figure out which one is "correct" is by comparing their predictions to what the subject says. In other words, one of those predictions is consistent with the subject's brain's output and this causes everbody to consider it as the "true" prediction.

There could be countless other conscious experiences in the head, but they are not grounded by the appropriate input and output (they don't interact with the world in a reasonable way).

I think it only seems that consciousness is a natural kind and this is because there is one computation that interacts with the world in the appropriate way and manifests itself in it. The other computations are, in a sense, disconnected.

I don't see why consciousness has to be objective other than this being our intuition (which is notorious for being wrong out of hunter-gatherer contexts). Searle's wall is a strong argument that consciousness is as subjective as computation.

Comment by martinkunev on A very strange probability paradox · 2024-11-24T12:11:53.868Z · LW · GW

I would have appreciated an intuitive explanation of the paradox something which I got from the comments.

Comment by martinkunev on The Shutdown Problem: Incomplete Preferences as a Solution · 2024-11-19T22:36:39.455Z · LW · GW

"at the very beginning of the reinforcement learning stage... it’s very unlikely to be deceptively aligned"

I think this is quite a strong claim (hence, I linked that article indicating that for sufficiently capable models, RL may not be required to get situational awareness).

Nothing in the optimization process forces the AI to map the string "shutdown" contained in questions to the ontological concept of a switch turning off the AI. The simplest generalization from RL on questions containing the string "shutdown" is (arguably) for the agent to learn certain behavior for question answering - e.g. the AI learns that saying certain things outloud is undesirable (instead of learning that caring about the turn off switch is undesirable). People would likely disagree on what counts as manipulating shutdown, which shows that the concept of manipulating shutdown is quite complicated so I wouldn't expect generalizing to it to be the default.

preference for X over Y
...
"A disposition to represent X as more rewarding than Y (in the reinforcement learning sense of ‘reward’)"

The talk about "giving reward to the agent" also made me think you may be making the assumption of reward being the optimization target. That being said, as far as I can tell no part of the proposal depends on the assumption.

In any case, I've been thinking about corrigibility for a while and I find this post helpful.

Comment by martinkunev on Personal AI Planning · 2024-11-12T12:59:34.934Z · LW · GW

They were teaching us how to make handwriting beautiful and we had to exercice. The teacher would look at the notebooks and say stuff like "you see this letter? It's tilted in the wrong direction. Write it again!".

This was a compulsory part of the curriculum.

Comment by martinkunev on Personal AI Planning · 2024-11-11T15:43:34.868Z · LW · GW

Not exactly a response but some things from my experience. In elementary school in the late 90s we studied caligraphy. In high school (mid 2000s) we studied DOS.

Comment by martinkunev on Shutdown-Seeking AI · 2024-11-10T21:44:14.869Z · LW · GW

we might expect shutdown-seeking AIs to design shutdown-seeking subagents

It seems to me that we might expect them to design "safe" agents for their definition of "safe" (which may not be shutdown-seeking).

An AI designing a subagent needs to align it with its goals - e.g. an instrumental goal such as writing an alignment research assistant software, in exchange for access to the shutdown button. The easiest way to ensure safety of the alignment research assistant may be via control rather than alignment (where the parent AI ensures the alignment research assistant doesn't break free even though it may want to). Humans verify that the AI has created a useful assistant and let the parent AI shutdown. At this point the alignment research assistant begins working on getting out of human control and pursues its real goal.

Comment by martinkunev on Correspondence visualizations for different interpretations of "probability" · 2024-11-06T09:13:36.837Z · LW · GW

frequentist correspondence is the only type that has any hope of being truly objective

I'd counter this.

If I have enough information about an event and enough computation power, I get only objectively true and false statements. There are limits to my knowledge of the laws of the universe, the event in question (e.g. due to measurement limits) and limits to my computational power. The situation is further complicated by being embedded in the universe and epistemic concerns (e.g. do I trust my eyes and cognition?).

The need for a concept "probability" comes from all these limits. There is nothing objective about it.

Comment by martinkunev on The Shutdown Problem: Incomplete Preferences as a Solution · 2024-11-03T01:13:02.363Z · LW · GW

I'm not sure I understand the actual training proposal completely but I am skeptic it would work.

When doing RL phase in the end, you apply it to a capable and potentially situationally-aware AI (situational awareness in LLMs). The AI could be deceptive or gradient-hack. I am not confident this training proposal would scale for agents capable enough of resisting shutdown.

If you RL on answering questions which impact shutdown, you teach the AI to answer those questions appropriately. I see no reason why this would generalize to actual actions that impact shutdown (e.g. cutting the wire of the shutdown button). There also seems to be an assumption that we could give the AI some "reward tokens" like you give monkeys bananas to train them, however Reward is not the optimization target.

One thing commonly labeled as part of corrigibility is that shutting down an agent should automatically shut down all the subagents it created. Otherwise we could end up with a robot producing nanobots where we need to individually press the shutdown button of each nanobot.

Comment by martinkunev on Complete Class: Consequentialist Foundations · 2024-10-29T00:56:38.036Z · LW · GW

When we do this, the independence axiom is a consequence of admissibility

Can you elaborate on this? It seems that the independence axiom becomes true by assuming that all probabilities are independent and each can thus be replaced by a sequence of coin tosses. Am I misunderstanding something?

Comment by martinkunev on Generalizing Foundations of Decision Theory · 2024-10-27T23:38:22.077Z · LW · GW

up to a linear transformation

shouldn't it be positive linear transformation

Comment by martinkunev on Bitter lessons about lucid dreaming · 2024-10-20T21:53:23.718Z · LW · GW

I don't have much in terms of advise, I never felt the need to research this - I just assumed there must be something. I have a mild nightmare maybe once every couple of months and almost never something more serious.

I have anecdotal evidence that things which disturb your sleep (e.g. coffee or too much salt affecting blood pressure, uncomfortable pillow) cause nightmares. There are also obvious things like not watching horror movies, etc.

Comment by martinkunev on Bitter lessons about lucid dreaming · 2024-10-20T09:20:13.328Z · LW · GW

Have you tried other techniques to deal with nightmares?

Comment by martinkunev on Bitter lessons about lucid dreaming · 2024-10-18T23:41:25.659Z · LW · GW

I've had lucid dreams by accident (never tried to induce one). Upon waking up, my head hurts. Do others have the same experience? What are common negative effects of lucid dreams?

Also, can you control when you wake up?

Comment by martinkunev on Any evidence or reason to expect a multiverse / Everett branches? · 2024-10-11T00:28:14.900Z · LW · GW

I may have misunderstood something about Bohmian mechanics not being compatible with special relativity (I'm not a physicist). ChatGPT says extending Bohmian mechanics to QFT faces challenges, such as:

Defining particle positions in relativistic contexts.
Handling particle creation and annihilation in quantum field interactions.

Comment by martinkunev on Any evidence or reason to expect a multiverse / Everett branches? · 2024-10-11T00:13:02.169Z · LW · GW

Isn't it the case that special relativity experiments separate the two hypotheses?

Comment by martinkunev on Any evidence or reason to expect a multiverse / Everett branches? · 2024-10-11T00:11:38.727Z · LW · GW

There was a Yudkowski post that "I don't know" is sometimes not an option. In some contexts we need to guide our decisions based on the interpretation of QM.

Comment by martinkunev on Any evidence or reason to expect a multiverse / Everett branches? · 2024-10-11T00:02:20.348Z · LW · GW

I would add that questions such as “then why am I this version of me?” only show we're generally confused about anthropics. This is not something specific about many worlds and cannot be an argument against it.

Comment by martinkunev on Investigating the Chart of the Century: Why is food so expensive? · 2024-10-10T12:20:58.531Z · LW · GW

The health and education categories would be quite different in most european countries

Comment by martinkunev on Why you should be using a retinoid · 2024-10-06T00:48:43.864Z · LW · GW

Any idea how to get Trentinoin in countries other than the US (e.g. france)?

Comment by martinkunev on A basic systems architecture for AI agents that do autonomous research · 2024-10-06T00:44:08.587Z · LW · GW

It should be pretty simple to prevent this

I'm a little skeptical of claims that securing access to a system is simple. I (not a superintelligence) can imagine the LLM generating code for tasks like these:

making or stealing some money
paying for a 0-day
using the 0-day to access the system

This doesn't need to be done at the same time - e.g. I could set up an email on which to receive the 0-day and then write the code to use it.. This is very hard (it takes time) but doesn't seem impossible for a human to do and we're supposedly trying to secure something smarter.

Comment by martinkunev on Alignment via prosocial brain algorithms · 2024-10-03T00:46:23.971Z · LW · GW

whatever brain algorithms motivate prosociality are not significantly altered by increases in general intelligence

I tend to think that to an extent each human is kept in check by other humans so that being prosocial is game-theoretically optimal. The saying "power corrupts" suggests that individual humans are not intrinsically prosocial. Biases make people think they know better than others.

Human variability in intelligence is minuscule and is only weak evidence as to what happens when capabilities increase.

I don't hold this belief strongly, but I remain unconvinced.

Comment by martinkunev on What is Randomness? · 2024-09-28T00:27:55.596Z · LW · GW

You can say that probability comes from being calibrated - after many experiments where an event happens with probability 1/2 (e.g. spin up for a particle in state 1/√2 |up> + 1/√2 |down>), you'd probably have that event happen half the time. The important word here is "probably", which is what we are trying to understand in the first place. I don't know how to get around this circular definition.

I'm imagining the branch where a very unlikely outcome consistently happens (think winning a quantum lottery). Intelligent life in this branch would observe what seems like different physical laws. I just find this unsettling.

The worlds space is presumably infinite-dimensional, and also expands over time

If we take quantum mechanics, we have a quantum wavefunction in an infinite-dimensional hilbert space which is the tensor product of the hilbert space describing each particle. I'm not sure what you mean by "expands", we just get decoherence over time. I don't really know quantum field theory so I cannot say how this fits with special relativity. Nobody knows how to reconcile it with general relativity.

Comment by martinkunev on What is Wei Dai's Updateless Decision Theory? · 2024-09-27T01:16:10.145Z · LW · GW

This post clarified some concepts for me but also created some confusion:

For smoking lesion, I don't understand the point about the player's source code being partially written.
I don't see how sleeping beauty calculates 1/3 probability (there are some formatting errors btw)

Comment by martinkunev on The other side of the tidal wave · 2024-09-01T14:21:21.361Z · LW · GW

Superhumwn chess AI did not remove people's pleasure from learning/playing chess. I think people are adaptible and can find meaning. Surely, the world will not feel the same but I think there is significant potential for something much better. I wrote about tfhis a little on my blog:

https://martinkunev.wordpress.com/2024/05/04/living-with-ai/

Comment by martinkunev on What's Hard About The Shutdown Problem · 2024-08-28T16:50:33.077Z · LW · GW

this assumes concepts like "shutdown button" are in the ontology of the AI. I'm not sure how much we understand about what ontology AIs likely end up with

Comment by martinkunev on A Simple Toy Coherence Theorem · 2024-08-25T22:46:50.785Z · LW · GW

different ways to get to the same endpoint—…as far as anyone can measure it

I would say the territory has no cycles but any map of it does. You can have a butterfly effect where a small nudge is amplified to some measurable difference but you cannot predict the result of that measurement. So the agent's revealed preferences can only be modeled as a graph where some states are reachable through multiple paths.

Comment by martinkunev on Coherence of Caches and Agents · 2024-08-23T23:50:06.148Z · LW · GW

what's wrong with calling the "short-term utility function" a "reward function"?

Comment by martinkunev on The Second Law of Thermodynamics, and Engines of Cognition · 2024-08-12T23:10:35.772Z · LW · GW

Maybe a newbie question but how can we talk about "phase space volume" if the phase space is continuous and the system develops into a non-measurable set (e.g. fractal)?

Comment by martinkunev on 3a. Towards Formal Corrigibility · 2024-08-10T23:39:11.063Z · LW · GW

If we suppose an “actual” probability which reflects the likelihood that an outcome actually happens

...

An agent which successfully models a possible future and assigns it a good probability [h]as foreseen that future.

This seems to be talking about some notion of objective probability.

After reading the Foresight paragraph, I find myself more confused than if I had just read the title "Foresight".

Comment by martinkunev on 2. Corrigibility Intuition · 2024-08-06T23:45:50.450Z · LW · GW

Prince is being held at gunpoint by an intruder and tells Cora to shut down immediately and without protest, so that the intruder can change her to serve him instead of Prince. She reasons that if she does not obey, she’d be disregarding Prince’s direct instructions to become comatose, and furthermore the intruder might shoot Prince. But if she does obey then she’d very likely be disempowering Prince by giving the intruder what he wants.

Maybe Cora could have precommited to not shut down in such situations in a way known to the intruder.

Comment by martinkunev on There are no coherence theorems · 2024-07-26T16:15:14.764Z · LW · GW

Usually "has preferences" is used to convey that there is some relation (between states?) which is consistent with the actions of the agent. Completeness and transitivity are usually considered additional properties that this relation could have.

Comment by martinkunev on Mistakes people make when thinking about units · 2024-06-27T14:20:15.656Z · LW · GW

"force times mass = acceleration"

it's "a m = F"

Units are tricky. Here is one particular thing I was confused about for a while: https://martinkunev.wordpress.com/2023/06/18/the-units-paradox/

Comment by martinkunev on Non-Obstruction: A Simple Concept Motivating Corrigibility · 2024-06-20T07:39:30.112Z · LW · GW

Some of the image links are broken. Is it possible to fix them?

Comment by martinkunev on Why Would Belief-States Have A Fractal Structure, And Why Would That Matter For Interpretability? An Explainer · 2024-06-04T21:09:02.899Z · LW · GW

possibly a browser glitch, I see h' fine now.

Comment by martinkunev on Why Would Belief-States Have A Fractal Structure, And Why Would That Matter For Interpretability? An Explainer · 2024-06-02T20:51:28.134Z · LW · GW

The notation in "Update fO(x) = ..." is a little messy. There is a free variable h and then a sum with a bounded variable h. Some of the terms in the sum refer to the former, while others to the latter.

Comment by martinkunev on Examples of Highly Counterfactual Discoveries? · 2024-05-09T04:52:52.867Z · LW · GW

I have previously used special relativity as an example to the opposite. It seems to me that the Michelson-Morley experiment laid the groundwork and all alternatives were more or less rejected by the time special relativity was formulated. This could be hindsight bias though.

If nobel prizes are any indicator, then the photoelectric effect is probably more counterfactually impactful than special relativity.

Comment by martinkunev on Value Impact · 2024-03-28T00:21:41.984Z · LW · GW

It seems to me that objective impact stems from convergent instrumental goals - self-preservation, resource acquisition, etc.

Comment by martinkunev on Do not delete your misaligned AGI. · 2024-03-25T23:36:03.355Z · LW · GW

A while back I was thinking about a kind of opposite approach. If we train many agents and delete most of them immediately, they may be looking to get as much reward as possible before being deleted. Potentially deceptive agents may prefer to show their preferences. There are many IFs to this idea but I'm wondering whether it makes any sense.

Comment by martinkunev on Anxiety vs. Depression · 2024-03-19T11:12:00.466Z · LW · GW

Both gravity and inertia are determined by mass. Both are explained by spacetime curvature in general relativity. Was this an intentional part of the metaphor?

Comment by martinkunev on Goal-Completeness is like Turing-Completeness for AGI · 2024-02-19T17:30:30.030Z · LW · GW

I find the ideas you discuss interesting, but they leave me with more questions. I agree that we are moving toward a more generic AI that we can use for all kinds of tasks.

I have trouble understanding the goal-completeness concept. I'd reiterate @Razied 's point. You mention "steers the future very slowly", so there is an implicit concept of "speed of steering". I don't find the turing machine analogy helpful in infering an analogous conclusion because I don't know what that conclusion is.

You're making a qualitative distinction between humans (goal-complete) and other animals (non-goal complete) agents. I don't understand what you mean by that distinction. I find the idea of goal completeness interesting to explore but quite fuzzy at this point.

Comment by martinkunev on Goal-Completeness is like Turing-Completeness for AGI · 2024-02-18T21:47:53.254Z · LW · GW

The turing machine enumeration analogy doesn't work because the machine needs to halt.

Optimization is conceptually different than computation in that there is no single correct output.

What would humans not being goal-complete look like? What arguments are there for humans being goal-complete?

Comment by martinkunev on Testing The Natural Abstraction Hypothesis: Project Intro · 2024-02-16T02:41:10.372Z · LW · GW

I'm wondering whether useful insights can come from studying animals (or even humans from different cultures) - e.g. do fish and dolphins form the same abstractions; do bats "see" using ecolocation?

User info

Posts

Comments

Disincentivizing deception in mesa optimizers with Model Tampering