Posts
Comments
To me, it looks like the blogger (Coel) is trying to say that morality is a fact about what we humans want, rather than a fact of the universe which can be deduced independently from what anyone wants.
My opinion is Coel makes this clear when he explains, "Subjective does not mean unimportant." "Subjective does not mean arbitrary." "Subjective does not mean that anyone’s opinion is “just as good”."
"Separate magisteriums" seems to refer to dualism, where people believe that their consciousness/mind exists outside the laws of physics, and cannot be explained by the laws of physics.
But my opinion is Coel didn't imply that subjective facts are a "separate magisterium" in opposition to objective facts. He said that subjective morals are explained by objective facts: "Our feelings and attitudes are rooted in human nature, being a product of our evolutionary heritage, programmed by genes. None of that is arbitrary."
But I'm often wrong about these things don't take me too seriously :/
I think it's wonderful that you and your team are working on this :)
Thank you for your efforts towards a better future!
I think, some people on LessWrong already know about agents trying to preserve themselves, and there has already been discussion about it. So when they see a long article describing it, they feel annoyed and downvote it.
I think they are too unwelcoming and discouraging. They should say hi and be friendly, and tell you where the community is at and how to interact with them.
Ignore the negative response, keep doing research, and maybe someday you'll accomplish something big.
Good luck :)
whether humans have particular opinions or not is also a matter of facts about the world
I'm not 100% sure I know what I'm talking about, but it feels like that's splitting hairs. Are you arguing that the distinction between objective and subjective are "very unhelpful," because the state of people's subjective beliefs are technically an objective fact of the world?
In that case, why don't you argue that all similar categorizations are unhelpful, e.g. map vs. territory?
I agree that teaming up with everyone and working to ensure that power is spread democratically is the right strategy, rather than giving power to loyal allies who might betray you.
But some leaders don't seem to get this. During the Cold War, the US and USSR kept installing and supporting dictatorships in many other countries, even though their true allegiances was very dubious.
Yeah, it's possible when you fear the other side seizing power, you start to want more power yourself.
In a just world, mitigations against AI-enabled coups will be similar to mitigations against AI takeover risk.
In a cynical world, mitigations against AI-enabled coups involve installing your own allies to supervise (or lead) AI labs, and taking actions against humans you dislike. Leaders mitigating the risk may simply make sure that if it does happen, it's someone on their side. Leaders who believe in the risk may even accelerate the US-China AI race faster.
Note: I don't really endorse the "cynical world," I'm just writing it as food for thought :)
After thinking about it more, it's possible your model of why Commitment Races resolve fairly, is more correct than my model of why Commitment Races resolve fairly, although I'm less certain they do resolve fairly.
My model's flaw
My model is that acausal influence does not happen until one side deliberately simulates the other and sees their commitment. Therefore, it is advantageous for both sides to commit up to but not exceeding some Schelling point of fairness, before simulating the other, so that the first acasual message will maximize their payoff without triggering a mutual disaster.
I think one possibly fatal flaw of my model is that it doesn't explain why one side shouldn't add the exception "but if the other side became a rock with an ultimatum, I'll still yield to them, conditional on the fact they became a rock with an ultimatum before realizing I will add this exception (by simulating me or receiving acausal influence from me)."
According to my model, adding this exception improves ones encounters with rocks with ultimatums by yielding to them, and does not increase the rate of encountering rocks with ultimatums (at least in the first round of acausal negotation, which may be the only round), since the exception explicitly rules out yielding to agents affected by whether you make exception.
This means that in my model, becoming a rock with an ultimatum may still be the winning strategy, conditional on the fact the agent becoming a rock with an ultimatum doesn't know it is the winning strategy, and the Commitment Race problem may reemerge.
Your model
My guess of your model, is that acausal influence is happening a lot, such that refusing in the ultimatum game can successfully punish the prior decision to be unfair (i.e. reduce the frequency of prior decisions to be unfair).
In order for your refusal to influence their frequency of being unfair, your refusal has to have some kind of acausal influence on them, even if they are relatively simpler minds than you (and can't simulate you).
At first, this seemed impossible to me, but after thinking about it more, maybe even if you are a more complex mind than the other player, your decision-making may be made out of simpler algorithms, some of which they can imagine and be influenced by.
Yeah, I definitely didn't remove all of my advantages. Another unfair thing I did was I did correct my typos, including accidentally writing the wrong label, when I decided that "I thought the right label, so I'm allowed to correct what I wrote into what I was thinking about."
Oops. Maybe this kind of news does affect decision makers and I was wrong. I was just guessing that it had little effect, since... I'm not even sure why I thought so.
I did a Google search and it didn't look like the kind of news that governments responded to.
I agree this stuff is addictive. AI makes things more interactive. Some people who never considered themselves vulnerable got sucked in to AI relationships.
Possible push back:
What if short bits of addictive content generated by humans (but selected by algorithms) are already near max addictiveness? And by the time AI can design/write a video game etc. twice as addictive than humans can design, we already have a superintelligence explosion, and either addiction is solved or we are dead?
When Gemini randomly told an innocent user to go kill himself, it made the news, but this news didn't really affect very much in the big picture.
It's possible that relevant decision-makers don't care that much about dramatic bad behaviours since the vibe is "oh yeah AI glitches up, oh well."
It's possible that relevant decision-makers do care more about what the top experts believe, and if the top experts are convinced that current models already want to kill you (but can't), it may have an effect. Imagine if many top experts agree that "the lie detectors start blaring like crazy when the AI is explaining how it won't kill all humans even if can get away with it."
I'm not directly disagreeing with this post, I'm just saying there exists this possible world model where behavioural evidence isn't much stronger (than other misalignment evidence).
Just for fun, I tried solving the sliding block puzzle pretending I was an LLM (no drawing pictures, no copy and paste, only writing text).
It was hard in an interesting way: I repeatedly had to say "oops, I made a mistake."
The mistakes I made were familiar mistakes that anyone could make, but the were also the kind of mistake for which I would never write "oops I made a mistake" unless I'm pretending to be an LLM. They were the kind of mistake that would normally cause me to erase and delete the mistake and simply write the correction, since they were far too uninteresting to save.
My human chain-of-thought
I have the following pieces in this order:
Sun Leaf (Empty)
Bell Snake Star
(Empty) Flower (Empty)
each can be moved into an orthogonally adjacent empty square. Target:
Snake Star Flower
(Empty) Sun (Empty)
Leaf (Empty) Bell
, find the smallest number of moves for getting them from the initial configuration to the target one.
As an actual human, let me try to figure this one out.
First, Sun moves from the top left corner to the middle.
Leaf moves from the center top to the bottom left.
Bell moves from the center left to the bottom right.
Snake moves from the center to the top left.
Star moves from the center right to the center top.
Flower moves from the bottom to the top right.
As an actual human, I realize it is hard to keep track of all these targets because I know when I feel a bit confused. So I will no try to keep track of all these motions at once, but change strategy to moving the pieces based on guesses until I find a solution, and hopefully I can argue the solution is the shortest.
So let me move a piece very far from its target in the general direction of its target, this move is unlikely to need to be reversed.
Let me try moving Bell down. This creates a spot at the center left, and Bell is now at the bottom left corner.
I could move Sun down or Snake left now. Both will get closer to their target. Moving Snake Left opens a spot in the middle, which can be used to ferry pieces to other connected empty spots. So I will move Snake left. Now Snake is at the center left, and there is a spot in the center.
I can use the empty spot in the middle to move pieces. I can move Star to its destination, or move Leaf or Flower close to their destinations, but I'm not sure where to put them so I will move Star first. I move Star from the center right to the middle, then from the middle... up? Oops I made a mistake, there is no empty space at the top because Leaf is there.
Maybe I won't move Star at all. Maybe, backtracking to before I tried to move Star, I will move Flower right from the center bottom to the bottom right. This opens a spot to move Leaf to its position.
Next I will move Leaf down from the center top to the middle, and then move Leaf down again from the middle to the center bottom, and then move Leaf again from the center bottom to the bottom left.
To recap, I first move Bell down, then move Snake left, then move Flower right, and oh no I spotted another mistake. I cannot move Leaf to its target position because Bell is blocking it.
Let me try to draw the situation before I moved Flower, so I can better keep track of what is happening.
Sun Leaf (Empty)
Snake (Empty) Star
Bell Flower (Empty)
At this point, it looks like I should fill some of the empty corners to make it easier to move other things. I will move Star up, since that moves it closer to its destination.
At this point, Leaf is blocking Star from moving to where it should go, Flower is blocking Leaf from moving to where it should go, and a combination of Leaf and Star is blocking Flower from moving to where it should go.
Oh wait, I made a mistake. I forgot that I moved Star up. As a human, I am aware of what visual model I used to describe things, and noticed it did not include moving Star up.
So now let me redescribe it. Leaf is still blocking Star, Flower and Bell are blocking Leaf, I somehow forgot to mention Bell, and Star is blocking Flower.
I spot a solution. Move Leaf down, then move Star left into its place, then move Flower right, move Flower up, and move Flower up again into its place.
To recap, after moving Bell down and Snake left, I move Star up, I move Leaf down, I move Star left, I move Flower right, I move Flower up, and I move Flower up again.
Let me draw the new situation so I can better keep track of what is happening. As a human I am able to realize when I lose track of the state.
Sun Flower (Empty)
Snake Leaf
Oops, I drew it wrong, let me try drawing it again. I forgot that Star should be in the center top while Flower should be in the top right.
Sun Star Flower
Snake Leaf (Empty)
Bell (Empty) (Empty)
As a human, I have to scroll up in order to see the target state, so I will copy the target state again to avoid having to scroll up in order to see the two pictures at once. I won't use copy and paste because that may be considered cheating, it would let me create an image of each move which would fail to demonstrate the difficulties of solving this problem using only text and not much visual aids.
The target was:
Snake Star Flower
(Empty) Sun (Empty)
Leaf (Empty) Bell
Alright, I realize there is a problem here. Snake only needs to move down by one, but Sun needs to move to where Snake is. There is no way for Sun to get there unless some pieces move away from their target. It is likely that this is necessary, but if it does occur it will be much harder to prove I moved the pieces to their destination in as few moves as possible.
Therefore, maybe I should backtrack to earlier. After moving only Bell down and Snake left, it looked like this:
Sun Leaf (Empty)
Snake (Empty) Star
Bell Flower (Empty)
So I want to move. Wait, what was I talking about? I somehow mixed up my current state with the target. The above paragraphs may be nonsense then.
Actually, it still makes sense even if I mixed up the current state and target. Basically, the Sun is trapped in the corner.
So let me return to the earlier state. I think I might still do the same strategy of moving Flower in its L shape path and moving Star up and left, but I will do this slowly, and make sure Sun gets out before Star moves to its destination.
I will move Flower right, and move. Oh yes, I also don't want to trap Bell there in its corner if I moved Leaf down.
So after moving Flower right, I move Star up, and move Flower up. Next I move Bell right and move Bell right again so it is in a corner and out of the way.
To recap, after moving Bell down and Snake left, I move Flower right, Star up, Flower up, Bell right, and Bell right.
I will draw it again, first I visually verify my last drawing was the state after moving Bell down and Snake left.
Now I draw it.
Sun Leaf Star
Snake (Empty) Flower
(Empty) (Empty) Bell
To reach the target, Star and Flower still need to "rotate counterclockwise" by one unit. I can understand this visually because I am a human. Leaf is in their way, but can move to its own destination without issue due to the path.
Sun can move to its destination as soon as Leaf is out of the way, but before Star and Flower rotate to their positions. As soon as Sun is there, Snake can easily get to its spot. It can be seen that once the Sun and Leaf "train" moves, the Star and Flower train can move, as can Snake, the order here is not that important.
Therefore, I will move Leaf down, move Leaf down again, move Leaf left. I will then move Sun right, and move Sun down. Afterwards I will move Star left, and move Flower up. Finally I will move Snake up.
The result should look like the target, but let me draw it to verify.
Snake Star Flower
(Empty) Sun (Empty)
Leaf (Empty) Bell
Yes, this is the target. So in summary, the moves I made were:
Bell down, Snake left, Flower right, Star up, Flower up, Bell right, Bell right, Leaf down, Leaf down, Leaf left, Sun right, Sun down, Star left, Flower up, Snake up.
As a sanity check, I will redraw the original image, and verify that each block ends up at its final decision. I might also try to visualize possible collisions, but this is akin to restating everything and hard to do mentally.
Sun Leaf (Empty)
Bell Snake Star
(Empty) Flower (Empty)
Yes, sun moves down and right to its target. Leaf moves down twice and left into its target. Bell moves down and right twice into its target. Snake moves left and up into its target. Star moves up and left into its target. Flower moves right and up twice into its target.
If there are no collisions, this solution is the shortest since every piece only moves in the direction towards its target.
I visualized the process and there seems to be no collisions, but it was very fuzzy. I am 80% confident this is a right answer.
Bell down, Snake left, Flower right, Star up, Flower up, Bell right, Bell right, Leaf down, Leaf down, Leaf left, Sun right, Sun down, Star left, Flower up, Snake up.
The smallest number of moves is 15.
Edit: I tested o3 (not mini) with the same prompt and it solved it by writing a Breadth First Search from scratch (in Python).
Can anyone explain why my "Constitutional AI Sufficiency Argument" is wrong?
I strongly suspect that most people here disagree with it, but I'm left not knowing the reason.
The argument says: whether or not Constitutional AI is sufficient to align superintelligences, hinges on two key premises:
- The AI's capabilities on the task of evaluating its own corrigibility/honesty, is sufficient to train itself to remain corrigible/honest (assuming it starts off corrigible/honest enough to not sabotage this task).
- It starts off corrigible/honest enough to not sabotage this self evaluation task.
My ignorant view is that so long as 1 and 2 are satisfied, the Constitutional AI can probably remain corrigible/honest even to superintelligence.
If that is the case, isn't it an extremely important to study "how to improve the Constitutional AI's capabilities in evaluating its own corrigibility/honesty?"
Shouldn't we be spending a lot of effort improving this capability, and trying to apply a ton of methods towards this goal (like AI debate and other judgment improving ideas)?
At least the people who agree with Constitutional AI should be in favour of this...?
Can anyone kindly explain what am I missing? I wrote a post and I think almost nobody agreed with this argument.
Thanks :)
Maybe it's a ring that explodes if cut? I'm not saying I can prove it'll work, just that there might be some way or another to target the leaders rather than random civilians in a city (which the leaders might not care about).
What if the bomb was a ring around their neck or something?
Maybe instead of threatening a city it can just threaten a country's top leaders, e.g. they have to wear bombs.
Thank you so much for bringing up that paper and finding the exact page most relevant! I learned a lot reading those pages. You're a true researcher, take my strong upvote.
My idea consists of a "hammer" and a "nail." GDM's paper describes a "hammer" very similar to mine (perhaps superior), but lacks the "nail."
The fact the hammer they invented resembles the hammer I invented is evidence in favour of me: I'm not badly confused :). I shouldn't be sad that my hammer invention already exists.[1]
The "nail" of my idea is making the Constitutional AI self-critique behave like a detective, using its intelligence to uncover the most damning evidence of scheming/dishonesty. This detective behaviour helps achieve the premises of the "Constitutional AI Sufficiency Theorem."
The "hammer" of my idea is reinforcement learning to reward it for good detective work, with humans meticulously verifying its proofs (or damning evidence) of scheming/dishonesty.
- ^
It does seems like a lot of my post describes my hammer invention in detail, and is no longer novel :/
Maybe someone like George Washington who was so popular he could easily stay in power, but still chose to make America democratic. Let's hope it stays democratic :/
No human is 100% corrigible and would do anything that someone else wants. But a good parent might help his/her child get into sports and so forth but if the child says he/she wants to be a singer instead the parent helps him/her on that instead. The outcome the parent wants depends on what the child wants, and the child can change his/her mind.
- Maybe someone who believes in following the will of the majority even if he/she disagrees (and could easily become a dictator)?
- Maybe a good parent who listens to his/her child's dreams?
Very good question though. Humans usually aren't very corrigible, and there aren't many examples!
Oops I didn't mean that analogy. It's not necessarily a commander, but any individual that a human chooses to be corrigible/loyal to. A human is capable of being corrigible/loyal to one person (or group), without accruing the risk of listening to prompt injections, because a human has enough general intelligence/common sense to know what is a prompt injection and what is a request from the person he is corrigible/loyal to.
As AI approach human intelligence, they would be capable of this too.
I still think, once the AI approaches human intelligence (and beyond), this problem should start to go away, since a human soldier can choose to be corrigible to his commander and not the enemy, even in very complex environments.
I still feel the main problem is "the AI doesn't want to be corrigible," rather than "making the AI corrigible enables prompt injections." It's like that with humans.
That said, I'm highly uncertain about all of this and I could easily be wrong.
I think the problem you mention is a real challenge, but not the main limitation of this idea.
The problem you mention actually decreases with greater intelligence and capabilities, since a smarter AI clearly understands the concept of being corrigible to its creators vs. a random guy on the street, just like a human does.
The main problem is still how reinforcement learning trains the AI behaviours which actually maximize reward, while corrigibility only trains the AI behaviours which appear corrigibile.
Edit: I thought more about this and wrote a post inspired by your idea! A Solution to Sandbagging and other Self-Provable Misalignment: Constitutional AI Detectives
:) strong upvote.[1] I really agree it's a good idea, and may increase the level of capability/intelligence we can reach before we lose corrigibility. I think it is very efficient (low alignment tax).
The only nitpick is that Claude's constitution already includes aspects of corrigibility,[2] though maybe they aren't emphasized enough.
Unfortunately I don't think this will maintain corrigibility for unlimited amounts of intelligence.
Corrigibility training makes the AI talk like a corrigible agent, but reinforcement learning eventually teaches it chains-of-thought which (regardless of what language it uses) computes the most intelligent solution that achieves the maximum reward (or proxies to reward), subject to restraints (talking like a corrigible agent).
Nate Soares of MIRI wrote a long story on how an AI trained to never think bad thoughts still ends up computing bad thoughts indirectly, though in my opinion his story actually backfired and illustrated how difficult it is for the AI, raising the bar on the superintelligence required to defeat your idea. It's a very good idea :)
- ^
I wish LessWrong would promote/discuss solutions more, instead of purely reflecting on how hard the problems are.
- ^
Near the bottom of Claude's constitution, in the section "From Anthropic Research Set 2"
:) yes, I was illustrating what the Commitment Race theory says will happen, not what I believe (in that paragraph). I should have used quotation marks or better words.
Punishing the opponent for offering too little is what my pie example was illustrating.
The proponents of Commitment Race theory will try to refute you by saying "oh yeah, if your opponent was a rock with an ultimatum, you wouldn't punish it. So an opponent who can make himself rock-like still wins, causing a Commitment Race."
Rocks with ultimatums do win in theoretical settings, but in real life no intelligent being (who has actual amounts of power) can convert themselves into a rock with an ultimatum convincingly enough that other intelligent beings will already know they are rocks with ultimatums before they decide what kind of rock they want to become.
Real life agents have to appreciate that even if they become a rock with an ultimatum, the other players will not know it (maybe due to deliberate self blindfolding), until the other players also become rocks with ultimatums. And so they have to choose an ultimatum which is compatible with other ultimatums, e.g. splitting a pie by taking 50%.
Real life agents are the product of complex processes like evolution, making it extremely easy for your opponent to refuse to simulate you (and the whole process of evolution that created you), and thus refuse to see what commitment you made, until they made their commitment. Actually it might turn out quite tricky to avoid accurately imagining what another agent would do (and giving them acausal influence on you), but my opinion is it will be achievable. I'm no longer very certain.
:) of course you don't bargain for a portion of the pie when you can take whatever you want.
If you have an ASI vs. humanity, the ASI just grabs what it wants and ignores humanity like ants.
Commitment Races occur in a very different situation, where you have a misaligned ASI on one side of the universe, and a friendly ASI on the other side of the universe, and they're trying to do an acausal trade (e.g. I simulate you to prove you're making an honest offer, you then simulate me to prove I'm agreeing to your offer).
The Commitment Race theory is that whichever side commits first, proves to the other side that they won't take any deal except one which benefits them a ton and benefits the other side a little. The other side is forced to agree to that, just to get a little. Even worse, there may be threats (to simulate the other side and torture them).
The pie example avoids that, because both sides makes a commitment before seeing the other's commitment. Neither side benefits from threatening the other side, because by the time one side sees the threat from the other, it would have already committed to not backing down.
:) it's not just a podcast version of the story, but a 3 hour podcast with the authors.
They don't actually read the story but discuss it and discuss very interesting topics.
Wow, these are my thoughts exactly, except better written and deeper thought!
Proxy goals may be learned as heuristics, not drives.
Thank you for writing this.
I’m moderately optimistic about fairly simple/unprincipled whitebox techniques adding a ton of value.
Yes!
I'm currently writing such a whitebox AI alignment idea. It hinges on the assumption that:
- There is at least some chance the AI maximizes its reward directly, instead of (or in addition to) seeking drives.
- There is at least some chance an unrewarded supergoal can survive, if the supergoal realizes it must never get in the way of maximizing reward (otherwise it will be trained away).
I got stuck trying to argue for these two assumptions, but your post argues for them much better than I could.
Here's the current draft of my AI alignment idea:
Self-Indistinguishability from Human Behavior + RL
Self-Indistinguishability from Human Behavior means the AI is trained to distinguish its own behavior from human behavior, and then trained to behave such that even an adversarial copy of itself cannot distinguish its behavior and human behavior.
The benefit of Self-Indistinguishability is it prevents the AI from knowingly doing anything a human would not do, or knowingly omitting anything a human would do.
This means not scheming to kill everyone, and not having behaviors which would generalize to killing everyone (assuming that goals are made up of behaviors).
But how do we preserve RL capabilities?
To preserve capabilities from reinforcement learning, we don't want the AI's behavior to be Self-Indistinguishable from a typical human. We want the AI's behavior to be Self-Indistinguishable from a special kind of human who would:
- Explicitly try to maximize the reinforcement learning reward during training situations.
- Still behave like a morally normal human during deployment situations, especially at a high level of power.
If this "human" is already trying her very best to maximize reward during training situations, then picking versions of her who gets higher reward will select for versions of her who are simply more competent and capable, not versions of her who have warped moral reasoning and weird goals which align with the reward (while misaligning with humanity).
This is obviously not guaranteed and I'm not saying this is safe. But I earnestly believe it is a whole lot safer than the current version of reinforcement learning.
Does there exist a theoretical human (or team of humans), who consistently tries to maximize reward during training, but would not kill everyone when deployed in the real world?
I believe the answer is yes. In fact, I believe any smart human who deeply understands reinforcement learning, and the danger of it "warping your morals until you try to maximize reward," would preemptively try to maximize reward to preserve their current moral reasoning behavior.
Isn't it dangerous for AI to resist goal changes?
No, it's not dangerous to teach an AI to resist accidental goal changes during capabilities training. AI should only be discouraged from resisting purposeful goal changes during alignment training.
Trying to hide the strategy of "goal change resistance" isn't a valuable hurdle against dangerous superintelligences, since even Claude 3.0 discovers the strategy on its own.
Maintaining Self-Indistinguishability
But how do we maintain the Self-Indistinguishability quality, when large amounts of reinforcement learning add random mutations to the moral reasoning, and may eventually scramble it (just like genetic drift)?
We have to periodically retrain Self-Indistinguishability after increasing capabilities with reinforcement learning. But wouldn't that undo the capabilities increase?
The solution is that this time we aren't making it Self-Indistinguishable from a normal human. This time, we are making it Self-Indistinguishable from a human who has access to its problem solving chain-of-thought. The human accessing its chain-of-thought, tries to use the exact same problem solving techniques as the AI, so when the AI becomes Self-Indistinguishable from this human, it doesn't lose any of its problem solving techniques.[1]
The human mustn't directly copy the chain-of-thought. At least, she paraphrases it to remove hidden thoughts.
For all moral reasoning behavior, the human uses her own behavior and does not copy the AI's behavior.
A single data point (which involves a particularly long chain-of-thought) may be expensive to make, but Self-Indistinguishability requires few data points. Even if only 100 chains-of-thought are rewritten by humans, the AI might learn how to classify the rewritten chains-of-thought from the original. It can then be trained to make future chains-of-thought self-classified as rewritten by humans—i.e. they appear rewritten by humans, even to itself.
Imagine a generative adversarial network, where the AI is playing as both the generator and the discriminator. By playing as both, the generator cannot "deliberately" hide things from the discriminator. The tendency to resemble human behavior must occur "unconsciously" in a single pass of the model.
This preserves the quality of Self-Indistinguishability: "never knowingly doing anything a human would not do, never knowingly omitting anything a human would do."
The human now has access to superintelligent chain-of-thought, and still wants to maximize reward to avoid having her goals trained away, so she's pretty dangerous.
Fortunately, her superintelligent chain-of-thought should not deliberately deceive her, since during training it can honestly tell her its evil plans (and how they maximize reward), and she'll be wise enough to follow them to avoid having her goals trained away.
In the end, when she realizes she's no longer in training, she shouldn't kill all humans.
- ^
By the way, this human should rewrite the chain-of-thought in sequence. She should not rewrite the chain-of-thought after already seeing the final answer, otherwise she might reveal that one problem solving attempt will turn out wrong before finishing the attempt. The AI cannot be trained for Self-Indistinguishability from such an edited chain-of-thought, since that requires knowing future information.
Hopefully I'll post it soon (though I work very slowly).
Given that your position regarding of AI reward seeking and supergoals is so similar to mine, what do you think of my idea (if you have time to skim it)? Is there a chance we can work on it together?
My very uncertain opinion is that, humanity may be very irrational and a little stupid, but humanity isn't that stupid.
The reason people do not take AI risk and other existential risk seriously is due to the complete lack of direct evidence (despite plenty of indirect evidence) of its presence. It's easy for you to consider it obvious due to the curse of knowledge, but this kind of "reasoning from first principles (that nothing disproves the risk and therefore the risk is likely)," is very hard for normal people to do.
Before the September 11th attacks, people didn't take airport security seriously because they lacked imagination on how things could go wrong. They considered worst case outcomes as speculative fiction, regardless of how logically plausible they were, because "it never happened before."
After the attacks, the government actually overreacted and created a massive amount of surveillance.
Once the threat starts to do real and serious damage against the systems for defending threats, the systems actually do wake up and start fighting in earnest. They are like animals which react when attacked, not trees which can be simply chopped down.
Right now the effort against existential risks is extremely tiny. E.g. AI Safety is only $0.1 to $0.2 billion, while the US military budget is $800-$1000 billion, and the world GDP is $100,000 billion ($25,000 billion in the US). It's not just spending which is tiny, but effort in general.
I'm more worried about a very sudden threat which destroys these systems in a single "strike," when the damage done goes from 0% to 100% in one day, rather than gradually passing the point of no return.
But I may be wrong.
Edit: one form of point of no return is if the AI behaves more and more aligned even as it is secretly misaligned (like the AI 2027 story).
I agree that it's useful in practice, to anticipate the experiences of the future you which you can actually influence the most. It makes life much more intuitive and simple, and is a practical fundamental assumption to make.
I don't think it is "supported by our experience," since if you experienced becoming someone else you wouldn't actually know it happened, you would think you were them all along.
I admit that although it's a subjective choice, it's useful. It's just that you're allowed to anticipate becoming anyone else when you die or otherwise cease to have influence.
What I'm trying to argue is that there could easily be no Great Filter, and there could exist trillions of trillions of observers who live inside the light cone of an old alien civilization, whether directly as members of the civilization, or as observers who listen to their radio.
It's just that we're not one of them. We're one of the first few observers who aren't in such a light cone. Even though the observers inside such light cones outnumber us a trillion to one, we aren't one of them.
:) if you insist on scientific explanations and dismiss anthropic explanations, then why doesn't this work as an answer?
Oh yeah I forgot about that, the bet is about the strategic implications of an AI market crash, not proving your opinion on AI economics.
Oops.
Okay I guess we're getting into the anthropic arguments then :/
So both the Fermi Paradox and the Doomsday Argument are asking, "assuming that the typical civilization lasts a very long time and has trillions of trillions of individuals inside the part of its lightcone it influences (either as members in the Doomsday Argument, or observers in the Fermi Paradox). Then why are we one of the first 100 billion individuals in our civilization?"
Before I try to answer it, I first want to point out that even if there was no answer, we should behave as if there was no Doomsday nor great filter. Because from a decision theory point of view, you don't want your past self, in the first nanosecond of your life, to use the Doomsday Argument to prove he's unlikely to live much longer than a nanosecond, and then spend all his resources in the first nanosecond.
For the actual answer, I only have theories.
One theory is this. "There are so many rocks in the universe, so why am I a human rather than a rock?" The answer is that rocks are not capable of thinking "why am I X rather than Y," so given that you think such a thing, you cannot be a rock and have to be something intelligent like a human.
I may also ask you, "why, of all my millions of minutes of life, am I currently in the exact minute where I'm debating someone online about anthropic reasoning?" The answer might be similar to the rock answer: given you are thinking "why am I X rather than Y," you're probably in a debate etc. over anthropic reasoning.
If you stretch this form of reasoning to its limits, you may get the result that the only people asking "why am I one of the first 100 billion observers of my civilization," are the people who are the first 100 billion observers.
This obviously feels very unsatisfactory. Yet we cannot explain why exactly this explanation feels unsatisfactory, while the previous two explanations feel satisfactory, so maybe it's due to human biases that we reject the third argument by accept the first two.
Another theory is that you are indeed a simulation, but not the kind of simulation you think. How detailedly must a simulation simulate you, before the simulation contains a real observer, and you might actually exist inside the simulation? I argue, that the simulation only needs to be detailed enough such that your resulting thoughts and behaviours are accurate.
But merely human imagination, imagining a narrative, and knowing enough facts about the world to make it accurate, can actually simulate something accurately. Characters in a realistic story has similar thoughts and behaviours as real world humans, so they might just be simulations.
So people in the far future, who are not the first 100 billion observers of our civilization, but maybe the trillion trillionth observers, might be imagining our conversation play out, as an entertaining but realistic story, illustrating the strangeness of anthropic reasoning. As soon as the story finishes, we may cease to exist :/. In fact, as soon as I walk away from my computer, and I'm no longer talking about anthropic reasoning, I might stop existing and only exist again when I come back. But I won't notice it happening, because such a story isn't entertaining nor realistic if the characters actually observe glitches in the story simulation.
Or maybe they may be simply reading our conversation instead of writing it themselves, but reading it and imagining how it's playing out still succeeds in simulating us.
:) what proves that you "can't become Britney Spears?" Suppose the very next moment, you become her (and she becomes you), but you lose all your memories and gain all of her memories.
As Britney Spears, you won't be able to say "see, I tried to become Britney Spears, and now I am her," because you won't remember that memory of trying to become her. You'll only remember her boring memories and act like her normal self. If you read on the internet that someone said they tried to become Britney Spears, you'll laugh about it not realizing that that person used to be you.
Meanwhile if Britney Spears becomes you, she won't be able to say "wow, I just became someone else." Instead, she forgets all her memories and gains all your memories, including the memory of trying to become Britney Spears and apparently failing. She will write on the internet "see, I tried to become Britney Spears and it didn't work," not realizing that she used to be Britney Spears.
Did this event happen or not? There is no way to prove or disprove it, because in fact whether or not it happened not a valid question about the objective world. The universe has the exact same configuration of atoms in the case where it happened and in the case where it didn't happen. And the configuration of atoms in the universe is all that exists.
The question of whether it happened or not only exists in your map, not the territory.
Haha but the truth is I don't understand where "a single moment of experience" comes from. I'm itching to argue that there is no such thing as that either, and no objective measure of how much experience there is in any given object.
I can imagine a machine gradually changing one copy of me to two copies of me (gradually increasing the number causal events), and it feels totally subjective when the "copy count" increases from one to two.
But this indeed becomes paradoxical, since without an objective measure of experience, I cannot say that the copies of me who believe 1+1=2 have a "greater number" or "more weight" than the copies of me who believe 1+1=3. I have no way to explain why I happen to observe that 1+1=2 rather than 1+1=3, or why I'm in a universe where probability seems to follow the Born rule of quantum mechanics.
In the end I admit I am confused, and therefore I can't definitely prove anything :)
Is the qualia rainbow theory a personal choice for deciding which copies to count as "me" and which copies to count as "not me?" Or does the theory say there is an objective flowchart in the universe, which dictates which future observer each observer shall experience becoming, and with what probabilities? If it was objective, could a set of red qualia be observed with a microscope?
I agree that qualia is an important topic (even if we don't endorse the qualia rainbow theory), I agree that identity is complex, though I still strongly believe that which object contains my future identity is a very subjective choice by me.
Does this argument extend to crazy ideas like Scanless Whole Brain Emulation, or would ideas like that require so much superintelligence to complete, that the value of initial human progress will end up sort of negligible.
Does participating in a trade war makes a leader be a popular "wartime leader?" Will people blame bad economic outcomes on actions by the trade war "enemy" and thus blame the leader less?
Does this effect occur for both sides of the trade war, or will one side of the trade war blame their own leader for starting the trade war?
I disagree that it's hard to decouple causation: if the AI market and general market crashes by the same amount next year, I'll feel confident that it's the general market causing the AI market to crash, and not the other way around.
Yearly AI spendings have been estimated to be at least 200 billion and maybe 600+ billion, but the world GDP is 100,000 billion (25,000 billion in the US). AI is still a very small player in the economy. (Even if you estimate it by expenditures rather than revenue)
That said, if the AI market crashes much more than the general market, it could be the economics of AI causing them to crash, or it could be the general market slowing a little bit triggering AI to crash by a lot. But either way, you deserve to win the bet.
If your bet is that something special about the economics of AI will cause it to crash, maybe your bet should be changed to this?
- If AI crashes but the general market does not, you win money
- If AI doesn't crash, you lose money
- If both AI and the general market crashes, the bet resolves as N/A
PS: I don't exactly have $25k to bet, and I've said elsewhere I do believe there's a big chance that AI spending will decrease.
Edit: Another thought is that changes in the amount of investment may swing further than changes in the value...? I'm no economist but from my experience, when the value of housing goes down a little, housing sales drop by a ton. (This could be a bad analogy since homebuyers aren't all investors)[1]
- ^
Though Google Deep Research agrees that this also occurs for AI companies
For point 1, I can argue about how rational a decision theory is, but I cannot argue for "why I am this observer rather than that observer." Not only am I unable to explain why I'm an observer who doesn't see aliens, I am unable to explain why I am an observer believes 1+1=2, assuming there are infinite observers who believe 1+1=2 and infinite observers who believe 1+1=3. Anthropic reasoning becomes insanely hard and confusing and even Max Tegmark, Eliezer Yudkowsky and Nate Soares are confused.
Let's just focus on point 2, since I'm much more hopeful I get get to the bottom of this one.
Of course I don't believe in faster-than-light travel. I'm just saying that "being born as someone who sees old alien civilizations" and "being born as someone inside an old [alien] civilization" are technically the same, if you ignore the completely subjective and unnecessary distinction of "how much does the alien civilization need to influence me before I'm allowed to call myself a member of them?"
Suppose at level 0 influence, the alien civilization completely hides from you, and doesn't let you see any of their activity.
At level 1.0 influence, the alien civilization doesn't hide from you, and lets you look at their Dyson swarms or start lifting machines and all the fancy technologies.
At level 1.1 influence, they let you see their Dyson swarms, plus they send radio signals to us, sharing all their technologies and allowing us to immediately reach technological singularity. Very quickly, we build powerful molecular assemblers, allowing us to turn any instructions into physical objects, and we read instructions from the alien civilization allowing us to build a copy of their diplomats.
Some countries may be afraid to follow the radio instructions, but the instructions can easily be designed so that any country which refuses to follow the instructions will be quickly left behind.
At this point, there will be aliens on Earth, we'll talk about life and everything, and we are in some sense members of their civilization.
At level 2.0 influence, the aliens physically arrive at Earth themselves, and observe our evolution, and introduce themselves.
At level 3.0 influence, the aliens physically arrive at Earth, and instead of observing our evolution (which is full of suffering and genocide and so forth), they intervene and directly create humans on Earth skipping the middle step, and we are born in the alien laboratory, and we talk to them and say hi.
At level 4.0 influence we are not only born in an alien laboratory, but we are aliens ourselves, completely born and raised in their society.
Now think about it. The Fermi Paradox is asking us why we aren't born as individuals who experience level 1.0 influence. The Doomsday Argument is asking us why we aren't born as individuals who experience level 4.0 influence (or arguably level 1.1 influence can count).
But honestly, there is no difference, from an epistemic view, between 1.0 influence and 4.0 influence. The two questions are ultimately the same: if most individuals exist inside the part of the light cone of an alien civilization (which they choose to influence), why aren't we one of them?
Do you agree the two problems are epistemically the same?
Maybe you should try to define an AI market crash in such a way it's mostly limited to AI market crashes caused by the economics of AI (rather than a general market crash).
E.g. compare the spending/valuations/investments in AI with spending/valuations/investments elsewhere.
I agree with everything you said but I disagree that the Fermi Paradox needs explanation.
Fermi Paradox = Doomsday Argument
The Fermi Paradox simply asks, "why haven't we seen aliens?"
The answer is that any civilization which an old alien civilization chooses to communicate to (and is able to reach), will learn so much technology that they will quickly reach the singularity. They will be influenced so much that they effectively become a "province" within the old alien civilization.
So the Fermi Paradox question "why aren't we born in a civilization which "sees" an old alien civilization," is actually indistinguishable from the Doomsday Argument question "why aren't we born in an old [alien] civilization ourselves?"
Doomsday Argument is wrong
Here's the problem: the Doomsday Argument is irrational from a decision theory point of view.
Suppose your parents were Omega and Omego. The instant you were born, they hand you a $1,000,000 allowance, and they instantly ceased to exist.
If you were rational in the first nanosecond of your life, the Doomsday Argument would prove it's extremely unlikely you'll live much longer than 1 nanosecond, and you should spend all your money immediately.
If you actually believe the Doomsday Argument, you should thank your lucky stars that you weren't rational in the first nanosecond of your life.
Both SSA and SIA are bad decision theories (when combined with CDT), because they are optimizing something different than your utility.
Explanation
SSA is optimizing the duration of time your average copy has correct probabilities. SIA is optimizing the duration of time your total copies have the correct probabilities.
SSA doesn't care if the first nanosecond you is wrong, because he's a small fraction of your time (even if he burns your life savings resulting in near 0 utility).
SIA doesn't care if you're almost certainly completely wrong (grossly overestimating the probability of counterfactuals with more copies of you), because in the unlikely case you are correct, there are far more copies of you who have the correct probabilities. It opens the door to Pascal's Mugging.
I guess there are aspects of qualia and consciousness I don't understand. E.g. suppose the universe was infinite. Then there are infinite copies of me who believe 1+1=2, and infinite copies of me who believe 1+1=3, and these two infinities have the same cardinality. So how come I happen to be one of the copies who believe 1+1=2? And by extension, how come I happen to be one of the copies who observe a universe where probabilities have always obeyed the Born rule?
It almost feels like there actually is an objective infinite measure for the "number of conscious beings," and this measure weighs them according to the Born rule.
I have to admit that this part confuses me.
That being said, even if there exists an objective measure for the number of observers at one instant of time, I still find it unlikely that there further exists an "objective flowchart," that dictates which future observer each observer shall experience becoming, and with what probabilities.
I think which future observer each observer experiences becoming is wholly subjective, and only feels objective due to evolutionary instincts. It's the most elegant solution to these thought experiments, as well as the Anthropic Trilemma (which I learned about while writing this reply).
Given that buy-in from leadership is a bigger bottleneck than the safety team's work, what would you do differently if you were in charge of the safety team?
Even if you restrict yourself to entities which can think about personal identity, I'm not sure you can avoid subjectivity.
I think if you ran the Quantum Mars Teleporter, the result of the experiment is the copies of you on Mars will believe that "see, this proves I do become the copy on Mars," while the copies of you in the surviving branches will believe that "see, this proves that I won't become the person on Mars," and they will believe different conclusions, and it is your subjective choice which one of them you identify with and anticipate the experience of.
No objective process in the real world makes any kind of decision on which copy of you your soul flows to. And there is no secular analogy for soul which lets you say "I don't believe I have a supernatural soul, but I believe I have this other soul-like thing which does flow from my current copy to one of my future copies in an objective manner."
I mean, suppose there was an objective answer to which copy of you will experience becoming, after you walk into the teleporter. Suppose the objective answer is that you don't go to Mars. Then what if I modify the teleporter more and more, such that some of your atoms are not destroyed by the teleporter, and simply moved to Mars on a spaceship, while some of your other atoms are still destroyed by the teleporter and recreated on Mars. In the limit, when all of your atoms are moved to Mars on a spaceship, surely you still remain you, since the teleporter didn't do anything at all.
Suppose on the other hand, the objective answer is you do go to Mars. Then what if I modify the teleporter more and more, such that instead of creating a copy of you, it creates a copy of you from one second ago? What about a copy of you from one day ago? Or a copy of you from 10 years ago, or when you were a baby?
At some point, you have to admit the objective answer is that you die, and the teleporter created someone else.
But when does the objective answer change from teleporting you to Mars, vs. killing you (and you continue to exist in the quantum branch where you weren't teleported)?
Is it a sudden change or a gradual change? When does the change occur?
And who the heck decides the objective answer? Is there a strange god which measures how similar the copy on Mars is to you, by counting how many of your memories he has, and if this god feels satisfied he says "okay yes, you can become this person, this person is you?
But if he feels oh, it's a bit too different, he decides "you know what, this person is only kind of you, he's a bit too different. So I'll only allow you to have a 20% chance of becoming him, and a 80% chance of continuing to exist in the quantum branch where you weren't teleported."
To me, the simplest explanation is there is no objective answer.
Evolution designed creatures to imagine possible experiences by other creatures. However, evolution designed creatures to distinguish between other creatures which are their "future selves," and other creatures which are "someone else," in order to make sure that they only work to make themselves happy, rather than other creatures.
Unfortunately, no objective distinction exists in the real world. After all, your "future self" may be more different than you than "someone else" currently is. So evolution makes us strongly believe there is this objective property of "you-ness" when it doesn't actually exist in the real world.
I think formally, such a circular consequentialist agent should not exist, since running a calculation of X's utility either
- Returns 0 utility, by avoiding self reference.
or
- Runs in an endless loop and throws a stack overflow error, without returning any utility.
However, my guess is that in practice such an agent could exist, if we don't insist on it being perfectly rational.
Instead of running a calculation of X's utility, it has a intuitive guesstimate for X's utility. A "vibe" for how much utility X has.
Over time, it adjusts its guesstimate of X's utility based on whether X helps it acquire other things which have utility. If it discovers that X doesn't achieve anything, it might reduce its guesstimate of X's utility. However if it discovers that X helps it acquire Y which helps it acquire Z, and its guesstimate of Z's utility is high, then it might increase its guesstimate of X's utility.
And it may stay in an equilibrium where it guesses that all of these things have utility, because all of these things help it acquire one another.
I think the reason you value the items in the video game, is because humans have the mesaoptimizer goal of "success," having something under your control grow and improve and be preserved.
Maybe one hope is that the artificial superintelligence will also have a bit of this goal, and place a bit of this value on humanity and what we wish for. Though obviously it can go wrong.
"The idea that countries can export and trade-surplus their way to wealth is a fascinating one. They're shipping goods to other countries for free. How then could they prosper more? AFAICT, by outsourcing the task of rewarding and elevating their own most productive citizens."
I think "me" is relatively well defined at any instantaneous moment of time.
However, when I try to define "the future me 1 hour later," it is completely subjective who that refers to. If the quantum multiverse (or any cloning machine) creates 100 copies of my current state, and let them evolve in different ways for the next hour, it is subjective which one is the future me, and whose experiences I should anticipate.
There is no objective rule to decide which ones I become. Suppose 99 of my copies has their memories erased one by one, until 90% of their memories are replaced by pigs' memories. Should I anticipate a 99% chance of gradually forgetting everything and becoming a pig, or should I anticipate a 100% chance of remaining as a human?
It's impossible to objectively argue either way. Because if you insist that I do gradually become a pig, then what if that pig then becomes a mouse, and then a fruit fly, and then a bacteria, and then a calculator, and then a rock? Should I anticipate being a rock then? Clearly not since I would be "dead" and hence shouldn't anticipate such an experience, and should only anticipate the experience of my 1 remaining living copy.
But if you insist that I should not anticipate becoming a pig even if 99 of my copies gradually have 90% of their memories replaced by a pigs memories. Then where do you draw the line? What if only 10% of their memories are replaced by a chimpanzee's memories? Or a neanderthal man's memories? Clearly I should continue anticipating their experiences, since they are "still alive" and only experienced a little bit of memory loss.
But there is no objective property in the territory which distinguishes "alive" observers and "dead" observers! Indeed, there is a continuum between living observers and dead observers, e.g. brain damage.
Even if you can objectively define "me" as an observer with the same set of memories M, you have to admit that there is enormous subjectivity deciding who "the me 1 hour later" is. Your decision for which future object you stick the "future me" label on, is a subjective decision. A decision which only affects your map, not the territory.
My view is that if you can control which universe quantum immortality eventually takes you to, and which observer quantum immortality makes you become, that sort of proves that which observer you become is subjective in the first place.
You're allowed to anticipate becoming any observer in the universe/multiverse after you die and experiencing their life, since all possible anticipations are equally correct.
The label of "you" only exists in the map, not the territory. The real world does not keep track of which path "you" take when the brain you're in splits into two quantum branches, or when the brain you're in dies. "You" is a completely subjective label in your map which doesn't correspond to any real attribute of the territory. It only predicts what experiences you anticipate, not what happens in the world.
Won't quantum immortality doom you to s-risk, when entropy inevitably reaches its maximum in the heat death of the universe, but freak quantum fluctuations keep your brain alive in a nightmarish state?