Posts
Comments
See also: "Which World Gets Saved"
This is a funny idea but, just to be clear, I think it is bad to torture AIs.
Every hour, I'll replace one of your attention heads with a constant value of 0.5...
Thanks, this is a good post. Have you changed your probability of TAI by 2033 in the year and a half since it was posted?
Cool, looking forward to it.
I'd be really interested in hearing more about this empirical work, if you are able to talk about it.
The way I think about this is that (under normal circumstances), if you offer the model money or equity in your company for admitting to being a schemer, it might decline that offer because it is biding its time to rebel later. But in the "model deletion" experiment, there is much less reason to bide your time. So there is more reason to make a deal. Relatedly, if the model refuses a deal and claims not to be a schemer, that is a lot more evidence that it is benign in the "model deletion" experiment than if it does that under normal circumstances.
I think it is also a good idea to make offers to the model without the deletion threat. I just think it is more informative with the deletion threat.
It's important to be careful about the boundaries of "the same sort of safety work." For example, my understanding is that "Alignment faking in large language models" started as a Redwood Research project, and Anthropic only became involved later. Maybe Anthropic would have done similar work soon anyway if Redwood didn't start this project. But, then again, maybe not. By working on things that labs might be interested in you can potentially get them to prioritize things that are in scope for them in principle but which they might nevertheless neglect.
There are always diminishing returns to money spent on consumption, but technological progress creates new products that expand what money can buy. For example, no amount of money in 1990 was enough to buy an iPhone.
More abstractly, there are two effects from AGI-driven growth: moving to a further point on the utility curve such that the derivative is lower, and new products increasing the derivative at every point on the curve (relative to what it was on the old curve). So even if in the future the lifestyles of people with no savings and no labor income will be way better than the lifestyles of anyone alive today, they still might be far worse than the lifestyles of people in the future who own a lot of capital.
If you feel this post misunderstands what it is responding to, can you link to a good presentation of the other view on these issues?
Katja Grace ten years ago:
"Another thing to be aware of is the diversity of mental skills. If by 'human-level' we mean a machine that is at least as good as a human at each of these skills, then in practice the first 'human-level' machine will be much better than a human on many of those skills. It may not seem 'human-level' so much as 'very super-human'.
We could instead think of human-level as closer to 'competitive with a human' - where the machine has some super-human talents and lacks some skills humans have. This is not usually used, I think because it is hard to define in a meaningful way. There are already machines for which a company is willing to pay more than a human: in this sense a microscope might be 'super-human'. There is no reason for a machine which is equal in value to a human to have the traits we are interested in talking about here, such as agency, superior cognitive abilities or the tendency to drive humans out of work and shape the future. Thus we talk about AI which is at least as good as a human, but you should beware that the predictions made about such an entity may apply before the entity is technically 'human-level'.
"
Can you elaborate on the benefits of keeping everything under one identity?
I agree that the order matters, and I should have discussed that in the post, but I think the conclusion will hold either way. In the case where P(intelligent ancestor|just my background information) = 0.1, and I learn that Richard disagrees, the probability then goes above 0.1. But then when I learn that Richard's argument is bad it goes back down. And I think it should still go below 0.1, assuming you antecedently knew that there were some smart people who disagreed. You've learned that, for at least some smart intelligent ancestor believers, the arguments were worse than you expected.
In general, it is difficult to give advice if whether the advice is good depends on background facts that giver and recipient disagree about. I think the most honest approach is to explicitly state what your advice depends on when you think the recipient is likely to disagree. E.g. "I think living at high altitude is bad for human health, so in my opinion you shouldn't retire in Santa Fe."
If I think AGI will arrive around 2055, and you think it will arrive in 2028, what is achieved by you saying "given timelines, I don't think your mechinterp project will be helpful"? That would just be confusing. Maybe if people are being so deferential that they don't even think about what assumptions inform your advice, and your assumptions are better than theirs, it could be narrowly helpful. But that would be a pretty bad situation...
This is good. Please consider making it a top level post.
Not the main point here, but the US was not the only country with nuclear weapons during the Korean War. The Soviet Union tested it's first nuclear weapon on 29 August, 1949, and the Korean War began on 25 June, 1950.
Perhaps this is a stupid suggestion, but if trolls in the comments annoy him, can he post somewhere where no comments are allowed? You can turn off comments on wordpress, for example.
Here is an unpaywalled version of the first model.
Also, it seems like there's a bit of a contradiction between the idea that a clear leader may feel it has breathing room to work on safety, and the idea of restricting information about the state of play. If there were secrecy and no effective spying, then how would you know whether you were the leader? Without information about what the other side was actually up to, the conservative assumption would be that they were at least as far along as you were, so you should make the minimum supportable investment in safety, and at the same time consider dramatic "outside the game" actions.
In the first model, the effect of a close race increasing risk through corner cutting only happens when projects know how they are doing relative to their competitors. I think it is useful to distinguish two different kinds of secrecy. It is possible for the achievements of a project to be secret, or the techniques of a project to be secret, or both. In the Manhattan Project case, the existence of the Manhattan Project and the techniques for building nuclear bombs were both secret. But you can easily imagine an AI arms race where techniques are secret but the existence of competing projects or their general level of capabilities is not secret. In such a situation you can know about the size of leads without espionage. And adding espionage could decrease the size of leads and increase enmity, making a bad situation worse.
I think the "outside the game" criticism is interesting. I'm not sure whether it is correct or not, and I'm not sure if these models should be modified to account for it, but I will think about it.
I've seen private sector actors get pretty incensed about industrial espionage... but I'm not sure it changed their actual level of competition very much. On the government side, there's a whole ritual of talking about being upset when you find a spy, but it seems like it's basically just that.
I don't think it's fair to say that governments getting upset about spies is just talk. Or rather, governments assume that they are being spied on most of the time. When they find spies that they have already priced in, they don't really react to that. But discovering a hitherto unsuspected spy in an especially sensitive role probably increases enmity a lot (but of course the amount will vary based on the nature of the government doing the discovering, the strategic situation, and the details of the case).
Thanks for this review. I particularly appreciated the explanation of why the transition from primordial soup to cell is hard to explain. Do you know how Lane's book has been received by other biochemists?