Posts

What (if anything) made your p(doom) go down in 2024? 2024-11-16T16:46:43.865Z
What are some positive developments in AI safety in 2024? 2024-11-15T10:32:39.541Z
Why would ASI share any resources with us? 2024-11-13T23:38:36.535Z

Comments

Comment by Satron on Buck's Shortform · 2024-11-19T22:43:22.455Z · LW · GW

Given the lack of response, should I assume the answer is "no"?

Comment by Satron on Why imperfect adversarial robustness doesn't doom AI control · 2024-11-19T22:41:54.399Z · LW · GW

My intuition also tells me that the distinction might just lack the necessary robustness. I do wonder if Buck's intuition is different. In any case, it would be very interesting to know his opinion.

Comment by Satron on Why imperfect adversarial robustness doesn't doom AI control · 2024-11-19T21:15:17.869Z · LW · GW

I also wonder what Buck thinks about CoTs becoming less transparent (especially in light of recent o1 developments).

Comment by Satron on Why imperfect adversarial robustness doesn't doom AI control · 2024-11-18T20:05:52.075Z · LW · GW

Great post, very clearly written. Going to share it in my spaces.

Comment by Satron on Sabotage Evaluations for Frontier Models · 2024-11-18T09:49:06.802Z · LW · GW

Sure, it sounds like a good idea! Below I will write my thoughts on your overall summarized position.

———

"I primarily wish to argue that, given the general lack of accountability for developing machine learning systems in worlds where indeed the default outcome is doom, it should not be surprising to find out that there is a large corporation (or multiple) doing so."

I do think that I could maybe agree with this if it was 1 small corporation. In your previous comment you suggested that you are describing not the intentional contribution to the omnicide, but the bit of rationalization. I don't think I would agree that that many people working on AI are successfully engaged in that bit of rationalization or that it would be enough to keep them doing it. The big factor is that in case of their failure, they personally (and all of their loved ones) will suffer the consequences.

"It is also not surprising that glory-seeking companies have large departments focused on 'ethics' and 'safety' in order to look respectable to such people."

I don't disagree with this, because it seems plausible that one of the reasons for creating safety departments is ulterior. However, I believe that this reason is probably not the main one and that AI safety labs are making genuinely good research papers. To take an example of Anthropic, I've seen safety papers that got LessWrong community excited (at least judging by upvotes). Like this one.

"I believe that there is no good plan and that these companies would exist regardless of whether a good plan existed or not... I believe that the people involved are getting rich risking all of our lives and there is (currently) no justice here"

For the reasons, that I mentioned in my first paragraph I would probably disagree with this. Relatedly, while I do think wealth in general can be somewhat motivating, I also think that AI developers are aware that all their wealth would mean nothing if AI kills everyone.

———

Overall, I am really happy with this discussion. Our disagreements came down to a few points and we agree on quite a bit of issues. I am similarly happy to conclude this big comment thread.

Comment by Satron on Sabotage Evaluations for Frontier Models · 2024-11-17T20:57:21.016Z · LW · GW

I don't think it would be necessarily be easy to rationalize that vaccine that negatively affects 10% of the population and has no effect on 90% is good. It seems possible to rationalize that it is good for you (if you don't care about other people), but quite hard to rationalize that it is good in general.

Given how few politicians die as the result of their policies (at least in the Western world), the increase in their chance of death does seem negligible (compared to something that presumably increases this risk by a lot). Most bad policies that I have in mind don't seem to happen during human catastrophes (like pandemics).

However, I suspect that the main point of your last comment was that potential glory can, in principle, be one of the reasons for why people rationalize stuff, and if that is the case, then I can broadly agree with you!

Comment by Satron on Sabotage Evaluations for Frontier Models · 2024-11-17T08:50:34.777Z · LW · GW

Looking at the 2 examples that you gave me, I can see a few issues. I wouldn't really say that saying "I don't know" once is necessarily a lie. If anything, I could find such an answer somewhat more honest in some contexts. Other than that, there is also the issue of both of the examples being of a much different scope and scale. Saying IDK to the committee and trying to take down someone's comment on the forum on the internet and definitely not on the same scale as the elaborate scheme of tricking or bribing/silencing multiple government employees who have access to your model. But even with all that aside, these 2 examples are only tangential to the topic of governmental oversight over OpenAI or Anthropic and don't necessarily provide direct evidence.

I can believe that you genuinely have information from private sources, but without any way of me verifying them, I am fine to leave this one at that.

Comment by Satron on Sabotage Evaluations for Frontier Models · 2024-11-17T08:36:22.117Z · LW · GW

To modify my example to include an accountability mechanism that's also similar to the real life, the King takes exactly the same vaccines as everyone else. So if he messed up with the chemicals, he also dies.

I believe similar accountability mechanism works in our real world case. If CEOs build unsafe AI, they and everyone they valued in this life die. This seems like a really good incentive for them to not build unsafe AI.

At the end of the day, voluntary commitment such as debating with the critics are not as strong in my option. Imagine that they agree with you and go to the debate. Without the incentive of "if I mess up, everyone dies", the CEOs could just go right back to doing what they were doing. As far as I know voluntary debates have few (if any) actual legal mechanisms to hold CEOs accountable.

Comment by Satron on My disagreements with "AGI ruin: A List of Lethalities" · 2024-11-16T16:39:32.993Z · LW · GW

"If a model trained on synthetic data is expected to have good performance out of distribution (on real-world problems) then I think that it would also be expected to have high performance at assessing whether it's in a simulation."

Noosphere89, you have marked this sentence with a "disagree" emoji. Would you mind expanding on that? I think it is a pretty important point and I'd love to see why you disagree with Ben.

Comment by Satron on AI Control: Improving Safety Despite Intentional Subversion · 2024-11-16T13:04:36.989Z · LW · GW

What about the following red team strategy:

Give an honest assessment of the suspicion level, unless there is a very well hidden backdoor, then give a low score. Also, only create backdoors if it is possible to hide them well.

Wouldn't this defeat the self-checking strategy?

Comment by Satron on Sabotage Evaluations for Frontier Models · 2024-11-16T09:02:04.096Z · LW · GW

I haven't heard of any such corrupt deals with OpenAI or Anthropic concerning governmental oversight over AI technology on the scale that would make me worried. Do you have any links to articles about government employees (who are responsible for oversight) recently signing secret contracts with OpenAI or Anthropic that would prohibit them from giving real feedback on a big enough scale to make it concerning?

Comment by Satron on Sabotage Evaluations for Frontier Models · 2024-11-16T08:55:06.961Z · LW · GW

I will try to provide another similar analogy. Let's say that a King got tired of his people dying from diseases, so he decided to try a novel method of vaccination.

However, some people were really concerned about that. As far as they were concerned, the default outcome of injecting viruses into the bodies of people is death. And the King wants to vaccinate everyone, so these people create a council of great and worthy thinkers who after thinking for a while come up with a list of reasons why vaccines are going to doom everyone.

However, some other great and worthy thinkers (let's call them "hopefuls") come to the council and give reasons to think that aforementioned reasons are mistaken. Maybe they have done their own research, which seems to vindicate King's plan or at least undermine council's arguments.

And now imagine that King comes down from the castle points his finger at hopefuls' giving arguments for why the arguments proposed by the council is wrong and says "yeah, basically this" and then turns around and goes back to the castle. To me it seems like King's official endorsement of the arguments proposed by hopefuls doesn't really change the ethicality of the situation, as long as King is acting according to hopefuls' plan.

Furthermore, imagine if the one of the hopefuls who come to argue with the council was actually an undercover King. And he gave exactly the same arguments as people before him. This still IMO doesn't change the ethicality of the situation.

Comment by Satron on Sabotage Evaluations for Frontier Models · 2024-11-16T07:51:19.354Z · LW · GW

Then we are actually broadly in agreement. I just think that instead of CEOs responding to the public, having anyone at their side (the side of AI alignment being possible) responding is enough. Just as an example that I came up with, if a critic says that some detail is a reason for why AI will be dangerous, I do agree that someone needs to respond to the argument. But I would be fine with it being someone other than the CEO.

That's why I am relatively optimistic about Anthropic hiring the guy who has been engaged with critic's argument for years.

Comment by Satron on Sabotage Evaluations for Frontier Models · 2024-11-16T07:33:24.282Z · LW · GW

I similarly don't see the need for any official endorsement of the arguments. For example if a critic says that such and such technicality will prevent us from building safe AI and someone responds that here are the reasons for why such and such technicality will not prevent us from building safe AI (maybe this particular one is unlikely by default for various reasons), then such and such technicality will just not prevent us from building safe AI. I don't see a need for a CEO to officially endorse the response.

There is a different type of technicalities which you actively need to work against. But even in this case, as long as someone has found a way to combat them, as long as relevant people in your company are aware of the solution, it is fine by me.

Even if there are technicalities that can't be solved in principle, they should be evaluated by technical people and discussed by the same technical people (like they are on Less wrong for example).

I am definitely not saying that I can pinpoint an exact solution to AI alignment, but there have been attempts so promising that leading skeptics (like Yudkowski) have said "Not obviously stupid on a very quick skim. I will have to actually read it to figure out where it's stupid. (I rarely give any review this positive on a first skim. Congrats.)"

Whether companies actually follow promising alignment techniques is an entirely different question. But having CEOs officially endorse such solutions as opposed to relevant specialists evaluating and implementing them doesn't seem strictly necessary to me.

Comment by Satron on Sabotage Evaluations for Frontier Models · 2024-11-16T07:21:58.372Z · LW · GW

Having some shady deals in the past isn't evidence that there are currently shady deals on the scale that we are talking about going on between government committees and AI companies.

If there is no evidence for that happening in our particular case (on the necessary scale), then I don't see why I can't make a similar claim about other auditors who similarly had less than ideal history.

Comment by Satron on Sabotage Evaluations for Frontier Models · 2024-11-16T07:14:08.057Z · LW · GW

I think then we just fundamentally disagree with the ethical role of CEO in the company. I believe that it is to find and gather people who are engaged with the arguments of the critic's (like that guy from this forum who was hired by Anthropic). If you have people on your side who are able to engage with the arguments, then this is good enough for me. I don't see the role of CEO is publicly engaging with critic's arguments even in the moral sense. In the moral sense, my requirements would actually be even lesser. IMO, it would be enough just to have people broadly on your side (optimists for example) to engage with the critics.

Comment by Satron on Sabotage Evaluations for Frontier Models · 2024-11-16T07:04:10.121Z · LW · GW

I do think that in order for government department to blatantly approve an unsafe model, it would take a lot of people to have secret agreements with. I currently haven't seen any evidence of widespread corruption in that particular department.

And it's not like I am being unfair. You are basically saying that because some people might've had some secret agreements we should take their supposed trustworthiness as external auditors with a grain of salt. I can make a similar argument about a lot of external auditors.

Comment by Satron on Sabotage Evaluations for Frontier Models · 2024-11-16T06:57:42.028Z · LW · GW

I do disagree with your proposed standard. It is good enough for me, that the critic's argument are engaged by someone on your side. Going there personally seems unnecessary. After all, if the goal is to build safe AI, you personally knowing a niche technical solution isn't necessary, if you have people on your team who are aware of publicly produced solutions as well as internal ones.

And there is engagement with people like Yudkowski from the optimistic side. There are at least proposed solutions to problems he is presenting. Just because Sam Altman personally didn't invent or voice them doesn't really make him unethical (I accept that he personally may be unethical for a different reason).

Comment by Satron on Sabotage Evaluations for Frontier Models · 2024-11-16T06:51:47.566Z · LW · GW

The founders of the companies accept money for the same reason any other business accepts money. You can build something genuinely good for humanity while making yourself richer at the same time. This has happened many times already (Apple phones for example).

I concede that the founders of the companies didn't personally publicly engage with the arguments of people like Yudkowski, but that's a really high bar. Historically, CEOs aren't usually known for getting into technical debates. For that reason, they create security councils that monitor the perceived threats.

And it's not like there was no response whatsoever from people who are optimistic about AI. There was plenty of them. I am a believer that arguments matter and not people who make them. If a successful argument has been made, then I don't see a need for CEOs to repeat it. And I certainly don't think that just because CEOs don't go to debates, that makes them unethical.

Comment by Satron on Sabotage Evaluations for Frontier Models · 2024-11-16T06:39:47.402Z · LW · GW

That depends on how bad things are perceived. They might be more optimistic than is warranted, but probably not by that much. I don't believe that so many people could decrease their worries by 40% for example (just my opinion). Not to mention all of the government monitoring people who approve of the models.

Comment by Satron on Sabotage Evaluations for Frontier Models · 2024-11-16T06:33:33.520Z · LW · GW

The issue is that no matter how much good stuff they do, one can always call it marginal and not enough. There aren't really any objective benchmarks for measuring it.

The justification for creating AI (at least in the heads of its developers) could look something like this (really simplified): Our current models seem safe, we have a plan on how to proceed and we believe that we will achieve a utopia.

You can argue that they are wrong and nothing like that will work, but that's their justification.

Comment by Satron on Sabotage Evaluations for Frontier Models · 2024-11-16T06:20:15.397Z · LW · GW

I think that these examples are quite different. People who were voting for a president who ultimately turned out to be corrupt most likely predicted that the most likely scenario won't be totally horrible for them personally. And most of the times they are right, corrupt presidents are bad for countries, but you could live happily under their rule. People would at least expect to retain whatever valuable stuff they already have.

With AI the story is different. If a person thinks that the most likely scenario is the scenario where everyone dies, then they would not only expect to not win anything, but to lose everything they have. And media is actively talking about it, so it's not like they are not at least aware of the possibility. Even if they don't see them as personally responsible for everything, they sure don't want to feel the consequence of everyone dying.

This is not the case of someone doing something that they know is wrong but benefits them. If they truly believe that we are all going to die, then it's doing something they know is wrong and actively hurts them. And allegedly hurts them much more than any scenario that has happened so far (since we are all still alive).

So I believe that most people working on it, actually believe that we are going to be just fine. Whether this belief is warranted is another question.

Comment by Satron on Sabotage Evaluations for Frontier Models · 2024-11-16T06:09:44.238Z · LW · GW

I think the justification of this articles goes something like this: "here is the vision of future where we successfully align AI. It is utopian enough that it warrants pursuing it". Risks from creating AI just aren't the topic. They deal with them elsewhere. These essays were specifically focused on the "positive vision" part of the alignment. So I think you are critiquing the articles for lacking something they were never intended to have in the first place.

OpenAI seems to have made some basic commitments here. The word commit is mentioned 29 times. Other companies did it as well here, and here. Here Anthropic makes promises for optimistic, intermediate and pessimistic scenarios of AI development.

Comment by Satron on Sabotage Evaluations for Frontier Models · 2024-11-15T22:48:11.759Z · LW · GW

I do get that point that you are making, but I think this is a little bit unfair to these organizations. Articles like Machines of Loving Grace, The Intelligence Age and Planning for AGI and Beyond are implicit public justifications for building AGI.

These labs have also released their plans on "safe development". I expect a big part of what they say to be the usual business marketing, but it's not like they completely ignoring the issue. In fact, taking one example, Anthropic's research papers on safety are often discussed on this site as genuine improvements on this or that niche of AI Safety.

I don't think that money alone would've convinced CEOs of big companies to run this enterprise. Altman and Amodei, they both have families. If they don't care about their own families, then they at least care about themselves. After all, we are talking about scenarios where these guys would die the same deaths as the rest of us. No amounts of hoarded money would save them. They would have little motivation to do any of this if they believed that they would die as the result of their own actions. And that's not mentioning all of the other researchers working at their labs. Just Anthropic and OpenAI together have almost 2000 employees. Do they all not care about their and their families' well-being?

I think the point about them not engaging with critics is also a bit too harsh. Here is DeepMind's alignment team response to concerns raised by Yudkowski. I am not saying that their response is flawless or even correct, but it is a response nonetheless. They are engaging with this work. DeepMind's alignment team also seemed to engage with concerns raised by critics in their (relatively) recent work.

EDIT: Another example would be Anthropic creating a dedicated team for stress testing their alignment proposals. And as far as I can see, this team is lead by someone who has been actively engaged with the topic of AI safety on LessWrong, someone who you sort of praised a few days ago.

Comment by Satron on Sabotage Evaluations for Frontier Models · 2024-11-15T21:42:14.517Z · LW · GW

If the default path is AI's taking over control from humans, then what is the current plan in leading AI labs? Surely all the work they put in AI safety is done to prevent exactly such scenarios. I would find it quite hard to believe that a large group of people would vigorously do something if they believed that their efforts will go to vain.

Comment by Satron on Mark Xu's Shortform · 2024-11-15T20:55:54.740Z · LW · GW

Does that mean that you believe that after a certain point we would lose control over AI? I am new to this field, but doesn't this fact spell doom for humanity?

Comment by Satron on Sabotage Evaluations for Frontier Models · 2024-11-15T20:49:04.116Z · LW · GW

"One thing I appreciate about Buck/Ryan's comms around AI control is that they explicitly acknowledge that they believe control will fail for sufficiently intelligent systems."

Does that mean that they believe that after a certain point we would lose control over AI? I am new to this field, but doesn't this fact spell doom for humanity?

Comment by Satron on Why would ASI share any resources with us? · 2024-11-15T09:48:57.942Z · LW · GW

That makes sense. Are there any promising developments in the field of AI safety that make you think that we will be able to answer that question by the time we need to?

Comment by Satron on Buck's Shortform · 2024-11-14T19:09:50.113Z · LW · GW

Are there any existing ways to defend against bad stuff that the agent can do in the unmonitored scaffold?