Debate: Is it ethical to work at AI capabilities companies?

post by Ben Pace (Benito), LawrenceC (LawChan) · 2024-08-14T00:18:38.846Z · LW · GW · 21 comments

Contents

  Ben's Opening Statement
  Lawrence's Opening Statement
  Verbal Interrogation
  Ben's First Rebuttal
    Can people have higher integrity & accountability standards than ~everyone else at these orgs?
    Do people have better alternatives?
  Lawrence's First Rebuttal
    Our disagreements seem to be:
    That being said, here are my quick responses to Ben’s main points:
  Ben's Second Rebuttal
None
21 comments

Epistemic status: Soldier mindset. These are not (necessarily) our actual positions, these are positions we were randomly assigned, and for which we searched for the strongest arguments we could find, over the course of ~1 hr 45 mins.

Sides: Ben was assigned to argue that it's ethical to work for an AI capabilities company, and Lawrence was assigned to argue that it isn't.

Reading Order: Ben and Lawrence drafted each round of statements simultaneously. This means that each of Lawrence statements you read were written without Lawrence having read Ben's statements that are immediately proceeding.

Ben's Opening Statement

Ben Pace

I think it is sometimes ethical to work for AI capabilities companies.

I think the standard argument against my position is:

  1. It is bad to cause an existential failure for humanity.
  2. For these companies, we have a direct mechanism by which that may happen.
  3. Therefore, you should not work for them.

(I am granting that we both believe they are risking either extinction or takeover by an ~eternal alien dictatorship.)

I think this argument is strong, but I argue that there are some exceptions.

I think that there are still four classes of reason for joining.

First, you have a concrete mechanism by which your work there may prevent the existential threat. For instance, there were people in the Manhattan project improving the safety of the atom bomb engineering, and I think they had solid mechanistic reasons to think that they could substantially improve the safety. I’m not sure what probability to assign, because I think probabilities-without-mechanisms are far more suspect and liable to be meaningless, but if you have a concrete story for averting the threat, I think even a 10% chance is sufficient.

Second, I have often felt it is important to be able to join somewhat corrupt institutions, and mitigate the damage. I sometimes jokingly encourage my friends at IMO corrupt institutions to rise through the ranks and then dramatically quit when something unusually corrupt happens, and help start the fire for reform. I recall mentioning this to Daniel Kokotajlo who’s resignation was ultimately very influential (to be clear I don't believe my comment had any impact on what happened). That said, I think most random well-intentioned people should not consider themselves strong enough to withstand the forces of corruption that will assault them on the inside of such an organization, and think many people are stupidly naive about this. (I think there’s a discussion to be had on what the strength of the forces are here and what kinds of forces and strengths you need to be able to withstand, and perhaps Lawrence can argue that it’s quite low or easy to overcome.)

Third, I think if you are in a position to use the reputation from being in a senior role to substantially improve the outsider’s understanding of the situation, then I could consider this worth it. I can imagine a world where I endorsed the creation of a rival AI company worth $10Bs if a person in a leadership position would regularly go out to high profile platforms and yell “Our technology will kill us all! I am running at this to try to get a bit of control, but goddammit you have to stop us. Shut it all down or we are all doomed. Stop us!” I would genuinely consider it reasonable to work for such an existential-risk-producing-organization, and if you’re in a position to be that person (e.g. a senior person who can say this), then I think I may well encourage you to join the fray.

Fourth, there is an argument that even if you cannot see how to do good, and even if you cannot speak openly about it, it may make sense to stay in power rather than choose to forgo it, even if you increase the risk of extinction or eternal totalitarian takeover. I am reluctant to endorse this strategy on principle, but my guess is that there are worlds dark enough where simply getting power is worthwhile. Hm, my actual guess is that the correct policy is to not do this in situations that risk extinction and takeover, only in situations with much lesser risks such as genocide or centuries-long totalitarian takeover, but I am unsure. I think that the hardest thing here is that this heuristic can cause a very sizeable amount of the people working on the project to be people who spiritually oppose it — imagine if 50% of the Nazis were people broadly opposed to Nazism and genocide but felt that it was better for there to be opposition inside the organization, this would obviously be the wrong call. I think this makes sense for a handful of very influential people to do, and you should execute an algorithm that avoids 5% of the resources of the organization to come from those who politically oppose it. (There are also many other reasons relating to honesty and honor to not work at an organization whose existence you politically oppose.)

In summary, my position is that there are sometimes cases where the good and right thing to do is to work as part of an effort that is on a path to cause human extinction or an eternal totalitarian takeover. They are: if you believe you have a good plan for how to prevent it, if you believe you are able to expose the corruption, if you are able to substantially improve outsiders’ understanding of the bad outcomes being worked toward, and sometimes just so that anyone from a faction opposed to extinction and eternal totalitarian takeover has power.

Lawrence's Opening Statement

LawrenceC

No, it’s not ethical to work at AI companies.

Here’s my take:

  1. First, from a practical utilitarian perspective, do we expect the direct marginal effect of someone joining an AI scaling lab to be positive? I argue the answer is pretty obviously no.
    1. The number of people is clearly too high on the margin already due to financial + social incentives, so it’s unlikely 
    2. The direct impact of your work is almost certainly net negative. (Per unit person at labs, this is true on average.) 
    3. Incentives (both financial and social) prohibit clear thinking.
      1. Many things that might be good become unthinkable (e.g. AI pauses)
  2. Second, are there any big categories of exceptions? I argue that there are very few exceptions to the above rule. 
    1. Direct safety work is probably not that relevant and likely to just be capabilities anyways, plus you can probably do it outside of a lab. 
      1. Financial and other incentives strongly incentivize you to do things of commercial interest; this is likely just capabilities work. 
      2. Lots of safety work is ambiguous, easy to round it one way or the other. 
    2. Gaining power with hopes to use it is not a reliable strategy, almost no one remains incorruptible
      1. It’s always tempting to bide your time and wait for the next chance. 
      2. Respectability is in large part a social fiction that binds people to not do things. 
    3. Trip wires are pretty fake, very few people make such explicit commitments and I think the rate of people sticking to them is also not very high. 
    4. The main exception is for people who:
      1. Are pursuing research directions that require weight-level access to non-public models, OR whose research requires too much compute to pursue outside of a lab
      2. AND where the lab gives you the academic freedom to pursue the above research
      3. AND you’re highly disagreeable (so relatively immune to peer pressure) and immune to financial incentives. 
    5. But the intersection of the above cases is basically null. 
      1. There’s lots of research to be done and other ways to acquire resources for research (e.g. as a contractor or via an AISI/the US government). 
      2. Academic freedom is very rare at labs unless you’re a senior professor/researcher that the labs are fond of. 
      3. People who get the chance to work at labs are selected for being more agreeable, and also people greatly underestimate the effect of peer pressure and financial incentives on average. 
  3. Third, I argue that for people in this position, it’s almost certainly the case that there are better options
    1. There’s lots of independent AIS research to be done.
      1. Not to say that people should be independent AIS researchers – I agree that their output is pretty low, but I argue that it's because independent research sucks, not because there isn’t work to be done.
    2. Empirically people’s research output drops after going to labs
      1. Anthropic interp seems bogged down due to having to work on large models + being all in on one particular agenda
      2. [A few researchers, name redacted]’s work are much less exciting now
      3. The skills you build on the margin don’t seem that relevant to safety either, as opposed to scaling/working with big models. 
    3. There’s a ton of policy or other positions that people don’t take due to financial incentives, but people who care about doing the right thing can.
      1. e.g. red lines, independent evals
      2. Empirically seems difficult to hire for METR/other EA orgs due in large part to financial incentives + lack of social support.  
  4. I argue from the deontological or coordination perspective, that the world would be better if people didn’t do things they thought were bad, that is, the means rarely if ever justify the ends
    1. This is a pretty good heuristic imo, though a bit hard to justify on naive utilitarian grounds. 
    2. But it seems likely reasonable rule utilitarianism would say not to do it. 

Verbal Interrogation

We questioned each other for ~20 minutes, which informed the following discussion.

Ben's First Rebuttal

Ben Pace

It seems that we’re arguing substantially about in-principle vs in-practice. Lawrence mostly wanted to ask if there were literally any people in any of the categories that I could point to, and argued that many people currently there trick themselves into thinking they’re in those categories.

I felt best defending the second category. 

I then also want to push back harder on Lawrence’s third argument, that “for people in this position, it’s almost certainly the case that there are better options.”

Can people have higher integrity & accountability standards than ~everyone else at these orgs?

I think that Lawrence and I disagree about how easy it would be for someone to give themselves way better incentives if they just tried to. My guess is that there are non-zero people who would consider joining a capabilities company, who would sign up to have a personal board of (say) me, Lawrence, and Daniel Kokotajlo, to check-in with quarterly, and who would agree to quit their job if 2 of the 3 of us fired them. They can also publicly list their red-lines ahead of joining.

Perhaps if the organization got wind of this then they would not hire such a person. I think this would certainly be a sign that the org was not one you could work at ethically. I can also imagine the person doing this privately and not telling the organization, in a more adversarial setup (like an internal spy or corporate whistleblower). I’d have to think through the ethics of that, my current guess is that I believe I should never support such an effort, but I’ve not thought about it much. (I'd be interested in reading or having a debate on the ethics of corporate spies.)

Lawrence disagrees that people would in-practice be open to this, or that if they did so then they would get hired.

My guess is that I’d like to try these ideas with some people who are considering getting hired, or perhaps one or two people I have in mind in-particular.

Do people have better alternatives?

Lawrence writes above that, in general, the people going to these orgs have better alternatives elsewhere. I agree this is true for a few select individuals but I don’t think that this is the case for most people.

Let's very roughly guess that there are 100 “safety” researchers and 100 policy researchers at capabilities companies. I do not believe there is a similar sized independent ecosystem of organizations for them to join. For most of these people, I think their alternative is to go home (a phrase which here means “get a job at a different big tech company or trading company that pays 50%-100% of the salary they were previously getting”). I concur that I’d like to see more people go independent but I think it’s genuinely hard to get funding and there’s little mentorship and little existing orgs to fit into, so you need to work hard (in the style of a startup founder) to do this.

My sense is that it would be good for a funder like Open Philanthropy to flood the independent scene with funding so that they can hire competitively, though there are many hidden strings involved in that money already which may easily make that action net negative.

But I am like “if you have an opportunity to work at an AI org, and one of my four reasons holds, I think it is quite plausible that you have literally no other options for doing good, and the only default to consider is to live a quiet life”. I want to push back on the assumption that there are always lots of good options to directly improve things. I mostly do not want to flood a space with random vaguely well-intentioned people who haven’t thought about things much and do not have strong character or epistemologies or skills, and so disagree that the people reliably have better options for helping reduce existential risk from AI.

Lawrence's First Rebuttal

LawrenceC

After talking with Ben, it seems that we agree on the following points:

  1. It’s possible in principle that you could be in an epistemic situation where it’s correct to join a lab. 
  2. It’s rare in practice for people to be in the situations described by Ben above. There are maybe <10 examples of people being any of 1, 2, 3, 4, out of thousands who have joined a lab citing safety concerns. 
  3. The financial and social incentives to join a lab are massive, ergo more people join than should just based on ethical/impact grounds, ergo you should be suspicious a priori when someone claims that they’re doing it on ethical grounds.

Our disagreements seem to be:

  1. Ben thinks it’s more plausible to make tripwires than I seem to.This seems empirically testable, and we should test it.
  2. I seem to believe that there are better opportunities for people outside of labs than Ben does. My guess is a lot of this is because I’m more bullish on existing alignment research than Ben is. Maybe I also believe that social pressures are harder to fix than practical ones like raising money or having an org. It’s also plausible that I see a selected subgroup of people (e.g. those that METR/FAR want to hire, or the most promising researchers in the field), and that on average most people don’t have good options. Note that this doesn’t mean that Ben disagrees that fewer people should be at labs at the margin – he just thinks that these people should go home.

But I think these are pretty minor points.

I think it’s a bit hard to rebut Ben’s points when I agree with them in principle, but my contention is just that it’s rare for anyone to ever find themselves in such situations in practice, and that people who think they’re in such a situation tend to on average be wrong.

That being said, here are my quick responses to Ben’s main points: 

First, I think few if any people who join labs have such a concrete mechanism in mind, and most do it in large part due to financial or status reasons. Second, I think the number of examples where someone has successfully mitigated damage in the AI capabilities context is really small; there’s a reason we keep talking about Daniel Kokotajlo. Third, there are zero examples of people doing this, in large part because you cannot make a $10B capabilities lab while saying this. Fourth, the majority of people who think they’re in this situation have been wrong. I’m sure many, many people joined the Nazis because they thought “better me in charge than a real Nazi”, and people definitely have thought this way in Washington DC in many contexts. Also, as Ben points out, there are many virtue ethical or deontological reasons to not pursue such a policy. 

Perhaps my best argument is: what’s the real point of saying something is ethical? I don’t think we should focus on whether or not things are ethical in principle – this leads people down rabbit holes where they use contrived situations to justify violence against political opponents (“punch a nazi”) or torture (the “ticking time bomb” terrorism scenarios). Ethics should inform what to do in practice, based primarily on the situations that people find themselves in, and not based on extreme edge cases that people are unlikely to ever find themselves in. 

Ben's Second Rebuttal

Ben Pace

Briefly responding to two points:

I’m sure many, many people joined the Nazis because they thought “better me in charge than a real Nazi”

I’m not sure I agree with this. My guess is currently people are doing way more “I’m actually a good person so it’s good for me to be the person building the extinction machine” than people though “I’m actually a good person so it’s good for me to be the person running the concentration camp”. I think the former is partly because there’s a whole community of people who identify as good (“effective altruists”) who have poured into the AI scene and there wasn’t an analogous social group in the German political situation. I do think that there were some lone heroes in Nazi Germany (Martin Sustrik has written about them [LW · GW]) but there was less incentive to be open about it, and there was a lot of risk involved. My guess is that on some dimensions I’d feel better about someone at a capabilities company who was attempting this sort of internal sabotage in terms of them being open and honest with themselves. I don’t endorse it in the current adversarial and political environment, seems like a crazy defection on US/UK society (I’d be interested to debate that topic with someone; in today’s society, is it credible that we’ll get to a point where having corporate spies is ethical?). But overall my guess is that staff in Nazi germany didn’t think about it too much or self-deceived, rather than thinking about themselves as more ethical than the alternative. ("Wow, it sucks that I have to do this job, but I think it's the best choice for me and my family, and so I will try to avoid thinking about it otherwise.")

[After chatting, it does seem that me and Lawrence have a disagreement about how often people take these jobs due to selfish internal accounts and how much they give altruistic internal accounts.]

what’s the real point of saying something is ethical?

What’s the point of arguing ethics? I guess it’s good to know where the boundaries lie. I think it helps to be able to notice if you’re in a situation where you actually should work for one of these companies. And it helps to notice if you’re nowhere near such a situation.

Final Note

I think you did a good job of laying out where we agreed and disagreed. I am also curious to discuss tripwires & counter-incentives at more length sometime. I found this debate very helpful for clarifying the principles and the practice!

Lawrence cut his final rebuttal as he felt it mostly repeated prior discussion from above.

21 comments

Comments sorted by top scores.

comment by Nathan Helm-Burger (nathan-helm-burger) · 2024-08-17T02:03:24.724Z · LW(p) · GW(p)

I like that you did this, and find it interesting to read as an AInotkilleveryoneism-motivated researcher. Unfortunately, I think you both miss talking about what I consider to be the most relevant points.

Particularly, I feel like you don't consider the counterfactual of "you and your like-minded friends choose not to assist the most safety oriented group, where you might have contributed meaningfully to that group making substantial progress on both capabilities and safety. This results in a less safety-minded group becoming the dominant AI leader, and thus inceases likelihood of catastrophe." Based on my reading histories of the making of the atomic bomb, a big part of the motivation of a lot of people involved was knowing that the Nazis were also working on an atomic bomb. It turns out that the Nazis were nowhere close, and were defeated without being in range of making one. That wasn't known until later though.

Had the Nazis managed to create an Atomic Bomb first, and were the only group to have them, I expect that would have gone very poorly indeed for the world. Given this, and with the knowledge that the scientists choosing whether to work on the Manhattan project had this limited info, I believe I would have made the same decision in their shoes. Even anticipating that the US government would seize control of the product and make its own decisions about when and where to use it.

It seems you are both of the pessimistic mindset that even a well-intentioned group discovering a dangerously potent AGI recipe in the lab will be highly likely to mishandle it and cause a catastrophe, whilst also being highly unlikely to handle it well and help move humanity out of its critical vulnerable period.

I think that that is an incorrect position, and here are some of my arguments why:

  1. I believe that it is possible to detect sandbagging (and possibly other forms of deception) via this technique that I am collaborating on. https://www.apartresearch.com/project/sandbag-detection-through-model-degradation

  2. I believe that it is feasible to pursue the path towards corrigibility as described by CAST theory (which I gave some support to the development of.) https://www.lesswrong.com/s/KfCjeconYRdFbMxsy [? · GW]

  3. I believe that it is possible to safely contain and study even substantially superhuman AI if this is done thoughtfully and carefully with good security practices. Specifically: a physically sandboxed training cluster, and training data which has been carefully censored to not contain information about humans, our tech, or the physics of our universe. That, plus the ability to do mechanistic interpretability, and to have control to shut off or slow down the AI at will, makes it very likely we could maintain control.

  4. I believe that humanity's safety requires strong AI to counter the increasing availability of offense-dominant technology, particularly self-replicating weapons like engineered viruses or nanotech. ( I have been working with https://securebio.org/ai/ so I have witnessed strong evidence that AI is posing increasing risk of lowering barriers to bioweapons risk.)

  5. I don't believe that substantial delay in developing AGI is possible, and believe that trying would increase rather than decrease humanity's danger. I believe we are already in a substantial hardware and data overhang, and that within the next 24 months the threshold of capability of LLM agents will suffice to begin recursive self-improvement. This means it is likely that a leader in AI at that time will come into possession of strongly super-human AI (if they choose to engage their LLM agents in RSI), and that this will happen too quickly for other groups to catch up.

Replies from: Benito, Benito, Benito
comment by Ben Pace (Benito) · 2024-08-17T22:24:01.180Z · LW(p) · GW(p)

I believe we are already in a substantial hardware and data overhang, and that within the next 24 months the threshold of capability of LLM agents will suffice to begin recursive self-improvement. This means it is likely that a leader in AI at that time will come into possession of strongly super-human AI (if they choose to engage their LLM agents in RSI)

Just FYI, I think I'd be willing to bet against this at 1:1 at around $500, whereby I don't expect that a cutting edge model will start to train other models of a greater capability level than itself, or make direct edits to its own weights that have large effects (e.g. 20%+ improvements on a broad swath of tasks) on its performance, and the best model in the world will not be one whose training was primarily led by another model. 

If you wish to take me up on this, I'd propose any of John Wentworth or Rob Bensinger or Lawrence Chan (i.e. whichever of them ends up available) as adjudicators if we disagree on whether this has happened.

Replies from: nathan-helm-burger, nathan-helm-burger
comment by Nathan Helm-Burger (nathan-helm-burger) · 2024-08-22T16:51:57.566Z · LW(p) · GW(p)

Cool! Ok, yeah. So, I'm happy with any of the arbiters you proposed. I mean, I'm willing to make the bet because I don't think it will come down to a close call, but will instead be clear.

I do think that there's some substantial chance that the process of RSI will begin in the next 24 months, but not become publicly known right away. So my ask related to this would be:

At the end of 24 months, we resolve the bet based on what is publicly known. If, within the 12 month period following the end of the 24 month period, it becomes publicly known that a process started during the 24 month period has now come to fruition and is clearly the leading model, we reverse the decision of the bet from 'leading not from RSI' to 'leading model from RSI'.

Since my hypothesis is stating that the RSI result would be something extraordinary, beyond what would otherwise be projected from the development trend we've seen from human researchers, I think that a situation which ends up as a close call, akin to what feels like a 'tie', should resolve with my hypothesis losing. 

Replies from: Benito
comment by Ben Pace (Benito) · 2024-08-22T17:52:19.390Z · LW(p) · GW(p)

Alright, you have yourself a bet! Let's return on August 23rd 2026 to see who's made $500. I've sent you a calendar invite to help us remmeber the date.

I'll ping the arbiters just to check they're down, may suggest an alt if one of them opts out.

(For the future: I think in future I will suggest arbiters to a betting-counterparty via private DM, so it's easier for the arbiters to opt out for whatever reason, or for one of us to reject them for whatever reason.)

Replies from: LawChan, Raelifin
comment by LawrenceC (LawChan) · 2024-08-22T17:54:37.258Z · LW(p) · GW(p)

I'm down.

comment by Nathan Helm-Burger (nathan-helm-burger) · 2024-08-18T02:23:50.076Z · LW(p) · GW(p)

I'd take that bet! If you are up for it, you can dm me about details.

comment by Ben Pace (Benito) · 2024-08-17T02:12:33.456Z · LW(p) · GW(p)

[Note to reader: I wrote this comment in response to an earlier and much shorter version of the parent comment.]

Thanks! I'm glad you found it interesting to read.

I considered that counterfactual, but didn't think it was a good argument. I think there's a world of difference between a team that has a mechanistic story for how they can prevent the doomsday device from killing everyone, and a team that is merely "looking for such a way" and "occasionally finding little nuggets of insight". The latter such team is still contributing its efforts and goodwill and endorsement to a project set to end the world in exchange for riches and glory. 

I think better consequences will obtain if humans follow the rule of never using safety as the reason for contributing to such an extinction/takeover-causing project unless the bar of "has a mechanistic plan for preventing the project from causing an extinction event" is met. 

And, again, I think it's important in that case to have personal eject-buttons, such as some friends who check in with you to ask things like "Do you still believe in this plan?" and if they think there's not a good reason, you've agreed that any two of them can fire you.

(As a small note of nuance, barring Fermi and one or two others, almost nobody involved had thought of a plausible mechanism by which the atomic bomb would be an extinction risk, so I think there's less reason to keep to that rule in that case.)

Replies from: nathan-helm-burger, nathan-helm-burger
comment by Nathan Helm-Burger (nathan-helm-burger) · 2024-08-17T02:41:09.995Z · LW(p) · GW(p)

I disagree that the point about the scientists not realizing nuclear weapons would likely become an existential risk changes what I see as the correct choice to make. I think that, knowing that I was a scientist in the USA and I had either the choice to help the US government build nuclear weapons and thus set the world up for a tense, potentially existential, détente between the US's enemies (Nazis and/or communists and/or others), or... not help. Still seems clearly correct to me to help, since I think a dangerous détente is a better option than only the Nazis or only Stalin having nuclear weapons.

 

In the current context, I do think there is important strategic game theory overlap with those times, since it seems likely that AI (whether AGI or not) will potentially disrupt the long-standing nuclear détente in the next few years. I expect that whichever government controls the strongest AI in five years from now, if not sooner, will also be nearly immune to long-range missile attacks, conventional military threats, and bioweapons, but able to deploy those things (or a wide range of other coercive technologies) at will against other nations.

Replies from: Benito
comment by Ben Pace (Benito) · 2024-08-17T04:51:39.865Z · LW(p) · GW(p)

Point of clarification: I didn't mean that there should be a rule against helping one's country race to develop nukes, the argument I'm making is that humans should have a rule against helping one's country race to develop nukes that one expects by-default to (say) ignite the atmosphere and kill everyone and for which there is no known countermeasure.

Replies from: nathan-helm-burger
comment by Nathan Helm-Burger (nathan-helm-burger) · 2024-08-17T05:51:20.431Z · LW(p) · GW(p)

Ah, yes. Well that certainly makes sense! Thanks for the clarification.

comment by Nathan Helm-Burger (nathan-helm-burger) · 2024-08-17T02:24:25.199Z · LW(p) · GW(p)

Sorry for editing and extending my comment not knowing that you were already in the process of writing a response to the original!

Replies from: Benito
comment by Ben Pace (Benito) · 2024-08-17T04:48:17.862Z · LW(p) · GW(p)

Thanks! A simple accident, I forgive you :-)

comment by Ben Pace (Benito) · 2024-08-17T05:25:07.986Z · LW(p) · GW(p)

I believe that it is possible to detect sandbagging (and possibly other forms of deception) via this technique that I am collaborating on. https://www.apartresearch.com/project/sandbag-detection-through-model-degradation

That's a clever trick! 

I'm curious if you've thought of this, but I think this doesn't generalize to models whose outputs are too hard to evaluate?

Consider a different model we might train in the future that is trained on all the internal data in a company — all the emails, slack, stock price, trades, spreadsheets, etc, and is trained to predict the stock movement, and then is used to generate the rest of the emails from the CEO and so forth that it predicts will increase the stock price further. It is then given the actual role of CEO and its instructions are followed. 

Suppose that you perform your trick of randomly fiddling with the weights, and find that the stock performs substantially better when you do so. You inquire with the model as to why, and it tells you that it is taking a strategy with short-term payoffs but too much long-term risk, and your tampering is causing it to make worse decisions. If you ask it to explain why its current strategy is better, it says it's for complicated reasons that it cannot well-explain to a mere IQ 150 human. Is it so clear now whether the model is intentionally pretending to be less competent than it is, or not?

[Meta: I will probably come back and address some of your other points tomorrow, but in 2 minutes my laptop is going to transform itself into a rather expensive paperweight.]

Replies from: nathan-helm-burger
comment by Nathan Helm-Burger (nathan-helm-burger) · 2024-08-17T05:50:07.960Z · LW(p) · GW(p)

Yes, the sandbagging detection is quite limited. I anticipate it's mostly useful as an "oh no, something weird is going on" alarm. I don't think it will be particularly helpful in figuring out what exactly. I think at that point you need to delay deployment and use other mechanistic interpretability techniques to explore the anomaly in depth.

Since this is intended as an "anomaly-detecting in-the-lab test, for use on well understood benchmarks", I think it will probably work even on quite intelligent models. My hope is that the noise injection impairment will smoothly compromise the model's capabilites down to arbitrarily lower performance. So, for a smarter model, you just increase the impairment.

In theory , this would work up until the model was so capable that it was able to do something like gradient hacking to engineer its expected impairment curve to avoid betraying itself. At that point of capabilities I can only hope the researchers have been wise enough to move to a censored training simulation regime, so that the model can't anticipate the need to be robust to noise injection.

comment by Eli Tyre (elityre) · 2024-08-14T19:31:59.667Z · LW(p) · GW(p)

These are not (necessarily) our actual positions, these are positions we were randomly assigned, and for which we searched for the strongest arguments we could find, over the course of ~1 hr 45 mins.

Why do this? Do you have some hope that this will do something different or better than people expressing their actual beliefs and the reasons for them?

Replies from: Raemon, Benito
comment by Raemon · 2024-08-14T19:44:50.193Z · LW(p) · GW(p)

I was recently sold on this being a valuable practice, not for the audience, but for the practitioners. Someone (with debate club experience) argued that a debate club where you have to argue for positions you don't hold forces you develop the muscles of stepping outside your current frame/headspace and actually avoid soldier mindset for yourself.

Replies from: Benito
comment by Ben Pace (Benito) · 2024-08-14T19:50:51.279Z · LW(p) · GW(p)

This too. In the format I've been using, both participants spend 15 mins going on a walk thinking about the subject before they know which side they're arguing. These walks have been one of the more intense periods in my life of desperately searching for a contrary perspective from the default one that I bring to a subject, because I might be about to have to quickly provide strong arguments for it to someone I respect, and I've found it quite productive for seeking ways that I might be wrong.

comment by Ben Pace (Benito) · 2024-08-14T19:48:37.097Z · LW(p) · GW(p)

Yes, for a bunch of reasons!

  1. Something about it feels like a nice split of responsibilities. "I'll look for the arguments and evidence for, you look for the arguments and evidence against, and I trust we'll cover the whole space well."
  2. Having a slight competition gives me a lot of added energy for searching through arguments about a subject.
  3. I think it makes it less about "me" and more about "the domain". I don't have to be tracking what it is that I am personally signaling, because I'm here running a search process as well as I can, not making claims about my personal beliefs.
comment by Josh H (joshua-haas) · 2024-08-17T22:48:27.686Z · LW(p) · GW(p)

It sounds like both sides of the debate are implicitly agreeing that capabilities research is bad. I don't think that claim necessarily follows from the debate premise that "we both believe they are risking either extinction or takeover by an ~eternal alien dictatorship". I suspect disagreement about capabilities research being bad is the actual crux of the issue for some or many people who believe in AI risk yet still work for capabilities companies, rather than the arguments made in this debate.

The reason it doesn't necessarily follow is that there's a spectrum of possibilities about how entangled safety and capabilities research are. On one end of the spectrum, they are completely orthogonal: we can fully solve AI Safety without ever making any progress on AI capabilities. On the other end of the spectrum, they are completely aligned: figuring out how to make an AI that successfully follows, say, humanity's coherent extrapolated volition, is the same problem as figuring out how to make an AI that successfully performs any other arbitrarily-complicated real world task.

It seems likely to me that the truth is in the middle of the spectrum, but it's not entirely obvious where on the spectrum it lies. I think it's highly unlikely pure theoretical work is sufficient to guarantee safe AI, because without advancements in capabilities, we are very unlikely to have an accurate enough model about what AGI actually is to be able to do any useful safety work on it. On the flip side, I doubt AI safety is as simple as making a sufficiently smart AI and just telling it "please be safe, you know what I mean by that right?". 

The degree of entanglement dictates the degree to which capabilities research is intrinsically bad. If we live in a world close to the orthogonal end of the spectrum, it's relatively easy to distinguish "good research" from "bad research", and if you do bad research, you are a bad person, shame on you, let's create social / legal incentives to stop it. 

On the flip side, if we live in a world closer to the convergent end of the spectrum, where making strides on AI safety problems is closely tied to advances in capabilities, then some capability research will be necessary, and demonizing capability research will lead to the perverse outcome that it's mostly done by people who don't care about safety, whereas we'd prefer a world where the leading AI labs are staffed with brilliant, highly-safety-conscious safety/capabilities researchers who push the frontier in both directions.

It's outside the scope of this comment for me to weigh in on which end of the spectrum we live on, but I think that's real crux of this debate, and why smart people in good conscience can fling themselves into capabilities research: their world model is different than AI safety researchers coming from a very orthogonal perspective (my understanding of MIRI's original philosophy is that they believe that there's a tremendous amount of safety research that can be done in the absence of capabilities advances).

comment by Raemon · 2024-08-17T22:45:30.436Z · LW(p) · GW(p)

Empirically people’s research output drops after going to labs

I'm curious about how true this is and what lead LawrenceC to believe it?