Posts

Scalable And Transferable Black-Box Jailbreaks For Language Models Via Persona Modulation 2023-11-07T17:59:36.857Z
Yes, avoiding extinction from AI *is* an urgent priority: a response to Seth Lazar, Jeremy Howard, and Arvind Narayanan. 2023-06-01T13:38:16.444Z
[Linkpost] The AGI Show podcast 2023-05-23T09:52:29.685Z

Comments

Comment by Soroush Pour (soroush-pour) on Why Is No One Trying To Align Profit Incentives With Alignment Research? · 2023-08-30T23:53:46.356Z · LW · GW

I think much of this is right, which is why, as an experienced startup founder that's deeply concerned about AI safety & alignment, I'm starting a new AI safety public benefit corp startup, called Harmony Intelligence. I recently gave a talk on this at VAISU conference: slides and recording.

If what I'm doing is interesting for you and you'd like to be involved or collaborate, please reach out via the contact details on the last slide linked above.

Comment by Soroush Pour (soroush-pour) on (My understanding of) What Everyone in Technical Alignment is Doing and Why · 2023-08-14T02:27:58.558Z · LW · GW

For anybody else wondering what "ERO" stands for in the DeepMind section -- it stands for "Externalized Reasoning Oversight" and more details can be found in this paper.

Source: @Rohin Shah's comment.

Comment by Soroush Pour (soroush-pour) on Statement on AI Extinction - Signed by AGI Labs, Top Academics, and Many Other Notable Figures · 2023-06-01T13:23:37.666Z · LW · GW

There have been some strong criticisms of this statement, notably by Jeremy Howard et al here. I've written a detailed response to the criticisms here:

https://www.soroushjp.com/2023/06/01/yes-avoiding-extinction-from-ai-is-an-urgent-priority-a-response-to-seth-lazar-jeremy-howard-and-arvind-narayanan/

Please feel free to share with others who may find it valuable (e.g. skeptics of AGI x-risk).

Comment by Soroush Pour (soroush-pour) on [Linkpost] "Governance of superintelligence" by OpenAI · 2023-05-24T00:31:08.857Z · LW · GW

I don't think this is a fair consideration of the article's entire message. This line from the article specifically calls out slowing down AI progress:

we could collectively agree (with the backing power of a new organization like the one suggested below) that the rate of growth in AI capability at the frontier is limited to a certain rate per year.

Having spent a long time reading through OpenAI's statements, I suspect that they are trying to strike a difficult balance between:

  • A) Doing the right thing by way of AGI safety (including considering options like slowing down or not releasing certain information and technology).
  • B) Staying at or close to the lead of the race to AGI, given they believe that is the position from which they can have the most positive impact in terms of changing the development path and broader conversation around AGI.

Instrumental goal (B) is in tension (but not necessarily stark conflict, depending on how things play out) with ultimate goal (A).

What they're presenting here in this article are ways to potentially create situation where they could slow down and be confident that doing so wouldn't actually lead to worse eventual outcomes for AGI safety. They are also trying to promote and escalate the societal conversation around AGI x-risk.

While I think it's totally valid to criticise OAI on aspects of their approach to AGI safety, I think it's also fair to say that they are genuinely trying to do the right thing and are simply struggling to chart what is ultimately a very difficult path.

Comment by Soroush Pour (soroush-pour) on Deep Deceptiveness · 2023-03-29T04:23:37.588Z · LW · GW

No comment on this being an accurate take on MIRI's worldview or not, since I am not an expert there. I wanted to ask a separate question related to the view described here:

> "With gradient descent, maybe you can learn enough to train your AI for things like "corrigibility" or "not being deceptive", but really what you're training for is "Don't optimise for the goal in ways that violate these particular conditions"."

On this point, it seems that we create a somewhat arbitrary divide between corrigibility & deception on one side and all other goals of the AI on the other.

The AI is trained to minimise some loss function, of which non-corrigibility and deception are penalised, so wouldn't be more accurate to say the AI actually has a set of goals which include corrigibility and non-deception?

And if that's the case, I don't think it's as fair to say that the AI is trying to circumvent corrigibility and non-deception, so much as it is trying to solve a tough optimisation problem that includes corrigibility, non-deception, and all other goals.
 

If the above is correct, then I think this is a reason to be more optimistic about the alignment problem - our agent is not trying to actively circumvent our goals, but instead trying to strike a hard balance of achieving all of them including important safety aspects like corrigibility and non-deception.

Now, it is possible that instrumental convergence puts certain training signals (e.g. corrigibility) at odds with certain instrumental goals of agents (e.g. self preservation). I do believe this is a real problem and poses alignment risk. But it's not obvious to me that we'll see agents universally ignore their safety feature training signals in pursuit of instrumental goals.