Posts

Comments

Comment by FrancescaG on Extending control evaluations to non-scheming threats · 2025-01-15T12:27:02.879Z · LW · GW

Thanks for putting this out there, I’m on the Catalyze program in Lisa in London and have been thinking a lot about AI control (for framing my background is risk management, so coming at it from that angle).

'The high-level change I propose is to define a control evaluation as a method of bounding risk from worst-case deployment behavior generally – rather than scheming models in particular.'

I’m on board with this, if our goal is to bound the risk from worst-case deployment behaviour, I think the broader the better and enumerating different ‘untrusted model categories’ is going to better inform our understanding of harmful behaviours. If we have post deployment monitoring, and identify new patterns of harmful behaviour, we may even identify new untrusted model categories in this way.

For post deployment monitoring (let’s say the control eval highlighted some behaviour we wanted ongoing control), I think the reason (i.e the untrusted model category causing the behaviour) is less important than the observed behaviour itself. There could even be multiple applying untrusted model categories e.g. scheming and jailbroken, so you could have a single monitor to catch all of these. What are your thoughts on this?

‘control is a broad approach to mitigating and evaluating safety given ’pessimistic assumptions’ about model deployment behavior.’ 

Rather than ‘pessimistic assumptions’ here I describe these as ‘unknowns’, the probability of which would range somewhere between 0-1 based in our confidence in the representativeness in the evaluations.

I think to make control evaluations practically useful, we need to add more nuance to how we define trusted and untrusted. Coming from risk management (in Fintech and FS), I’m approaching it from a view that a model can only ever be P(Trusted)=1 (i.e. fully trusted) when bounded by certain deployment constraints. So a model is trusted to do something rather than generally trusted e.g. it may be trusted to not write backdoored code, but may not be trusted to keep sensitive data safe if its in a setting where users can jailbreak it.

This could also help justify claims around a ‘rarely misbehaving monitor being safe because its failures will likely be uncorrelated with attacks executed by the policy.’ Let’s say you assign a probability to this of 0.001, which is fine for a use case of coding a website but not ok for coding a banking platform. You could layer more monitors to reduce this probability for this use case, effectively getting the risk within appetite for this deployment setting e.g. to 0.000001.

Comment by FrancescaG on Are we dropping the ball on Recommendation AIs? · 2024-11-01T11:46:03.104Z · LW · GW

I think incentives. Based on my recent reading of 'The Chaos Machine' by Max Fisher, I think it's closely linked to continually increasing engagement driving profit. Addictiveness unfortunately leads to more engagement, which in turn leads to profit. Emotive content (clickbait style, extreme things) also increase engagement.

Tools and moderation processes might be expensive on their own, but I think it's when they start to challenge the underlying business model of 'More engagement = more profit' that the companies are in a more uncomfortable position.

Comment by FrancescaG on Are we dropping the ball on Recommendation AIs? · 2024-11-01T11:38:56.950Z · LW · GW

I agree that more research on effect of recommendation algorithms on the brain would be useful. 

Also research looking at which cognitive biases and preferences the algorithms are exploiting, and who is most susceptible to these e.g. children, neurodiverse etc. It seems plausible to me that some ai applications e.g. character.ai as you say, will be optimising on some sort of human interaction and exploiting human biases and cognitive patterns will be a big part of this.

Comment by FrancescaG on Toward Safety Cases For AI Scheming · 2024-11-01T11:09:42.228Z · LW · GW

Great addition to thinking on safety cases.

I'm curious about the decision to include the time constraint 'Deployment does not pose unacceptable risk within its 1 year lifetime', as I've been considering how safety case arguments remain relevant post deployment.

Is this intended to convey that the system will be withdrawn/ updated in a year or to help reduce the scope of what is being considered in terms of feasibility of inability and control arguments by looking at a short risk evaluation timeline?

On the topic of the scope of the Harm Inability arguments in the paper:

'A Harm inability argument could be absolute or limited to a given deployment setting'

 Given the vast amounts of deployment settings, the developers of an AI model could provide absolute arguments with accompanying warnings/ limitations based on the evaluation outputs (along the lines of usage guidelines for licensed medicines). Organisations deploying the models in novel settings (also higher risk like medical, finance etc) could supplement these with arguments specific to their deployment setting. Any thoughts on this?