Abhinav Pola's Shortform

post by Abhinav Pola (abhinav-pola) · 2025-02-28T03:51:29.160Z · LW · GW · 1 comments

Contents

1 comment

1 comments

Comments sorted by top scores.

comment by Abhinav Pola (abhinav-pola) · 2025-02-28T03:51:29.156Z · LW(p) · GW(p)

Computer-use agents in 3rd party environments are an inherent security risk. 
What is the risk exactly? That we can always phish an agent. Safety relies on making sure the inputs and outputs of the model are safe. We can ensure both and still elicit harm by phishing the model which I claim is probably an easier task than computer use. This is only possible because:
1. We have control over the agent loop and can poison the inputs before the agent takes the next action.
2. We have control over the environment to edit HTML, the browser, the OS, etc. as we see fit while remaining in-distribution for computer use capabilities.

Here, I "phish" Sonnet 3.5 to create a Pinterest account: [demo]. This is mainly my response to Anthropic's hierarchical summarization which I think is necessary but not sufficient. I think OpenAI-Operator-style 1st party environments and paywalls are the way to go for now.