Ariel Kwiatkowski's Shortform

kwiat-dev

Ariel Kwiatkowski's Shortform

post by kwiat.dev (ariel-kwiatkowski) · 2020-05-30T19:58:50.319Z · LW · GW · 4 comments

4 comments

4 comments

Comments sorted by top scores.

comment by kwiat.dev (ariel-kwiatkowski) · 2020-05-30T19:58:50.680Z · LW(p) · GW(p)

Has anyone tried to work with neural networks predicting the weights of other neural networks? I'm thinking about that in the context of something like subsystem alignment, e.g. in an RL setting where an agent first learns about the environment, and then creates the subagent (by outputting the weights or some embedding of its policy) who actually obtains some reward

comment by kwiat.dev (ariel-kwiatkowski) · 2020-06-04T21:25:00.022Z · LW(p) · GW(p)

Looking for research idea feedback:

Learning to manipulate: consider a system with a large population of agents working on a certain goal, either learned or rule-based, but at this point - fixed. This could be an environment of ants using pheromones to collect food and bring it home.

Now add another agent (or some number of them) which learns in this environment, and tries to get other agents to instead fulfil a different goal. It could be ants redirecting others to a different "home", hijacking their work.

Does this sound interesting? If it works, would it potentially be publishable as a research paper? (or at least a post on LW) Any other feedback is welcome!

Replies from: romeostevensit

↑ comment by romeostevensit · 2020-06-05T19:42:34.377Z · LW(p) · GW(p)

This sounds interesting to me.

comment by kwiat.dev (ariel-kwiatkowski) · 2024-08-16T00:31:18.908Z · LW(p) · GW(p)

Modern misaligned AI systems are good, actually. There's some recent news about Sakana AI developing a system where the agents tried to extend their own runtime by editing their code/config.

This is amazing for safety! Current systems are laughably incapable of posing x-risks. Now, thanks to capabilities research, we have a clear example of behaviour that would be dangerous in a more "serious" system. So we can proceed with empirical research, create and evaluate methods to deal with this specific risk, so that future systems do not have this failure mode.

The future of AI and AI safety has never been brighter.

Ariel Kwiatkowski's Shortform

Contents

4 comments