Posts

The Geometry of Feelings and Nonsense in Large Language Models 2024-09-27T17:49:27.420Z
[Paper] Hidden in Plain Text: Emergence and Mitigation of Steganographic Collusion in LLMs 2024-09-25T14:52:48.263Z
Robustness of Contrast-Consistent Search to Adversarial Prompting 2023-11-01T12:46:14.516Z
Machine Unlearning Evaluations as Interpretability Benchmarks 2023-10-23T16:33:04.878Z
Splitting Debate up into Two Subsystems 2020-07-03T20:11:11.723Z
Acknowledging Human Preference Types to Support Value Learning 2018-11-13T18:57:52.965Z

Comments

Comment by Nandi on Splitting Debate up into Two Subsystems · 2020-07-06T22:21:06.833Z · LW · GW

I agree that if you score an oracle based on how accurate it is, then it is incentivized to steer the world towards states where easy questions get asked.

I think that in these considerations it matters how powerful we assume the agent to be. You made me realize that specifying the scope and detailing the application area of the proposed approach better could have made my post more interesting. In many cases making the world more predictable may be very difficult for the agent, compared to causing the human to predict the world better. In the short term I think deploying an agentic oracle could be safe.

Comment by Nandi on Splitting Debate up into Two Subsystems · 2020-07-04T19:20:43.262Z · LW · GW
I think Bostrom might have mentioned this problem (educating someone on a topic) somewhere.

Cool! I'm not familiar with it

Comment by Nandi on Splitting Debate up into Two Subsystems · 2020-07-04T19:19:37.294Z · LW · GW

In the case that the epistemic helper can explain us enough for us to come up with solutions ourselves, the info helper is as useful by itself.

However, sometimes even if we get educated about a domain or problem, we may not be creative enough to propose good solutions ourselves. In such cases we would need an agent to propose options to us. It would be good if an agent that gets trained to come up with solutions that we approve of is not the same agent that explains to us why we should or should not approve of a solution (because if it were, it would have an incentive to convince us).