0 comments
Comments sorted by top scores.
comment by Gyrodiot · 2021-11-18T16:13:51.640Z · LW(p) · GW(p)
I am confused by the problem statement. What you're asking for is a generic tool, something that doesn't need information about the world to be created, but that I can then feed information about the real world and it will become very useful.
My problem is that the real world is rich, and feeding the tool with all relevant information will be expensive, and the more complicated the math problem is, the more safety issues you get.
I cannot rely on "don't worry if the Task AI is not aligned, we'll just feed it harmless problems", the risk comes from what the AI will do to get to the solution. If the problem is hard and you want to defer the search to a tool powerful enough that you have to choose carefully your inputs or catastrophe happens, you don't want to build that tool.
Replies from: None, None↑ comment by [deleted] · 2021-11-18T16:34:26.141Z · LW(p) · GW(p)
Replies from: Gyrodiot↑ comment by Gyrodiot · 2021-11-18T20:51:59.780Z · LW(p) · GW(p)
So, assuming an unaligned agent here.
If your agent isn't aware that its compute cycles are limited (i.e. the compute constraint is part of the math problem), then you have three cases: (1a) the agent doesn't hit the limit with its standard search, you're in luck; (1b) the problem is difficult enough that the agent runs its standards search but fails to find a solution in the allocated cycles, so it always fails, but safely. (1c) you tweak the agent to be more compute-efficient, which is very costly and might not work, in practice if you're in case 1b and it apparently fails safely you have an incentive to just increase the limit.
If your agent is indeed aware of the constraint, then it has an incentive to remove it, or increase the limit by other means. Three cases here again: (2a) identical to 1a, you're in luck; (2b) the limit is low enough that strategic action to remove the constraint is impossible, the agent fails "safely"; (3b) the agent finds a way to remove the constraint, and you're in very unsafe territory.
Two observations from there: first, ideally you'd want your agent to operate safely even if given unbounded cycles, that's the Omni Test. Second, there's indeed an alignment concept for agents that just try to solve the problem without long-term planning, that's Myopia [? · GW] (and defining it formally is... hard).
Replies from: None