Mathematically-Defined Optimization Captures A Lot of Useful Information
post by J Bostock (Jemist) · 2023-10-29T17:17:03.211Z · LW · GW · 0 commentsContents
Math summary Optimization relates to Information Optimization relates to a Utility-Like Function Optimization relates to Power Why is this important? Worlds of High Optimization Are "Good" Worlds for The Optimizer None No comments
Alright it's been a long hiatus. I may not post again for another year or two if no inspiration strikes me. I will summarize my work so far [LW · GW].
- We can define a function over causal networks which describes how hard a given portion of the network is optimizing some node around some particular world history[1].
- This function depends on some partial derivatives of the network state, making it a local function over world histories i.e. it does not depend on world histories which are "far away" in world-space.[2]
- Across one dimension, we can use path-integrals to find the "power" of the optimizer, i.e. how much variation it removes from the system.
Math summary
Optimization is written as for "past" node , "future" node , and a section of the causal network. It measures the (negative) log of the ratios of two derivatives of (i.e the value of in world ) with respect to . The first is the "normal" world where varies, and the second being an imaginary world where is "frozen" at it's value in , unable to respond to infinitesimal changes in .
We can overall write the following:
If is optimizing with respect to , we would expect that some (or all) of the changes in which are caused by changes in will be removed; so the derivative when is allowed to vary will be smaller than the derivative when is fixed. This means will be positive.
I have a few observations about what this might mean and why it might be important.
Optimization relates to Information
This is a simple calculation: if doesn't depend on in any way, then the two derivative terms will be equal, because won't vary in either of them. The ability of to optimize with respect to is related to its ability to gather information about .
Optimization relates to a Utility-Like Function
For simple systems like the thermostat, it seems like has high values when the thermostat "gets what it wants". It kind of looks like across one axis is the second derivative of our utility function, at least within regions of world-space where has roughly equal power and knowledge.
Optimization relates to Power
This seems pretty intuitively obvious. The more "powerful" the thermostat was in terms of having a stronger heating and cooling unit, the more it was able to optimize the world.
Why is this important?
We already have mathematical proofs that the "knowledge" and "values" of an agent-like thing cannot be disentangled exactly. So if we want a mathematically well-defined measure of agent behaviour, we must take them both at once.
Secondly, the sorts of histories that works on are deliberately chosen to be both very general, requiring no notion of absolute time and space in the style of much of John Wentworth's work. A specific case of these networks is the activations of a neural network, so these tools could in theory be applied directly to AI interpretability work.
Worlds of High Optimization Are "Good" Worlds for The Optimizer
Worlds where is large tend to be "good" for the optimizing region in question. They seem to correspond to local minimal (or at least local pareto frontiers) of a utility function. They also correspond to worlds where is both knowledgeable and powerful. They correspond to worlds where is "in control". Here are a few potential thoughts on making safer AI designs using this concept:
- Having a mathematically well-defined measure of optimization means it can be hard-coded rather than relying on machine learning.
- Lots of thought has gone into trying to make AI "low impact", and using in this way might let us specify this better.
- If worlds of high tend to be good for the optimizerin question then this might provide a route towards a way to encode things that are good for humans.
- can be defined for regions in the AI's past, which makes it harder to reward-hack, or modify the preferences of the humans in question to hack .
- is defined locally, but extending it to distributions over worlds is probably trivial.
- ^
World history here meaning a given set of numeric values which describe the state of a causal network. For example if we have the network [Temperature in Celsius] [State of Water] then the following are examples of world histories: [-10] [0], [45] [1], and [120] [2]. Where we've represented the state of water as a number {0: Solid, 1: Liquid, 2: Gas}.
- ^
So if we consider our previous world history examples, the value of our optimization metric at a temperature of 310 Kelvin doesn't depend on the behaviour of the system at 315 Kelvin.
0 comments
Comments sorted by top scores.