0 comments
Comments sorted by top scores.
comment by Vivek Hebbar (Vivek) · 2022-11-15T08:27:53.207Z · LW(p) · GW(p)
Pr π∈ξ [U(⌈G⌉,π) ≥ U(⌈G⌉,G*)]
is the probability that, for a random policyπ∈ξ
, that policy has worse utility than the policyG*
its program dictates; in essence, how goodG
's policies are compared to random policy selection
What prior over policies?
given
g(G|U)
, we can infer the probability that an agentG
has a given utility functionU
, asPr[U] ∝ 2^-K(U) / Pr π∈ξ [U(⌈G⌉,π) ≥ U(⌈G⌉,G*)])
where∝
means "is proportional to" andK(U)
is the kolmogorov complexity of utility functionU
.
Suppose the prior over policies is max-entropy (uniform over all action sequences). If the number of "actions" is greater than the number of bits it takes to specify my brain[1], it seems like it would conclude that my utility function is something like "1 if {acts exactly like [insert exact copy of my brain] would}, else 0".
- ^
Idk if this is plausible
↑ comment by Martín Soto (martinsq) · 2022-11-15T23:27:49.732Z · LW(p) · GW(p)
What prior over policies?
Some kind of simplicity prior, as mentioned here [LW · GW].
Suppose the prior over policies is max-entropy (uniform over all action sequences). If the number of "actions" is greater than the number of bits it takes to specify my brain[1] [LW(p) · GW(p)], it seems like it would conclude that my utility function is something like "1 if {acts exactly like [insert exact copy of my brain] would}, else 0".
Yes. In fact I'm not even sure we need your assumption about bits. Say policies are sequences of actions, and suppose at each time step we have actions available. Then, in our process of approximating your perfect/overfitted utility "1 if {acts exactly like [insert exact copy of my brain] would}, else 0", adding one more specified action to our can be understood as adding one more symbol to its generating program, and so incrementing by 1. But also, adding one more (perfect) specified action multiplies the denominator probability by (since the prior is uniform). So as long as , will be unbounded when approximating your utility.
And of course, this is solved by the simplicity prior, because this makes it easier for simple s to achieve low denominator probability. So a way simpler (less overfitted to *) will achieve almost the same low denominator probability as your function, because the only policies that maximize better than * are too complex.
comment by harfe · 2022-09-01T12:26:57.750Z · LW(p) · GW(p)
In Pr[U] ≈ 2^-K(U) / Pr π∈ξ [U(⌈G⌉,π) ≤ U(⌈G⌉,G*)])
: Shouldn't the inequality sign be the other way around? I am assuming that we want to maximize and not minimize .
As currently written, a good agent with utility function would be better than most random policies, and therefore would be close to and therefore be rather small.
If the sign should indeed be the other way around, then a similar problem might be present in the definition of , if you want to be high for more agenty programs .
Replies from: carado-1↑ comment by Tamsin Leake (carado-1) · 2022-09-02T11:15:15.766Z · LW(p) · GW(p)
after thinking about it and asking vanessa, it seems that you're correct; thanks for noticing. the mistake comes from the fact that i express things in terms of utility functions and vanessa expresses things in terms of loss functions, and they are reversed. the post should be fixed now.
note that in the g(G|U)
definition, i believe it is also ≥, because -log
flips the function.
comment by harfe · 2022-09-25T17:58:16.910Z · LW(p) · GW(p)
(Copied partially from here [LW(p) · GW(p)])
My intuition is that preDCa falls short on the "extrapolated" part in "Coherent extrapolated volition". PreDCA would extract a utility function from the flawed algorithm implemented by a human brain. This utility function would be coherent, but might not be extrapolated: The extrapolated utility function (ie what humans would value if they would be much smarter) is probably more complicated to formulate than the un-extrapolated utility function.
For example, the policy implemented by an average human brain probably contributes more to total human happiness than most other policies. Lets say is an utility function that values human happiness as measured by certain chemical states in the brain, and is "extrapolated happiness" (where "putting all humans brains in vat to make it feel happy" would not be good for ). Then it is plausible that . But the policy implemented by an average human brain would do approximately equally well on both utility functions. Thus, .
comment by Frank_R · 2022-08-22T13:03:34.105Z · LW(p) · GW(p)
I am starting to learn theoretical stuff about AI alignment and have a question. Some of the quantities in your post contain the Kolmogorov complexity of U. Since it is not possible to compute the Kolmogorov complexity of a given function or to write down a list of all function whose complexity is below a certain bound, I wonder how it would be possible to implement the PreDCA protocol on a physical computer.
Replies from: carado-1↑ comment by Tamsin Leake (carado-1) · 2022-08-23T09:07:36.184Z · LW(p) · GW(p)
Like all the other uncomputable or untractable logic in the post, the AI is to make increasingly informed guesses about them using something like logical induction [? · GW], where one can estimate the likelihood of a logical statement without having to determine its truth value for sure.
Replies from: Frank_R