What's wrong with these analogies for understanding Informed Oversight and IDA?
post by Wei Dai (Wei_Dai) · 2019-03-20T09:11:33.613Z · LW · GW · No commentsThis is a question post.
Contents
Answers 15 paulfchristiano None No comments
In Can HCH epistemically dominate Ramanujan? [LW · GW] Alex Zhu wrote:
If HCH is ascription universal, then it should be able to epistemically dominate an AI theorem-prover that reasons similarly to how Ramanujan reasoned. But I don’t currently have any intuitions as to why explicit verbal breakdowns of reasoning should be able to replicate the intuitions that generated Ramanujan’s results (or any style of reasoning employed by any mathematician since Ramanujan, for that matter).
And I answered [LW(p) · GW(p)]:
My guess is that HCH has to reverse engineer the theorem prover, figure out how/why it works, and then reproduce the same kind of reasoning.
And then I followed up my own comment with:
It occurs to me that if the overseer understands everything that the ML model (that it’s training) is doing, and the training is via some kind of local optimization algorithm like gradient descent, the overseer is essentially manually programming the ML model by gradually nudging it from some initial (e.g., random) point in configuration space.
No one answered my comments with either a confirmation or denial, as to whether these guesses of how to understand Universality / Informed Oversight and IDA are correct. I'm surfacing this question as a top-level post because if "Informed Oversight = reverse engineering" and "IDA = programming by nudging" are good analogies for understanding Informed Oversight and IDA, it seems to have pretty significant implications.
In particular it seems to imply that there's not much hope for IDA to be competitive with ML-in-general, because if IDA is analogous to a highly constrained method of "manual" programming, that seems unlikely to be competitive with less constrained methods of "manual" programming (i.e., AIs designing and programming more advanced AIs in more general ways, similar to how humans do most programming today), which itself is presumably not competitive with general (unconstrained-by-safety) ML (otherwise ML would not be the competitive benchmark).
If these are not good ways to understand IO and IDA, can someone please point out why?
Answers
A universal reasoner is allowed to use an intuition "because it works." They only take on extra obligations once that intuition reflects more facts about the world which can't be cashed out as predictions that can be confirmed on the same historical data that led us to trust the intuition.
For example, you have an extra obligation if Ramanujan has some intuition about why theorem X is true, you come to trust such intuitions by verifying them against proof of X, but the same intuitions also suggest a bunch of other facts which you can't verify.
In that case, you can still try to be a straightforward Bayesian about it, and say "our intuition supports the general claim that process P outputs true statements;" you can then apply that regularity to trust P on some new claim even if it's not the kind of claim you could verify, as long as "P outputs true statements" had a higher prior than "P outputs true statements just in the cases I can check." That's an argument that someone can give to support a conclusion, and "does process P output true statements historically?" is a subquestion you can ask during amplification.
The problem becomes hard when there are further facts that can't be supported by this Bayesian reasoning (and therefore might undermine it). E.g. you have a problem if process P is itself a consequentialist, who outputs true statements in order to earn your trust but will eventually exploit that trust for their own advantage. In this case, the problem is that there is something going on internally inside process P that isn't surfaced by P's output. Epistemically dominating P requires knowing about that.
See the second and third examples in the post introducing ascription universality. There is definitely a lot of fuzziness here and it seems like one of the most important places to tighten up the definition / one of the big research questions for whether ascription universality is possible.
↑ comment by Wei Dai (Wei_Dai) · 2019-03-20T19:36:22.263Z · LW(p) · GW(p)
In that case, you can still try to be a straightforward Bayesian about it, and say “our intuition supports the general claim that process P outputs true statements;” you can then apply that regularity to trust P on some new claim even if it’s not the kind of claim you could verify, as long as “P outputs true statements” had a higher prior than “P outputs true statements just in the cases I can check.”
If that's what you do, it seems “P outputs true statements just in the cases I can check.” could have a posterior that's almost 50%, which doesn't seem safe, especially in an iterated scheme where you have to depend on such probabilities many times? Do you not need to reduce the posterior probability to a negligible level instead?
See the second and third examples in the post introducing ascription universality.
Can you quote these examples? The word "example" appears 27 times in that post and looking at the literal second and third examples, they don't seem very relevant to what you've been saying here so I wonder if you're referring to some other examples.
There is definitely a lot of fuzziness here and it seems like one of the most important places to tighten up the definition / one of the big research questions for whether ascription universality is possible.
What I'm inferring from this (as far as a direct answer to my question) is that an overseer trying to do Informed Oversight on some ML model doesn't need to reverse engineer the model enough to fully understand what it's doing, only enough to make sure it's not doing something malign, which might be a lot easier, but this isn't quite reflected in the formal definition yet or isn't a clear implication of it yet. Does that seem right?
Replies from: paulfchristiano↑ comment by paulfchristiano · 2019-03-23T16:11:00.604Z · LW(p) · GW(p)
Can you quote these examples? The word "example" appears 27 times in that post and looking at the literal second and third examples, they don't seem very relevant to what you've been saying here so I wonder if you're referring to some other examples.
Subsections "Modeling" and "Alien reasoning" of "Which C are hard to epistemically dominate?"
What I'm inferring from this (as far as a direct answer to my question) is that an overseer trying to do Informed Oversight on some ML model doesn't need to reverse engineer the model enough to fully understand what it's doing, only enough to make sure it's not doing something malign, which might be a lot easier, but this isn't quite reflected in the formal definition yet or isn't a clear implication of it yet. Does that seem right?
You need to understand what facts the model "knows." This isn't value-loaded or sensitive to the notion of "malign," but it's still narrower than "fully understand what it's doing."
As a simple example, consider linear regression. I think that linear regression probably doesn't know anything you don't. Yet doing linear regression is a lot easier than designing a linear model by hand.
If that's what you do, it seems “P outputs true statements just in the cases I can check.” could have a posterior that's almost 50%, which doesn't seem safe, especially in an iterated scheme where you have to depend on such probabilities many times?
Where did 50% come from?
Also "P outputs true statements in just the cases I check" is probably not catastrophic, it's only catastrophic once P performs optimization in order to push the system off the rails.
No comments
Comments sorted by top scores.