My AI Model Delta Compared To Christiano

post by johnswentworth · 2024-06-12T18:19:44.768Z · LW · GW · 39 comments

Contents

    Preamble: Delta vs Crux
  My AI Model Delta Compared To Christiano
None
39 comments

Preamble: Delta vs Crux

This section is redundant if you already read My AI Model Delta Compared To Yudkowsky [LW · GW].

I don’t natively think in terms of cruxes [? · GW]. But there’s a similar concept which is more natural for me, which I’ll call a delta.

Imagine that you and I each model the world (or some part of it) as implementing some program. Very oversimplified example: if I learn that e.g. it’s cloudy today, that means the “weather” variable in my program at a particular time[1] takes on the value “cloudy”. Now, suppose your program and my program are exactly the same, except that somewhere in there I think a certain parameter has value 5 and you think it has value 0.3. Even though our programs differ in only that one little spot, we might still expect very different values of lots of variables during execution - in other words, we might have very different beliefs about lots of stuff in the world.

If your model and my model differ in that way, and we’re trying to discuss our different beliefs, then the obvious useful thing-to-do is figure out where that one-parameter difference is.

That’s a delta: one or a few relatively “small”/local differences in belief, which when propagated through our models account for most of the differences in our beliefs.

For those familiar with Pearl-style causal models [LW · GW]: think of a delta as one or a few do() operations which suffice to make my model basically match somebody else’s model, or vice versa.

This post is about my current best guesses at the delta between my AI models and Paul Christiano's AI models. When I apply the delta outlined here to my models, and propagate the implications, my models mostly look like Paul’s as far as I can tell. That said, note that this is not an attempt to pass Paul's Intellectual Turing Test [? · GW]; I'll still be using my own usual frames.

My AI Model Delta Compared To Christiano

Best guess: Paul thinks that verifying solutions to problems is generally “easy” in some sense. He’s sometimes summarized this as “verification is easier than generation [LW(p) · GW(p)]”, but I think his underlying intuition is somewhat stronger than that.

What do my models look like if I propagate that delta? Well, it implies that delegation is fundamentally viable in some deep, general sense.

That propagates into a huge difference in worldviews. Like, I walk around my house and look at all the random goods I’ve paid for - the keyboard and monitor I’m using right now, a stack of books, a tupperware, waterbottle, flip-flops, carpet, desk and chair, refrigerator, sink, etc. Under my models, if I pick one of these objects at random and do a deep dive researching that object, it will usually turn out to be bad in ways which were either nonobvious or nonsalient to me, but unambiguously make my life worse and would unambiguously have been worth-to-me the cost to make better. But because the badness is nonobvious/nonsalient, it doesn’t influence my decision-to-buy, and therefore companies producing the good are incentivized not to spend the effort to make it better. It’s a failure of ease of verification: because I don’t know what to pay attention to, I can’t easily notice the ways in which the product is bad. (For a more game-theoretic angle, see When Hindsight Isn’t 20/20 [LW · GW].)

On (my model of) Paul’s worldview, that sort of thing is rare; at most it’s the exception to the rule. On my worldview, it’s the norm for most goods most of the time. See e.g. the whole air [LW · GWconditioner [LW · GWepisode [LW · GW] for us debating the badness of single-hose portable air conditioners specifically, along with a large sidebar on the badness of portable air conditioner energy ratings.

How does the ease-of-verification delta propagate to AI?

Well, most obviously, Paul expects AI to go well mostly via humanity delegating alignment work to AI. On my models, the delegator’s incompetence is a major bottleneck to delegation going well in practice, and that will extend to delegation of alignment to AI: humans won’t get what we want by delegating because we don’t even understand what we want or know what to pay attention to [? · GW]. The outsourced alignment work ends up bad in nonobvious/nonsalient (but ultimately important) ways for the same reasons as most goods in my house. But if I apply the “verification is generally easy” delta to my models, then delegating alignment work to AI makes total sense.

Then we can go even more extreme: HCH [LW · GW], aka “the infinite bureaucracy”, a model Paul developed a few years ago. In HCH, the human user does a little work then delegates subquestions/subproblems to a few AIs, which in turn do a little work then delegate their subquestions/subproblems to a few AIs, and so on until the leaf-nodes of the tree receive tiny subquestions/subproblems which they can immediately solve. On my models, HCH adds recursion to the universal pernicious difficulties of delegation, and my main response is to run away screaming. But on Paul’s models, delegation is fundamentally viable, so why not delegate recursively?

(Also note that HCH is a simplified model of a large bureaucracy, and I expect my views and Paul’s differ in much the same way when thinking about large organizations in general. I mostly agree with Zvi’s models of large organizations [? · GW], which can be lossily-but-accurately summarized as “don’t”. Paul, I would guess, expects that large organizations are mostly reasonably efficient and reasonably aligned with their stakeholders/customers, as opposed to universally deeply dysfunctional.)

Propagating further out: under my models, the difficulty of verification accounts for most of the generalized market inefficiency in our world. (I see this as one way of framing Inadequate Equilibria [? · GW].) So if I apply a “verification is generally easy” delta, then I expect the world to generally contain far less low-hanging fruit. That, in turn, has a huge effect on timelines. Under my current models, I expect that, shortly after AIs are able to autonomously develop, analyze and code numerical algorithms better than humans, there’s going to be some pretty big (like, multiple OOMs) progress in AI algorithmic efficiency (even ignoring a likely shift in ML/AI paradigm once AIs start doing the AI research). That’s the sort of thing which leads to a relatively discontinuous takeoff. Paul, on the other hand, expects a relatively smooth takeoff - which makes sense, in a world where there’s not a lot of low-hanging fruit in the software/algorithms because it’s easy for users to notice when the libraries they’re using are trash.

That accounts for most of the known-to-me places where my models differ from Paul’s. I put approximately-zero probability on the possibility that Paul is basically right on this delta; I think he’s completely out to lunch. (I do still put significantly-nonzero probability on successful outsourcing of most alignment work to AI, but it’s not the sort of thing I expect to usually work.)

39 comments

Comments sorted by top scores.

comment by Jozdien · 2024-06-12T19:22:48.852Z · LW(p) · GW(p)

I was surprised to read the delta propagating to so many different parts of your worldviews (organizations, goods, markets, etc), and that makes me think that it'd be relatively easier to ask questions today that have quite different answers under your worldviews. The air conditioner one seems like one, but it seems like we could have many more, and some that are even easier than that. Plausibly you know of some because you're quite confident in your position; if so, I'd be interested to hear about them[1].

At a meta level, I find it pretty funny that so many smart people seem to disagree on the question of whether questions usually have easily verifiable answers.

  1. ^

    I realize that part of your position is that this is just really hard to actually verify, but as in the example of objects in your room it feels like there should be examples where this is feasible with moderate amounts of effort. Of course, a lack of consensus on whether something is actually bad if you dive in further could also be evidence for hardness of verification, even if it'd be less clean.

Replies from: johnswentworth, Lorxus
comment by johnswentworth · 2024-06-12T21:50:44.578Z · LW(p) · GW(p)

Yeah, I think this is very testable, it's just very costly to test - partly because it requires doing deep dives on a lot of different stuff, and partly because it's the sort of model which makes weak claims about lots of things rather than very precise claims about a few things.

comment by Lorxus · 2024-06-16T00:40:12.478Z · LW(p) · GW(p)

At a meta level, I find it pretty funny that so many smart people seem to disagree on the question of whether questions usually have easily verifiable answers.

And at a twice-meta level, that's strong evidence for questions not generically having verifiable answers (though not for them generically not having those answers).

Replies from: Jozdien, Morpheus
comment by Jozdien · 2024-06-16T09:51:28.523Z · LW(p) · GW(p)

(That's what I meant, though I can see how I didn't make that very clear.)

comment by Morpheus · 2024-07-18T14:53:09.405Z · LW(p) · GW(p)

So on the -meta-level you need to correct weakly in the other direction again.

comment by Elliot Callender (javanotmocha) · 2024-06-12T18:38:36.425Z · LW(p) · GW(p)

I think it depends on which domain you're delegating in. E.g. physical objects, especially complex systems like an AC unit, are plausibly much harder to validate than a mathematical proof.

In that vein, I wonder if requiring the AI to construct a validation proof would be feasible for alignment delegation? In that case, I'd expect us to find more use and safety from [ETA: delegation of] theoretical work than empirical.

Replies from: Davidmanheim
comment by Davidmanheim · 2024-06-13T05:40:56.256Z · LW(p) · GW(p)

That seems a lot like Davidad's alignment research agenda.

comment by ozziegooen · 2024-06-15T14:54:30.628Z · LW(p) · GW(p)

First, I want to flag that I really appreciate how you're making these delta clear and (fairly) simple.

I like this, though I feel like there's probably a great deal more clarity/precision to be had here (as is often the case). 
 

Under my models, if I pick one of these objects at random and do a deep dive researching that object, it will usually turn out to be bad in ways which were either nonobvious or nonsalient to me, but unambiguously make my life worse and would unambiguously have been worth-to-me the cost to make better.

I'm not sure what "bad" means exactly. Do you basically mean, "if I were to spend resources R evaluating this object, I could identify some ways for it to be significantly improved?" If so, I assume we'd all agree that this is true for some amount R, the key question is what that amount is.

I also would flag that you draw attention to the issue with air conditioners. But for the issue of personal items, I'd argue that when I learn more about popular items, most of what I learn are positive things I didn't realize. Like with Chesterton's fence - when I get many well-reviewed or popular items, my impression is generally that there were many clever ideas or truths behind those items that I don't at all have time to understand, let alone invent myself. A related example is cultural knowledge - a la The Secret of Our Success.

When I try out software problems, my first few attempts don't go well for reasons I didn't predict. The very fact that "it works in tests, and it didn't require doing anything crazy" is a significant update. 

Sure, with enough resources R, one could very likely make significant improvements to any item in question - but as a purchaser, I only have resources r << R to make my decisions. My goal is to buy items to make my life better, it's fine that there are potential other gains to be had by huge R values. 

> “verification is easier than generation [LW(p) · GW(p)]”

I feel like this isn't very well formalized. I think I agree with this comment [LW(p) · GW(p)]on that post. I feel like you're saying, "It's easier to generate a simple thing than verify all possible things", but Paul and co are saying more like, "It's easier to verify/evaluate a thing of complexity C than generate a think of complexity C, in many important conditions", or, "There are ways of delegating many tasks where the evaluation work required would be less than that of doing the work yourself, in order to get a result of a certain level of quality." 

I think that Paul's take (as I understand it) seems like a fundamental aspect about the working human world. Humans generally get huge returns from not inventing the wheel all the time, and deferring to others a great deal. This is much of what makes civilization possible. It's not perfect, but it's much better than what individual humans could do by themselves. 

> Under my current models, I expect that, shortly after AIs are able to autonomously develop, analyze and code numerical algorithms better than humans, there’s going to be some pretty big (like, multiple OOMs) progress in AI algorithmic efficiency (even ignoring a likely shift in ML/AI paradigm once AIs start doing the AI research)

I appreciate the precise prediction, but don't see how it exactly follows. This seems more like a question of "how much better will early AIs be compared to current humans", than one deeply about verification/generation. Also, I'd flag that in many worlds, I'd expect that pre-AGI AIs could do a lot of this code improvement - or they already have - so it's not clear exactly how big a leap the "autonomously" is doing here. 

--- 

I feel like there are probably several wins to be had by better formalizing these concepts more. They seem fairly cruxy/high-delta in the debates on this topic.

I would naively approach some of this with some simple expected value/accuracy lens. There are many assistants (including AIs) that I'd expect would improve the expected accuracy on key decisions, like knowing which AI systems to trust. In theory, it's possible to show a bunch of situations where delegation would be EV-positive. 

That said, a separate observer could of course claim that one using the process above would be so wrong as to be committing self-harm. Like, "I think that when you would try to use delegation, your estimates of impact are predictably wrong in ways that would lead to you losing." But this seems like mainly a question about "are humans going to be predictably overconfident in a certain domain, as seen by other specific humans". 

Replies from: kave, ozziegooen
comment by kave · 2024-06-17T22:55:04.067Z · LW(p) · GW(p)

I'm not sure what "bad" means exactly. Do you basically mean, "if I were to spend resources R evaluating this object, I could identify some ways for it to be significantly improved?" If so, I assume we'd all agree that this is true for some amount R, the key question is what that amount is.

I think an interesting version of this is "if I were to spend resource R evaluating this object, I could identify some ways for it to be significantly improved (even when factoring in additinoal cost) that the productino team probably already knew about"

comment by ozziegooen · 2024-06-15T15:20:46.188Z · LW(p) · GW(p)

Thinking about this more, it seems like there are some key background assumptions that I'm missing. 

Some assumptions that I often hear get presenting on this topic are things like:
1. "A misaligned AI will explicitly try to give us hard-to-find vulnerabilities, so verifying arbitrary statements from these AIs will be incredibly hard."
2. "We need to generally have incredibly high assurances to build powerful systems that don't kill us". 

My obvious counter-arguments would be:
1. Sure, but smart agents would have a reasonable prior that agents would be misaligned, and also, they would give these agents tasks that would be particularly easy to verify. Any action actually taken by a smart overseer, using information provided by another agent with a chance of being misaligned, M (known by the smart overseer), should be EV-positive in value. With some creativity, there's likely a bunch of ways of structuring things (using systems likely not to be misaligned, using more verifiable questions), where many resulting actions will likely be heavily EV-positive.

2. "Again, my argument in (1). Second, we can build these systems gradually, and with a lot of help from people/AIs that won't require such high assurances." (This is similar to the HCH / oversight arguments)

comment by Joel Burget (joel-burget) · 2024-06-12T21:33:46.439Z · LW(p) · GW(p)

I put approximately-zero probability on the possibility that Paul is basically right on this delta; I think he’s completely out to lunch.

Very strong claim which the post doesn't provide nearly enough evidence to support

Replies from: johnswentworth
comment by johnswentworth · 2024-06-12T21:48:18.710Z · LW(p) · GW(p)

I mean, yeah, convincing people of the truth that claim was not the point of the post.

Replies from: joel-burget
comment by Joel Burget (joel-burget) · 2024-06-13T13:46:45.582Z · LW(p) · GW(p)

Sorry, was in a hurry when I wrote this. What I meant / should have said is: it seems really valuable to me to understand how you can refute Paul's views so confidently and I'd love to hear more.

comment by Max H (Maxc) · 2024-06-19T23:15:14.452Z · LW(p) · GW(p)

I'm curious what you think of Paul's points (2) and (3) here [LW(p) · GW(p)]:

  • Eliezer often talks about AI systems that are able to easily build nanotech and overpower humans decisively, and describes a vision of a rapidly unfolding doom from a single failure. This is what would happen if you were magically given an extraordinarily powerful AI and then failed to aligned it, but I think it’s very unlikely what will happen in the real world. By the time we have AI systems that can overpower humans decisively with nanotech, we have other AI systems that will either kill humans in more boring ways or else radically advanced the state of human R&D. More generally, the cinematic universe of Eliezer’s stories of doom doesn’t seem to me like it holds together, and I can’t tell if there is a more realistic picture of AI development under the surface.
  • One important factor seems to be that Eliezer often imagines scenarios in which AI systems avoid making major technical contributions, or revealing the extent of their capabilities, because they are lying in wait to cause trouble later. But if we are constantly training AI systems to do things that look impressive, then SGD will be aggressively selecting against any AI systems who don’t do impressive-looking stuff. So by the time we have AI systems who can develop molecular nanotech, we will definitely have had systems that did something slightly-less-impressive-looking.

And specifically to what degree you think future AI systems will make "major technical contributions" that are legible to their human overseers before they're powerful enough to take over completely.

You write:

I expect that, shortly after AIs are able to autonomously develop, analyze and code numerical algorithms better than humans, there’s going to be some pretty big (like, multiple OOMs) progress in AI algorithmic efficiency (even ignoring a likely shift in ML/AI paradigm once AIs start doing the AI research). That’s the sort of thing which leads to a relatively discontinuous takeoff.

But how likely do you think it is that these OOM jumps happen before vs. after a decisive loss of control? 

My own take: I think there will probably be enough selection pressure and sophistication in primarily human-driven R&D processes alone to get to uncontrollable AI. Weak AGIs might speed the process along in various ways, but by the time an AI itself can actually drive the research process autonomously (and possibly make discontinuous progress), the AI will already also be capable of escaping or deceiving its operators pretty easily, and deception / escape seems likely to happen first for instrumental reasons.

But my own view isn't based on the difficulty of verification vs. generation, and I'm not specifically skeptical of bureaucracies / delegation. Doing bad / fake R&D that your overseers can't reliably check does seem somewhat easier than doing real / good R&D, but not always, and as a strategy seems like it would usually be dominated by "just escape first and do your own thing".

comment by Cole Wyeth (Amyr) · 2024-06-13T16:22:29.693Z · LW(p) · GW(p)

I expect you still believe P != NP?

Replies from: johnswentworth
comment by johnswentworth · 2024-06-13T16:46:55.066Z · LW(p) · GW(p)

Yes, though I would guess my probability on P = NP is relatively high compared to most people reading this. I'm around 10-15% on P = NP.

Notably relevant [LW · GW]:

People who’ve spent a lot of time thinking about P vs NP often have the intuition that “verification is easier than generation”. [...]

The problem is, this intuition comes from thinking about problems which are in NP. NP is, roughly speaking, the class of algorithmic problems for which solutions are easy to verify. [...]

I think a more accurate takeaway would be that among problems in NP, verification is easier than generation. In other words, among problems for which verification is easy, verification is easier than generation. Rather a less impressive claim, when you put it like that.

Replies from: Amyr
comment by Cole Wyeth (Amyr) · 2024-06-13T20:25:15.904Z · LW(p) · GW(p)

Do you expect A.G.I. to be solving problems outside of NP? If not, it seems the relevant follow-up question is really out of the problems that are in NP, how many are in P? 

Actually, my intuition is that deep learning systems cap out around P/poly, which probably strictly contains NP, meaning (P/poly) \ NP may be hard to verify, so I think I agree with you.

Replies from: johnswentworth, quetzal_rainbow
comment by johnswentworth · 2024-06-13T21:10:08.696Z · LW(p) · GW(p)

Most real-world problems are outside of NP. Let's go through some examples...

Suppose I am shopping for a new fridge, and I want to know which option is best for me (according to my own long-term values). Can I easily write down a boolean circuit (possibly with some inputs from data on fridges) which is satisfiable if-and-only-if this fridge in particular is in fact the best option for me according to my own long-term values? No, I have no idea how to write such a boolean circuit at all. Heck, even if my boolean circuit could internally use a quantum-level simulation of me, I'd still have no idea how to do it, because neither my stated values nor my revealed preferences are identical to my own long-term values. So that problem is decidedly not in NP.

(Variant of that problem: suppose an AI hands me a purported mathematical proof that this fridge in particular is the best option for me according to my own long-term values. Can I verify the proof's correctness? Again, no, I have no idea how to do that, I don't understand my own values well enough to distinguish a proof which makes correct assumptions about my values from one which makes incorrect assumptions.)

A quite different example from Hindsight Isn't 20/20 [LW · GW]: suppose our company has 100 workers, all working to produce a product. In order for the product to work, all 100 workers have to do their part correctly; if even just one of them messes up, then the whole product fails. And it's an expensive one-shot sort of project; we don't get to do end-to-end tests a billion times. I have been assigned to build the backup yellow connector widget, and I do my best. The product launches. It fails. Did I do my job correctly? No idea, even in hindsight; isolating which parts failed would itself be a large and expensive project. Forget writing down a boolean circuit in advance which is satisfiable if-and-only-if I did my job correctly; I can't even write down a boolean circuit in hindsight which is satisfiable if-and-only-if I did my job correctly. I simply don't have enough information to know.

Another kind of example: I read a paper which claims that FoxO mediates the inflammatory response during cranial vault remodelling surgery. Can I easily write down a boolean circuit (possibly with some inputs from the paper) which is satisfiable if-and-only-if the paper's result is basically correct? Sure, it could do some quick checks (look for p-hacking or incompetently made-up data, for example), but from the one paper I just don't have enough information to reliably tell whether the result is basically correct.

Another kind of example: suppose I'm building an app, and I outsource one part of it. The contractor sends me back a big chunk of C code. Can I verify that (a) the C code does what I want, and (b) the C code has no security holes? In principle, formal verification tools advertise both of those. In practice, expressing what I want in a formal verification language is as-much-or-more-work as writing the code would be (assuming that I understand what I want well enough to formally express it at all, which I often don't). And even then, I'd expect someone who's actually good at software security to be very suspicious of the assumptions made by the formal verifier.

Replies from: Amyr
comment by Cole Wyeth (Amyr) · 2024-06-14T13:02:27.254Z · LW(p) · GW(p)

I think the issues here are more conceptual than algorithmic.

Replies from: tailcalled
comment by tailcalled · 2024-06-16T12:04:36.524Z · LW(p) · GW(p)

The conceptual vagueness certainly doesn't help, but in general generation can be easier than validation because when generating you can stay within a subset of the domain that you understand well, whereas when verifying you may have to deal with all sorts of crazy inputs.

Replies from: Algon
comment by Algon · 2024-06-19T22:11:09.375Z · LW(p) · GW(p)

generation can be easier than validation because when generating you can stay within a subset of the domain that you understand well, whereas when verifying you may have to deal with all sorts of crazy inputs.

Attempted rephrasing: you control how you generate things, but not how others do, so verifying their generations can expose you to stuff you don't know how to handle.

Example: 
"Writing code yourself is often easier than validating someone else's code"
 

Replies from: o-o
comment by O O (o-o) · 2024-06-19T22:52:38.831Z · LW(p) · GW(p)

I think a more nuanced take is there is a subset of generated outputs that are hard to verify. This subset is split into two camps, one where you are unsure of the outputs correctness (and thus can reject/ask for an explanation). This isn’t too risky. The other camp is ones where you are sure but in reality overlook something. That’s the risky one.

However at least my priors tell me that the latter is rare with a good reviewer. In a code review, if something is too hard to parse, a good reviewer will ask for an explanation or simplification. But bugs still slip by so it’s imperfect.

The next question is whether the bugs that slip by in the output will be catastrophic. I don’t think it dooms the generation + verification pipeline if the system is designed to be error tolerant.

Replies from: Algon
comment by Algon · 2024-06-20T11:12:44.299Z · LW(p) · GW(p)

I'd like to try another analogy, which makes some potential problems for verifying output in alignment more legible. 

Imagine you're a customer and ask a programmer to make you an app. You don't really know what you want, so you give some vague design criteria. You ask the programmer how the app works, and they tell you, and after a lot of back and forth discussion, you verify this isn't what you want. Do you know how to ask for what you want, now? Maybe, maybe not. 

Perhaps the design space you're thinking of is small, perhaps you were confused in some simple way that the discussion resolved, perhaps the programmer worked with you earnestly to develop the design you're really looking for, and pointed out all sorts of unknown unknowns. Perhaps.

I think we could wind up in this position. The position of a non-expert verifying an experts' output, with some confused and vague ideas about what we want from the experts. We won't know the good questions to ask the expert, and will have to rely on the expert to help us. If ELK is easy, then that's not a big issue. If it isn't, then that seems like a big issue.

Replies from: rif-a-saurous
comment by rif a. saurous (rif-a-saurous) · 2024-06-23T15:38:43.546Z · LW(p) · GW(p)

I feel like a lot of the difficulty here is a punning of the word "problem." 

In complexity theory, when we talk about "problems", we generally refer to a formal mathematical question that can be posed as a computational task. Maybe in these kinds of discussions we should start calling these problems_C (for "complexity"). There are plenty of problems_C that are (almost definitely) not in NP,  like #SAT ("count the number of satisfying assignments of this Boolean formula"), and it's generally believed that verification is hard for these problems. A problem_C like #SAT that is (believed to be) in #P but not NP will often have a short easy-to-understand algorithm that will be very slow ("try every assignment and count up the ones that satisfy the formula").

On the other hand, "suppose I am shopping for a new fridge, and I want to know which option is best for me (according to my own long-term values)" is a very different sort of beast. I agree it's not in NP in that I can't easily verify a solution, but the issue is that it's not a problem_C, rather than it being a problem_C that's (almost definitely) not in NP.  With #SAT, I can easily describe how to solve the task using exponential amounts of compute; for "choose a refrigerator", I can't describe any computational process that will solve at all. If I could (for instance, if I could write down an evaluation function f : fridge -> R (where f was computable in P)), then the problem would be not only in NP but in P (evaluate each fridge, pick the best one).

So it's not wrong to say that "choose a refrigerator" is not (known to be) in NP,  but it's important to foreground that that's because the task isn't written as a problem_C, rather than because it needs a lot of compute. So discussions about complexity classes and relative ease of generation and verification seem not especially relevant.

I don't think I'm saying anything non-obvious, but I also think I'm seeing a lot of discussions that don't seem to fully internalize this?

comment by quetzal_rainbow · 2024-06-13T20:48:57.496Z · LW(p) · GW(p)

PCP theorem states that problems with probabilistically checkable in polynomial time verifications contain NEXP problems, so, in some sense, there is a very large class of problems that can be "easily" verified.

I think the whole "verification is easier than generation because of computational complexity theory" line of reasoning is misguided. The problem is not whether we have enough computing power to verify solution, it is that we have no idea how to verify solution.

comment by tmeanen · 2024-06-13T10:51:04.342Z · LW(p) · GW(p)

“keyboard and monitor I’m using right now, a stack of books, a tupperware, waterbottle, flip-flops, carpet, desk and chair, refrigerator, sink, etc. Under my models, if I pick one of these objects at random and do a deep dive researching that object, it will usually turn out to be bad in ways which were either nonobvious or nonsalient to me, but unambiguously make my life worse"

But, I think the negative impacts that these goods have on you are (mostly) realized on longer timescales - say, years to decades. If you’re using a chair that is bad for your posture, the impacts of this are usually seen years down the line when your back starts aching. Or if you keep microwaving tupperware, you may end up with some pretty nasty medical problems, but again, decades down the line. 

The property of an action having long horizons until it can be verified as good or bad for you makes delegating to smarter-than-you systems dangerous. My intuition is that there are lots of tasks that could significantly accelerate alignment research that don’t have this property, examples being codebase writing (unit tests can provide quick feedback), proof verification etc. In fact, I can’t think of many research tasks in technical fields that have month/year/decade horizons until they can be verified - though maybe I’ve just not given it enough thought.

Replies from: carl-feynman
comment by Carl Feynman (carl-feynman) · 2024-06-13T12:47:59.804Z · LW(p) · GW(p)

Many research tasks have very long delays until they can be verified.  The history of technology is littered with apparently good ideas that turned out to be losers after huge development efforts were poured into them.  Supersonic transport, zeppelins, silicon-on-sapphire integrated circuits, pigeon-guided bombs, object-oriented operating systems, hydrogenated vegetable oil, oxidative decoupling for weight loss…

Finding out that these were bad required making them, releasing them to the market, and watching unrecognized problems torpedo them.  Sometimes it took decades.

Replies from: tmeanen
comment by tmeanen · 2024-06-13T16:21:32.587Z · LW(p) · GW(p)

But if the core difficulty in solving alignment is developing some difficult mathematical formalism and figuring out relevant proofs then I think we won't suffer from the problems with the technologies above. In other words, I would feel comfortable delegating and overseeing a team of AIs that have been tasked with solving the Riemann hypothesis - and I think this is what a large part of solving alignment might look like.

Replies from: carl-feynman
comment by Carl Feynman (carl-feynman) · 2024-06-13T21:43:50.861Z · LW(p) · GW(p)

“May it go from your lips to God’s ears,” as the old Jewish saying goes.  Meaning, I hope you’re right.  Maybe aligning superintelligence will largely be a matter of human-checkable mathematical proof.

I have 45 years experience as a software and hardware engineer, which makes me cynical. When one of my designs encounters the real world, it hardly ever goes the way I expect.  It usually either needs some rapid finagling to make it work (acceptable) or it needs to be completely abandoned (bad).  This is no good for the first decisive try at superalignment; that has to work first time.  I hope our proof technology is up to it.
 

comment by Keenan Pepper (keenan-pepper) · 2024-06-13T03:48:24.975Z · LW(p) · GW(p)

In HCH, the human user does a little work then delegates subquestions/subproblems to a few AIs, which in turn do a little work then delegate their subquestions/subproblems to a few AIs, and so on until the leaf-nodes of the tree receive tiny subquestions/subproblems which they can immediately solve.

This does not agree with my understanding of what HCH is at all. HCH is a definition of an abstract process for thought experiments, much like AIXI is. It's defined as the fixed point of some iterative process of delegation expanding out into a tree. It's also not something you could actually implement, but it's a platonic form like "circle" or "integral".

This has nothing to do with the way an HCH-like process would be implemented. You could easily have something that's designed to mimic HCH but it's implemented as a single monolithic AI system.

comment by kromem · 2024-06-13T03:38:25.065Z · LW(p) · GW(p)

As you're doing these delta posts, do you feel like it's changing your own positions at all?

For example, reading this one what strikes me is that what's portrayed as the binary sides of the delta seem more like positions near the edges of a gradient distribution, and particularly one that's unlikely to be uniform across different types of problems.

To my eyes the most likely outcome is a situation where you are both right.

Where there are classes of problems where verification is easy and delegation is profitable, and classes of problems where verification will be hard and unsupervised delegation will be catastrophic (cough glue on pizza).

If we are only rolling things up into aggregate pictures of the average case across all problems, I can see the discussion filtering back into those two distinct deltas, but a bit like flip-flops and water bottles, the lack of nuance obscures big picture decision making.

So I'm curious if as you explore and represent the opposing views to your own, particularly as you seem to be making effort to represent without depicting them as straw person arguments, if your own views have been deepening and changing through the process?

Replies from: johnswentworth
comment by johnswentworth · 2024-06-13T03:56:53.448Z · LW(p) · GW(p)

As you're doing these delta posts, do you feel like it's changing your own positions at all?

Mostly not, because (at least for Yudkowsky and Christiano) these are deltas I've been aware of for at least a couple years. So the writing process is mostly just me explaining stuff I've long since updated on, not so much figuring out new stuff.

comment by Eli Tyre (elityre) · 2024-06-18T02:31:52.159Z · LW(p) · GW(p)

Under my models, if I pick one of these objects at random and do a deep dive researching that object, it will usually turn out to be bad in ways which were either nonobvious or nonsalient to me, but unambiguously make my life worse and would unambiguously have been worth-to-me the cost to make better.

Crucially, this is true only because you're relatively smart for a human: smarter than many of the engineers that designed those objects, and smarter than most or all of the committee-of-engineers that designed those objects. You can come up with better solutions then they did, if you have a similar level of context.

But that's not true of most humans. Most humans, if they did a deep dive into those objects wouldn't notice the many places where there is substantial room for improvement. Just like most humans don't spontaneously recognized blatant-to-me incentive problems in government design (and virtually every human institution), and just as I often wouldn't be able to tell that a software solution was horrendously badly architected, at least without learning learning a bunch of software engineering in addition to doing a deep dive into this particular program. 

Replies from: philh, johnswentworth
comment by philh · 2024-06-29T23:08:28.220Z · LW(p) · GW(p)

Note that to the extent this is true, it suggests verification is even harder than John thinks.

Replies from: ryan_greenblatt
comment by ryan_greenblatt · 2024-06-30T19:38:11.789Z · LW(p) · GW(p)

Hmm, not exactly. Our verification ability only needs to be sufficiently good relative to the AIs.

comment by johnswentworth · 2024-06-19T01:49:52.978Z · LW(p) · GW(p)

Ehh, yes and no. I maybe buy that a median human doing a deep dive into a random object wouldn't notice the many places where there is substantial room for improvement; hanging around with rationalists does make it easy to forget just how low the median-human bar is.

But I would guess that a median engineer is plenty smart enough to see the places where there is substantial room for improvement, at least within their specialty. Indeed, I would guess that the engineers designing these products often knew perfectly well that they were making tradeoffs which a fully-informed customer wouldn't make. The problem, I expect, is mostly organizational dysfunction (e.g. the committee of engineers is dumber than one engineer, and if there are any nontechnical managers involved then the collective intelligence nosedives real fast), and economic selection pressure.

For instance, I know plenty of software engineers who work at the big tech companies. The large majority of them (in my experience) know perfectly well that their software is a trash fire, and will tell you as much, and will happily expound in great detail the organizational structure and incentives which lead to the ongoing trash fire.

comment by Aprillion (Peter Hozák) (Aprillion) · 2024-06-16T15:48:11.181Z · LW(p) · GW(p)

It’s a failure of ease of verification: because I don’t know what to pay attention to, I can’t easily notice the ways in which the product is bad.

Is there an opposite of the "failure of ease of verification" that would add up to 100% if you would categorize the whole of reality into 1 of these 2 categories? Say in a simulation, if you attributed every piece of computation into following 2 categories, how much of the world can be "explained by" each category?

  • make sure stuff "works at all and is easy to verify whether it works at all"
  • stuff that works must be "potentially better in ways that are hard to verify"

Examples:

  • when you press the "K" key on your keyboard for 1000 times, it will launch nuclear missiles ~0 times and the K key will "be pressed" ~999 times
  • when your monitor shows you the pixels for a glyph of the letter "K" 1000 times, it will represent the planet Jupyter ~0 times and "there will be" the letter K ~999 times
  • in each page in your stack of books, the character U+0000 is visible ~0 times and the letter A, say ~123 times
  • tupperware was your own purchase and not gifted by a family member? I mean, for which exact feature would you pay how much more?!?
  • you can tell whether a water bottle contains potable water and not sulfuric acid
  • carpet, desk, and chair haven't spontaneously combusted (yet?)
  • the refrigerator doesn't produce any black holes
  • (flip-flops are evil and I don't want to jinx any sinks at this time)


 

comment by Morpheus · 2024-06-12T19:19:06.334Z · LW(p) · GW(p)

That propagates into a huge difference in worldviews. Like, I walk around my house and look at all the random goods I’ve paid for - the keyboard and monitor I’m using right now, a stack of books, a tupperware, waterbottle, flip-flops, carpet, desk and chair, refrigerator, sink, etc. Under my models, if I pick one of these objects at random and do a deep dive researching that object, it will usually turn out to be bad in ways which were either nonobvious or nonsalient to me, but unambiguously make my life worse and would unambiguously have been worth-to-me the cost to make better.

Based on my 1 deep dive on pens a few years ago this seems true. Maybe that is too high dimensional and too unfocused a post, but maybe there should be a post on "best X of every common product people use every day"? And then we somehow filter for people with actual expertise? Like for pens you want to go with the recommendations of "the pen addict".

comment by Review Bot · 2024-06-13T05:39:43.700Z · LW(p) · GW(p)

The LessWrong Review [? · GW] runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?