keith_wynroe

Posts
Comments

Posts

Decomposing the QK circuit with Bilinear Sparse Dictionary Learning 2024-07-02T13:17:16.352Z

An OV-Coherent Toy Model of Attention Head Superposition 2023-08-29T19:44:11.242Z

Literature review of TAI timelines 2023-01-27T20:07:38.186Z

You're Not One "You" - How Decision Theories Are Talking Past Each Other 2023-01-09T01:21:11.708Z

Comments

Comment by keith_wynroe on A Simple Toy Coherence Theorem · 2024-11-10T15:42:22.024Z · LW · GW

I feel like this could branch out into a lot of small disagreements here but in the interest of keeping it streamlined:

One of the consequences of this, however, is that this prefix-based encoding method is only optimal for functions whose prefix-free encodings (i.e. encodings that cannot be partitioned into substrings such that one of the substrings encodes another UTM) in UTM1 and UTM2 differ in length by more than len(P). And, since len(P) is a measure of UTM2's complexity relative to UTM1, it follows directly that a UTM2 whose "coding scheme" is such that a function whose prefix-free encoding in UTM2 differs in length from its prefix-free encoding in UTM1 by some large constant (say, ~2^10^80), P itself must be on the order of 2^10^80—in other words, UTM2 must have an astronomical complexity relative to UTM1.

I agree with all of this, and wasn't gesturing at anything related to it, so I think we're talking past eachother. My point was simply that two UTMs even with not very-large prefix encodings can wind up with extremely different priors, but I don't think that's too relevant to what your main point is

For any physically realizable universal computational system, that system can be analogized to UTM1 in the above analysis. If you have some behavioral policy that is e.g. deontological in nature, that behavioral policy can in principle be recast as an optimization criterion over universe histories; however, this criterion will in all likelihood have a prefix-free description in UTM1 of length ~2^10^80. And, crucially, there will be no UTM2 in whose encoding scheme the criterion in question has a prefix-free description of much less than ~2^10^80, without that UTM2 itself having a description complexity of ~2^10^80 relative to UTM1—meaning, there is no physically realizable system that can implement UTM2.

I think I disagree with almost all of this. You can fix some gerrymandered extant physical system right now that ends up looking like a garbled world-history optimizer, I doubt that it would take on the order of length ~2^10^80 to specify it. But granting that these systems would in fact have astronomical prefixes, I think this is a ponens/tollens situation: if these systems actually have a huge prefix, that tells me that some the encoding schemes of some physically realisable systems are deeply incompatible with mine, not that those systems which are out there right now aren't physically realisible.

I imagine an objection is that these physical systems are not actually world-history optimizers and are actually going to be much more compressible than I'm making them out to be, so your argument goes through. In which case I'm fine with this, this just seems like a differing definition of what counts as when two schemes are acting "virtually identically" w.r.t to optimization criteria. If your argument is valid but is bounding this similarity to include e.g random chunks of a rock floating through space, then I'm happy to concede that - seems quite trivial and not at all worrying from the original perspective of bounding the kinds of optimization criteria an AI might have

Comment by keith_wynroe on A Simple Toy Coherence Theorem · 2024-11-09T17:22:19.097Z · LW · GW

The constant bound isn't not that relevant just because of the in principal unbounded size, it also doesn't constrain the induced probabilities in the second coding scheme much at all. It's an upper bound on the maximum length, so you can still have the weightings in codings scheme B differ differ in relative length by a ton, leading to wildly different priors

And of the encoding schemes that remain on the table, virtually all of them will behave identically with respect to the description lengths they assign to "natural" versus "unnatural" optimization criteria.

I have no idea how you're getting to this, not sure if it's claiming a formal result or just like a hunch. But I disagree both that there is a neat correspondence between a system being physically realizable and its having a concise implementation as a TM. Even granting that point, I don't think that nearly all or even most of these physically realisable systems will behave identically or even similarly w.r.t. how they assign codes to "natural" optimization criteria

Comment by keith_wynroe on The Sun is big, but superintelligences will not spare Earth a little sunlight · 2024-09-24T13:26:27.858Z · LW · GW

If you parse this post as "attempting to impart a basic intuition that might let people (new to AI x-risk arguments) avoid certain classes of errors" rather than "trying to argue with the bleeding-edge arguments on x-risk", this post seems good

This seems reasonable in isolation, but it gets frustrating when the former is all Eliezer seems to do these days, with seemingly no attempt at the latter. When all you do is retread these dunks on "midwits" and show apathy/contempt for engaging with newer arguments, it makes it look like you don't actually have an interest in being maximally truth-seeking but instead like you want to just dig in and grandstand.

From what little engagement there is with novel criticisms of their arguments (like Nate's attempt to respond to Quintin/Nora's work), it seems like there's a cluster of people here who don't understand and don't particularly care about understanding some objections to their ideas and instead want to just focus on relitigating arguments they know they can win.

Comment by keith_wynroe on What's the Deal with Logical Uncertainty? · 2024-09-17T12:45:35.547Z · LW · GW

I can reason as follows: There is 0.5 chance that it is Heads. Let P represent the actual, unknown, state of the outcome of the toss (Heads or Tails); and let Q represent the other state. If Q, then anything follows. For example, Q implies that I will win $1 billion. Therefore the value of this bet is at least $500,000,000, which is 0.5 * $1,000,000, and I should be willing to pay that much to take the bet.

This doesn't go through, what you have are two separate propositions "H -> (T -> [insert absurdity here]" and "T -> (H -> [insert absurdity here]" ^[1], and actually deriving a contradiction from the consequent requires proving which antecedent obtains, which you can't do since neither is a theorem.

The distinction then with logical uncertainty is supposedly that you do already have a proof of the analogue of H or T, so you can derive the consequent that either H or T derives a contradiction

^{^}
You don't really have these either, unless you can prove NOT(H AND T) i.e. can you definitively rule out a coin landing both heads and tails? But that's kinda pedantic

Comment by keith_wynroe on Lucius Bushnaq's Shortform · 2024-09-06T10:26:50.736Z · LW · GW

What are your thoughts on KL-div after the unembed softmax as a metric?

Comment by keith_wynroe on Rabin's Paradox · 2024-08-15T15:46:10.439Z · LW · GW

These results hold only if you assume risk aversion is entirely explained by a concave utility function, if you don't assume that then the surprising constraints on your preferences don't apply

IIRC that's the whole point of the paper - not that utility functions are in fact constrained in this way (they're not), but if you assume risk aversion can only come from diminishing maginal value of money (as many economists do), then you end up in weird places so maybe you should rethink that

Comment by keith_wynroe on A Simple Toy Coherence Theorem · 2024-08-05T19:03:24.926Z · LW · GW

I think what I'm getting at is more general than specifically talking about resources, I'm more getting at the degree of freedom in the problem description that lets you frame anything as technically optimizing something at a distance i.e. in 'Utility Maximization = Description Length Minimization' you can take any system, find its long-term and long-distance effects on some other region of space-time, and find a coding-scheme where those particular states have the shortest descriptions. The description length of the universe will by construction get minimized. Obviously this just corresponds to one of those (to us) very unnatural-looking "utility functions" over universe-histories or w/e

If we're first fixing the coding scheme then this seems to me to be equivalent to constraining the kinds of properties we're allowing as viable targets of optimization

I guess one way of looking at it is I don't think it makes sense to talk about a system as being an optimizer/not an optimizer intrinsically. It's a property of a system relative to a coding scheme/set of interesting properties/resources, everything is an optimizer relative to some encoding scheme. And all of the actual, empirical scariness of AI comes from how close the encoding scheme that by-definition makes it an optimizer is to our native encoding scheme - as you point out they'll probably have some overlap but I don't think that itself is scary

Comment by keith_wynroe on A Simple Toy Coherence Theorem · 2024-08-05T11:41:05.057Z · LW · GW

Thanks, I feel like I understand your perspective a bit better now.

Re: your "old" frame: I agree that the fact we're training an AI to be useful from our perspective will certainly constrain its preferences a lot, such that it'll look like it has preferences over resources we think in terms of/won't just be representable as a maximally random utility function. I think there's a huge step from that though to "it's a optimizer with respect to those resources" i.e there are a lot of partial orderings you can put over states where it broadly has preference orderings we like w.r.t resources without looking like a maximizer over those resources, and I don't think that's necessarily scary. I think some of this disagreement may be downstream of how much you think a superintelligence will "iron out wrinkles" like preference gaps internally though which is another can of worms

Re: your new frame: I think I agree that looking like a long-term/distance planner is much scarier. Obviously implicitly assuming we're restricting to some interesting set of resources, because otherwise we can reframe any myopic maximizer as long-term and vice-versa. But this is going round in circles a bit, typing this out I think the main crux here for me is what I said in the previous point in that I think there's too much of a leap from "looks like it has preferences over this resource and long-term plans" vs. "is a hardcore optimizer of said resource". Maybe this is just a separate issue though, not sure I have any local disagreements here

Re: your last pont, thanks - I don't think I have a problem with this, I think I was just misunderstanding the intended scope of the post

Comment by keith_wynroe on A Simple Toy Coherence Theorem · 2024-08-04T17:13:12.136Z · LW · GW

Thanks, I think that's a good distinction - I guess I have like 3 issues if we roll with that though

I don't think a system acting according to preferences over future states entails it is EV-maximising w.r.t. some property/resource of those future states. If it's not doing the latter it seems like it's not necessarily scary, and if it is then I think we're back at the issue that we're making an unjustified leap, this time from "it's a utility maximizer + it has preferences over future-states" (i.e. having preferences over properties of future states is compatible w/ also having preferences over world-histories/all sorts of weird stuff)
It's not clear to me that specifying "preferences over future states" actually restricts things much - if I have some preferences over the path I take through lotteries, then whether I take path A or path B to reach outcome X will show up as some difference in the final state, so it feels like we can cast a lot (Most? All?) types of preferences as "preferences over future states". I think the implicit response here is that we're categorizing future states by a subset of "interesting-to-us" properties, and the differences in future-states yielded by taking Path A or Path B don't matter to us (in other words, implicitly whenever we talk about these kinds of preferences over states we're taking some equivalence class over actual micro-states relative to some subset of properties). But then again I think the issue recurs that a system having preferences over future states w.r.t. this subset of properties is a stronger claim
I'm more and more convinced that, even if a system does have preferences over future-states in the scariest sense here, there's not really an overriding normative force for it to update towards being a utility-maximiser. But I think this is maybe a kind of orthogonal issue about the force of exploitability arguments rather than coherence theorems here

I think you've said something along the lines of one or two of these points in your links, sorry! Not expecting this to be super novel to you, half just helpful for me to get my own thoughts down explicitly

Comment by keith_wynroe on A Simple Toy Coherence Theorem · 2024-08-04T15:27:06.285Z · LW · GW

The actual result here looks right to me, but kinda surfaces a lot of my confusion about how people in this space use coherence theorems/reinforces my sense they get misused

You say:

This ties to a common criticism: that any system can be well-modeled as a utility maximizer, by simply choosing the utility function which rewards whatever the system in fact does. As far as I can tell, that criticism usually reflects ignorance of what coherence says

My sense of how this conversation goes is as follows:

"Utility maximisers are scary, and here are some theorems that show that anything sufficiently smart/rational (i.e. a superintelligence) will be a utility maximiser. That's scary"

"Literally anything can be modelled as a utility maximiser. It's not the case that literally everything is scary, so something's wrong here"

"Well sure, you can model anything as a utility maximiser technically, but the resource w.r.t which it's being optimal/the way its preferences are carving up state-space will be incredibly awkward/garbled/unnatural (in the extreme, they could just be utility-maximizing over entire universe-histories). But these are unnatural/trivial. If we add constraints over the kind of resources it's caring about/kinds of outcomes it can have preferences over, we constrain the set of what can be a utility-maximiser a lot. And if we constrain it to smth like the set of resources that we think in terms of, the resulting set of possible utility-maximisers do look scary"

Does this seem accurate-ish? If so it feels like this last response is true but also kind of vacuously so, and kind of undercuts the scariness of the coherence theorems in the first place. As in, it seems much more plausible that a utility-maximiser drawn from this constrained set will be scary, but then where's the argument we're sampling from this subset when we make a superintelligence? It feels like there's this weird motte-and-bailey going on where people flip-flop between the very unobjectionable "it's representable as a utility-maximiser" implied by the theorems and "it'll look like a utility-maximiser "internally", or relative to some constrained set of possible resources s.t. it seems scary to us" which feels murky and un-argued for.

Also on the actual theorem you outline here - it looks right, but isn't assuming utilities assigned to outcomes s.t. the agent is trying to maximise over them kind of begging most of the question that coherence theorems are after? i.e. the starting data is usually a set of preferences, with the actual work being proving that this along with some assumptions yields a utility function over outcomes. This also seems why you don't have to use anything like dutch-book arguments etc as you point out - but only because you've kind of skipped over the step where they're used

Comment by keith_wynroe on An OV-Coherent Toy Model of Attention Head Superposition · 2024-08-04T14:45:45.566Z · LW · GW

Hey, sorry for the (very) belated response - thanks for the comment! Your description of the problem set-up/model look right to me. FWIW this post was ~my first attempt at digging into something superpositon-related, so I think you're right that it was being pretty sloppy/confused with the concept of "superposition". I've since come around more to your perspective of polysemanticity/distributed representation/interference being insufficient for "true" superposition.

Re: your point about there existing simpler solutions - you're totally right that for d-head >= 4, there exists a more straightforward n_head = 1 solution, I did try solving this problem on paper before training anything and arrived at the same thing as you

However we found that for d_head = 1, n_head = 2 the model could still solve the problem perfectly - in this case I think the problem is less trivial and it does rely on the kind of "conditional attention hierarchy" behaviour and the associated interference we talk about. When n_head = 2 and d_head >= 4 the model still prefers this approach over the more trivial method you outline - we included the plots from this experiment over the n_head = 2, d_head = 1 version because the plots were a bit easier to read and we felt made the same point, but in retrospect

Overall I'm a lot less impressed/interested by this work in retrospect largely for the reasons you point out here, however I think some of the qualitative behaviours we saw are still quite interesting, and have at least for me affected how I think about what kinds of things attention layers might be doing (although the lessons may not be new/interesting to others)

"Inverted attention preferences": In almost all of our tests, the two heads learn to invert the order in which they attend to important tokens. If there are multiple important key-tokens that all need to be attended to, you really don't want multiple heads attending to the same token and ignoring some, so the QK-circuits of heads may be arranged so they distribute responsibility in a mutually exclusive/exhaustive way. Obviously our toy example is an extreme case, but I think this mutual-information between QK-circuits is probably likely to exist in LLM's, since "needing to attend to a lot of different context information simultaneously" is v. present in language
"Thinking of heads as copying information about entire contexts vs. specific tokens": This is maybe more of a perspective-shift than anything, but I found it interesting that when a head attended to its "second favourite token", it could safely not write to the logits of the completion implied by (second-favorite token, first-favorite token), because it can "infer" the first-favorite is not elsewhere in the context (or else it'd be attending there). Or in other words, when an OV-circuit is sent to a specific key-position, it's able to exploit not just the information at the residual stream locally at that position, but also the information implied about the entire context by its QK-circuit. Again, this may largely just be a "frame-shift" thing, but it's definitely informed how I think about the relationship between the QK- and OV-circuits and how independent/disconnected I should be thinking of them as

Comment by keith_wynroe on Decomposing the QK circuit with Bilinear Sparse Dictionary Learning · 2024-07-10T11:29:03.502Z · LW · GW

Sorry for the delay - thanks for this! Yeah I agree, in general the OV circuit seems like it'll be much easier given the fact that it doesn't have the bilinearity or the softmax issue. I think the idea you sketch here sounds like a really promising one and pretty in line with some of the things we're trying atm

I think the tough part will be the next step which is somehow "stitching together" the QK and OV decompositions that give you an end-to-end understanding of what the whole attention layer is doing. Although I think the extent to which we should be thinking about the QK and OV circuit as totally independent is still unclear to me

Interested to hear more about your work though! Being able to replace the entire model sounds impressive given how much reconstruction errors seem to compound

Comment by keith_wynroe on Decomposing the QK circuit with Bilinear Sparse Dictionary Learning · 2024-07-05T10:21:58.665Z · LW · GW

Thanks!

The auxiliary losses were something we settled on quite early, and we made some improvements to the methodology since then for the current results so I don't have great apples-to-apples comparisons for you. The losses didn't seem super important though in the sense that runs would still converge, just take longer and end with slightly worse reconstruction error. I think it's very likely that with a better training set-up/better hyperparam tuning you could drop these entirely and be fine.

Re: comparison to SAE's, you mean what do the dictionaries/feature-map have to look like if you're explicitly targeting L2-reconstruction error and just getting pattern reconstruction as a side-effect? If so we also looked at this briefly early on. We didn't spend a huge amount of time on these so they were probably not optimally trained, but we were finding that to get L2-reconstruction error low enough to yield comparably close good pattern reconstruction we were needing to go up to a d_hidden of 16,000 i.e. comparable to residual SAEs for the same layer. Which I think is another data-point in favour of "a lot of the variance in head-space is attention-irrelevant and just inherited from the residual stream"

Comment by keith_wynroe on Toward A Mathematical Framework for Computation in Superposition · 2024-01-19T19:39:03.061Z · LW · GW

This looks really cool! Haven't digested it all yet but I'm especially interested in the QK superposition as I'm working on something similar. I'm wondering what your thoughts are on the number of bigrams being represented by a QK circuit not being bounded by interference but by its interaction with the OV circuit. IIUC it looks like a head can store a surprising number of d_resid bigrams, but since the OV circuit is only a function of the key, then having the same key feature be in a clique with a large number of different query features means the OV-circuit will be unable to differentially copy information based on which bigram is present. I don't think this has been explored outside of toy models from Anthropic though

Comment by keith_wynroe on OpenAI Superalignment: Weak-to-strong generalization · 2023-12-15T15:07:43.669Z · LW · GW

I know they flag it in the paper, but seeing the performance curves for the strong model on zero- and few-shot attempts really makes me think the data leakage issue is doing a lot of the work here. If you get the majority(?) of the PGR from e.g. 5-shot prompting it seems like a natural takeaway is the strong model doesn't actually need to be fine-tuned on the task, and the weak supervisor is just eliciting the knowledge that's already there

Comment by keith_wynroe on AI #41: Bring in the Other Gemini · 2023-12-09T10:48:28.910Z · LW · GW

Sorry you found it so stressful! I’m not objecting to you deciding it’s not worth your time to engage, what I’m getting at is a perceived double standard in when this kind of criticism is applied. You say

I do not think that the thing I am observing from Pope/Belrose is typical of LW/AF/rationalist/MIRI/etc behaviors to anything like the same degree that they consistently do it

But this seems wrong to me. The best analogue of your post from Quintin’s perspective was his own post laying out disagreements with Eliezer. Eliezer’s response to this was to say it was too long for him to bother reading, which imo is far worse. AFAICT his response to you in your post is higher-effort than the responses from MIRI people to his arguments all put together. Plausibly we have different clusters in our head of who we’re comparing him too though - I agree a wider set of LW people are much more engaging, I’m specifically comparing to e.g Nate and Eliezer as that feels to me a fairer comparison

To go into the specific behaviours you mention

I basically don't see him changing his mind about anything, agreeing a good point was made

I don’t think this makes sense - if from his perspective you didn’t make good points or change his mind then what was he supposed to do? If you still think you did and he’s not appreciating them then that’s fair but is more reifying the initial disagreement. I also don’t see this behaviour from Eliezer or Nate?

addressing my arguments or thoughts on their merits rather than correcting my interpretation of his arguments, asking me questions, suggesting cruxes and so on.

I again don’t see Eliezer doing any of this either in responses to critical posts?

Where he notes disagreement he says he's baffled anyone could think such a thing and doesn't seem curious why I might think it

Again seems to be a feature of many MIRI-cluster responses. Stating that certain things feel obvious from the inside and that you don’t get why it’s so hard for other people to grok them is a common refrain.

Comment by keith_wynroe on AI #41: Bring in the Other Gemini · 2023-12-08T05:37:04.237Z · LW · GW

And all of this is asserted as, essentially, obvious and undeniable, extreme confidence is displayed, all the arguments offered against this are invalid and dumb, and those that disagree are at best deeply confused and constantly told they did not understand or fairly represent what was said.

This feels unnecessarily snarky, but is also pretty much exactly the experience a lot of people have trying to engage with Yudkowsky et al. It feels weird to bring up “they’re very confident and say that their critics just don’t get it” as a put-down here.

It seems doubly bad because it really seems like a lot of the more pessimist crowd just genuinely aren’t actually trying to engage with these ideas at all. Nate wrote one skimmed post which badly misread the piece, and Yudkowsky AFAICT has at most engaged via a couple tweets (again which don’t seem to engage with the points). This is concurrent with them both engaging much more heavily with weaker objections to which they already have easy answers.

I genuinely don’t understand why a group which is highly truth-seeking and dispassionately interested in the validity of their very consequential arguments feels so little reason to engage with counter-arguments to their core claims which have been well-received.

I tried one reply to one of Pope’s posts

From your post, you seem to have misunderstood Quintin’s arguments in a way he explains pretty clearly, and then there’s not really much follow-up. You don’t seem to have demonstrated you can pass an ITT after this, and I think if it were Yudkowsky in Pope’s position and someone effectively wrote him off as hopeless after one failed attempt to understand eachother you would probably not be as forgiving.

Comment by keith_wynroe on UDT shows that decision theory is more puzzling than ever · 2023-09-15T15:38:42.933Z · LW · GW

I understand it’s a proposition like any other, I don’t see why an agent would reflect on it/use it in their deliberation to decide what to do. The fact that they’re a CDT agent is a fact about how they will act in the decision, not a fact that they need to use in their deliberation

Analogous to preferences, whether or not an agent prefers A or B is a proposition like any other, but I don’t think it’s a natural way to model them as first consult the credences they have assigned to “I prefer A to B” etc. Rather, they will just choose A ex hypothesis because that’s what having the preference means.

Comment by keith_wynroe on UDT shows that decision theory is more puzzling than ever · 2023-09-15T12:39:24.966Z · LW · GW

Why would they be uncertain about whether they’re a CDT agent? Being a CDT agent surely just means by definition they evaluate decisions based on causal outcomes. It feels confused to say that they have to be uncertain about/reflect on which decision theory they have and then apply it, rather than their being a CDT agent being an ex hypothesis fact about how they behave

Comment by keith_wynroe on A Hill of Validity in Defense of Meaning · 2023-07-18T00:13:32.474Z · LW · GW

Why not? Is it common for NDAs/non-disparagement agreements to also have a clause stating the parties aren’t allowed to tell anyone about it? I’ve never heard of this outside of super-injunctions which seems a pretty separate thing

Comment by keith_wynroe on A Hill of Validity in Defense of Meaning · 2023-07-17T21:15:17.099Z · LW · GW

They can presumably confirm whether or not there is a nondisparagement agreement and whether that is preventing them from commenting though right

Comment by keith_wynroe on An artificially structured argument for expecting AGI ruin · 2023-05-08T12:38:54.247Z · LW · GW

I think (1b) doesn't go through. The "starting data" we have from (1a) is that the AGI has some preferences over lotteries that it competently acts on - acyclicality seems likely but we don't get completeness or transitivity for free, so we can't assume its preferences will be representable as maximising some utility function. (I suppose we also have the constraint that its preferences look "locally" good to us given training). But if this is all we have it doesn't follow that the agent will have some coherent goal it'd be want optimisers optimising towards

An AGI doesn't have to be an EU-maximiser to be scary - it could have e.g. incomplete preferences but still prefer B to A where we really really prefer A to B. But I think assuming an AI will look like an EU-maximiser does a lot of the heavy-lifting in guaranteeing the AGI will be lethal, since otherwise we can't a priori predict it'll want to optimise along any dimension particularly hard

Comment by keith_wynroe on The basic reasons I expect AGI ruin · 2023-04-20T14:12:33.451Z · LW · GW

Thanks for writing this. I think this is a lot clearer and more accessible that most write-ups on this topic and seems valuable.

I think the points around randomly-sampled plans being lethal, and expecting AGI to more closely randomly-sample plans, seem off though:

I don't see why lethal plans dominate the simplicity-weighted distribution if all we do is condition on plans that succeed. I expect the reasoning is "Lethal IC plans are more likely to succeed, therefore there are more minor (equally or barely more complex) variations of a given lethal plan that succeed vs. minor variations of non-lethal plans, therefore the former will be overrepresented in the space of successful plans". But this doesn't seem to go through a priori. You get this "there are way more variations" phenomenon whenever the outcome is overdetermined by a plan, but this doesn't automatically make the plan more likely on a simplicity prior unless it's also not sufficiently more complex to outweigh this. In this case, a fully-fleshed out plan which goes all-in on IC and takes over the world might easily be more complex than a simpler plan, in which case why do we assume the IC plans dominate?

I don't think weighting by plan-complexity necessarily prioritises IC/lethal plans unless you also weight by something like "probability of plan success relative to a prior", in which case sure your distribution will upweight plans that just take over everything. But even so maybe simpler, non-lethal plans are likely enough to succeed that they still come out in front. It feels like what you're implicitly doing is assuming the AI will be trying to maximise the probability of WBE, but why would it do this? This seems to be where all the danger is coming from really. If it instead does something more like "Search through plans, pick the first one that seems "good enough"", then the question of whether it selects a dangerous plan is a purely empirical one about what its own inductive biases are, and it seems odd to be so a priori confident about the danger here

Comment by keith_wynroe on There are no coherence theorems · 2023-03-28T11:14:00.453Z · LW · GW

Want to bump this because it seems important? How do you see the agent in the post as being dominated?

Comment by keith_wynroe on There are no coherence theorems · 2023-03-18T10:39:05.424Z · LW · GW

How is the toy example agent sketched in the post dominated?

Comment by keith_wynroe on There are no coherence theorems · 2023-03-11T01:40:41.022Z · LW · GW

Yeah I agree that even if they fall short of normative constraints there’s some empirical content around what happens in adversarial environments. I think I have doubts that this stuff translates to thinking about AGIs too much though, in the sense that there’s an obvious story of how an adversarial environment selected for (partial) coherence in us, but I don’t see the same kinds of selection pressures being a force on AGIs. Unless you assume that they’ll want to modify themselves in anticipation of adversarial environments which kinda begs the question

Comment by keith_wynroe on There are no coherence theorems · 2023-03-10T01:09:49.638Z · LW · GW

Kind of tangential but I'd be interested in your take on how strongly money-pumping etc is actually an argument against full-on cyclical preferences? One way to think about why getting money-pumped is bad is because you have an additional preference to not pay money to go nowhere. But it feels like all this tells us is that "something has to go", and if an agent is rationally permitted to modify its own preferences to avoid these situations then it seems a priori acceptable for it to instead just say something like "well actually I weight my cyclical preferences more highly so I'll modify the preference against arbitrarily paying"

In other words, it feels like the money-pumping arguments presume this other preference that in a sense takes "precedence" over the cyclical ones and I'm not sure how to think about that still

Comment by keith_wynroe on There are no coherence theorems · 2023-03-10T01:05:22.782Z · LW · GW

This seems totally different to the point OP is making which is that you can in theory have things that definitely are agents, definitely do have preferences, and are incoherent (hence not EV-maximisers) whilst not "predictably shooting themselves in the foot" as you claim must follow from this

I agree the framing of "there are no coherence theorems" is a bit needlessly strong/overly provocative in a sense, but I'm unclear what your actual objection is here - are you claiming these hypothetical agents are in fact still vulnerable to money-pumping? That they are in fact not possible?

Comment by keith_wynroe on There are no coherence theorems · 2023-02-21T23:09:08.194Z · LW · GW

Great post. I think a lot of the discussion around the role of coherence arguments and what we should expect a super-intelligent agent to behave like is really sloppy and I think this distinction between "coherence theorems as a self-contained mathematical result" and "coherence arguments as a normative claim about what an agent must be like on pain of shooting themselves in the foot" is an important one

The example of how an incomplete agent avoids getting Dutch-booked also seems to look very naturally like how irl agents behave imo. One way of thinking about this is also that these lotteries are a lot more "high-dimensional" than they initially look - e.g. the decision at node 2 isn't between "B and C" but between "B and C given I just chose B in a choice between B and A and this guy is trying to rip me off". In general the path-dependence of our bets and our meta-preferences on how our preferences are engaged with by other agents are also legitimate reasons to expect things like Dutch-booking has less normative force for actual agents IRL. Of course in a way this is maybe just making you VNM-rational after all albeit with a super weird and garbled utility function, but that's a whole other problem with coherence arguments

Comment by keith_wynroe on There are no coherence theorems · 2023-02-21T22:44:56.792Z · LW · GW

Ngl kinda confused how these points imply the post seems wrong, the bulk of this seems to be (1) a semantic quibble + (2) a disagreement on who has the burden of proof when it comes to arguing about the plausibility of coherence + (3) maybe just misunderstanding the point that's being made?

(1) I agree the title is a bit needlessly provocative and in one sense of course VNM/Savage etc count as coherence theorems. But the point is that there is another sense that people use "coherence theorem/argument" in this field which corresponds to something like "If you're not behaving like an EV-maximiser you're shooting yourself in the foot by your own lights", which is what brings in all the scary normativity and is what the OP is saying doesn't follow from any existing theorem unless you make a bunch of other assumptions

(2) The only real substantive objection to the content here seems to be "IMO completeness seems quite reasonable to me". Why? Having complete preferences seems like a pretty narrow target within the space of all partial orders you could have as your preference relation, so what's the reason why we should expect minds to steer towards this? Do humans have complete preferences?

(3) In some other comments you're saying that this post is straw-manning some extreme position because people who use coherence arguments already accept you could have e.g.

>an extremely powerful AI that is VNM rational in all situations except for one tiny thing that does not >matter or will never come up

This seems to be entirely missing the point/confused - OP isn't saying that agents can realistically get away with not being VNM-rational because its inconsistencies/incompletenesses aren't efficiently exploitable, they're saying that you can have an agent that aren't VNM-rational and aren't exploitable in principle - i.e., your example is an agent that could in theory be money-pumped by another sufficiently powerful agent that was able to steer the world to where their corner-case weirdness came out - the point being made about incompleteness here is that you can have a non VNM-rational agent that's not just un-Dutch-Bookable as a matter of empirical reality but in principle. The former still gets you claims like "A sufficiently smart agent will appear VNM-rational to you, they can't have any obvious public-facing failings", the latter undermines this

Comment by keith_wynroe on You're Not One "You" - How Decision Theories Are Talking Past Each Other · 2023-01-09T13:28:55.284Z · LW · GW

Ah I hadn't realised Caspar wrote that, thanks for the link! I agree that seems to be getting at the same idea, and it's kind of separable from the multi-agent point

Comment by keith_wynroe on You're Not One "You" - How Decision Theories Are Talking Past Each Other · 2023-01-09T08:45:34.576Z · LW · GW

I'm probably misunderstanding you or I've worded things in a confusing way that I haven't noticed - I don't think anywhere it's implied what you do on Tails? The "iff" here is just saying you would be paid on Heads iff you would pay on Tails - the flip will happen regardless and the predictor hasn't made any prediction about the coin itself, just you're conditional behaviour

Edit: Maybe the "iff you will pay $1 on Tails" makes it sound like the predictor is predicting both the coin and your response, I'll edit to make more clear

User info

Posts

Comments