Posts

An OV-Coherent Toy Model of Attention Head Superposition 2023-08-29T19:44:11.242Z
An OV-Coherent Toy Model of Attention Head Superposition 2023-08-25T18:26:28.238Z
Literature review of TAI timelines 2023-01-27T20:07:38.186Z
You're Not One "You" - How Decision Theories Are Talking Past Each Other 2023-01-09T01:21:11.708Z

Comments

Comment by keith_wynroe on Toward A Mathematical Framework for Computation in Superposition · 2024-01-19T19:39:03.061Z · LW · GW

This looks really cool! Haven't digested it all yet but I'm especially interested in the QK superposition as I'm working on something similar. I'm wondering what your thoughts are on the number of bigrams being represented by a QK circuit not being bounded by interference but by its interaction with the OV circuit. IIUC it looks like a head can store a surprising number of d_resid bigrams, but since the OV circuit is only a function of the key, then having the same key feature be in a clique with a large number of different query features means the OV-circuit will be unable to differentially copy information based on which bigram is present. I don't think this has been explored outside of toy models from Anthropic though

Comment by keith_wynroe on OpenAI Superalignment: Weak-to-strong generalization · 2023-12-15T15:07:43.669Z · LW · GW

I know they flag it in the paper, but seeing the performance curves for the strong model on zero- and few-shot attempts really makes me think the data leakage issue is doing a lot of the work here. If you get the majority(?) of the PGR from e.g. 5-shot prompting it seems like a natural takeaway is the strong model doesn't actually need to be fine-tuned on the task, and the weak supervisor is just eliciting the knowledge that's already there

Comment by keith_wynroe on AI #41: Bring in the Other Gemini · 2023-12-09T10:48:28.910Z · LW · GW

Sorry you found it so stressful! I’m not objecting to you deciding it’s not worth your time to engage, what I’m getting at is a perceived double standard in when this kind of criticism is applied. You say

I do not think that the thing I am observing from Pope/Belrose is typical of LW/AF/rationalist/MIRI/etc behaviors to anything like the same degree that they consistently do it

But this seems wrong to me. The best analogue of your post from Quintin’s perspective was his own post laying out disagreements with Eliezer. Eliezer’s response to this was to say it was too long for him to bother reading, which imo is far worse. AFAICT his response to you in your post is higher-effort than the responses from MIRI people to his arguments all put together. Plausibly we have different clusters in our head of who we’re comparing him too though - I agree a wider set of LW people are much more engaging, I’m specifically comparing to e.g Nate and Eliezer as that feels to me a fairer comparison

To go into the specific behaviours you mention

I basically don't see him changing his mind about anything, agreeing a good point was made

I don’t think this makes sense - if from his perspective you didn’t make good points or change his mind then what was he supposed to do? If you still think you did and he’s not appreciating them then that’s fair but is more reifying the initial disagreement. I also don’t see this behaviour from Eliezer or Nate?

addressing my arguments or thoughts on their merits rather than correcting my interpretation of his arguments, asking me questions, suggesting cruxes and so on.

I again don’t see Eliezer doing any of this either in responses to critical posts?

Where he notes disagreement he says he's baffled anyone could think such a thing and doesn't seem curious why I might think it

Again seems to be a feature of many MIRI-cluster responses. Stating that certain things feel obvious from the inside and that you don’t get why it’s so hard for other people to grok them is a common refrain.

Comment by keith_wynroe on AI #41: Bring in the Other Gemini · 2023-12-08T05:37:04.237Z · LW · GW

And all of this is asserted as, essentially, obvious and undeniable, extreme confidence is displayed, all the arguments offered against this are invalid and dumb, and those that disagree are at best deeply confused and constantly told they did not understand or fairly represent what was said.

This feels unnecessarily snarky, but is also pretty much exactly the experience a lot of people have trying to engage with Yudkowsky et al. It feels weird to bring up “they’re very confident and say that their critics just don’t get it” as a put-down here.

It seems doubly bad because it really seems like a lot of the more pessimist crowd just genuinely aren’t actually trying to engage with these ideas at all. Nate wrote one skimmed post which badly misread the piece, and Yudkowsky AFAICT has at most engaged via a couple tweets (again which don’t seem to engage with the points). This is concurrent with them both engaging much more heavily with weaker objections to which they already have easy answers.

I genuinely don’t understand why a group which is highly truth-seeking and dispassionately interested in the validity of their very consequential arguments feels so little reason to engage with counter-arguments to their core claims which have been well-received.

I tried one reply to one of Pope’s posts

From your post, you seem to have misunderstood Quintin’s arguments in a way he explains pretty clearly, and then there’s not really much follow-up. You don’t seem to have demonstrated you can pass an ITT after this, and I think if it were Yudkowsky in Pope’s position and someone effectively wrote him off as hopeless after one failed attempt to understand eachother you would probably not be as forgiving.

Comment by keith_wynroe on UDT shows that decision theory is more puzzling than ever · 2023-09-15T15:38:42.933Z · LW · GW

I understand it’s a proposition like any other, I don’t see why an agent would reflect on it/use it in their deliberation to decide what to do. The fact that they’re a CDT agent is a fact about how they will act in the decision, not a fact that they need to use in their deliberation

Analogous to preferences, whether or not an agent prefers A or B is a proposition like any other, but I don’t think it’s a natural way to model them as first consult the credences they have assigned to “I prefer A to B” etc. Rather, they will just choose A ex hypothesis because that’s what having the preference means.

Comment by keith_wynroe on UDT shows that decision theory is more puzzling than ever · 2023-09-15T12:39:24.966Z · LW · GW

Why would they be uncertain about whether they’re a CDT agent? Being a CDT agent surely just means by definition they evaluate decisions based on causal outcomes. It feels confused to say that they have to be uncertain about/reflect on which decision theory they have and then apply it, rather than their being a CDT agent being an ex hypothesis fact about how they behave

Comment by keith_wynroe on A Hill of Validity in Defense of Meaning · 2023-07-18T00:13:32.474Z · LW · GW

Why not? Is it common for NDAs/non-disparagement agreements to also have a clause stating the parties aren’t allowed to tell anyone about it? I’ve never heard of this outside of super-injunctions which seems a pretty separate thing

Comment by keith_wynroe on A Hill of Validity in Defense of Meaning · 2023-07-17T21:15:17.099Z · LW · GW

They can presumably confirm whether or not there is a nondisparagement agreement and whether that is preventing them from commenting though right

Comment by keith_wynroe on An artificially structured argument for expecting AGI ruin · 2023-05-08T12:38:54.247Z · LW · GW

I think (1b) doesn't go through. The "starting data" we have from (1a) is that the AGI has some preferences over lotteries that it competently acts on - acyclicality seems likely but we don't get completeness or transitivity for free, so we can't assume its preferences will be representable as maximising some utility function. (I suppose we also have the constraint that its preferences look "locally" good to us given training). But if this is all we have it doesn't follow that the agent will have some coherent goal it'd be want optimisers optimising towards

An AGI doesn't have to be an EU-maximiser to be scary - it could have e.g. incomplete preferences but still prefer B to A where we really really prefer A to B. But I think assuming an AI will look like an EU-maximiser does a lot of the heavy-lifting in guaranteeing the AGI will be lethal, since otherwise we can't a priori predict it'll want to optimise along any dimension particularly hard

Comment by keith_wynroe on The basic reasons I expect AGI ruin · 2023-04-20T14:12:33.451Z · LW · GW

Thanks for writing this. I think this is a lot clearer and more accessible that most write-ups on this topic and seems valuable.

I think the points around randomly-sampled plans being lethal, and expecting AGI to more closely randomly-sample plans, seem off though:

I don't see why lethal plans dominate the simplicity-weighted distribution if all we do is condition on plans that succeed. I expect the reasoning is "Lethal IC plans are more likely to succeed, therefore there are more minor (equally or barely more complex) variations of a given lethal plan that succeed vs. minor variations of non-lethal plans, therefore the former will be overrepresented in the space of successful plans". But this doesn't seem to go through a priori. You get this "there are way more variations" phenomenon whenever the outcome is overdetermined by a plan, but this doesn't automatically make the plan more likely on a simplicity prior unless it's also not sufficiently more complex to outweigh this. In this case, a fully-fleshed out plan which goes all-in on IC and takes over the world might easily be more complex than a simpler plan, in which case why do we assume the IC plans dominate?

I don't think weighting by plan-complexity necessarily prioritises IC/lethal plans unless you also weight by something like "probability of plan success relative to a prior", in which case sure your distribution will upweight plans that just take over everything. But even so maybe simpler, non-lethal plans are likely enough to succeed that they still come out in front. It feels like what you're implicitly doing is assuming the AI will be trying to maximise the probability of WBE, but why would it do this? This seems to be where all the danger is coming from really. If it instead does something more like "Search through plans, pick the first one that seems "good enough"", then the question of whether it selects a dangerous plan is a purely empirical one about what its own inductive biases are, and it seems odd to be so a priori confident about the danger here

Comment by keith_wynroe on There are no coherence theorems · 2023-03-28T11:14:00.453Z · LW · GW

Want to bump this because it seems important? How do you see the agent in the post as being dominated?

Comment by keith_wynroe on There are no coherence theorems · 2023-03-18T10:39:05.424Z · LW · GW

How is the toy example agent sketched in the post dominated?

Comment by keith_wynroe on There are no coherence theorems · 2023-03-11T01:40:41.022Z · LW · GW

Yeah I agree that even if they fall short of normative constraints there’s some empirical content around what happens in adversarial environments. I think I have doubts that this stuff translates to thinking about AGIs too much though, in the sense that there’s an obvious story of how an adversarial environment selected for (partial) coherence in us, but I don’t see the same kinds of selection pressures being a force on AGIs. Unless you assume that they’ll want to modify themselves in anticipation of adversarial environments which kinda begs the question

Comment by keith_wynroe on There are no coherence theorems · 2023-03-10T01:09:49.638Z · LW · GW

Kind of tangential but I'd be interested in your take on how strongly money-pumping etc is actually an argument against full-on cyclical preferences? One way to think about why getting money-pumped is bad is because you have an additional preference to not pay money to go nowhere. But it feels like all this tells us is that "something has to go", and if an agent is rationally permitted to modify its own preferences to avoid these situations then it seems a priori acceptable for it to instead just say something like "well actually I weight my cyclical preferences more highly so I'll modify the preference against arbitrarily paying"

In other words, it feels like the money-pumping arguments presume this other preference that in a sense takes "precedence" over the cyclical ones and I'm not sure how to think about that still

Comment by keith_wynroe on There are no coherence theorems · 2023-03-10T01:05:22.782Z · LW · GW

This seems totally different to the point OP is making which is that you can in theory have things that definitely are agents, definitely do have preferences, and are incoherent (hence not EV-maximisers) whilst not "predictably shooting themselves in the foot" as you claim must follow from this

I agree the framing of "there are no coherence theorems" is a bit needlessly strong/overly provocative in a sense, but I'm unclear what your actual objection is here - are you claiming these hypothetical agents are in fact still vulnerable to money-pumping? That they are in fact not possible? 

Comment by keith_wynroe on There are no coherence theorems · 2023-02-21T23:09:08.194Z · LW · GW

Great post. I think a lot of the discussion around the role of coherence arguments and what we should expect a super-intelligent agent to behave like is really sloppy and I think this distinction between "coherence theorems as a self-contained mathematical result" and "coherence arguments as a normative claim about what an agent must be like on pain of shooting themselves in the foot" is an important one

The example of how an incomplete agent avoids getting Dutch-booked also seems to look very naturally like how irl agents behave imo. One way of thinking about this is also that these lotteries are a lot more "high-dimensional" than they initially look - e.g. the decision at node 2 isn't between "B and C" but between "B and C given I just chose B in a choice between B and A and this guy is trying to rip me off". In general the path-dependence of our bets and our meta-preferences on how our preferences are engaged with by other agents are also legitimate reasons to expect things like Dutch-booking has less normative force for actual agents IRL. Of course in a way this is maybe just making you VNM-rational after all albeit with a super weird and garbled utility function, but that's a whole other problem with coherence arguments

Comment by keith_wynroe on There are no coherence theorems · 2023-02-21T22:44:56.792Z · LW · GW

Ngl kinda confused how these points imply the post seems wrong, the bulk of this seems to be (1) a semantic quibble + (2) a disagreement on who has the burden of proof when it comes to arguing about the plausibility of coherence + (3) maybe just misunderstanding the point that's being made?

(1) I agree the title is a bit needlessly provocative and in one sense of course VNM/Savage etc count as coherence theorems. But the point is that there is another sense that people use "coherence theorem/argument" in this field which corresponds to something like "If you're not behaving like an EV-maximiser you're shooting yourself in the foot by your own lights", which is what brings in all the scary normativity and is what the OP is saying doesn't follow from any existing theorem unless you make a bunch of other assumptions

(2) The only real substantive objection to the content here seems to be "IMO completeness seems quite reasonable to me". Why? Having complete preferences seems like a pretty narrow target within the space of all partial orders you could have as your preference relation, so what's the reason why we should expect minds to steer towards this? Do humans have complete preferences?

(3) In some other comments you're saying that this post is straw-manning some extreme position because people who use coherence arguments already accept you could have e.g.

>an extremely powerful AI that is VNM rational in all situations except for one tiny thing that does not >matter or will never come up

This seems to be entirely missing the point/confused - OP isn't saying that agents can realistically get away with not being VNM-rational because its inconsistencies/incompletenesses aren't efficiently exploitable, they're saying that you can have an agent that aren't VNM-rational and aren't exploitable in principle - i.e., your example is an agent that could in theory be money-pumped by another sufficiently powerful agent that was able to steer the world to where their corner-case weirdness came out - the point being made about incompleteness here is that you can have a non VNM-rational agent that's not just un-Dutch-Bookable as a matter of empirical reality but in principle. The former still gets you claims like "A sufficiently smart agent will appear VNM-rational to you, they can't have any obvious public-facing failings", the latter undermines this

Comment by keith_wynroe on You're Not One "You" - How Decision Theories Are Talking Past Each Other · 2023-01-09T13:28:55.284Z · LW · GW

Ah I hadn't realised Caspar wrote that, thanks for the link! I agree that seems to be getting at the same idea, and it's kind of separable from the multi-agent point

Comment by keith_wynroe on You're Not One "You" - How Decision Theories Are Talking Past Each Other · 2023-01-09T08:45:34.576Z · LW · GW

I'm probably misunderstanding you or I've worded things in a confusing way that I haven't noticed - I don't think anywhere it's implied what you do on Tails? The "iff" here is just saying you would be paid on Heads iff you would pay on Tails - the flip will happen regardless and the predictor hasn't made any prediction about the coin itself, just you're conditional behaviour

Edit: Maybe the "iff you will pay $1 on Tails" makes it sound like the predictor is predicting both the coin and your response, I'll edit to make more clear