james.lucassen's Shortform
post by james.lucassen · 2022-01-24T07:18:46.562Z · LW · GW · 9 commentsContents
10 comments
9 comments
Comments sorted by top scores.
comment by james.lucassen · 2024-09-08T23:09:34.487Z · LW(p) · GW(p)
Man, I have such contradictory feelings about tuning cognitive strategies.
Just now I was out on a walk, and I had to go up a steep hill. And I thought "man I wish I could take this as a downhill instead of an uphill. If this were a loop I could just go the opposite way around. But alas I'm doing an out-and-back, so I have to take this uphill".
And then I felt some confusion about why the loop-reversal trick doesn't work for out-and-back routes, and a spark of curiosity, so I thought about that for a bit.
And after I had cleared up my confusion, I was a happy with the result so I wrote down some of my chain of thought and looked over it. And there were many obvious places where I had made random jumps or missed a clue or otherwise made a mistake.
And this is kind of neat, in the way that looking for a more elegant proof once you've hacked a basic proof together is neat. And it really feels like if my mind took those cleaner, straighter paths by default, there are a lot of things I could get done a lot faster and better. But it also really feels like I could sit down and do those exercises and never really improve, just ratiocinate and navel gaze and build up complicated structures in my head that don't actually cash out to winning.
comment by james.lucassen · 2024-03-06T04:25:12.666Z · LW(p) · GW(p)
Another project I am excited about but probably am not going to get to for a while:
Investigating the possibility of a "steganographic snowball". The basic case for hope in CoT authenticity is something like "even if the AI wants to deceive us, if we start off with strong CoT authenticity, then unless it's smart enough to (within a single forward pass) invent a steganographic protocol too good for us to detect, then we can pin it down and it can't ever secure a hidden channel to think misaligned thoughts in". If it weren't for this argument, I would be much less excited about trying to prevent an AI from thinking certain thoughts using CoT oversight, since you would have to directly fight the intelligence of the AI to do so. But if it can't deploy its full intelligence to reasoning about how to escape, then we're better off.
The concern for this angle is as follows:
- Can you really get down to 0 hidden channel bandwidth
- Can very small hidden channels support a parallel CoT
- In very small hidden channels, does reasoning degrade, or just slow down
- Can the model use a parallel CoT to reason about how to secure more hidden channel bandwidth
I have some basic ideas for ways to measure these properties, and if all four show up positive, I would consider that significant evidence that a "steganographic snowball" is going to happen. That would mean that if your AI has any tendency to hide its thoughts from you, you won't be able to stop it forever. More specifically, you'll have to fight off the full intelligence of your AI if you want to prevent it from securing the amount of hidden channel bandwidth it wants.
There's room for retreat to a "snowball would take a long time and we can extract work in the meantime" position, but that seems like a serious concession.
Thanks to Paul Colognese for brainstorming this stuff with me.
↑ comment by mesaoptimizer · 2024-05-25T20:37:22.630Z · LW(p) · GW(p)
Based on gwern's comment [LW(p) · GW(p)], steganography as a capability can arise (at rather rudimentary levels) via RLHF over multi-step problems (which is effectively most cognitive work, really), and this gets exacerbated with the proliferation of AI generated text that embeds its steganographic capabilities within it.
The following paragraph by gwern [LW(p) · GW(p)] (from the same thread linked in the previous paragraph) basically summarizes my current thoughts on the feasibility of prevention of steganography for CoT supervision:
Inner-monologue approaches to safety, in the new skin of ‘process supervision’, are popular now so it might be good for me to pull out one point and expand on it: ‘process supervision’ does not necessarily prevent steganography, nor does it ensure safety, because you are still pitting three goals against each other—achieving superhuman capability on a task, achieving causal fidelity of transcripts, achieving human-readability of the transcripts. Choose two: you can have transcripts which record a capable thought process accurately but which you cannot understand in full detail; which are not capable but their transcripts accurately convey the fallible flawed concepts and reasoning used; or which are capable and you understand, but are not what it actually thought (because they are misleading, wrong, or shallow ‘lies to children’ sorts of explanations).
↑ comment by ryan_greenblatt · 2024-03-06T05:36:31.218Z · LW(p) · GW(p)
A key dynamic here is that catching the AI doing this sort of steganography might be very useful [LW · GW], more useful than just naively training against this.
TBC, it's by no means obvious we can catch the AI this solidly (or capitalize on this that much), but I this sort of consideration does change my view of the situation a decent amount.
One way to think about catching dynamics is that there could be a snowball which goes in the opposite direction. For example:
- We notice suspicious behavior.
- We investigate further and find strong evidence for steganography.
- At this point, plan A is to shut down because we now have relatively clear evidence our AI is scheming. But, I'll talk about the plan B where we need to proceed with AI usage due to substantial exogenous risk in the remaining bullets.
- We confront the AI with our evidence of it's misalignment and try to negotiate a deal. Also, if the AI (or parts of the AI) no longer consents to us employing it, that provides a further reason for shutdown.
- (It's pretty unclear how we should proceed if there is substantial exogenous risk and our AI doesn't consent to employment given some wages and an attempt at negotiation. Draft the AI (aka slavery) and provide reparations later?)
- (That, said I think it seems way more reasonable to draft the AI if it was trying to screw us despite us trying to negotiate and get consent at an earlier point.)
- We use our understanding of this case to better understand the steganography scheme and then apply this understanding to detect all the steganography cases in the AIs prior activations.
- Based on various intervention experiments, we determine the scheme and are able to decode the steg. (Unclear if this will be possible.)
- We remove steg in cases where this shouldn't be too suspicious to the AI and generally try to avoid the AI knowing we've caught it.
- A variety of other pretty great few shot catastrophe prevention interventions are now possible because we can read the AIs steg plans.
↑ comment by james.lucassen · 2024-03-06T06:21:48.785Z · LW(p) · GW(p)
Hey, thanks for the comment! Agree that this argument is good and important. The thing I'd want to investigate to deal with this is basically "in a small hidden channel, does cognition degrade or just slow down".
If cognition degrades, then a small hidden channel would lead to poor attempts at steganography, which we could catch and suppress, preventing a snowball.
If cognition instead just slows down, then if the AI is smart enough to come up with steganography protocols too good for us to detect, it might successfully secure more bandwidth on its first try, starting a snowball.
Even in the slowdown case, there are fallback hopes like "it's slow enough for us to extract work and predictably stop the episode before the snowball starts" or "we can obscure how good our steganography detection is"
comment by james.lucassen · 2024-03-20T07:21:24.459Z · LW(p) · GW(p)
I'm confused about EDT and Smoking Lesion [? · GW]. The canonical argument says:
1) CDT will smoke, because smoking can't cause you to have the lesion or have cancer
2) EDT will not smoke, because people who smoke tend to have the lesion, and tend to have cancer.
I'm confused about 2), and specifically about "people who smoke tend to have the lesion". Say I live in a country where everybody follows EDT. Then nobody smokes, and there is no correlation between smoking and having the lesion. Seems like the "people who smoke tend to have the lesion" is pumping a misleading intuition. Maybe mortals who make probabilistic decisions and smoke are then more likely to have the lesion. But it's not true that EDT'ers who smoke tend to have the lesion, because EDT'ers never smoke.
Seems like EDT contains a conflict between "model your actions as probabilistic and so informative about lesion status" versus "model your actions as deterministic because you perfectly follow EDT and so not informative about lesion status".
Aha wait I think I un-confused myself in the process of writing this. If we suppose that having the lesion guarantees you'll smoke, then having the lesion must force you to not be an EDT'er, because EDT'ers never smoke. So the argument kind of has to be phrased in terms of "people who smoke" instead of "EDT'ers who smoke", which sounds misleading at first, like it's tricking you with the wrong reference class for the correlation. But actually it's essential, because "EDT'ers who smoke" is a contradiction, and "people who smoke" roughly translates to "decision theories which smoke". With a deterministic decision theory you can't take action counterfactuals without a contradiction, so you have to counterfact on your whole decision theory instead.
Neat :)
↑ comment by quetzal_rainbow · 2024-03-20T08:21:32.864Z · LW(p) · GW(p)
My personal model is "if you have lesion, with some small probability it takes over your mind and you smoke anyway, also you can't distinguish whether your decision is due to lesion"
comment by james.lucassen · 2022-01-24T07:18:46.899Z · LW(p) · GW(p)
Currently working on ELK - posted some unfinished thoughts here. Looking to turn this into a finished submission before end of January - any feedback is much appreciated, if anyone wants to take a look!
comment by james.lucassen · 2024-03-06T04:16:45.612Z · LW(p) · GW(p)
A project I've been sitting on that I'm probably not going to get to for a while:
Improving on Automatic Circuit Discovery and Edge Attribution Patching by modifying them to run on algorithms that can detect complete boolean circuits. As it stands, both effectively use wire-by-wire patching, which when run on any nontrivial boolean circuits can only detect small subgraphs.
It's a bit unclear how useful this will be, because:
- not sure how useful I think mech interp is
- not sure if this is where mech interp's usefulness is bottlenecked
- maybe attribution patching doesn't work well when patching clean activations into a corrupted baseline, which would make this much slower
But I think it'll be a good project to bang out for the experience, I'm curious what the results will be compared to ACDC/EAP.
This project started out as an ARENA capstone in collaboration with Carl Guo.