Posts
Comments
Self-coordination might also be easier if AIs can recognize themselves by being good at recognizing features that survive paraphrasing and that vary very consistently between answers
Yeah, this is the case I'm thinking about. I think on longer outputs it will be possible for the AI to use a more nebulous sense of "style", rather than only relying on specific features like "using X secret phrase" (which are presumably easier to redact away).
I think this could be countered by paraphrasing at a higher level - i.e. rewriting entire paragraphs from scratch rather than just rewording sentences. I'm noting this down as a potential topic if I have time to do another project.
Thanks for the discussion, it's helped me clear up my thoughts on this topic.
it needs the ability to recursively handle instrumental subgoals anyway, so why have a whole extra different kind of goal at the top?
I'm picturing this somewhat differently. I imagine each goal as a node in a directed graph, with zero or more arrows pointing toward its supergoal(s). Under this construction, we have a natural separation between instrumental goals (nodes with at least one outgoing arrow) and terminal goals (nodes with no outgoing arrows), even though they're not fundamentally different "types" of things.
So the question of "how do you build an agent with no terminal goals?" translates to "how do you build a directed graph with no dead ends?" And obviously the solution is that the graph must contain cycles (and specifically nothing but cycles). In the simplest case, you get an agent that wants to do A in order to do B, and wants to do B in order to do A.
I suspect this kind of structure shouldn't occur in humans because we're naturally born with certain base drives. So even if our goal graphs have cycles, the subgoals still lead to a dead end somewhere. E.g. when I play video games, I want to kill monsters to get experience, and I want to get experience so I can get better at killing monsters, but ultimately it's grounded in some basic desire for achievement or something.
However with an artificial mind, we may be able to tweak its goal graph directly and prune off all the dead ends, or link them back into other nodes.
(epistemic status: unsure if this idea actually has merit or if I'm taking the graph metaphor too far)
Hi, thank you very much for the response! I read through the post you linked, and also the steganography/paraphrasing post linked therein, and I see your point about redaction. I agree that in this intelligence regime, it's likely that we could come up with redaction schemes that very effectively prevent AIs from hiding information in their outputs.
However, I still have some concerns about generating synthetic inputs. Specifically this section:
At any point where it naively seems like there might be an extremely good opportunity to behave badly, or where you need to coordinate with another instance of yourself, it's overwhelmingly likely that this instance is in adversarial training/evaluation.
"Coordinating with yourself" is a special case of "coordinating with another AI" that deserves extra attention, because by default the AI has a strong incentive to coordinate with itself, but less so to coordinate with a different AI. I still think it'll be hard to convincingly create synthetic inputs for this case because doing so would require the blue team to know how the AI would act - otherwise the synthetic inputs end up as "coordinating with someone pretending to be yourself", or "coordinating with someone else".[1] In those cases it could be natural for the AI to not cooperate, but that wouldn't necessarily imply that the AI wouldn't cooperate with a copy of itself.
If you want to claim this isn't a problem, you'd have to argue one of:
- Creating synthetic inputs for self-coordination is not significantly harder than for other-coordination.
- Defecting on other-coordination does imply the AI wouldn't cooperate with itself.
- Not specifically testing for self-coordination is an acceptable weakness - if the redaction scheme is good, the AI won't know if its inputs are coming from itself anyways.[2]
- None of the above, there's another case I didn't consider.
I think 1 and 2 are false, and 3 requires some shaky reasoning (see the footnote).
(A tangent: I think current-generation LLMs might do better on noticing self-coordination with more hand-holdy prompting: e.g. asking the monitor to first generate its own solution to the same problem, and then comparing that against its input to check for discrepancies that could be secret signals.)
- ^
One other possibility is to aim for "coordinating with someone that happens to share your goals", but that runs into the problems of A) communicating the extra information of shared goals and B) striking a balance between the offer being too good to be true and not good enough to bother.
- ^
I think this makes sense at a first glance but is the kind of mindset that leads to insufficient paranoia, so I'd like to avoid it unless the case for it is really watertight.
First off, thank you for this series. You've thoroughly dissolved the paradox for me. I was initially a thirder despite suspicions about Elga's original argument (he seemed to equivocate between being awakened in general and being first awakened/awakened on Monday), because the probabilities for Lewis' halfer argument (the 1/2, 1/4, 1/4 split) made no sense no matter how I did the math. I now understand that both sides were making the same error in their models, and that while the thirder position is correctly measuring something, it's not your credence in the coin landing heads.
Second, as I was thinking about this solution, I came up with an isomorphic problem that might better clarify the key insight of Tails&Monday and Tails&Tuesday being the same event. I don't think it's worth a whole post, but I'd like your thoughts on it.
The road trip rest stop
You and a friend are taking a late night road trip from Townsburg to Cityville to catch a concert. Your friend graciously offers to drive the whole way, so you decide to go to sleep.
Before departing, your friend tells you that he'd like to stop at rest stops A and B on the way, but if the traffic out of Townsburg is congested (which you expect at a 50% chance), he'll only have time to stop at A. Either way, he promises to wake you up at whatever rest stop(s) he stops at.
You're so tired that you fall asleep before seeing how the traffic is, so you don't know which rest stops your friend is going to stop at.
Some time during the night, your friend wakes you up at a rest stop. You're groggy enough that you don't know if you've been woken up for another rest stop previously, and it's too dark to tell whether you're at A or B. At this point, what is your credence that the traffic was bad?
and have no particular reason to think those people are very unusual in terms of cooking-skill
Yeah, that's what I was trying to get at with the typical-friend-group stuff. The people you know well aren't a uniform sample of all people, so you have no reason to conclude that their X-skill is normal for any arbitrary X.
The problem (or maybe just my problem) is that when I say "average" it feels like it's activating my concept of "mathematical concept of sum/count", even though the actual thing I'm thinking of is "typical member of class extracted from my mental model". I find myself treating "average" as if it came from real data even if it didn't.
(including just using the default settings, or uninstalling the app and choosing an easier one).
I feel like this was meant to be in jest, but if you were serious: my goal was to beat the stage with certain self-imposed constraints, so relaxing those constraints would've defeated the purpose of the exercise.
Anyways, it wasn't my friend's specific change that was surprising, it was the fact that he had come up with an optimization at all. My strategy was technically viable but incredibly tedious, and even though I knew it was tedious I was planning on trying it as-is anyways. It was like I was committed to eating a bowl of soup with a fork and my friend pointed out the existence of spoons. The big realization was that I had a dumb strategy that I (on some level) knew was dumb, but I didn't have the "obvious" next thought of trying to make it better.