Posts
Comments
"Well, I'm not sure. As you mention, it depends on the step size. It also depends on how vulnerable to adversarial inputs LLMs are and how easy they are to find. I haven't looked into the research on this, but it sounds empirically checkable. If there are lots of adversarial inputs which have a wide array of impacts on LLM behavior, then it would seem very plausible that the optimized planner could find useful ones without being specifically steered in that direction."
I am really interested in hearing CBiddulph's thoughts on this. Do you agree with Abram?
I have 3 other concrete concerns about this strategy. So if I understand it correctly, the plan is for humans to align AGI and then for that AGI to align AGI and so forth (until ASI).
-
What if the strategy breaks on the first step? What if first AGI turns out to be deceptive (scheming) and only pretends to be aligned with humans. It seems like if we task such deceptive AGI to align other AGIs, then we will end up with a pyramid of misaligned AGIs.
-
What if the strategy breaks later down the line? What if AGI #21 accidentally aligns AGI #22 to be deceptive (scheming)? Would there be any fallback mechanisms we can rely on?
-
What is the end goal? Do we stop once we achieve ASI? Can we stop once we achieve ASI? What if ASI doesn't agree and instead opts to continue self-improving? Are we going to be able to get to the point where the acceleration of ASI's intelligence plateaus and we can recuperate and plan for future?
Sure, I might as well ask my question directly about scalable oversight, since it seems like a leading strategy of iterative alignment anyways. I do have one preliminary question (which probably isn't worthy of being included in that post, given that it doesn't ask about a specific issue or threat model, but rather about expectations of people).
I take it that this strategy relies on evaluation being easier than coming up with research? Do you expect this to be the case?
That's pretty interesting, I do think that if iterative alignment strategy ends up working, then this will probably end up working too (if nothing else, then because this seems much easier).
I have some concerns left about iterative alignment strategy in general, so I will try to write them down below.
EDIT: On the second thought, I might create a separate question for it (and link it here), for the benefit of all of the people who concerned about the things (or similar things) that I am concerned about.
This topic is quite interesting for me from the perspective of human survival, so if you do decide to make a post specifically about preventing jailbreaking, then please tag me (somewhere) so that I can read it.
I agree with comments both by you and Seth. I guess that isn't really part of an alignment as usually understood. However, I think it is a part of a broad preventing AI from killing humans strategy, so it's still pretty important for our main goal.
I am not exactly sure I understand your proposal. Are you proposing that we radically gut our leading future model by restricting it severely? I don't think any AI labs will agree to do so, because such future AI is much less useful than probably even current AIs.
Or are you proposing that we use AI monitors to monitor our leading future AI models and then we heavily restrict only the monitors?
I also have a more meta-level layman concern (sorry if it will sound unusual). There seem to be a large number of jailbreaking strategies that all succeed against current models. To mitigate them, I can conceptually see 2 paths: 1) trying to come up with a different niche technical solution to each and every one of them individually or 2) trying to come up with a fundamentally new framework that happens to avoid all of them collectively.
Strategy 1 seems logistically impossible, as developers at leading labs (which are most likely to produce AGI) have to be aware of all of them (and they are often reported in relatively unknown papers). Furthermore, even if they somehow manage to monitor all reported jailbreaks, they would have to come up with so many different solutions, that it seems very unlikely to succeed.
Strategy 2 seems conceptually correct, but there seems to be no sign of it as even newer models are getting jailbreaked.
What do you think?
Looking more generally, there seems to be a ton of papers that develop sophisticated jailbreak attacks (that succeed against current models). Probably more than I can even list here. Are there any fundamentally new defense techniques that can protect LLMs against these attacks (since the existing ones seem to be insufficient)?
EDIT: The concern behind this comment is better detailed in the next comment.
"Perhaps the missing piece is that I think alignment is already solved for LLM agents."
Another concern that I might have is that maybe it only seems like alignment is solved for LLMs. For example, this, this, this and this short papers argue that that seemingly secure LLMs may not be as safe as we initially believe. And it appears that they test even our models that are considered to be more secure and still find this issue.
That's a really good point. I would like to see John address it, because it seems quite crucial for the overall alignment plan.
Thank you for both of your responses, they address all of my worries. Good luck on all of your future projects!
Søren Elverlin has recently released a video critique of this article on YouTube, claiming that it strongly fails to deliver. He goes over the entire piece and criticizes most of it.
Interested in hearing your thoughts. The video is very detailed (and high-quality) and engages directly with your article. It also raises valid points about its shortcomings. How would you address those?
Yep, Seth has really clearly outlined the strategy and now I can see what I missed on the first reading. Thanks to both of you!
"If you have an agent that's aligned and smarter than you, you can trust it to work on further alignment schemes. It's wiser to spot-check it, but the humans' job becomes making sure the existing AGI is truly aligned, and letting it do the work to align its successor, or keep itself aligned as it learns."
Ah, that's the link that I was missing. Now it makes sense. You can use AGI as a reviewer for other AGIs, once it is better than humans at reviewing AGIs. Thank you a lot for clarifying!
I've been reading a lot of the stuff that you have written and I agree with most of it (like 90%). However, one thing which you mentioned (somewhere else, but I can't seem to find the link, so I am commenting here) and which I don't really understand is iterative alignment.
I think that the iterative alignment strategy has an ordering error – we first need to achieve alignment to safely and effectively leverage AIs.
Consider a situation where AI systems go off and “do research on alignment” for a while, simulating tens of years of human research work. The problem then becomes: how do we check that the research is indeed correct, and not wrong, misguided, or even deceptive? We can’t just assume this is the case, because the only way to fully trust an AI system is if we’d already solved alignment, and knew that it was acting in our best interest at the deepest level.
Thus we need to have humans validate the research. That is, even automated research runs into a bottleneck of human comprehension and supervision.
The appropriate analogy is not one researcher reviewing another, but rather a group of preschoolers reviewing the work of a million Einsteins. It might be easier and faster than doing the research itself, but it will still take years and years of effort and verification to check any single breakthrough.
Fundamentally, the problem with iterative alignment is that it never pays the cost of alignment. Somewhere along the story, alignment gets implicitly solved.
I've been reading a lot of the stuff that you have written and I agree with most of it (like 90%). However, one thing which you mentioned (somewhere else, but I can't seem to find the link, so I am commenting here) and which I don't really understand is iterative alignment.
I think that the iterative alignment strategy has an ordering error – we first need to achieve alignment to safely and effectively leverage AIs.
Consider a situation where AI systems go off and “do research on alignment” for a while, simulating tens of years of human research work. The problem then becomes: how do we check that the research is indeed correct, and not wrong, misguided, or even deceptive? We can’t just assume this is the case, because the only way to fully trust an AI system is if we’d already solved alignment, and knew that it was acting in our best interest at the deepest level.
Thus we need to have humans validate the research. That is, even automated research runs into a bottleneck of human comprehension and supervision.
The appropriate analogy is not one researcher reviewing another, but rather a group of preschoolers reviewing the work of a million Einsteins. It might be easier and faster than doing the research itself, but it will still take years and years of effort and verification to check any single breakthrough.
Fundamentally, the problem with iterative alignment is that it never pays the cost of alignment. Somewhere along the story, alignment gets implicitly solved.
I am also interested in knowing whether the probing method is a solution to the undetectable backdoor problem.
This paper argues that unintended deceptive behavior is not susceptible to detection by probing method. The authors of that paper argue that the probing method fares no better than random guessing for detecting unintended deceptive behavior.
I would really appreciate any input, especially from Monte or his co-authors. This seems like a very important issue to address.
James, thank you for a well-written comment. It was a pleasure to read. Looking forward to Seth's response. Genuinely interested in hearing his thoughts.
I do believe that Shubhorup's second edit provides another case for concern. If you combine his proposal with steganography, the collusion team has another way of successfully mitigating Ryan's safety strategy.
This collusion team's strategy should also ideally be addressed.
This paper creates a new 1-bit strategy for circumventing Ryan and Buck's safety strategy. The collusion team achieves 90%+ success rate.
I would really appreciate any input, especially from Ryan or his co-authors. This seems like a very important issue to address.
I wonder, have the comments managed to alleviate your concerns at all? Are there any promising ideas for multipolar AGI scenarios? Were there any suggestions that could work?