0 comments
Comments sorted by top scores.
comment by Jack O'Brien (jack-o-brien) · 2022-10-05T02:31:41.085Z · LW(p) · GW(p)
Fantastic! Here's my summary:
Premises:
- A recursively self improving singleton is the most likely AI scenario.
- To mitigate AI risk, building a fully aligned singleton on the first try is the easiest solution. This is easier than other approaches which require solving coordination.
- By default, AI will become misaligned when it generalises away from human capabilities. We must apply a security mindset and be doubtful of most claims that an AI will be aligned when it generalises.
- We should prioritise research which solves the hard part of the alignment problem rather than handwavy, tractable research.
- Many current tractable alignment solutions handwave away the hard parts of the problem (and possibly make the problem worse in expectation).
- More formal / rigorous research can produce more robust guarantees about AI safety than other research.
Conclusion: We should prioritize working up what we even want from an aligned AI, formalize these ideas into concrete desiderata, then build AI systems which meet those desiderata
comment by jacob_cannell · 2022-10-07T18:52:22.412Z · LW(p) · GW(p)
the values we want are a very narrow target and we currently have no solid idea how to do alignment, so when AI does take over everything we're probly going to die. or worse, if for example we botch alignment.
We can build altruistic AGI without learning human values at all - as AGI can optimize for human empowerment (our ability to fulfill all long term goals).[1]
if aligning current ML models is impossible or would take 50 years, and if aligning something different could take as little 5 years, then we need to align something else.
The foundational formal approach has already been pursued for over 20 years, and has show very little signs of progress. On the contrary, it's main original founder/advocate seems to have given up, declaring doom. What makes you think it could succeed in as little as 5 years? Updating on the success of DL, what makes you think that DL based alignment would take 50?
Franzmeyer, Tim, Mateusz Malinowski, and João F. Henriques. "Learning Altruistic Behaviours in Reinforcement Learning without External Rewards." arXiv preprint arXiv:2107.09598 (2021) ↩︎
comment by Alex Flint (alexflint) · 2022-10-05T18:53:24.141Z · LW(p) · GW(p)
This is one of the clearest top-to-bottom accounts of the alignment problem and related world situation that I've seen here in a while. Thank you for writing it.
i believe, akin to the yudkowsky-moore law of mad science, that the amount of resources it takes for the world to be destroyed — whether on purpose or by accident — keeps decreasing.
Yes it seems that in this particular way the world is becoming more and more unstable
pretty soon (probly this decade or the next), an artificial intelligence capable of undergoing recursive self-improvement (RSI) until it becomes a singleton, and at that point the fate of at least the entire future lightcone will be determined by the goals of that AI.
I think the risk is that one way or another we lock in some mostly-worthless goal to a powerful optimization process. I don't actually think RSI is necessary for that. Beyond that, in practical ML work, we keep seeing systems that are implemented with a world model that is very far from being able to make sense of their own implementation, and actually we seem to be moving further in this direction over time. Google's SayCan, for example, seems to be further from understanding its own implementation than some much more old-fashioned robotics tech from the 1990s (which wasn't exactly close to being able to reason about its own implementation within its world model, either)
the values we want are a very narrow target and we currently have no solid idea how to do alignment, so when AI does take over everything we're probly going to die. or worse, if for example we botch alignment.
Don't assume that the correct solution to the alignment problem consists of alignment to a utility function defined over physical world states. We don't know that for sure and many schools of moral philosophy have formulated ethics in a way that doesn't supervene on physical world states. It's not really clear to me that even hedonistic utilitarianism can be formulated as a utility function over physical world states.
what i call "sponge coordination" is getting everyone who's working on AI to only build systems that are weak and safe just like a sponge, instead of building powerful AIs that take over or destroy everything
I really like the term "sponge coordination" and the definition you've given! And I agree that it's not viable. The basic problem is that we humans are ourselves rapidly becoming unbounded optimizers and so the current world situation is fundamentally not an equilibrium, and we can't just make it an equilibrium by trying to. A solution looks like a deeper change than just asking everyone to keep doing mostly what they're currently doing except not building super-powerful AIs.
[pivotal acts] right now, we're at least able to work on alignment, and the largest AI capability organizations are at least somewhat interacting with the alignment community; it's not clear how that might evolve in the future if the alignment community is percieved to be trying to harm the productivity of AI capability organizations
There are two basic responses possible: either don't perform pivotal acts, or move to a situation where we can perform pivotal acts. It is very difficult to resolve the world situation with AI if one is trying to do so while not harming the productivity of any AI companies. That would be like trying to upgrade a bird to a jumbo jet while keeping it alive throughout.
it's not clear that we have to aim for first non-singleton aligned AIs, and then FAS; it could be that the best way to maximize our expected utility is to aim straight for FAS.
This is true. Thank you for having the courage to say it.
in retrospect of this now-formalized view of the work at hand, a lot of my criticisms of approaches to AI alignment i've seen are that they're either [hand-wavey or uncompetitive]
Indeed. It's a difficult problem and it's okay to formulate and discuss hand-wavey or uncompetitive plans as a stepping stone to formulating precise and competitive plans.
finally, tractable solutions — by virtue of being easier to implement — risk boosting AI capability, and when you're causing damage you really want to know that it's helping alignment enough to be an overall expected net good
This is a very good sentence
this is the sort of general robustness i think we'll be needing, to trust an AI with singletonhood. and without singletonhood
I think you're right to focus on both formalism and trustworthiness, though please do also investigate whether and in what way the former actually leads to the latter.
nothing guarantees that, just because ML is how we got to highly capable unaligned systems, it's also the shortest route to highly capable aligned systems
Yeah well said.
a much more reasonable approach is to first figure out what "aligned" means, and then figure out how to build something which from the ground up is designed to have that property
This is my favorite sentence in your essay
in my opinion we should figure out what alignment means, what desiderata would formalize it, and then build something that has those.
Again, just notice that you are assuming a kind of pre-baked blueprint here, just in putting in the middle step "what desiderata would formalize it". Formal systems are a tool like any other tool: powerful, useful, important to understand the contexts in which they are effective and the contexts in which they are ineffective.
comment by Jonathan Claybrough (lelapin) · 2022-12-24T12:49:34.132Z · LW(p) · GW(p)
I definitely agree on thinking deeply on how relevant different approaches are, but I think framing it as "relevance over tractability" is weird because tractability makes something relevant.
Maybe Deep learning neural networks really are too difficult to align and we won't solve the technical alignment problem by working on them, but interpretability could still be highly relevant by proving that this is the case. We currently do not have hard information to rule out AGI from scaling up transformers, nor that it's impossible to align for self modification (though uh, I wouldn't count on it being possible, I would recommend not allowing self modification etc). More information is useful if it helps activates the breaks on developing these dangerous AI.
In the world where all AI alignment people are convinced DL is dangerous so work on other things, then the the world gets destroyed by DL people who were never shown that it's dangerous. The universe won't guarantee other people will be reasonable, so AI safety folk have all responsibility of proving danger etc.
comment by Krieger · 2022-10-05T03:50:49.187Z · LW(p) · GW(p)
It seems like the AI risk mitigation solutions you've listed aren't mutually exclusive, but we'll likely have to use a combination of them to succeed. While I agree that it would be ideal for us to end up with a FAS, the pathway towards the outcome would likely involve "sponge coordination" and "pivotal acts" as mechanisms by which our civilization can buy some time before FAS.
A possible scenario in a world where FAS takes some time (chronological):
- "sponge coordination" with the help of various AI policy work
- one cause-aligned lab executes a "pivotal act" with their (not Friendly FAS, but likely corrigible) AI and ends the period of vulnerability from humanity being a sponge; now the vulnerability lies on the human operators running this Pivotal Act AI.
- FAS takeover, happy ending!
Of course we wouldn't need all of this if FAS happens to be the first capable AGI to be developed, (kinda unlikely in my model). I would like to know what scenarios you think are most likely to happen (or should aim towards), or if I overlooked any other pathways. (also relevant [LW · GW])