Posts
Comments
A meta-related comment from someone who's not deep into alignment (yet) but does work in AI/academia.
My impression on reading LessWrong has been that the people who are deep into alignment research are generally spending a great deal of their time working on their own independent research agendas, which - naturally - they feel are the most fruitful paths to take for alignment.
I'm glad that we seem to be seeing a few more posts of this nature recently (e.g. with Infra-Bayes, etc) where established researchers spend more of their time both investigating and critiquing others' approaches. This is one good way to get alignment researchers to stack more, imo.
A fair objection.
I had a quick search online and also flicked through Boyd's Convex Optimization, and didn't find Stuart Russell's claim expounded on. Would you be able to point me in a direction to look further into this?
Nevertheless, let me try to provide more detailed reasoning for my counterclaim. I assume that Russell's claim is indeed true in the classic optimisation domain, where there is a function R^N -> R f(x) as well as some inequality constraints on a subset of x.
However, I argue that this is not a good model for maximising a utility function in the real world.
First of all, it is not necessarily possible to freely search over x, as x corresponds to environmental states. All classic optimisation techniques that I know of assume that you may set x to any value regardless of the history of values that x was set to. This is not the case in the real world; there are many environmental states which are not accessible from other environmental states. For example, if Earth were to be swallowed up into a black hole, we wouldn't be able to restore the environment of me typing out this response to you on LW ever again.
In effect, what I'm describing is the difference in optimising in a RL setting than the classical setting. And whilst I can believe some result on extremal values exists in the classical setting, I'd be very surprised indeed if something similar exists in the RL setting. Particularly when the transition matrices are unknown to the agent i.e. it does not have a perfect model of the environment already.
So I've laid out my skepticism for the extremal values claim in RL, but is there any reason to believe my counterclaim that RL optimisation naturally leads to non-extremal choices? Here I think I'll have to be handwavy and gestur-y again, for now (afaik, no literature exists pertaining to this topic and what I'm going to say next, but please do inform me if this is not the case).
Any optimisation process requires evaluating f(x) for different values of x. In order to be able to evaluate f(x), the agent has two distinct choices:
- Either it can try setting the environment state to x directly;
- Or it can build a model f* of f, and evaluate f*(x) as its estimate;
(roughly, this corresponds to model-free and model-based RL respectively)
Utilising 1 is likely to be highly suboptimal for finding the global optima if the environment is highly 'irreversible' i.e. there are many states x that, if you enter them, you are closed off from a large remaining subspace of X. Better is to build the model f* as 'safely' as possible, with few evaluations, and where you are reasonably sure the evaluations keep your future choices of x as open as possible. I think this is 'obvious' in a worst-case analysis over possible functions f, but it also feels true in an average case with some kind of uniform prior over f.
And now for the most handwavy part: I suspect most elements of the state vector x representing the universe are much more commonly irreversible at extreme values than when they take non-extremal values. But really, this is a bit of a red herring from the headline point - regardless of extremality of values or not, I think an intelligent enough agent will be reticent to enter states which it is not sure it can reverse back out of, and that for me is 'cautious' behaviour.
AGI is happening soon. Significant probability of it happening in less than 5 years.
I agree that there is at least some probability of AGI within 5 years, and my median is something like 8-9 years (which is significantly advanced vs most of the research community, and also most of the alignment/safety/LW community afaik).
Yet I think that the following statements are not at all isomorphic to the above, and are indeed - in my view - absurdly far off the mark:
We don’t have any obstacle left in mind that we don’t expect to get overcome in more than 6 months after efforts are invested to take it down.
If you have technical understanding of current AIs, do you truly believe there are any major obstacles left? The kind of problems that AGI companies could reliably not tear down with their resources?
Let's look at some examples for why.
- DeepMind's AlphaGo - took at least 1.5 years of development to get to human professional standard, possibly closer to 2 years.
- DeepMind's AlphaFold - essentially a simple supervised learning problem at its core - was an internal project for at least 3 years before culminating in the Nature paper version.
- OpenAIs DOTA-playing OpenAI Five again took at least 2.5 years of development to get to human professional level (arguably sub-professional, after humans had more time to adapt to its playstyle) on a restricted format of the game.
In all 3 cases, the teams were large, well-funded, and focused throughout the time periods on the problem domain.
One may argue that a) these happened in the past, and AI resources/compute/research-iteration-speed are all substantially better now, and b) the above projects did not have the singular focus of the entire organisation. And I would accept these arguments. However, the above are both highly constrained problems, and ones with particularities eminently well suited to modern AI techniques. The space of 'all possible obstacles' and 'all problems' is significantly more vast than the above.
I wonder what model of AI R&D you guys have that gives you the confidence to make such statements in the face of what seems to me to be strong contrary empirical evidence.