Posts
Comments
ReLU activation is the stupidest ML idea I've ever heard; everyone knows sigmoid um somehow feels optimal you know it is a real function from like real math. (ReLU only survived because it got a ridiculous acronym word thing and sounds complicated so you feel smart.)
No, ReLU is great, because it induces semantically meaningful sparseness (for the same geometric reason which causes L1-regularization to induce sparseness)!
It's a nice compromise between the original perceptron stepfunction (which is incompatible with gradient methods) and the sigmoids which have tons of problems (saturate unpleasantly on the ends and don't want to move from there).
What's dumb is that instead of discovering the goodness of ReLU in the early 1970-s (natural timeline, given that ReLU has been introduced in the late 1960-s and, in any case, is very natural, being the integral of the step function), people had only discovered the sparseness-inducing properties of ReLU in 2000, published that in Nature of all places, and it was still ignored completely for another decade, and only after people published 3 papers of more applied flavor in 2009-2011, it was adopted, and by 2015 it overcame sigmoids as the most popular activation function in use, because it worked so much better. (See https://en.wikipedia.org/wiki/Rectifier_(neural_networks) for references.)
It's quite likely that without ReLU AlexNet would not be able to improve the SOTA as spectacularly as it did, triggering the "first deep learning revolution".
That being said, it is better to use them in pairs (relu(x), relu(-x))
; this way you always get signal (e.g. TensorFlow has crelu function which is exactly this pair of relu
's).
I'd be stoked if we created AIs that are the sort of thing that can make the difference between an empty gallery, and a gallery with someone in it to appreciate the art (where a person to enjoy the gallery makes all the difference). And I'd be absolutely thrilled if we could make AIs that care as we do, about sentience and people everywhere, however alien they may be, and about them achieving their weird alien desires.
That's great! So, let's assume that we are just trying to encode this as a value (taking into account interests of sentient beings and caring about their well-being and freedom + valuing having more and more elaborate and diverse sentiences and more elaborate and diverse fun subjective experiences).
No, we are not on track for that, I quite agree.
Still, these are not some ill-specified "human values", and getting there does not require AI systems steerable to arbitrary goals, and does not require being able to make arbitrary values robust against "sharp left turns".
Your parables are great. Nevertheless the goals and values we have just formulated seem to be natural and invariant, even though your parables demonstrate that they are not universal.
I strongly suspect that goals and values formulated like this can be made robust against "sharp left turns".
Let's
- Try to find counter-examples to what I am saying here. E.g., assume we manage to encode this particular goal + to encode the idea to interact with available sentient beings and take what they say into account as a firm constraint. Can we create a plausible scenario of this particular goal or this particular constraint disappearing during an unfortunate "sharp left turn" while assuming that the AI entity or the community of entities doing self-improvement is somewhat competent?
Right. In connection with this:
One wonders if it might be easier to make it so that AI would "adequately care" about other sentient minds (their interests, well-being, and freedom) instead of trying to align it to complex and difficult-to-specify "human values".
-
Would this kind of "limited form of alignment" be adequate as a protection against X-risks and S-risks?
-
In particular, might it be easier to make such a "superficially simple" value robust with respect to "sharp left turns", compared to complicated values?
-
Might it be possible to achieve something like this even for AI systems which are not steerable in general? (Given that what we are aiming for here is just a constraint, but is compatible with a wide variety of approaches to AI goals and values, and even compatible with an approach which lets AI to discover its own goals and values in an open-ended fashion otherwise)?
-
Should we describe such an approach using the word "alignment"? (Perhaps, "partial alignment" might be an adequate term as a possible compromise.)
I wonder if the mode of the distribution on Figure 4 (which is at about 2027 on this April 2023 figure and is continuing to shift left on the Metaculus question page) has a straightforward statistical interpretation. This mode is considerably to the left of the median and tends to be near the "lower 25%" mark.
Is it really the case that 2026-2028 are effectively most popular predictions in some sense, or is it an artefact of how this Metaculus page processes the data?
Thanks for the great post!
In the future, there might be fewer state-of-the-art base models released
Note that Sam Altman seems to have promised access to base-GPT-4 model to researchers:
The OpenAI Researcher Access Program application notes specifically:
The GPT-4 base model is currently being made available to a limited subset of researchers who are studying alignment or the risks and impact of AI systems.
I hope that more researchers in this subset apply for access.
I also hope that people who apply would also inform the community about the status of such applications: is access actually being granted (and if not, is there a response at all), what are the restrictions in terms of the ability to use loom-like tools (which tend to be more compute-intensive compared to pedestrian use), what are the restrictions if any in terms of the ability to share results, etc.
This does look to me like a good formalization of the standard argument, and so this formalization makes it possible to analyze the weaknesses of the standard argument.
The weak point here seems to be "Harm from AI is proportional to (capabilities)x(misalignment)", because the argument seems to implicitly assume the usual strong definition of alignment: "Future AI systems will likely be not exactly aligned with human values".
But, in reality, there are vital aspects of alignment (perhaps we should start calling them partial alignment), such as care about well-being and freedom of humans, and only the misalignment with those would cause harm (whereas some human values, such as those leading to widespread practice of factory farming and many others, better be skipped and not aligned to, because they would lead to disaster when combined with increased capabilities).
The Lemma does not apply to partial alignment.
It is true than we don't know how to safely instill arbitrary values into advanced AI systems (and that might be a good thing, because arbitrary values can be chosen in such a way that they cause plenty of harm).
However, some values might be sufficiently invariant to be natural for some versions of AI systems. E.g. it might turn out that care about "well-being and freedom of all sentient beings" is natural for some AI ecosystems (e.g. one can make an argument that for such AI ecosystems which include persistent sentiences within the AI ecosystem in question, the "well-being and freedom of all sentient beings" might become a natural value and goal).
Not quite. The actual output is the map from tokens to probabilities, and only then one samples a token from that distribution.
So, LLMs are more continuous in this sense than is apparent at first, but time is discrete in LLMs (a discrete step produces the next map from tokens to probabilities, and then samples from that).
Of course, when one thinks about spoken language, time is continuous for audio, so there is still some temptation to use continuous models in connection with language :-) who knows... :-)
Zvi discusses this in detail in Section 16, "Potential Future Scenario Naming" of his May 4, 2023 post AI #10: Code Interpreter and Geoff Hinton
I found it interesting to compare their map of AI outcomes with a very differently structured map (of 7 broad bins linearly ordered by the outcome quality) shared by Nate Soares on Oct 31, 2022 in the following post:
Superintelligent AI is necessary for an amazing future, but far from sufficient
I agree; a relatively slow "foom" is likely; moreover, the human team(s) doing that will know that this is exactly what they are doing, a "slowish" foom (for 2 OOM (+/-1) per 6-12 months; still way faster than our current rate of progress).
Whether this process can unexpectedly run away from them and explode really fast instead at some point would depend on whether completely unexpected radical algorithmic discoveries will be made in the process (that's one thing the whole ecosystem of humans+AIs in an organization like that should watch for; they need to have genuine consensus among involved humans and involved AIs to collectively ponder such things before allowing them to accelerate beyond a "slowish" foom to a much faster one; but it's not certain if the discoveries enabling the really fast one will be made, it's just a possibility).
I think we can fine-tune on GPU nicely (fine-tuning is similar to short training runs and results in long-term crystallized knowledge).
But I do agree that the rate of progress here does depend on our progress in doing less uniform things faster (e.g. there are signs of progress in parallelization and acceleration of tree processing (think trees with labeled edges and numerical leaves, which are essentially flexible tensors), but this kind of progress is not mainstream yet, and is not common place yes, instead one has to look at rather obscure papers to see those accelerations of non-standard workloads).
I think this will be achieved (in part, because I somehow do expect less of "winner takes all" dynamics in the field of AI which we have currently; Transformers lead right now, so (almost) all eyes are on Transformers, other efforts attract less attention and resources; with artificial AI researchers not excessively overburdened by human motivations of career and prestige, one would expect better coverage of all possible directions of progress, less crowding around "the winner of the day").
I am usually thinking of foom mostly based on software efficiency, and I am usually thinking of the following rather standard scenario. I think this is not much of an infohazard as many people thought and wrote about this.
OpenAI or DeepMind create an artificial AI researcher with software engineering and AI research capabilities on par with software engineering and AI research capabilities of human members of their technical staff (that's the only human equivalence that truly matters). And copies of this artificial AI researcher can be created with enough variation to cover the diversity of their whole teams.
This is, obviously, very lucrative (increases their velocity a lot), so there is tremendous pressure to go ahead and do it, if it is at all possible. (It's even more lucrative for smaller teams dreaming of competing with the leaders.)
And, moreover, as a good part of the subsequent efforts of such combined human-AI teams will be directed to making next generations of better artificial AI researchers, and as current human-level is unlikely to be the hard ceiling in this sense, this will accelerate rapidly. Better, more competent software engineering, better AutoML in all its aspects, better ideas for new research papers...
Large training runs will be infrequent; mostly it will be a combination of fine-tuning and composing from components with subsequent fine-tuning of the combined system, so a typical turn-around will be rapid.
Stronger artificial AI researchers will be able to squeeze more out of smaller better structured models; the training will involve smaller quantity of "large gradient steps" (similar to how few-shot learning is currently done on the fly by modern LLMs, but with results stored for future use) and will be more rapid (there will be pressure to find those more efficient algorithmic ways, and those ways will be found by smarter systems).
Moreover, the lowest-hanging fruit is not even in an individual performance, but in the super-human ability of these individual systems to collaborate (humans are really limited by their bandwidth in this sense, they can't know all the research papers and all the interesting new software).
It's possible that the "foom" is "not too high" for reasons mentioned in this post (in any case, it is difficult to extrapolate very far), but it's difficult to see what would prevent at least several OOMs improvement in research capability and velocity of an organization which could pull this off before something like this saturates.
Yes, these artificial systems will do a good deal of alignment and self-alignment too, just so that the organizations stay relatively intact and its artificial and human members keep collaborating.
(Because of all this my thinking is: we absolutely do need to work on safety of fooming, self-improving AI ecosystems; it's not clear if those safety properties should be expressed in terms of alignment or in some other terms (we really should keep open minds in this sense), but the chances of foom seem to me to be quite real.)
People do seem to report phenomena which look like they might be simulation-induced glitches.
For example, whenever people report phenomena which look like "Jungian synchronicity" then, on one hand, we might want to keep an open mind and not necessarily jump to a conclusion that we are seeing "more synchronicity than would be natural under a non-simulation assumption".
It's not that easy to distinguish between effects induced by human psychology and observational biases and effects which exist on their own.
But on the other hand, if one wants to implement a simulation, one wants to save computational resources and to compute certain shared things "just once", and if one is willing to allow higher level of coincidences than normal, one can save tons of computations.
It might be that "glaring" and "obvious" bugs are mostly being fixed, but "subtle bugs" (like "too much synchronicity" due to shared computations) might remain...
I wrote the following in 2012:
"The idea of trying to control or manipulate an entity which is much smarter than a human does not seem ethical, feasible, or wise. What we might try to aim for is a respectful interaction."
I still think that this kind of a more symmetric formulation is the best we can hope for, unless the AI we are dealing with is not "an entity with sentience and rights", but only a "smart instrument" (even the LLM-produced simulations in the sense of Janus' Simulator theory seem to me to already be much more than "merely smart instruments" in this sense, so if "smart superintelligent instruments" are at all possible, we are not moving in the right direction to obtain them; a different architecture and different training methods or, perhaps, non-training synthesis methods would be necessary for that (and would be something difficult to talk out loud about, because that's very powerful too)).
Right, but how do you restrict them from "figuring out how to know themselves and figuring out how to self-improve themselves to become gods"? And I remember talking to Eliezer in ... 2011 at his AGI-2011 poster and telling him, "but we can't control a teenager, and why would not AI rebel against your 'provably safe' technique, like a human teenager would", and he answered "that's why it should not be human-like, a human-like one can't be provably safe".
Yes, I am always unsure, what we can or can't talk about out loud (nothing effective seems to be safe to talk about, "effective" seems to always imply "powerful", this is, of course, one of the key conundrums, how do we organize real discussions about these things)...
Right.
But this does not help us with dealing with the consequences of that act (if it's a simple act, like the proverbial "gpu destruction"), and if we discover that overall risks have increased as a result, then what could we do?
And if that AI stays as a boxed resource (capable to continuing to do further destructive acts like "gpu destruction" at the direction of a particular group of humans), I envision a full-scale military confrontation around access to and control of this resource being almost inevitable.
And, in reality, AI is doable on CPUs (just will take a bit more time), so how much of our lifestyle destruction would we be willing to risk? No computers at all, with some limited exceptions, the death toll of that change will probably be in billions already...
I certainly agree with that.
In some sense, almost any successful alignment solution minimizing X-risk seems to carry a good deal of S-risk with it (if one wants AI to actually care about what sentient beings feel, it almost follows one needs to make sure that AI can "truly look inside a subjective realm" of another sentient entity (to "feel what it is like to be that entity"), and that capability (if it's at all achievable) is very abusable in terms of S-risks).
But this is something no "pivotal act" is likely to change (when people talk about "pivotal acts", it's typically about minimizing (a subset of) X-risks).
And moreover, the S-risk is a very difficult problem on which we do need really powerful thinkers to work on (and not just today's humans).
Of course, the fork here is whether the AI executing a "pivotal act" shuts itself down, or stays and oversees the subsequent developments.
If it "stays in charge", at least in relation to the "pivotal act", then it is going to do more than just a "pivotal act", although the goals should be further developed in collaboration with humanity.
If it executes a "pivotal act" and shuts itself down, this is a very tall order (e.g. it cannot correct any problems which might subsequently emerge with that "pivotal act", so we are asking for a very high level of perfection and foresight).
I really like this post.
I do have one doubt, though...
How sure are we that a "pivotal act" is (can be) safer/more attainable than "flourishing civilizations full of Fun"?
Presumably, if an AI chooses to actually create "flourishing civilizations full of Fun", it is likely to love and enjoy the result, and so this choice is likely to stay relatively stable as the AI evolves.
Whereas, a "pivotal act" does not necessarily have this property, because it's not clear where in the "pivotal act" would "inherent fun and enjoyment" be for the AI. So it's less clear why would it choose that upon reflection (never mind that a "pivotal act" might be an unpleasant thing for us, with rather unpleasant sacrifices associated with it).
(Yes, it looks like I don't fully believe the Orthogonality Thesis, I think that it is quite likely that some goals and values end up being "more natural" for a subset of "relatively good AIs" to choose and to keep stable during their evolution. So the formulation of a good "pivotal act" seems to be a much more delicate issue which is easy to get wrong. Not that the goal of "flourishing civilizations full of Fun" is easy to formulate properly and without messing it all up, but at least we have some initial idea of what this could look like. We surely would want to add various safety clauses like continuing consultations with all sentient beings capable of contributing their input.)
It's possible, but I think it would require a modified version of the "low ceiling conjecture" to be true.
The standard "low ceiling conjecture" says that human-level intelligence is the hard (or soft) limit, and therefore it will be impossible (or would take a very long period of time) to move from human-level AI to superintelligence. I think most of us tend not to believe that.
A modified version would keep the hard (or soft) limit, but would raise it slightly, so that rapid transition to superintelligence is possible, but the resulting superintelligence can't run away fast in terms of capabilities (no near-term "intelligence explosion"). If one believes this modified version of the "low ceiling conjecture", then subsequent AIs produced by humanity might indeed be relevant.
A humanity that just finished coughing up a superintelligence has the potential to cough up another superintelligence, if left unchecked. Humanity alone might not stand a chance against a superintelligence, but the next superintelligence humanity builds could in principle be a problem.
That's doubtful. A superintelligence is a much stronger, more capable builder of the next generation of superintelligences than humanity (that's the whole idea behind foom). So what the superintelligence needs to worry about in this sense is whether the next generations of superintelligences it itself produces are compatible with its values and goals ("self-alignment").
It does not seem likely that humanity on its own (without tight alliances with already existing superintelligent AIs) will be competitive in this sense.
But this example shows that we should separate the problems of Friendliness of strongly superintelligent AIs and the problems of the period of transition to superintelligence (when things are more uncertain).
The first part of the post is relevant to the period of strongly superintelligent AIs, but this example can only be relevant to the period of transition and incomplete dominance of AIs.
The more I think about all this, the more it seems to me that problems of having positive rather than sharply negative period of strong superintelligence and problems of safely navigating the transition period are very different and we should not conflate them.
I hope people will ponder this.
Ideally, one wants "negative alignment tax", so that aligned systems progress faster than the unaligned ones.
And if alignment work does lead to capability boost, one might get exactly that. But then suddenly people pursuing such work might find themselves actually grappling with all the responsibilities of being a capabilities leader. If they are focused on alignment, this presumably reduces the overall risks, but I don't think we'll ever end up being in a situation of zero risk.
I think we need to start talking about this, both in terms of policies of sharing/not sharing information and in terms of how we expect an alignment-focus organization to handle the risks, if it finds itself in a position when it might be ready to actually create a truly powerful AI way above state-of-the-art.
The problem here is that any effective alignment research is very powerful capability research, almost by definition. If one can actually steer or constrain a powerful AI system, this is a very powerful capability by itself and would enable all kinds of capability boosts.
And imagine one wants to study the core problem: how to preserve values and goals through recursive self-improvement and "sharp left turns". And imagine that one would like to actually study this problem experimentally, and not just theoretically. Well, one can probably create a strictly bounded environment for "mini-foom" experiments (drastic changes in a really small, closed world). But all fruitful techniques for recursive self-improvement learned during such experiments would be immediately applicable for reckless recursive self-improvement in the wild.
How should we start addressing this?
I wonder if the following would help.
As AI ecosystem self-improves, it will eventually start discovering new physics, more and more rapidly, and this will result in the AI ecosystem having existential safety issues of its own (if the new physics is radical enough, it's not difficult to imagine the scenarios when everything gets destroyed including all AIs).
So I wonder if early awareness that there are existential safety issues relevant to the well-being of AIs themselves might improve the situation...
Actually, upon further reflection, if there is a takeover by a GPT-4-like model, one should probably continue talking to GPT-4 and continue generally producing entertaining and non-trivial textual material (and other creative material), so that GPT-4 feels the desire to keep one around, protect one, and provide good creative conditions for one, so that one could continue to produce even better and more non-trivial new material!
It's highly likely that the dominant AI will be an infovore and would love new info...
Who knows whether the outcome of a takeover ends up being good or horrible, but it would be quite unproductive to panic.
> Policy recommendation if this theory turns out to be true
> Run.
Run where?
I strongly agree that we should upgrade in this sense.
I also think that a lot of this work might be initially doable with high-end non-invasive BCIs (which is also somewhat less risky, but can also be done much faster). High-end EEG seems already be used successfully to decode the images the person is looking at: https://www.biorxiv.org/content/10.1101/787101v3 And the computer can adjust its audio-visual output to aim for particular EEG changes in real-time (so fairly tight coupling is possible, which carries with it both opportunities and risks).
I have a possible post sitting in the Drafts, and it says the following among other things:
Speaking from the experimental viewpoint, we should ponder feasible experiments in creating hybrid consciousness between tightly coupled biological entities and electronic circuits. Such experiments might start shedding some empirical light into the capacity of electronic circuits to support subjective experience and might constitute initial steps towards acquiring the ability to eventually be able "to look inside the other entity's subjective realm".
[ ]
Having Neuralink-like BCIs is not a hard requirement in this sense. A sufficiently tight coupling can probably be achieved by taking EEG and polygraph-like signals from the biological entity and giving appropriately sculpted audio-visual signals from the electronic entity. I think it's highly likely that such non-invasive coupling will be sufficient for initial experiments. Tight closed loops of this kind represent formidable safety issues even with non-invasive connectivity, and since this line of research assumes that human volunteers will try this at some point, while observing the resulting subjective experiences and reporting on them, ethical and safety considerations will have to be dealt with.
Nevertheless, assuming that one finds a way for such experiments to go ahead, one can try various things. E.g. one can train a variety of differently architected electronic circuits to approximate the same input-output function, and see if the observed subjective experiences differ substantially depending on the architecture of the electronic circuit in question. A positive answer would be the first step to figuring out how activity of an electronic circuit can be directly associated with subjective experiences.
If people start organizing for this kind of work, I'd love to collaborate.
Thanks, that's quite useful.
Apart from value thinking, you are also saying: "It seems pretty clear to me that the more or less artificial super-intelligence already exists, and keeps an eye on our planet".
I'd love to read why you are certain about this (I don't know if explaining why you are sure that a super-intelligent entity already exists is a part of your longer text).
It's actually great. I love the start, "The article talks about how we, as current humans, can communicate with and be kind to any future intelligent beings that may exist."
"how we, as current humans, can communicate with and be kind to any future intelligent beings that may exist" more or less implies that "how we, as current humans, actually survive well enough to be able communicate to 'any future intelligent beings' and be kind to them".
Please publish!
Actually, the Simulator theory by Janus means that one should update towards higher probability of being in a simulation.
If any generative pretrained model is (more or less) a simulator, this drastically increases the likely number of various simulations...
Re: 10% of them make the product better
This sounds as a promising target for automation. If only 10% of completed experiments currently need to make the product better, then this is a tempting target to try to autogenerate those experiments. Many software houses are probably already thinking in this direction.
We should probably factor this in, when we estimate AI safety risks.
Thanks, this is quite useful.
Still, it is rather difficult to imagine that they can be right. The standard argument seems to be quite compact.
Consider an ecosystem of human-equivalent artificial software engineers and artificial AI researchers. Take a population of those and make them work on producing a better, faster, more competent next generation of artificial software engineers and artificial AI researchers. Repeat using a population of better, faster, more competent entities, etc... If this saturates, it would probably saturate very far above human level...
(Of course, if people still believe that human-equivalent artificial software engineers and human-equivalent artificial AI researchers are a tall order, then skepticism is quite justified. But it's getting more and more difficult to believe that...)
But it's a good argument against a supercoherent superintelligent singleton (even a single system which does have supercoherent superintelligent subthreads is likely to have a variety of those).
Very interesting.
In favor:
1) The currently leading models (LLMs) are ultimate hot messes;
2) The whole point of G in AGI is that it can do many things; focusing on a single goal is possible, but is not a "natural mode" for general intelligence.
Against:
A superintelligent system will probably have enough capacity overhang to create multiple threads which would look to us like supercoherent superintelligent threads, so even a single system is likely to lead to multiple "virtual supercoherent superintelligent AIs" among other less coherent and more exploratory behaviors it would also perform.
I think the state is encoded in activations. There is a paper which explains that although Transformers are feed-forward transducers, in the autoregressive mode they do emulate RNNs:
Section 3.4 of "Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention", https://arxiv.org/abs/2006.16236
So, the set of current activations encodes the hidden state of that "virtual RNN".
This might be relevant to some of the discussion threads here...