Posts
Comments
Yeah that could be doable. Dylan's pretty natsec focused already so I would guess he'd take a broad view of the ROI from something like this. From what I hear he is already in touch with some of the folks who are in the mix, which helps, but the core goal is to get random leaf node action officers this access with minimum friction. I think an unconditional discount to all federal employees probably does pass muster with the regs, though of course folks would still be paying something out of pocket. I'll bring this up to SA next time we talk to them though, it might move the needle. For all I know, they might even be doing it already.
Because of another stupid thing, which is that U.S. depts & agencies have strong internal regs against employees soliciting and/or accepting gifts other than in carefully carved out exceptional cases. For more on this, see, e.g., 5 CFR § 2635.204, but this isn't the only such reg. In practice U.S. government employees at all levels are broadly prohibited from accepting any gift with a market value above 20 USD for example. (As you'd expect this leads to a lot of weird outcomes, including occasional hilarious minor diplomatic incidents with inexperienced foreign counterparties who have different gift giving norms.)
Yep, can confirm this is true. And this often leads to shockingly stupid outcomes, such as key action officers at the Office of [redacted] in the Department of [redacted] not reading SemiAnalysis because they'd have to pay for their subscriptions out of pocket.
This is a great & timely post.
Thanks very much for writing this. We appreciate all the feedback across the board, and I think this a well done and in-depth write up.
On the specific numerical thresholds in the report (i.e., your Key Proposal section), I do need to make one correction that also applies to most of Brooks's commentary. All the numerical thresholds mentioned in the report, and particularly in that subsection, are solely examples and not actual recommendations. They are there only to show how one can calculate self-consistent licensing thresholds under the principles we recommend. They are not themselves recommendations. We had to do it this way for the same reason we propose granting fairly broad rule-setting flexibility to the regulatory entity. The field is changing so quickly that any concrete threshold risks being out of date (for one reason or the other) in very short order. We would have liked to do otherwise, but that is not a realistic expectation for a report that we expect to be digested over the course of several months.
To avoid precisely this misunderstanding, the report states in several places that those very numbers are, in fact, only examples for illustration. A few screencaps of those disclaimers are below, but there are several others. Of course we could have included even more, but beyond a certain point one is simply adding more length to what you correctly point out is already quite a sizeable document. Note that the Time article, in the excerpt you quoted, does correctly note and acknowledge that the Tier 3 AIMD threshold is there as an example (emphasis added):
the report suggests, as an example, that the agency could set it just above the levels of computing power used to train current cutting-edge models like OpenAI’s GPT-4 and Google’s Gemini.
Apart from this, I do think overall you've done a good and accurate job of summarizing the document and offering sensible and welcome views, emphasis, and pushback. It's certainly a long report, so this is a service to anyone who's looking to go one or two levels deeper than the Executive Summary. We do appreciate you giving it a look and writing it up.
Gotcha, that makes sense!
Looks awesome! Minor correction on the cost of the GPT-4 training run: the website says $40 million, but sama confirmed publicly that it was over $100M (and several news outlets have reported the latter number as well).
Done, a few days ago. Sorry thought I'd responded to this comment.
Excellent context here, thank you. I hadn't been aware of this caveat.
Great question. This is another place where our model is weak, in the sense that it has little to say about the imperfect information case. Recall that in our scenario, the human agent learns its policy in the absence of the AI agent; and the AI agent then learns its optimal policy conditional on the human policy being fixed.
It turns out that this setup dodges the imperfect information question from the AI side, because the AI has perfect information on all the relevant parts of the human policy during its training. And it dodges the imperfect information question from the human side, because the human never considers even the existence of the AI during its training.
This setup has the advantage that it's more tractable and easier to reason about. But it has the disadvantage that it unfortunately fails to give a fully satisfying answer to your question. It would be interesting to see if we can remove some of the assumptions in our setup to approximate the imperfect information case.
Agreed. We think our human-AI setting is a useful model of alignment in the limit case, but not really so in the transient case. (For the reason you point out.)
I think you might have reversed the definitions of and in your comment,[1] but otherwise I think you're exactly right.
To compute (the correlation coefficient between terminal values), naively you'd have reward functions and , that respectively assign human and AI rewards over every possible arrangement of matter . Then you'd look at every such reward function pair over your joint distribution , and ask how correlated they are over arrangements of matter. If you like, you can imagine that the human has some uncertainty around both his own reward function over houses, and also over how well aligned the AI is with his own reward function.
And to compute (the correlation coefficient between instrumental values), you're correct that some of the arrangements of matter will be intermediate states in some construction plans. So if the human and AI both want a house with a swimming pool, they will both have high POWER for arrangements of matter that include a big hole dug in the backyard. Plot out their respective POWERs at each , and you can read the correlation right off the alignment plot!
- ^
Looking again at the write-up, it would have made more sense for us to define as the terminal goal correlation coefficient, since we introduce that one first. Alas, this didn't occur to us. Sorry for the confusion.
Good question. Unfortunately, one weakness of our definition of multi-agent POWER is that it doesn't have much useful to say in a case like this one.
We assume AI learning timescales vastly outstrip human learning timescales as a way of keeping our definition tractable. So the only way to structure this problem in our framework would be to imagine a human is playing chess against a superintelligent AI — a highly distorted situation compared to the case of two roughly equal opponents.
On the other hand, from other results I've seen anecdotally, I suspect that if you gave one of the agents a purely random policy (i.e., take a random legal action at each state) and assigned the other agent some reasonable reward function distribution over material, you'd stand a decent chance of correctly identifying high-POWER states with high-mobility board positions.
You might also be interested in this comment by David Xu, where he discusses mobility as a measure of instrumental value in chess-playing.
Thanks for you comment. These are great questions. I'll do the best I can to answer here, feel free to ask follow-ups:
- On pre-committing as a negotiating tactic: If I've understood correctly, this is a special case of the class of strategies where you sacrifice some of your own options (bad) to constrain those of your opponent (good). And your question is something like: which of these effects is strongest, or do they cancel each other out?
It won't surprise you that I think the answer is highly context-dependent, and that I'm not sure which way it would actually shake out in your example with Fred and Bob and the $5000. But interestingly, we did in fact discover an instance of this class of "sacrificial" strategies in our experiments!
You can check out the example in Part 3 if you're interested. But briefly, what happens is that when the agents get far-sighted enough, one of them realizes that there is instrumental value in having the option to bottle up the other agent in a dead-end corridor (i.e., constraining that other agent's options). But it can only actually do this by positioning itself at the mouth of the corridor (i.e., sacrificing its own options). Here is a full-size image of both agents' POWERs in this situation. You can see from the diagram that Agent A prefers to preserve its own options over constraining Agent H's options in this case. But crucially, Agent A values the option of being able to constrain Agent H's options.
In the language of your negotiating example, there is instrumental value in preserving one's option to pre-commit. But whether actually pre-committing is instrumentally valuable or not depends on the context.
- On babies being more powerful than adults: Yes, I think your reasoning is right. And it would be relatively easy to do this experiment! All you'd need would be to define a "death" state, and set your transition dynamics so that the agent gets sent to the "death" state after N turns and can never escape from it afterwards. I think this would be a very interesting experiment to run, in fact.
- On paperclip maximizers: This is a very deep and interesting question. One way to think about this schematically might be: a superintelligent paperclip maximizer will go through a Phase One, in which it accumulates its POWER; and then a Phase Two in which it spends the POWER it's accumulated. During the accumulation phase, the system might drive towards a state where (without loss of generality) the Planet Earth is converted into a big pile of computronium. This computronium-Earth state is high-POWER, because it's a common "way station" state for paperclip maximizers, thumbtack maximizers, safety pin maximizers, No. 2 pencil maximizers, and so on. (Indeed, this is what high POWER means.)
Once the system has the POWER it needs to reach its final objective, it will begin to spend that POWER in ways that maximize its objective. This is the point at which the paperclip, thumbtack, safety pin, and No. 2 pencil maximizers start to diverge from one another. They will each push the universe towards sharply different terminal states, and the more progress each maximizer makes towards its particular terminal state, the fewer remaining options it leaves for itself if its goal were to suddenly change. Like a male praying mantis, a maximizer ultimately sacrifices its whole existence for the pursuit of its terminal goal. In other words: zero POWER should be the end state of a pure X-maximizer![1]
My story here is hypothetical, but this is absolutely an experiment on can do (at small scale, naturally). The way to do it would be to run several rollouts of an agent, and plot the POWER of the agent at each state it visits during the rollout. Then we can see whether most agent trajectories have the property where their POWER first goes up (as they, e.g., move to topological junction points) and then goes down (as they move from the junction points to their actual objectives).
Thanks again for your great questions. Incidentally, a big reason we're open-sourcing our research codebase is to radically lower the cost of converting thought experiments like the above into real experiments with concrete outcomes that can support or falsify our intuitions. The ideas you've suggested are not only interesting and creative, they're also cheaply testable on our existing infrastructure. That's one reason we're excited to release it!
- ^
Note that this assumes the maximizer is inner aligned to pursue its terminal goal, the terminal goal is stable on reflection, and all the usual similar incantations.
Yes, I think this is right. It's been pointed out elsewhere that feature universality in neural networks could be an instance of instrumental convergence, for example. And if you think about it, to the extent that a "correct" model of the universe exists, then capturing that world-model in your reasoning should be instrumentally useful for most non-trivial terminal goals.
We've focused on simple gridworlds here, partly because they're visual, but also because they're tractable. But I suspect there's a mapping between POWER (in the RL context) and generalizability of features in NNs (in the context of something like the circuits work linked above). This would be really interesting to investigate.
Got it. That makes sense, thanks!
This is really interesting. It's hard to speak too definitively about theories of human values, but for what it's worth these ideas do pass my intuitive smell test.
One intriguing aspect is that, assuming I've followed correctly, this theory aims to unify different cognitive concepts in a way that might be testable:
- On the one hand, it seems to suggest a path to generalizing circuits-type work to the model-based RL paradigm. (With shards, which bid for outcomes on a contextually activated basis, being analogous to circuits, which contribute to prediction probabilities on a contextually activated basis.)
- On the other hand, it also seems to generalize the psychological concept of classical conditioning (Pavlov's salivating dog, etc.), which has tended to be studied over the short term for practical reasons, to arbitrarily (?) longer planning horizons. The discussion of learning in babies also puts one in mind of the unfortunate Little Albert Experiment, done in the 1920s:
For the experiment proper, by which point Albert was 11 months old, he was put on a mattress on a table in the middle of a room. A white laboratory rat was placed near Albert and he was allowed to play with it. At this point, Watson and Rayner made a loud sound behind Albert's back by striking a suspended steel bar with a hammer each time the baby touched the rat. Albert responded to the noise by crying and showing fear. After several such pairings of the two stimuli, Albert was presented with only the rat. Upon seeing the rat, Albert became very distressed, crying and crawling away.
[...]
In further experiments, Little Albert seemed to generalize his response to the white rat. He became distressed at the sight of several other furry objects, such as a rabbit, a furry dog, and a seal-skin coat, and even a Santa Claus mask with white cotton balls in the beard.
A couple more random thoughts on stories one could tell through the lens of shard theory:
- As we age, if all goes well, we develop shards with longer planning horizons. Planning over longer horizons requires more cognitive capacity (all else equal), and long-horizon shards do seem to have some ability to either reinforce or dampen the influence of shorter-horizon shards. This is part of the continuing process of "internally aligning" a human mind.
- Introspectively, I think there is also an energy cost involved in switching between "active" shards. Software developers understand this as context-switching, actively dislike it, and evolve strategies to minimize it in their daily work. I suspect a lot of the biases you might categorize under "resistance to change" (projection bias, sunk cost fallacy and so on) have this as a factor.
I do have a question about your claim that shards are not full subagents. I understand that in general different shards will share parameters over their world-model, so in that sense they aren't fully distinct — is this all you mean? Or are you arguing that even a very complicated shard with a long planning horizon (e.g., "earn money in the stock market" or some such) isn't agentic by some definition?
Anyway, great post. Looking forward to more.
Nice. Congrats on the launch! This is an extremely necessary line of effort.
Interesting. The specific idea you're proposing here may or may not be workable, but it's an intriguing example of a more general strategy that I've previously tried to articulate in another context. The idea is that it may be viable to use an AI to create a "platform" that accelerates human progress in an area of interest to existential safety, as opposed to using an AI to directly solve the problem or perform the action.
Essentially:
- A "platform" for work in domain X is something that removes key constraints that would otherwise have consumed human time and effort when working in X. This allows humans to explore solutions in X they wouldn't have previously — whether because they'd considered and rejected those solution paths, or because they'd subconsciously trained themselves not to look in places where the initial effort barrier was too high. Thus, developing an excellent platform for X allows humans to accelerate progress in domain X relative to other domains, ceteris paribus. (Every successful platform company does this. e.g., Shopify, Amazon, etc., make valuable businesses possible that wouldn't otherwise exist.)
- For certain carefully selected domains X, a platform for X may plausibly be relatively easier to secure & validate than an agent that's targeted at some specific task x ∈ X would be. (Not easy; easier.) It's less risky to validate the outputs of a platform and leave the really dangerous last-mile stuff to humans, than it would be to give an end-to-end trained AI agent a pivotal command in the real world (i.e., "melt all GPUs") that necessarily takes the whole system far outside its training distribution. Fundamentally, the bet is that if humans are the ones doing the out-of-distribution part of the work, then the output that comes out the other end is less likely to have been adversarially selected against us.
(Note that platforms are tools, and tools want to be agents, so a strategy like this is unlikely to arise along the "natural" path of capabilities progress other than transiently.)
There are some obvious problems with this strategy. One is that point 1 above is no help if you can't tell which of the solutions the humans come up with are good, and which are bad. So the approach can only work on problems that humans would otherwise have been smart enough to solve eventually, given enough time to do so (as you already pointed out in your example). If AI alignment is such a problem, then it could be a viable candidate for such an approach. Ditto for a pivotal act.
Another obvious problem is that capabilities research might benefit from the similar platforms that alignment research can. So actually implementing this in the real world might just accelerate the timeline for everything, leaving us worse off. (Absent an intervention at some higher level of coordination.)
A third concern is that point 2 above could be flat-out wrong in practice. Asking an AI to build a platform means asking for generalization, even if it is just "generalization within X", and that's playing a lethally dangerous game. In fact, it might well be lethal for any useful X, though that isn't currently obvious to me. e.g., AlphaFold2 is a primitive example of a platform that that's useful and non-dangerous, though it's not useful enough for this.
On top of all that, there are all the steganographic considerations — AI embedding dangerous things in the tool itself, etc. — that you pointed out in your example.
But this strategy still seems like it could bring us closer to the Pareto frontier for critical domains (alignment problem, pivotal act), than it would be to directly train an AI to do the dangerous action.
Yep, I'd say I intuitively agree with all of that, though I'd add that if you want to specify the set of "outcomes" differently from the set of "goals", then that must mean you're implicitly defining a mapping from outcomes to goals. One analogy could be that an outcome is like a thermodynamic microstate (in the sense that it's a complete description of all the features of the universe) while a goal is like a thermodynamic macrostate (in the sense that it's a complete description of the features of the universe that the system can perceive).
This mapping from outcomes to goals won't be injective for any real embedded system. But in the unrealistic limit where your system is so capable that it has a "perfect ontology" — i.e., its perception apparatus can resolve every outcome / microstate from any other — then this mapping converges to the identity function, and the system's set of possible goals converges to its set of possible outcomes. (This is the dualistic case, e.g., AIXI and such. But plausibly, we also should expect a self-improving systems to improve its own perception apparatus such that its effective goal-set becomes finer and finer with each improvement cycle. So even this partition over goals can't be treated as constant in the general case.)
Gotcha. I definitely agree with what you're saying about the effectiveness of incentive structures. And to be clear, I also agree that some of the affordances in the quote reasonably fall under "alignment": e.g., if you explicitly set a specific mission statement, that's a good tactic for aligning your organization around that specific mission statement.
But some of the other affordances aren't as clearly goal-dependent. For example, iterating quickly is an instrumentally effective strategy across a pretty broad set of goals a company might have. That (in my view) makes it closer to a capability technique than to an alignment technique. i.e., you could imagine a scenario where I succeeded in building a company that iterated quickly, but I failed to also align it around the mission statement I wanted it to have. In this scenario, my company was capable, but it wasn't aligned with the goal I wanted.
Of course, this is a spectrum. Even setting a specific mission statement is an instrumentally effective strategy across all the goals that are plausible interpretations of that mission statement. And most real mission statements don't admit a unique interpretation. So you could also argue that setting a mission statement increases the company's capability to accomplish goals that are consistent with any interpretation of it. But as a heuristic, I tend to think of a capability as something that lowers the cost to the system of accomplishing any goal (averaged across the system's goal-space with a reasonable prior). Whereas I tend to think of alignment as something that increases the relative cost to the system of accomplishing classes of goals that the operator doesn't want.
I'd be interested to hear whether you have a different mental model of the difference, and if so, what it is. It's definitely possible I've missed something here, since I'm really just describing an intuition.
Thanks, great post.
These include formulating and repeating a clear mission statement, setting up a system for promotions that rewards well-calibrated risk taking, and iterating quickly at the beginning of the company in order to habituate a rhythm of quick iteration cycles.
I may be misunderstanding, but wouldn't these techniques fall more under the heading of capabilities rather than under alignment? These are tactics that should increase a company's effectiveness in general, for most reasonable mission statements or products the company could have.
This is fantastic. Really appreciate both the detailed deep-dive in the document, and the summary here. This is also timely, given that teams working on superscale models with concerning capabilities haven't generally been too forthcoming with compute estimates. (There are exceptions.)
As you and Alex point out in the sibling thread, the biggest remaining fudge factors seem to be:
- Mixture models (or any kind of parameter-sharing, really) for the first method, which will cause you to systematically overestimate the "Operations per forward pass" factor; and
- Variable effective utilization rates of custom hardware for the second method, which will cause an unknown distribution of errors in the "utilization rate" factor.
Nonetheless, my flying guess would be that your method is pretty much guaranteed to be right within an OOM, and probably within a factor of 2 or less. That seems pretty good! It's certainly an improvement over anything I've seen previously along these lines. Congrats!
It's simply because we each (myself more than her) have an inclination to apply a fair amount of adjustment in a conservative direction, for generic "burden of proof" reasons, rather than go with the timelines that seem most reasonable based on the report in a vacuum.
While one can sympathize with the view that the burden of proof ought to lie with advocates of shorter timelines when it comes to the pure inference problem ("When will AGI occur?"), it's worth observing that in the decision problem ("What should we do about it?") this situation is reversed. The burden of proof in the decision problem probably ought instead to lie with advocates of non-action: when one's timelines are >1 generation, it is a bit too easy to kick the can down the road in various ways — leaving one unprepared if the future turns out to move faster than we expected. Conversely someone whose timelines are relatively short may take actions today that will leave us in a better position in the future, even if that future arrives more slowly than they believed originally.
(I don't think OpenPhil is confusing these two, just that in a conversation like this it is particularly worth emphasizing the difference.)
This is an excellent point and it's indeed one of the fundamental limitations of a public tracking approach. Extrapolating trends in an information environment like this can quickly degenerate into pure fantasy. All one can really be sure of is that the public numbers are merely lower bounds — and plausibly, very weak ones.
Yeah, great point about Gopher, we noticed the same thing and included a note to that effect in Gopher's entry in the tracker.
I agree there's reason to believe this sort of delay could become a bigger factor in the future, and may already be a factor now. If we see this pattern develop further (and if folks start publishing "model cards" more consistently like DM did, which gave us the date of Gopher's training) we probably will begin to include training date as separate from publication date. But for now, it's a possible trend to keep an eye on.
Thanks again!
A more typical example: I can look at a chain of options on a stock, and use the prices of those options to back out market-implied probabilities for each possible stock price at expiry.
Gotcha, this is a great example. And the fundamental reasons why this works are 1) the immediate incentive that you can earn higher returns by pricing the option more correctly; combined with 2) the fact that the agents who are assigning these prices have (on a dollar-weighted-average basis) gone through multiple rounds of selection for higher returns.
(I wonder to what extent any selection mechanism ultimately yields agents with general reasoning capabilities, given tight enough competition between individuals in the selected population? Even if the environment doesn't start out especially complicated, if the individuals are embedded in it and are interacting with one another, after a few rounds of selection most of the complexity an individual perceives is going to be due to its competitors. Not everything is like this — e.g., training a neural net is a form of selection without competition — but it certainly seems to describe many of the more interesting bits of the world.)
Thanks for the clarifications here btw — this has really piqued my interest in selection theorems as a research angle.
Okay, then to make sure I've understood correctly: what you were saying in the quoted text is that you'll often see an economist, etc., use coherence theorems informally to justify a particular utility maximization model for some system, with particular priors and conditionals. (As opposed to using coherence theorems to justify the idea of EU models generally, which is what I'd thought you meant.) And this is a problem because the particular priors and conditionals they pick can't be justified solely by the coherence theorem(s) they cite.
The problem with VNM-style lotteries is that the probabilities involved have to come from somewhere besides the coherence theorems themselves. We need to have some other, external reason to think it's useful to model the environment using these probabilities.
To try to give an example of this: suppose I wanted to use coherence / consistency conditions alone to assign priors over the outcomes of a VNM lottery. Maybe the closest I could come to doing this would be to use maxent + transformation groups to assign an ignorance prior over those outcomes; and to do that, I'd need to additionally know the symmetries that are implied by my ignorance of those outcomes. But those symmetries are specific to the structure of my problem and are not contained in the coherence theorems themselves. So this information about symmetries would be what you would refer to as an "external reason to think it's useful to model the environment using these probabilities".
Is this a correct interpretation?
Thanks so much for the feedback!
The ability to sort by model size etc would be nice. Currently sorting is alphabetical.
Right now the default sort is actually chronological by publication date. I just added the ability to sort by model size and compute budget at your suggestion. You can use the "⇅ Sort" button in the Models tab to try it out; the rows should now sort correctly.
Also the rows with long textual information should be more to the right and the more informative/tighter/numerical columns more to the left (like "deep learning" in almost all rows, not very informative). Ideally the most relevant information would be on the initial page without scrolling.
You are absolutely right! I've just taken a shot at rearranging the columns to surface the most relevant parts up front and played around a bit with the sizing. Let me know what you think.
"Date published" and "date trained" can be quite different. Maybe worth including the latter?
That's true, though I've found the date at which a model was trained usually isn't disclosed as part of a publication (unlike parameter count and, to a lesser extent, compute cost). There is also generally an incentive to publish fairly soon after the model's been trained and characterized, so you can often rely on the model not being that stale, though that isn't universal.
Is there a particular reason you'd be interested in seeing training dates as opposed to (or in addition to) publication dates?
Thanks again!
I’m surprised by just how much of a blindspot goal-inputs seem to be for today’s economists, AI researchers, etc. The coherence theorems usually cited to justify expected utility maximization models imply a quite narrow range of inputs to those utility functions: utilities are only over the outcomes on which agents can bet. Yet practitioners use utility functions over entire (unobservable) world states, world state trajectories, MDP states, etc, often without any way for the agent to bet on all of the outcomes.
It's true that most of the agents we build can't directly bet on all the outcomes in their respective world-models. But these agents would still be modelled by the coherence theorems (+ VNM) as betting on lotteries over such outcomes. This seems like a fine way to justify EU maximization when you're unable to bet on every "microstate" of the world — so in what sense did you mean that this was a blind spot?
EDIT: Unless you were alluding to the fact that real-world agents' utility functions are often defined over "wrong" ontologies, such that you couldn't actually construct a lottery over real-world microstates that's an exact fit for the bet the agent wants to make. Is that what you meant?
(FWIW, I agree with your overall point in this section. I'm just trying to better understand your meaning here.)
Personally speaking, I think this is the subfield to be closely tracking progress in, because 1) it has far-reaching implications in the long term and 2) it has garnered relatively little attention compared to other subfields.
Thanks for the clarification — definitely agree with this.
If you'd like to visualize trends though, you'll need more historical data points, I think.
Yeah, you're right. Our thinking was that we'd be able to do this with future data points or by increasing the "density" of points within the post-GPT-3 era, but ultimately it will probably be necessary (and more compelling) to include somewhat older examples too.
Interesting; I hadn't heard of DreamerV2. From a quick look at the paper, it looks like one might describe it as a step on the way to something like EfficientZero. Does that sound roughly correct?
it would be great to see older models incorporated as well
We may extend this to older models in the future. But our goal right now is to focus on these models' public safety risks as standalone (or nearly standalone) systems. And prior to GPT-3, it's hard to find models whose public safety risks were meaningful on a standalone basis — while an earlier model could have been used as part of a malicious act, for example, it wouldn't be as central to such an act as a modern model would be.
Yeah, these are interesting points.
Isn't it a bit suspicious that the thing-that's-discontinuous is hard to measure, but the-thing-that's-continuous isn't? I mean, this isn't totally suspicious, because subjective experiences are often hard to pin down and explain using numbers and statistics. I can understand that, but the suspicion is still there.
I sympathize with this view, and I agree there is some element of truth to it that may point to a fundamental gap in our understanding (or at least in mine). But I'm not sure I entirely agree that discontinuous capabilities are necessarily hard to measure: for example, there are benchmarks available for things like arithmetic, which one can train on and make quantitative statements about.
I think the key to the discontinuity question is rather that 1) it's the jumps in model scaling that are happening in discrete increments; and 2) everything is S-curves, and a discontinuity always has a linear regime if you zoom in enough. Those two things together mean that, while a capability like arithmetic might have a continuous performance regime on some domain, in reality you can find yourself halfway up the performance curve in a single scaling jump (and this is in fact what happened with arithmetic and GPT-3). So the risk, as I understand it, is that you end up surprisingly far up the scale of "world-ending" capability from one generation to the next, with no detectable warning shot beforehand.
"No one predicted X in advance" is only damning to a theory if people who believed that theory were making predictions about it at all. If people who generally align with Paul Christiano were indeed making predictions to the effect of GPT-3 capabilities being impossible or very unlikely within a narrow future time window, then I agree that would be damning to Paul's worldview. But -- and maybe I missed something -- I didn't see that. Did you?
No, you're right as far as I know; at least I'm not aware of any such attempted predictions. And in fact, the very absence of such prediction attempts is interesting in itself. One would imagine that correctly predicting the capabilities of an AI from its scale ought to be a phenomenally valuable skill — not just from a safety standpoint, but from an economic one too. So why, indeed, didn't we see people make such predictions, or at least try to?
There could be several reasons. For example, perhaps Paul (and other folks who subscribe to the "continuum" world-model) could have done it, but they were unaware of the enormous value of their predictive abilities. That seems implausible, so let's assume they knew the value of such predictions would be huge. But if you know the value of doing something is huge, why aren't you doing it? Well, if you're rational, there's only one reason: you aren't doing it because it's too hard, or otherwise too expensive compared to your alternatives. So we are forced to conclude that this world-model — by its own implied self-assessment — has, so far, proved inadequate to generate predictions about the kinds of capabilities we really care about.
(Note: you could make the argument that OpenAI did make such a prediction, in the approximate yet very strong sense that they bet big on a meaningful increase in aggregate capabilities from scale, and won. You could also make the argument that Paul, having been at OpenAI during the critical period, deserves some credit for that decision. I'm not aware of Paul ever making this argument, but if made, it would be a point in favor of such a view and against my argument above.)
I think what gwern is trying to say is that continuous progress on a benchmark like PTB appears (from what we've seen so far) to map to discontinuous progress in qualitative capabilities, in a surprising way which nobody seems to have predicted in advance. Qualitative capabilities are more relevant to safety than benchmark performance is, because while qualitative capabilities include things like "code a simple video game" and "summarize movies with emojis", they also include things like "break out of confinement and kill everyone". It's the latter capability, and not PTB performance, that you'd need to predict if you wanted to reliably stay out of the x-risk regime — and the fact that we can't currently do so is, I imagine, what brought to mind the analogy between scaling and Russian roulette.
I.e., a straight line in domain X is indeed not surprising; what's surprising is the way in which that straight line maps to the things we care about more than X.
(Usual caveats apply here that I may be misinterpreting folks, but that is my best read of the argument.)
Good catch! I didn't check the form. Yes you are right, the spoiler should say (1=Paul, 9=Eliezer) but the conclusion is the right way round.
(Not being too specific to avoid spoilers) Quick note: I think the direction of the shift in your conclusion might be backwards, given the statistics you've posted and that 1=Eliezer and 9=Paul.
Thanks for the kind words and thoughtful comments.
You're absolutely right that expected ROI ultimately determines scale of investment. I agree on your efficiency point too: scaling and efficiency are complements, in the sense that the more you have of one, the more it's worth investing in the other.
I think we will probably include some measure of efficiency as you've suggested. But I'm not sure exactly what that will be, since efficiency measures tend to be benchmark-dependent so it's hard to get apples-to-apples here for a variety of reasons. (e.g., differences in modalities, differences in how papers record their results, but also the fact that benchmarks tend to get smashed pretty quickly these days, so newer models are being compared on a different basis from old ones.) Did you have any specific thoughts about this? To be honest, this is still an area we are figuring out.
On the ROI side: while this is definitely the most important metric, it's also the one with by far the widest error bars. The reason is that it's impossible to predict all the creative ways people will use these models for economic ends — even GPT-3 by itself might spawn entire industries that don't yet exist. So the best one could hope for here is something like a lower bound with the accuracy of a startup's TAM estimate: more art than science, and very liable to be proven massively wrong in either direction. (Disclosure: I'm a modestly prolific angel investor, and I've spoken to — though not invested in — several companies being built on GPT-3's API.)
There's another reason we're reluctant to publish ROI estimates: at the margin, these estimates themselves bolster the case for increased investment in scaling, which is concerning from a risk perspective. This probably wouldn't be a huge effect in absolute terms, since it's not really the sort of thing effective allocators weigh heavily as decision inputs, but there are scenarios where it matters and we'd rather not push our luck.
Thanks again!
Gotcha. Well, that seems right—certainly in the limit case.
Thanks, that helps. So actually this objection says: "No, the biggest risk lies not in the trustworthiness of the Bob you use as the input to your scheme, but rather in the fidelity of your copying process; and this is true even if the errors in your copying process are being introduced randomly rather than adversarially. Moreover, if you actually do develop the technical capability to reduce your random copying-error risk down to around the level of your Bob-trustworthiness risk, well guess what, you've built yourself an AGI. But since this myopic copying scheme thing seems way harder than the easiest way I can think of to build an AGI, that means a fortiori that somebody else built one the easy way several years before you built yours."
Is that an accurate interpretation?
This is a great thread. Let me see if I can restate the arguments here in different language:
- Suppose Bob is a smart guy whom we trust to want all the best things for humanity. Suppose we also have the technology to copy Bob's brain into software and run it in simulation at, say, a million times its normal speed. Then, if we thought we had one year between now and AGI (leaving aside the fact that I just described a literal AGI in the previous sentence), we could tell simulation-Bob, "You have a million subjective years to think of an effective pivotal act in the real world, and tell us how to execute it." Bob's a smart guy, and we trust him to do the right thing by us; he should be able to figure something out in a million years, right?
- My understanding of Evan's argument at this point would be: "Okay; so we don't have the technology to directly simulate Bob's brain. But maybe instead we can imitate its I/O signature by training a model against its actions. Then, because that model is software, we can (say) speed it up a million times and deal with it as if it was a high-fidelity copy of Bob's brain, and it can solve alignment / execute pivotal action / etc. for us. Since Bob was smart, the model of Bob will be smart. And since Bob was trustworthy, the model of Bob will be trustworthy to the extent that the training process we use doesn't itself introduce novel long-term dependencies that leave room for deception."
- Note that myopia — i.e., the purging of long term dependencies from the training feedback signal — isn't really conceptually central to the above scheme. Rather it is just a hack intended to prevent additional deception risks from being introduced through the act of copying Bob's brain. The simulated / imitated copy of Bob is still a full-blown consequentialist, with all the manifold risks that entails. So the scheme is basically a way of taking an impractically weak system that you trust, and overclocking it but not otherwise affecting it, so that it retains (you hope) the properties that made you trust it in the first place.
- At this point my understanding of Eliezer's counterargument would be: "Okay sure; but find me a Bob that you trust enough to actually put through this process. Everything else is neat, but it is downstream of that." And I think that this is correct and that it is a very, very strong objection, but — under certain sets of assumptions about timelines, alternatives, and counterfactual risks — it may not be a complete knock-down. (This is the "belling the cat" bit, I believe.)
- And at this point, maybe (?) Evan says, "But wait; the Bob-copy isn't actually a consequentialist because it was trained myopically." And if that's what Evan says, then I believe this is the point at which there is an empirically resolvable disagreement.
Is this roughly right? Or have I missed something?
I want to push back a little against the claim that the bootstrapping strategy ("build a relatively weak aligned AI that will make superhumanly fast progress on AI alignment") is definitely irrelevant/doomed/inferior. Specifically, I don't know whether this strategy is good or not in practice, but it serves as useful threshold for what level/kind of capabilities we need to align in order to solve AI risk.
Yeah, very much agree with all of this. I even think there's an argument to be made that relatively narrow-yet-superhuman theorem provers (or other research aids) could be worth the risk to develop and use, because they may make the human alignment researchers who use them more effective in unpredictable ways. For example, researchers tend to instinctively avoid considering solution paths that are bottlenecked by statements they see as being hard to prove — which is totally reasonable. But if your mentality is that you can just toss a super-powerful theorem-prover at the problem, then you're free to explore concept-space more broadly since you may be able to check your ideas at much lower cost.
(Also find myself agreeing with your point about tradeoffs. In fact, you could think of a primitive alignment strategy as having a kind of Sharpe ratio: how much marginal x-risk does it incur per marginal bit of optimization it gives? Since a closed-form solution to the alignment problem doesn't necessarily seem forthcoming, measuring its efficient frontier might be the next best thing.)
Great catch. For what it's worth, it actually seems fine to me intuitively that any finite pattern would be an optimizing system for this reason, though I agree most such patterns may not directly be interesting. But perhaps this is a hint that some notion of independence or orthogonality of optimizing systems might help to complete this picture.
Here's a real-world example: you could imagine a universe where humans are minding their own business over here on Earth, while at the same time, over there in a star system 20 light-years away, two planets are hurtling towards each other under the pull of their mutual gravitation. No matter what humans may be doing on Earth, this universe as a whole can still reasonably be described as an optimizing system! Specifically, it achieves the property that the two faraway planets will crash into each other under a fairly broad set of contexts.
Now suppose we describe the state of this universe as a single point in a gargantuan phase space — let's say it's the phase space of classical mechanics, where we assign three positional and three momentum degrees of freedom to each particle in the universe (so if there are N particles in the universe, we have a 6N-dimensional phase space). Then there is a subspace of this huge phase space that corresponds to the crashing planets, and there is another, orthogonal subspace that corresponds to the Earth and its humans. You could then say that the crashing-planets subspace is an optimizing system that's independent of the human-Earth subspace. In particular, if you imagine that these planets (which are 20 light-years away from Earth) take less than 20 years to crash into each other, then the two subspaces won't come into causal contact before the planet subspace has achieved the "crashed into each other" property.
Similarly on the GoL grid, you could imagine having an interesting eater over here, while over there you have a pretty boring, mostly empty grid with just a single live cell in it. If your single live cell is far enough away from the eater than the two systems do not come into causal contact before the single cell has "died" (if the lone live cell is more than 2 cells away from any live cell of the eater system, for example) then they can imo be considered two independent optimizing systems.
Of course the union of two independent optimizing systems will itself be an optimizing system, and perhaps that's not very interesting. But I'd contend that the reason it's not very interesting is that very property of causal independence — and that this independence can be used to resolve our GoL universe into two orthogonal optimizers that can then be analyzed separately (as opposed to asserting that the empty grid isn't an optimizing system at all).
Actually, that also suggests an intriguing experimental question. Suppose Optimizer A independently achieves Property X, and Optimizer B independently achieves Property Y in the GoL universe. Are there certain sorts of properties that tend to be achieved when you put A and B in causal contact?
Extremely interesting — thanks for posting. Obviously there are a number of caveats which you carefully point out, but this seems like a very reasonable methodology and the actual date ranges look compelling to me. (Though they also align with my bias in favor of shorter timelines, so I might not be impartial on that.)
One quick question about the end of this section:
The expected number of bits in original encoding per bits in the compression equals the entropy of that language.
Wouldn't this be the other way around? If your language has low entropy it should be more predictable, and therefore more compressible. So the entropy would be the number of bits in the compression for each expected bit of the original.
Thanks! I think this all makes sense.
- Oh yeah, I definitely agree with you that the empty board would be an optimizing system in the GoL context. All I meant was that the "Death" square in the examples table might not quite correspond to it in the analogy, since the death property is perhaps not an optimization target by the definition. Sorry if that wasn't clear.
- :)
- Got it, thanks! So if I've understood correctly, you are currently only using the mask as a way to separate the agent from its environment at instantiation, since that is all you really need to do to be able to define properties like robustness and retargetability in this context. That seems reasonable.
Loved this post. This whole idea of using a deterministic dynamical system as a conceptual testing ground feels very promising.
A few questions / comments:
- About the examples: do you think it's strictly correct to say that entropy / death is an optimizing system? One of the conditions of the Flint definition is that the set of target states ought to be substantially smaller than the basin of attraction, by some measure on the configuration space. Yet neither high entropy nor death seem like they satisfy this: there are too many ways to be dead, and (tautologically) too many ways to have high entropy. As a result, both the "dead" property and the "high-entropy" property make up a large proportion of the attraction basin. The original post makes a similar point, though admittedly there is some degree of flexibility in terms of how big the target state set has to be before you call the system an optimizer.
- Not sure if this is a useful question, but what do you think of using "macrostate" as opposed to "property" to mean a set of states? This term "macrostate" is used in statistical physics for the identical concept, and as you're probably aware, there may be results from that field you'd be able to leverage here. (The "size" of a macrostate is usually thought of as its entropy over states, and this seems like it could fit into your framework as well. At first glance it doesn't seem too unreasonable to just use a flat prior over grid configurations, so this just ends up being the log of the state count.)
- I like the way embedded perturbations have been defined too. External perturbations don't seem fundamentally different from embedded ones (we can always just expand our configuration space until it includes the experimenter) but keeping perturbations "in-game" cuts out those complications while keeping the core problem in focus.
- The way you're using and as a way to smoothly vary the "degree" of optimization of a system is very elegant.
- Do you imagine keeping the mask constant over the course of a computational rollout? Plausibly as you start a computation some kinds of agents may start to decohere as they moves outside the original mask area and/or touch and merge with bits of their environments. E.g., if the agent is a glider, does the mask "follow" the agent? Or are you for now mostly considering patterns like eaters that stay in one place?
Very neat. It's quite curious that switching to L2 for the base optimizer doesn't seem to have resulted in the meta-initialized network learning the sine function. What sort of network did you use for the meta-learner? (It looks like the 4-layer network in your Methods refers to your base optimizer, but perhaps it's the same architecture for both?)
Also, do you know if you end up getting the meta-initialized network to learn the sine function eventually if you train for thousands and thousands of steps? Or does it just never learn no matter how hard you train it?
I see — perhaps I did misinterpret your earlier comment. It sounds like the transition you are more interested in is closer to (AI has ~free rein over the internet) => (AI invents nanotech). I don't think this is a step we should expect to be able to model especially well, but the best story/analogy I know of for it is probably the end part of That Alien Message. i.e., what sorts of approaches would we come up with, if all of human civilization was bent on solving the equivalent problem from our point of view?
If instead you're thinking more about a transition like (AI is superintelligent but in a box) => (AI has ~free rein over the internet), then I'd say that I'd expect us to skip the "in a box" step entirely.
No problem, glad it was helpful!
And thanks for the APS-AI definition, I wasn't aware of the term.
Thanks! I agree with this critique. Note that Daniel also points out something similar in point 12 of his comment — see my response.
To elaborate a bit more on the "missing step" problem though:
- I suspect many of the most plausible risk models have features that make it undesirable for them to be shared too widely. Please feel free to DM me if you'd like to chat more about this.
- There will always be some point between Step 1 and Step 3 at which human-legible explanations fail. i.e., it would be extremely surprising if we could tell a coherent story about the whole process — the best we can do is assume the AI gets to the end state because it's highly competent, but we should expect it to do things we can't understand. (To be clear, I don't think this is quite what your comment was about. But it is a fundamental reason why we can't ever expect a complete explanation.)
See my response to point 6 of Daniel's comment — it's rather that I'm imagining competing hedge funds (run by humans) beginning to enter the market with this sort of technology.