AI Tracker: monitoring current and near-future risks from superscale models 2021-11-23T19:16:05.554Z
AI takeoff story: a continuation of progress by other means 2021-09-27T15:55:44.163Z
Defining capability and alignment in gradient descent 2020-11-05T14:36:53.153Z


Comment by Edouard Harris on AI Tracker: monitoring current and near-future risks from superscale models · 2021-11-26T16:58:12.946Z · LW · GW

Personally speaking, I think this is the subfield to be closely tracking progress in, because 1) it has far-reaching implications in the long term and 2) it has garnered relatively little attention compared to other subfields.

Thanks for the clarification — definitely agree with this.

If you'd like to visualize trends though, you'll need more historical data points, I think.

Yeah, you're right. Our thinking was that we'd be able to do this with future data points or by increasing the "density" of points within the post-GPT-3 era, but ultimately it will probably be necessary (and more compelling) to include somewhat older examples too.

Comment by Edouard Harris on AI Tracker: monitoring current and near-future risks from superscale models · 2021-11-25T13:47:24.069Z · LW · GW

Interesting; I hadn't heard of DreamerV2. From a quick look at the paper, it looks like one might describe it as a step on the way to something like EfficientZero. Does that sound roughly correct?

it would be great to see older models incorporated as well

We may extend this to older models in the future. But our goal right now is to focus on these models' public safety risks as standalone (or nearly standalone) systems. And prior to GPT-3, it's hard to find models whose public safety risks were meaningful on a standalone basis — while an earlier model could have been used as part of a malicious act, for example, it wouldn't be as central to such an act as a modern model would be.

Comment by Edouard Harris on Yudkowsky and Christiano discuss "Takeoff Speeds" · 2021-11-25T05:35:35.898Z · LW · GW

Yeah, these are interesting points.

Isn't it a bit suspicious that the thing-that's-discontinuous is hard to measure, but the-thing-that's-continuous isn't? I mean, this isn't totally suspicious, because subjective experiences are often hard to pin down and explain using numbers and statistics. I can understand that, but the suspicion is still there.

I sympathize with this view, and I agree there is some element of truth to it that may point to a fundamental gap in our understanding (or at least in mine). But I'm not sure I entirely agree that discontinuous capabilities are necessarily hard to measure: for example, there are benchmarks available for things like arithmetic, which one can train on and make quantitative statements about.

I think the key to the discontinuity question is rather that 1) it's the jumps in model scaling that are happening in discrete increments; and 2) everything is S-curves, and a discontinuity always has a linear regime if you zoom in enough. Those two things together mean that, while a capability like arithmetic might have a continuous performance regime on some domain, in reality you can find yourself halfway up the performance curve in a single scaling jump (and this is in fact what happened with arithmetic and GPT-3). So the risk, as I understand it, is that you end up surprisingly far up the scale of "world-ending" capability from one generation to the next, with no detectable warning shot beforehand.

"No one predicted X in advance" is only damning to a theory if people who believed that theory were making predictions about it at all. If people who generally align with Paul Christiano were indeed making predictions to the effect of GPT-3 capabilities being impossible or very unlikely within a narrow future time window, then I agree that would be damning to Paul's worldview. But -- and maybe I missed something -- I didn't see that. Did you?

No, you're right as far as I know; at least I'm not aware of any such attempted predictions. And in fact, the very absence of such prediction attempts is interesting in itself. One would imagine that correctly predicting the capabilities of an AI from its scale ought to be a phenomenally valuable skill — not just from a safety standpoint, but from an economic one too. So why, indeed, didn't we see people make such predictions, or at least try to?

There could be several reasons. For example, perhaps Paul (and other folks who subscribe to the "continuum" world-model) could have done it, but they were unaware of the enormous value of their predictive abilities. That seems implausible, so let's assume they knew the value of such predictions would be huge. But if you know the value of doing something is huge, why aren't you doing it? Well, if you're rational, there's only one reason: you aren't doing it because it's too hard, or otherwise too expensive compared to your alternatives. So we are forced to conclude that this world-model — by its own implied self-assessment — has, so far, proved inadequate to generate predictions about the kinds of capabilities we really care about.

(Note: you could make the argument that OpenAI did make such a prediction, in the approximate yet very strong sense that they bet big on a meaningful increase in aggregate capabilities from scale, and won. You could also make the argument that Paul, having been at OpenAI during the critical period, deserves some credit for that decision. I'm not aware of Paul ever making this argument, but if made, it would be a point in favor of such a view and against my argument above.)

Comment by Edouard Harris on Yudkowsky and Christiano discuss "Takeoff Speeds" · 2021-11-25T01:51:54.306Z · LW · GW

I think what gwern is trying to say is that continuous progress on a benchmark like PTB appears (from what we've seen so far) to map to discontinuous progress in qualitative capabilities, in a surprising way which nobody seems to have predicted in advance. Qualitative capabilities are more relevant to safety than benchmark performance is, because while qualitative capabilities include things like "code a simple video game" and "summarize movies with emojis", they also include things like "break out of confinement and kill everyone". It's the latter capability, and not PTB performance, that you'd need to predict if you wanted to reliably stay out of the x-risk regime — and the fact that we can't currently do so is, I imagine, what brought to mind the analogy between scaling and Russian roulette.

I.e., a straight line in domain X is indeed not surprising; what's surprising is the way in which that straight line maps to the things we care about more than X.

(Usual caveats apply here that I may be misinterpreting folks, but that is my best read of the argument.)

Comment by Edouard Harris on Yudkowsky and Christiano discuss "Takeoff Speeds" · 2021-11-24T19:34:35.145Z · LW · GW

Good catch! I didn't check the form. Yes you are right, the spoiler should say (1=Paul, 9=Eliezer) but the conclusion is the right way round.

Comment by Edouard Harris on Yudkowsky and Christiano discuss "Takeoff Speeds" · 2021-11-24T17:38:42.622Z · LW · GW

(Not being too specific to avoid spoilers) Quick note: I think the direction of the shift in your conclusion might be backwards, given the statistics you've posted and that 1=Eliezer and 9=Paul.

Comment by Edouard Harris on AI Tracker: monitoring current and near-future risks from superscale models · 2021-11-24T14:06:08.558Z · LW · GW

Thanks for the kind words and thoughtful comments.

You're absolutely right that expected ROI ultimately determines scale of investment. I agree on your efficiency point too: scaling and efficiency are complements, in the sense that the more you have of one, the more it's worth investing in the other.

I think we will probably include some measure of efficiency as you've suggested. But I'm not sure exactly what that will be, since efficiency measures tend to be benchmark-dependent so it's hard to get apples-to-apples here for a variety of reasons. (e.g., differences in modalities, differences in how papers record their results, but also the fact that benchmarks tend to get smashed pretty quickly these days, so newer models are being compared on a different basis from old ones.) Did you have any specific thoughts about this? To be honest, this is still an area we are figuring out.

On the ROI side: while this is definitely the most important metric, it's also the one with by far the widest error bars. The reason is that it's impossible to predict all the creative ways people will use these models for economic ends — even GPT-3 by itself might spawn entire industries that don't yet exist. So the best one could hope for here is something like a lower bound with the accuracy of a startup's TAM estimate: more art than science, and very liable to be proven massively wrong in either direction. (Disclosure: I'm a modestly prolific angel investor, and I've spoken to — though not invested in — several companies being built on GPT-3's API.)

There's another reason we're reluctant to publish ROI estimates: at the margin, these estimates themselves bolster the case for increased investment in scaling, which is concerning from a risk perspective. This probably wouldn't be a huge effect in absolute terms, since it's not really the sort of thing effective allocators weigh heavily as decision inputs, but there are scenarios where it matters and we'd rather not push our luck.

Thanks again!

Comment by Edouard Harris on A positive case for how we might succeed at prosaic AI alignment · 2021-11-19T13:46:18.008Z · LW · GW

Gotcha. Well, that seems right—certainly in the limit case.

Comment by Edouard Harris on A positive case for how we might succeed at prosaic AI alignment · 2021-11-19T00:57:39.982Z · LW · GW

Thanks, that helps. So actually this objection says: "No, the biggest risk lies not in the trustworthiness of the Bob you use as the input to your scheme, but rather in the fidelity of your copying process; and this is true even if the errors in your copying process are being introduced randomly rather than adversarially. Moreover, if you actually do develop the technical capability to reduce your random copying-error risk down to around the level of your Bob-trustworthiness risk, well guess what, you've built yourself an AGI. But since this myopic copying scheme thing seems way harder than the easiest way I can think of to build an AGI, that means a fortiori that somebody else built one the easy way several years before you built yours."

Is that an accurate interpretation?

Comment by Edouard Harris on A positive case for how we might succeed at prosaic AI alignment · 2021-11-18T15:23:41.313Z · LW · GW

This is a great thread. Let me see if I can restate the arguments here in different language:

  1. Suppose Bob is a smart guy whom we trust to want all the best things for humanity. Suppose we also have the technology to copy Bob's brain into software and run it in simulation at, say, a million times its normal speed. Then, if we thought we had one year between now and AGI (leaving aside the fact that I just described a literal AGI in the previous sentence), we could tell simulation-Bob, "You have a million subjective years to think of an effective pivotal act in the real world, and tell us how to execute it." Bob's a smart guy, and we trust him to do the right thing by us; he should be able to figure something out in a million years, right?
  2. My understanding of Evan's argument at this point would be: "Okay; so we don't have the technology to directly simulate Bob's brain. But maybe instead we can imitate its I/O signature by training a model against its actions. Then, because that model is software, we can (say) speed it up a million times and deal with it as if it was a high-fidelity copy of Bob's brain, and it can solve alignment / execute pivotal action / etc. for us. Since Bob was smart, the model of Bob will be smart. And since Bob was trustworthy, the model of Bob will be trustworthy to the extent that the training process we use doesn't itself introduce novel long-term dependencies that leave room for deception."
  3. Note that myopia — i.e., the purging of long term dependencies from the training feedback signal — isn't really conceptually central to the above scheme. Rather it is just a hack intended to prevent additional deception risks from being introduced through the act of copying Bob's brain. The simulated / imitated copy of Bob is still a full-blown consequentialist, with all the manifold risks that entails. So the scheme is basically a way of taking an impractically weak system that you trust, and overclocking it but not otherwise affecting it, so that it retains (you hope) the properties that made you trust it in the first place.
  4. At this point my understanding of Eliezer's counterargument would be: "Okay sure; but find me a Bob that you trust enough to actually put through this process. Everything else is neat, but it is downstream of that." And I think that this is correct and that it is a very, very strong objection, but — under certain sets of assumptions about timelines, alternatives, and counterfactual risks — it may not be a complete knock-down. (This is the "belling the cat" bit, I believe.)
  5. And at this point, maybe (?) Evan says, "But wait; the Bob-copy isn't actually a consequentialist because it was trained myopically." And if that's what Evan says, then I believe this is the point at which there is an empirically resolvable disagreement.

Is this roughly right? Or have I missed something?

Comment by Edouard Harris on Ngo and Yudkowsky on alignment difficulty · 2021-11-18T14:26:35.686Z · LW · GW

I want to push back a little against the claim that the bootstrapping strategy ("build a relatively weak aligned AI that will make superhumanly fast progress on AI alignment") is definitely irrelevant/doomed/inferior. Specifically, I don't know whether this strategy is good or not in practice, but it serves as useful threshold for what level/kind of capabilities we need to align in order to solve AI risk.

Yeah, very much agree with all of this. I even think there's an argument to be made that relatively narrow-yet-superhuman theorem provers (or other research aids) could be worth the risk to develop and use, because they may make the human alignment researchers who use them more effective in unpredictable ways. For example, researchers tend to instinctively avoid considering solution paths that are bottlenecked by statements they see as being hard to prove — which is totally reasonable. But if your mentality is that you can just toss a super-powerful theorem-prover at the problem, then you're free to explore concept-space more broadly since you may be able to check your ideas at much lower cost.

(Also find myself agreeing with your point about tradeoffs. In fact, you could think of a primitive alignment strategy as having a kind of Sharpe ratio: how much marginal x-risk does it incur per marginal bit of optimization it gives? Since a closed-form solution to the alignment problem doesn't necessarily seem forthcoming, measuring its efficient frontier might be the next best thing.)

Comment by Edouard Harris on Optimization Concepts in the Game of Life · 2021-11-01T19:12:57.300Z · LW · GW

Great catch. For what it's worth, it actually seems fine to me intuitively that any finite pattern would be an optimizing system for this reason, though I agree most such patterns may not directly be interesting. But perhaps this is a hint that some notion of independence or orthogonality of optimizing systems might help to complete this picture.

Here's a real-world example: you could imagine a universe where humans are minding their own business over here on Earth, while at the same time, over there in a star system 20 light-years away, two planets are hurtling towards each other under the pull of their mutual gravitation. No matter what humans may be doing on Earth, this universe as a whole can still reasonably be described as an optimizing system! Specifically, it achieves the property that the two faraway planets will crash into each other under a fairly broad set of contexts.

Now suppose we describe the state of this universe as a single point in a gargantuan phase space — let's say it's the phase space of classical mechanics, where we assign three positional and three momentum degrees of freedom to each particle in the universe (so if there are N particles in the universe, we have a 6N-dimensional phase space). Then there is a subspace of this huge phase space that corresponds to the crashing planets, and there is another, orthogonal subspace that corresponds to the Earth and its humans. You could then say that the crashing-planets subspace is an optimizing system that's independent of the human-Earth subspace. In particular, if you imagine that these planets (which are 20 light-years away from Earth) take less than 20 years to crash into each other, then the two subspaces won't come into causal contact before the planet subspace has achieved the "crashed into each other" property.

Similarly on the GoL grid, you could imagine having an interesting eater over here, while over there you have a pretty boring, mostly empty grid with just a single live cell in it. If your single live cell is far enough away from the eater than the two systems do not come into causal contact before the single cell has "died" (if the lone live cell is more than 2 cells away from any live cell of the eater system, for example) then they can imo be considered two independent optimizing systems.

Of course the union of two independent optimizing systems will itself be an optimizing system, and perhaps that's not very interesting. But I'd contend that the reason it's not very interesting is that very property of causal independence — and that this independence can be used to resolve our GoL universe into two orthogonal optimizers that can then be analyzed separately (as opposed to asserting that the empty grid isn't an optimizing system at all).

Actually, that also suggests an intriguing experimental question. Suppose Optimizer A independently achieves Property X, and Optimizer B independently achieves Property Y in the GoL universe. Are there certain sorts of properties that tend to be achieved when you put A and B in causal contact?

Comment by Edouard Harris on Forecasting progress in language models · 2021-11-01T17:38:30.177Z · LW · GW

Extremely interesting — thanks for posting. Obviously there are a number of caveats which you carefully point out, but this seems like a very reasonable methodology and the actual date ranges look compelling to me. (Though they also align with my bias in favor of shorter timelines, so I might not be impartial on that.)

One quick question about the end of this section:

The expected number of bits in original encoding per bits in the compression equals the entropy of that language.

Wouldn't this be the other way around? If your language has low entropy it should be more predictable, and therefore more compressible. So the entropy would be the number of bits in the compression for each expected bit of the original.

Comment by Edouard Harris on Optimization Concepts in the Game of Life · 2021-10-27T18:01:43.331Z · LW · GW

Thanks! I think this all makes sense.

  1. Oh yeah, I definitely agree with you that the empty board would be an optimizing system in the GoL context. All I meant was that the "Death" square in the examples table might not quite correspond to it in the analogy, since the death property is perhaps not an optimization target by the definition. Sorry if that wasn't clear.
  2. :)
  5. Got it, thanks! So if I've understood correctly, you are currently only using the mask as a way to separate the agent from its environment at instantiation, since that is all you really need to do to be able to define properties like robustness and retargetability in this context. That seems reasonable.
Comment by Edouard Harris on Optimization Concepts in the Game of Life · 2021-10-20T21:13:20.335Z · LW · GW

Loved this post. This whole idea of using a deterministic dynamical system as a conceptual testing ground feels very promising.

A few questions / comments:

  1. About the examples: do you think it's strictly correct to say that entropy / death is an optimizing system? One of the conditions of the Flint definition is that the set of target states ought to be substantially smaller than the basin of attraction, by some measure on the configuration space. Yet neither high entropy nor death seem like they satisfy this: there are too many ways to be dead, and (tautologically) too many ways to have high entropy. As a result, both the "dead" property and the "high-entropy" property make up a large proportion of the attraction basin. The original post makes a similar point, though admittedly there is some degree of flexibility in terms of how big the target state set has to be before you call the system an optimizer.
  2. Not sure if this is a useful question, but what do you think of using "macrostate" as opposed to "property" to mean a set of states? This term "macrostate" is used in statistical physics for the identical concept, and as you're probably aware, there may be results from that field you'd be able to leverage here. (The "size" of a macrostate is usually thought of as its entropy over states, and this seems like it could fit into your framework as well. At first glance it doesn't seem too unreasonable to just use a flat prior over grid configurations, so this just ends up being the log of the state count.)
  3. I like the way embedded perturbations have been defined too. External perturbations don't seem fundamentally different from embedded ones (we can always just expand our configuration space until it includes the experimenter) but keeping perturbations "in-game" cuts out those complications while keeping the core problem in focus.
  4. The way you're using  and  as a way to smoothly vary the "degree" of optimization of a system is very elegant.
  5. Do you imagine keeping the mask constant over the course of a computational rollout?  Plausibly as you start a computation some kinds of agents may start to decohere as they moves outside the original mask area and/or touch and merge with bits of their environments. E.g., if the agent is a glider, does the mask "follow" the agent? Or are you for now mostly considering patterns like eaters that stay in one place?
Comment by Edouard Harris on Meta learning to gradient hack · 2021-10-06T22:04:54.102Z · LW · GW

Very neat. It's quite curious that switching to L2 for the base optimizer doesn't seem to have resulted in the meta-initialized network learning the sine function. What sort of network did you use for the meta-learner? (It looks like the 4-layer network in your Methods refers to your base optimizer, but perhaps it's the same architecture for both?)

Also, do you know if you end up getting the meta-initialized network to learn the sine function eventually if you train for thousands and thousands of steps? Or does it just never learn no matter how hard you train it?

Comment by Edouard Harris on AI takeoff story: a continuation of progress by other means · 2021-09-29T13:26:41.909Z · LW · GW

I see — perhaps I did misinterpret your earlier comment. It sounds like the transition you are more interested in is closer to (AI has ~free rein over the internet) => (AI invents nanotech). I don't think this is a step we should expect to be able to model especially well, but the best story/analogy I know of for it is probably the end part of That Alien Message. i.e., what sorts of approaches would we come up with, if all of human civilization was bent on solving the equivalent problem from our point of view?

If instead you're thinking more about a transition like (AI is superintelligent but in a box) => (AI has ~free rein over the internet), then I'd say that I'd expect us to skip the "in a box" step entirely.

Comment by Edouard Harris on AI takeoff story: a continuation of progress by other means · 2021-09-29T12:22:37.043Z · LW · GW

No problem, glad it was helpful!

And thanks for the APS-AI definition, I wasn't aware of the term.

Comment by Edouard Harris on AI takeoff story: a continuation of progress by other means · 2021-09-28T17:50:01.941Z · LW · GW

Thanks! I agree with this critique. Note that Daniel also points out something similar in point 12 of his comment — see my response.

To elaborate a bit more on the "missing step" problem though:

  1. I suspect many of the most plausible risk models have features that make it undesirable for them to be shared too widely. Please feel free to DM me if you'd like to chat more about this.
  2. There will always be some point between Step 1 and Step 3 at which human-legible explanations fail. i.e., it would be extremely surprising if we could tell a coherent story about the whole process — the best we can do is assume the AI gets to the end state because it's highly competent, but we should expect it to do things we can't understand. (To be clear, I don't think this is quite what your comment was about. But it is a fundamental reason why we can't ever expect a complete explanation.)
Comment by Edouard Harris on AI takeoff story: a continuation of progress by other means · 2021-09-28T17:31:33.055Z · LW · GW

See my response to point 6 of Daniel's comment — it's rather that I'm imagining competing hedge funds (run by humans) beginning to enter the market with this sort of technology.

Comment by Edouard Harris on AI takeoff story: a continuation of progress by other means · 2021-09-28T17:21:56.815Z · LW · GW

Hey Daniel — thanks so much for taking the time to write this thoughtful feedback. I really appreciate you doing this, and very much enjoyed your "2026" post as well. I apologize for the delay and lengthy comment here, but wanted to make sure I addressed all your great points.

1. It would be great if you could pepper your story with dates, so that we can construct a timeline and judge for ourselves whether we think things are happening too quickly or not.

I've intentionally avoided referring to absolute dates, other than by indirect implication (e.g. "iOS 19"). In writing this, I was more interested in exploring how a plausible technical development model might interact with the cultural and economic contexts of these companies. As a result I decided to focus on a chain of events instead of a timeline.

But another reason is that I don't feel I know enough to have a strong view on dates. I do suspect we have been in an overhang of sorts for the past year or so, and that the key constraints on broad-based development of scaled models up to this point have been institutional frictions. It takes a long time to round up the internal buy-in you need for an investment at this scale, even in an org that has a technical culture, and even if you have a committed internal champion. And that means the pace of development immediately post-GPT3 is unusually dependent on random factors like the whims of decision-makers, and therefore has been/will be especially hard to predict.

(E.g., how big will Google Pathways be, in terms of scale/compute? How much capex committed? Nobody knows yet, as far as I can tell. As a wild guess, Jeff Dean could probably get a $1B allocation for this if he wanted to. Does he want $1B? Does he want $10B? Could he get $10B if he really pushed for it? Does the exec team "get it" yet? When you're thinking in terms of ROI for something like this, a wide range of outcomes is on the table.)

2. Auto-generated articles and auto-generated videos being so popular that they crowd out most human content creators... this happens at the beginning of the story? I think already this is somewhat implausible and also very interesting and deserves elaboration. Like, how are you imagining it: we take a pre-trained language model, fine-tune it on our article style, and then let it loose using RL from human feedback (clicks, ad revenue) to learn online? And it just works? I guess I don't have any arguments yet for why that shouldn't work, but it seems intuitively to me that this would only work once we are getting pretty close to HLAGI / APS-AI. How big are these models in your story? Presumably bigger than GPT-3, right, since even a fine-tuned GPT-3 wouldn't be able to outperform human content creators (right?). And currently video generation tech lags behind text generation tech.

The beginning of the story still lies in our future, so to be clear, this isn't a development I'd necessarily expect immediately. I am definitely imagining an LM bigger than GPT-3, but it doesn't seem at all implausible that ByteDance would build such an LM on, say, a 24-month timeframe from today. They certainly have the capital for it, and the company has a history of favoring algorithmic recommendations and AI over user-driven virality — so particularly in Toutiao's case, this would be a natural extension of their existing content strategy. And apart from pure scale, the major technical hurdle for auto-generated articles seems like it's probably the size of the attention window, which people have been making notable progress on this recently.

I'd say the "it just works" characterization is not quite right — I explicitly say that this system takes some time to fine tune even after it's first deployed in production. To elaborate a bit, I wouldn't expect any training based on human feedback at first, but rather something more like manual screening/editing of auto-generated articles by internal content teams. That last part is not something I said explicitly in the text; maybe I should?

I think your point about video is a great critique though. It's true that video has lagged behind text. My thinking here was that the Douyin/TikTok form factor is an especially viable setting to build early video gen models: the videos are short, and they already have a reliable reward model available in the form of the existing rec algorithm. But even though this might be the world's best corpus to train on, I do agree with you that there is more fundamental uncertainty around video models. I'd be interested in an further thoughts you might have on this point.

One question on this part: what do you mean by "APS-AI"?

3. "Not long after, Google rocks the tech industry with a major announcement at I/O. They’ve succeeded in training a deep learning model to completely auto-generate simple SaaS software from a natural-language description. " Is this just like Codex but better? Maybe I don't what SaaS software is.

Yes, pretty much just Codex but better. One quick-and-dirty way to think of SaaS use cases is: "any business workflow that touches a spreadsheet". There are many, many, many such use cases.

4. "At first, the public is astonished. But after nothing more is heard about this breakthrough for several months, most eventually dismiss it as a publicity stunt. But one year later, Google launches an improved version of the model in a new Search widget called “synthetic SaaS”." --I didn't successfully read between the lines here, what happened in that quiet year?

Ah this wasn't meant to be subtle or anything, just that it takes time to go from "prototype demo" to "Google-scale production rollout". Sorry if that wasn't clear.

5. "The S&P 500 doubles that year, driven by explosive growth in the big-cap tech stocks. Unemployment claims reach levels not seen since the beginning of the Covid crisis." Why is unemployment so high? So far it seems like basic programming jobs have been automated away, and lots of writing and video generation jobs. But how many jobs are those? Is it enough to increase unemployment by a few percent? I did some googling and it seems like there are between 0.5 and 1 million jobs in the USA that are like this, though I'm not at all confident. (there are 0.25M programmer jobs) More than a hundred million total employed, though. So to make unemployment go up by a couple percent a bunch of other stuff would need to be automated away besides the stuff you've mentioned, right?

You're absolutely right. I was imagining some additional things happening here which I didn't put into the story and therefore didn't think through in enough detail. I'd expect unemployment to increase, but not necessarily to this extent or on these timescales. Will delete this sentence — thanks!

6. "At the end of that year, the stock market once again delivers astronomical gains. Yet, curiously, the publicly disclosed performance of hedge funds — particularly of the market-neutral funds that trade most frequently — consists almost entirely of losses." I take it this is because several tech companies are secretly using AI to trade? Is that legal? How would they be able to keep this secret?

Good question. I don't actually expect that any tech companies would do this. While it could strictly speaking be done in a legal way, I can't imagine the returns would justify the regulatory and business-relationship risk. More to the point, big tech cos already own money machines that work, and that have even better returns on capital than market trading from an unleveraged balance sheet would.

My implication here is rather that other hedge funds enter the market and begin trading using sophisticated AIs. Hedge funds aren't required to disclose public returns, so I'm imagining that one or more of these funds have entered the market without disclosure.

7. You have a section on autonomous drones. Why is it relevant? Is the implication that they are going to be used by the AI to take over? The last section makes it seem like the AI would have succeeded in taking over anyway, drones or no. Ditto for the USA's self-improving cyberwar software.

Great observation. I was debating whether to cut this part, actually. I kept it because 1) it motivated the plot later, when OpenAI debates whether to build in an explicit self-improvement mechanism; and 2) it felt like I should tell some kind of story about military applications. But given how I'm actually thinking about self-improvement and the risk model (see 9 and 12, below) I think this can be cut with little loss.

8. "Codex 4 is expected to cost nearly a billion dollars in compute alone." This suggests that all the AIs made so far cost less than that? Which means it's, like, not even 2025 yet according to Ajeya's projection?

Oh yeah, you're totally right and this is a major error on my part. This should be more like $10B+. Will edit!

9. "After a rigorous internal debate, it’s also decided to give Codex 4 the ability to suggest changes to its own codebase during training, in an attempt to maximize performance via architectural improvements in the model." I thought part of the story here was that more complex architectures do worse? Are you imagining that Codex 4 discovers simpler architectures? By the way, I don't think that's a plausible part of the story -- I think even if the scaling hypothesis and bitter lesson are true, it's still the case that more complex, fiddly architectures help. It's just that they don't help much compared to scaling up compute.

I agree that the bitter lesson is not as straightforward as "complex architectures do worse", and I also agree with you that fiddly architectures can do better than simple ones. But I don't really believe the kinds of fiddly architectures humans will design are likely to perform better than our simplest architectures at scale. Roughly speaking, I do not believe we are smart enough to approach this sort of work with the right assumptions to design good architectures, and under those conditions, the fewer assumptions we embed in our architectures, the better.

I do believe that the systems we build will be better at designing such architectures than we are, though. And that means there is indeed something to be gained from fiddly architectures — just not from "human-fiddly" ones. In fact, you can argue that this is what meta-learning does: a system that meta-learns is one that redesigns its own architecture, in some sense. And actually, articulating it that way suggests that this kind of self-improvement is really just the limit case of meta-learning — which in turn makes the explicit self-improvement scheme in my story redundant! So yep, I think this gets cut too. :)

10. "This slows down the work to a crawl and multiplies the expense by an order of magnitude, but safety is absolutely paramount." Why is Microsoft willing to pay these costs? They don't seem particularly concerned about AI risk now, are you imagining this changes in the next 4 years? How does it change? Is it because people are impressed by all the AI progress and start to listen to AI safety people?

There is no "canon" reason why they are doing this — I'm taking some liberties in this direction because I don't expect the kinds of safety precautions they are taking to matter much. However I do expect that alignment will soon become an obvious limiting factor in getting big models to do what we want, and it doesn't seem too unreasonable to expect this might be absorbed as a more general lesson.

11. Also, if it's slowing the work to a crawl and multiplying the expense, shouldn't Microsoft/OpenAI be beaten to the punch by some other company that isn't bothering with those precautions? Or is the "market" extremely inefficient, so to speak?

The story as written is intentionally consistent with OpenAI being beaten to the punch by a less cautious company. In fact, I consider that the more plausible failure scenario (see next point) even though the text strongly implies otherwise.

Still, it's marginally plausible that nobody was yet willing to commit funds on that scale at the time of the project — and in the world of this story, that's indeed what happened. Relatively few organizations have the means for something like this, so that does make the market less efficient than it would be if it had more viable participants.

12. "Not long after this, the world ends." Aaaaagh tell me more! What exactly went wrong? Why did the safety techniques fail? (To be clear, I totally expect that the techniques you describe would fail. But I'm interested to hear your version of the story.)

Yeah, I left this deliberately ambiguous. The reason is that I'm working from a risk model that I'm a bit reluctant to publicize too widely, since it feels like there is some chance that the publication itself might be slightly risky. (I have shared it privately with a couple of folks though, and would be happy to follow up with you on this by DM — please let me know if you're interested.) As a result, while I didn't want to write a story that was directly inconsistent with my real risk model, I did end up writing a story that strongly implies an endgame scenario which I don't actually believe is very likely (i.e., "OpenAI carefully tries to train an aligned AI but it blows up").

Honestly I wasn't 100% sure how to work around this problem — hence the ambiguity and the frankly kludgy feel of the OpenAI bit at the end. But I figured the story itself was worth posting at least for its early development model (predicated on a radical version of connectionism) and economic deployment scenario (predicated on earliest rollouts in environments with fastest feedback cycles). I'd be especially interested in your thoughts on how to handle this, actually.

13. Who is Jessica? Is she someone important? If she's not important, then it wouldn't be worth a millisecond delay to increase success probability for killing her.

Jessica is an average person. The AI didn't delay anything to kill her; it doesn't care about her. Rather I'm intending to imply that whatever safety precautions were in place to keep the AI from breaking out merely had the effect of causing a very small time delay.

14. It sounds like you are imagining some sort of intelligence explosion happening in between the Codex 4 section and the Jessica section. Is this right or a misinterpretation?

Yes that is basically right.

Thanks again Daniel!


UPDATE: Made several changes to the post based on this feedback.

Comment by Edouard Harris on The theory-practice gap · 2021-09-23T14:12:45.247Z · LW · GW

I see. Okay, I definitely agree that makes sense under the "fails to generalize" risk model. Thanks Rohin!

Comment by Edouard Harris on The theory-practice gap · 2021-09-22T12:55:48.887Z · LW · GW

Got it, thanks!

I find it plausible that the AI systems fail in only a special few exotic circumstances, which aren't the ones that are actually created by AGI.

This helps, and I think it's the part I don't currently have a great intuition for. My best attempt at steel-manning would be something like: "It's plausible that an AGI will generalize correctly to distributions which it is itself responsible for bringing about." (Where "correctly" here means "in a way that's consistent with its builders' wishes.") And you could plausibly argue that an AGI would have a tendency to not induce distributions that it didn't expect it would generalize correctly on, though I'm not sure if that's the specific mechanism you had in mind.

Comment by Edouard Harris on The theory-practice gap · 2021-09-21T00:12:56.015Z · LW · GW

I agree with pretty much this whole comment, but do have one question:

But it still seems plausible that in practice we never hit those exotic circumstances (because those exotic circumstances never happen, or because we've retrained the model before we get to the exotic circumstances, etc), and it's intent aligned in all the circumstances the model actually encounters.

Given that this is conditioned on us getting to AGI, wouldn't the intuition here be that pretty much all the most valuable things such a system would do would fall under "exotic circumstances" with respect to any realistic training distribution? I might be assuming too much in saying that — e.g., I'm taking it for granted that anything we'd call an AGI could self-improve to the point of accessing states of the world that we wouldn't be able to train it on; and also I'm assuming that the highest-reward states would probably be the these exotic / hard-to-access ones. But both of those do seem (to me) like they'd be the default expectation.

Or maybe you mean it seems plausible that, even under those exotic circumstances, an AGI may still be able to correctly infer our intent, and be incentivized to act in alignment with it?

Comment by Edouard Harris on The alignment problem in different capability regimes · 2021-09-16T00:01:12.017Z · LW · GW

But in the context of superhuman systems, I think we need to be more concerned by the possibility that it’s performance-uncompetitive to restrict your system to only take actions that can be justified entirely with human-understandable reasoning.

Interestingly, this is already a well known phenomenon in the hedge fund world. In fact, quant funds discovered about 25 years ago that the most consistently profitable trading signals are reliably the ones that are the least human-interpretable. It makes intuitive sense: any signal that can be understood by a human is at risk of being copied by a human, so if you insist that your trading decisions have to be interpretable, you'll pay for that insistence in alpha.

I'd imagine this kind of issue is already top-of-mind for folks who are working on the various transparency agendas, but it does imply that there's a very strong optimization pressure directly against interpretability in many economically relevant contexts. In fact, it could hardly be stronger: your forcing function is literally "Want to be a billionaire? Then you'll have to trade exclusively on the most incomprehensible signals you can find."

(Of course this isn't currently true of all hedge funds, only a few specialized ones.)

Comment by Edouard Harris on The alignment problem in different capability regimes · 2021-09-15T23:44:20.011Z · LW · GW

One reason to favor such a definition of alignment might be that we ultimately need a definition that gives us guarantees that hold at human-level capability or greater, and humans are probably near the bottom of the absolute scale of capabilities that can be physically realized in our world. It would (imo) be surprising to discover a useful alignment definition that held across capability levels way beyond us, but that didn't hold below our own modest level of intelligence.

Comment by Edouard Harris on When Most VNM-Coherent Preference Orderings Have Convergent Instrumental Incentives · 2021-08-24T22:26:24.652Z · LW · GW

No problem! Glad it was helpful. I think your fix makes sense.

I'm not quite sure what the error was in the original proof of Lemma 3; I think it may be how I converted to and interpreted the vector representation.

Yeah, I figured maybe it was because the dummy variable  was being used in the EV to sum over outcomes, while the vector  was being used to represent the probabilities associated with those outcomes. Because  and  are similar it's easy to conflate their meanings, and if you apply  to the wrong one by accident that has the same effect as applying  to the other one. In any case though, the main result seems unaffected.


Comment by Edouard Harris on When Most VNM-Coherent Preference Orderings Have Convergent Instrumental Incentives · 2021-08-20T20:00:11.958Z · LW · GW

Thanks for writing this.

I have one point of confusion about some of the notation that's being used to prove Lemma 3. Apologies for the detail, but the mistake could very well be on my end so I want to make sure I lay out everything clearly.

First,  is being defined here as an outcome permutation. Presumably this means that 1)  for some ; and 2)  admits a unique inverse . That makes sense.

We also define lotteries over outcomes, presumably as, e.g., , where  is the probability of outcome . Of course we can interpret the  geometrically as mutually orthogonal unit vectors, so this lottery defines a point on the -simplex. So far, so good.

But the thing that's confusing me is what this implies for the definition of . Because  is defined as a permutation over outcomes (and not over probabilities of outcomes), we should expect this to be

The problem is that this seems to give a different EV from the lemma:

(Note that I'm using  as the dummy variable rather than , but the LHS above should correspond to line 2 of the proof.) Doing the same thing for the  lottery gives an analogous result. And then looking at the inequality that results suggests that lemma 3 should actually be " induces " as opposed to " induces ".

(As a concrete example, suppose we have a lottery  with the permutation . Then  and our EV is

Yet  which appears to contradict the lemma as stated.)

Note that even if this analysis is correct, it doesn't invalidate your main claim. You only really care about the existence of a bijection rather than what that bijection is — the fact that your outcome space is finite ensures that the proportion of orbit elements that incentivize power seeking remains the same either way. (It could have implications if you try to extend this to a metric space, though.)

Again, it's also possible I've just misunderstood something here — please let me know if that's the case!

Comment by Edouard Harris on Re-Define Intent Alignment? · 2021-08-13T16:11:21.682Z · LW · GW

Update: having now thought more deeply about this, I no longer endorse my above comment.

While I think the reasoning was right, I got the definitions exactly backwards. To be clear, what I would now claim is:

  1. The behavioral objective is the thing the agent is revealed to be pursuing under arbitrary distributional shifts.
  2. The mesa-objective is something the agent is revealed to be pursuing under some subset of possible distributional shifts.

Everything in the above comment then still goes through, except with these definitions reversed.

On the one hand, the "perfect IRL" definition of the behavioral objective seems more naturally consistent with the omnipotent experimenter setting in the IRL unidentifiability paper cited downthread. As far as I know, perfect IRL isn't defined anywhere other than by reference to this reward modelling paper, which introduces the term but doesn't define it either. But the omnipotent experimenter setting seems to capture all the properties implied by perfect IRL, and does so precisely enough that one can use it to make rigorous statements about the behavioral objective of a system in various contexts.

On the other hand, it's actually perfectly possible for a mesa-optimizer to have a mesa-objective that is inconsistent with its own actions under some subset of conditions (the key conceptual error I was making was in thinking this was not possible). For example, a human being is a mesa-optimizer from the point of view of evolution. A human being may have something like "maximize happiness" as their mesa-objective. And a human being may, and frequently does, do things that do not maximize for their happiness.

A few consequences of the above:

  • Under an "omnipotent experimenter" definition, the behavioral objective (and not the mesa-objective) is a reliable invariant of the agent.
  • It's entirely possible for the behavioral objective to be overdetermined in certain situations. i.e., if we run every possible experiment on an agent, we may find that the only reward / utility function consistent with its behavior across all those experiments is the trivial utility function that's constant across all states.
  • If the behavioral objective of a system is overdetermined, that might mean the system never pursues anything coherently. But it might also mean that there exist subsets of distributions on which the system pursues an objective very coherently, but that different distributions induce different coherent objectives.
  • The natural way to use the mesa-objective concept is to attach it to one of these subsets of distributions on which we hypothesize our system is pursuing a goal coherently. If we apply a restricted version of the omnipotent experimenter definition — that is, run every experiment on our agent that's consistent with the subset of distributions we're conditioning on — then we will in general recover a set of mesa-objective candidates consistent with the system's actions on that subset.
  • It is strictly incorrect to refer to "the" mesa-objective of any agent or optimizer. Any reference to a mesa-objective has to be conditioned on the subset of distributions it applies on, otherwise it's underdetermined. (I believe Jack refers to this as a "perturbation set" downthread.)

This seems like it puts these definitions on a more rigorous footing. It also starts to clarify in my mind the connection with the "generalization-focused approach" to inner alignment, since it suggests a procedure one might use in principle to find out whether a system is pursuing coherent utilities on some subset of distributions. ("When we do every experiment allowed by this subset of distributions, do we recover a nontrivial utility function or not?")

Would definitely be interested in getting feedback on these thoughts!

Comment by Edouard Harris on Re-Define Intent Alignment? · 2021-08-06T19:55:36.149Z · LW · GW

I'm with you on this, and I suspect we'd agree on most questions of fact around this topic. Of course demarcation is an operation on maps and not on territories.

But as a practical matter, the moment one starts talking about the definition of something such as a mesa-objective, one has already unfolded one's map and started pointing to features on it. And frankly, that seems fine! Because historically, a great way to make forward progress on a conceptual question has been to work out a sequence of maps that give you successive degrees of approximation to the territory.

I'm not suggesting actually trying to imbue an AI with such concepts — that would be dangerous (for the reasons you alluded to) even if it wasn't pointless (because prosaic systems will just learn the representations they need anyway). All I'm saying is that the moment we started playing the game of definitions, we'd already started playing the game of maps. So using an arbitrary demarcation to construct our definitions might be bad for any number of legitimate reasons, but it can't be bad just because it caused us to start using maps: our earlier decision to talk about definitions already did that.

(I'm not 100% sure if I've interpreted your objection correctly, so please let me know if I haven't.)

Comment by Edouard Harris on Re-Define Intent Alignment? · 2021-08-05T13:19:09.273Z · LW · GW

Yeah I agree this is a legitimate concern, though it seems like it is definitely possible to make such a demarcation in toy universes (like in the example I gave above). And therefore it ought to be possible in principle to do so in our universe.

To try to understand a bit better: does your pessimism about this come from the hardness of the technical challenge of querying a zillion-particle entity for its objective function? Or does it come from the hardness of the definitional challenge of exhaustively labeling every one of those zillion particles to make sure the demarcation is fully specified? Or is there a reason you think constructing any such demarcation is impossible even in principle? Or something else?

Comment by Edouard Harris on Re-Define Intent Alignment? · 2021-08-04T19:00:40.167Z · LW · GW

I'm not sure what would constitute a clearly-worked counterexample. To me, a high reliance on an agent/world boundary constitutes a "non-naturalistic" assumption, which simply makes me think a framework is more artificial/fragile.

Oh for sure. I wouldn't recommend having a Cartesian boundary assumption as the fulcrum of your alignment strategy, for example. But what could be interesting would be to look at an isolated dynamical system, draw one boundary, investigate possible objective functions in the context of that boundary; then erase that first boundary, draw a second boundary, investigate that; etc. And then see whether any patterns emerge that might fit an intuitive notion of agency. But the only fundamentally real object here is always going to be the whole system, absolutely.

As I understand, something like AIXI forces you to draw one particular boundary because of the way the setting is constructed (infinite on one side, finite on the other). So I'd agree that sort of thing is more fragile.

The multiagent setting is interesting though, because it gets you into the game of carving up your universe into more than 2 pieces. Again it would be neat to investigate a setting like this with different choices of boundaries and see if some choices have more interesting properties than others.

Comment by Edouard Harris on Re-Define Intent Alignment? · 2021-08-04T18:37:22.915Z · LW · GW

I would further add that looking for difficulties created by the simplification seems very intellectually productive.

Yep, strongly agree. And a good first step to doing this is to actually build as robust a simplification as you can, and then see where it breaks. (Working on it.)

Comment by Edouard Harris on Re-Define Intent Alignment? · 2021-08-04T17:46:56.300Z · LW · GW

Ah I see! Thanks for clarifying.

Yes, the point about the Cartesian boundary is important. And it's completely true that any agent / environment boundary we draw will always be arbitrary. But that doesn't mean one can't usefully draw such a boundary in the real world — and unless one does, it's hard to imagine how one could ever generate a working definition of something like a mesa-objective. (Because you'd always be unable to answer the legitimate question: "the mesa-objective of what?")

Of course the right question will always be: "what is the whole universe optimizing for?" But it's hard to answer that! So in practice, we look at bits of the whole universe that we pretend are isolated. All I'm saying is that, to the extent you can meaningfully ask the question, "what is this bit of the universe optimizing for?", you should be able to clearly demarcate which bit you're asking about.

(i.e. I agree with you that duality is a useful fiction, just saying that we can still use it to construct useful definitions.)

Comment by Edouard Harris on Re-Define Intent Alignment? · 2021-08-04T14:51:39.772Z · LW · GW

which stems from the assumption that you are able to carve an environment up into an agent and an environment and place the "same agent" in arbitrary environments. No such thing is possible in reality, as an agent cannot exist without its environment


I might be misunderstanding what you mean here, but carving up a world into agent vs environment is absolutely possible in reality, as is placing that agent in arbitrary environments to see what it does. You can think of the traditional RL setting as a concrete example of this: on one side we have an agent that is executing some policy ; and on the other side we have an environment that consists of state transition dynamics given by some distribution . One can in fact show (see the unidentifiability in IRL paper) that if an experimenter has the power to vary the environment  arbitrarily and look at the policies the agent pursues on each of those environments, then that experimenter can recover a reward function that is unique up to the usual affine transformations.

That recovered reward function is a fortiori a reliable invariant of the agent, since it is consistent with the agent's actions under every possible environment the agent could be exposed to. (To be clear, this claim is also proved in the paper.) It also seems reasonable to identify that reward function with the mesa-objective of the agent, because any mesa-objective that is not identical with that reward function has to be inconsistent with the agent's actions on at least one environment.

Admittedly there are some technical caveats to this particular result: off the top, 1) the set of states & actions is fixed across environments; 2) the result was proved only for finite sets of states & actions; and 3) optimal policy is assumed. I could definitely imagine taking issue with some of these caveats — is this the sort of thing you mean? Or perhaps you're skeptical that a proof like this in the RL setting could generalize to the train/test framing we generally use for NNs?

in the OOD robustness literature you try to optimize worst-case performance over a perturbation set of possible environments.

Yeah that's sensible because this is often all you can do in practice. Having an omnipotent experimenter is rarely realistic, but imo it's still useful as a way to bootstrap a definition of the mesa-objective.

Btw, if you're aware of any counterpoints to this — in particular anything like a clearly worked-out counterexample showing that one can't carve up a world, or recover a consistent utility function through this sort of process — please let me know. I'm directly working on a generalization of this problem at the moment, and anything like that could significantly accelerate my execution.


Comment by Edouard Harris on Re-Define Intent Alignment? · 2021-07-22T21:55:11.767Z · LW · GW

If we wish, we could replace or re-define "capability robustness" with "inner robustness", the robustness of pursuit of the mesa-objective under distributional shift.

I strongly agree with this suggestion. IMO, tying capability robustness to the behavioral objective confuses a lot of things, because the set of plausible behavioral objectives is itself not robust to distributional shift.

One way to think about this from the standpoint of the "Objective-focused approach" might be: the mesa-objective is the thing the agent is revealed to be pursuing under arbitrary distributional shifts. To be precise: suppose we take the world and split it into an "agent" part and "environment" part. Then we expose the agent to every possible environment (or data distribution) allowed by our laws of physics, and we note down what the agent does in each of them. Any objective function that's consistent with our agent's actions across all of those environments must then count as a valid mesa-objective. (This is pretty much Amin & Singh's omnipotent experimenter setting.)

The behavioral objective, meanwhile, would be more like the thing the agent appears to be pursuing under some subset of possible distributional shifts. This is the more realistic case where we can't afford to expose our agent to every possible environment (or data distribution) that could possibly exist, so we make do and expose it to only a subset of them. Then we look at what objectives could be consistent with the agent's behavior under that subset of environments, and those count as valid behavioral objectives.

The key here is that the set of allowed mesa-objectives is a reliable invariant of the agent, while the set of allowed behavioral objectives is contingent on our observations of the agent's behavior under a limited set of environments. In principle, the two sets of objectives won't converge perfectly until we've run our agent in every possible environment that could exist.

So if we do an experiment whose results are consistent with behavioral objectives , and we want to measure the agent's capability robustness with respect to , we'd apply a distributional shift and see how well the agent performs at . But what if  isn't actually the mesa-objective? Then the fact that the agent appeared to be pursuing  at all was just an artifact of the limited set of experiments we were running. So if our agent does badly at  under the shift, maybe the problem isn't a capability shortfall — maybe the problem is that the agent doesn't care about  and never did.

Whereas with your definition of inner robustness, we'd at least be within our rights to say that the true mesa-objective was , and therefore that doing badly at  really does say something about the capability of our agent.

Comment by Edouard Harris on Utility Maximization = Description Length Minimization · 2021-07-19T20:20:24.062Z · LW · GW

Ah yes, that's right. Yeah, I just wanted to make this part fully explicit to confirm my understanding. But I agree it's equivalent to just let  ignore the extra  (or whatever) component.

Thanks very much!

Comment by Edouard Harris on Utility Maximization = Description Length Minimization · 2021-07-15T23:09:03.982Z · LW · GW

Late comment here, but I really liked this post and want to make sure I've fully understood it. In particular there's a claim near the end which says: if  is not fixed, then we can build equivalent models  for which it is fixed. I'd like to formalize this claim to make sure I'm 100% clear on what it means. Here's my attempt at doing that:

For any pair of models  where , there exists a variable  (of which  is a subset) and a pair of models  such that 1)  for any ; and 2) the behavior of the system is the same under  as it was under .

To satisfy this claim, we construct our  as the conjunction of  and some "extra" component . e.g.,  for a coin flip,  for a die roll, and so  is the conjunction of the coin flip and the die roll, and the domain  of  is the outer product of the coin flip domain and of the die roll domain.

Then we construct our  by imposing 1)  (i.e.,  are logically independent given  for every ); and 2)  (i.e., the marginal prob given  equals the original prob under ).

Finally we construct  by imposing the analogous 2 conditions that we did for : 1)  and 2) . But we also impose the extra condition 3)  (assuming finite sets, etc.).

We can always find  and  that satisfy the above conditions, and with these choices we end up with  for all  (i.e.,  is fixed) and  (i.e., the system retains the same dynamics).

Is this basically right? Or is there something I've misunderstood?

Comment by Edouard Harris on BASALT: A Benchmark for Learning from Human Feedback · 2021-07-12T17:46:29.349Z · LW · GW

That makes sense, though I'd also expect that LfLH benchmarks like BASALT could turn out to be a better fit for superscale models in general. (e.g. a BASALT analogue might do a better job of capturing the flexibility of GPT-N or DALL-E type models than current benchmarks do, though you'd probably need to define a few hundred tasks for that to be useful. It's also possible this has already been done and I'm unaware of it.)

Comment by Edouard Harris on BASALT: A Benchmark for Learning from Human Feedback · 2021-07-11T19:45:57.339Z · LW · GW

Love this idea. From the linked post on the BAIR website, the idea of "prompting" a Minecraft task with e.g. a brief sequence of video frames seems especially interesting.

Would you anticipate the benchmark version of this would ask participants to disclose metrics such as "amount of task-specific feedback or data used in training"? Or does this end up being too hard to quantify because you're explicitly expecting folks to use a variety of feedback modalities to train their agents?

Comment by Edouard Harris on Formal Inner Alignment, Prospectus · 2021-05-18T18:57:19.828Z · LW · GW

Great post.

I responded that for me, the whole point of the inner alignment problem was the conspicuous absence of a formal connection between the outer objective and the mesa-objective, such that we could make little to no guarantees based on any such connection.

Strong agree. In fact I believe developing the tools to make this connection could be one of the most productive focus areas of inner alignment research.

What I'd like to have would be several specific formal definitions, together with several specific informal concepts, and strong stories connecting all of those things together.

In connection with this, it may be worth checking out out my old post where I try to to untangle capability from alignment in the context of a particular optimization problem. I now disagree with around 20% of what I wrote there, but I still think it was a decent first stab at formalizing some of the relevant definitions, at least from a particular viewpoint.

Comment by Edouard Harris on Defining capability and alignment in gradient descent · 2021-01-27T22:26:55.303Z · LW · GW

Thanks, Rohin!

Please note that I'm currently working on a correction for part of this post — the form of the mesa-objective  I'm claiming is in fact wrong, as Charlie correctly alludes to in a sibling comment.

Comment by Edouard Harris on Clarifying inner alignment terminology · 2020-11-11T23:14:23.722Z · LW · GW

Sure, makes sense! Though to be clear, I believe what I'm describing should apply to optimizers other than just gradient descent — including optimizers one might think of as reward-maximizing agents.

Comment by Edouard Harris on Clarifying inner alignment terminology · 2020-11-11T21:39:24.739Z · LW · GW

Great post. Thanks for writing this — it feels quite clarifying. I'm finding the diagram especially helpful in resolving the sources of my confusion.

I believe everything here is consistent with the definitions I proposed recently in this post (though please do point out any inconsistencies if you see them!), with the exception of one point.

This may be a fundamental confusion on my part — but I don't see objective robustness, as defined here, as being a separate concept at all from inner alignment. The crucial point, I would argue, is that we ought to be treating the human who designed our agent as the base optimizer for the entire system. 

Zooming in on the "inner alignment  objective robustness" part of the diagram, I think what's actually going on is something like:

  1. A human AI researcher wishes to optimize for some base objective, .
  2. It would take too much work for our researcher to optimize for  manually. So our researcher builds an agent to do the work instead, and sets  to be the agent's loss function.
  3. Depending on how it's built, the agent could end up optimizing for , or it could end up optimizing for something different. The thing the agent ends up truly optimizing for is the agent's behavioral objective — let's call it . If  is aligned with , then the agent satisfies objective robustness by the above definition: its behavioral objective is aligned with the base. So far, so good.

    But here's the key point: from the point of view of the human researcher who built the agent, the agent is actually a mesa-optimizer, and the agent's "behavioral objective" is really just the mesa-objective of that mesa-optimizer.
  4. And now, we've got an agent that wishes to optimize for some mesa-objective . (Its "behavioral objective" by the above definition.)
  5. And then our agent builds a sub-agent to do the work instead, and sets  to be the sub-agent's loss function.
  6. I'm sure you can see where I'm going with this by now, but the sub-agent the agent builds will have its own objective  which may or may not be aligned with , which may or may not in turn be aligned with . From the point of view of the agent, that sub-agent is a mesa-optimizer. But from the point of view of the researcher, it's actually a "mesa-mesa-optimizer".

That is to say, I think there are three levels of optimizers being invoked implicitly here, not just two. Through that lens, "intent alignment", as defined here, is what I'd call "inner alignment between the researcher and the agent"; and "inner alignment", as defined here, is what I'd call "inner alignment between the agent and the mesa-optimizer it may give rise to".

In other words, humans live in this hierarchy too, and we should analyze ourselves in the same terms — and using the same language — as we'd use to analyze any other optimizer. (I do, for what it's worth, make this point in my earlier post — though perhaps not clearly enough.)

Incidentally, this is one of the reasons I consider the concepts of inner alignment and mesa-optimization to be so compelling. When a conceptual tool we use to look inside our machines can be turned outward and aimed back at ourselves, that's a promising sign that it may be pointing to something fundamental.

A final caveat: there may well be a big conceptual piece that I'm missing here, or a deep confusion that I have around one or more of these concepts that I'm still unaware of. But I wanted to lay out my thinking as clearly as I could, to make it as easy as possible for folks to point out any mistakes — would enormously appreciate any corrections!

Comment by Edouard Harris on Biextensional Equivalence · 2020-11-11T18:12:29.502Z · LW · GW

Really interesting!

I think there might be a minor typo in Section 2.2:

For transitivity, assume that for 

I think this should be  based on the indexing in the rest of the paragraph.

Comment by Edouard Harris on Defining capability and alignment in gradient descent · 2020-11-09T18:46:54.351Z · LW · GW

Thanks for the kind words, Adam! I'll follow up over DM about early drafts — I'm interested in getting feedback that's as broad as possible and really appreciate the kind offer here.

Typo is fixed — thanks for pointing it out!

At first I wondered why you were taking the sum instead of just , but after thinking about it, the latter would probably converge to 0 almost all the time, because even with amazing optimization, the loss will stop being improved by a factor linear in T at some point. That might be interesting to put in the post itself.

Yes, the problem with that definition would indeed be that if your optimizer converges to some limiting loss function value like , then you'd get  for any .

Thanks again!

Comment by Edouard Harris on Defining capability and alignment in gradient descent · 2020-11-06T19:47:17.647Z · LW · GW

Thanks for the comment!

Not sure if I agree with your interpretation of the "real objective" - might be better served by looking for stable equilibria and just calling them as such.

I think this is a reasonable objection. I don't make this very clear in the post, but the "true objective" I've written down in the example indeed isn't unique: like any measure of utility or loss, it's only unique up to affine transformations with positive coefficients. And that could definitely damage the usefulness of these definitions, since it means that alignment factors, for example, aren't uniquely defined either. (I'll be doing a few experiments soon to investigate this, and a few other questions, in a couple of real systems.)

Don't we already have weak alignment to arbitrary functions using annealing (basically, jump at random, but jump around more/further on average when the loss is higher and lower the jumping rate over time)? The reason we don't add small annealing terms to gradient descent is entirely because of we expect them to be worse in the short term (a "strong alignment" question).

Interesting question! To try to interpret in light of the definitions I'm proposing: adding annealing changes the true objective (or mesa-objective) of the optimizer, which is no longer solely trying to minimize its gradients — it now has this new annealing term that it's also trying to optimize for. Whether this improves alignment or not depends on the effect annealing has on 1) the long-term performance of the mesa-optimizer on its new (gradient + annealing) objective; and 2) the long-term performance this induces on the base objective.


Hope that's somewhat helpful, but please let me know if it's unclear and I can try to unpack things a bit more!

Comment by Edouard Harris on Introduction to Cartesian Frames · 2020-10-23T02:53:48.076Z · LW · GW

Great framework - feels like this is touching on something fundamental.

I'm curious: is the controllable / observable terminology intentionally borrowed from control theory? Or is that a coincidence?