Posts
Comments
There are all kinds of benefits to acting with good faith, and people should not feel licensed to abandon good faith dialogue just because they're SUPER confident and this issue is REALLY IMPORTANT.
When something is really serious it becomes even more important to do boring +EV things like "remember that you can be wrong sometimes" and "don't take people's quotes out of context, misrepresent their position, and run smear campaigns on them; and definitely don't make that your primary contribution to the conversation".
Like, for Connor & people who support him (not saying this is you Ben): don't you think it's a little bit suspicious that you ended up in a place where you concluded that the very best use of your time in helping with AI risk was tweet-dunking and infighting among the AI safety community?
I don't expect most people to agree with that point, but I do believe it. It ends up depending on a lot of premises, so expanding on my view there in full would be a whole post of its own. But to try to give a short version:
There are a lot of specific reasons I think having people working in AI capabilities is so strongly +EV. But I don't expect people to agree with those specific views. The reason I think it's obvious is that even when I make massive concessions to the anti-capabilities people, these organizations... still seem +EV? Let's make a bunch of concessions:
1. Alignment will be solved by theoretical work unrelated to capabilities. It can be done just as well at an alignment-only organization with limited funding as it can at a major AGI org with far more funding.
2. If alignment is solved, that automatically means future ASI will be built using this alignment technique, regardless of whether leading AI orgs actually care about alignment at all. You just publish a paper saying "alignment solution, pls use this Meta" and Meta will definitely do it.
3. Alignment will take a significant amount of time - probably decades.
4. ASI is now imminent; these orgs have reduced timelines to ASI by 1-5 years.
5. Our best chance of survival is a total stop, which none of the CEOs of these orgs support.
Even given all five of these premises... Demis Hassabis, Dario Amodei, and Sam Altman have all increased the chance of a total stop, by a lot. By more than almost anyone else on the planet, in fact. Yes, even though they don't think it's a good idea right now and have said as much (I think? haven't followed all of their statements on AI pause).
That is, the chance of a total stop is clearly higher in this world than in the counterfactual one where any of Demis/Dario/Sam didn't go into AI capabilities, because a CEO of a leading AI organization saying "yeah I think AI could maybe kill us all" is something that by default would not happen. As I said before, most people in the field of AI don't take AI risk seriously; this was even more true back when they first entered the field. The default scenario is one where people at NVIDIA and Google Brain and Meta are reassuring the public that AI risk isn't real.
So in other words, they are still increasing our chances of survival, even under that incredibly uncharitable set of assumptions.
Of course, you could cook these assumptions even more in order to make them -EV - if you think that a total stop isn't feasible, but still believe all of the other four premises, then they're -EV. Or you could say "yeah, we need a total stop now, because they've advanced timelines, but if these orgs didn't exist then we totally would have solved alignment before Meta made a big transformer model and trained it on a lot of text; so even though they've raised the chances of a total stop they're still a net negative." Or you could say "the real counterfactual about Sam Altman isn't if he didn't enter the field. The real counterfactual is the one where he totally agreed with all of my incredibly specific views and acted based on those."
I.e. if you're looking for excuses to be allowed to believe that these orgs are bad, you'll find them. But that's always the case. Under real worldviews - even under Connor's worldview, where he thinks a total stop is both plausible and necessary - OAI/DM/Anthropic are all helping with AI risk. Which means that their beneficiality is incredibly robust, because again, I think many of the assumptions I outlined above are false & incredibly uncharitable to AGI orgs.
Yeah, fair enough.
But I don't think that would be a sensible position. The correct counterfactual is in fact the one where Google Brain, Meta, and NVIDIA led the field. Like, if DM + OpenAI + Anthropic didn't exist - something he has publicly wished for - that is in fact the most likely situation we would find. We certainly wouldn't find CEOs who advocate for a total stop on AI.
(Ninth, I am aware of the irony of calling for more civil discourse in a highly inflammatory comment. Mea culpa)
I believe you're wrong on your model of AI risk and you have abandoned the niceness/civilization norms that act to protect you from the downside of having false beliefs and help you navigate your way out of them. When people explain why they disagree with you, you accuse them of lying for personal gain rather than introspect about their arguments deeply enough to get your way out of the hole you're in.
First, this is a minor point where you're wrong, but it's also a sufficiently obvious point that it should hopefully make clear how wrong your world model is: AI safety community in general, and DeepMind + Anthropic + OpenAI in particular, have all made your job FAR easier. This should be extremely obvious upon reflection, so I'd like you to ask yourself how on earth you ever thought otherwise. CEOs of leading AI companies publicly acknowledging AI risk has been absolutely massive for public awareness of AI risk and its credibility. You regularly bring up how CEOs of leading AI companies acknowledge AI risk as a talking point, so I'd hope that on some level you're aware that your success in public advocacy would be massively reduced in the counterfactual case where the leading AI orgs are Google Brain, Meta, and NVIDIA, and their leaders were saying "AI risk? Sounds like sci-fi nonsense!"
The fact that people disagree with your preferred method of reducing AI risk does not mean that they are EVIL LIARS who are MAKING YOUR JOB HARDER and DOOMING US ALL.
Second, the reason that a total stop is portrayed as an extreme position is because it is. You can think a total stop is correct while acknowledging that it is obviously an extreme course of action that would require TREMENDOUS international co-ordination and would have to last across multiple different governments. You would need both Republicans and Democrats in America behind it, because both will be in power across the duration of your indefinite stop, and ditto for the leadership of every other country. It would require military action to be taken against people who violate the agreement. This total stop would not just impact AI, because you would need insanely strong regulations on compute - it would impact everyone's day to day life. The level of compute you'd have to restrict would only escalate as time went on due to Moore's law. And you and others talk about carrying this on for decades. This is an incredibly extreme position that requires pretty much everyone in the world to agree AI risk is both real and imminent, which they don't. Leading to...
Third: most people - both AI researchers and the general public - are not seriously concerned about AI risk. No, I don't believe your handful of sketchy polls. On the research side, whether it's on the machine learning subreddit, on ML specific discords, or within Yoshua Bengio's own research organization[1], the consensus in any area that isn't specifically selected for worrying about AI risk is always that it's not a serious concern. And on the public side, hopefully everyone realizes that awareness & agreement on AI risk is far below where climate change is.
Your advocacy regularly assumes that there is a broad consensus among both researchers and the public that AI risk is a serious concern. Which makes sense because this is the only way you can think a total stop is at all plausible. But bad news: there is nowhere close to such a consensus. And if you think developing one is important, you should wake up every morning & end every day praising Sam Altman, Dario Amodei, and Demis Hassabis for raising the profile of AI risk to such an extent; but instead you attack them, out of a misguided belief that somehow, if not for them, AI progress wouldn't happen.
Which leads us to number four: No, you can't get a total stop on AI progress through individual withdrawal. You and others in the stop AI movement regularly use the premise that if only OpenAI + Anthropic + DeepMind would just stop, AI would never get developed and we could all live happily ever after, so therefore they are KILLING US ALL.
This is false. Actually, there are many people and organizations that do not believe AI risk is a serious concern and only see AI as a technology with massive potential economic benefits; as long as this is the case, AI progress will continue. This is not a prisoner's dilemma where if only all the people worried about AI risk would "co-operate" (by ceasing AI work) AI would stop. Even if they all stopped tomorrow, progress would continue.
If you want to say they should stop anyway because that would slow timelines, I would like to point out that that is completely different from a total stop and cannot be justified by praising the virtues of a total stop. Moreover, it has the absolutely massive drawback that now AI is getting built by a group of people who were selected for not caring about AI risk.
Advocating for individual withdrawal by talking about how good a total, globally agreed upon stop would be is deceptive - or, if I wanted to use your phrasing, I could say that doing so is LYING, presumably FOR PERSONAL GAIN and you're going to GET US ALL KILLED you EVIL PERSON. Or I guess I could just not do all that and just explain why I disagree with you - I wonder which method is better?
Fifth, you can't get a total stop on AI progress at all, and that's why no one will advocate for one. This follows from points two and three and four. Even if somehow everyone agreed that AI risk was a serious issue a total stop would still not happen the same way that people believing in climate change did not cause us to abandon gasoline.
Sixth, if you want to advocate for a total stop, that's your prerogative, but you don't get to choose that that's the only way. In theory there is nothing wrong with advocating for a total stop even though it is completely doomed. After all, nothing will come of it and maybe you'll raise awareness of AI risk while you're doing it.
The problem is that you are dead set on torching other alignment plans to the ground all for the sake of your nonworkable idea. Obviously you are going after AI capabilities people all the time but here you are also going against people who simply advocate for positions less stringent than you. Everyone needs to fall in line and advocate for your particular line of action that will never happen and if they don't they are liars and going to kill us all. This is where your abdication from normal conversational norms makes your wrong beliefs actively harmful.
Leading to point number seven, we should talk about AI risk without constantly accusing each other of killing us all. What? But if I believe Connor's actions are bad for AI risk surely that means I should be honest and say he's killing us all, right? No, the same conversational norms that work for discussing a tax reform apply just as much here. You're more likely to get a good tax reform if you talk it out in a civil manner, and the same goes for AI risk. I reject the idea that being hysterical and making drastic accusations actually helps things, I reject the idea that the long term thinking and planning that works best for literally every other issue suddenly has to be abandoned in AI risk because the stakes are so high, I reject the idea that the only possible solution is paralysis.
Eighth, yes, working in AI capabilities is absolutely a reasonable alignment plan that raises odds of success immensely. I know, you're so overconfident on this point that even reading this will trigger you to dismiss my comment. And yet it's still true - and what's more, obviously so. I don't know how you and others egged each other into the position that it doesn't matter whether the people working on AI care about AI risk, but it's insane.
- ^
From a recent interview:
D’Agostino: How did your colleagues at Mila react to your reckoning about your life’s work?Bengio:The most frequent reaction here at Mila was from people who were mostly worried about the current harms of AI—issues related to discrimination and human rights. They were afraid that talking about these future, science-fiction-sounding risks would detract from the discussion of the injustice that is going on—the concentration of power and the lack of diversity and of voice for minorities or people in other countries that are on the receiving end of whatever we do.
This post is fun but I think it's worth pointing out that basically nothing in it is true.
-"Clown attacks" are not a common or particularly effective form of persuasion
-They are certainly not a zero day exploit; having a low status person say X because you don't want people to believe X has been available to humans for our entire evolutionary history
-Zero day exploits in general are not a thing you have to worry about; it isn't an analogy that applies to humans because we're far more robust than software. A zero day exploit on an operating system can give you total control of it; a 'zero day exploit' like junk food can make you consume 5% more calories per day than you otherwise would.
-AI companies have not devoted significant effort to human thought steering, unless you mean "try to drive engagement on a social media website"; they are too busy working on AI.
-AI companies are not going to try to weaponize "human thought steering" against AI safety
-Reading the sequences wouldn't protect you from mind control if it did exist
-Attempts at manipulation certainly do exist but it will mostly be mass manipulation aimed at driving engagement and selling you things based off of your browser history, rather than a nefarious actor targeting AI safety in particular
Seems like we mostly agree and our difference is based on timelines. I agree the effect is more of a long term one, although I wouldn't say decades. OpenAI was founded in 2015 and raised the profile of AI risk in 2022, so in the counterfactual case where Sam Altman was dissuaded from founding OpenAI due to timeline concerns, AI risk would have much lower public credibility less than a decade.
Public recognition as a researcher does seem to favour longer periods of time though, the biggest names are all people who've been in the field multiple decades, so you have a point there.
I think we're talking past each other a bit. I'm saying that people sympathetic to AI risk will be discouraged from publishing AI capability work, and publishing AI capability work is exactly why Stuart Russell and Yoshua Bengio have credibility. Because publishing AI capability work is so strongly discouraged, any new professors of AI will to some degree be selected for not caring about AI risk, which was not the case when Russell or Bengio entered the field.
The focus of the piece is on the cost of various methods taken to slow down AI timelines, with the thesis being that across a wide variety of different beliefs about the merit of slowing down AI, these costs aren't worth it. I don't think it's confused to be agnostic about the merits of slowing down AI when the tradeoffs being taken are this bad.
Views on the merit of slowing down AI will be highly variable from person to person and will depend on a lot of extremely difficult and debatable premises that are nevertheless easy to have an opinion on. There is a place for debating all of those various premises and trying to nail down what exactly the benefit is of slowing down AI; but there is also a place for saying "hey, stop getting pulled in by that bike-shed and notice how these tradeoffs being taken are not worth it given pretty much any view on the benefit of slowing down AI".
> I think you’re confused about the perspective that you’re trying to argue against. Lots of people are very confident, including “when pressed”, that we’d probably be in a much better place right now if the big AGI labs (especially OpenAI) had never been founded. You can disagree, but you shouldn’t put words in people’s mouths.
I was speaking from experience, having seen this dynamic play out multiple times. But yes, I'm aware that others are extremely confident in all kinds of specific and shaky premises.
> I just think it’s extraordinarily important to be doing things on a case-by-case basis here. Like, let’s say I want to work at OpenAI, with the idea that I’m going to advocate for safety-promoting causes, and take actions that are minimally bad for timelines.
Notice that this is phrasing AI safety and AI timelines as two equal concerns that are worth trading off against each other. I don't think they are equal, and I think most people would have far better impact if they completely struck "I'm worried this will advance timelines" from their thinking and instead focused solely on "how can I make AI risk better".
I considered talking about why I think this is the case psychologically, but for the piece I felt it was more productive to focus on the object level arguments for why the tradeoffs people are making are bad. But to go into the psychological component a bit:
-Loss aversion: The fear of making AI risk worse is greater than the joy of making it better.
-Status quo bias: Doing something, especially something like working on AI capabilities, is seen as giving you responsibility for the problem. We see this with rhetoric against AGI labs - many in the alignment community will level terrible accusations against them, all while having to admit when pressed that it is plausible they are making AI risk better.
-Fear undermining probability estimates: I don't know if there's a catchy phrase for this one but I think it's real. The impacts of any actions you take will be very muddy, indirect, and uncertain, because this is a big, long term problem. When you are afraid, this makes you view uncertain positive impacts with suspicion and makes you see uncertain negative impacts as more likely. So people doubt tenuous contributions to AI safety like "AI capability researchers worried about AI risk lend credibility to the problem, thereby making AI risk better", but view tenuous contributions to AI risk like "you publish a capabilities paper, thereby speeding up timelines, making AI risk worse" as plausible.
Capabilities withdrawal doesn't just mean fewer people worried about AI risk at top labs - it also means fewer Stuart Russells and Yoshua Bengios. In fact, publishing AI research is seen as worse than working in a lab that keeps its work private.
My specific view:
- OpenAI's approach seems most promising to me
- Alignment work will look a lot like regular AI work; it is unlikely that someone trying to theorize about how to solve alignment, separate from any particular AI system that they are trying to align, will see success.
- Takeoff speed is more important than timelines. The ideal scenario is one where compute is the bottleneck and we figure out how to build AGI well before we have enough compute to build it, because this allows us to experiment with subhuman AGI systems.
- Slow takeoff is pretty likely. I think we'll need a lot more compute than we have before we can train human-level AGI
- I don't think alignment will be that hard. A LLM trained to be agentic can easily be trained to be corrigible.
- Because I don't think alignment will be that hard, I think a lot of AI risk involves ASI being built without proper precautions. I think if teams are taking the proper precautions it will probably be fine. This is one more reason why I think individual withdrawal is such a terrible idea.
- As an aside, my subjective impression is that Yann LeCun is correct when he says that worry about AI risk is a minority position among AI researchers. I think a lot of pushes for individual withdrawal implicitly assume that most AI researchers are worried about AI risk.
With that said, I'm not super confident in anything I said above. My lack of confidence goes both ways; maybe AI alignment isn't a real problem and Yann LeCun is right, or maybe AI alignment is a much more difficult problem than I think and theoretical work independent from day-to-day AI research is necessary. That's why on a meta level I think people should pursue different approaches that seem promising to them and contribute how they think they personally are suited to contribute. For some people that will be public advocacy, for some that will be theorizing about alignment, for some that will be working at AI labs. But working in AI capabilities absolutely seems strongly +EV to me, and so do all of the major AI labs (DM + OAI + Anthropic). Even giving a small likelihood to the contrary view (timelines are very fast, alignment is hard and unsolved) doesn't change that - if anything, if I thought ASI was imminent I would be even more glad that the leading labs were all concerned about AI risk.
Is the overall karma for this mostly just people boosting it for visibility? Because I don't see how this would be a quality comment by any other standards.
Frontpage comment guidelines:
- Maybe try reading the post
So if I'm getting at things correctly, capabilities and safety are highly correlated, and there can't be situations where capabilities and alignment decouple.
Not that far, more like it doesn't decouple until more progress has been made. Pure alignment is an advanced subtopic of AI research that requires more progress to have been made before it's a viable field.
I'm not super confident in the above and wouldn't discourage people from doing alignment work now (plus the obvious nuance that it's not one big lump, there are some things that can be done later and some that can be done earlier) but the idea of alignment work that requires a whole bunch of work in serial, independent of AI capability work, doesn't seem plausible to me. From Nate Soares' post:
The most blatant case of alignment work that seems serial to me is work that requires having a theoretical understanding of minds/optimization/whatever, or work that requires having just the right concepts for thinking about minds.
This is the kind of thing that seems inextricably bound up with capability work to me. My impression is that MIRI tends to think that whatever route we take to get to AGI, as it moves from subhuman to human-level intelligence it will transform to be like the minds that they theorize about (and they think this will happen before it goes foom) no matter how different it was when it started. So even if they don't know what a state of the art RL agent will look like five years from now, they feel confident they can theorize about what it will look like ten years from now. Whereas my view is that if you can't get the former right you won't get the latter right either.
To the extent that intelligences will converge towards a certain optimal way of thinking as they get smarter, being able to predict what that looks like will involve a lot of capability work ("Hmm, maybe it will learn like this; let's code up an agent that learns that way and see how it does"). If you're not grounding your work in concrete experiments you will end up with mistakes in your view of what an optimal agent looks like and no way to fix them.
A big part of my view is that we seem to still be a long way from AGI. This hinges on how "real" the intelligence behind LLMs is. If we have to take the RL route then we are a long way away - I wrote a piece on this, "What Happened to AIs Learning Games from Pixels?", which points out how slow the progress has been and covers the areas where the field is stuck. On the other hand if we can get most of the way to AGI just with massive self-supervised training then it starts seeming more likely that we'll walk into AGI without having a good understanding of what's going on. I think that the failure of VPT for minecraft compared to GPT for language, and the difficulty LLMs have with extrapolation and innovation, means that self-supervised learning won't be enough without more insight. I'll be paying close attention to how GPT-4 and other LLMs do over the next few years to see if they're making progress faster than I thought, but I talked to chatGPT and it was way worse than I thought it'd be.
I'm very unconfident in the following but, to sketch my intuition:
I don't really agree with the idea of serial alignment progress that is independent from capability progress. This is what I was trying to get at with
"AI capabilities" and "AI alignment" are highly related to each other, and "AI capabilities" has to come first in that alignment assumes that there is a system to align.
By analogy, nuclear fusion safety research is inextricable from nuclear fusion capability research.
When I try to think of ways to align AI my mind points towards questions like "how do we get an AI to extrapolate concepts? How will it be learning? What will its architecture be?" etc. In other words it just points towards capabilities questions. Since alignment turns on capability questions that we don't yet have an answer to, it doesn't surprise me when many alignment researchers seem to spin their wheels and turn to doom and gloom - that's more or less what I had thought would happen.
As an example of the blurred lines between capability and alignment: while I think it's useful to have specific terms for inner and outer alignment, I also think that really anyone who worked with RL in a situation where they were manually setting the reward function was aware of these ideas already on some level. "Sometimes I mess up the reward function" and "sometimes the agent isn't optimizing properly" are both issues encountered frequently. Basically while many people in the alignment community seem to think of alignment as something that is cooked up entirely separately from capability research I tend to think that a lot of it will develop naturally as part of day-to-day AI research with no specific focus on alignment.
As a thought experiment, let's say that about 20% of current AI capability researchers are very concerned about AI alignment and get together to decide what to do for the next five years. They're deciding between taking the stance "Capability work is fine right now! Go for it! Worry about alignment when we're farther along!" or "Let's get out of capability and go into alignment instead. Capability research is dangerous and burning precious time." What's the impact of adopting these two positions?
The first is roughly the default position, and I'd expect that basically what we'll see is AGI in the year 20XX and that in the runup to this we'll see vastly increased interest in alignment work and also a significant blurring between "alignment" and "regular AI research" since people want their home robots to not roll over their cat. We'll also see all major AI research orgs and the AI community as a whole take existential risk from self-improving AGI a lot more seriously once modern SOTA AI systems start looking more and more like the kind of thing that could do that. Because of this there'll be a concerted effort to handle the situation appropriately which has a good chance of success.
Option two involves slowing down the timeline by about 5-10%. Cutting the size of a field by 20% doesn't slow progress that much since there's diminishing returns to adding more researchers, and on top of that AI capability research is only half of what drives progress (the other half being compute). In return for this small slowdown the AI researchers who are now going into alignment will initially spin their wheels due to the lack of anything concrete to focus on or any concrete knowledge of what the future systems will look like. When AGI does start approaching the remaining AI capability community will take it much less seriously due to having been selected specifically for that trait. Three years before the arrival of transformative AGI alignment research is further along than it otherwise would have been, but AI capability researchers have gotten used to tuning alignment researchers out and there aren't alignment-sympathetic colleagues around to say "hey, given how things are progressing I think it's time we start taking all that AI risk stuff seriously". Prospects are worse than option one.
So right now my intuition is that I think alignment will be very doable as long as it's something that the AI community is taking seriously in the few years leading up to transformative AGI. The biggest risk seems to me to be some AI researchers at one of the leading research groups thinking "man, it sure would be cool if we could use the latest coding LLM combined with RL to make an AI that could improve itself in order to accomplish a goal" and set it running without it ever occuring to them that this could go wrong. Given this, the suggestion that everyone concerned about alignment basically cedes the whole field of AI research (outside of this specific community, "AI capability research" is just called "AI research") to people who aren't worried about it seems like a bad idea.
Right, I specifically think that someone would be best served by trying to think of ways to get a SOTA result on an Atari benchmark, not simply reading up on past results (although you'd want to do that as part of your attempt). There's a huge difference between reading about what's worked in the past and trying to think of new things that could work and then trying them out to see if they do.
As I've learned more about deep learning and tried to understand the material, I've constantly had ideas that I think could improve things. Then I've tried them out, and usually learned that they didn't, or they did but they'd already been done, or that it was more complicated than that, etc. But I learned a ton in the process. On the other hand, suppose I was wary of doing AI capability work. Each time I had one of these ideas, I shied away from it out of fear of advancing AGI timelines. The result would be threefold: I'd have a much worse understanding of AI, and I'd be a lot more concerned about immininent AGI (after all, I had tons of ideas for how things could be done better!), and I wouldn't have actually delayed AGI timelines at all.
I think a lot of people who get into AI from the alignment side are in danger of falling into this trap. As an example in an ACX thread I saw someone thinking about doing their PHD in ML, and they were concerned that they may have to do capability research in order to get their PHD. Someone replied that if they had to they should at least try to make sure it is nothing particularly important, in order to avoid advancing AGI timelines. I don't think this is a good idea. Spending years working on research while actively holding yourself back from really thinking deeply about AI will harm your development significantly, and early in your career is right when you benefit the most from developing your understanding and are least likely to actually move up AGI timelines.
Suppose we have a current expected AGI arrival date of 20XX. This is the result of DeepMind, Google Brain, OpenAI, FAIR, Nvidia, universities all over the world, the Chinese government, and more all developing the state of the art. On top of that there's computational progress happening at the same time, which may well turn out to be a major bottleneck. How much would OpenAI removing themselves from this race affect the date? A small but real amount. How about a bright PHD candidate removing themselves from this race? About zero. I don't think people properly internalize both how insignificant the timeline difference is, and also how big the skill gains are from actually trying your hardest at something as opposed to handicapping yourself. And if you come up with something you're genuinely worried about you can just not publish.
"AI capabilities" and "AI alignment" are highly related to each other, and "AI capabilities" has to come first in that alignment assumes that there is a system to align. I agree that for people on the cutting edge of research like OpenAI, it would be a good idea for at least some of them to start thinking deeply about alignment instead. There's two reasons for this:
1) OpenAI is actually likely to advance capabilities a pretty significant amount, and
2) Due to their expertise that they've developed from working on AI capabilities, they're much more likely to make important progress on AGI alignment than e.g. MIRI.
But I think there's something of a "reverse any advice you hear" thing going on - the people most likely to avoid working on capabilities as a result of this post are those who would actually benefit from working on AI capabilities for a while, even if they don't intend to publish their results, in order to build more expertise in AI. Capabilities is the foundation of the field and trying theorize about how to control an AI system without having anything but the vaguest ideas about how the AI system will work isn't going to get you anywhere.
For example, Eliezer is in a pessimistic doom-spiral while also being, by his own admission, pretty useless at solving alignment. If he would just take a break and try to make an AI good at Atari for six months then I think he'd find he was a lot more effective at alignment afterwards and would realize that AGI isn't as imminent as he currently believes it is. Of course, the very fact that he thinks it's imminent means he won't do this; such is life.
- A paper which does for deceptive alignment what the goal misgeneralization paper does for inner alignment, i.e. describing it in ML language and setting up toy examples (for example, telling GPT-3 to take actions which minimize changes in its weights, given that it’s being trained using actor-critic RL with a certain advantage function, and seeing if it knows how to do so).
If I'm understanding this one right OpenAI did something similar to this for purely pragmatic reasons with VPT, a minecraft agent. They first trained a "foundation model" to imitate humans playing. Then, since their chosen task was to craft a diamond pickaxe, they finetuned it with RL, giving it reward for each step along the path to a diamond pickaxe that it successfully took. There's a problem with this approach:
"A major problem when fine-tuning with RL is catastrophic forgetting because previously learned skills can be lost before their value is realized. For instance, while our VPT foundation model never exhibits the entire sequence of behaviors required to smelt iron zero-shot, it did train on examples of players smelting with furnaces. It therefore may have some latent ability to smelt iron once the many prerequisites to do so have been performed. To combat the catastrophic forgetting of latent skills such that they can continually improve exploration throughout RL fine-tuning, we add an auxiliary Kullback-Leibler (KL) divergence loss between the RL model and the frozen pretrained policy."
In other words they first trained it to imitate humans, and saved this model; after that they trained the agent to maximize reward, but used the saved model and penalized it for deviating too far from what that human-imitator model would do. Basically saying "try to maximize reward but don't get too weird about it."