Posts
Comments
I think there are less cautious plans for containment that are more likely to be enacted, e.g., the whole "control" line of work or related network security approaches. The slow substrate plan seems to have far too high an alignment tax to be a realistic option.
On the wait-and-see attitude, which is maybe the more important part of your point:
I agree that a lot of people are taking a wait-and-see-what-we-actually-create stance. I don't think that's a good idea with something this important. I think we should be doing our damndest to predict what it will be like and what should be done about it while there's still a little spare time. Many of those predictions will be wrong, but they will at least produce some analysis of what the sane actions are in different scenarios of the type of AGI we create. And as we get closer, some of the predictions might be right enough to actually help with alignment plans. I share Max's conviction that we have a pretty good guess at the form the first AGIs will take on the current trajectory.
There is certainly merit to what you say. I don't want to go into it farther; LW maintains a good community standard of cordial exchanges in part by believing that "politics is the mind-killer".
I wasn't arguing that US foreign policy is better for the world now. I was just offering one reason it might be better in a future scenario in which one or the other has powerful AGI. It could easily be wrong; I think this question deserves a lot more analysis.
If you have reasons to feel optimistic about the CCP being either the sole AGI-controller or one several in a multipolar scenario, I'd really love to hear them. I don't know much about the mindset of the CCP and I'd really rather not push for a race to AGI if it's not really necessary.
I agree that software is a potential use-case for closed form proofs.
l thought their example of making a protein-creating system (or maybe it was a RNA creator) fully safe was interesting, because it seems like knowing what's "toxic" would require a complete understanding of not only biology, but a complete understanding of which changes to the body humans do and don't want. If even their chosen example seems utterly impossible, it doesn't speak well for how thoroughly they've thought out the general proposal.
But yes, in the software domain it might actually work - conditions like "only entities with access to these keys should be allowed access to this system" seem simple enough to actually define to make closed form proofs relevant. And software security might make the world substantially safer in a multipolar scenario (although we should've forget that physical attacks are also quite possible).
I don't think that's what Bogdan meant. I think if we took a referendum on AI replacing humans entirely, the population would be 99.99% against - far higher than the consensus that might've voted against the industrial revolution (and actually I suspect that referendum might've been in favor - job loss only affected minorities of the population at any one point I think).
Even the e/acc people accused of wanting to replace humanity with machines mostly don't want that, when they're read in detail. I did this with "Beff Jezos" writings since he's commonly accused of being anti-human. He's really not - he thinks humans will be preserved or else machines will carry on humans values. There are definitely a few people who actually think intelligence is the most important thing to preserve (Sutton), but they're very rare compared to those who want humans to persist. Most of those like Jezos who say it's fine to be replaced by machines are still thinking those machines would be a lot like humans, including have a lot of our values. And even those are quite rare. For the most part, e/acc, d/acc, and doomers all share a love of humanity and its positive potential. We just disagree on how to get there. And given how new and complex this discussion is, I hold hope that we can mostly converge as we sort through the complex logic and evidence.
This sounds true, and that's disturbing.
I don't think it's relevant yet, since there aren't really strong arguments that you couldn't deploy an AGI and keep it under your control, let alone arguments that it would escape control even in the lab.
I think we can improve the situation somewhat by trying to refine the arguments for and against alignment success with the types of AGI people are working on, before they have reached really dangerous capabilities.
Thanks! I also thought o1 will be plenty of reason for other groups to pursue more chain-of-thought research. Agents are more than that, but dependent on CoT actually being useful. o1 shows that it can be made quite useful.
I am currently working on clearer and more comprehensive summaries of the suite of alignment plans I endorse - which curiously are the same ones I think orgs will actually employ (this is in part because I don't spend time thinking about what we "should" do if it has large alignment taxes or is otherwise unlikely to actually be employed).
I agree that we're unlikely to get any proofs. I don't think any industry making safety cases has ever claimed a proof of safety, only sufficient arguments, calculations, and evidence. I do hope we get it down to a .1% but I think we'll probably take much larger risks than that. Humans are impulsive and cognitively limited and typically neither utilitarian nor longtermist.
Good point about formal proofs forcing us to be explicit about assumptions. I am highly suspicious of formal proofs after noticing that every single time I dug into the background assumptions, they violated some pretty obvious conditions of the problem they were purportedly addressing. They seem to really draw people to "searching under the light".
It is worth noting that Omohundro & Tegmark's recent high-profile paper really only endorsed provably safe systems that were not AGI, but to be used by AGIs (their example was a gene sequencer that would refuse to produce harmful compounds). I think even that is probably unworkable. And applying closed-form proofs to the behavior of an AGI seems impossible to me. But I'm not an expert, and I'd like to see someone at least try — as you say, it would at least clarify assumptions.
I think the general idea is that the US is currently a functioning democracy, while China is not. I think if this continued to be true, it would be a strong reason to prefer AGI in the hands of the US vs Chinese governments. I think this is true despite agreeing that the US is more violent and reckless than China (in some ways - the persecution of the Uigher people by the Chinese government hits a different sort of violence than any recent US government acts).
If the government is truly accountable to the people, public opinion will play a large role in deciding how AGI is used. Benefits would accrue first to the US, but new technologies developed by AGI can be shared without cost. Once we have largely autonomous and self-replicating factories, the whole world could be made vastly richer at almost zero cost. This will be a pragmatic as well as an idealistic move; making more of the world your friend at low cost is good politics as well as good ethics.
However, it seems pretty questionable whether the US will remain a true democracy. The upcoming election should give us more insight into the resilience and stability of the institution.
I applaud this post. I agree with most of the claims here. We need more people proposing and thinking through sane plans like this, so I hope the community will engage with this one.
Aschenbrenner, Amodei and others are pushing for this plan because they think we will be able to align or control superhuman AGI. And they may very well be right. There are decent plans for aligning scaffolded LLM agents, they just haven't been broadly discussed yet. Links follow.
This unfortunately complicates the issue. It's not clearly a suicide race. I think we have to accept this uncertainty to propose workable policy and societal approaches. Having plans that might work does not justify rushing full-speed ahead without a full safety case, but it must be acknowledged if true, because hope of human-controlled AGI will drive a lot of relevant actors.
... it is now rather obvious that the first AGI will not be a pure LLM, but a hybrid scaffolded system.
I agree that this is pretty likely. I also very much agree that LLM "alignment" is not addressing AGI alignment, and that full goal alignment is the important problem.
If you disagree with my assertion, I challenge you to cite or openly publish an actual plan for aligning or controlling a hybrid AGI system.
I have done this, but I haven't yet promoted it widely.
I am not affiliated with any major org, but I have relevant expertise and generous funding to spend full-time on "solving the whole problem".
Far from being crazy or elaborate, this set of plans seems to both have very low alignment taxes, and to be fairly obvious-in-retrospect when one thinks about the problem in detail. As such, I expect most orgs to arrive at and use a similar approach to aligning their scaffolded LLM agent proto-AGIs.
To address Max's response to Noosphere's "bitter lesson" alignment plan response (which I endorse as one of several layered approaches): having such a plan does not constitute a safety case, it is just the start of one. In a sane world we would prohibit the launch of anything like "Real AGI" (autonomous and self-teaching) until these approaches have been carefully analyzed and tested in closed environments.
There are two elements: a suite of technical approaches, and the alignment target that's much easier to hit than "all of humanity's values for all of time".
My most complete writeup so far is:
Internal independent review for language model agent alignment
This describes and overlapping suite of alignment and control approaches. The two central bits are easily stated. The first is to use "system 2" thinking for alignment as well as capabilities. We would hard-code the system to carefully "think about" outcomes before taking actions with major consequences, and compare likely outcomes against both its current goals and a constitution of ethics.
The second is to frequently prompt the LLM "cognitive engine" with both its current goal and its identity as a helpful, cautious assistant. Because the LLM has been trained with RL to (roughly) follow prompts, this should overpower the effects of any goals implicit in its predictive training corpus.
Details and additional techniques are in that article.
It doesn't include the "bitter lesson" approach, but the next version will.
I apologize that it's not a better writeup. I haven't yet iterated on it or promoted it, in part because talking about how to align LLM agents in sufficient detail includes talking about how to build LLM agents, and why they're likely to get all the way to real AGI. I haven't wanted to speed up the race. I think this is increasingly irrelevant since many people and teams have the same basic ideas, so I'll be publishing more detailed and clearer writeups soon.
This set of plans, and really any technical alignment approach will work much better if it's used to create an instruction-following AGI before that AGI has superhuman capabilities. This is the obvious alignment target for creating a subhuman agent, and it allows the approach of using that agent as a helpful collaborator in aligning future versions of itself. I discuss the advantages of using this as a stepping-stone to full value alignment in
Instruction-following AGI is easier and more likely than value aligned AGI
Interestingly, I think all of these alignment approaches. are obvious-in-retrospect, and that they will probably be pursued by almost any org launching scaffolded LLM systems with the potential to blossom into human-plus AGI. I think this is already the loosely-planned approach at DeepMind, but I have to say I'm deeply concerned that neither OAI nor Anthropic has mentioned these relatively-obvious alignment approaches for scaffolded LLM agents in their "we'll use AI to align AI" vague plans.
If these approaches work, then we are faced with either a race or a multipolar, human-controlled AGI scenario, making me wonder If we solve alignment, do we die anyway? This scenario introduces new, more politically-flavored hard problems.
I currently see this as the likely default scenario, since halting progress universally is so hard, as Nathan pointed out in his reply and others have elaborated elsewhere.
I didn't mean to imply that speedboats are a perfect analogy. They're not. Maybe not even a good one.
My claim was that details matter; we can't get good p(doom) estimates without considering specific scenarios, including all of the details you mention about the motivations and regulations around use, as well as the engineering approaches, and more.
Here's an entirely separate weak argument, improving on your straw man:
AGI will be powerful. Powerful agentic things do whatever they want. People will try to make AGI do what they want. They might succeed or fail. Nobody has tried doing this before, so we have to guess what the odds of their succeeding are. We should notice that they won't get many second chances because agents want to keep doing what they want to do. And notice that humans have screwed up big projects in surprisingly dumb (in retrospect) ways.
If some people do succeed at making AGI do what they want, they might or might not want things the rest of humanity wants. So we have to estimate the odds that the types of people who will wind up in charge of AGI (not the ones that start out in charge) are "good people". Do they want things that would be better described as doom or flourishing for humanity? This matters, because they now have AGI which we agree is powerful. It may be powerful enough that a subset of power-hungry people now control the future - quite possibly all of it.
If you look at people who've held power historically, the appear to often be pretty damned selfish, and all too often downright sadistic toward their perceived enemies and sometimes toward their own subjects.
Here's my serious claim after giving this an awful lot of thought and study: you are correct about the arguments for doom being either incomplete or bad.
But the arguments for survival are equally incomplete and bad.
It's like arguing about whether humans would survive driving a speedboat at ninety miles an hour before anyone's invented the internal combustion engine. There are some decent arguments that they wouldn't, and some decent argument that their engineering would be up to the challenge. People could've debated endlessly.
The correct answer is that humans often survive driving speedboats at ninety miles and hour and sometimes die doing it. It depends on the quality of engineering and the wisdom of the pilot (and how their motivations are shaped by competition;).
So, to make progress on actual good arguments, you have to talk about specifics: what type of AGI we'll create, who will be in charge of creating it and perhaps directing it, and exactly what strategies we'll use to ensure it does what we want and that humanity survives. (That is what my work focuses on.)
In the absence of doing detailed reasoning about specific scenarios, you're basically taking a guess. It's no wonder people have wildly varying p(doom) estimates from that method.
If it's a guess, the base rate is key. You've said you refuse to change it, but to understand the issue we have to address it. That's the primary difference between your p(doom) and mine. I've spent a ton of time trying to address specific scenarios, but my base rate is still the dominant factor in my total, because we don't have enough information yet and havent' modeled specific scenarios well enough yet.
I think we'll probably survive a decade after the advent of AGI on something like your odds, but I can't see farther than that, so my long-range p(doom) goes to around .5 with my base rate. (I'm counting permanent dystopias as doom). (incidentally, Paul Christiano's actual p(doom) is similar - modest in the near term after AGI but rising to 40% as things progress from there).
It's tough to arrive at a base rate for something that's never happened before. There is no reference class, which is why debates about the proper reference class are endless. No species has ever designed and created a new entity smarter than itself. So I put it around .5 for sheer lack of evidence either way. I think people using lower base rates are succumbing to believing what feels easy and good- motivated reasoning. There simply are no relevant observations.
I did wonder if this was AI written before seeing the comment thread. It takes a lot of effort for a human to write like an AI. Upvoted for effort.
I think this also missed the mark with the LW audience because it is about AI safety, which is largely separate from AGI alignment. LW is mostly focused on the second. It's addressing future systems that have their own goals, whereas AI safety addresses tool systems that don't really have their own goals. Those issues are important; but around here we tend to worry that we'll all die when a new generation of systems with goals are introduced, so we mostly focus on those. There is a lot of overlap between the two, but that's mostly in technical implementations rather than choosing desired behavior.
I should've specified - the orgs carefully train to get them to refuse to say things. I don't think the specifically train them to say things the orgs like or believe. The refusals are intentional, the bias is accidental IMO.
And every source has bias.
So, do you want people.to.quit saying they googled for an answer? I just like them to say where they got the answer so I can judge how biased it might be.
What they're saying is I got a semi-objective answer fast.
If they'd googled for the answer all the same concerns would apply. You'd need to know the biases of whoever wrote the web content they read to get an answer.
I doubt the orga got much of their own bias into the RLHF/RLAIF process. There are real cultural biases from the humans answering RLHF and the LLM itself from the training set and how it interpreted its constitution.
This sounds highly plausible. There are some other dangers your scenario leaves out, which I tried to explore in If we solve alignment, do we die anyway?
If it's not serving them, it's pathological by definition, right?
So obsessing about exactly those circumstances and types of people could be pathological if it's done more than will protect them in the future, weighing in the emotional cost of all that obsessing.
Of course we can't just stop patterns of thought as soon as we decide they're pathological. But deciding it doesn't serve me so I want to change it is a start.
Yes, it's proportional to the way it affected them - but most of the effect is in the repetition of thoughts about the incident and fear of future similar experiences. Obsessing about unpleasant events is natural, but it often seems pretty harmful itself.
Trauma is a horrible thing. There's a delicate balance between supporting someone's right and tendency to obsess over their trauma while also supporting their ability to quit re-traumatizing themselves by simulating their traumatic event repeatedly.
Your post on journalists is, as I suspected, a lot better.
I bounced off the post about mapping discussions because the post didn't make clear what potential it might have to be useful to me. The post on journalists, which drew me in and quickly made clear what its use would be: informing me of how journalists use conversations with them.
The implied claims that theory should be worth less than building or that time on task equals usefulness are both wrong. We are collectively very confused, so running around building stuff before getting less confused isn't always the best use of time.
I think it's important to consider hacking in any safety efforts. These hacks would probably include stealing and using any safety methods for control or alignment, for the same reasons the originating org was using them - they don't want to lose control of their AGI. Better make those techniques and their code public, and publicly advertise why you're using them!
Of course, we'd worry that some actors (North Korea, Russia, individuals who are skilled hackers) are highly misaligned with the remainder of humanity, andd might bring about existential catastrophes through some combination of foolishness and selfishness.
The other concern is mere proliferation of aligned/controlled systems, which leads to existential danger as soon as those systems approach the capability for autonomous recursive self-improvement: If we solve alignment, do we die anyway?
I don't see how that would work technically. It seems like any small set of activating tokens would be stolen along with the weights, and I don't see how to train it for a large shifting set.
I'm not saying this is impossible, just htat I'm not sure it is. Can you flesh this idea out any further?
There's a lot of detailed arguments for why alignment it's going to be more or less difficult. Understanding all of those arguments, starting with the most respected, is a curriculum. Just pulling a number out of your own limited perspective is a whole different thing.
After reading all the comments threads, I think there's some framing that hasn't been analyzed adequately:
Why would humans be testing AGIs this way if they have the resources to create simulation that will fool a super intelligence?
Also, the risk of humanity being wiped out seems different and worse while that asi is attempting a takeover - during that time the humans are probably an actual threat.
Finally, leaving humans around would seem to pose a nontrivial risk that they'll eventually spawn a new ASI that could threaten the original.
The Dyson sphere is just a tiny part of the universe so using that as the fractional cost seems wrong. Other considerations in both directions would seem to dominate it.
It's the first, there's a lot of uncertainty. I don't think anyone is lying deliberately, although everyone's beliefs tend to follow what they think will produce good outcomes. This is called motivated reasoning.
I don't think this changes the situation much, except to make it harder to coordinate. Rushing full speed ahead while we don't even know the dangers is pretty dumb. But some people really believe the dangers are small so they're going to rush ahead. There aren't strong arguments or a strong consensus for the danger being extremely high, even though looking at opinions of the most thorough thinkers puts risks in the alarmingly high, 50‰ plus range.
Add to this disagreement the fact that most people are neither longtermist nor utilitarian; they'd like a chance to get rich and live forever even if it risks humanity's future.
I guess I have no idea what you mean by "consciousness" in this context. I expect consciousness to be fully explained and still real. Ah, consciousness. I'm going to mostly save the topic for if we survive AGI and have plenty of spare time to clarify our terminology and work through all of the many meanings of the word.
Edit - or of course if something else was meant by consciousness, I expect a full explanation to indicate that thing isn't real at all.
I'm an eliminativist or a realist depending on exactly what is meant. People seem to be all over the place on what they mean by the word.
Consciousness is not at all epiphenomenal, it's just not the whole mind and not doing everything. We don't have full control over our behavior, but we have a lot. While the output bandwidth is low, it can be applied to the most important things.
I think this is dependent on reading strategy, which is dependent on cognitive style. For someone who skims a lot, they are frequently making active decisions about what to read while reading, so they're skilled at this and not bothered by footnotes. I love footnotes. This style may be more characteristic of a fast-attention cognitive style (and ADHD-spectrum loosely defined).
For those I like to refer to as attention surplus disorder :) who do not skim much, I can see the problem.
One strategy is to simply not read any footnotes on your first pass. Footnotes are supposed to be optional to understanding the series of ideas in the writing. Then, if you're intterested enough to get further into details, you go back and read some or all of the footnotes.
I agree that we could use footnotes better by either using them one way and stating it, or providing a brief cue to how it's used in the text.
I strongly disagree that footnotes as classically used are not useful. And having any sort of hypertext improves the situation.
Footnotes are usually used to mean "here are some more thoughts/facts/claims related to those you just read before the footnote mark". Sometimes those will be in a whole different reference. After you glance at a couple, you know how this author is using them.
Appropriate use of footnotes is part of good writing. As such, it's dependent on the topic, the author, the reader, and their goals in writing/reading. And thus very much a matter of individual taste and opinion.
Endnotes of varied use, without two-way hypertext links, on the other hand, should die in a fire.
That sprang to my mind as the perfect solution to this problem.
Great reference! I found myself explaining this repeatedly but without the right terminology. The "but comparative advantage!" argument is quite common among economists trying to wrap their head around AI advances.
I think it applies for worlds with tool/narrow AI, but not with AGI that can do whole jobs for much lower wages than any human can do anything.
Feeling better than you did while in your 20s would be a powerful reward!
Excellent point.
I'd think the Gish mindset isn't limited to people like your dad. I'd think that rationalists are vulnerable to it as well in any complex domain. It's not like we're doing literal bayesian updates or closed form proofs for our actually important beliefs like how hard alignment is or what methods are promising. In those areas no argument is totally closed, so weighing preponderance of decent arguments is about all we can do. So I'd say we're all vulnerable to the Gish Fallacy to an important degree. And therefore the implicit Motte-And-Bailey fallacy. l
Excellent post!
It looks to me liike it's a matter of keeping the pressures for a faithful chain of thought larger than the pressures to create steganography or jargon/unique language of thought. Methods that penalize jargon will drive toward steganography.
I've seen enthusiasts remark that training it to do more cognition in a single forward pass is a good thing; for efficiency, it is.
As in the other comment thread, training for a short CoT drives toward jargon/unique language. But that's balanced by using an independent judge of validity for process supervision; as long as the judge is a different model, it won't understand any jargon and should judge the step as invalid. Explicitly making that part of the criteria would really help.
If I understand correctly, steganography in existing models is quite limited; it's more a matter of using phrasing as a cue for likely continuations than any real attempt to hide cognition. That's because there's no real pressure in the training process to create steganography - yet.
Which pressure wins out seems very much up in the air right now.
Those are all good points. I think the bottom line is: if your training process includes pressures to make shorter CoTs or any other pressures to create steganography or "jargon", it shouldn't.
Unfortunately, shorter CoTs save compute at this point.
One big question here is: does the process supervision OpenAI used for 01 really work as you describe, with evaluations for every step? If it is more than a single step as a process, and there's a loss for longer CoTs, we have pressures that will obscure the true CoT if they're used enough.
I'd say the obttom line is that anyone who wants to understand their agent, let alone control it, should include pressures that keep the CoT understandable, like the third-model supervision you describe (which I agree is probably how they did it).
Thinking about it this way, you could have the process evaluator penalize use of strange tokens or jargon.
Thanks! And how long does that runner's high last?
Thanks! How does that runner's high feel?
Because your method of getting there sounds like hell on earth. I'd want to know what the payoff is.
If what you're saying is no set of beliefs perfectly captures reality then I agree.
If what you're saying is don't get attached to your beliefs, I agree. Even rationalists are often anchored to beliefs and preferences much more than is Bayesian optimal. Rationalism provides resistance but not immunity to motivated reasoning.
If you're saying beliefs aren't more or less true and more or less useful, I disagree.
If you're saying we should work on enlightenment before working on AGI x-risk, I disagree.
We may well not have the time.
I absolutely do think LW at large is trying to understand the truth about that bill, yes. I'm sure there are some exceptions, but I'd be surprised if there were many LWers willing to actively deceive people about its consequences - LWers typically really hate lying.
Your statement is not just technically incorrect, but mostly incorrect. Incompetence explains many more wrong statement than malice. I'm not going to change your mind on that right now, but it's something I think about an awful lot, and I think the evidence strongly supports it.
More importantly, your statement sounds mean, which is enough to not want it on LW. People being rude and not extending the benefit of the doubt leads to a community that argues instead of collaboratively seeking the truth. Arguing has huge but subtle downsides for reaching the truth - motivated reasoning from the combative relationships in arguments leads people to solidify their beliefs instead of changing them as the evidence suggests.
I believe the "about LW" page requests that we extend benefit of the doubt and be not just civil, but polite. If not, the community at large still does this, and it seems to work.
This has nothing to do with my support or lack of SBb-1047; I don't even know if I do support it, because I find such legislation's first-order effects to be pointless. My comment is merely about truth and how to obtain it. Being mean is not how you get at the truth.
"Anyone who says otherwise is lying" is pretty judgmental and hostile. And wrong.
We really try to maintain a culture of assuming good intent on LW. The claim that nobody might legitimately misunderstand how the law would be enforced is quite a strong assumption about people's intelligence and the time they spend to understand things before commenting.
I think this phrasing is better suited to the broader internet where people start arguments instead of working together to understand the truth.
This distinction might be important in some particular cases. If it looks like an AGI might ascend to power with no real chance of being stopped by humanity, its decision about humanity might be swayed by just such abstract factors.
That consideration of being in a test might be the difference between our extinction, and our survival and flourishing by current standards.
This would also apply to the analagous consideration that alien ASIs might consider any new ASI that extincted its creators to be untrustworthy and therefore kill-on-sight.
None of this has anything to do with "niceness", just selfish logic, so I don't think it's a response to the main topic of that post.
I very much agree with you that we should be analyzing the question in terms of the type of AGI we're most likely to build first, which is agentized LLMs or something else that learns a lot from human language.
I disagree that we can easily predict "niceness" of the resulting ASI based on the base LLM being very "nice". See my answer to this question.
I really think the best arguments for and against AIs being slightly nice are almost entirely different than the ones from that thread.
That discussion addresses all of mind-space. We can do much better if we address the corner of mind-space that's relevant: the types of AGIs we're likely to build first.
Those are pretty likely to be based on LLMs, and even more likely to learn a lot from human language (since it distills useful information about the world so nicely). That encodes a good deal of "niceness". They're also very likely to include RLHF/RLAIF or something similar, which make current LLMs sort of absurdly nice.
Does that mean we'll get aligned or "very nice" AGI by default? I don't think so. But it does raise the odds substantially that we'll get a slightly nice AGI even if we almost completely screw up alignment.
The key issue in whether an autonomous mind with those starting influences winds up being "nice" is The alignment stability problem. This has been little addressed outside of reflective stability; it's pretty clear that the most important goal will be reflectively stable; it's pretty much part of the definition of having a goal that you don't want to change it before you achieve it. It's much less clear what the stable equilibrium is in a mind with a complex set of goals. Humans don't live long enough to reach a stable equilibrium. AGIs with preferences encoded in deep networks may reach equilibrium rather quickly.
What equilibrium they reach is probably dependent on how they make decisions about updating their beliefs and goals. I've had a messy rough draft on this for years, and I'm hoping to post a short version. But it doesn't have answers, it just tries to clarify the question and argue that it deserves a bunch more thought.
The other perspective is that it's pretty unlikely that such a mind will reach an equilibrium autonomously. I'm pretty sure that Instruction-following AGI is easier and more likely than value aligned AGI, so we'll probably have at least some human intervention on the trajectory of those minds before they become fully autonomous. That could also raise the odds of some accidental "niceness" even if we don't successfully put them on a trajectory for full value alignment before they are granted or achieve autonomy.
This post is great; good job getting it out the door without further perfecting it.
It's important because I think a lot of people are roughly in this camp.
And it's important because it's half right in an important way: we should not be focusing alignment discussions on utility maximizers. That's not the sort of AGI we're expecting anymore. This is a real cause for increasing optimism above Eliezer's and thinking about the problem differently than Superintelligence frames it.
But it's also important because it's half wrong in an important way: focusing on current AI systems is not the right way to think about alignment with fresh eyes.
Current systems are safe and also mostly useless. We will turn them into goal-directed agents because it's easy and very very useful.
We can think about aligning those agentic systems based on some of the properties of the deep networks they'll use. But they'll also have real, explicit goals. Aligning those goals with humanity's is the subject of alignment.
For more, see my reply to Thane Ruthenis' top comment.
I largely concur, but I think the argument is simpler and more intuitive. I want to boil this down a little and try to state it in plainer language:
Arguments for doom as a default apply to any AI that has unbounded goals and pursues those goals more competently than humans. Maximization, coherence, etc. are not central pieces.
Current AI doesn't really have goals, so it's not what we're worried about. But we'll give AI goals, because we want agents to get stuff done for us, and giving them goals seems necessary for that. All of the concerns for the doom argument will apply to real AI soon enough.
However, current AI systems may suggest a route to AGI that dodges soom of the more detailed doom arguments. Their relative lack of inherent goal-directedness and relative skill at following instructions true to their intent (and the human values behind them) may be cause for guarded optimism. One of my attempts to explain this is The (partial) fallacy of dumb superintelligence.
In a different form, the doom as default argument is:
- IF an agent is smarter/more competent than you and
- Has goals that conflict with yours
- It will outsmart you somehow, eventually (probably soon)
- It will achieve its goals and you will correspondingly not achieve yours
- If its goals are unbounded and it "cares" about your goals near zero,
- You will lose everything
Arguments that "we're training it on human data so it will care about our values above zero" are extremely speculative. They could be true, but betting the future of humanity on it without thinking it through seems very, very foolish.
That's my attempt at the simplest form of the doom by default argument
Just to point out the one distinction: I make no reference to game theoretic agents or coherence theorems. I think this are unnecessary distractions to the core argument. An agent that has weird and conflicting goals (and so isn't coherent or a perfect game theoretic agent) will still take all of your stuff if its set of goals and values don't weigh human property rights or human wellbeing very highly. That's why we take the alignment problem to be the central problem in surviving AGI.
The other question implicit in this post was, why would we make AI less safe than current systems, which would remain pretty safe even if they were a lot smarter.
Asking why in the world humans would make AI with its own goals is like asking why in the world we'd create dynamite, much less nukes: because it will help humans accomplish their goals, until it doesn't; and it's as easy as calling your safe oracle AI (e.g., really good LLM) repeatedly with "what would an agent trying to accomplish X do with access to tools Y?" and passing the output to those tools. Agency is a one-line extension, and we're not going to just not bother.
Precisely what I meant, good catch on the effort bit.
Libertarian free will is a contradiction in terms. Randomness is not what we want. Compatibilist free will has all the properties worth wanting: your values and beliefs determine the future, to the extent you exert the effort to make good decisions. Whether you do that is also determined, but it is determined by meaningful things like how you react to this statement.
Determinism has no actionable consequences if it's true. The main conclusion people draw, my efforts at making decisions don't matter, is dreadfully wrong.
Sure, but I wasn't saying what we do is pointless - just that there are other routes to meaning that aren't really more difficult. And that what most of the world does now is back breaking and soul crushing labor, not the fun intellectual labor the LW community often is privileged to do.
Most of humanity has always known they couldn't do anything useful - except provide a better life for their children than they had.
Only a few elites have ever felt that what they do mattered, and looked forward to doing it as a challenge. Most of humanity has done what they must to ensure their children won't suffer.
Your first answer to your daughter would make most parents weep with joy: whatever you want is what you'll do.
Don't worry that she won't find something she likes to to do unless she's forced to. People care about people, and there will be plenty to do with and for other people.
If you want concrete ideas of what people do when they're allowed to, see art and other collaborative projects that aren't just for money.
While we're determined, we also determine the future. The atoms that do that are called you. They make up beliefs and passions and you. You are not an object. You are a subject, and you determine your own future. The nexus of past influences is called you and your thoughts. Don't skimp on care in thinking; your future is up to you.
As per our discussions on our other posts, I don't think we can say that value learning in itself solves the problem. The issue of whether the ASI's interpretation of its central goal or instructions changing is not automatically solved by adopting that approach. The value mutability problem you link to is a separate issue. I'm not addressing here whether human values might change, but whether an AGI's interpretations of its central goal/values might change.
I think my terminology isn't totally clear. By "goals" in that statement, I mean what we mean by "'values" in humans. The two are used in overlapping and mostly interchangable ways in my writing
- Humans aren't sufficiently intelligent to be all that internally consistent
- In many cases of humans changing goals, I'd say they're actually changing subgoals, while their central goal (be happy/satisfied/joyous) remains the same. This may be described as changing goals while keeping the same values.
- Note 'in the short term' (I think you're quoting Bostrom? The context isn't quite clear). In the long term, with increasing intelligence and self-awareness, I'd expect some of people's goals to change as they become more self-aware and work toward more internal coherence (e.g., many people change their goal of eating delicious food when they realize it conflicts with their more important goal of being happy and living a a long life).
Yes, humans may change exactly that way. A friend said he'd gotten divorced after getting a CPAP to solve his sleep apnea: "When we got married, we were both sad and angry people. Now I'm not." But that's because we're pretty random and biology determined.
Nice article! Your main point, that capabilities and alignment can be and often are advanced together, is valuable, and I take it.
Now, to voice the definitional quibble I suspect many readers are thinking of:
I think it's more correct to say that some techniques invented for alignment might also be used to improve capabilities, and the way they're first implemented might do that by accident. Literally negative alignment taxes for a whole technique seem like it's stretching the definition of capabilities and alignment.
For instance, the approach I propose in Internal independent review for language model agent alignment should hypothetically improve capabilities when it's turned to that end, but will not if it's used strictly for alignment. A better term for it, (coined by Shane Legg in this short talk), is System 2 alignment. It's scripting the agent to "think through" the consequences of an action before taking it, like humans employ System 2 thinking for important actions. You could design it to think through the ethical consequences, or the efficacy or cost of an action, or any combination. Including checking the predicted ethical consequences will take slightly longer than checking only predicted efficacy and costs, and thus have a small but positive alignment tax.
The technique itself of implementing System 2 predictions for actions doesn't have a negative alignment tax, just the potential to be employed for both alignment and capabilities in ways so similar that the design/implementation costs are probably almost zero. This technique seems to have been independently invented several times, often with alignment as the inspiration, so we could argue that working on alignment is also advancing capabilities here.
In the case of RLHF, we might even argue that the creation tax is negative; if you don't specify what criteria people use for judging outputs, they'll probably include both ethical (alignment) and helpful (capabilities). Differentiating these would be a bit harder. But the runtime cost seems like it's guaranteed to be a positive tax. Refusing to do some unaligned stuff is a common complaint leveled at the system's capabilities.
So:
I think the original definition of zero being the minimal alignment tax is probably correct. RLHF/RLAIF happen to both increase alignment and performance (by the metric of what people prefer), when they're performed in a specific way. If you told the people or Constitution in charge of the RL process to not prefer harmless responses, just helpful ones, I very much doubt it would harm capabilities - I'd think it would help them (particularly from the perspective of people who'd like the capability of the AI saying naughty words.
Anyway, the main point that you can advance capabilities and alignment at the same time, and should think about differentially advancing alignment is well taken. I'd just change the framing in future pitches to this effect.