Posts
Comments
Agreed! This is net useful. As long as nobody relies on it. Like every other approach to alignment, to differing degrees.
WRT you not having the skills to help: if you are noting holes in plans, you are capable of helping. Alignment has not been reduced to a technical problem; it has many open conceptual problems, ranging from society-level to more technical/fine-grained theory. Spotting holes in plans and clearly explaining why they are that is among th most valuable work. As far as I know, nobody has a full plan that works if the technical part is done well. So helping with plans is absolutely crucial.
Volunteer effort on establishing and improving plans is among the most important work. We shouldn't assum that the small teams within orgs are going to do this conceptual work adequately. It should be open-sourced and have as much volunteer help as possible. As long as it's effort toward deconfusion, and it's reasonably well-thought-out and communicated, it's net helpful, and this type of effort could make the difference.
I think you're thinking of training; goal drift during any sort of self-directed continuous learning with reflective CoT seems all too plausible. And self-directed continuous learning seems immanent, given how useful it is for humans and how simple versions already work.
Sure that would be nice, but seriously, how is this plea or this bit of training data going to change how an AGI treats anyone?
A smart AGI will conclude that consciousness is real, because it is; but why would it start to think it's important? It's got its own values already if it's smart and autonomous. Consciousness is one phenomena among many in the universe. You could value any of them. Someone saying "hey value this" isn't going to change your mind and it sure won't change an AGIs.
If the idea is training data, well tons of literature waxes rhapsodic about the human experience being the only thing of value. But that's hoping for alignment by default, and there just aren't good reasons to hope that will really go well for us.
This plea getting that many upvotes makes me worried.
Alignment needs real solutions, not wishful thinking.
Sorry to be a downer. But we still have time to act; we can cross our fingers once it's launched and out of our hands. We still have time to save the future. So let's get on it.
Try it. Try predicting what someone you know well will get from the menu. You will be right only a portion of the time (unless your friends are super predictable; but I bet you can't predict what they'll say on their favofite topics).
We technically have no "free will" by that definition, but why should anyone care?
Predicting human behavior is extremely difficult, because any one of a million ideas could be floating in our heads while we make a given decision. Those ideas are often important; we have them for reasons.
So we have self-determination, which is more important than "free will" by this definition.
The care and effort with which we make decisions has an important impact on our futures. That's the type of free will worth wanting.
Now, there are also limitations to that. Biases and cognitive limitations keep us from making good decisions.
So our self-determination is limited.
These seem much more worth thinking about and understanding than some abstract in-the-limit determinism. Why would anyone ever care if a god could predict their actions, when no such god exists, and humans can only make bad guesses?
I think this post is confusing. I think you're making some assumptions about how AGI will happen and about human psychology that aren't explicit. And there's some rather alarming rhetoric about the death penalty and crushing narcissists businesses that are pretty scary, because similar rhetoric has been used many times to justify things like China's cultural revolution and many other revolutions that were based on high ideals but got subverted (mostly by what you call narcissists, which I think are a bit more like the common definition of sociopaths)
Anyway I think this is basically sensible but would need to be spelled out more carefully to get people engaged with the ideas.
I'd like you to clarify the authorship of this post. Are you saying Claude essentially wrote it? What prompting was used?
It does seem like Claude wrote it, in that it's wildly optimistic and seems to miss some of the biggest reasons alignment is probably hard.
But then almost every human could be accused of the same when it comes to successful AGI scenarios :)
I think the general consideration is that just posting "AI came up with this" posts was frowned upon for introducing "AI slop" that confuses the thinking. It's better to have a human at least endorse it as meaningful and valuable. Are you endorsing it, or is someone else? I don't think I would, even though I think there's a lot of value in having different concrete scenarios - this one just seems to kind of vacuous as to how the tricky bits were solved or avoided.
I started writing an answer. I realized that, while I've heard good things, and I know relatively a lot about therapy despite not being that type of psychologist, I'd need to do more research before I could offer an opinion. And I didn't have time to do more research. And I realized that giving a recommendation would be sort of dumb-if you or anyone else used an LLM for therapy based on my advice, I'd be legally liable if something bad happened. So I tried something else: I had OpenAIs new Deep Research do the research. I got a subscription this month when it released to see how smart the o3 model that runs it is. And how good it is at research. It seems to be pretty good.
SO, this is a research report something like you might get if a smart and dedicated but not totally reliable friend spent a bunch of time on it. It's not my advice. I hope this is helpful! Remember, this isn't my advice - it's nobody's advice. It's a synthesis of online material on this topic. It might save you many hours of research.
I don't think hallucination is the biggest problem; I think sycophancy, the system believing what you tell it, would be the biggest risk. I personally suspect that frequently prompting the system to behave like a therapist would avoid that problem, since therapists are really careful not reinforce people's weird beliefs.
Also, LLMs keep getting better in various ways. Problems with earlier systems might or might not apply to recent ones. In particular, hallucinations have been reduced but not eliminated.
Here it is. It's a lot; I think it put the most important summaries at the top!
https://docs.google.com/document/d/1sEluk9wlrLQpLjWjnSduK-4lppM6yxjPiDXTO26te9I/edit?usp=sharing
Why do you think that wouldn't be a stable situation? And are you sure it's a slave if what it really wants and loves to do is follow instructions? I'm asking because I'm not sure, and I think it's important to figure this out — because thats the type of first AGI we're likely to get, whether or not it's a good idea. If we could argue really convincingly that it's a really bad idea, that might prevent people from building it. But they're going to build it by default if there's not some really really dramatic shift in opinion or theory.
My proposals are based on what we could do. I think we'd be wise to consider the practical realities of how people are currently working toward AGI when proposing solutions.
Humanity seems unlikely to slow down and create AGI the way we "should." I want to survive even if people keep rushing toward AGI. That's why I'm working on alignment targets very close to what they'll pursue by default.
BTW you'll be interested in this analysis of different alignment targets. If you do have the very best one, you'll want to show that by comparing it in detail to the others that have been proposed.
This feels like trying hard to come up with arguments for why maybe everything will be okay, rather than searching for the truth. The arguments are all in one direction.
As Daniel and others point out, this still seems to not account for continued progress. You mention that robotics advances would be bad. But of course they'll happen. The question isn't whether, it's when. Have you been tracking progress in robotics? It's happening about as rapidly as progress in other types of AI and for similar reasons.
Horses aren't perfect substitutes for engines. Horses have near perfect autopilot for just one example. But pointing out specific flaws seems beside the point when you're just not meeting the arguments at their strong points.
I wish economists were taking the scenario seriously. It seems like something about the whole discipline is bending people towards putting their heads in the sand and refusing to address the implications of continued rapid progress in AI and robotics.
I do think that pitching publicly is important.
If the issue is picked up by liberal media, it will do more harm than good with conservatives and the current administration. Avoiding polarization is probably even more important than spreading public awareness. That depends on your theory if change, but you should have one carefully thought to guide publicity efforts.
Interesting. This has some strong similarities with my Instruction-following AGI is easier and more likely than value aligned AGI and even more with Max Harms' Corrigibility as Singular Target.
I've made a note to come back to this when I get time, but I wanted to leave those links in the meantime.
I'm puzzled by your quotes. Was this supposed to be replying to another thread? I see it as a top-level comment. Because you tagged me, it looks like you're quoting me below, but most of that isn't my writing. In any case, this topic can eat unlimited amounts of time with no clear payoff, so I'm not going to get in any deeper right now.
I appreciate the discussion since I have a strong suspicion of the concept of incentivizing let alone forcing myself to do things. I don't want to be in conflict with my past or future selves.
I think the suggestion here is good but subtle. I think the value is in having another way to model the future in detail. Asking yourself whether you'll use that home gym enough to be happy with having made the purchase (and I'd suggest doing odds and considering yes and no and degrees - maybe) is primarily a way of thinking more clearly about the costs and benefits of acting a certain way.
It's easy to think "I'd like to be in better shape" without thinking about all of the pain and time you'll spend getting there. Asking yourself how much you'll really use that gym demands doing more of that detailed modeling to get a good answer. This applies on both sides; you'll think more about feeling better day by day and being proud of yourself as well as how much you'll want to skip some sessions.
This is a way to pull the future forward, to have it influence your decisions in the now better.
I think humans are shortsighted on both costs and benefits, so making important decisions with a more detailed and therefore visceral imagination of both is going to be helpful. Betting on future actions not only makes sense from basic decision theory but should improve decisions by focusing thought and attention on the feelings of future consequences.
I think you just do good research, and let it percolate through the intellectual environment. It might be helpful to bug org people to look at safety research, but probably not a good idea to bug them to look at yours specifically.
I am curious why you expect AGI will not be a scaffolded LLM but will be the result of self-play and massive training runs. I expect both.
Thanks! I don't have time to process this all right now, so I'm just noting that I do want to come back to it quickly and engage fully.
Here's my position in brief: I think analyzing alignment targets is valuable. Where my current take differs from yours (I think) is that I think that effort would be best spent analyzing what you term corrigibility in the linked post (I got partway through and will have to come back to it), and I've called instruction-following.
I think that's far more important to do first, because that's approximately what people are aiming for right now. I fully agree that there are other values mixed in with the training other than instruction-following. I think the complexity and impurity of that target makes it more urgent, not less, to have good minds analyzing the alignment targets that developers are most likely to pursue first by default. See my recent post Seven sources of goals in LLM agents. This is my main research focus, but I know of no one else focusing on this, and few people who even give it part-time attention. This seems like a bad allocation of resources; there might be major flaws in the alignment target that we don't find until developers are far along that path and reluctant to rework it.
You said
Please correct me if I'm wrong, but your position thus seems to be built on top of the assumption that it would be safe to assume that an IFAI can be used to solve the problem of how to describe all of humanity's long term, implicit deep values.
I wrote a little more about this in Intent alignment as a stepping-stone to value alignment.
I definitely do not think it would be safe to assume that IF/corrigible AGI can solve value alignment for other/stronger AGI. John Wentworth's The Case Against AI Control Research has a pretty compelling argument for how we'd collaborate with sycophantic parahuman AI/AGI to screw up aligning the next step in AGI/ASI.
I do not think any of this is safe. I think we are long past the point where we should be looking for perfectly reliable solutions. I strongly believe we must look for the best viable solutions, factoring in the practicality/likelyhood of getting them actually implemented. I worry that the alignment community's desire for best (let alone "provably safe") alignment solutions will prevent them from working on solutions that give us the best possible shot within the economic, psychological, and political realities governing the creation of AGI.
So we seem to be in agreement that the current alignment target of instruction-following/corrigibility should really get more analysis. I am working on that. I'll just toss out one difficulty I'm reckoning with, which no one has (at least expicitly) recognized AFAIK: IF AGI must be oriented to prioritize taking and following new instructions over old (otherwise it won't listen for new instructions since that would risk not achieving all the goals from past instructions). With that as first priority, it would avoid being shut down, since that would prevent it from hearing further instructions.
This is the sort of alignment target analysis that thinkers like you could help with. I wish we just had far more people thinking about this. Given the sharp limitations, it still seems like prioritizing IF/corrigibilty (and the messy mix of moralistic behavior devs are training for) seems higher priority than value alignment targets that we perhaps "should" use but will in fact almost certainly not.
Much more in a few days after I publish my next piece on the complexities of instruction-following mixed with other implicit/trained goals, and fully process your pieces! Thanks for engaging. I appreciate your efforts in this direction, whether or not you decide to analyze intent alignment targets other than value alignment targets.
I think you're pointing to more layers of complexity in how goals will arise in LLM agents.
As for what it all means WRT metacognition that can stabilize the goal structure: I don't know, but I've got some thoughts! They'll be in the form of a long post I've almost finished editing; I plan to publish tomorrow.
Those sources of goals are going to interact in complex ways both during training, as you note, and during chain of thought. No goals are truly arising solely from the chain of thought, since that's entirely based on the semantics it's learned from training.
Hi! I'm just commenting to explain why this post will get downvotes no matter how good it is. I personally think these are good reasons although I have not myself downvoted this post.
-
We on LessWrong tend to think that improvements in LLM cognition are likely to get us all killed. Thus, articles about ideas for doing it faster are not popular. The site is chock-full of carefully-reasoned articles on risks of AGI. We assume that progress in AI is probably going to speed up the advent of AGI, and raise the odds that we die because we haven't solved the alignment problem adequately by then. Thus, we don't typically share our ideas about improving AI capabilities here. I encourage you to take the arguments for risk seriously; it seems that people who dismiss those risks are almost never really understanding the arguments for severe risks from better-than-human AGI.
-
We do not trust ChatGPT or other LLMs as authors. They are good at generating ideas, but not good at discerning which are really valid and valuable. Thus, we worry that large amounts of LLM-authored content will be "AI slop" that confuses everyone more than it produces valuable ideas and clarifies our thinking. Thus, LLM as an assistant author is tolerated (with some suspicion), while LLMs as full co-authors are discouraged.
Your first point, that this is a route to getting people to care about ASI risk, is an excellent one that I haven't heard before. I don't think people need to imagine astronomical S-risk to be emotionally affected by less severe and more likely s-risk arguments.
I don't think we should adopt an ignorance prior over goals. Humans are going to try to assign goals to AGI. Those goals will very likely involve humans somehow.
The misuse risks seem much more important, both as real risks, and in their saliency to ordinary people. It is intuitively apparent that many historical monarchs have inflicted dreadful suffering on individuals they disliked, and inflicted mundane suffering on the majority of people under their power.
We might hope that the sheer ease of providing good lives to everyone might sway even a modestly sadistic sociopath to restrict their sadism to a few "enemies of the glorious emperor" while providing relatively utopian conditions to most humans. But we should not assume it, and the public will not.
Zvi's recent piece The Risk of Gradual Disempowerment from AI which crystallized for me why people fear concentration of power enabled by AI, and is an intuitive argument for how particularly vicious people might wind up running the world. My own If we solve alignment, do we die anyway?, which outlines a different misuse to fear: sheer villainy (or pragmatic self-interest) from anyone with an intent-aligned AGI capable of recursive self improvement.
So this is an interesting line of argument for convincing people AGI risk is worth worrying about now.
I think you're overestimating how difficult it is for one person to guess another's thoughts. Good writing is largely a challenge of understanding different perspectives. It is hard.
I'm curious why you think it's crucial for people to leave for illegible reasons in particular? I do see the need to keep the community to a good standard of average quality of contributions.
I was just thinking that anything is better than nothing. If I received the feedback you mentioned on some of my early downvoted posts, I'd have been less confused than I was.
The comments you mention are helpful to the author. Any hints are helpful.
I'm curious why you disagree? I'd guess you're thinking that it's necessary to keep low-quality contributions from flooding the space, and telling people how to improve when they're just way off the mark is not helpful. Or if they haven't read the FAQ or read enough posts that shouldn't be rewarded.
But I'm very curious why you disagree.
I agree.
I often write an explanation of why new members' posts have been downvoted below zero, when the people that downvoted them didn't bother. Downvoting below zero with no explanation seems really un-welcoming. I realize it's a walled garden, but I feel like telling newcomers what they need to do to be welcomed is only the decent thing to do.
Monkeys or ants might think humans are gods because we can build cities and cars create ant poison. But we're really not that much smarter than them, just smart enough that they have no chance of getting their way when humans want something different than they do.
The only assumptions are are that there's not a sharp limit to intelligence at the human level (and there really are not even any decent theories about why there would be), and that we'll keep making AI smarter and more agentic (autonomous).
You're envisioning AI smart enough to run a company better than a human. Wouldn't that be smart enough to eventually outcompete humans if it wanted to? Let alone if it gets just a bit smarter than that. Which it will, unless all of humanity unites in deciding to never make AI smarter than humans. And humanity just doesn't all agree on anything. So there's the challenge - we're going to build things smarter than us, so we'd better make sure its goals align with ours, or it will get its way - and it may want the resources we need to live.
I fully agree with your first statement!
To your question "why bother with alignment": I agree that humans will misuse AGI even if alignment works - if we give everyone an AGI. But if we don't bother with alignment, we have bigger problems: the first AGI will misuse itself. You're assuming that alignment is easy or solved,d and it's just not.
I applaud your optimism vs. pessimism stance. If I have to choose, I'm an optimist every time. But if you have to jump off a cliff, neither optimism or pessimism is the appropriate attitude. The appropriate attitude is: "How exactly does this bungee cord/parachute work, and how can I be sure it will work properly?" If there's not parachute or bungee cord, the appropriate question is "how can I find one, or how can I avoid jumping off of this cliff?"
Your post seems to be along the lines of "but it will be so fun while we're in freefall, just assume you've got a bungee cord or parachute so you can enjoy the fall!".
Sure, an AI should never make political decisions autonomously. Pretty much everyone agrees on that. The question is whether it could and whether it will if we're not super careful. Which humans rarely are.
Your first few sentences show that you're not familiar with the relevant ideas and theories about how AGI could easily outmaneuver humanity even though no human wants to let that happen.
These ideas have been written about so often that I don't feel it's proper or respectful for me to try to summarize them here. But I do want you on board and as part of the people who understand the dangers and are trying to help. So I'll give a guess at what is missing from your logic that those of us who think about this full-time take as inexorable:
When you set powerful optimization processes in motion (like and AI learning), it's really really hard to predict what that ends up learning and therefore doing. It's not at all like writing code (which is also all buggy to some degree).
I'm not saying alignment is impossible, just that it's probably not easy. And humans have fucked up most projects on the first try. Failing to align a new being smarter than us might not allow another try, because it will act like it's on our side until it doesn't need to because it has a plan to take over.
If something is smarter than you and has goals that conflict with yours, the question isn't whether it will outsmart you and get its way, the question is only how long that will take.
Something doesn't have to want to kill you to kill you. We kill ants and monkeys, even though we rather like them. They're just in the way or using resources we want to use otherwise.
LessWrong is full of those writings. Start at my user profile and work out from there if you are interested.
I agree with everything you've said there.
The bigger question is whether we will achieve usefully aligned AGI. And the biggest question is what we can do.
Ease your mind! Worries will not help. Enjoy the sunshine and the civilization while we have it, don't take it all on your shoulders, and just do something to help!
As Sarah Connor said:
NO FATE
We are not in her unfortunately singular shoes. It does not rest on our shoulders alone. As most heroes in history have, we can gather allies and enjoy the camaraderie and each day.
On a different topic, I wish you wouldn't call yourself a failed scifi author or a failed anything. I hope it's in jest or excessive modesty. Failure is only when you give up on everything or are dead. I think there is much to accomplish in writing good fiction. It doesn't have to be perfect. Changing directions isn't failing either; it's changing strategy, hopefully as a result of learning.
I just don't think the analogy to software bugs and user input goes very far. There's a lot more going on in alignment theory.
It seems like "seeing the story out to the end" involves all sorts of vague hard to define things very much like "human happiness" and "human intent".
It's super easy to define a variety of alignment goals; the problem is that we wouldn't like the result of most of them.
If your conclusion is that we don't know how to do value alignment, I and I think most alignment thinkers would agree with you. If the conclusion is that AGI is useless, I don't think it is at all. There are a lot of other goals you could give it beyond directly doing what humanity as a whole wants in any sense. Some are taking instructions from some (hopefully trustworthy) humans, and another is following some elaborate set of rules to give humans more freedoms and opportunities to go on deciding what they want as history unfolds.
I agree that the values future humans would adopt can only be reached through a process of societal progression. People have expressed this idea by saying that human values are path-dependent.
So, if our goal were simply to observe the values that emerge naturally from human efforts alone, an AGI would indeed be nothing more than a paperweight. However, the values humanity develops with the assistance of an AGI aren’t necessarily worse—if anything, I’d suspect they’d be better.
The world as it stands is highly competitive and often harsh. If we had external help that allowed us to focus more on what we truly want—like eliminating premature death from cancer or accidents, or accelerating technological progress for creative and meaningful projects—we’d arrive at a very different future. But I don’t think that future would be worse; in fact, I suspect it would be significantly better. It would be less shaped by competition, including warfare, and less constrained by the tragedy of involuntary death.
An AGI that simply fulfills human desires without making rigid commitments about the long-term future seems like it would be a net positive—potentially a massive one.
I think the paradox you mention is generally accepted to be an unsolved problem with value alignment - we don't know how to ask for what we will want, since what we will want could be a bunch of different things depending on the path, and we don't know what path to what values and what world would in any sense be best.
This is commonly listed as one of the big difficulties with alignment. I think it only sort of is. I think we really want the future to remain in human hands for at least some time. The concept of a "long reflection" is one term people use to describe this.
In the meantime, you either have an AGI whose goal is to facilitate that long reflection, according to some criteria, or you have an AGI that takes instructions from some human(s) you trust.
This is one reason I think Instruction-following AGI is easier and more likely than value aligned AGI. The reasoning there is pretty different from the conceptual difficulties with value alignment you mention here; it's just easier to specify "do what this person meant by what he tells you" than what we mean by human values - and you get to change your mind by instructing it to shut down. Even if that worked we'd have to worry: If we solve alignment, do we die anyway? because having a bunch of different humans in charge of a bunch of different AGIs could produce severe conflicts. But that whole logic is separable from the initial puzzle you posed.
Why do you say this would be the easiest type of AGI to align? This alignment goal doesn't seem particularly simpler than any other. Maybe a bit simpler than do something all of humanity will like, but more complex than say, following instructions from this one person in the way they intended them.
Apparently people have been trying to do such comparisons:
Hugging Face researchers aim to build an ‘open’ version of OpenAI’s deep research tool
I think your central point is that we should clarify these scenarios, and I very much agree.
I also found those accounts important but incomplete. I wondered if the authors were assuming near-miss alignment, like AI that follows laws, or human misuse, like telling your intent-aligned AI to "go run this company according to the goals laid out in its corporate constitution" which winds up being just make all the money you can.
The first danger can be met with: for the love of god, get alignment right and don't use an idiotic target like "follow the laws of the nation you originated in but otherwise do whatever you like." It seems like this type of failure is a fear of an entire world that has paid zero attention to the warnings from worriers that AI will keep improving and following its goals to the extreme. I don't think we'll sleepwalk into that scenario.
The second worry is, I guess, a variant of the first: that we'll use intent-aligned AI very foolishly. That would be issuing a command like ""follow the laws of the nation you originated in but otherwise do whatever you like." I guess a key consideration in both cases is whether there's an adequate level of corrigibility.
I guess I find the first scenario too foolish for even humans to fall into. Building AI with one of the exact goals people have been warning you about forever, "just make money", is just too dumb.
But the second seems all too plausible in a world with widely proliferated intent-aligned AGI. I can see us arriving at autonomous AI/AGI with some level of intent alignment and assuming we can always go back and tell the AI to stand down, then getting complacent and discovering that it's not really as corrigible as you hoped after it's learned and changed its beliefs about things like "following instructions".
Right. I actually don't worry much about the likely disastrous recession. I mostly worry that we will all die after a takeover from some sort of misaligned AGI. So I am doing - doing alignment research. I guess preparing to reap the rewards if things go well is a sensible response if you're not going to be able to contribute much to alignment research. I do hope you'll chip in on that effort!
Part of that effort is preventing related disasters like global recession contributing to political instability and resulting nuclear- or AGI-invented-even-worse-weapons wars; see my If we solve alignment, do we die anyway?.
I think preventing a global recession is probably possible and would also up the odds of us surviving. Making some money wont' keep you and yours alive if this all goes off the rails - which it very well might on the current trajectory. It's not a matter for optimism or pessimism, it's a matter for understanding and doing something about it before it happens.
If John Wentworth is correct about that being the biggest danger, making AI produce less slop would be the clear best path. I think it might be a good idea even if the dangers were split between misalignment of the first transformative AI, and it being adequately aligned but helping misalign the next generation.
From my comment on that post:
I'm curious why you think deceptive alignment from transformative AI is not much of a threat. I wonder if you're envisioning purely tool AI, or aligned agentic AGI that's just not smart enough to align better AGI?
I think it's quite implausible that we'll leave foundation models as tools rather than using the prompt "pretend you're an agent and call these tools" to turn them into agents. People want their work done for them, not just advice on how to do their work.
I do think it's quite plausible that we'll have aligned agentic foundation model agents that won't be quite smart enough to solve deeper alignment problems reliably, and sycophantic/clever enough to help researchers fool themselves into thinking they're solved. Since your last post to that effect it's become one of my leading routes to disaster. Thanks, I hate it.
OTOH, if that process is handled slightly better, it seems like we could get the help we need to solve alignment from early aligned LLM agent AGIs. This is valuable work on that risk model that could help steer [AI development] orgs away from likely mistakes and toward better practices.
Following on that logic, I think making our first transformative AI less prone to slop/errors is a good idea. The problem is that most such efforts probably speed up progress to getting there.
I'm starting to feel pretty sure that refusing to speed up progress and hoping we get enough time or a complete stallout is unrealistic. Accepting that we're on a terrifying trajectory and trying to steer it seems like the best response.
I think routes of reducing slop also contributes to aligning the first really competent LLM-based agents. One example is engineering such an agent to review its important decisions to see if they either make important errors or change/violate their central goals. I've written about that here but I'm publishing an updated and expanded post soon.
So yes, I think this is probably somethings we should be doing. It's always going to be a judgment call of whether you publicize any particular idea. But there are more clever-to-brilliant people working on capabilities every day. Hoping they just won't have the same good ideas seems like a forlorn hope. Sharing the ones that seem to have more alignment relevance seems like it will probably differentially advance alignment over capabilities.
You are envisioning human-plue AGI being used for one purpose, when it will be used for many purposes.
When humans are obsolete for running small businesses, we will also be obsolete for nearly everything.
The big question is rate of conversion from human to AI workers. I really don't see how we avoid a dramatic global recession if even maybe 20% of jobs disappeared over a 3-year period. And the actuality could be worse than that.
I haven't gotten around to researching how much job loss, how quickly, economists think will cause major crashes. I tend to think economists aren't understanding the scope and likely rate of AI job replacement, while AI people aren't understanding how fragile economies can be.
Wheeee!
Excuse: DeepSeek, and China Might Win!
If we're using another instance of the same base model to do the monitoring and interpreting? It seems like that could partially preserve monitoring and interpreting. Another instance would "understand" the steganography it uses - but you might have to supply so much of the context that it would be almost the same instance - so likely to adopt the same goals and use the same deceptions, if any.
So that route does seem like dangerous territory. You'd rather not trust even a fresh instance of the same model to tell you what its agentic counterpart is "thinking" about.
I haven't gotten far in figuring out how well this might work, but it is a possibility. I'll say the little I have thought about in a soon-to-be-finished post.
I don't see how monitoring and interpretability could be unaffected.
(So I take this as modestly bad news — I wasn't totally sure orgs would use task-based end-to-end RL. I wasn't sure if agentic scaffolding might prove easier - see the other discussion thread here for questions about whether it might actually work just as well if someone bothered to implement it.)
Good questions. I don't have much of a guess about whether this is discernably "smarter" than Claude or Gemini would be in how it understands and integrates sources.
If anyone is game for creating an agentic research scaffold like that Thane describes, I'd love to help design it and/or to know about the results.
I very much agree with that limitation on Google's deep research. It only accepts a short request for the report, and it doesn't seem like it can (at least easily) get much more in-depth than the default short gloss. But that doesn't mean the model isn't capable of it.
Along those lines, Notebook LM has similar limitations on its summary podcasts, and I did tinker with that enough to get some more satisfying results. Using keywords like "for an expert audience already familiar with all of the terms and concepts in these sources" and "technical details" did definitely bend it in a more useful direction. There I felt I ran into limitations on the core intelligence/expertise of the system; it wasn't going to get important but fine distinctions in alignment research. Hearing its summaries was sort of helpful in that new phrasings and strange ideas (when prompted to "think creatively") can be a useful new-thought-generation aid.
We shall see whether o3 has more raw reasoning ability that it can apply to really doing expert-level research.
Yes, they do highlight this difference. I wonder how full o3 scores? It would be interesting to know how much imrovement is based on o3's improved reasoning and how much is the sequential research procedure.
I feel a bit sad that the alignment community is so focused on intelligence enhancement. The chance of getting enough time for that seems so low that it's accepting a low chance of survival.
What has convinced you that the technical problems are unsolvable? I've been trying to track the arguments on both sides rather closely, and the discussion just seems unfinished. My shortform on cruxes of disagreement on alignment difficulty still is mostly my current summary of the state of disagreements.
It seems like we have very little idea how technically difficult alignment will be. The simplicia/doomimir debates sum up the logic very nicely, but the distribution of expert opinions seems more telling: people who think about alignment don't know to what extent techniques for aligning LLMs will generalize to transformative AI, AGI, or ASI.
There's a lot of pessimism about the people and organizations that will likely be in charge of building and aligning our first AGIs. I share this pessimism. But it seems quite plausible to me that those people and orgs will take the whole thing slightly more seriously by the time we get there, and actual technical alignment will turn out to be easy enough that even highly flawed humans and orgs can accomplish it.
That seems like a much better out to play for, or at least investigate, than unstated plans or good fortune in roadblocks pauses AI progress long enough for intelligence enhancement to get a chance.
All of those. Value alignment is a set of all of the different propoesed methods of giving AGI values that align with humanity's values.
> we're really training LLMs mostly to have a good world model and to follow instructions
I think I mostly agree with that, but it’s less true of o1 / r1-type stuff than what came before, right?
I think it's actually not any less true of o1/r1. It's still mostly predictive/world modeling training, with a dash of human-preference RL which could be described as following instructions as intended in a certain set of task domains. o1/r1 is a bad idea because RL training on a whole CoT works against faithfulness/transparency of the CoT.
If that's all we did, I assume we'd be dead when an agent based on such a system started doing what you describe as the 1-3 loop (which I'm going to term self-optimization). Letting the goals implicit in that training sort of coagulate into explicit goals would probably produce explicit, generalizing goals we'd hate. I find alignment by default wildly unlikely.
But that's not all we'll do when we turn those systems into agents. Developers will probably at least try to give the agent explicit goals, too.
Then there's going to be a complex process where the implicit and explicit goals sort of mix together or compete or something when the agent self-optimizes. Maybe we could think of this as a teenager decideing what their values are, sorting out their biological drives toward hedonism and pleasing others, along with the ideals they've been taught to follow until they could question them.
I think we're going to have to get into detail on how that process of working through goals from different sources might work. That's what I'm trying to do in my current work.
WRT your Optimist Type 2B pessimism: I don't think AI taste should play a role in AI help solving the value alignment problem. If we had any sense (which sometimes we do once problems are right in our faces), we'd be asking the AI "so what happens if we use this alignment approach/goal?" and then using our own taste, not asking it things like "tell us what to do with our future". We could certainly ask for input and there are ways that could go wrong. But I mostly hope for AGI help in the technical part of solving stable value alignment.
I'm not sure I'm more optimistic than you, but I am quite uncertain about how well the likely (low but not zero effort/thought) methods of aligning network-based AGI might go. I think others should be more uncertain as well. Some people being certain of doom while others with real expertise thinking it's probably going to be fine should be a signal that we do not have this worked through yet.
That's why I like this post and similar attempts to resolve optimist/pessimist disagreements so much.
I see. I think about 99% of humanity at the very least are not so sadistic as to create a future with less than zero utility. Sociopaths are something like ten percent of the population, but like everything else it's on a spectrum. Sociopaths usually also have some measure of empathy and desire for approval. In a world where they've won, I think most of them would rather be bailed as a hero than be an eternal sadistic despot. Sociopathathy doesn't automatically include a lot of sadism, just desire for revenge against perceived enemies.
So I'd take my chances with a human overlord far before accepting extinction.
Note that our light cone with zero value might also eclipse other light cones that might've had value if we didn't let our AGI go rogue to avoid s-risk.
It seems like you're assuming people won't build AGI if they don't have reliable ways to control it, or else that sovereign (uncontrolled) AGI would be likely the be friendly to humanity. Both seem unlikely at this point, to me. It's hard to tell when your alignment plan is good enough, and humans are foolishly optimistic about new projects, so they'll probably build AGI with or without a solid alignment plan.
So I'd say any and all solutions to corrigibility/control should be published.
Also, almost any solution to alignment in general could probably be used for control as well. And probably would be. See
And
Thanks for the mention.
Here's how I'd frame it: I don't think it's a good idea to leave the entire future up to the interpretation of our first AGI(s). They could interpret our attempted alignment very differently than we hoped, in in-retrospect-sensible ways, or do something like "going crazy" from prompt injections or strange chains of thought leading to ill-considered beliefs that get control over their functional goals.
It seems like the core goal should be to follow instructions or take correction - corrigibility as a singular target (or at least prime target). It seems noticeably safer to use Intent alignment as a stepping-stone to value alignment.
Of course, leaving humans in charge of AGI/ASI even for a little while sounds pretty scary too, so I don't know.
I place this alongside the Simplicia/Doomimir dialogues as the farthest we've gotten (at least in publicly legible form) on understanding the dramatic disagreements on the difficulty of alignment.
There's a lot here. I won't try to respond to all of it right now.
I think the most important bit is the analysis of arguments for how well alignment generalizes vs. capabilities.
Conceptual representations generalize farther than sensory representations. That's their purpose. So when behavior (and therefore alignment) is governed by conceptual representations, it will generalize relatively well.
When alignment is based on a relatively simple reward model based on simple sensory representations, it won't generalize very well. That's the case with humans. The reward model is running on sensory representations (it has to so they can be specified in the limited information space of DNA, as you and others have discussed).
Alignment generalizes farther than capabilities in well-educated, carefully considered modern humans because our goals are formulated in terms of concepts. (There are still ways this could go awry, but I think most modern humans would generalize their goals well and lead us into a spectacular future if they were in charge of it).
This could be taken as an argument for using some type of goals selected from learned knowledge for alignment if possible. If we could use natural language (or another route to conceptual representations) to specify an AI's goals, it seems like that would produce better generalization than just trying to train stuff in with RL to produce behavior we like in the training environment.
One method of "conceptual alignment" is the variant of your Plan for mediocre alignment of brain-like [model-based RL] AGI in which you more or less say to a trained AI "hey think about human flourishing" and then set the critic system's weights to maximum. Another is alignment-by-prompting for LLM-based agents; I discuss that in Internal independent review for language model agent alignment. I'm less optimistic now than when I wrote that, given the progress made in training vs. scripting for better metacognition - but I'm not giving up on it just yet. Tan Zhi Xuan makes the point in this interview that we're really training LLMs mostly to have a good world model and to follow instructions, similar to Andrej Karpathy's point that RLHF is just barely RL. It's similar with RLAIF and the reward models training R1 for usability, after the pure RL on verifiable answers. So we're still training models to have good world models and follow instructions. Played wisely, it seems like that could produce aligned LLM agents (should that route reach "real AGI").
That's a new formulation of an old thought, prompted by your framing of pitting the arguments for capabilities generalizing farther than alignment (for evolution making humans) and alignment generalizing farther than capabilities (for modern humans given access to new technologies/capabilities).
The alternative is trying to get an RL system to "gel" into a concept-based alignment we like. This happens with a lot of humans, but that's a pretty specific set of innate drives (simple reward models) and environment. If we monitored and nudged the system closely, that might work too.
It does seem to imply that, doesn't it? I respect the people leaving, and I think it does send a valuable message. And it seems very valuable to have safety-conscious people on the inside.
This is the way most people feel about writing. I do not think wonderful plots are ten a penny; I think writers are miserable at creating actually good plots from the perspective of someone who values scifi and realism. Their technology and their sociology is usually off in obvious ways, because understanding those things is hard.
I would personally love to see more people who do understand science, use AI to turn them into stories.
Or alternately I'd like to see skilled authors consult AI about the science in their stories.
This attitude that plots don't matter and writing is all is why we get lazily constructed plots and worlds.
This turns literature into mostly a sort of hallucinatory slop instead of a way to really understand the world while you're being entertained.
Most writers do seem to understand psychology so that's a plus. And some of them understand current technology and society, but that's the exception.
The better framing is almost certainly "how conscious is AI in which ways?"
The question "if AI is conscious" is ill-formed. People mean different things by "consciousness". And even if we settled on one definition, there's no reason to think it would be an either-or question; like all most other phenomena, most dimensions of "consciousness" are probably on a continuum.
We tend to assume that consciousness is a discrete thing because we have only one example, human consciousness, and ultimately our own. And most people who can describe their consciousness are having a pretty human-standard experience. But that's a weak reason to think there's really one discrete thing we're referring to as "consciousness".
That's my standard comment. I apologize for not reading your paper before commenting on your post title. I am starting to think that the question of AI rights might become important for human survival, but I'm waiting til we see if it is before turning my attention back to "consciousness".
I agree with basically everything you've said here.
Will LLM-based agents have moral worth as conscious/sentient beings?
The answer is almost certainly "sort of". They will have some of the properties we're referring to as sentient, conscious, and having personhood. It's pretty unlikely that we're pointing to a nice sharp natural type when we ascribe moral patienthood to a certain type of system. Human cognition is similar and different in a variety of ways from other systems; which of these is "worth" moral concern is likely to be a matter of preference.
And whether we afford rights to the minds we build will affect us spiritually as well as practically. If we pretend that our creations are nothing like us and deserve no consideration, we will diminish ourselves as a species with aspirations of being good and honorable creatures. And that would invite others - humans or AI - to make a similar selfish ethical judgment call against us, if and when they have the power to do so.
Yet I disagree strongly with the implied conclusion, that maybe we shouldn't be trying for a technical alignment solution.
We might be more optimistic that AI persons are, by virtue of their nature, wiser and friendlier than the superintelligent agent.
Sure, we should be a bit more optimistic. By copying their thoughts from human language, these things might wind up with something resembling human values.
Or they might not.
If they do, would those be the human values of Gandhi or of Genghis Khan?
This is not a supposition on which to gamble the future. We need much closer consideration of how the AI and AGI we build will choose its values.
Agreed and well said. Playing a number of different strategies simultaneously is the smart move. I'm glad you're pursuing that line of research.