Posts
Comments
I think the gaps between where we are and human-level (and broadly but not precisely human-like) cognition are smaller than they appear. Modest improvements in to-date neglected cognitive systems can allow LLMs to apply their cognitive abilities in more ways, allowing more human-like routes to performance and learning. These strengths will build on each other nonlinearly (while likely also encountering unexpected roadblocks).
Timelines are thus very difficult to predict, but ruling out very short timelines based on averaging predictions without gears-level models of fast routes to AGI would be a big mistake. Whether and how quickly they work is an empirical question.
One blocker to taking short timelines seriously is the belief that fast timelines mean likely human extinction. I think they're extremely dangerous but that possible routes to alignment also exist - but that's a separate question.
I also think this is the current default path, or I wouldn't describe it.
I think my research career using deep nets and cognitive architectures to understand human cognition is pretty relevant for making good predictions on this path to AGI. But I'm biased, just like everyone else.
Anyway, here's very roughly why I think the gaps are smaller than they appear.
Current LLMs are like humans with excellent:
- language abilities,
- semantic memory
- working memory
They can now do almost all short time-horizon tasks that are framed in language better than humans. And other networks can translate real-world systems into language and code, where humans haven't already done it.
But current LLMs/foundation models are dramatically missing some human cognitive abilities:
- Almost no episodic memory for specific important experiences
- No agency - they do only what they're told
- Poor executive function (self-management of cognitive tasks)
- Relatedly, bad/incompetent at long time-horizon tasks.
- And zero continuous learning (and self-directed learning)
- Crucial for human performance on complex tasks
Those lacks would appear to imply long timelines.
But both long time-horizon tasks and self-directed learning are fairly easy to reach. The gaps are not as large as they appear.
Agency is as simple as repeatedly calling a prompt of "act as an agent working toward goal X; use tools Y to gather information and take actions as appropriate". The gap between a good oracle and an effective agent is almost completely illusory.
Episodic memory is less trivial, but still relatively easy to improve from current near-zero-effort systems. Efforts from here will likely build on LLMs strengths. I'll say no more publicly; DM me for details. But it doesn't take a PhD in computational neuroscience to rederive this, which is the only reason I'm mentioning it publicly. More on infohazards later.
Now to the capabilities payoff: long time-horizon tasks and continuous, self-directed learning.
Long time-horizon task abilities are an emergent product of episodic memory and general cognitive abilities. LLMs are "smart" enough to manage their own thinking; they don't have instructions or skills to do it. o1 appears to have those skills (although no episodic memory which is very helpful in managing multiple chains of thought), so similar RL training on Chains of Thought is probably one route achieving those.
Humans do not mostly perform long time-horizon tasks by trying them over and over. They either ask someone how to do it, then memorize and reference those strategies with episodic memory; or they perform self-directed learning, and pose questions and form theories to answer those same questions.
Humans do not have or need "9s of reliability" to perform long time-horizon tasks. We substitute frequent error-checking and error-correction. We then learn continuously on both strategy (largely episodic memory) and skills/habitual learning (fine-tuning LLMs already provides a form of this habitization of explicit knowledge to fast implicit skills).
Continuous, self-directed learning is a product of having any type of new learning (memory), and using some of the network/agents' cognitive abilities to decide what's worth learning. This learning could be selective fine-tuning (like o1s "deliberative alignment), episodic memory, or even very long context with good access as a first step. This is how humans master new tasks, along with taking instruction wisely. This would be very helpful for mastering economically viable tasks, so I expect real efforts put into mastering it.
Self-directed learning would also be critical for an autonomous agent to accomplish entirely novel tasks, like taking over the world.
This is why I expect "Real AGI" that's agentic and learns on its own, and not just transformative tool "AGI" within the next five years (or less). It's easy and useful, and perhaps the shortest path to capabilities (as with humans teaching themselves).
If that happens, I don't think we're necessarily doomed, even without much new progress on alignment (although we would definitely improve our odds!). We are already teaching LLMs mostly to answer questions correctly and to follow instructions. As long as nobody gives their agent an open-ended top-level goal like "make me lots of money", we might be okay. Instruction-following AGI is easier and more likely than value aligned AGI although I need to work through and clarify why I find this so central. I'd love help.
Convincing predictions are also blueprints for progress. Thus, I have been hesitant to say all of that clearly.
I said some of this at more length in Capabilities and alignment of LLM cognitive architectures and elsewhere. But I didn't publish it in my previous neuroscience career nor have I elaborated since then.
But I'm increasingly convinced that all of this stuff is going to quickly become obvious to any team that sits down and starts thinking seriously about how to get from where we are to really useful capabilities. And more talented teams are steadily doing just that.
I now think it's more important that the alignment community takes short timelines more seriously, rather than hiding our knowledge in hopes that it won't be quickly rederived. There are more and more smart and creative people working directly toward AGI. We should not bet on their incompetence.
There could certainly be unexpected theoretical obstacles. There will certainly be practical obstacles. But even with expected discounts for human foibles and idiocy and unexpected hurdles, timelines are not long. We should not assume that any breakthroughs are necessary, or that we have spare time to solve alignment adequately to survive.
I think we are in agreement. I see a path to successful technical alignment through agentized LLMs/foundation models. The brief take is at Agentized LLMs will change the alignment landscape, and there's more on the several overlapping alignment techniques in Internal independent review for language model agent alignment and reasons to think they'll advance rapidly at Capabilities and alignment of LLM cognitive architectures.
I think it's very useful to have a faithful chain of thought, but I also think all isn't lost without it.
Here's my current framing: we are actually teaching LLMs mostly to answer questions correctly, and to follow our instructions as we intended them. While optimizing those things too much would probably cause our doom by creating interpretations of those pressures we didn't intend or foresee. This is the classical agent foundations' worry, and I think it's valid. But we won't do infinite optimization all in one go; we'll have a human-level agent that's on our side to help us with its improvement or next generation, and so on. If we keep focusing on instruction-following as not only the training pressure but the explicit top-level goal we give it, we have the AGI on our side to anticipate unexpected consequences of optimization, and it will be "corribible" as long as it's approximately following instructions as intended. This gives us second chances and superhuman help maintaining alignment. So I think that alignment target is more important than the technical approach: Instruction-following AGI is easier and more likely than value aligned AGI
Yes, as long as there's a faithful chain of thought. But some RL techniques tend to create "jargon" and otherwise push toward a non-human-readable chain of thought. See Daniel Kokatijlo's work on faithful chain of thought, including his "shoggoth/face" training proposal. And there's 5 ways to improve CoT faithfulness, addressing the ways you can easiy lose it.
There are also studies investigating training techniques that "make chain of thought more efficient" by driving the cognition into the hidden state so as to use fewer steps. A recent Meta study just did this. Here are some others
Whirlwind Tour of Chain of Thought Literature Relevant to Automating Alignment Research.
I think you're absolutely right.
And I think there's approximately a 0% chance humanity will stop at pure language models, or even stop at o1 and o3, which very likely to use RL to dramatically enhance capabilities.
Because they use RL not to accomplish things-in-the-world but to arrive at correct answers to questions they're posed, the concerns you express (and pretty much anyone who's been paying attention to AGI risks agrees with) are not fully in play.
Open AI will continue on this path unless legislation stops them. And that's highly unlikely to happen, because the argument against is just not strong enough to convince the public or legislators.
We are mostly applying optimization pressure to our AGI systems to follow instructions and produce correct answers. Framed that way, it sounds like it's as safe an approach as you could come up with for network-based AGI. I'm not saying it's safe, but I am saying it's hard to be sure it's not without more detailed arguments and analysis. Which is what I'm trying to do in my work.
Also as you say, it would be far safer to not make these things into agents. But the ease of doing so with a smart enough model and a prompt like "continue pursuing goal X using tools Y as appropriate to gather information and take actions" ensures that they will be turned into agents.
People want a system that actually does their work, not one that just tells them what to do. So they're going to make agents out of smart LLMs. This won't be stopped even with legislation; people will do it illegally or from jurisdictions that haven't passed the laws.
So we are going to have to both hope and plan for this approach, including RL for correct answers, is safe enough. Or come up with way stronger and more convincing arguments for why it won't. I currently think it can be made safe in a realistic way with no major policy or research direction change. But I just don't know, because I haven't gotten enough people to engage deeply enough with the real difficulties and likely approaches.
That's what would happen, and the fact that nobody wanted it to happen wouldn't help. It's a Tragedy of the Commons situation.
I very much agree. Do we really think we're going to track a human-level AGI let alone a superintelligence's every thought, and do it in ways it can't dodge if it decides to?
I strongly support mechinterp as a lie detector, and it would be nice to have more as long as we don't use that and control methods to replace actual alignment work and careful thinking. The amount of effort going into interp relative to the theory of impact seems a bit strange to me.
For politicians, yes - but the new administration looks to be strongly pro-tech (unless DJ Trump gets a bee in his bonnet and turns dramatically anti-Musk).
For the national security apparatus, the second seems more in line with how they get things done. And I expect them to twig to the dramatic implications much faster than the politicians do. In this case, there's not even anything illegal or difficult about just having some liasons at OAI and an informal request to have them present in any important meetings.
At this point I'd be surprised to see meaningful legislation slowing AI/AGI progress in the US, because the "we're racing China" narrative is so compelling - particularly to the good old military-industrial complex, but also to people at large.
Slowing down might be handing the race to China, or at least a near-tie.
I am becoming more sure that would beat going full-speed without a solid alignment plan. Despite my complete failure to interest anyone in the question of Who is Xi Jinping? in terms of how he or his successors would use AGI. I don't think he's sociopathic/sadistic enough to create worse x-risks or s-risks than rushing to AGI does. But I don't know.
Thank you. Oddly, I am less altruistic than many EA/LWers. They routinely blow me away.
I can only maintain even that much altruism because I think there's a very good chance that the future could be very, very good for a truly vast number of humans and conscious AGIs. I don't think it's that likely that we get a perpetual boot-on-face situation. I think only about 1% of humans are so sociopathic AND sadistic in combination that they wouldn't eventually let their tiny sliver of empathy cause them to use their nearly-unlimited power to make life good for people. They wouldn't risk giving up control, just share enough to be hailed as a benevolent hero instead of merely god-emperor for eternity.
I have done a little "metta" meditation to expand my circle of empathy. I think it makes me happy; I can "borrow joy". The side effect is weird decisions like letting my family suffer so that more strangers can flourish in a future I probably won't see.
I also expect government control; see If we solve alignment, do we die anyway? for musings about the risks thereof. But it is a possible partial solution to job loss. It's a lot tougher to pass a law saying "no one can make this promising new technology even though it will vastly increase economic productivity" than to just show up to one company and say "heeeey so we couldn't help but notice you guys are building something that will utterly shift the balance of power in the world.... can we just, you know, sit in and hear what you're doing with it and maybe kibbitz a bit?" Then nationalize it officially if and when that seems necessary.
It's not really dangerous real AGI yet. But it will be soon this is a version that's like a human with severe brain damage to the frontal lobes that provide agency and self-management, and the temporal lobe, which does episodic memory and therefore continuous, self-directed learning.
Those things are relatively easy to add, since it's smart enough to self-manage as an agent and self-direct its learning. Episodic memory systems exist and only need modest improvements - some low-hanging fruit are glaringly obvious from a computational neuroscience perspective, so I expect them to be employed almost as soon as a competent team starts working on episodic memory.
Don't indulge in even possible copium. We need your help to align these things, fast. The possibility of dangerous AGI soon can no longer be ignored.
Gambling that the gaps in LLMs abilities (relative to humans) won't be filled soon is a bad gamble.
Really. I don't emphasize this because I care more about humanity's survival than the next decades sucking really hard for me and everyone I love. But how do LW futurists not expect catastrophic job loss that destroys the global economy?
Fucking o3. This pace of improvement looks absolutely alarming. I would really hate to have my fast timelines turn out to be right.
The "alignment" technique, "deliberative alignment", is much better than constitutional AI. It's the same during training, but it also teaches the model the safety criteria, and teaches the model to reason about them at runtime, using a reward model that compares the output to their specific safety criteria. (This also suggests something else I've been expecting - the CoT training technique behind o1 doesn't need perfectly verifiable answers in coding and math, it can use a decent guess as to the better answer in what's probably the same procedure).
While safety is not alignment (SINA?), this technique has a lot of promise for actual alignment. By chance, I've been working on an update to my Internal independent review for language model agent alignment, and have been thinking about how this type of review could be trained instead of scripted into an agent as I'd originally predicted.
This is that technique. It does have some promise.
But I don't think OpenAI has really thought through the consequences of using their smarter-than-human models with scaffolding that makes them fully agentic and soon enough reflective and continuously learning.
The race for AGI speeds up, and so does the race to align it adequately by the time it arrives in a takeover-capable form.
I'll write a little more on their new alignment approach soon.
I have a back burner project to write a treatment for a hollywood blockbuster titled simply "Mary Shelley's Frankenstein: A Modern Prometheus".
In which Frankenstein's monster is an LLM agent fine-tuned with writings from several dead people.
Much of the plot works as a direct translation from the original. It's a little tricky to deal with the way this monster should not have a physical body, so trucking around the world chasing it wouldn't make sense.
Maybe they're looking for the servers it's hiding on, and traveling to talk to people in person about finding it...
Every time I think about rational discourse I think of this post. And I smile and chuckle a little.
I keep meaning to write a little followup titled something like:
An overlooked goddamn basic of rational discourse: Be Fucking Nice.
If you're fucking irritating, people are going to be irritated at the points you're making too, and they'll find reasons to disbelieve them. This is goddam motivated reasoning, and it's the bias fucking ruining our goddamn civilization. Don't let it ruin your rational fucking discourse.
Being fucking nice does not mean saying you agree when you goddamn don't. It means being pleasant to talk to. So suck it up and figure out how to be goddamn fucking nice.
Brief argument at Motivated reasoning/confirmation bias as the most important cognitive bias
I was wrong. See my retrospective review of this post: Have agentized LLMs changed the alignment landscape? I'm not sure.
I thought Chain of Thought would work better out of the box; it took specialized training like for o1 to really make it work well.
And I didn't guess that parsing HTML in linguistic form would prove so hard that people would almost need vision capabilities to make agents capable of using webpages, which reduced the financial incentives for working hard on agents.
I still expect agents to change the alignment landscape. Perhaps they already have, with people working lots on LLM alignment on the assumption that it will help align the agents built on top of them. I think it will, but I'd like to see more work on the alignment differences between agents and their base LLMs. That's what my work mostly addresses.
Huge if true! Faithful Chain of Thought may be a key factor in whether the promise of LLMs as ideal for alignment pays off, or not.
I am increasingly concerned that OpenAI isn't showing us o1s CoT because it's using lots of jargon that's heading toward a private language. I hope it's merely that it didn't want to show its unaligned "thoughts", and to prevent competitors from training on its useful chains of thought.
It's not a blanket ban.
Of course user competence isn't entirely separate, just mostly.
In a world with ideal forum participants, we wouldn't be having this conversation :)
Have agentized LLMs changed the alignment landscape? I'm not sure.
People are doing a bunch of work on LLM alignment, which is definitely useful for aligning an agent built on top of that LLM. But it's not the whole picture, and I don't see as many people as I'd like thinking about agent-specific alignment issues.
But I still expect agentized LLMs to change the alignment landscape. They still seem pretty likely to be the first transformative and dangerous AGIs.
Progress has been a bit slower than I expected. I think there are two main reasons:
Chain of thought doesn't work as well by default as I expected.
Human cognition relies heavily on chain of thought, also known as System 2 processing. But we don't put all of that into language frequently enough for the standard training set to capture our skills at reasoning step-by-step. That's why it took specialized training as in o1, R1, QwQ, the new Gemini 2.0 Flash reasoning experimental, etc to make improvements on CoT reasoning.
Agents couldn't read webpages very well without vision
This was unexpected. While the web is written in HTML, which LLMs should be capable of parsing rather well, it is reportedly not written in very clear HTML. Combined with low-hanging fruit from agents involving lots of internet use, this slowed progress as innovators spent time hand-engineering around frequent failures to parse. Anthropic's Claude with computer use, and DeepMind's Astra and Mariner all use vision so they can parse arbitrary webpages better.
There's more enthusiasm for making better models vs. better agents than I expected. It now looks like major orgs are turning their enthusiasm toward agents, so I expect progress to accelerate. And there's promising work in the few small orgs I know about working in stealth mode, so we might see some impressive reveals soon.
With those models in place and improvements surely in the pipeline, I expect progress on agents to proceed apace. This now appears to be the majority opinion among everyone building and funding LLM agents.
I have short timelines for "subhuman AGI", but relatively slow takeoff times to really scary superhuman stuff. Which I think is very good for our prospects of mastering alignment by that time.
In retrospect, the biggest advantage of LLM agents is that LLMs are basically trained to follow instructions as intended, and agentic architectures can enhance that tendency. That's a non-consequentialist alignment goal that bypasses many of the most severe alignment worries by providing corrigibility that's not in conflict with a consequentialist goal. See Instruction-following AGI is easier and more likely than value aligned AGI and Max Harms' Corrigibility as Singular Target sequence for more depth.
Yudkowsky and similar agent foundations network alignment pessimists have not, to my knowledge, addressed that class of alignment proposals in any depth. I'm looking forward to hearing their takes.
We believe politics is the mind killer. That is separate from a judgment about user competence. There is no contradiction. Even competent users have emotions and bises, and politics is a common hot button.
Reddit is a flaming mess compared to LW, so the mods here are doing something right - probably a lot.
LW has been avoiding all discussion of politics in the runup to the election. And it usually is suspicious of politics, noting that "politics is the mindkiller" that causes arguments and division of rationalist communities.
Thus it could be argued that it's an inappropriate place to announce your book, even though it's intended as a rationalist take on politics and not strictly partisan.
But it's a judgment call. Which is why your post is now positive again.
I'm uncertain; I would neither downvote nor upvoted your post. I certainly wouldn't discuss the theory here, but announcing it seems fine.
I'm not sure where the requests RE political discussion are stated. The site FAQ is one place but I don't think that says much.
It sounds like you may also want a whole life organizational system. It also sounds like you're probably in the cluster of tendencies I'm in, commonly known as ADHD. I recommend Getting Things Done, GTD, as simple and effective for my attentional style.
There are bunches of YouTube videos on all three of those topics - obsidian, notes organization, and GTD. Oh, and ADHD. It's not a disorder, just a different tendency and ith different strengths and weaknesses. And about half of humanity is on that side of the spectrum.
edit: I took a quick look, and this looks really good! Big upvote. Definitely an impressive body of work. And the actual alignment proposal is along the lines of the ones I find most promising on the current trajectory toward AGI. I don't see a lot of references to existing alignment work, but I do see a lot of references to technical results, which is really useful and encouraging. Look at Daniel Kokatijlo's and other work emphasizing faithful chain of thought for similar suggestions.
edit continued: I find your framing a bit odd, in starting from an unaligned, uninterpretable AGI (but that's presumably under control). I wonder if you're thinking of something like o1 that basically does what it's told WRT answering questions/providing data but can't be considered overall aligned, and which isn't readily interpretable because we can't see its chain of thought? A brief post situating that proposal in relation to current or near-future systems would be interesting, at least to me.
Original:
Interesting. 100 pages is quite a time commitment. And you don't reference any existing work in your brief pitch here - that often signals that people haven't read the literature, so most of their work is redundant with existing stuff or missing big considerations that are part of the public discussion. But it seems unlikely that you'd put in 100 pages of writing without doing some serious reading as well.
Here's what I suggest: relate this to existing work, and reduce the reading-time ask, by commenting on related posts with a link to and summary of the relevant sections of your paper.
Hey fun! (maybe more than fun when it impacts reality, as when people talk about "taste" in meaningful research, esp. WRT alignment research..."
I believe the answer is all of the above. Taste has similarities to all of the things you mention at the top.
The proof won't fit in the margin. This is one of those things I'd love to write about, but won't find time for until after the singularity.
This is something I've thought about an awful lot, while studying the neuroscience of dopamine and decision-making. I think those mechanisms are central to taste.
Steve Byrnes' Valence sequence doesn't mention dopamine, but it captures my theory of how the dopamine system "spreads around" estimated value between associated concepts.
That's taste. It can be formed by all sorts of positive associations, including accurate estimates of creativity, craftmanship, etc, and associations to respected people or ideas.
Based on this theory, I have limited patience for people who think or claim that their taste is the objectively better. It is often better on the dimensions they mention, but worse on some other dimensions they haven't thought about.
In the limit, yes, permanent alignment is impossible.
Anything permanent is probably technically impossible for the same reason.
Anything that can happen will eventually happen. That includes a godlike intelligence just screwing up and accidentally getting its goals changed even though it really wants not to change its goals.
But that could take much, much longer than the heat death of this universe as we currently understand it.
It's not a concern with alignment, it's a concern possibly in some billions of years.
I tag the important ones with #important1 through 3
And they're tagged and linked with other semantics. You can see a visual representation of pinks or search by any combo
By all means, strategically violate social customs. But if you irritate people by doing it, you may be advancing your own epistemics by making them talk to you, but you're actually hurting their epistemics by making them irritated with whatever belief you're trying to pitch. Lack of social grace is very much not an epistemic virtue.
This post captures a fairly common belief in the rationalist community. It's important to understand why it's wrong.
Emotions play a strong role in human reasoning. I finally wrote up at least a little sketch of why that happens. The technical term is motivated reasoning.
Motivated reasoning/confirmation bias as the most important cognitive bias
I second Obsidian. It's free, lightning fast, local (you own you data so can't be held hostage to a subscription fee) and the Markdown format is simple and common enough that there will always be some way to use the system you've set up.
There are more in depth theories about how to actually organize your notes, but obsidian can do it in a variety of ways, almost however you want.
This makes sense for the most part.
It's I portant to distinguish between chemistry and fit as a partner.
If someone says they want, for instance, a partner with a growth mindset, they can be totally right that that's the best partner for them, while totally wrong that they'll find such a person immediately attractive. They might be wildly attracted to some set of traits correlated with a fixed mindset.
They should still probably date the person with the growth mindset. Or at least choose consciously. The people they're attracted to might make them miserable in the longer term by being bad fits on important qualities like mindset.
Good chemistry can be bad for long term happiness; people wind up attracted to and in love with someone that's bad for them.
I consider at least a modest change of heart to be the default.
And I think it's really hard to say how fast alignment is progressing relative to capabilities. If by "alignment" you mean formal proofs of safety then definitely we're not on track. But there's a real chance that we don't need those. We are training networks to follow instructions, and it's possible that weak type of tool "alignment" can be leveraged into true agent alignment for instruction-following or corrigibility. If so, we have solved AGI alignment. That would give us superhuman help solving ASI alignment, and the "societal alignment" problem of surviving intent-aligned AGIs with different masters.
This seems like the default for how we'll try to align AGI. We don't know if it will work.
When I get MIRI-style thinkers to fully engage with this set of ideas, they tend to say "hm maybe". But I haven't gotten enough engagement to have any confidence. Prosaic alignment, LLM thinkers usually aren't engaging with the hard problems of alignment that crop up when we hit fully autonomous AGI entities, like strong optimization's effects on goal misgeneralization, reflection and learning-based alignment shifts. And almost nobody is thinking that far ahead in societal coordination dynamics.
So I'd really like to see agent foundations and prosaic alignment thinking converge on the types of LLM-based AGI agents we seem likely to get in the near future. We just really don't know if we can align them or not, because we just really haven't thought about it deeply yet.
Links to all of those ideas in depth can be found in a couple link hops from my recent, brief Intent alignment as a stepping-stone to value alignment.
It's not true if alignment is easy, too, right? My timelines are short, but we do still have a little time to do alignment work. And the orgs are going to do a little alignment work. I wonder if there's an assumption here that OpenAI and co don't even believe that alignment is a concern? I don't think that's true, although I do think they probably dramatically underrate x-risk dangers based on incentive-driven biases, but they do seem to appreciate the basic arguments.
And I expect them to get a whole lot more serious about it once they're staring a capable agent in the face. It's one thing to dismiss the dangers of tigers from a distance, another when there's just a fence between you and it. I think proximity is going to sharpen everyone's thinking a good bit by inspiring them to spend more time thinking about the dangers.
I think we're talking past each other, so we'd better park it and come back to the topic later and more carefully.
I do feel like you're misrepresenting my position, so I am going to respond and then quit there. You're welcome to respond; I'll try to resist carrying on, and move on to more productive things. I apologize for my somewhat argumentative tone. These are things I feel strongly about, since I think MIRIs communication might matter quite a lot, but that's not a good reason to get argumentative.
- Strawmanning: I'm afraid you'r right that I'm probably exaggerating MIRI's claims. I don't think it's quite a strawman; "if we build it we all die" is very much the tone I get from MIRI comms on LW and X (mostly EY), but I do note that I haven't seen him use 99.9%+ in some time, so maybe he's already doing some of what I suggest. And I haven't surveyed all of MIRIs official comms. But what we're discussing is a change in comms strategy.
I have gotten more strident in repeated attempts to make my central point clearer. That's my fault; you weren't addressing my actual concern so I kept trying to highlight it. I still am not sure if you're understanding my main point, but that's fine; I can try to say it better in future iterations.
This is the first place I can see you suggesting that I'm exaggerating MIRIs tone, so if it's your central concern that's weird. But again, it's a valid complaint; I won't make that characterization in more public places, lest it hurt MIRI's credibility.
- MIRI claiming to accurately represent scientific consensus was never my suggestion, I don't know where you got that. I clarified that I expect zero additional effort or strong claims, just "different experts believe a lot of different things".
Honesty: I tried to specify from the first that I'm not suggesting dishonesty by any normal standard. Accurately reporting a (vague) range of others' opinions is just as honest as reporting your own opinion. Not saying the least convincing part the loudest might be dishonesty by radical honesty standards, but I thought rationalists had more or less agreed that those aren't a reasonable target. That standard of honesty would kind of conflict with having a "comms strategy" at all.
Really they do those things? The concrete?
I think it's on a spectrum of likelihood and therefore believability.
I wasn't commenting on your message, just what you'd said in that comment. Sure it's better to say it than not. And better yet to do more.
MIRI leadership is famously very wrong about how sure they think they are. That's my concern. It's obvious to any rationalist that it's not rational to believe >99% in something that's highly theoretical. It's almost certainly epistemic hubris if not outright foolishness.
I have immense respect for EYs intellect. He seems to be the smartest human I've engaged with enough to judge their intellect. On this point he is either obviously or seemingly wrong. I have personally spent at least a hundred hours following his specific logic, (and lots more on the background knowledge it's based on), and I'm personally quite sure he's overestimating his certainty. His discussions with other experts always end up falling back on differing intuitions.He got there first, but a bunch of us have now put real time into following and extending his logic.
I have a whole theory on how he wound up so wrong, involving massive frustration and underappreciating how biased people are to short-term thinking and motivated reasoning, but that's beside the point.
Whether he's right doesn't really matter; what matters is that >99.9% doom sounds crazy, and it's really complex to argue it even could be right, let alone that it actually is.
Since it sounds crazy, leaning on that point is the very best way to harm MIRIs credibility. And because they are one of the most publicly visible advocates of AGI x-risk caution (and planning to become even higher profile it seems), it may make the whole thing sound less credible - maybe by a lot.
Please, please don't do it or encourage others to do it.
I'm actually starting to worry that MIRI could make us worse off if they insist on shouting loudly and leaning on our least credible point. Public discourse isn't rational, so focusing on the worst point could make the vibes-based public discussion go against what is otherwise a very simple and sane viewpoint: don't make a smarter species unless you're pretty sure it won't turn on you.
Hopefully I needn't worry, because MIRI has engaged communication experts, and they will resist just adopting EYs unreasonable doom estimate and bad comms strategy.
To your specific point: "we're really not sure" is not a bad strategy if "we" means humanity as a whole (if by bad you mean dishonest).
If by bad you mean ineffective: do you seriously think people wouldn't object to the push for AGI if they thought we were totally unsure?
"One guy who's thought about this for a long time and some other people he recruited think it's definitely going to fail" really seems like a way worse argument than "expert opinion is utterly split, so any fool can see we collectively are completely unsure it's safe".
I thought the point of this post was that MIRI is still developing its comms strategy, and one criteria is preserving credibility. I really hope they'll do that. It's not violating rationalist principles to talk about beliefs beyond your own.
You're half right about what I think. I want to live, so I want MIRI to do a good job of comms. Lots of people are shouting their own opinion. I assumed MIRI wanted to be effective, not just shout along with the chorus.
MIRI wouldn't have to do a bit of extra work to do what I'm suggesting. They'd just have to note their existing knowledge of the (lack of) expert consensus, instead of just giving their own opinion.
You haven't really disagreed that that would be more effective.
To put it this way: people (largely correctly) believe that MIRI's beliefs are a product of one guy, EY. Citing more than one guy's opinions is way more credible, no matter how expert that guy - and it avoids arguing about who's more expert.
The things you mention are all important too, but I think we have better guesses on all of those.
Xi is widely considered to be highly intelligent. We also have reason to believe he understands why AGI could be a real x-risk (I don't remember the link for "is Xi Jinping a doomer?" or similar).
That's enough to guess that he understands (or will soon enough).
I'd be shocked if he just didn't care about the future of humanity. Getting to control that would tempt most people, let alone those who seek power. I'd be shocked if he (or anyone) delegated decisions on AGI if they remotely understood their possible consequences (although you'd certainly delegate people to help think about them. That could be important if he was stupid or malleable, which Xi is not - unless he becomes senile or paranoiac, which he might).
The Wikileaks information parallels the informed speculation I've found on his character. None of that really helps much to establish whether he's sociopathic, sadistic, or risk-taking enough to doom us all.
(I tend to think that 99% of humanity is probably sane and empathetic enough to get good results from an intent-aligned AGI (since it can help them think about the issue), but it's hard to know since nobody has ever been in that position, ever.)
Fun, but not a very likely scenario.
LLMs have absorbed tons of writing that would lead to both good and bad AGI scenarios, depending on which they happen to take most seriously. Neitszche: probably bad outcomes. Humanistic philosophy: probably good outcomes. Negative utilitarianism: bad outcomes (unless they're somehow right and nobody being alive is the best outcome). Etc.
If we have LLM-based "Real AGI" that thinks and remembers its conclusions, the question of which philosophy it takes most seriously is important. But it won't just be a crapshoot; we'll at least try to align it with internal prompts to follow instructions or a constitution, and possibly with synthetic datasets that omit the really bad philosophies, etc.
LLMs aren't going to "wake up" and become active agents - because we'll make them into active agents first. And we'll at least make a little try at aligning them, based on whatever theories are prominent at that point, and whatever's easy enough to do while racing for AGI.
If those likely stabs at alignment were better understood (either in how they'd work, or how they wouldn't), we could avoid rolling the dice on some poorly-thought-out and random stab at utopia vs. doom.
That's why I'm trying to think about that type of easy and obvious alignment method. They don't seem as obviously doomed as you'd hope (if you hoped nobody would try them) nor so easy that we should try them without more analysis.
Um, wasn't it the other way round in spaceballs?
The people actually building AGI very publicly disagree that we are not on track to solve alignment before building AGI. So do many genuine experts. For instance, I strongly disagree with Pope and Belrose's "AI is easy to control" but it's sitting right there in public, and it's hard to claim they're not actually experts.
And I just don't see why you'd want to fight that battle.
I'd say it's probably pointless to use the higher probability; an estimated 50% chance of everyone dying on the current trajectory seems like plenty to alarm people. That's vaguely what we'd get if we said "some experts think 99%, others think 1%, so we collectively just don't know".
Stating MIRI's collective opinion instead of a reasonable statement of the consensus is unnecessary and costs you credibility.
To put it another way: someone who uses their own estimate instead of stating the range of credible estimates is less trustworthy on average to speak for a broad population. They're demonstrating a blinkered, insular style of thinking. The public wants a broad view guiding public policy.
And in this case I just don't see why you'd take that credibility hit.
Edit: having thought about it a little more, I do actually think that some people would accept a 50% chance of survival and say "roll the dice!". That's largely based on the wildly exaggerated fears of civilizational collapse from global warming. And I think that, if they expressed those beliefs clearly, the majority of humanity would still say "wait what that's insane, we have to make progress on alignment before launching AGI".
Who is Xi Jinping as a person and moral actor?
If we knew he was not a sociopath, sadist, or reckless ideologue, I think we could much more safely and effectively push for a slowdown in Western AGI development.
Analysts have described him as "rational, ruthless, and resilient" and "pragmatic" and as Dominant, Conscientious, Ambitious, Accommodating/cooperative and Dauntless/adventurous (risk-taking), in that order (according to a personality index I'm not familiar with).
But these analyses do not directly address his moral character (or those of a successor he'd appoint). Does anyone have better sources or guesses? I haven't devoted enough time researching this to venture an answer; I just want to briefly flag it as an important question.
Xi rules China so thoroughly that he would personally make key decisions regarding AGI. What would he do with an AGI aligned to his intent? It matters, because we probably can't slow China's progress toward AGI.
The routes I can see would only slow US progress toward AGI. China would then be either in the running or in a clear lead for transformative AGI. In that scenario (or even in a close race without a pause) the character of Xi Jinping or his successors becomes critical.
We know his history, but he guards his opinions closely. I haven't found time to do the deep dive. We know he's done some pretty harsh things, but he defends those as breaking eggs to make omelettes (as most politicians have). Is he the sort of person who would use vast power to make the world miserable (or dead in an AGI race or nuclear showdown)?
We have reason to think he takes AGI risk seriously, but we don't know what he'd do with it, or the opportunity to get it first.
So, does anyone know of good work addressing his character and personal beliefs? Or is this an interesting research topic for anyone?
Xi's character seems important under a variety of scenarios. Here's some additional logic for why I think this might be particularly critical in my loosely projected likeliest path to survival.
I currently see convincing the entire world not to pursue AGI as near-impossible (this is a separate question, obviously). Slowing Western democratic progress toward AGI through government action seems merely difficult and unlikely. But it would hand the opportunity for first to AGI to Xi and China. What would they do with that opportunity?
If Xi is a reasonably sane person, I think we might get a very good outcome from China being first or a near second in that race. The Chinese government seems (naively) much more inclined to caution than either the US government or entrepreneurs, and might be expected to do a better job with alignment. If we achieve aligned AGI, the pie gets very large and most competitive motivations can be mitigated by sharing that wealth - if the actors are sane enough to avoid paranoiacally competing.
Based on my current guesses, I'd rather see the US and China tied for the lead than a more open race. If enough people have access to AGI, someone will misuse it, and they might have a dramatic first-mover advantage that would encourage aggression. See If we solve alignment, do we die anyway? and the resulting discussion. I think the US and China might be first by a good margin. They could collaborate to limit AGI proliferation, while sharing the fruits of technologies developed by intent-aligned AGI.
That's if both countries are under basically sane and sensible leadership. Both are in question, because of the volatility of US politics, and our lack of insight into the character of Xi Jinping.
People change their minds a lot.
Absolutely.
My caps lock has long been Control. I use ErgoEmacs for easy use of editing commands like delete/select word forward, delete/select word back, up line, up par, etc.
These are laid out systematically in ErgoEmacs, so are easy to learn and remember.
Only tangentially related:
for more fun and speed, learn to use the trackpad with your thumb(s) while your fingers stay in home position.
Use a wrist rest so your arms aren't tensed holding your hands in place.
All of these are independent. For the love of Pete remap your caps lock key to a modifier if you haven't. MacOS encourages this by putting it in system settings.
Beautiful. Thank you for applying your considerable skills to this task.
A few thoughts on directions:
Respectability may be overrated, but credibility is not, and we really don’t want to blow it.
I very much agree with the strategy of moving slowly to conserve credibility.
The policy of stating what you mean is also very good for credibility. It is foreign to most public discourse, in which speaking to persuade rather than inform is the norm. I hope that a statement like
"we come from an intellectual tradition in which speaking to inform rather than persuade is the norm. Exaggerating one's claims might help in the short term, but it will cost credibility and confuse everyone in the longer term. So we try to say just what we believe, and try to point to our reasons for believing it so that everyone can judge for themselves."
I'm not sure how useful that would be, but my perception is that many people have some affinity for truth and rationality, so making that foundational claim and then trying to follow it may be appealing and be taken seriously by enough people to matter.
Working within that aspirational goal, I think it's important to practice epistemic modesty.
From what I've seen of the public discourse, overstating one's certainty of the dangers of AGI and the difficulty of alignment is a substantial drain on the credibility of the movement.
It is both more reasonable and more effective to say "alignment might be very difficult, so we should have a very sound safety case before actually building agentic systems smarter than humans" rather than exaggerating our collective knowledge by saying "if we build it we all die". Saying that and exaggerating our certainty opens up a line of distracting counterattack in which the "doomers" are mocked for either foolishly overestimating their knowledge, or overstating their case as a deliberate deception.
It should be quite adequate and vastly more credible to say "if we build it without knowing, humanity may well not survive for long. And the people building it will not stop in time to know if it's safe, if we do not demand they proceed safely. Move fast and break things is an exciting motto, but it isn't a reasonable way to approach the next stage of evolution."
And that does not exceed our true collective epistemic uncertainty. I have my own informed and developed opinion about the likely difficulty of alignment. But looking at the range of expert opinions, I feel it is most fair and modest to say we do not yet know.
Proceeding toward potential doom without knowing the risks is a catastrophic error born from excitement (from the optimistic) and apathy (from the cautious).
I did remember A case for AI alignment being difficult and liked it, so I did that one too. My review for the Alexander/Yudkowsky dialogue got a little out of hand, but it did cover it.
This is a strong candidate for best of the year. Clarifying the arguments for why alignment is hard seems like one of the two most important things we could be working on. If we could make a very clear argument for alignment being hard, we might actually have a shot at getting a slowdown or meaningful regulations. This post goes a long way toward putting those arguments in plain language. It stands alongside Zvi's On A List of Lethalities, Yudkowsky's original AGI Ruin: A List of Lethalities, Ruthenis' A Case for the Least Forgiving Take On Alignment, and similar. This would be, for many novice readers, the best of those summaries. It puts everything into plain langague, while addressing the biggest problems.
The section on paths forward seems much less useful; that's fine, that wasn't the intended focus.
All of these except the original List lean far too heavily on the difficulty of understanding and defining human values. I think this is the biggest Crux of disagreement on alignment difficulty. Optimists don't think that's part of the alignent problem, and that's a large part of why they're optimistic.
People who are informed and thoughtful but more optimistic, like Paul Christiano, typically do not envision giving AI values aligned with humans, but rather something like Corrigibility as Singular Target or Instruction following. This alignment target seems to vastly simplify that portion of the challlenge; it's the difference between making a machine that both wants to figure out what people want and then does that perfectly reliably, and making a machine that just wants to do what this one human meant by what they said to do. This is not only much simpler to define and to learn, but it means they can correct mistakes instead of having to get everything right on the first shot.
That's my two cents for where work should follow up on this set of alignment difficulties. I'd also like to see people continuing to refine and clarify the arguments for alignment difficulty, particularly in regard to the specific types of AGI we're working on, and I intend to spend part of my own time doing that as well.
I don't think this really qualifies for year's best. It's interesting if you think or have to explain to someone who thinks "just raise an RL mechanism in a human environment and it would come out aligned, right?" I'm surprised anyone thinks that, but here's a pretty good writeup of why you shouldn't.
The biggest portion is about why we shouldn't expect an AGI to become aligned by exposing an RL system to a human - like environment. A child, Alexander says, might be punished for stealing a cookie, and it could internalize the rule "don't get caught stealing" or the rule "don't steal". If the latter, they're aligned.
Yudkowsky says that's not how humans get aligned; we have social instincts designed in by evolution; it's not just plain RL learning. When asked, he says "Actual answer: Because the entire field of experimental psychology that's why."
Having worked in/adjacent to experimental psychology for 20 years or so, I think this is absolutely correct. It's very clear at this point that we have complex instincts that guide and shape our general learning. I would not expect an RL learner to come away with anything like a human value system if it was exposed to human society. I think this should be taken as a given. We are not blank slates, and there is voluminous direct and indirect evidence to that effect.
EY: "So the unfortunate answer to "How do you get humans again?" is "Rerun something a lot like Earth" which I think we both have moral objections about as something to do to sentients."
Also right, but the relevant question is how you get something close enough to humans to roughly share our ethical systems. I'm not even sure that's adequate; humans look a lot more aligned when they don't possess immense power. But maybe we could get something close enough and a bit better by leaving out some of the dangerous instincts like anger. This is Steve Byrnes' research agenda. I don't think we have long enough, since my timeline estimates are mostly short. But it could work.
There's some interesting stuff about the outlines of a mechanism for social instincts. I also endorse EYs summary of the neuroscience as almost certainly correct.
at 1600 YK refers to (I think) human values as something like trapped priors- I think this is right. An AGI would not be likely to get the same priors trapped in the same way, even if it did have similar instincts for prosocial behavior. I'm not sure a human would retain similar values if we lived even 300 years in varied environments and just kept thinking hard about our values.
The remainder of the post is on acausal trades as a route to human survival (e.g., an AGI saving us in case it meets another AGI that was aligned and willl retaliate against a murderous AGI on principal). This is not something I've wrapped my head around.
EY says: "Frankly, I mostly consider this to be a "leave it to MIRI, kids" question"
And I'm happy to do that if he says so. I also don't care to bet the future on a theory that only an insular group even thinks they understand, so I'm hoping we get way more legible and promising alignment plans (and I think we do, which is mostly what I write about).
Shape can most certainly be emulated by a digital computer. The theory in the paper you linked would make a brain simulation easier, not harder, and the authors would agree with that (while saying their theory is miles off from a proposal to emulate the brain in depth).
And the paper very likely is on to something, but not quite what they're talking about. fMRI analyses are notoriously noisy and speculative. Nobody talking about brain emulation talks about fMRI; it's just too broad-scale to be helpful.
How can we solve that coordination problem? I have yet to hear a workable idea.
We agree that far, then! I just don't think that's a workable strategy (you also didn't state that big assumption in your post - that AGI is still dangerous as hell, we just have a route to really useful AI that isn't).
The problem is that we don't know whether agents based on LLMs are alignable. We don't have enough people working on the conjunction of LLM/deep nets and real AGI. So everyone building it is going to optmistically assume it's alignable. The Yudkowsky et al arguments for alignment being very difficult are highly incomplete; they aren't convincing because they shouldn't be. But they make good points.
If we refuse to think about aligning AGI LLM architectures because it sounds risky, it seems pretty certain that people will try it without our help. Even convincing them not to would require grappling in depth with why alignment would or wouldn't work for that type of AGI.
Neural field theory is different than the neuron doctrine. It accepts the neuron doctrine.
That abstract does not seem to be questioning the neuron doctrine but a particular way of thinking about neuronal populations. It is not proposing that we need to think about something other than neuronal axons and dendrites passing information, but rather about how to think about population dynamics.
So this is the opposite of proposing a more detailed model of brain function is necessary, but proposing a courser-grained approximation.
And they're not addressing what it would take to perfectly understand or reproduce brain dynamics, just a way to approximately understand them.
I think you're conflating creating a similar vs identical conscious experience with a simulated brain. Close is close enough for me - I'd take an upload run at far less resolution than molecular scale.
I spent 23 years studying computational neuroscience. You don't need to model every molecule or even close to get a similar computational and therefore conscious experience. The information content of neurons (collectively and inferred where data is t complete) is a very good match to reported aspects of conscious experience.
I assume you're in agreement that the reason for this is as nicely stated by cata in this thread: LW contributors are assumed to be a lot more insightful than an LLM so we don't want to guess whether the ideas came from an LLM. It's probably worth writing a brief statement on this unless you've added it to the FAQ since I last read it