Posts

Instruction-following AGI is easier and more likely than value aligned AGI 2024-05-15T19:38:03.185Z
Goals selected from learned knowledge: an alternative to RL alignment 2024-01-15T21:52:06.170Z
After Alignment — Dialogue between RogerDearnaley and Seth Herd 2023-12-02T06:03:17.456Z
Corrigibility or DWIM is an attractive primary goal for AGI 2023-11-25T19:37:39.698Z
Sapience, understanding, and "AGI" 2023-11-24T15:13:04.391Z
Altman returns as OpenAI CEO with new board 2023-11-22T16:04:03.123Z
OpenAI Staff (including Sutskever) Threaten to Quit Unless Board Resigns 2023-11-20T14:20:33.539Z
We have promising alignment plans with low taxes 2023-11-10T18:51:38.604Z
Seth Herd's Shortform 2023-11-10T06:52:28.778Z
Shane Legg interview on alignment 2023-10-28T19:28:52.223Z
The (partial) fallacy of dumb superintelligence 2023-10-18T21:25:16.893Z
Steering subsystems: capabilities, agency, and alignment 2023-09-29T13:45:00.739Z
AGI isn't just a technology 2023-09-01T14:35:57.062Z
Internal independent review for language model agent alignment 2023-07-07T06:54:11.552Z
Simpler explanations of AGI risk 2023-05-14T01:29:29.289Z
A simple presentation of AI risk arguments 2023-04-26T02:19:19.164Z
Capabilities and alignment of LLM cognitive architectures 2023-04-18T16:29:29.792Z
Agentized LLMs will change the alignment landscape 2023-04-09T02:29:07.797Z
AI scares and changing public beliefs 2023-04-06T18:51:12.831Z
The alignment stability problem 2023-03-26T02:10:13.044Z
Human preferences as RL critic values - implications for alignment 2023-03-14T22:10:32.823Z
Clippy, the friendly paperclipper 2023-03-02T00:02:55.749Z
Are you stably aligned? 2023-02-24T22:08:23.098Z

Comments

Comment by Seth Herd on Instruction-following AGI is easier and more likely than value aligned AGI · 2024-05-16T19:11:37.463Z · LW · GW

That is fascinating. I hadn't seen his "task AGI" plan, and I agree it's highly overlapping with this proposal - more so than any other work I was aware of. What's most fascinating is that YK doesn't currently endorse that plan, even though it looks to me as though on main reason he calls it "insanely difficult" has been mitigated greatly by the success of LLMs in understanding human semantics and therefore preferences. We are already well up his Do-What-I-Mean hierarchy, arguably at an adequate level for safety/success even before inevitable improvements on the way to AGI. In addition, the slow takeoff path we're on seems to also make the project easier (although less likely to allow a pivotal act before we have many AGIs causing coordination problems).

So, why does YK think we should Shut It Down instead of build DWIM AGI? Ii've been trying to figure this out. I think his principal reasons are two: reinforcement learning sounds like a good way to get any central goal somewhat wrong, and being somewhat wrong could well be too much for survival. As I mentioned in the article, I think we have good alternatives to RL alignment, particularly for the AGI we're most likely to build first, and I don't think YK has ever considered proposals of that type. Second, he thinks that humans are stunningly foolish, and that competitive race dynamiccs will make them even more prone to critical errors, even for a project that's in-principal quite accomplishable. On this, I'm afraid I agree. So if I were in charge, I would indeed Shut It Down instead of shooting for DWIM alignment. But I'm not, and neither is YK. He thinks it's worth trying, to at least slow down AGI progress; I think it's more critical to use the time we've got to refine the alignment approaches that are most likely to actually be deployed.

Comment by Seth Herd on Instruction-following AGI is easier and more likely than value aligned AGI · 2024-05-16T01:51:52.655Z · LW · GW

I very much agree. Part of why I wrote that post was that this is a common assumption, yet much of the discourse ignores it and addresses value alignment. Which would be better if we could get it, but it seems wildly unrealistic to expect us to try.

The pragmatics of creating AGI for profit are a powerful reason to aim for instruction-following instead of value alignment; to the extent it will actually be safer and work better, that's just one more reason that we should be thinking about that type of alignment. Not talking about it won't keep it from taking that path.

Comment by Seth Herd on Why I'll Keep My Crummy Drawings—How Generative AI Art Won't Supplant... Art. · 2024-05-16T00:49:55.043Z · LW · GW

This is an excellent point about the value of human art: it creates a perceived connection between audience and artist.

This makes me wonder about the future of human-directed AI art. Would I like your drawings less if you had conceived them in detail, but not directed the brushstrokes with your own hands and brain?

I think I personally would appreciate them almost as much. The skill required to actually create them is impressive in one way, but it's not the aspect of creativity that I think about and value. Conveying ideas and mood through art is the part I value. So if you'd prompted an AI to create those same images, but in detail, I'd feel that same connection to you as the conceptual creator of the pieces.

This is making me hope that we see more detailed accounts of the creative process attached to AI art. If someone merely says "make me a cool picture", they have very little creative involvement and so I feel no attachment to them through the art. If they have a detailed prompt describing a piece they've imagined, then I do feel that connection to them as a creator, and more so the more detail, meaning, and creativity they conceived the work with. But it will take a detailed account of the creative process to know what happened; in many cases, a vague prompt could produces something the audience will resonate with.

This detailed account of the creative process is something I've always wanted more of in connection to visual art. On the rare occasions that I've heard artists talk in detail about the concepts and ideas behind their work, I've valued and enjoyed that work more deeply. I think this is rarely done because a) describing concepts isn't the artist's strong suit, so they avoid it and b) they want to let the audience see their own meaning in the piece. Both are reasonable stances. The first requires artists to learn a new skill: understanding and expressing their own conceptual creative process. The second can be addressed by making it optional for the audience to read or hear about the artist's conception of the piece.

But if the advent of AI art leads to more explicit descriptions of the creative process, I for one would greatly appreciate that trend.

And I look forward to seeing more thoroughly human art, like yours, that exists alongside AI art, and for which the creative process can remain mysterious.

Comment by Seth Herd on Ilya Sutskever and Jan Leike resign from OpenAI · 2024-05-15T22:16:05.763Z · LW · GW

There's also this podcast from just yesterday. It's really good. Sam continues to say all the right things; in fact, I think this is the most reassuring he's ever been on taking the societal risks seriously, if not necessarily the existential risks. Which leaves me baffled. He's certainly a skilled enough social engineer to lie convincingly, but he sounds so dang sincere. I'm weakly concluding for the moment that he just doesn't think the alignment problem is that hard. I think that's wrong, but the matter is curiously murky, so it's not necessarily an irrational opinion to hold. Getting more meaningful discussion between optimistic and pessimistic alignment experts would help close that gap.

Comment by Seth Herd on The Alignment Problem No One Is Talking About · 2024-05-15T21:51:50.306Z · LW · GW

The Moloch series is great, once agian nice work on the introductory materials. I'll send people there before the lengthy Scott Alexander post.

I just published a post related to your societal alignment problem. It's on instruction-following AGI, and how likely it is that even AGI will remain under human control. That really places an emphasis on the societal alignment problem. It's also about why alignment thinkers haven't thought about this as much as they should.

https://www.lesswrong.com/posts/7NvKrqoQgJkZJmcuD/instruction-following-agi-is-easier-and-more-likely-than

I

Comment by Seth Herd on The Natural Selection of Bad Vibes (Part 1) · 2024-05-15T21:37:52.472Z · LW · GW

This all makes sense of the purpose of life is to solve problems. It's not. Being rational means maximizing your own goals. Usually people care more about some sort of happiness than solving the maximum number of problems. Spendingost of your time thinking about problems that you probably can't solve anyway tends to make people unhappy. So it's irrational by the goals of humans, even though it's roughly rational by the goals of evolution.

Comment by Seth Herd on Deep Honesty · 2024-05-14T16:30:34.406Z · LW · GW

Deep honesty does require tradeoffs. It's a costly signal. Society doesn't need to restructure. As the post says, you can use it sometimes and not others according to your judgment of the tradeoffs for that situation. I have been doing this for my entire adult life, with apparently pretty good but not great results. Sometimes it backfires, often it works as intended.

Comment by Seth Herd on How much AI inference can we do? · 2024-05-14T16:17:18.598Z · LW · GW

Algorithmic improvements are, on average, roughly similar in soeed to hardware improvements. In the area I f deep nets I believe they're on average larger, although I haven't looked deeply enough to say this with confidence or have a ref handy. So how much you can do is a function of how far in the future you're talking about, on two fronts. The opportunities for algorithmic improvements go far beyond the parallelization and mixture of experts methods you mention.

Comment by Seth Herd on What you really mean when you claim to support “UBI for job automation”: Part 1 · 2024-05-13T18:01:15.445Z · LW · GW

Thanks for doing this! It's looking like we may need major economic changes to keep up with job automation (assuming we don't get an outright AGI takeover). So, getting started on thinking this stuff through may have immense benefit. Like the alignment problem, it's embarassing as a species that we haven't thought about this more when the train appears to be barreling down the tracks. So, kudos and keep it up!

Now, the critique: doing this analysis for only the richest country in the world seems obviously inadequate and not even a good starting point; something like the median country would be more useful. OTOH, I see why you're doing this; I'm a US citizen and numbers are easier to get here.

So in sum, I think the bigger issue is the second one you mention: global tax reform that can actually capture the profits made from various AI companies and the much larger base of AI-enabled companies that don't pay nearly as much for AI as they would for labor, but reap massive profits. They will often be "based" in whatever country gives them the lowest tax rates. So we have another thorny global coordination problem.

I was also going to mention not accounting for the tech changes this is accounting for. So I recommend you add that this is part 1 in the intro to head off that frustration among readers.

Comment by Seth Herd on Experiments as a Third Alternative · 2024-05-12T16:59:51.442Z · LW · GW

This is evidence of nothing but your (rather odd) lack of noticing. If anything, it might be easier to not notice stimulant meds if you benefit from them, but I'm not sure about that either.

Because they're relatively short duration, some people take Ritalin to get focused work done (when it's not interesting enough to generate hyper focus), and not at other times.

Comment by Seth Herd on [deleted post] 2024-05-11T20:01:55.045Z

This wouldn't fly on wikipedia and it probably shouldn't fly on the LW wiki either. Of course, moderating a contentious wiki is a ton of work, and if the LW wiki sees more use, you'll probably need a bigger mod team.

It's a dilemma, because using the wiki more as a collaborative summary of alignment work could be a real benefit to the field.

Comment by Seth Herd on Selfmaker662's Shortform · 2024-05-11T13:11:03.153Z · LW · GW

You need to have bunches of people use it for it to be any good, no matter how good the algorithm.

Comment by Seth Herd on Dyslucksia · 2024-05-11T05:08:08.117Z · LW · GW

Quick summary: it's super easy and useful to learn a little speedreading. Just move your finger a bit faster than your eyes are comfortable moving and force yourself to keep up as best you can. Just a little of this can go a long way when combined with a skimming-for-important-bits mindset with nonfiction and academic articles.

Explicit answers:

  1. With Regard To brain function. It's vague, just this matches my understanding of how the brain works.
  2. I don't remember. I think it was just a matter of forcing yourself to go faster than you could subvocalize. And to try to notice when you were subvocalizing or not. The core technique in learning speed reading was to move your finger along the lines, and keep going slightly faster. I learned this from the very old book How to Read a Book.
  3. I'm pretty sure it both a) literally hasn't and more importantly b) effectively has increased my learning rate for semantic knowledge. Fundamentally it doesn't. It doesn't allow you to think faster (or at least not much), so if you're reading complex stuff quickly, you're just not going to understand or remember it. BUT it allows you to quickly skim to find the semantic knowledge you find worth learning and remembering. So your effect rate is higher. Learning to skim is absolutely crucial for academia, and speedreading is very useful for skimming quickly. You sort of get a vague idea of what you're reading, and then spend time on the stuff that might be important.
  4. That mentioned some of the downsides. It's what you were guessing: you can't really take things in faster, so it's a quantity/quality tradeoff. Here's another manifestation. I rarely bother to speedread fiction, because I can't imagine the setting and action if I do. Come to think of it, maybe I could a bit better if I practiced it more. But I usually just skip ahead or better yet, put the book down if I'm tempted to skim. There are lots of great books to read for pleasure, and if it's not fun word by word and worth imagining, I don't really see the point. But a friend of mine speedreads tons of fiction, so there is a point; he says he also can't imagine it in detail, but I guess he's enjoying taking in the story in broader strokes.
    1. I have no idea what my WPM was or is. It's abundantly clear that I learned to read far faster.
  5. Probably like level 20? Depends if it's a nonlinear curve.
  6. Here's the interesting bit: it was very, very easy to learn some useful speedreading, just by using my finger to force my eyes to move faster on the page (and maybe some lesser techniques I've now forgotten). I think I probably spent 20 minutes to an hour doing that explictly, then was able to push my reading speed as high as I want. I think with more practice, I could probably take things in and imagine scenes a little better at high speed, but it seemed like diminishing returns, and I'm not the type to just sit and practice skills. To be fair, I spent my childhood reading instead of doing former schooling, so I might've had a deeper skill base to work from.
Comment by Seth Herd on The Alignment Problem No One Is Talking About · 2024-05-10T23:19:59.736Z · LW · GW

Excellent! I think that's a clear and compelling description of the AI alignment problem, particularly in combination with your cartoon images. I think this is worth sharing as an easy intro to the concept.

I'm curious - how did you produce the wonderful images? I can draw a little, and I'd like to be able to illustrate like you did here, whether that involves AI or some other process.

FWIW, I agree that understanding humanity's alignment challenges is conceptually an extension of the AI alignment problem. But I think it's commonly termed "coordination" in LW discourse, if you want to see what people have written about that problem here. Moloch is the other term of art for thorny coordination/competition problems.

Comment by Seth Herd on Dyslucksia · 2024-05-10T23:14:39.710Z · LW · GW

As I understand it from some cog psych/ linguistics class (it's not my area but this makes sense WRT brain function), the problem with subvocalizing is that it limits your reading speed to approximately the rate you can talk. So most skilled readers have learned to disconnect from subvocalizing. Part of the training for speedreading is to make sure you're not subvocalizing at all, and this helped me learn to speedread.

I turn on subvocalizing sometimes when reading poetry or lyrical prose, or sometimes when I'm reading slowly to make damned sure I understand something, or remember its precise phrasing.

Comment by Seth Herd on Habryka's Shortform Feed · 2024-05-10T23:07:55.355Z · LW · GW

That's true, but the timing and incongruity of a "suicide" the day before testifying seems even more absurdly unlikely than corporations starting to murder people. And it's not like they're going out and doing it themselves; they'd be hiring a hitman of some sort. I don't know how any of that works, and I agree that it's hard to imagine anyone invested enough in their job or their stock options to risk a murder charge; but they may feel that their chances of avoiding charges are near 100%, so it might make sense to them.

I just have absolutely no other way to explain the story I read (sorry I didn't get the link since this has nothing to do with AI safety) other than that story being mostly fabricated. People don't say "finally tomorrow is my day" in the evening and then put a gun in their mouth the next morning without being forced to do it. Ever. No matter how suicidal, you're sticking around one day to tell your story and get your revenge.

The odds are so much lower than somebody thinking they could hire a hit and get away with it, and make a massive profit on their stock options. They could well also have a personal vendetta against the whistleblower as well as the monetary profit. People are motivated by money and revenge, and they're prone to misestimating the odds of getting caught. They could even be right that in their case it's near zero.

So I'm personally putting it at maybe 90% chance of murder.

Comment by Seth Herd on Habryka's Shortform Feed · 2024-05-10T22:56:34.404Z · LW · GW

Ummm, wasn't one of them just about to testify against Boeing in court, on their safety practices? And they "committed suicide" after saying the day before how much they were looking forward to finally getting a hearing on their side of the story? That's what I read; I stopped at that point, thinking "about zero chance that wasn't murder".

Comment by Seth Herd on We might be missing some key feature of AI takeoff; it'll probably seem like "we could've seen this coming" · 2024-05-09T23:51:28.561Z · LW · GW

Forecasting is hard.

Forecasting in a domain that includes human psychology, society-level propagation of beliefs, development of entirely new technology, and understanding how a variety of minds work in enough detail to predict not only what they'll do but how they'll change - that's really hard.

So, should we give up, and just prepare for any scenario? I don't think so. I think we should try harder.

That involves spending more individual time on it, and doing more collaborative prediction with people of different perspectives and different areas of expertise.

On the object level: I think it's pretty easy to predict now that we'll have more ChatGPT moments, and the Overton window will shift farther. In particular, I think interacting with a somewhat competent agent with self-awareness will be an emotionally resonant experience for most people who haven't previously imagined in detail that such a thing might exist soon.

Comment by Seth Herd on Has Generative AI Already Peaked? - Computerphile · 2024-05-09T20:21:38.411Z · LW · GW

It's helpful to include a summary with linkposts.

So here's a super quick one. I didn't listen to it closely, so I could've missed something.

It's about the article No "Zero-Shot" Without Exponential Data

Here's the key line from the abstract:

We consistently find that, far from exhibiting "zero-shot" generalization, multimodal models require exponentially more data to achieve linear improvements in downstream "zero-shot" performance, following a sample inefficient log-linear scaling trend

So, we might not continue to get better performance if we need exponentially larger datasets to get small linear improvements. This seems quite plausible, if nobody comes up with some sort of clever bootstrapping in which automatic labeling of images and videos, with a little human feedback, creates useful unlimited size datasets.

I think this isn't going to much of a slowdown on AGI progress, because we don't need much more progress on foundation models to build scaffolded agentic cognitive architectures that use system 2 type cognition to gauage their accuracy and the importance of the judgment, and use multiple tries on multiple models for important cognitive acts. That's how humans are as effective as we are; we monitor and double-check our own cognition when appropriate.

Comment by Seth Herd on jacquesthibs's Shortform · 2024-05-09T17:52:00.429Z · LW · GW

I think future more powerful/useful AIs will understand our intentions better IF they are trained to predict language. Text corpuses contain rich semantics about human intentions.

I can imagine other AI systems that are trained differently, and I would be more worried about those.

That's what I meant by current AI understanding our intentions possibly better than future AI.

Comment by Seth Herd on jacquesthibs's Shortform · 2024-05-08T18:55:30.196Z · LW · GW

This is an excellent point.

While LLMs seem (relatively) safe, we may very well blow right on by them soon.

I do think that many of the safety advantages of LLMs come from their understanding of human intentions (and therefore implied values). Those would be retained in improved architectures that still predict human language use. If such a system's thought process was entirely opaque, we could no longer perform Externalized reasoning oversight by "reading its thoughts".

But think it might be possible to build a reliable agent from unreliable parts. I think humans are such an agent, and evolution made us this way because it's a way to squeeze extra capability out of a set of base cognitive capacities.

Imagine an agentic set of scaffolding that merely calls the super-LLM for individual cognitive acts. Such an agent would use a hand-coded "System 2" thinking approach to solve problems, like humans do. That involves breaking a problem into cognitive steps. We also use System 2 for our biggest ethical decisions; we predict consequences of our major decisions, and compare them to our goals, including ethical goals. Such a synthetic agent would use System 2 for problem-solving capabilities, and also for checking plans for how well they achieve goals. This would be done for efficiency; spending a lot of compute or external resources on a bad plan would be quite costly. Having implemented it for efficiency, you might as well use it for safety.

This is just restating stuff I've said elsewhere, but I'm trying to refine the model, and work through how well it might work if you couldn't apply any external reasoning oversight, and little to no interpretability. It's definitely bad for the odds of success, but not necessarily crippling. I think.

This needs more thought. I'm working on a post on System 2 alignment, as sketched out briefly (and probably incomprehensibly) above.

Comment by Seth Herd on AXRP Episode 31 - Singular Learning Theory with Daniel Murfet · 2024-05-07T04:21:58.642Z · LW · GW

Please just wait until you have the podcast link to post these to LW? We probably don't want to read it if you went to the trouble of making a podcast.

This is now available as a podcast if you search. I don't have the RSS feed link handy.

Comment by Seth Herd on Does reducing the amount of RL for a given capability level make AI safer? · 2024-05-06T17:53:05.370Z · LW · GW

I agree, I have heard that claim many times, probably including the vague claim that it's "more dangerous" than a poorly-defined imagined alternative. A bunch of pessimistic stuff in the vein of List of Lethalities focuses on reinforcement learning, analyzing how and why that is likely to go wrong. That's what started me thinking about true alternatives.

So yes, that does clarify why you've framed it that way. And I think it's a useful question.

In fact, I would've been prone to say "RL is unsafe and shouldn't be used". Porby's answer to your question is insightful; it notes that other types of learning aren't that different in kind. It depends how the RL or other learning is done.

One reason that non-RL approaches (at least the few I know of) seem safer is that they're relying on prediction or other unsupervised learning to create good, reliable representations of the world, including goals for agents. That type of learning is typically better because you can do more of it. You don't need either a limited set of human-labeled data, which is always many orders of magnititude scarcer than data gathered from sensing the world (e.g., language input for LLMs, images for vision, etc). The other alternative is having a reward-labeling algorithm which can attach reward signals to any data, but that seems unreliable in that we don't have even good guesses on an algorithm that can identify human values or even reliable instruction-following.

Comment by Seth Herd on Does reducing the amount of RL for a given capability level make AI safer? · 2024-05-06T16:41:23.393Z · LW · GW

Surely asking if anything is safer is only sensible when comparing it to something. Are you comparing it to some implicit expected-if-not RL method of alignment? I don't think we have a commonly shared concept of what that would be. That's why I'm pointing to some explicit alternatives in that post.

Comment by Seth Herd on Does reducing the amount of RL for a given capability level make AI safer? · 2024-05-05T22:59:18.323Z · LW · GW

Compared to what?

If you want an agentic system (and I think many humans do, because agents can get things done), you've got to give it goals somehow. RL is one way to do that. The question of whether that's less safe isn't meaningful without comparing it to another method of giving it goals.

The method I think is both safer and implementable is giving goals in natural language, to a system that primarily "thinks" in natural language. I think this is markedly safer than any RL proposal anyone has come up with so far. And there are some other options for specifying goals without using RL, each of which does seem safer to me:

Goals selected from learned knowledge: an alternative to RL alignment

Comment by Seth Herd on Thomas Kwa's Shortform · 2024-05-05T15:29:09.470Z · LW · GW

I get conservation of expected evidence. But the distribution of belief changes is completely unconstrained.

Going from the class martingale to the subclass Brownian motion is arbitrary, and the choice of 1% update steps is another unjustified arbitrary choice.

I think asking about the likely possible evidence paths would improve our predictions.

You spelled it conversation of expected evidence. I was hoping there was another term by that name :)

Comment by Seth Herd on Thomas Kwa's Shortform · 2024-05-04T13:20:58.222Z · LW · GW

But... Why would p(doom) move like Brownian motion until stopping at 0 or 1?

I don't disagree with your conclusions, there's a lot of evidence coming in, and if you're spending full time or even part time thinking about alignment, a lot of important updates on the inference. But assuming a random walk seems wrong.

Is there a reason that a complex, structured unfolding of reality would look like a random walk?

Comment by Seth Herd on LLM+Planners hybridisation for friendly AGI · 2024-05-03T18:00:59.118Z · LW · GW

I think this is quite similar to my proposal in Capabilities and alignment of LLM cognitive architectures.

I think people will add cognitive capabilities to LLMs to create fully capable AGIs. One such important capability is executive function. That function is loosely defined in cognitive psychology, but it is crucial for planning among other things.

I do envision such planning looking loosely like a search algorithm, as it does for humans. But it's a loose search algorithm, working in the space of statements made by the LLM about possible future states and action outcomes. So it's more like a tree of thought or graph of thought than any existing search algorithm, because the state space isn't well defined independently of the algorithm.

That all keeps things more dependent on the LLM black box, as in your final possibility.

At least I think that's the analogy between the proposals? I'm not sure.

I think the pushback to both of these is roughly: this is safer how?

I don't think there's any way to strictly formalize not harming humans. My answer is halfway between that and your "sentiment analysis in each step of planning". I think we'll define rules of behavior in natural language, including not harming humans but probably much more elaborate, and implement both internal review, like your sentiment analysis but more elaborate, and external review by humans aided by tool AI (doing something like sentiment analysis), in a form of scalable oversight.

I'm curious if I'm interpreting your proposal correctly. It's stated very succinctly, so I'm not sure.

Comment by Seth Herd on An explanation of evil in an organized world · 2024-05-03T17:42:38.635Z · LW · GW

Yeah. Well, since It was addressing a tribe of nomadic herders in prehistoric times, that in itself is a good thing :)

Comment by Seth Herd on Please stop publishing ideas/insights/research about AI · 2024-05-02T18:49:38.339Z · LW · GW

At the core, this is a reminder to not publish things that will help more with capabilities than alignment. That's perfectly reasonable.

The tone of the post suggests erring on the side of "safety" by not publishing things that have an uncertain safety/capabilities balance. I hope that wasn't the intent.

Because that does not make sense. Anything that advances alignment more than safety in expectation should be published.

You have to make a difficult judgment call for each publication. Be mindful of your bias in wanting to publish to show off your work and ideas. Get others' insights if you can do so reasonably quickly.

But at the end of the day, you have to make that judgment call. There's no consolation prize for saying "at least I didn't make the world end faster". If you're a utilitarian, winning the future is the only goal.

(If you're not a utilitarian, you might actually want a resolution faster so you and your loved ones have higher odds of surviving into the far future.)

Comment by Seth Herd on An explanation of evil in an organized world · 2024-05-02T18:31:05.441Z · LW · GW

Then God isn't "good" as humans mean the term. That's always been one possible explanation.

Comment by Seth Herd on Shane Legg's necessary properties for every AGI Safety plan · 2024-05-01T22:52:55.256Z · LW · GW

There's also some more in his interview with Dwarkesh Patel just before then. I wrote this brief analysis of that interview WRT alignment, and this talk seems to confirm that I was more-or-less on target.

So, to your questions, including where I'm guessing at Shane's thinking, and where it's mine.

This is overlapping with the standard story AFAICT, and 80% of alignment work is sort of along these lines. I think what Shane's proposing is pretty different in an important way: it includes System 2 thinking, where almost all alignment work is about aligning the way LLMs give quick answers, analogous to human System 1 thinking.

How do we get a model that is genuinely robustly trying to obey the instruction text, instead of e.g. choosing actions on the basis of a bunch of shards of desire/drives that were historically reinforced[?]

Shane seemed to say he wants to use zero reinforcement learning in the scaffolded agent system, a stance I definitely agree with. I don't think it matters much whether RLHF was used to "align" the base model, because it's going to have implicit desires/drives from the predictive training of human text, anyway. Giving instructions to follow doesn't need to have anything to do with RL; it's just based on the world model, and putting those instructions as a central and recurring prompt for that system to produce plans and actions to carry out those instructions.

So, how we get a model to robustly obey the instruction text is by implementing system 2 thinking. This is "the obvious thing" if we think about human cognition. System 2 thinking would be applying something more like a tree of thought algorithm, which checks through predicted consequences of the action, and then makes judgments about how well those fulfill the instruction text. This is what I've called internal review for alignment of language model cognitive architectures.

To your second and third questions; I didn't see answers from Shane in either the interview or that talk, but I think they're the obvious next questions, and they're what I've been working on since then. I think the answers are that the instructions will try to be as scope-limited as possible, that we'll want to carefully check how they're interpreted before setting the AGI any major tasks, and that we'll want to limit autonomous action to the degree that they're still effective. 

Humans will want to remain closely in the loop to deal with inevitable bugs and unintended interpretations and consequences of instructions. I've written about this briefly here, and in just a few days soon be publishing a more thorough argument for why I think we'll do this by default, and why I think it will actually work if it's done relatively carefully and wisely. Following that, I'm going to write more on the System 2 alignment concept, and I'll try to actually get Shane to look at it and say if it's the same thing he's thinking of in this talk, or at least close.

In all, I think this is both a real alignment plan and one that can work (at least for technical alignment - misuse and multipolar scenarios are still terrifying), and the fact that someone in Shane's position is thinking this clearly about alignment is very good news.

Comment by Seth Herd on The Prop-room and Stage Cognitive Architecture · 2024-04-30T20:10:03.820Z · LW · GW

I agree with all of that. Even being sceptical that LLMs plus search will reach AGI. The lack of constraint satisfaction as the human brain does it could be a real stumbling block.

But LLMs have copied a good bit of our reasoning and therefore our semantic search. So they can do something like constraint satisfaction.

Put the constraints into a query, and the answer will satisfy those constraints. The process used is different than a human brain, but for every problem I can think of, the results are the same.

Now, that's partly because every problem I can think of is one I've already seen solved. But my ability to do truly novel problem solving is rarely used and pretty limitted. So I'm not sure the LLM can't do just as good a job if it had a scaffolded script to explore its knowledge base from a few different angles.

Comment by Seth Herd on LLMs seem (relatively) safe · 2024-04-30T06:26:03.456Z · LW · GW

Fair enough, thank you! Regardless, it does seem like a good reason to be concerned about alignment. If you have no idea how intelligence works, how in the world would you know what goals your created intelligence is going to have? At that point, it is like alchemy - performing an incantation and hoping not just that you got it right, but that it does the thing you want.

Comment by Seth Herd on LLMs seem (relatively) safe · 2024-04-29T19:37:51.552Z · LW · GW

Nothing in this post or the associated logic says LLMs make AGI safe, just safer than what we were worried about.

Nobody with any sense predicted runaway AGI by this point in history. There's no update from other forms not working yet.

There's a weird thing where lots of people's p(doom) went up when LLMs started to work well, because they found it an easier route to intellligence than they'd been expecting. If it's easier it happens sooner and with less thought surrounding it.

See Porby's comment on his risk model for language model agents. It's a more succinct statement of my views.

LLMs are easy to turn into agents, so let's don't get complacent. But they are remarkably easy to control and align, so that's good news for aligning the agents we build from them. But that doesn't get us out of the woods; there are new issues with self-reflective, continuously learning agents, and there's plenty of room for misuse and conflict escalation in a multipolar scenario, even if alignment turns out to be dead easy if you bother to try.

Comment by Seth Herd on LLMs seem (relatively) safe · 2024-04-29T19:30:08.225Z · LW · GW

That is a fascinating take! I haven't heard it put that way before. I think that perspective is a way to understand the gap between old-school agent foundations folks' high p(doom) and new school LLMers relatively low p(doom) - something I've been working to understand, and hope to publish on soon.

To the extent this is true, I think that's great, because I expect to see some real insights on intelligence as LLMs are turned into functioning minds in cognitive architectures.

Do you have any refs for that take, or is it purely a gestalt?

Comment by Seth Herd on Disentangling Competence and Intelligence · 2024-04-29T19:12:37.398Z · LW · GW

Interesting, and good job publishing rather than polishing!

I really like terminology of competence vs. intelligence.

I don't think you want to use the term intelligence for your level 3. I think I see why you want to; but intelligence is currently an umbrella term for any cognitive capacity, so you're invoking different intuitions when you use it for one particular cognitive capacity.

In either case, I think you should draw the analogy more closely with Level 3 and problem-solving. At least if you think it exists.

Suppose I'm a hunter-gatherer, and there are fruit high up in a tree. This tree has thorns, so my usual strategy of climbing it and shaking branches won't work. If I figure out, through whatever process of association, simulation, and trial and error that I can get a long branch from another tree, then knock the fruit down, I can incorporate that into my level 2 cognition, and from there into level 1. This type of problem-solving is also probably the single cognitive ability most often referred to as intelligence, thus justifying your use of the term for that level. If I'm right that you'd agree with all of that, that could maake the terminology more intuitive to the reader.

In any case, I'm glad to see you thinking about cognition in relation to alignment. It's obviously crucial; I'm unclear if most people just aren't thinking about it, or if it's all considered too infohazardous.

Comment by Seth Herd on The Prop-room and Stage Cognitive Architecture · 2024-04-29T19:03:19.875Z · LW · GW

Interesting! I'm not following everything, but it sounds like you're describing human cognition for the most part.

I found it interesting that you used the phrase "constraint satisfaction". I think this concept is crucial for understanding human intelligence; but it's not used very widely. So I'm curious where you picked it up.

I agree with your conclusion on the alignment section: these seem like low-resolution ideas that seem worth fleshing out.

Good job putting this out there without obsessively polishing it. That shares at least some of your ideas with the rest of us, so we can build on them in parallel with you polishing your understanding and your presentation.

Comment by Seth Herd on Book review: Deep Utopia · 2024-04-28T23:22:48.968Z · LW · GW

Thanks! I'm also uninterested in the question of whether it's possible. Obviously it is. The question is how we'll decide to use it. I think that answer is critical to whether we'd consider the results utopian. So, does he consider how we should or will use that ability?

Comment by Seth Herd on [Aspiration-based designs] 1. Informal introduction · 2024-04-28T21:38:24.943Z · LW · GW

I applaud the effort. Big upvote for actually trying to solve the problem, by coming up with a way to create safe, aligned AGI. If only more people were doing this instead of hand wringing, arguing, or "working on the problem" in poorly-thought-out, too-indirect-to-probably-help-in-time ways. Good job going straight for the throat.

That said: It seems to me like the problem isn't maximization or even optimization; it's conflicting goals.

If I have a goal to make some paperclips, not as many as I can, just a few trillion, I may still enter a deadly conflict with humanity. If humanity knows about me and my paperclips goal, they'll shut me down. The most certain way to get those paperclips made may be to eliminate unpredictable humanity's ability to mess with my plans.

For essentialaly this reason, I think quantilization is and was recognized as a dead-end. You don't have to take your goals to the logical extreme to still take them way too far for humanity's good.

I read the this post, but not the remainder yet, so you might've addressed this elsewhere.

Comment by Seth Herd on Book review: Deep Utopia · 2024-04-28T21:05:00.024Z · LW · GW

Does Bostrom address human modification/amplification? I'd think he would, but I'm not sure he actually did, at least not in any depth.

A world in which we all get sad because we can't make new philosophy breakthroughs and don't bother to engineer out that sadness seems quite implausible. Yet I didn't hear this addressed in his interview with Liv Boeree.

And I'm not going to buy and read it just to find out.

Comment by Seth Herd on List your AI X-Risk cruxes! · 2024-04-28T20:43:57.408Z · LW · GW

I think this is a great project! Clarifying why informed people have such different opinions on AGI x-risk seems like a useful path to improving our odds. I've been working on a post on alignment difficulty cruxes that covers much of hte same ground.

Your list is a good starting point. I'd add:

Time window of analysis: I think a lot of people give a low p(doom) because they're only thinking about the few years after we get real AGI.

Paul Christiano, for instance, adds a substantial chance that we've "irreversibly messed up our future within 10 years of building powerful AI" over and above the odds that we all die from takeover or misuse. (in My views on “doom”, from April 2023).

Here are my top 4 cruxes for alignment difficulty which is different but highly overlapping with p(doom).

How AGI will be designed and aligned

  • How we attempt it is obviously important for the odds of success

How well RL alignment will generalize

Whether we need to understand human values better

Whether societal factors are included in alignment difficulty

  • Sometimes people are just answering whether the AGI will do what we want, and not including whether that will result in doom (eg., ASI super-weapons proliferate until someone starts a hyper-destructive conflict).

Other important cruxes are mentioned in Stop talking about p(doom) - basically, what the heck one includes in their calculation, like my first point on time windows.

Comment by Seth Herd on This is Water by David Foster Wallace · 2024-04-27T22:41:54.633Z · LW · GW

Fascinating, thank you!

I think the reality here is probably complex. I think we can direct our thoughts to some degree, and that in turn creates our feelings to some degree. Using that wisely isn't trivial. If I obsess about controlling my thinking, that could easily become upsetting.

I do think there's a good chance that the views David Foster Wallace espouses here were causally linked to his depression and suicide. They should be taken with caution. But doing the opposite isn't probably the best approach either

I had thought that cognitive reframing is part of some well-regarded therapeutic approaches to depression. While one can't choose how to feel, it is pretty apparent that we can, sometimes, choose what to think. When I ask myself "what should I think about now?" I get what seems like meaningful answers, and they direct my train of thought to a nontrivial degree - but not infinitely. My thoughts return to emotionally charged topics. If this upsets me, those topics become even more emotionally charged, and my thoughts return to them more often. This is the "don't think of a white bear" phenomenon.

However, gentle redirection does seem to work. Reframing my understanding of situations in ways that make me happier does appear to sometimes make me happier.

But thinking I should be able to do this infinitely is unrealistic, and my failure to do so would be upsetting if I thought I should be able to control my feelings and my thoughts relatively thoroughly.

I think this is a fascinating topic. I think therapy and psychology is in its infancy, and I expect us to have vastly better treatment for depression relatively soon. It will probably involve hugs and puppies as well as a better understanding of how we can and should try to think about our thinking.

Comment by Seth Herd on [deleted post] 2024-04-27T18:20:37.935Z

This is pretty fluffy. It didn't include any effect sizes or description of manipulations, let alone meta analysis that would make any of this reliable enough to take seriously. And while thinking hard hasn't been show to cost more metabolic energy, indont think the studies are good enough to prove it doesn't, either.

WSJ may not be a good place to get your neuroscience and psychology. I wouldn't know, it's paywalled.

Comment by Seth Herd on We are headed into an extreme compute overhang · 2024-04-27T14:32:12.791Z · LW · GW

The big question here, it seems like, is: does intelligence stack? Does a hundred thousand instances of GPT4 working together make an intelligence as smart as GPT7?

This far the answer seems to be no. There are some intelligence improvements from combining multiple calls in tree of thought type setups, but not much. And those need carefully hand-structured algorithms.

So I think the limitation is in scaffolding techniques, not the sheer number of instances you can run. I do expect scaffolding LLMs into cognitive architectures to achieve human level fully general AGI, but how and when we get there is tricky to predict.

When we have that, I expect it to stack a lot like human organizations. They can do a lot more work at once, but they're not much smarter than a single individual because it's really hard to coordinate and stack all of that cognitive work.

Comment by Seth Herd on Andrew Burns's Shortform · 2024-04-27T14:18:15.790Z · LW · GW

Are you saying that China will use Llama 3 400B weights as a basis for improving their research on LLMs? Or to make more tools from? Or to reach real AGI? Or what?

Comment by Seth Herd on AI Regulation is Unsafe · 2024-04-24T00:17:31.327Z · LW · GW

Who is downvoting posts like this? Please don't!

I see that this is much lower than the last time I looked, so it's had some, probably large, downvotes.

A downvote means "please don't write posts like this, and don't read this post".

Daniel Kokatijlo disagreed with this post, but found it worth engaging with. Don't you want discussions with those you disagree with? Downvoting things you don't agree with says "we are here to preach to the choir. Dissenting opinions are not welcome. Don't post until you've read everything on this topic". That's a way to find yourself in an echo chamber. And that's not going to save the world or pursue truth.

I largely disagree with the conclusions and even the analytical approach taken here, but that does not make this post net-negative. It is net-positive. It could be argued that there are better posts on this topic one should read, but there certainly haven't been this week. And I haven't heard these same points made more cogently elsewhere. This is net-positive unless I'm misunderstanding the criteria for a downvote.

I'm confused why we don't have a "disagree" vote on top-level posts to draw off the inarticulate disgruntlement that causes people to downvote high-effort, well-done work.

Comment by Seth Herd on AI Regulation is Unsafe · 2024-04-22T20:40:11.638Z · LW · GW

The fact that very few in government even understand the existential risk argument means that we haven't seen their relevant opinions yet. As you point out, the government is composed of selfish individuals. At least some of those individuals care about themselves, their children and grandchildren. Making them aware of the existential risk arguments in detail could entirely change their atiitude.

In addition, I think we need to think in more detail about possible regulations and downsides. Sure, government is shortsighted and selfish, like the rest of humanity.

I think you're miscalibrated on the risks relative to your average reader. We tend to care primarily about the literal extinction of humanity. Relative to those concerns, the "the most dystopian uses for AI" you mention are not a concern, unless you mean literally the worst- a billion-year reich of suffering or something.

We need a reason to believe that governments can reliably improve the incentives facing private organizations.

We do not. Many of us here believe we are in such a desparate situation that merely rolling the dice to change anything would make sense.

I'm not one of those people. I can't tell what situation we're really in, and I don't think anyone else has a satisfactory full view either. So, despite all of the above, I think you might be right that government regulation may make the situation worse. The biggest risk I can see is changing who's in the lead for the AGI race; the current candidates seem relatively well-intended and aware of the risks (with large caveats). (One counterargument is that takeoff will likely be so slow in the current paradigm that we will have a multiple AGIs, making the group dynamics as important as individual intentions.)

So I'd like to see a better analysis of the potential outcomes of government regulation. Arguing that governments are bad and dumb in a variety of ways just isn't sufficiently detailed to be helpful in this situation.

Comment by Seth Herd on How to Model the Future of Open-Source LLMs? · 2024-04-22T03:29:45.218Z · LW · GW

I expect there is still tons of low-hanging fruit available in LLM capabilities land. You could call this "algorithmic progress" if you want. This will decrease the compute cost necessary to get a given level of performance, thus raising the AI capability level accessible to less-resourced open-source AI projects.

Don't you expect many of those improvements to remain closed-source from here on out, benefitting the teams that developed them at great (average) expense? And even the ones that are published freely will benefit the leaders just as much as their open-source chasers.

Comment by Seth Herd on Good Bings copy, great Bings steal · 2024-04-22T03:18:10.016Z · LW · GW

The question this addresses is whether LLMs can create new knowledge. The answer is "that's irrelevant".

Your framing seems to equivocate over current LLMs, future LLMs, and future AI of all types. That's exactly what the public debate does, and it creates a flaming mess.

I'm becoming concerned that too many in the safety community are making this same mistake, and thereby misunderstanding and underestimating the near-term danger.

I think there's a good point to be made about the cognitive limitations of LLMs. I doubt they can achieve AGI on their own.

But they don't have to, so whether they can is irrelevant.

If you look at how humans create knowledge, we are using a combination of techniques and brain systems that LLMs cannot employ. Those include continuous, self-directed learning; episodic memory as one aid to that learning; cognitive control, to organize and direct that learning; and sensory and motor systems to carry out experiments to direct that learning.

All of those are conceptually straightforward to add to LLMs (good executive function/cognitive control for planning is less obviously straightforward, but I think it may be surprsingly easy to leverage LLMs "intelligence" to do it well).

See my Capabilities and alignment of LLM cognitive architectures for expansions on those arguments. I've been reluctant to publish more, but I think these ideas are fairly obvious once someone actually sits down to create agents that expand on LLM capabilities, so I think getting the alignment community thinking about this correctly is more important than a tiny slowdown in reaching x-risk capable AGI through this route.

(BTW human artistic creativity uses that same set of cognitive capabilities in different ways, so same answer to "can LLMs be true artists").