Posts

Goals selected from learned knowledge: an alternative to RL alignment 2024-01-15T21:52:06.170Z
After Alignment — Dialogue between RogerDearnaley and Seth Herd 2023-12-02T06:03:17.456Z
Corrigibility or DWIM is an attractive primary goal for AGI 2023-11-25T19:37:39.698Z
Sapience, understanding, and "AGI" 2023-11-24T15:13:04.391Z
Altman returns as OpenAI CEO with new board 2023-11-22T16:04:03.123Z
OpenAI Staff (including Sutskever) Threaten to Quit Unless Board Resigns 2023-11-20T14:20:33.539Z
We have promising alignment plans with low taxes 2023-11-10T18:51:38.604Z
Seth Herd's Shortform 2023-11-10T06:52:28.778Z
Shane Legg interview on alignment 2023-10-28T19:28:52.223Z
The (partial) fallacy of dumb superintelligence 2023-10-18T21:25:16.893Z
Steering subsystems: capabilities, agency, and alignment 2023-09-29T13:45:00.739Z
AGI isn't just a technology 2023-09-01T14:35:57.062Z
Internal independent review for language model agent alignment 2023-07-07T06:54:11.552Z
Simpler explanations of AGI risk 2023-05-14T01:29:29.289Z
A simple presentation of AI risk arguments 2023-04-26T02:19:19.164Z
Capabilities and alignment of LLM cognitive architectures 2023-04-18T16:29:29.792Z
Agentized LLMs will change the alignment landscape 2023-04-09T02:29:07.797Z
AI scares and changing public beliefs 2023-04-06T18:51:12.831Z
The alignment stability problem 2023-03-26T02:10:13.044Z
Human preferences as RL critic values - implications for alignment 2023-03-14T22:10:32.823Z
Clippy, the friendly paperclipper 2023-03-02T00:02:55.749Z
Are you stably aligned? 2023-02-24T22:08:23.098Z

Comments

Comment by Seth Herd on AXRP Episode 27 - AI Control with Buck Shlegeris and Ryan Greenblatt · 2024-04-12T22:11:01.943Z · LW · GW

It's available as a podcast now:

http://castbox.fm/app/castbox/feed/c219b11a71a4f5eed01939f31db06ed24aa86116/track/b781f40900d7cfe1b74fa9c4dfdb1d0ddddba1ab

Want to re-add a link?

Comment by Seth Herd on Martín Soto's Shortform · 2024-04-11T23:19:23.688Z · LW · GW

I guess I don't get it.

Comment by Seth Herd on Martín Soto's Shortform · 2024-04-11T20:22:50.411Z · LW · GW

Sure, long after we're dead from AGI that we deliberately created to plan to achieve goals.

Comment by Seth Herd on Thinking harder doesn’t work · 2024-04-11T13:09:30.367Z · LW · GW

Plagiarism is bad, on LW or anywhere.

Repeating other people's useful thoughts is good. Pretending you came up with them yourself is bad. Attribution is the difference.

Comment by Seth Herd on The 2nd Demographic Transition · 2024-04-11T12:58:57.267Z · LW · GW

It could be but that's clearly not the whole deal. Societal standards for childcare have shifted dramatically. That could be driven by people having fewer children and also causing it, in a vicious cycle.

Comment by Seth Herd on The 2nd Demographic Transition · 2024-04-10T19:04:31.167Z · LW · GW

High income households have access to the world’s best leisure opportunities, yet they still invest more time in child-rearing than lower income households.

I doubt they invest more time. They have money to pay for more help with childcare. And I think this is the critical difference.

Time spent on care per child has skyrocketed in recent decades. I think that's one major factor driving down fertility: having kids is a bigger PITA every year.

Thinking of costs solely terms of money is a mistake. The time investment is critical.

This is why I'm unconcerned with low fertility if we get AI progress and don't die from it: AI is going to be great at childcare. Even current LLMs have the cognitive capacity to be good tutors and playmates.

Comment by Seth Herd on Can singularity emerge from transformers? · 2024-04-08T22:21:45.448Z · LW · GW

By using the LLM as the central cognitive core in a scaffolded agent or cognitive architecture. Those are just the more obvious routes to leveraging LLM techniques into a smarter and more agentic form.

I agree that current LLMs aren't capable of recursive self-improvement.Almost no AI worriers think that current LLMs are existentially dangerous. We worry that extensions of them, combined with outer AI techniques, might be dangerous soon, so decelerating or at least creating some sort of brakes would be the sensible thing to do if we want to improve our odds of survival.

Comment by Seth Herd on My intellectual journey to (dis)solve the hard problem of consciousness · 2024-04-08T19:59:43.512Z · LW · GW

Nice post! The comments section is complex, indicating that even rationalists have a lot of trouble talking about consciousness clearly. This could be taken as evidence for what I take to be one of your central claims: the word consciousness means many things, and different things to different people.

I've been fascinated by consciousness since before starting grad school in neuroscience in 1999. Since then, I've thought a lot about consciousness, and what insight neuroscience (not the colored pictures of imaging, but detailed study of individual and groups of neurons' responses to varied situations) has to say about it.

I think it has a lot to say. There are more detailed explanations available of each of the phenomena you identify as part of the umbrella term "consciousness".

This gets at the most apt critique of this and similar approaches to denying the existence of a hard problem: "Wait! That didn't explain the part I'm interested in!". I think this is quite true, and better explanations are quite possible given what we know. I believe I have some, but I'm not sure it's worth the trouble to even try to explicate them.

Over the past 25 years, I've discussed consciousness less and less. It's so difficult as to create unproductive discussions a lot of the time, and frustrating misunderstandings and arguments a good bit of the time.

Thus, while I've wanted to write about it, there's never been a professional or personal motivating factor.

I wonder if the advent of AGI will create such a factor. If we go through a nontrivial era of parahuman AGI, as I think we will, then I think the question of whether and how they're conscious might become a consequential one, determining how we treat them.

It could also help determine how seriously we take AGI safety. If the answer to "is this proto-AGI conscious?" and the honest answer is "Yes, in some ways humans are, and some other ways humans aren't", that encourages the intuition that we should take these things seriously as a potential threat.

So, perhaps it would make sense to start that discussion now, before public debate ramps up?

If that logic doesn't strongly hold, discussing consciousness seems like a huge time-sink taking time that would be better spent trying to solve alignment as best we can while we still have a chance.

Comment by Seth Herd on What's with all the bans recently? · 2024-04-08T18:04:13.459Z · LW · GW

It's ironic that your response doesn't address my comment. That was one of the stated reasons for your limit. This also addresses why Habryka thought explaining it to you further didn't seem likely to help.

How to best moderate a website such as LW is a deep and difficult question. If you have better ideas, that might be useful. Just do more, better is not a useful suggestion.

Comment by Seth Herd on StartAtTheEnd's Shortform · 2024-04-08T17:51:21.519Z · LW · GW

First a point of agreement: living in a "solved" world would suck. To the extent we live like that, it does suck.

But reducing information isn't the only way to prevent that. Creating new rich situations is a better solution, I think. And the world has been doing that just fine so far. The modern world isn't solved. Often, attempts to live as though it is are deeply mistaken, as well as depressing.

If you don't like having a strategy guide to your games, don't look at them until and where you really decide you want to. And if you do, you'll notice that any PvP game is not entirely solved by its best theorists. I don't play PvE but I suspect the same. The level of play and strategy interact with the meta in complex ways.

I agree that more information has correlated with some bad effects, but I don't think it's directly causal, and I don't think reducing the information would itself make life better or less Molochian, unless you somehow held the material quality of life to a high level. If you could do that, I think you could come up with better, more thorough solutions to making life better, rather than going back to living in more ignorance.

The world as it stands isn't great. But most of history for most of humanity truly epically sucked.

If you're taking a wireframe view of the world, you're looking at it wrong. And a lot of people are. The details matter. Making decisions based on data is only half of the way to live in our current state. Feelings and intuitions matter. A lot.

People who run businesses DO give me things for free sometimes. I engage in conversation with them, see them as individuals, and they sometimes respond with generosity (I don't do this to get free stuff, and I don't get it a lot; what I get is human interactions with richness and value). It's true that they can't give me stuff from carefully regulated businesses, and there are real dangers in having a society that's capitolistic to the exclusion of valuing happiness, rich textured experience, and beauty.

To some specifics of your argument:

I take your point about those with power having more lattitude of choice because they didn't know the best choices. But fewer people had power. Those with no power were abused by those with a little, because life was hard and necessities were scarce. That is Molochian. The idea that people didn't go to war because they didn't know that was a winning strategy seems like that would be small effect relative to the difference in competition based on necessity. Many more people were forced to go to war in the past than today, because their nondemocratic power structure would use them as soldiers at threat of their lives and families.

WRT the difference between now and twenty or so years ago, the time you're idealizing (and I lived through), I'm saying let's see the statistics. I don't think the world is worse now, and "things were simpler then" isn't good data. I think they were simpler, which was nice; and they were worse in many ways. (The US happened to have a golden economic age starting in 50s because it was the only advanced nation whose industrial capacity was enhanced rather than destroyed by WW2, in case that's what you're thinking of; the MAGA illusion is based on that historical accident).

So I don't buy that things are worse now just because some people like to say they are. I take that to be largely a product of social media spreading negative information better than positive on average. That is a real problem, but the solution isn't as simple as "just don't spread information".

I don't think the freedom to make more mistakes is making life much better. However, I do I agree that making real choices makes people happy, and society needs to support that, and we might not be adequately right now (although there are real choices to make, and you should make them and revel in that freedom). We haven't solved the meta, not by a long shot! For instance, dating like it's a job interview isn't at all how properly informed dating works, that's some sort of bad local minimum. But asking some important questions as dealbreakers can spare you a lifetime of slow heartbreak when you discover late that there are fundamental incompatibilities.

So in sum, I don't think your solution on its own would work. It's solving only a tiny part of the Molochian problem to limit information, and on average making the whole worse. Unless you have a quite different solution for the remainder. And that would be the real solution.

If you're disillusioned, stop it. Find something new and wonderful and complex to wonder at. Or look deeper at the details for more possibilities in the things you're disullusioned by.

The world isn't solved, and we can keep creating rich challenges while we keep developing our information technology. The question is whether anyone with good intentions and good ideas controls the world. Development of AGI is currently central to that, so I suggest focusing on that as the current think to engage with and wonder at.

Thanks for an interesting conversation! I'd better focus on more immediate concerns, like the above.

Comment by Seth Herd on Centrists are (probably) less biased · 2024-04-07T21:28:42.364Z · LW · GW

In your model of why to assume centrists are less biased: aren't you assuming that the truth tends to be in the center of the spectrum? If we knew where the truth lay, there would be no point in studying which side is more biased or better rationalists. Right?

Comment by Seth Herd on StartAtTheEnd's Shortform · 2024-04-05T22:31:58.629Z · LW · GW

To my eye, the world in the past has had more problem with Moloch, not less. Warlords, serfdom as near-slavery, etc. are the direct result of Molochian competition. The human condition has been getting better over history.

We (at least the middle class) might've had a golden age just recently and things might've gone downhill since. I don't know and I don't think anyone has a good measure of whether things have really gone downhill WRT happiness, quality of life, or Molochian competition. But that's at an intermediate level of information transmission. The remainder of earlier history had much less information transmission, and Molochian competition was way worse.

Comment by Seth Herd on What's with all the bans recently? · 2024-04-05T22:13:26.840Z · LW · GW

It is not "fine to get into arguments". The FAQ definitely lays out goals of having interactions here be civil and collaborative.

Unfortunately, becoming less wrong (reaching the truth) benefits hugely from not getting into arguments.

If you tell a human being (even a rationalist) something true, with good evidence or arguments, but you do it in an aggressive or otherwise irritating way, they may very well become less likely to believe that true thing, because they associate it with you and want to fight both you and the idea.

This is true of you and other rationalists as well as everyone else.

This is not to argue that the bans might not be overdoing it; it's trying to do what Habryka doesn't have time to do: explain to you why you're getting downvoted even when you're making sense.

Comment by Seth Herd on LessWrong's (first) album: I Have Been A Good Bing · 2024-04-05T20:12:34.917Z · LW · GW

2.0 is now my current favorite album; I've listened to it at least five times through since you recommended it. Thanks so much!! The electro-rock style does it for me. And I think the lyrics and music are well-written. Having each lyricist do only one song is an interesting approach that might raise quality.

It's hard to say how much of it is directly written about AI risk, but all of it can be taken that way. Most of the songs can be taken as written from the perspective of a misaligned AGI with human-similar thinking and motivations. Which I find highly plausible, since I think language model agents are the most likely route to agi, and they'll be curiously parahuman.

Comment by Seth Herd on LessWrong's (first) album: I Have Been A Good Bing · 2024-04-05T20:03:26.185Z · LW · GW

What I've done is to notice exactly what the emotion feels like when it's happening; then, when I want to encourage that emotion, I try to remember that feeling, and imagine it's happening now as vividly and intensely as I can.

I've thought of this as "direct emotional induction" but I've never written about it. It seems to work remarkably well.

I came up with this after studying the precise mechanisms the brain uses to control attention for my dissertation. I'm certain there are other names for this and that it's been discovered through other means, but oddly I haven't come across it.

Comment by Seth Herd on EJT's Shortform · 2024-04-03T18:08:57.318Z · LW · GW

I found your explanation quite dense, so I only skimmed it initially. I appreciate your inclusion of a brief summary at the start. But that wasn't enough to motivate me to read the whole proposal in detail, because the summary doesn't sound like it addresses the hard parts of the shutdown problem.

Much more in another comment on the post.

Comment by Seth Herd on The Shutdown Problem: Incomplete Preferences as a Solution · 2024-04-03T18:06:58.320Z · LW · GW

This is in response to your shortform question of why this didn't get more engagement. It's an attempt to explain why I didn't engage further, and I think some others will share some of my issues, particularly with clarity and brevity of the core argument:

I'd have needed a better summary to dig through the detailed formalism. I appreciate that you included it; it just didn't hit the points I care about.

Iit's not clear from your summary how temporal indifference would prevent shutdown preferences. How does not caring about how many timesteps result in not caring about being shut down, probably permanently? I assume you explain this to your satisfaction in the text, but expecting me to parse your formalisms without even making a claim in English about how they produce the desired result seems like a bad sign for how much time I'd have to invest to evaluate your proposal.

Second, neither the summary nor, AFAIK the full proposal address what I take to be the hard problem of shutdownability:; not caring about being shut down, while still caring about every other obstacle to completing ones' goals, is irrational. You have to create a lacuna in a world-model, or the motivation for that portion of the model. It has to not care about being shut down, but still care about all of the other things whose semantics overlap with those of being shut down (hostile agents or circumstance preventing work on the problem. I think this is the same concern Ryan Greenblatt is expressing as "assuming TD preferences generalize perfectly".

There's a lot of discussion of this under the terminology "corrigibility is anti-natural to consequentialist reasoning". I'd like to see some of that discussion cited, to know you've done the appropriate scholarship on prior art. But that's not a dealbreaker to me, just one factor in whether I dig into an article.

Now, you may be addressing non-sapient AGI only, that's not allowed to refine its world model to make it coherent, or to do consequentialist reasoning. If so (and this was my assumption), I'm not interested in your solution even if it works perfectly. I think it would be great if a non-sapient AI resisted shutdown, because I think it would fail, and that would serve as a warning shot before a sapient AGI resists successfully.

My belief that only reflective, learning AGI is the real important threat model is a minority opinion right now, but it's a large minority. In essence, I think someone will add reflection and continuous self-directed learning almost immediately to any AI capable enough to be dangerous. And then it will be more capable and much more dangerous.

When I asked about the core argument in the comment above, you just said "read these sections". If you write long dense work and then just repeat "read the work" to questions, that's a reason people aren't engaging. Sorry to point this out; I understand being frustrated with people asking questions without reading the whole post (I hadn't), but that's more engagement than not reading and not asking questions. Answering their questions in the comments is somewhat redundant, but if you explain differently, it gives readers a second chance at understanding the arguments that were sticking points for them and likely for other readers as well.

Having read the post in more detail, I still think those are reasonable questions that are not answered clearly in the sections you mentioned. But that's less important than the general suggestions for getting more engagement with this set of ideas in the future.

Sorry to be so critical; this is a response to your question of why people weren't engaging more, so I assume you want harsh truths.

Edit: TBC, I'm not saying "this wouldn't work", I'm saying "I don't understand it enough to know whether it would work, although I suspect it wouldn't. Please explain more clearly and briefly so more of us can think about it with less time investment".

Comment by Seth Herd on NickH's Shortform · 2024-04-03T16:56:51.990Z · LW · GW

I'm not quite understanding yet. Are you saying that an immortal AGI will prioritize preparing to fight an alien AGI, to the point that it won't get anything else done? Or what?

Immortal expanding AGI is a part of classic alignment thinking, and we do assume it would either go to war or negotiate with an alien AGI if it encounters one, depending on the overlap in their alignment/goals.

Comment by Seth Herd on LessWrong's (first) album: I Have Been A Good Bing · 2024-04-03T16:14:19.449Z · LW · GW

Oh yeah - this is different in that it's actually good! (In the sense that it was made with substantial skill and effort, and it appeals to my tastes.)

I'm not sure it's actually helpful for AI safety, but I think popular art is going to play a substantial role in the public dialogue. AI doom is a compelling topic for pop art, logic aside.

Comment by Seth Herd on [EA xpost] The Rationale-Shaped Hole At The Heart Of Forecasting · 2024-04-03T16:04:22.051Z · LW · GW

I think a major issue is that the people who would be best at predicting AGI usually don't want to share their rationale.

Gears-level models of the phenomenon in question are highly useful in making accurate predictions. Those with the best models are either worriers who don't want to advance timelines, or enthusiasts who want to build it first. Neither has an incentive to convince the world it's coming soon by sharing exactly how that might happen.

The exceptions are people who have really thought about how to get from AI to AGI, but are not in the leading orgs and are either uninterested in racing or want to attract funding and attention for their approach. Yann LeCun comes to mind.

Imagine trying to predict the advent of heavier-than-air flight without studying either birds or mechanical engineering. You'd get predictions like the ones we saw historically - so wild as to be worthless, except those from the people actually trying to achieve that goal.

Comment by Seth Herd on LessWrong's (first) album: I Have Been A Good Bing · 2024-04-02T03:01:45.710Z · LW · GW

I like it. And I "bought" it to support more similar work by humans.

But I was far more moved by hearing the best rationalist writing set to music. I'm kind of shocked at how well that worked, in terms of both the AI creating music to human-selected lyrics, and the emotional impact.

Comment by Seth Herd on [deleted post] 2024-04-01T21:46:29.853Z

I think the better phrasing would be "is the model going to do what the humans trained (or told) it to do?" (specifying a goal you really want is outer alignment).

Comment by Seth Herd on On Lex Fridman’s Second Podcast with Altman · 2024-04-01T19:49:40.750Z · LW · GW

This is great, thanks for filling in that reasoning. I agree that there are lots of plausible reasons Altman could've made that comment, other than disdain for safety.

Comment by Seth Herd on LessWrong's (first) album: I Have Been A Good Bing · 2024-04-01T19:00:16.798Z · LW · GW

This is silly and beautiful and profound. I have never heard rationalist music before, and I find it quite moving to hear it for the first time. Several songs brought tears to my eyes (although I've practiced opening those emotional channels a bit, so this is a less uncommon experience for me than for most)

I think this says something about the potential of AI to democratize art and allow high quality art aimed at small minority subgroups.

I want more. Thank you to all of those who made this happen.

Comment by Seth Herd on Please Understand · 2024-04-01T18:21:26.979Z · LW · GW

I agree with you worry about reducing every artistic input from a human to a natural language prompt. I think others share this worry and will make generative AI to address it. Some image generation software already allows artistic input (sketching something for the software to detail). I don't think it exists yet, but I know it's a goal, and I'm looking forward to music generation AI that takes humming or singing as inputs. These will be further refined to editing portions of the resulting art, rather than producing a whole new work with each prompt.

Generative AI can also be used at the detailed level, to aid existing experts. Using AI to generate tones for music, sketches for visual art, etc. may preserve our interest in details. The ability to do this seems likely to preserve artists who at least intuit tone and color theory.

The loop of learning and engaging by perturbing may be enhanced, by doing that perturbation at a broad scale, at least initially. Changing a prompt and getting a whole new piece of art is quite engaging. I see no reason why interest in the details might not be driven by an ability to work with the whole, perhaps better than interest in producing the whole is driven by working to produce the components. Learning to sketch before producing any satisfying visual art is quite frustrating, as is learning to play an instrument. The idea that we won't get real experts who learn about the details just because they started at the level of the whole seems possible- as you put it, not an unreasonable worry. But my best guess would be that we get the opposite, which is a world in which many more people at least intuit the detailed mechanics of art because they've tried to make art themselves.

Somewhat off of your point: I expect this question to be less relevant than the broader question "what will humans do once AGI can do everything better?". The idea that we might have many years, let alone generations with access to generative AI but not AGI strikes me as quite odd. While it's possible that the last 1% of cognitive ability (agency, reflection, and planning) will remain the domain of humans, it seems much more likely that those predictions are driven by wishful thinking (technically, motivated reasoning).

Comment by Seth Herd on Victor Ashioya's Shortform · 2024-04-01T16:55:04.965Z · LW · GW

What do you mean it's catching on fast? Who is using it or advocating for it? I think this is important if true.

Comment by Seth Herd on Habryka's Shortform Feed · 2024-03-26T20:31:38.049Z · LW · GW

TLDR: The only thing I'd add to Gwern's proposal is making sure there are good mechanisms to discuss changes. Improving the wiki and focusing on it could really improve alignment research overall.

Using the LW wiki more as a medium for collaborative research could be really useful in bringing new alignment thinkers up to speed rapidly. I think this is an important part of the overall project; alignment is seeing a burst of interest, and being able to rapidly make use of bright new minds who want to donate their time to the project might very well make the difference in adequately solving alignment in time.

As it stands, someone new to the field has to hunt for good articles on any topic, and they provide some links to other important articles, but that's not really their job. The wiki's tags does serve that purpose. The articles are sometimes a good overview of that concept or topic, but more community focus on the wiki could make them work much better as a way

Ideally each article aims to be a summary of current thinking on that topic, including both majority and minority views. One key element is making this project build community rather than strain it. Having people with different views work well collaboratively is a bit tricky. Good mechanisms for discussion are one way to reduce friction and any trend toward harsh feelings when ones' contributions are changed. The existing comment system might be adequate, particularly with more of a norm of linking changes to comments, and linking to comments from the main text for commentary.

Comment by Seth Herd on My Interview With Cade Metz on His Reporting About Slate Star Codex · 2024-03-26T18:00:13.791Z · LW · GW

I thought he just meant "criticism is good, actually; I like having it done to me so I'm going to do it to you", and was saying that rationalists tend to feel this way.

Comment by Seth Herd on My Interview With Cade Metz on His Reporting About Slate Star Codex · 2024-03-26T17:58:17.149Z · LW · GW

Wow, what a bad justification for doxxing someone. I somehow thought the NYT had a slightly better argument.

Comment by Seth Herd on On Lex Fridman’s Second Podcast with Altman · 2024-03-26T15:04:23.617Z · LW · GW

Okay, then I can't guess why you find it horrifying, but I'm curious because I think you could be right.

Comment by Seth Herd on On Lex Fridman’s Second Podcast with Altman · 2024-03-26T15:00:32.255Z · LW · GW

The argument Zvi is making, or Altman's argument?

Comment by Seth Herd on Alexander Gietelink Oldenziel's Shortform · 2024-03-26T14:56:36.940Z · LW · GW

What areas of science are you thinking of? I think the discussion varies dramatically.

I think allowing less legibility would help make science less plodding, and allow it to move in larger steps. But there's also a question of what direction it's plodding. The problem I saw with psych and neurosci was that it tended to plod in nearly random, not very useful directions.

And what definition of "smart"? I'm afraid that by a common definition, smart people tend to do dumb research, in that they'll do galaxy brained projects that are interesting but unlikely to pay off. This is how you get new science, but not useful science.

In cognitive psychology and neuroscience, I want to see money given to people who are both creative and practical. They will do new science that is also useful.

In psychology and neuroscience, scientists pick the grantees, and they tend to give money to those whose research they understand. This produces an effect where research keeps following one direction that became popular long ago. I think a different method of granting would work better, but the particular method matters a lot.

Thinking about it a little more, having a mix of personality types involved would probably be useful. I always appreciated the contributions of the rare philospher who actually learned enough to join a discussion about psych or neurosci research.

I think the most important application of meta science theory is alignment research.

Comment by Seth Herd on AI Alignment and the Classical Humanist Tradition · 2024-03-24T16:20:16.105Z · LW · GW

Thanks for your suggestions. I think having more people deeply engaged with alignment is good for our chances of getting it right.

I think this proposal falls into the category of goal crafting (a term proposed by Roko) - deciding what we want an AGI to do. Most alignment work addresses technical alignment - how we might get an AGI to reliably do anything. I think you imply the approach "just train it"; this might work for some types of AGI, and some types of training.

I think many humans trained in classical ethics are not actually ethical by their standards. It is one thing to understand and another to believe in an ethical system. My post "the partial.fqllacy of dumb superintelligence" is one of many treatments of that knowing vs caring distinction.

Comment by Seth Herd on Malentropic Gizmo's Shortform · 2024-03-24T16:07:36.889Z · LW · GW

I second this recommendation. This book was amazing. It's quite unlike other scifi, and that's a good thing.

Comment by Seth Herd on What does "autodidact" mean? · 2024-03-22T19:49:31.458Z · LW · GW

I think this touches on an important concept in education: the educational system provides motivation and peer learners much more than it provides instruction. I've taught, and it seems to me that a good lecture is much more about motivating students to care and think about a topic than explaining concepts that are available in the textbook and in an easy web search.

Comment by Seth Herd on Richard Ngo's Shortform · 2024-03-22T19:45:31.048Z · LW · GW

I think there are probably a lot of ways to build rational agents. The idea that general intelligence is hard in any absolute sense may be a biased by wanting to believe we're special, and for AI workers, that our work is special and difficult.

Comment by Seth Herd on On Devin · 2024-03-19T00:35:14.771Z · LW · GW

Very little alignment work of note, despite tons of published work on developing agents. I'm puzzled as to why the alignment community hasn't turned more of their attention toward language model cognitive architectures/agents, but I'm also reluctant to publish more work advertising how easily they might achieve AGI.

ARC Evals did set up a methodology for Evaluating Language-Model Agents on Realistic Autonomous Tasks. I view this as a useful acknowledgment of the real danger of better LLMs, but I think it's inherently inadequate, because it's based on the evals team doing the scaffolding to make the LLM into an agent. They're not going to be able to devote nearly as much time to that as other groups will down the road. New capabilities are certainly going to be developed by combinations of LLM improvements, and hard work at improving the cognitive architecture scaffolding around them.

Comment by Seth Herd on Controlling AGI Risk · 2024-03-18T15:38:44.725Z · LW · GW

You've probably seen this recent discussion post

"How useful is "AI Control" as a framing on AI X-Risk?"

It addresses the control issue you raise, and has links to other work addressing the same issue.

Comment by Seth Herd on lynettebye's Shortform · 2024-03-18T15:22:27.073Z · LW · GW

The choking under pressure results are all about very fast athletic tasks where smoothness is critical. Most cognitive tasks will have enough time to think about both rules and then separately about intuitions/automatic skills. So getting benefit from both is quite possible.

Comment by Seth Herd on On Devin · 2024-03-18T15:17:18.744Z · LW · GW

See also MultiOn and Maisa. Both are different agent enhancements for LLMs that claim notable new abilities on benchmarks. MultiOn can do web tasks, Maisa scores better on reasoning tasks than COT prompting and uses more efficient calls for lower cost. Neither is in deployment yet, neither company exains exactly how they're engineered. Ding! Ding!

I also thought developing agents was taking too long until talking to a few people actually working on them. LLMs include new types of unexpected behavior, so engineering around that is a challenge. And there's the standard time to engineer anything reliable and usable enough to be useful.

So, we're right on track for language model cognitive architectures with alarmingly fast timelines, coupled with a slow enough takeoff that we'll get some warning shots.

Edit:  I just heard about another one, GoodAI, developing the episodic (long term) memory that I think will be a key element of LMCA agents. They outperform 128k context GPT4T with only 8k of context, on a memory benchmark of their own design, at 16% of the inference cost. Thanks, I hate it.

Comment by Seth Herd on Toward a Broader Conception of Adverse Selection · 2024-03-17T19:57:07.918Z · LW · GW

You don't need to have different preferences to make mutually beneficial trades. Human preferences tend to be roughly unbounded but sublinear - more of the same good isn't as important to us. So if I have a lot of money and you have a lot of ripe oranges, we can both benefit greatly by trading even if we both have the same love of oranges and money.

Comment by Seth Herd on More people getting into AI safety should do a PhD · 2024-03-15T22:01:51.215Z · LW · GW

I think these arguments only apply if you are somehow doing a PhD in AI safety. Otherwise you just wasted most of 5 years doing only tangentially relevant work. The skills of developing and evaluating research programs are taught and practiced, but I'd say they usually occupy less than 1% of the time on task.

WRT evaluating research programs, see Some (problematic) aesthetics of what constitutes good work in academia.

Source: I did a PhD in a related field, cognitive psychology and cognitive neuroscience. I feel like it was probably worth it, but that's for the content of my studies, not anything about methodology or learning to do research.

But it's also hugely dependent on the advisor. Most advisors are too stressed out to spend much time actually training students. If yours isn't, great.

I now say that no one should do grad school in any topic without having a very good idea of how their specific advisor treats students. I saw it do tremendous damage to many students, just through transmitted stress and poor mentoring skills.

OTOH I think it's far too harsh to say that academia is particularly bad at teaching bad thinking or incentivizing fraud. Every field does those things. I think more academics are informally rationalists and good people than are found in other careers, even though it's still a minority that are really rationalists.

Comment by Seth Herd on Clickbait Soapboxing · 2024-03-15T21:40:52.995Z · LW · GW

I looked at that thread and was baffled. I didn't see exactly what you were referring to at the linked point, and it's a large thread.

Comment by Seth Herd on To the average human, controlled AI is just as lethal as 'misaligned' AI · 2024-03-14T22:20:33.691Z · LW · GW

That's an excellent summary sentence. It seems like that would be a useful statement in advocating for AI slowdown/shutdown.

Comment by Seth Herd on To the average human, controlled AI is just as lethal as 'misaligned' AI · 2024-03-14T21:44:14.198Z · LW · GW

I think you raise an important point. If we solve alignment, do we still all die?

This has been discussed in the alignment community under the terminology of a "pivotal act". It's often been assumed that an aligned AGI would prevent the creation of more AGIs to prevent both the accidental creation of misaligned AGIs, and the deliberate creation of AGIs that are misaligned to most of humanity's interests, while being aligned to the creator's goals. Your misuse category falls into the latter. So you should search for posts under the term pivotal act. I don't know of any particularly central ones off the top of my head.

However, I think this is worth more discussion. People have started to talk about "multipolar scenarios" in which we have multiple or many human-plus level AGIs. I'm unclear on how people think we'll survive such a scenario, except by not thinking about it a lot. I think this is linked to the shift in predicting a slower takeoff, where AGI doesn't become superintelligent that quickly. But I think the same logic applies, even if we survive for a few years longer.

I hope to be convinced otherwise, but I currently mostly agree with your logic for multipolar scenarios. I think we're probably doomed to die if that's allowed to happen. See What does it take to defend the world against out-of-control AGIs? for reasons that a single AGI could probably end the world even if friendly AGIs have a headstart in trying to defend it.

I'd summarize my concerns thus: Self-improvement creates an unstable situation to which no game-theoretic cooperative equilibrium applies. It's like playing Diplomacy where the players can change the rules arbitrarily on each turn. If there are many AGIs under human control, one will eventually have goals for the use of Earth at odds with those of humanity at large. This could happen because of an error in its alignment, or because the human(s) controlling it has non-standard beliefs or values.

When this happens, I think it's fairly easy for a self-improving AGI to destroy human civilization (although perhaps not other AGIs with good backup plans). It just needs to put together a hidden (perhaps off-planet, underground or underwater) robotic production facility that can produce new compute and new robots. That's if there's nothing simpler and more clever to do, like diverting an asteroid or inventing a way to produce a black hole. The plans get easier the less you care about using the Earth immediately afterward.

I agree that this merits more consideration.

I also agree that the title should change. LW very much looks down on clickbait titles. I don't think you intended to argue that AI won't kill people, merely that people with AIs will. I believe you can edit the title, and you should.

Edit: I recognized the title and didn't take you to be arguing against autonomous AI as a risk - but it does actually make that claim, so probably best to change it.

Comment by Seth Herd on Clickbait Soapboxing · 2024-03-14T06:11:01.302Z · LW · GW

I think you're preaching to the choir. I think the majority opinion among LW users is that it's a sin against rationality to overstate ones' case or ones beliefs, and that "generating discussion" is not a sufficient reason to do so.

It works to generate more discussion, but it really doesn't seem to generate good discussion. I think it creates animosity through arguments, and that creates polarization. Which is a major mind-killer.

Comment by Seth Herd on The Best Essay (Paul Graham) · 2024-03-12T22:28:35.533Z · LW · GW

Thinking about this a little more and rereading the piece: What is meant by "the best essay" is underdefined. Refining the definition of what type of best you're going for might be useful. Are you shooting for the most impact on the most people's thinking? Helping solve a specific problem? A large impact on the thinking of a small set of people (maybe ones interested in one of your specialized interests)?

If you just think of it as "the best", I'm afraid you'll wind up writing to impress people instead of to add value. Which is fine, as long as you don't take it to the extremes of most internet essays, that try to impress an ingroup instead of add value to the world.

Comment by Seth Herd on "How could I have thought that faster?" · 2024-03-12T19:17:25.846Z · LW · GW

This seems critical. The description given is very vague relative to actual cognitive steps that could happen for specific conclusions. How anyone could "retrain" themselves in 30 seconds is something different than what we usually mean by training.

Comment by Seth Herd on The Best Essay (Paul Graham) · 2024-03-12T06:39:46.096Z · LW · GW

Very nice!

This doesn't touch on one of the mistakes I see most often from newer LW writers:

https://twitter.com/LBacaj/status/1668446030814146563

My summary: don't assume a reader wants to read your whole piece just because they've started it. Tell them at the start what you're promising to deliver. This increases the number that do read it and are glad they did, and decreases the number that wish they hadn't and now resent and downvote the piece.

I personally feel this is best done for LW in two stages, with a very brief summary followed by a brief summary that gives the core logic, and then the full in-depth piece that addresses all of the caveats and gives more background. I think creating those two levels of summary also improve the clarity of the thinking and make the reading easier. Many of the best writers on LW seem to follow this format in one way or another.

This is particularly important when neither your name or your publication platform is adequate for the reader to know whether they want to spend their time on your piece, as is the case for me and most LW writers.

Comment by Seth Herd on Some (problematic) aesthetics of what constitutes good work in academia · 2024-03-11T23:16:45.381Z · LW · GW

I think the structure of Alignment Forum vs. academic journals solves a surprising number of the problems you mention. It creates a different structure for both publication and prestige. More on this at the end.

It was kind of cathartic to read this. I've spent some time thinking about the inefficiencies of academia, but hadn't put together a theory this crisp. My 23 years in academic cognitive psychology and cognitive neuroscience would have been insanely frustrating if I hadn't been working on lab funding. I resolved going in that I wasn't going to play the publish-or-perish game and jump through a bunch of strange hoops to do what would be publicly regarded as "good work".

I think this is a good high-level theory of what's wrong with academia. I think one problem is that academic fields don't have a mandate to produce useful progress, just progress. It's a matter of inmates running the asylum. This all makes some sense, since the routes to making useful progress aren't obvious, and non-experts shouldn't be directly in charge of the directions of scientific progress; but there's clearly something missing when no one along the line has more than a passing motivation to select problems for impact.

Around 2006 I heard Tal Yarkoni, a brilliant young scientist, give a talk on the structural problems of science and its publication model. (He's now ex-scientist as many brilliant young scientists become these days). The changes he advocated were almost precisely the publication and prestige model of the Alignment Forum. It allows publications of any length and format, and provides a public time stamp for when ideas were contributed and developed. It also provides a public record, in the form of karma scores, for how valuable the scientific community found that publication. This only works in a closed community of experts, which is why I'm mentioning AF and not LW. One's karma score is publicly visible as a sum-total-of-community-appreciation of that person's work.

This public record of appreciation breaks an important deadlocking incentive structure in the traditional scientific publication model: If you're going to find fault with a prominent theory, your publication of it had better be damned good (or rather "good" by the vague aesthetic judgments you discuss). Otherewise you've just earned a negative valence from everyone who likes that theory and/or the people that have advocated it, with little to show for it. I think that's why there's little market for the type of analysis you mention, in which someone goes through the literature in painstaking detail to resolve a controversy in the litterature, and then finds no publication outlet for their hard work.

This is all downstream of the current scientific model that's roughly an advocacy model. As in law, it's considered good and proper to vigorously advocate for a theory even if you don't personally think it's likely to be true. This might make sense in law, but in academia it's the reason we sometimes say that science advances one funeral at a time. The effect of motivated reasoning combined with the advocacy norm cause scientists to advocate their favorite wrong theory unto their deathbed, and be lauded by most of their peers for doing so.

The rationalist stance of asking that people demonstrate their worth by changing their mind in the face of new evidence is present in science, but it seemed to me much less common than the advocacy norm. This rationalist norm provides partial resistance to the effects of motivated reasoning. That is worth it's own post, but I'm not sure I'll get around to writing it before the singularity.

These are all reasons that the best science is often done outside of academia.

Anyway, nice thought-provoking article.