Posts

Goals selected from learned knowledge: an alternative to RL alignment 2024-01-15T21:52:06.170Z
After Alignment — Dialogue between RogerDearnaley and Seth Herd 2023-12-02T06:03:17.456Z
Corrigibility or DWIM is an attractive primary goal for AGI 2023-11-25T19:37:39.698Z
Sapience, understanding, and "AGI" 2023-11-24T15:13:04.391Z
Altman returns as OpenAI CEO with new board 2023-11-22T16:04:03.123Z
OpenAI Staff (including Sutskever) Threaten to Quit Unless Board Resigns 2023-11-20T14:20:33.539Z
We have promising alignment plans with low taxes 2023-11-10T18:51:38.604Z
Seth Herd's Shortform 2023-11-10T06:52:28.778Z
Shane Legg interview on alignment 2023-10-28T19:28:52.223Z
The (partial) fallacy of dumb superintelligence 2023-10-18T21:25:16.893Z
Steering subsystems: capabilities, agency, and alignment 2023-09-29T13:45:00.739Z
AGI isn't just a technology 2023-09-01T14:35:57.062Z
Internal independent review for language model agent alignment 2023-07-07T06:54:11.552Z
Simpler explanations of AGI risk 2023-05-14T01:29:29.289Z
A simple presentation of AI risk arguments 2023-04-26T02:19:19.164Z
Capabilities and alignment of LLM cognitive architectures 2023-04-18T16:29:29.792Z
Agentized LLMs will change the alignment landscape 2023-04-09T02:29:07.797Z
AI scares and changing public beliefs 2023-04-06T18:51:12.831Z
The alignment stability problem 2023-03-26T02:10:13.044Z
Human preferences as RL critic values - implications for alignment 2023-03-14T22:10:32.823Z
Clippy, the friendly paperclipper 2023-03-02T00:02:55.749Z
Are you stably aligned? 2023-02-24T22:08:23.098Z

Comments

Comment by Seth Herd on Habryka's Shortform Feed · 2024-03-26T20:31:38.049Z · LW · GW

TLDR: The only thing I'd add to Gwern's proposal is making sure there are good mechanisms to discuss changes. Improving the wiki and focusing on it could really improve alignment research overall.

Using the LW wiki more as a medium for collaborative research could be really useful in bringing new alignment thinkers up to speed rapidly. I think this is an important part of the overall project; alignment is seeing a burst of interest, and being able to rapidly make use of bright new minds who want to donate their time to the project might very well make the difference in adequately solving alignment in time.

As it stands, someone new to the field has to hunt for good articles on any topic, and they provide some links to other important articles, but that's not really their job. The wiki's tags does serve that purpose. The articles are sometimes a good overview of that concept or topic, but more community focus on the wiki could make them work much better as a way

Ideally each article aims to be a summary of current thinking on that topic, including both majority and minority views. One key element is making this project build community rather than strain it. Having people with different views work well collaboratively is a bit tricky. Good mechanisms for discussion are one way to reduce friction and any trend toward harsh feelings when ones' contributions are changed. The existing comment system might be adequate, particularly with more of a norm of linking changes to comments, and linking to comments from the main text for commentary.

Comment by Seth Herd on My Interview With Cade Metz on His Reporting About Slate Star Codex · 2024-03-26T18:00:13.791Z · LW · GW

I thought he just meant "criticism is good, actually; I like having it done to me so I'm going to do it to you", and was saying that rationalists tend to feel this way.

Comment by Seth Herd on My Interview With Cade Metz on His Reporting About Slate Star Codex · 2024-03-26T17:58:17.149Z · LW · GW

Wow, what a bad justification for doxxing someone. I somehow thought the NYT had a slightly better argument.

Comment by Seth Herd on On Lex Fridman’s Second Podcast with Altman · 2024-03-26T15:04:23.617Z · LW · GW

Okay, then I can't guess why you find it horrifying, but I'm curious because I think you could be right.

Comment by Seth Herd on On Lex Fridman’s Second Podcast with Altman · 2024-03-26T15:00:32.255Z · LW · GW

The argument Zvi is making, or Altman's argument?

Comment by Seth Herd on Alexander Gietelink Oldenziel's Shortform · 2024-03-26T14:56:36.940Z · LW · GW

What areas of science are you thinking of? I think the discussion varies dramatically.

I think allowing less legibility would help make science less plodding, and allow it to move in larger steps. But there's also a question of what direction it's plodding. The problem I saw with psych and neurosci was that it tended to plod in nearly random, not very useful directions.

And what definition of "smart"? I'm afraid that by a common definition, smart people tend to do dumb research, in that they'll do galaxy brained projects that are interesting but unlikely to pay off. This is how you get new science, but not useful science.

In cognitive psychology and neuroscience, I want to see money given to people who are both creative and practical. They will do new science that is also useful.

In psychology and neuroscience, scientists pick the grantees, and they tend to give money to those whose research they understand. This produces an effect where research keeps following one direction that became popular long ago. I think a different method of granting would work better, but the particular method matters a lot.

Thinking about it a little more, having a mix of personality types involved would probably be useful. I always appreciated the contributions of the rare philospher who actually learned enough to join a discussion about psych or neurosci research.

I think the most important application of meta science theory is alignment research.

Comment by Seth Herd on AI Alignment and the Classical Humanist Tradition · 2024-03-24T16:20:16.105Z · LW · GW

Thanks for your suggestions. I think having more people deeply engaged with alignment is good for our chances of getting it right.

I think this proposal falls into the category of goal crafting (a term proposed by Roko) - deciding what we want an AGI to do. Most alignment work addresses technical alignment - how we might get an AGI to reliably do anything. I think you imply the approach "just train it"; this might work for some types of AGI, and some types of training.

I think many humans trained in classical ethics are not actually ethical by their standards. It is one thing to understand and another to believe in an ethical system. My post "the partial.fqllacy of dumb superintelligence" is one of many treatments of that knowing vs caring distinction.

Comment by Seth Herd on Malentropic Gizmo's Shortform · 2024-03-24T16:07:36.889Z · LW · GW

I second this recommendation. This book was amazing. It's quite unlike other scifi, and that's a good thing.

Comment by Seth Herd on What does "autodidact" mean? · 2024-03-22T19:49:31.458Z · LW · GW

I think this touches on an important concept in education: the educational system provides motivation and peer learners much more than it provides instruction. I've taught, and it seems to me that a good lecture is much more about motivating students to care and think about a topic than explaining concepts that are available in the textbook and in an easy web search.

Comment by Seth Herd on Richard Ngo's Shortform · 2024-03-22T19:45:31.048Z · LW · GW

I think there are probably a lot of ways to build rational agents. The idea that general intelligence is hard in any absolute sense may be a biased by wanting to believe we're special, and for AI workers, that our work is special and difficult.

Comment by Seth Herd on On Devin · 2024-03-19T00:35:14.771Z · LW · GW

Very little alignment work of note, despite tons of published work on developing agents. I'm puzzled as to why the alignment community hasn't turned more of their attention toward language model cognitive architectures/agents, but I'm also reluctant to publish more work advertising how easily they might achieve AGI.

ARC Evals did set up a methodology for Evaluating Language-Model Agents on Realistic Autonomous Tasks. I view this as a useful acknowledgment of the real danger of better LLMs, but I think it's inherently inadequate, because it's based on the evals team doing the scaffolding to make the LLM into an agent. They're not going to be able to devote nearly as much time to that as other groups will down the road. New capabilities are certainly going to be developed by combinations of LLM improvements, and hard work at improving the cognitive architecture scaffolding around them.

Comment by Seth Herd on Controlling AGI Risk · 2024-03-18T15:38:44.725Z · LW · GW

You've probably seen this recent discussion post

"How useful is "AI Control" as a framing on AI X-Risk?"

It addresses the control issue you raise, and has links to other work addressing the same issue.

Comment by Seth Herd on lynettebye's Shortform · 2024-03-18T15:22:27.073Z · LW · GW

The choking under pressure results are all about very fast athletic tasks where smoothness is critical. Most cognitive tasks will have enough time to think about both rules and then separately about intuitions/automatic skills. So getting benefit from both is quite possible.

Comment by Seth Herd on On Devin · 2024-03-18T15:17:18.744Z · LW · GW

See also MultiOn and Maisa. Both are different agent enhancements for LLMs that claim notable new abilities on benchmarks. MultiOn can do web tasks, Maisa scores better on reasoning tasks than COT prompting and uses more efficient calls for lower cost. Neither is in deployment yet, neither company exains exactly how they're engineered. Ding! Ding!

I also thought developing agents was taking too long until talking to a few people actually working on them. LLMs include new types of unexpected behavior, so engineering around that is a challenge. And there's the standard time to engineer anything reliable and usable enough to be useful.

So, we're right on track for language model cognitive architectures with alarmingly fast timelines, coupled with a slow enough takeoff that we'll get some warning shots.

Edit:  I just heard about another one, GoodAI, developing the episodic (long term) memory that I think will be a key element of LMCA agents. They outperform 128k context GPT4T with only 8k of context, on a memory benchmark of their own design, at 16% of the inference cost. Thanks, I hate it.

Comment by Seth Herd on Toward a Broader Conception of Adverse Selection · 2024-03-17T19:57:07.918Z · LW · GW

You don't need to have different preferences to make mutually beneficial trades. Human preferences tend to be roughly unbounded but sublinear - more of the same good isn't as important to us. So if I have a lot of money and you have a lot of ripe oranges, we can both benefit greatly by trading even if we both have the same love of oranges and money.

Comment by Seth Herd on More people getting into AI safety should do a PhD · 2024-03-15T22:01:51.215Z · LW · GW

I think these arguments only apply if you are somehow doing a PhD in AI safety. Otherwise you just wasted most of 5 years doing only tangentially relevant work. The skills of developing and evaluating research programs are taught and practiced, but I'd say they usually occupy less than 1% of the time on task.

WRT evaluating research programs, see Some (problematic) aesthetics of what constitutes good work in academia.

Source: I did a PhD in a related field, cognitive psychology and cognitive neuroscience. I feel like it was probably worth it, but that's for the content of my studies, not anything about methodology or learning to do research.

But it's also hugely dependent on the advisor. Most advisors are too stressed out to spend much time actually training students. If yours isn't, great.

I now say that no one should do grad school in any topic without having a very good idea of how their specific advisor treats students. I saw it do tremendous damage to many students, just through transmitted stress and poor mentoring skills.

OTOH I think it's far too harsh to say that academia is particularly bad at teaching bad thinking or incentivizing fraud. Every field does those things. I think more academics are informally rationalists and good people than are found in other careers, even though it's still a minority that are really rationalists.

Comment by Seth Herd on Clickbait Soapboxing · 2024-03-15T21:40:52.995Z · LW · GW

I looked at that thread and was baffled. I didn't see exactly what you were referring to at the linked point, and it's a large thread.

Comment by Seth Herd on To the average human, controlled AI is just as lethal as 'misaligned' AI · 2024-03-14T22:20:33.691Z · LW · GW

That's an excellent summary sentence. It seems like that would be a useful statement in advocating for AI slowdown/shutdown.

Comment by Seth Herd on To the average human, controlled AI is just as lethal as 'misaligned' AI · 2024-03-14T21:44:14.198Z · LW · GW

I think you raise an important point. If we solve alignment, do we still all die?

This has been discussed in the alignment community under the terminology of a "pivotal act". It's often been assumed that an aligned AGI would prevent the creation of more AGIs to prevent both the accidental creation of misaligned AGIs, and the deliberate creation of AGIs that are misaligned to most of humanity's interests, while being aligned to the creator's goals. Your misuse category falls into the latter. So you should search for posts under the term pivotal act. I don't know of any particularly central ones off the top of my head.

However, I think this is worth more discussion. People have started to talk about "multipolar scenarios" in which we have multiple or many human-plus level AGIs. I'm unclear on how people think we'll survive such a scenario, except by not thinking about it a lot. I think this is linked to the shift in predicting a slower takeoff, where AGI doesn't become superintelligent that quickly. But I think the same logic applies, even if we survive for a few years longer.

I hope to be convinced otherwise, but I currently mostly agree with your logic for multipolar scenarios. I think we're probably doomed to die if that's allowed to happen. See What does it take to defend the world against out-of-control AGIs? for reasons that a single AGI could probably end the world even if friendly AGIs have a headstart in trying to defend it.

I'd summarize my concerns thus: Self-improvement creates an unstable situation to which no game-theoretic cooperative equilibrium applies. It's like playing Diplomacy where the players can change the rules arbitrarily on each turn. If there are many AGIs under human control, one will eventually have goals for the use of Earth at odds with those of humanity at large. This could happen because of an error in its alignment, or because the human(s) controlling it has non-standard beliefs or values.

When this happens, I think it's fairly easy for a self-improving AGI to destroy human civilization (although perhaps not other AGIs with good backup plans). It just needs to put together a hidden (perhaps off-planet, underground or underwater) robotic production facility that can produce new compute and new robots. That's if there's nothing simpler and more clever to do, like diverting an asteroid or inventing a way to produce a black hole. The plans get easier the less you care about using the Earth immediately afterward.

I agree that this merits more consideration.

I also agree that the title should change. LW very much looks down on clickbait titles. I don't think you intended to argue that AI won't kill people, merely that people with AIs will. I believe you can edit the title, and you should.

Edit: I recognized the title and didn't take you to be arguing against autonomous AI as a risk - but it does actually make that claim, so probably best to change it.

Comment by Seth Herd on Clickbait Soapboxing · 2024-03-14T06:11:01.302Z · LW · GW

I think you're preaching to the choir. I think the majority opinion among LW users is that it's a sin against rationality to overstate ones' case or ones beliefs, and that "generating discussion" is not a sufficient reason to do so.

It works to generate more discussion, but it really doesn't seem to generate good discussion. I think it creates animosity through arguments, and that creates polarization. Which is a major mind-killer.

Comment by Seth Herd on The Best Essay (Paul Graham) · 2024-03-12T22:28:35.533Z · LW · GW

Thinking about this a little more and rereading the piece: What is meant by "the best essay" is underdefined. Refining the definition of what type of best you're going for might be useful. Are you shooting for the most impact on the most people's thinking? Helping solve a specific problem? A large impact on the thinking of a small set of people (maybe ones interested in one of your specialized interests)?

If you just think of it as "the best", I'm afraid you'll wind up writing to impress people instead of to add value. Which is fine, as long as you don't take it to the extremes of most internet essays, that try to impress an ingroup instead of add value to the world.

Comment by Seth Herd on "How could I have thought that faster?" · 2024-03-12T19:17:25.846Z · LW · GW

This seems critical. The description given is very vague relative to actual cognitive steps that could happen for specific conclusions. How anyone could "retrain" themselves in 30 seconds is something different than what we usually mean by training.

Comment by Seth Herd on The Best Essay (Paul Graham) · 2024-03-12T06:39:46.096Z · LW · GW

Very nice!

This doesn't touch on one of the mistakes I see most often from newer LW writers:

https://twitter.com/LBacaj/status/1668446030814146563

My summary: don't assume a reader wants to read your whole piece just because they've started it. Tell them at the start what you're promising to deliver. This increases the number that do read it and are glad they did, and decreases the number that wish they hadn't and now resent and downvote the piece.

I personally feel this is best done for LW in two stages, with a very brief summary followed by a brief summary that gives the core logic, and then the full in-depth piece that addresses all of the caveats and gives more background. I think creating those two levels of summary also improve the clarity of the thinking and make the reading easier. Many of the best writers on LW seem to follow this format in one way or another.

This is particularly important when neither your name or your publication platform is adequate for the reader to know whether they want to spend their time on your piece, as is the case for me and most LW writers.

Comment by Seth Herd on Some (problematic) aesthetics of what constitutes good work in academia · 2024-03-11T23:16:45.381Z · LW · GW

I think the structure of Alignment Forum vs. academic journals solves a surprising number of the problems you mention. It creates a different structure for both publication and prestige. More on this at the end.

It was kind of cathartic to read this. I've spent some time thinking about the inefficiencies of academia, but hadn't put together a theory this crisp. My 23 years in academic cognitive psychology and cognitive neuroscience would have been insanely frustrating if I hadn't been working on lab funding. I resolved going in that I wasn't going to play the publish-or-perish game and jump through a bunch of strange hoops to do what would be publicly regarded as "good work".

I think this is a good high-level theory of what's wrong with academia. I think one problem is that academic fields don't have a mandate to produce useful progress, just progress. It's a matter of inmates running the asylum. This all makes some sense, since the routes to making useful progress aren't obvious, and non-experts shouldn't be directly in charge of the directions of scientific progress; but there's clearly something missing when no one along the line has more than a passing motivation to select problems for impact.

Around 2006 I heard Tal Yarkoni, a brilliant young scientist, give a talk on the structural problems of science and its publication model. (He's now ex-scientist as many brilliant young scientists become these days). The changes he advocated were almost precisely the publication and prestige model of the Alignment Forum. It allows publications of any length and format, and provides a public time stamp for when ideas were contributed and developed. It also provides a public record, in the form of karma scores, for how valuable the scientific community found that publication. This only works in a closed community of experts, which is why I'm mentioning AF and not LW. One's karma score is publicly visible as a sum-total-of-community-appreciation of that person's work.

This public record of appreciation breaks an important deadlocking incentive structure in the traditional scientific publication model: If you're going to find fault with a prominent theory, your publication of it had better be damned good (or rather "good" by the vague aesthetic judgments you discuss). Otherewise you've just earned a negative valence from everyone who likes that theory and/or the people that have advocated it, with little to show for it. I think that's why there's little market for the type of analysis you mention, in which someone goes through the literature in painstaking detail to resolve a controversy in the litterature, and then finds no publication outlet for their hard work.

This is all downstream of the current scientific model that's roughly an advocacy model. As in law, it's considered good and proper to vigorously advocate for a theory even if you don't personally think it's likely to be true. This might make sense in law, but in academia it's the reason we sometimes say that science advances one funeral at a time. The effect of motivated reasoning combined with the advocacy norm cause scientists to advocate their favorite wrong theory unto their deathbed, and be lauded by most of their peers for doing so.

The rationalist stance of asking that people demonstrate their worth by changing their mind in the face of new evidence is present in science, but it seemed to me much less common than the advocacy norm. This rationalist norm provides partial resistance to the effects of motivated reasoning. That is worth it's own post, but I'm not sure I'll get around to writing it before the singularity.

These are all reasons that the best science is often done outside of academia.

Anyway, nice thought-provoking article.

Comment by Seth Herd on An Optimistic Solution to the Fermi Paradox · 2024-03-11T05:40:27.780Z · LW · GW

I wonder more why nobody has visited than why we don't see obvious signs of anyone; as you say, that's easy enough to explain away.

Most of your explanations would also imply that those aliens don't care to help fledgling civilizations. If they're post-singular, they could do that fairly easily by sending out probes that could share information and technology. I guess the neighbors being selfish is better than some great filter taking everyone out.

I take some comfort in misaligned AGI not being a good candidate great filter. A misaligned maximizer would be sweeping out from that light cone and we'd be dead, not observing an empty universe.

You're probably familiar with the dark forest explanation; advanced civilizations don't like to give evidence of their existence, because it could get them killed by others. I find it somewhat possible that there are probes watching us to ensure we don't spawn a misaligned AGI, but not acting to help lest other watchers detect them. If I were running a technologically advanced civilization that didn't care too much about maximum expansion, I'd send probes to detect potentially hostile civilizations or AGIs.

It's also quite plausible that we're the first in our galaxy. Our position on the spiral arm of the galaxy makes us much less vulnerable to sterilizing nova events. I've heard one astrophysicist present this as her favorite solution, but I don't remember her name.

See also The Fermi Paradox: What did Sandberg, Drexler and Ord Really Dissolve? on an important recent paper claiming that there's really no paradox; we're overestimating some probability in the Drake equation, we just don't know which and why.

Comment by Seth Herd on Let's build definitely-not-conscious AI · 2024-03-09T14:37:18.997Z · LW · GW

That's a great idea, but it's hard to convince people to do it, because all of the things you mention are on the easiest path to making effective AI. That's why humans have those properties; they're useful for effective cognition.

Comment by Seth Herd on Let's build definitely-not-conscious AI · 2024-03-09T14:33:17.283Z · LW · GW

Discussions of consciousness without defining which meaning you're using will usually cause confusion and arguments, as Tamsin Leake points out.

Consciousness is used to label many aspects of human cognition. Which is complex. So consciousness almost means human-like, but without specifying on which dimensions you're defining the analogy.

Comment by Seth Herd on How is Chat-GPT4 Not Conscious? · 2024-03-08T23:59:32.489Z · LW · GW

GPT4 is very likely partly conscious, in some senses of that word but not others.

What do you mean by "conscious"? After spending a long time reading about the science and philosophy of consciousness, I think most questions about consciousness will be resolved by carefully defining what you mean.

"How does (some type of) consciousness work" questions usually require an explanation in terms of neural information processing. Those are harder to come by, and I haven't gotten around to writing mine down since it's never seemed useful to either my professional or personal life. But maybe with AGI creeping toward human-style consciousness it will be.

It definitely has a bunch of complex and subtle distributed representations of what we'd call "concepts". Which is part of the human conscious experience and our cognitive abilities.

Comment by Seth Herd on What if Alignment is Not Enough? · 2024-03-08T19:25:28.090Z · LW · GW

I think you present a good argument for plausibility.

For me to think this is likely to be important, it would take a stronger argument.

You mention proofs. I imagine they're correct, and based on infinite time passing. Everything that's possible will happen in infinite time. Whether this would happen within the heat death of the universe is a more relevant question.

For this to happen on a timescale that matters, it seems you're positing an incompetent superintelligence. It hasn't devoted enough of its processing to monitoring for these effects and correcting them when they happen. As a result, it eventually fails at its own goals.

This seems like it would only happen with an ASI with some particular blind spots for its intelligence.

Comment by Seth Herd on Many arguments for AI x-risk are wrong · 2024-03-06T22:46:15.395Z · LW · GW

I think this is a key crux of disagreement on alignment:

When I bring up the "actual RL algorithms don't seem very dangerous or agenty to me" point, people often respond with "Future algorithms will be different and more dangerous". 

I think this is a bad response for many reasons.

On the one hand, empiricism and assuming that the future will be much like the past have a great track record.

On the other, predicting the future is the name of the game in alignment. And while the future is reliably much like the past, it's never been exactly like the past.

So opinions pull in both directions.

 

On the object level, I certainly agree that existing RL systems aren't very agenty or dangerous. It seems like you're predicting that people won't make AI that's particularly agentic any time soon. It seems to me that they'll certainly want to. And I think it will be easy if non-agentic foundation models get good. Turning a smart foundation model into an agent is as simple as the prompt "make and execute a plan that accomplishes goal [x]. Use [these APIs] to gather information and take actions".

I think this is what Alex was pointing to in the OP by saying

I’m worried about people turning AIs into agentic systems using scaffolding and other tricks, and then instructing the systems to complete large-scale projects.

I think this is the default future, so much so that I don't think it matters if agency would emerge through RL. We'll build it in. Humans are burdened with excessive curiousity, optimism, and ambition. Especially the type of humans that head AI/AGI projects.

Comment by Seth Herd on Are we so good to simulate? · 2024-03-06T21:59:50.217Z · LW · GW

I don't think I was clear. I was trying to say that you can simulate just the minds in high-res, and the rest of the world in low-res. That reduces the computation by many orders of magnitude, since minds are such a tiny fraction of the physical universe.

Separately, I don't think there's any good reason to think we need molecular-level, let alone quantum-level process to explain consciousness. Thinking carefully about consciousness, it seems to me that neural networks, with specific properties and linked in specific ways, are adequate to explain our conscious experiences. But that's neither here nor there for the compressed simulation point. And I haven't written anything up, since explaining consciousness has little use in either academic neuroscience or AGI alignment.

Comment by Seth Herd on Are we so good to simulate? · 2024-03-04T17:27:41.010Z · LW · GW

What is supposed to be useful or desirable about running ancestor sims of the singularity? I don't remember a good argument for advanced civilizations or AGIs doing this.

I don't think it swings this discussion either way, but I think simulations could be run many orders of magnitude more efficiently by using low-res sims, scaling resolution only where necessary, except for full res for the minds involved. We've never actually experienced an atom, and neural networks don't need to be mod led to the molecular level to give good results. There is no good reason whatsoever to think we need quantum effects for intelligence or consciousness.

Comment by Seth Herd on Why I think it's net harmful to do technical safety research at AGI labs · 2024-02-26T15:24:11.245Z · LW · GW

There's nothing to blow the whistle on. Everyone knows that those labs are pursuing AGI.

We are not in direct danger yet, in all likelihood. I have short timelines, but there's almost no chance that any current work is at risk of growing smart enough to disempower humans. There's a difference between hitting the accelerator in the direction of a cliff, and holding it down as it gets close. Developing AGI internally is when we'll need and hopefullly get whistleblowers.

Are you thinking of blowing the whistle on something in between work on AGI and getting close to actually achieving it?

Comment by Seth Herd on The Shutdown Problem: Three Theorems · 2024-02-25T20:07:11.036Z · LW · GW

Yes, that's a hard part. But specifying the goal accurately is often regarded as a potential failure point. So, if I'm right that this is a simpler, easier-to-specify alignment goal, that's progress. It also has the advantage of incorporating corrigibility as a by product; so it's resistant to partial failure - if you can tell that something went wrong in time, the AGI can be asked to shut down.

WRT to the difficulty of using the AGI's understanding as its terminal goal, I think it's not trivial, but quite do-able, at least in some of the AGI architecture we can anticipate. See my two short posts Goals selected from learned knowledge: an alternative to RL alignment and The (partial) fallacy of dumb superintelligence.

Comment by Seth Herd on The Shutdown Problem: Incomplete Preferences as a Solution · 2024-02-25T20:01:24.229Z · LW · GW

I'm confused about how this proposal prevents preferences over being shut down, but preserves preferences over goals you want it to have.

Would your agent have a preference between making you $1 in one timestep and $1m in two timesteps?

Comment by Seth Herd on The Shutdown Problem: Three Theorems · 2024-02-23T01:48:23.654Z · LW · GW

Good questions. To me following instructions seems vastly simpler than working out what's best for all of humanity (and what counts as humanity) for an unlimited future. "Solving ethics" is often listed as a major obstacle to alignment, and I think we'll just punt on that difficult issue and align it to want to follow our current instructions instead of our inmost desires, let alone all of humanity's.

I realize this isn't fully satisfactory, so I'd like to delve into this more. It seems much simpler to guess "what did this individual mean by this request" than to guess "what does all of humanity want for all of time". Desires are poorly defined and understood. And what counts as humanity will become quite blurry if we get the ability to create AGIs and modify humans.

WRT ease, it seems like current LLMs already understand our instructions pretty well. So any AGI that incorporates LLMs or similar linguistic training will already be in the ballpark. And that's all it has to be, as long as it checks with the user before taking impactful actions.

It's critical that in my linked post on DWIM, I'm including a "and check" portion. It seems like pretty trivial overhead for the AGI to briefly summarize the plan it came up with and ask for approval from its human operator, particularly for impactful plans.

WRT occasionally misunderstanding intentions and whether an action is "impactful" enough to check before executing actions: there's a bunch of stuff you can do to institute internal crosschecks in an LLM agent's internal thinking. See my Internal independent review for language model agent alignment if you're interested.

Comment by Seth Herd on Dual Wielding Kindle Scribes · 2024-02-21T21:52:54.476Z · LW · GW

Agreed that it's not as simple as light hitting the eye. I've also tried to work on external monitors that are relatively far away for neutral focal distance. But I couldn't find any good research or even theory on it.

FWIW, the majority of my book reading in white on black has been done in the dark in bed before sleeping, but at low contrast.

Comment by Seth Herd on Dual Wielding Kindle Scribes · 2024-02-21T19:41:37.898Z · LW · GW

To your point about e-ink screens reducing eyestrain: white-on-black text may do the same thing. It's reducing the light hitting your eyeballs by a factor of ten to a hundred. I've read books on a large phone screen in dark mode for a long time (since Kindle for smartphones has existed, I guess) and experienced zero perceptible eyestrain.

This has nothing to do with the distraction-reducing portion of your post, nor the handwriting notes portion.

Comment by Seth Herd on Less Wrong automated systems are inadvertently Censoring me · 2024-02-21T18:54:18.745Z · LW · GW

That makes more sense, thanks.

This is placing a high bar on the tone of comments. But the culture of collegiality is valuable in a subtle and powerful way, so I'd probably endorse it.

Comment by Seth Herd on Less Wrong automated systems are inadvertently Censoring me · 2024-02-21T18:44:30.415Z · LW · GW

I would think that everyone with 2000 karma has been downvoted by at least 7 users. That's a lot of posts and comments.

It seems like maybe this algorithm deserves a little rethinking. Maybe the past month is all you need to change, but I don't know what the rest of the algorithm is. -5 is a very low bar for limiting a high net karma user, since that can be produced by one angry big downvotes from another high karma user.

Comment by Seth Herd on I'd also take $7 trillion · 2024-02-21T18:30:49.624Z · LW · GW

Right - but that's totally unrelated to the logic of the $7T ask. Which makes this an odd post. What is the point of speculating about the good that could be done with $7T that magically appears rather than speculating about how money could actually be garnered for good causes?

As an aside, Altman didn't ask for 7T for one business he would run - I think he said it would take that much to create an alternate chip supply chain in total. See this excellent breakdown of the chip supply chain. There is a lot of money to be made in chips. It might or might not make back investors money.

Interesting point about the property rights. Of course I agree that we'll probably hit RSI AGI before that money is made back and that things will get weird or stop

Edit: I did some more reading, you're right that $7T is way more than you could reasonably spend on chip manufacturer. So I don't and I'm not sure anyone knows what the $7T refers to, but I think it's probably all capitol investment intended to turn a profit, not money from God.

Comment by Seth Herd on I'd also take $7 trillion · 2024-02-21T05:21:58.887Z · LW · GW

We'd all like 7T for projects we like. The difference in Altman's ask is that investors would likely make their money back and more. That might be the case with some of your projects, but that's the key thing to focus on besides wishing for wealth.

Comment by Seth Herd on Goal-Completeness is like Turing-Completeness for AGI · 2024-02-21T05:14:05.672Z · LW · GW

Here it is: The No Free Lunch theorem for dummies. See particularly the second section: Sidenote: Why NFL has basically nothing to do with AGI and the first link to Yudkowsky's post on essentially the same thing.

I think the thing about your descripton is that S -> V is not going to be chosen at random in our world.

The no free lunch theorem states in essence (I'm pretty sure) that no classifier can both classify a big gray thing with tusks and big ears as both an elephant and not-an-elephant. That's fine, because the remainder of an AGI system can choose (by any other criteria) to make elephants either a goal or an anti-goal or neither.

If the NFL theorem applied to general intelligences, it seems like humans couldn't love elephants at one time and hate them at a later time, with no major changes to their perceptual systems. It proves too much.

Comment by Seth Herd on The Shutdown Problem: Three Theorems · 2024-02-21T05:03:13.126Z · LW · GW

I agree that the shutdown problem is hard. There's a way to circumvent it, and I think that's what will actually be pursued in the first AGI alignment attempts.

That is to make the alignment goal instruction-following. That includes following the instructions to shut down or do nothing. There's no conflict between shutdown and the primary goal.

This alignment goal isn't trivial, but it's much simpler and therefore easier to get right than full alignment to the good-of-humanity, whatever that is.

Given the difficulties, I very much doubt that anyone is going to try launching an AGI that is aligned to the good of all humanity, but can be reliably shut down we decide we've gotten that definition wrong.

I say a little more about this in Corrigibility or DWIM is an attractive primary goal for AGI and Roger Dearnaley goes into much more depth in his Requirements for a Basin of Attraction to Alignment.

This is more like Christiano's corrigibility concept the Eliezer's but probably distinct from both.

I'll be writing more about this soon. After writing that post, I've been increasingly convinced that that is the most likely alignment goal for the first, critical AGIs. And that this probably makes alignment easier, while making human power dynamics concerningly relevant.

Comment by Seth Herd on Goal-Completeness is like Turing-Completeness for AGI · 2024-02-19T18:29:18.867Z · LW · GW

The No Free Lunch theorem is irrelevant in worlds like ours that are a subset of possible data structures (world arrangements). I'm surprised this isn't better understood. I think Steve Byrnes did a nice writeup of this logic. I can find the link if you like.

Comment by Seth Herd on And All the Shoggoths Merely Players · 2024-02-12T23:55:59.026Z · LW · GW

Bravo! I think this both important in the problem it addresses, and written both engagingly and insightfully.

I think the question of why there are large disagreements between people who have large amounts of background knowledge and time thinking about alignment is probably the most relevant question we could be asking at this stage of alignment research.

Comment by Seth Herd on And All the Shoggoths Merely Players · 2024-02-12T23:53:00.122Z · LW · GW

I think he means the is-ought dilemma inside the AGI - the distinction between its beliefs about what is and its goals for what it thinks ought to happen.

Comment by Seth Herd on And All the Shoggoths Merely Players · 2024-02-12T23:46:43.710Z · LW · GW

This is a great question; I've never seen a convincing answer or even a good start at figuring out how many errors in ASI alignment we can tolerate before we're likely to die.

If each action it takes has independent errors, we'd need near 100% accuracy to expect to survive more than a little while. But if its beliefs are coherent through reflection, those errors aren't independent. I don't expect ASI to be merely a bigger network that takes an input and spits out an output, but a system that can and does reflect on its own goals and beliefs (because this isn't hard to implement and introspection and reflection seem useful for human cognition). Having said that, this might actually be a crux of disagreement on alignment difficulty - I'd be more scared of an ASI that can't reflect so that its errors are independent.

With reflection, a human wouldn't just say "seems like I should kill everyone this time" and then do it. They'd wonder why this decision is so different from their usual decisions, and look for errors.

So the more relevant question, I think, is how many errors and how large can be tolerated in the formation of a set of coherent, reflectively stable goals. But that's with my expectation of a reflective AGI with coherent goals and behaviors.

Comment by Seth Herd on The Ideal Speech Situation as a Tool for AI Ethical Reflection: A Framework for Alignment · 2024-02-12T22:57:10.623Z · LW · GW

I'm trying to be helpful by guessing at the gap between what you're saying and this particular audience's interests and concerns. You said this is your first post, it's a new account, and the post didn't get much interest, so I'm trying to help you guess what needs to be addressed in future posts or edits.

Comment by Seth Herd on The Ideal Speech Situation as a Tool for AI Ethical Reflection: A Framework for Alignment · 2024-02-12T20:45:49.805Z · LW · GW

I think you should emphasize this more since that's typically what alignment people think about. What part of the ISS statements do you take to imply values we'd like?

The more standard thinking is that human values are developed based on our innate drives, which includes prosocial drives. See Steve Byrnes work, particularly the intro to his brain-like AGI sequence. And that's not guaranteed to produce an aligned human.