[Valence series] 1. Introduction 2023-12-04T15:40:21.274Z
Thoughts on “AI is easy to control” by Pope & Belrose 2023-12-01T17:30:52.720Z
I’m confused about innate smell neuroanatomy 2023-11-28T20:49:13.042Z
8 examples informing my pessimism on uploading without reverse engineering 2023-11-03T20:03:50.450Z
Late-talking kid part 3: gestalt language learning 2023-10-17T02:00:05.182Z
“X distracts from Y” as a thinly-disguised fight over group status / politics 2023-09-25T15:18:18.644Z
A Theory of Laughter—Follow-Up 2023-09-14T15:35:18.913Z
A Theory of Laughter 2023-08-23T15:05:59.694Z
Model of psychosis, take 2 2023-08-17T19:11:17.386Z
My checklist for publishing a blog post 2023-08-15T15:04:56.219Z
Lisa Feldman Barrett versus Paul Ekman on facial expressions & basic emotions 2023-07-19T14:26:05.675Z
Thoughts on “Process-Based Supervision” 2023-07-17T14:08:57.219Z
Munk AI debate: confusions and possible cruxes 2023-06-27T14:18:47.694Z
My side of an argument with Jacob Cannell about chip interconnect losses 2023-06-21T13:33:49.543Z
LeCun’s “A Path Towards Autonomous Machine Intelligence” has an unsolved technical alignment problem 2023-05-08T19:35:19.180Z
Connectomics seems great from an AI x-risk perspective 2023-04-30T14:38:39.738Z
AI doom from an LLM-plateau-ist perspective 2023-04-27T13:58:10.973Z
Is “FOXP2 speech & language disorder” really “FOXP2 forebrain fine-motor crappiness”? 2023-03-23T16:09:04.528Z
EAI Alignment Speaker Series #1: Challenges for Safe & Beneficial Brain-Like Artificial General Intelligence with Steve Byrnes 2023-03-23T14:32:53.800Z
Plan for mediocre alignment of brain-like [model-based RL] AGI 2023-03-13T14:11:32.747Z
Why I’m not into the Free Energy Principle 2023-03-02T19:27:52.309Z
Why I’m not working on {debate, RRM, ELK, natural abstractions} 2023-02-10T19:22:37.865Z
Heritability, Behaviorism, and Within-Lifetime RL 2023-02-02T16:34:33.182Z
Schizophrenia as a deficiency in long-range cortex-to-cortex communication 2023-02-01T19:32:24.447Z
“Endgame safety” for AGI 2023-01-24T14:15:32.783Z
Thoughts on hardware / compute requirements for AGI 2023-01-24T14:03:39.190Z
Note on algorithms with multiple trained components 2022-12-20T17:08:24.057Z
More notes from raising a late-talking kid 2022-12-20T02:13:01.018Z
My AGI safety research—2022 review, ’23 plans 2022-12-14T15:15:52.473Z
The No Free Lunch theorem for dummies 2022-12-05T21:46:25.950Z
My take on Jacob Cannell’s take on AGI safety 2022-11-28T14:01:15.584Z
Me (Steve Byrnes) on the “Brain Inspired” podcast 2022-10-30T19:15:07.884Z
What does it take to defend the world against out-of-control AGIs? 2022-10-25T14:47:41.970Z
Quick notes on “mirror neurons” 2022-10-04T17:39:53.144Z
Book review: “The Heart of the Brain: The Hypothalamus and Its Hormones” 2022-09-27T13:20:51.434Z
Thoughts on AGI consciousness / sentience 2022-09-08T16:40:34.354Z
On oxytocin-sensitive neurons in auditory cortex 2022-09-06T12:54:10.064Z
I’m mildly skeptical that blindness prevents schizophrenia 2022-08-15T23:36:59.003Z
Changing the world through slack & hobbies 2022-07-21T18:11:05.636Z
Response to Blake Richards: AGI, generality, alignment, & loss functions 2022-07-12T13:56:00.885Z
The “mind-body vicious cycle” model of RSI & back pain 2022-06-09T12:30:33.810Z
[Intro to brain-like-AGI safety] 15. Conclusion: Open problems, how to help, AMA 2022-05-17T15:11:12.397Z
[Intro to brain-like-AGI safety] 14. Controlled AGI 2022-05-11T13:17:54.955Z
[Intro to brain-like-AGI safety] 13. Symbol grounding & human social instincts 2022-04-27T13:30:33.773Z
[Intro to brain-like-AGI safety] 12. Two paths forward: “Controlled AGI” and “Social-instinct AGI” 2022-04-20T12:58:32.998Z
[Intro to brain-like-AGI safety] 11. Safety ≠ alignment (but they’re close!) 2022-04-06T13:39:42.104Z
Abandoned prototype video game teaching elementary circuit theory 2022-03-30T20:37:19.653Z
[Intro to brain-like-AGI safety] 10. The alignment problem 2022-03-30T13:24:33.181Z
[Intro to brain-like-AGI safety] 9. Takeaways from neuro 2/2: On AGI motivation 2022-03-23T12:48:18.476Z
Is Grabby Aliens built on good anthropic reasoning? 2022-03-17T14:12:05.909Z


Comment by Steven Byrnes (steve2152) on Thoughts on “AI is easy to control” by Pope & Belrose · 2023-12-03T22:45:01.142Z · LW · GW

There's something to that, but this sounds too strong to me. If someone had hypothetically spent a year observing all of my behavior, having some sort of direct read access to what was happening in my mind, and also doing controlled experiments where they reset my memory and tested what happened with some different stimulus... it's not like all of their models would become meaningless the moment I read the morning newspaper. If I had read morning newspapers before, they would probably have a pretty good model of what the likely range of updates for me would be.

I dunno, I wrote “invalid (or at least, open to question)”. I don’t think that’s too strong. Like, just because it’s “open to question”, doesn’t mean that, upon questioning it, we won’t decide it’s fine. I.e., it’s not that the conclusion is necessarily wrong, it’s that the original argument for it is flawed.

Of course I agree that the morning paper thing would probably be fine for humans, unless the paper somehow triggered an existential crisis, or I try a highly-addictive substance while reading it, etc.  :)

Some relevant context is: I don’t think it’s realistic to assume that, in the future, AI models will be only slightly fine-tuned in a deployment-specific way. I think the relevant comparison is more like “can your values change over the course of years”, not “can your values change after reading the morning paper?”

Why do I think that? Well, let’s imagine a world where you could instantly clone an adult human. One might naively think that there would be no more on-the-job learning ever. Instead, (one might think), if you want a person to help with chemical manufacture, you open the catalog to find a human who already knows chemical manufacturing, and order a clone of them; and if you want a person to design widgets, you go to a different catalog page, and order a clone of a human widget design expert; so on.

But I think that’s wrong.

I claim there would be lots of demand to clone a generalist—a person who is generally smart and conscientious and can get things done, but not specifically an expert in metallurgy or whatever the domain is. And then, this generalist would be tasked with figuring out whatever domains and skills they didn’t already have.

Why do I think that? Because there’s just too many possible specialties, and especially combinations of specialties, for a pre-screened clone-able human to already exist in each of them. Like, think about startup founders. They’re learning how to do dozens of things. Why don’t they outsource their office supply questions to an office supply expert, and their hiring questions to a hiring expert, etc.? Well they do to some extent, but there are coordination costs, and more importantly the experts would lack all the context necessary to understand what the ultimate goals are. What are the chances that there’s a pre-screened clone-able human that knows about the specific combination of things that a particular application needs (rural Florida zoning laws AND anti-lock brakes AND hurricane preparedness AND …)

So instead I expect that future AIs will eventually do massive amounts of figuring-things-out in a nearly infinite variety of domains, and moreover that the figuring out will never end. (Just as the startup founder never stops needing to learn new things, in order to succeed.) So I don’t like plans where the AI is tested in a standardized way, and then it’s assumed that it won’t change much in whatever one of infinitely many real-world deployment niches it winds up in.

Comment by Steven Byrnes (steve2152) on Thoughts on “AI is easy to control” by Pope & Belrose · 2023-12-03T01:20:35.704Z · LW · GW

I find your text confusing. Let’s go step by step.

  • AlphaZero-chess has a very simple reward function: +1 for getting checkmate, -1 for opponent checkmate, 0 for draw
  • A trained AlphaZero-chess has extraordinarily complicated “preferences” (value function) — its judgments of which board positions are good or bad contain millions of bits of complexity
  • If you change the reward function (e.g. flip the sign of the reward), the resulting trained model preferences would be wildly different.

By analogy:

  • The human brain has a pretty simple innate reward function (let’s say dozens to hundreds of lines of pseudocode).
  • A human adult has extraordinarily complicated preferences (let’s say millions of bits of complexity, at least)
  • If you change the innate reward function in a newborn, the resulting adult would have wildly different preferences than otherwise.

Do you agree with all that?

If so, then there’s no getting around that getting the right innate reward function is extremely important, right?

So, what reward function do you propose to use for a brain-like AGI? Please, write it down. Don’t just say “the reward function is going to be learned”, unless you explain exactly how it’s going to be learned. Like, what’s the learning algorithm for that? Maybe use AlphaZero as an example for how that alleged reward-function-learning-algorithm would work? That would be helpful for me to understand you.  :)

Comment by Steven Byrnes (steve2152) on Thoughts on “AI is easy to control” by Pope & Belrose · 2023-12-03T00:31:57.543Z · LW · GW

You correctly mention that not all AI risk is solved by AI control being easy, because AI misuse can still be a huge factor

It’s odd that you understood me as talking about misuse. Well, I guess I’m not sure how you’re using the term “misuse”. If Person X doesn’t follow best practices when training an AI, and they wind up creating an out-of-control misaligned AI that eventually causes human extinction, and if Person X didn’t want human extinction (as most people don’t), then I wouldn’t call that “misuse”. Would you? I I would call it a “catastrophic accident” or something like that. I did mention in the OP that some people think human extinction is perfectly fine, and I guess if Person X is one of those people, then it would be misuse. So I suppose I brought up both accidents and misuse.

Misuse focused policy probably looks less technical, and more normal, for example Know Your Customer laws or hashing could be extremely important if we're worried about misuse of AI for say bioterrorism.

People who I think are highly prone to not following best practices to keep AI under control, even if such best practices exist, include people like Yann LeCun, Larry Page, Rich Sutton, and Jürgen Schmidhuber, who are either opposed to AI alignment on principle, or are so bad at thinking about the topic of AI x-risk that they spout complete and utter nonsense. (example). That’s not a problem solvable by Know Your Customer laws, right? These people (and many more) are not the customers, they are among the ones doing state-of-the-art AI R&D.

In general, the more people are technically capable of making an out-of-control AI agent, the more likely that one of them actually will, even if best practices exist to keep AI under control. People like to experiment with new approaches, etc., right? And I expect the number of such people to go up and up, as algorithms improve etc. See here.

If KYC laws aren’t the answer, what is? I don’t know. I’m not advocating for any particular policy here.

Comment by Steven Byrnes (steve2152) on Thoughts on “AI is easy to control” by Pope & Belrose · 2023-12-03T00:01:48.065Z · LW · GW

I agree with some of this, but I'd say Story 1 applies only very weakly, and that the majority/supermajority of value learning is online, for example via the self-learning/within lifetime-RL algorithms you describe, without relying on the prior. In essence, I agree with the claim that the genes need to impose a prior, which prevents pure blank-slatism from working. I disagree with the claim that this means that genetics need to impose a very strong prior without relying on the self-learning algorithms you describe for capabilities.

You keep talking about “prior” but not mentioning “reward function”. I’m not sure why. For human children, do you think that there isn’t a reward function? Or there is a reward function but it’s not important? Or do you take the word “prior” to include reward function as a special case?

If it’s the latter, then I dispute that this is an appropriate use of the word “prior”. For example, you can train AlphaZero to be superhumanly skilled at winning at Go, or if you flip the reward function then you’ll train AlphaZero to be superhumanly skilled at losing at Go. The behavior is wildly different, but is the “prior” different? I would say no. It’s the same neural net architecture, with the same initialization and same regularization. After 0 bits of training data, the behavior is identical in each case. So we should say it’s the same “prior”, right?

(As I mentioned in the OP, on my models, there is a human innate reward function, and it’s absolutely critical to human prosocial behavior, and unfortunately nobody knows what that reward function is.)

Comment by Steven Byrnes (steve2152) on Thoughts on “AI is easy to control” by Pope & Belrose · 2023-12-02T16:23:17.842Z · LW · GW

Neither this post nor the essay it’s responding to is about policy proposals. So why talk about it? Two points:

  • As a general principle, if there are two groups who wildly disagree about the facts on the ground, but nevertheless (coincidentally) agree about what policies they favor, then I say they should still probably try to resolve their disagreements if possible, because it’s generally good to have accurate beliefs, e.g. what if both of them are wrong? And maybe that coincidence will not always be true anyway.
  • It’s not true that the only choice on offer is “Should we ever build ASI? Yes or no?” In fact, that choice (per se) is not on offer at all. What there is, is a gazillion conceivable laws that could be passed, all of which have a wide and idiosyncratic array of intended and unintended consequences. Beyond that, there are a gazillion individual decisions that need to be made, like what careers to pursue, what to donate to, whether to publish or not publish particular things, whether to pursue or not pursue particular lines of research, etc. etc. I find it extraordinarily unlikely that, if Person A thinks p(doom)=99% and Person B thinks p(doom)=1%, then they’re going to agree on all these gazillions of questions. (And empirically, it seems to be clearly not the case that the p(doom)=1% people and the p(doom)=99% people agree on questions of policy.)
Comment by Steven Byrnes (steve2152) on Thoughts on “AI is easy to control” by Pope & Belrose · 2023-12-02T14:51:37.678Z · LW · GW

While it's obviously true that there is a lot of stuff operating in brains besides LLM-like prediction, such as mechanisms that promote specific predictive models over other ones, that seems to me to only establish that "the human brain is not just LLM-like prediction", while you seem to be saying that "the human brain does not do LLM-like prediction at all". (Of course, "LLM-like prediction" is a vague concept and maybe we're just using it differently and ultimately agree.)

I disagree with whether that distinction matters:

I think technical discussions of AI safety depend on the AI-algorithm-as-a-whole; I think “does the algorithm have such-and-such component” is not that helpful a question.

So for example, here’s a nightmare-scenario that I think about often:

  • (step 1) Someone reads a bunch of discussions about LLM x-risk
  • (step 2) They come down on the side of “LLM x-risk is low”, and therefore (they think) it would be great if TAI is an LLM as opposed to some other type of AI
  • (step 3) So then they think to themselves: Gee, how do we make LLMs more powerful? Aha, they find a clever way to build an AI that combines LLMs with open-ended real-world online reinforcement learning or whatever.

Even if (step 2) is OK (which I don’t want to argue about here), I am very opposed to (step 3), particularly the omission of the essential part where they should have said “Hey wait a minute, I had reasons for thinking that LLM x-risk is low, but do those reasons apply to this AI, which is not an LLM of the sort that I'm used to, but rather it’s a combination of LLM + open-ended real-world online reinforcement learning or whatever?” I want that person to step back and take a fresh look at every aspect of their preexisting beliefs about AI safety / control / alignment from the ground up, as soon as any aspect of the AI architecture and training approach changes, even if there’s still an LLM involved.  :)

Comment by Steven Byrnes (steve2152) on Shallow review of live agendas in alignment & safety · 2023-11-29T13:37:13.892Z · LW · GW

I've written up an opinionated take on someone else's technical alignment agenda about three times, and each of those took me something like 100 hours. That was just to clearly state why I disagreed with it; forget about resolving our differences :)

Comment by Steven Byrnes (steve2152) on Appendices to the live agendas · 2023-11-27T13:34:24.808Z · LW · GW

For what it’s worth, I am not doing (and have never done) any research remotely similar to your text “maybe we can get really high-quality alignment labels from brain data, maybe we can steer models by training humans to do activation engineering fast and intuitively”.

I have a concise and self-contained summary of my main research project here (Section 2).

Comment by Steven Byrnes (steve2152) on Possible OpenAI's Q* breakthrough and DeepMind's AlphaGo-type systems plus LLMs · 2023-11-26T15:38:03.324Z · LW · GW

Update: I kinda regret this comment. I think when I wrote it I didn’t realize quite how popular the “Let’s figure out what Q* is!!” game is right now. It’s everywhere, nonstop.

It still annoys me as much as ever that so many people in the world are playing the “Let’s figure out what Q* is!!” game. But as a policy, I don’t ordinarily complain about extremely widespread phenomena where my complaint has no hope of changing anything. Not a good use of my time. I don’t want to be King Canute yelling at the tides. I un-downvoted. Whatever.

Comment by Steven Byrnes (steve2152) on My Mental Model of Infohazards · 2023-11-26T01:17:43.544Z · LW · GW

if they study the relevant literature real hard, they too can create deadly pandemics in their basement with kidnapped feral cats.

Here’s a question:

Question A: Suppose that a person really really wants to create a novel strain of measles that is bio-engineered to be resistant to the current measles vaccine. This person has high but not extreme intelligence and conscientiousness, and has a high school biology education, and has 6 months to spend, and has a budget of $10,000, and has access to typical community college biology laboratory equipment. What’s the probability that they succeed?

I feel extremely strongly that the answer right now is “≈0%”. That’s based for example on this podcast interview with one of the world experts on those kinds of questions.

What do you think the answer to Question A is?

If you agree with me that the answer right now is “≈0%”, then I have a follow-up question:

Question B: Suppose I give you a magic wand. If you wave the wand, it will instantaneously change the answer to Question A to be “90%”. Would you wave that wand or not?

(My answer is “obviously no”.)

There was a clear danger, the public was alerted, the public was unhappy, changes were made, research was directed into narrower, safer channels, and society went back to doing its thing.

  • I’m strongly in favor of telling the public what gain-of-function research is and why they should care about its occurrence.
  • I’m strongly opposed to empowering millions of normal members of the public to do gain-of-function research in their garages.

Do you see the difference?

If you’re confused by the biology example, here’s a physics one:

  • I’m strongly in favor of telling the public what uranium enrichment is and why they should care about its occurrence.
  • I’m strongly opposed to empowering millions of normal members of the public to enrich uranium in their garages [all the way to weapons-grade, at kilogram scale, using only car parts and household chemicals].
Comment by Steven Byrnes (steve2152) on My Mental Model of Infohazards · 2023-11-25T14:00:21.812Z · LW · GW

I think you’re misunderstanding something very basic about infectious disease, or else we’re miscommunicating somehow.

You wrote “…has a 50% chance of causing a deadly global pandemic after 6 months of work…”, and you also wrote “…and the relevant global pandemic is already in full swing…”. Those are contradictory.

If virus X is already uncontrollably spreading around the world, then I don’t care about someone knowing how to manufacture virus X, and nobody else cares either. That’s not the problem. I care about somebody knowing how to take lab equipment and manufacture a new different virus Y, such that immunity to X (or to any currently-circulating virus) does not confer immunity to Y. I keep saying “novel pandemic”. The definition of a “novel pandemic” is that nobody is immune to it (and typically, also, we don’t already have vaccines). COVID is not much worse than seasonal flu once people have already caught it once (or been vaccinated), but the spread of COVID was a catastrophe because it was novel—everyone was catching it for the first time, and there were no vaccines yet. And it’s possible for novel pandemics to be much much worse than COVID.

If somebody synthesizes and releases novel viruses A,B,C,D,E,F, each of which is highly infectious and has 70% morality rate, then we have to invent and test and manufacture and distribute six brand new vaccines in parallel, everybody needs to get six shots, the default expectation is that you’re going to get deathly ill six times rather than once, etc. You understand this, right?

The other reason that I think you have some basic confusion is your earlier comment that basically said:

  • it’s great that the rationality community was loudly and publicly discussing how to react to the COVID pandemic that was already spreading uncontrollably around the world;
  • …therefore, if a domain expert figures out a recipe to make vaccine-resistant deadly super-measles or any of 100 other never-before-seen novel pandemic viruses using only widely-available lab equipment, then it’s virtuous for them (after warning the CDC and waiting maybe 90 days) to publish that recipe in the form of user-friendly step-by-step instructions on the open internet and then take out a billboard in Times Square that says “Did you know that anyone who wants to can easily manufacture vaccine-resistant deadly super-measles or any of 100 other never-before-seen novel pandemic viruses using only widely-available lab equipment? If you don’t believe me, check out this website!”

To me, going from the first bullet point to the second one is such a flagrant non sequitur that it makes my head spin to imagine what you could possibly be thinking. So again, I think there’s some very basic confusion here about what I’m talking about and/or how infectious disease works.

Comment by Steven Byrnes (steve2152) on My Mental Model of Infohazards · 2023-11-24T15:12:18.138Z · LW · GW

I'm still talking about the pandemic thing, not AI. If we're "mostly on the same page" that publishing and publicizing the blog post (with a step-by-step recipe for making a deadly novel pandemic virus) is a bad idea, then I think you should edit your post, right?

Comment by Steven Byrnes (steve2152) on My Mental Model of Infohazards · 2023-11-24T15:09:20.803Z · LW · GW

Your arguments here seem to be:

  • It is already possible for an individual to do something that has a 0.0000...0000001% chance of causing a deadly global pandemic after 6 months of work. Therefore, what's the harm in disseminating information about a procedure that has a 90% chance of causing a deadly global pandemic after 6 months of work?

  • If there is already a pandemic spreading around the world, it's good to publicly talk about it. Therefore, it is also good to publicly talk about how, step by step, an individual could create a deadly novel pandemic using widely-available equipment.

Do you endorse those? If those aren't the argument you were making in that comment, then can you clarify?

Comment by Steven Byrnes (steve2152) on My Mental Model of Infohazards · 2023-11-24T14:34:33.147Z · LW · GW

What is the “first mover advantage”? Are you worried about the CDC itself creating and releasing deadly novel global pandemics? To me, that seems like a crazy thing to be worried about. Nobody thinks that creating and releasing deadly novel global pandemics is a good idea, except from crazy ideologues like Seiichi Endo. Regrettably, crazy ideologues do exist. But they probably don’t exist among CDC employees.

I would expect the CDC to “engage with me in serious dialogue and good faith”. More specifically, I expect that I would show them the instructions, and they would say “Oh crap, that sucks. There’s nothing to do about that, except try to delay the dissemination of that information as long as possible, and meanwhile try to solve the general problem of pandemic prevention and mitigation. …Which we have already been trying to solve for decades. We’re making incremental progress but we sure aren’t going to finish anytime soon. If you want to work on the general problem of pandemic prevention and mitigation, that’s great! Go start a lab and apply for NIH grants. Go lobby politicians for more pandemic prevention funding. Go set up wastewater monitoring and invent better rapid diagnostics. Etc. etc. There’s plenty of work to do, and we need all the help we can get, God knows. And tell all your friends to work on pandemic prevention too.”

If the CDC says that, and then goes back to continuing the pandemic prevention projects that they were already working on, would you still advocate my publishing the blog post after 90 days? Can you spell out exactly what bad consequences you expect if I don’t publish it?

Comment by Steven Byrnes (steve2152) on My Mental Model of Infohazards · 2023-11-24T13:49:17.073Z · LW · GW

OK. So I contact the CDC. They say "if any crazy ideologue in the world were able to easily make and release a deadly novel pandemic, that would obviously be catastrophic. Defense against deadly novel pandemics is a very hard problem that we've been working on for decades and will not solve anytime soon. Did you know that COVID actually just happened and pandemic prevention funding is still woefully inadequate? And your step-by-step instructions only involve widely-available equipment, there's no targeted regulation for us to use to stop people from doing this. So anyway, publishing and publicizing that blog post would be one of the worst things that anyone has ever done and you personally and all your loved ones would probably be dead shortly thereafter."

And then I think to myself: "Am I concerned that the CDC itself will make and release deadly novel pandemics"? And my answer is "no".

And then I think to myself: "What would be a reasonable time cap before I publish the blog post and take out the Times Square billboard pointing people to it?" And my answer is "Infinity; why on earth would I ever want to do that? WTF?"

So I guess you'd say: Shame on me, I'm an infohoarder!!!! Right?

If that's your opinion, then what would you do differently? How soon would you publish the blog post, and why?

(Again I suggest the podcast that I linked in my earlier comment.)

Comment by Steven Byrnes (steve2152) on Possible OpenAI's Q* breakthrough and DeepMind's AlphaGo-type systems plus LLMs · 2023-11-24T04:47:31.787Z · LW · GW

None of the things you mention seem at all problematic to me. I'm mostly opposed to the mood of "Ooh, a puzzle! Hey let's all put our heads together and try to figure it out!!" Apart from that mood, I think it's pretty hard to go wrong in this case, unless you're in a position to personally find a new piece of the puzzle yourself. If you're just chatting with no particular domain expertise, ok sure maybe you'll inspire some idea in a reader, but that can happen when anyone says anything. :-P

Comment by Steven Byrnes (steve2152) on My Mental Model of Infohazards · 2023-11-23T20:20:35.118Z · LW · GW

I wish you would engage with specific cases, instead of speaking in generalities. Suppose a domain expert figures out an easy way to make deadly novel pandemics using only widely-available equipment. You would want them to first disclose it to their local government (who exactly?), then publish it on their blog, right? How soon? The next day? With nice clear illustrated explanations of each step? What if they take out a billboard in Times Square directing people to the blog post? Is that praiseworthy? If not, why not? What do you expect to happen if they publish and publicize that blog post, and what do you expect to happen if they don't?

Comment by Steven Byrnes (steve2152) on My Mental Model of Infohazards · 2023-11-23T17:30:42.176Z · LW · GW

I don't understand what you're trying to say. What exactly are the "coordination problems" that prevent true human-level AI from having already been created last year?

Comment by Steven Byrnes (steve2152) on My Mental Model of Infohazards · 2023-11-23T17:06:55.176Z · LW · GW

I’m confused. How do you explain the fact that we don’t currently have human-level AI—e.g., AI that is as good as Ed Witten at publishing original string theory papers, or an AI that can earn a billion dollars with no human intervention whatsoever? Or do you think we do have such AIs already? (If we don’t already have such AIs, then what do you mean by “the horse is no longer in the stable”?)

Comment by Steven Byrnes (steve2152) on My Mental Model of Infohazards · 2023-11-23T16:59:23.655Z · LW · GW

If it’s “very unlikely” rather than “impossible even in principle”, then think you should have entitled your post: “There are a lot of things that people say are infohazards, but actually aren’t.” And then you can go through examples. Like maybe you can write sentences like:

  • “If a domain expert figures out an easy way to make deadly novel pandemics with no special tools, then lots of people would say that’s an infohazard, but it’s actually not for [fill in the blank] reason, quite the contrary they are morally obligated to publish it.” (btw have you listened to this podcast?)
  • “If a domain expert figures out an easy way to run misaligned AGI on a consumer GPU (but they have no idea how to align it), then lots of people would say that’s an infohazard, but it’s actually not for [fill in the blank] reason, quite the contrary they are morally obligated to publish it.”
  • “If a domain expert figures out an easy way to enrich natural uranium to weapons-grade uranium using car parts for $100/kg, then lots of people would say that’s an infohazard, but it’s actually not for [fill in the blank] reason, quite the contrary they are morally obligated to publish it.”
  • “If a domain expert finds a zero-day exploit in impossible-to-patch embedded software used by hospitals and electric grids, then lots of people would say that’s an infohazard, but it’s actually not for [fill in the blank] reason, quite the contrary they are morally obligated to publish it.”
  • Etc. etc.

And then we could talk about the different examples and whether the consequences of publishing it would be net good or bad. (Obviously, I’m on the “bad” side.)

Comment by Steven Byrnes (steve2152) on Possible OpenAI's Q* breakthrough and DeepMind's AlphaGo-type systems plus LLMs · 2023-11-23T14:54:10.864Z · LW · GW

Why I strong-downvoted

[update: now it’s a weak-downvote, see edit at bottom]

[update 2: I now regret writing this comment, see my reply-comment]

I endorse the general policy: "If a group of reasonable people think that X is an extremely important breakthrough that paves the path to imminent AGI, then it's really important to maximize the amount of time that this group can think about how to use X-type AGI safely, before dozens more groups around the world start trying to do X too."

And part of what that entails is being reluctant to contribute to a public effort to fill in the gaps from leaks about X.

I don't have super strong feelings that this post in particular is super negative value. I think its contents are sufficiently obvious and already being discussed in lots of places, and I also think the thing in question is not in fact an extremely important breakthrough that paves the path to imminent AGI anyway. But this post has no mention that this kind of thing might be problematic, and it's the kind of post that I'd like to discourage, because at some point in the future it might actually matter.

As a less-bad alternative, I propose that you should wait until somebody else prominently publishes the explanation that you think is right (which is bound to happen sooner or later, if it hasn't already), and then linkpost it.

See also: my post on Endgame Safety.

EDIT: Oh oops I wasn’t reading very carefully, I guess there are no ideas here that aren’t word-for-word copied from very-widely-viewed twitter threads. I changed my vote to a weak-downvote, because I still feel like this post belongs to a genre that is generally problematic, and by not mentioning that fact and arguing that this post is one of the exceptions, it is implicitly contributing to normalizing that genre.

Comment by Steven Byrnes (steve2152) on My Mental Model of Infohazards · 2023-11-23T12:14:11.420Z · LW · GW

Nothing can be alllll that dangerous if it's known to literally everyone how it works (this is probably the crux where I differ wildly from the larger alignment community, if I had to guess)

I agree that this is the crux, which makes it odd that you're not presenting any argument for it. Do you think it's always true in every possible world, or that it just happens to be true for every realistic example you can think of?

Comment by Steven Byrnes (steve2152) on We have promising alignment plans with low taxes · 2023-11-17T02:09:01.005Z · LW · GW

Just to play devil’s advocate, here’s an alternative: When there are multiple plausible tokens, maybe Gemini does multiple branching roll-outs for all of them, and then picks the branch that seems best (somehow or other).

That would be arguably consistent with saying “some of the strengths of AlphaGo-type systems”, in the sense that AlphaGo also did multiple roll-outs at inference time (and training time for that matter). But it wouldn’t entail any extra RL.

If this is true (a big “if”!), my vague impression is that it’s an obvious idea that has been tried lots of times but has been found generally unhelpful for LLMs. Maybe they found a way to make it work? Or maybe not but they’re doing it anyway because it sounds cool? Or maybe this whole comment is way off. I’m very far from an expert on this stuff.

Comment by Steven Byrnes (steve2152) on New report: "Scheming AIs: Will AIs fake alignment during training in order to get power?" · 2023-11-15T19:14:39.967Z · LW · GW

(Sorry in advance if this whole comment is stupid, I only read a bit of the report.)

As context, I think the kind of technical plan where we reward the AI for (apparently) being helpful is at least not totally doomed to fail. Maybe I’m putting words in people’s mouths, but I think even some pretty intense doomers would agree with the weak statement “such a plan might turn out OK for all we know” (e.g. Abram Demski talking about a similar situation here, Nate Soares describing a similar-ish thing as “maybe one nine” a.k.a. a mere 90% chance that it would fail). Of course, I would rather have a technical plan for which there’s a strong reason to believe it would actually work. :-P

Anyway, if that plan had a catastrophic safety failure (assuming proper implementation etc., and also assuming a situationally-aware AI), I think I would bet on a goal misgeneralization failure mode over a “schemer” failure mode. Specifically, such an AI could plausibly (IMO) wind up feeling motivated by any combination of “the idea of getting a reward signal”, or “the idea of the human providing a reward signal”, or “the idea of the human feeling pleased and motivated to provide a reward signal”, or “the idea of my output having properties X,Y,Z (which make it similar to outputs that have been rewarded in the past)”, or whatever else. None of those possible motivations would require “scheming”, if I understand that term correctly, because in all cases the AI would generally be doing things during training that it was directly motivated to do (as opposed to only instrumentally motivated). But some of those possible motivations are really bad because they would make the AI think that escaping from the box, launching a coup, etc., would be an awesome idea, given the opportunity.

(Incidentally, I’m having trouble fitting that concern into the Fig. 1 taxonomy. E.g. an AI with a pure wireheading motivation (“all I want is for the reward signal to be high”) is intrinsically motivated to get reward each episode as an end in itself, but it’s also intrinsically motivated to grab power given an opportunity to do so. So would such an AI be a “reward-on-the-episode seeker” or a “schemer”? Or both?? Sorry if this is a stupid question, I didn’t read the whole report.)

Comment by Steven Byrnes (steve2152) on No, human brains are not (much) more efficient than computers · 2023-11-14T19:51:45.693Z · LW · GW

I'm confused by the 3000MW figure.  I go to top 500 and see ~30,000KW, i.e. 30MW???

Your comment is a reply to me, but this part is a criticism of OP. And I agree with the criticism: OP seems to be wrong about 3030 MW, unless I’m misunderstanding. (Or maybe it’s just a typo where Jesse typed “30” twice? Are the subsequent calculations consistent with 30 or 3030?)

Agree that … right?

This part seems to be implicitly an argument for some larger point, but I’m not following exactly what it is; can you say it explicitly?

Comment by Steven Byrnes (steve2152) on LeCun’s “A Path Towards Autonomous Machine Intelligence” has an unsolved technical alignment problem · 2023-11-12T12:13:22.377Z · LW · GW

GPT-4 is different from APTAMI. I'm not aware of any method that starts with movies of humans, or human-created internet text, or whatever, and then does some kind of ML, and winds up with a plausible human brain intrinsic cost function. If you have an idea for how that could work, then I'm skeptical, but you should tell me anyway. :)

Comment by Steven Byrnes (steve2152) on Who is Sam Bankman-Fried (SBF) really, and how could he have done what he did? - three theories and a lot of evidence · 2023-11-12T03:45:57.292Z · LW · GW

This strikes me as a good post but it seems to be getting a bunch of downvotes. I'm confused why. Anyone care to explain?

Comment by Steven Byrnes (steve2152) on LeCun’s “A Path Towards Autonomous Machine Intelligence” has an unsolved technical alignment problem · 2023-11-12T03:30:55.784Z · LW · GW

“Extract from the brain” how? A human brain has like 100 billion neurons and 100 trillion synapses, and they’re generally very difficult to measure, right? (I do think certain neuroscience experiments would be helpful.) Or do you mean something else?

Comment by Steven Byrnes (steve2152) on LeCun’s “A Path Towards Autonomous Machine Intelligence” has an unsolved technical alignment problem · 2023-11-12T01:57:55.578Z · LW · GW

I would say “the human brain’s intrinsic-cost-like-thing is difficult to figure out”. I’m not sure what you mean by “…difficult to extract”. Extract from what?

Comment by Steven Byrnes (steve2152) on LeCun’s “A Path Towards Autonomous Machine Intelligence” has an unsolved technical alignment problem · 2023-11-12T01:34:29.394Z · LW · GW

The “similar reason as why I personally am not trying to get heroin right now” is “Example 2” here (including the footnote), or a bit more detail in Section 9.5 here. I don’t think that involves an idiosyncratic anti-heroin intrinsic cost function.

The question “What is the intrinsic cost in a human brain” is a topic in which I have a strong personal interest. See Section 2 here and links therein. “Why don’t humans have an alignment problem” is sorta painting the target around the arrow I think? Anyway, if you radically enhanced human intelligence and let those super-humans invent every possible technology, I’m not too sure what you would get (assuming they don’t blow each other to smithereens). Maybe that’s OK though? Hard to say. Our distant ancestors would think that we have awfully weird lifestyles and might strenuously object to it, if they could have a say.

Maybe the view of alignment pessimists is that the paradigmatic human brain’s intrinsic cost is intractably complex.

Speaking for myself, I think the human brain’s intrinsic-cost-like-thing is probably hundreds of lines of pseudocode, or maybe low thousands, certainly not millions. (And the part that’s relevant for AGI is just a fraction of that.) Unfortunately, I also think nobody knows what those lines are. I would feel better if they did. That wouldn’t be enough to make me “optimistic” overall, but it would certainly be a step in the right direction. (Other things can go wrong too.)

Comment by Steven Byrnes (steve2152) on Who is Sam Bankman-Fried (SBF) really, and how could he have done what he did? - three theories and a lot of evidence · 2023-11-11T18:26:17.728Z · LW · GW

I think SBF was bad at the same kinds of things that other high-functioning sociopaths tend to be bad at, e.g. problems stemming from

  • relative aversion to doing boring low-stimulation things (e.g. maintaining a spreadsheet)
  • conversely, a relative penchant for arousal-seeking / thrill-seeking (psychologically I think this stems from global under-arousal),
  • relative lack of seriousness about avoiding downside risks (psychologically I think this stems from lack of visceral worry about such things)

All the “mismanagement” examples that @cata mentioned seem to fit into those categories, more or less, I think.

For example, I recall hearing that high-functioning sociopaths in general tend to be terrible at managing their finances and often wind up in debt. I can’t immediately find where I heard that, but it is very true for both of the high-functioning sociopaths that I’ve known personally.

Comment by Steven Byrnes (steve2152) on 8 examples informing my pessimism on uploading without reverse engineering · 2023-11-05T21:39:21.822Z · LW · GW

I disagree about the hardware difficulty of uploading-with-reverse-engineering—the short version of one aspect of my perspective is here, the longer version with some flaws is here, the fixed version of the latter exists as a half-complete draft that maybe I’ll finish sooner or later. :)

Comment by Steven Byrnes (steve2152) on 8 examples informing my pessimism on uploading without reverse engineering · 2023-11-05T21:29:32.068Z · LW · GW

Currently regulations are based on flops

I’m not sure what you’re talking about. Maybe you meant to say: “there are ideas for possible future AI regulations that have been under discussion recently, and these ideas involve flop-based thresholds”? If so, yeah that’s kinda true, albeit oversimplified.

which will restrict progress towards WBE long before it restricts anything with AGI-like capabilities

I think that’s very true in the “WBE without reverse engineering” route, but it’s at least not obvious in the “WBE with reverse engineering” route that I think we should be mainly talking about (as argued in OP). For the latter, we would have legible learning algorithms that we understand, and we would re-implement them in the most compute-efficient way we can on our GPUs/CPUs. And it’s at least plausible that the result would be close to the best learning algorithm there is. More discussion in Section 2.1 of this post. Certainly there would be room to squeeze some more intelligence into the same FLOP/s—e.g. tweaking motivations, saving compute by dropping the sense of smell, various other architectural tweaks, etc. But it’s at least plausible IMO that this adds up to <1 OOM. (Of course, non-WBE AGIs could still be radically superhuman, but it would be by using radically superhuman FLOP (e.g. model size, training time, speed, etc.))

seems very unlikely to me

Hmm. I should mention that I don’t expect that LLMs will scale to AGI. That might be a difference between our perspectives. Anyway, you’re welcome to believe that “WBE before non-WBE-AGI” is hopeless even if we put moonshot-level effort into accelerating WBE. That’s not a crazy thing to believe. I wouldn’t go as far as “hopeless”, but I’m pretty pessimistic too. That’s why, when I go around advocating for work on human connectomics to help AGI x-risk, I prefer to emphasize a non-WBE-related path to AI x-risk reduction that seems (to me) likelier to actualize.

Humans having WBE at our fingertips could mean infinite tortured simulations of the digital brains

I grant that a sadistic human could do that, and that’s bad, although it’s pretty low on my list of “likely causes of s-risk”. (Presumably Ems, like humans, would be more economically productive when they’re feeling pretty good, in a flow state, etc., and presumably most Ems would be doing economically productive things most of the time for various reasons.)

Anyway, you can say: “To avoid that type of problem, let’s never ever create sentient digital minds”, but that doesn’t strike me as a realistic thing to aim for. In particular, in my (controversial) opinion, that basically amounts to “let’s never ever create AGI” (the way I define “AGI”, e.g. AI that can do groundbreaking new scientific research, invent new gadgets, etc.) If “never ever create AGI” is your aim, then I don’t want to discourage you. Hmm, or maybe I do, I haven’t thought about it really, because in my opinion you’d be so unlikely to succeed that it’s a moot point. Forever is a long time.

Comment by Steven Byrnes (steve2152) on Does davidad's uploading moonshot work? · 2023-11-04T11:34:52.060Z · LW · GW

Little nitpicks:

  • “Human brains have probably more than 1000 times as many synapses as current LLMs have weights.” → Can you elaborate? I thought the ratio was more like 100-200. (180-320T ÷ 1.7T)
  • “If you want to keep your upload sane, or be able to communicate with it, you're also going to have to give it some kind of illusion of a body and some kind of illusion of a comprehensible and stimulating environment.” Seems like an overstatement. Humans can get injuries where they can’t move around or feel almost any of their body, and they sure aren’t happy about it, but they are neither insane nor unable to communicate.

My much bigger disagreement is a kind of philosophical one. In my mind, I’m thinking of the reverse-engineering route, so I think:

  • A = [the particular kinds of mathematical operations that support the learning algorithms (and other algorithms) that power human intelligence]
  • B = [the particular kinds of affordances enabled by biological neurons, glial cells, etc.]
  • C = [the particular kinds of affordances enabled by CPUs and GPUs]

You’re thinking of a A→B→C path, whereas I’m thinking that evolution did the A→B path and separately we would do the A→C path.

I think there’s a massive efficiency hit from the fact that GPUs and CPUs are a poor fit to many useful mathematical operations. But I also think there’s a massive efficiency hit from the fact that biological neurons are a poor fit to many useful mathematical operations.

So whereas you’re imagining brain neurons doing the basic useful operations, instead I’m imagining that the brain has lots of little machines involving collections of maybe dozens of neurons and hundreds of synapses assembled into a jerry-rigged contraption that does a single basic useful mathematical operation in an incredibly inefficient way, just because that particular operation happens to be the kind of thing that an individual biological neuron can’t do.

Comment by Steven Byrnes (steve2152) on Does davidad's uploading moonshot work? · 2023-11-04T11:08:28.080Z · LW · GW

It’s possible that you (jacobjacob) and jbash are actually in agreement that (part of) the brain does something that is not literally backprop but “relevantly similar” to backprop—but you’re emphasizing the “relevantly similar” part and jbash is emphasizing the “not literally” part?

Comment by Steven Byrnes (steve2152) on Does davidad's uploading moonshot work? · 2023-11-04T10:56:40.639Z · LW · GW

I don’t have a strong opinion one way or the other on the Em here.

In terms of what I wrote above (“when I think about what humans can do that GPT-4 can’t do, I think of things that unfold over the course of minutes and hours and days and weeks, and centrally involve permanently editing brain synapses … being able to figure things out”), I would say that human-unique “figuring things out” process happened substantially during the weeks and months of study and practice, before the human stepped into the exam room, wherein they got really good at solving IMO problems. And hmm, probably also some “figuring things out” happens in the exam room, but I’m not sure how much, and at least possibly so little that they could get a decent score without forming new long-term memories and then building on them.

I don’t think Ems are good for much if they can’t figure things out and get good at new domains—domains that they didn’t know about before uploading—over the course of weeks and months, the way humans can. Like, you could have an army of such mediocre Ems monitor the internet, or whatever, but GPT-4 can do that too. If there’s an Em Ed Witten without the ability to grow intellectually, and build new knowledge on top of new knowledge, then this Em would still be much much better at string theory than GPT-4 is…but so what? It wouldn’t be able to write groundbreaking new string theory papers the way real Ed Witten can.

Comment by Steven Byrnes (steve2152) on 8 examples informing my pessimism on uploading without reverse engineering · 2023-11-04T00:40:07.743Z · LW · GW

Oops, sorry for leaving out some essential context. Both myself, and everyone I was implicitly addressing this post to, are concerned about the alignment problem, e.g. AGI killing everyone. If not for the alignment problem, then yeah, I agree, there’s almost no reason to work on any scientific or engineering problem except building ASI as soon as possible. But if you are worried about the alignment problem, then it makes sense to brainstorm solutions, and one possible family of solutions involves trying to make WBE happen before making AGI. There are a couple obvious follow-up questions, like “is that realistic?” and “how would that even help with the alignment problem anyway?”. And then this blog post is one part of that larger conversation. For a bit more, see Section 1.3 of my connectomics post. Hope that helps :)

Comment by Steven Byrnes (steve2152) on 8 examples informing my pessimism on uploading without reverse engineering · 2023-11-04T00:30:19.747Z · LW · GW

For Section 2.5 (C. elegans), @davidad is simultaneously hopeful about a human upload moonshot (cf. the post from yesterday) and intimately familiar with C. elegans uploading stuff (having been personally involved). And he’s a pretty reasonable guy IMO. So the inference “c. elegans stuff therefore human uploads are way off” is evidently less of a slam dunk inference than you seem to think it is. (As I mentioned in the post, I don’t know the details, and I hope I didn’t give a misleading impression there.)

I’m confused by your last sentence; how does that connect to the rest of your comment? (What I personally actually expect is that, if there are uploads at all, it would be via the reverse-engineering route, where we would not have to “change random bits”.)

Comment by Steven Byrnes (steve2152) on 8 examples informing my pessimism on uploading without reverse engineering · 2023-11-04T00:17:56.140Z · LW · GW

On the narrow topic of “mouse matrix”: Fun fact, if you didn’t already know, Justin Wood at Indiana University has been doing stuff in that vicinity (with chicks not mice, for technical reasons):

He uses a controlled-rearing technique with natural chicks, whereby the chicks are raised from birth in completely controlled visual environments. That way, Justin can present designed visual stimuli to test what kinds of visual abilities chicks have or can immediately learn. Then he can [build AI models] that are trained on the same data as the newborn chicks.…

 (I haven’t read any of his papers, just listened to him on this podcast episode, from which I copied the above quote.)

Comment by Steven Byrnes (steve2152) on Does davidad's uploading moonshot work? · 2023-11-03T14:10:19.996Z · LW · GW

Seconded! I too am confused and skeptical about this part.

Humans can do lots of cool things without editing the synapses in their brain. Like if I say: “imagine an upside-down purple tree, and now answer the following questions about it…”. You’ve never thought about upside-down purple trees in your entire life, and yet your brain can give an immediate snap answer to those questions, by flexibly combining ingredients that are already stored in it.

…And that’s roughly how I think about GPT-4’s capabilities. GPT-4 can also do those kinds of cool things. Indeed, in my opinion, GPT-4 can do those kinds of things comparably well to a human. And GPT-4 already exists and is safe. So that’s not what we need.

By contrast, when I think about what humans can do that GPT-4 can’t do, I think of things that unfold over the course of minutes and hours and days and weeks, and centrally involve permanently editing brain synapses. (See also: “AGI is about not knowing how to do something, and then being able to figure it out.”)

Comment by Steven Byrnes (steve2152) on Does davidad's uploading moonshot work? · 2023-11-03T13:01:45.348Z · LW · GW

I carefully read all the openwater information and patents during a brief period where I was doing work in a similar technical area (brain imaging via simultaneous use of ultrasound + infrared). …But that was like 2017-2018, and I don’t remember all the details, and anyway they may well have pivoted since then. But anyway, people can hit me up if they’re trying to figure out what exactly Openwater is hinting at in their vague press-friendly descriptions, I’m happy to chat and share hot-takes / help brainstorm.

I don’t think they’re aspiring to measure anything that you can’t already measure with MRI, or else they would have bragged about it on their product page or elsewhere, right? Of course, MRI is super expensive and inconvenient, so what they’re doing is still potentially cool & exciting, even if it doesn’t enable new measurements that were previously impossible at any price. But in the context of this post … if I imagine cheap and ubiquitous fMRI (or EEG, ultrasound, etc.), it doesn’t seem relevant, i.e. I don’t think it would make uploading any easier, at least the way I see things.

Comment by Steven Byrnes (steve2152) on Balancing Security Mindset with Collaborative Research: A Proposal · 2023-11-01T02:22:12.054Z · LW · GW

I suggest replacing “security mindset” with “secrets” or “keeping secrets” everywhere in this article, all the way from the title to the end. (And likewise replace “security-conscious organization” with “secrecy-conscious organization”, etc.) Security is different from secrecy, and I’m pretty sure that secrecy is what you’re talking about here. Right?

Comment by Steven Byrnes (steve2152) on The “mind-body vicious cycle” model of RSI & back pain · 2023-10-31T18:40:56.505Z · LW · GW

I’m so happy to hear that!!! :)

Comment by Steven Byrnes (steve2152) on Symbol/Referent Confusions in Language Model Alignment Experiments · 2023-10-27T14:57:47.538Z · LW · GW

“Gullible” is an unfortunate term because it literally describes a personality trait—believing what you hear / read without scrutiny—but it strongly connotes that the belief is wrongheaded in this particular context.

For example, it would be weird to say “Gullible Alice believes that Napier, New Zealand has apple orchards in it. By the way, that’s totally true, it has tons of apple orchards! But Gullible Alice believes it on the basis of evidence that she didn’t have strong reason to trust. She just got lucky.” It’s not wrong to say that, just weird / unexpected.

Relatedly, it’s possible for Gus to have a friend Cas (Careful Steelman) who has the same object-level beliefs as Gus, but the friend is not gullible, because the friend came to those beliefs via a more sophisticated justification / investigation than did Gus. I don’t think OP meant to deny the possibility that Cas could exist—calling Gus a strawman is clearly suggesting that OP believes better arguments exist—but I still feel like the OP comes across that way at a glance.

Comment by Steven Byrnes (steve2152) on AI as a science, and three obstacles to alignment strategies · 2023-10-26T15:39:52.293Z · LW · GW

I think Nate’s claim “I expect them to care about a bunch of correlates of the training signal in weird and specific ways.” is plausible, at least for the kinds of AGI architectures and training approaches that I personally am expecting. If you don’t find the evolution analogy useful for that (I don’t either), but are OK with human within-lifetime learning as an analogy, then fine! Here goes!

OK, so imagine some “intelligent designer” demigod, let’s call her Ev. In this hypothetical, the human brain and body were not designed by evolution, but rather by Ev. She was working 1e5 years ago, back on the savannah. And her design goal was for these humans to have high inclusive genetic fitness.

So Ev pulls out a blank piece of paper. First things first: She designed the human brain with a fancy large-scale within-lifetime learning algorithm, so that these humans can gradually get to understand the world and take good actions in it.

Supporting that learning algorithm, she needs a reward function (“innate drives”). What to do there? Well, she spends a good deal of time thinking about it, and winds up putting in lots of perfectly sensible components for perfectly sensible reasons.

For example: She wanted the humans to not get injured, so she installed in the human body a system to detect physical injury, and put in the brain an innate drive to avoid getting those injuries, via an innate aversion (negative reward) related to “pain”. And she wanted the humans to eat sugary food, so she put a sweet-food-detector on the tongue and installed in the brain an innate drive to trigger reinforcement (positive reward) when that detector goes off (but modulated by hunger, as detected by yet another system). And so on.

Then she did some debugging and hyperparameter tweaking by running these newly-designed humans in the training environment (African savannah) and seeing how they do.

So that’s how Ev designed humans. Then she “pressed go” and lets them run for 1e5 years. What happened?

Well, I think it’s fair to say that modern humans “care about” things that probably would have struck Ev as “weird”. (Although we, with the benefit of hindsight, can wag our finger at Ev and say that she should have seen them coming.) For example:

  • Superstitions and fashions: Some people care, sometimes very intensely, about pretty arbitrary things that Ev could not have possibly anticipated in detail, like walking under ladders, and where Jupiter is in the sky, and exactly what tattoos they have on their body.
  • Lack of reflective equilibrium resulting in self-modification: Ev put a lot of work into her design, but sometimes people don’t like some of the innate drives or other design features that Ev put into them, so the people go right ahead and change them! For example, they don’t like how Ev designed their hunger drive, so they take Ozempic. They don’t like how Ev designed their attentional system, so they take Adderall. Many such examples.
  • New technology / situations leading to new preferences and behaviors: When Ev created the innate taste drives, she was (let us suppose) thinking about the food options available on the savannah, and thinking about what drives would lead to people making smart eating choices in that situation. And she came up with a sensible and effective design for a taste-receptors-and-associated-innate-drives system that worked well for that circumstance. But maybe she wasn’t thinking that humans would go on to create a world full of ice cream and coca cola and miraculin and so on. Likewise, Ev put in some innate drives with the idea that people would wind up exploring their local environment. Very sensible! But Ev would probably be surprised that her design is now leading to people “exploring” open-world video-game environments while cooped up inside. Ditto with social media, organized religion, sports, and a zillion other aspects of modern life. Ev probably didn’t see any of it coming when she was drawing up and debugging her design, certainly not in any detail.

To spell out the analogy here:

  • Ev ↔ AGI programmers;
  • Human within-lifetime learning ↔ AGI training;
  • Adult humans ↔ AGIs;
  • Ev “presses go” and lets human civilization “run” for 1e5 years without further intervention ↔ For various reasons I consider it likely (for better or worse) that there will eventually be AGIs that go off and autonomously do whatever they think is a good thing to do, including inventing new technologies, without detailed human knowledge and approval.
  • Modern humans care about (and do) lots of things that Ev would have been hard-pressed to anticipate, even though Ev designed their innate drives and within-lifetime learning algorithm in full detail ↔ even if we carefully design the “innate drives” of future AGIs, we should expect to be surprised about what those AGIs end up caring about, particularly when the AGIs have an inconceivably vast action space thanks to being able to invent new technology and build new systems.
Comment by Steven Byrnes (steve2152) on AI as a science, and three obstacles to alignment strategies · 2023-10-26T14:29:19.683Z · LW · GW

If anyone cares, my own current take (see here) is “it’s not completely crazy to hope for uploads to precede non-upload-AGI by up to a couple years, with truly heroic effort and exceedingly good luck on numerous fronts at once”. (Prior to writing that post six months ago, I was even more pessimistic.) More than a couple years’ window continues to seem completely crazy to me.

Comment by Steven Byrnes (steve2152) on What is an "anti-Occamian prior"? · 2023-10-23T13:43:03.618Z · LW · GW

I think this is similar to the 2010 post A Proof of Occam's Razor? ...which spawned 139 comments. I didn't read them all. But here's one point that came up a couple times:

Let N be a ridiculously, comically large finite number like N=3↑↑↑↑↑3. Take the N simplest possible hypotheses. This is a finite set, so we can split up probability weight such that simpler hypotheses are less likely, within this set.

For example, rank-order these N hypotheses by decreasing complexity, and assign probability  to the n'th on the list. That leaves  leftover probability weight for all the other infinity hypotheses outside that set, and you can distribute that however you like, N is so big that it doesn't matter.

Now, simpler hypotheses are less likely, until we go past the first N hypotheses. But N is so ridiculously large that that's never gonna happen.

Comment by Steven Byrnes (steve2152) on I Would Have Solved Alignment, But I Was Worried That Would Advance Timelines · 2023-10-21T12:14:10.905Z · LW · GW

Notice that this is phrasing AI safety and AI timelines as two equal concerns that are worth trading off against each other. I don't think they are equal, and I think most people would have far better impact if they completely struck "I'm worried this will advance timelines" from their thinking and instead focused solely on "how can I make AI risk better".

This seems confused in many respects. AI safety is the thing I care about. I think AI timelines are a factor contributing to AI safety, via having more time to do AI safety technical research, and maybe also other things like getting better AI-related governance and institutions. You’re welcome to argue that shorter AI timelines other things equal do not make safe & beneficial AGI less likely—i.e., you can argue for: “Shortening AI timelines should be excluded from cost-benefit analysis because it is not a cost in the first place.” Some people believe that, although I happen to strongly disagree. Is that what you believe? If so, I’m confused. You should have just said it directly. It would make almost everything in this OP besides the point, right? I understood this OP to be taking the perspective that shortening AI timelines is bad, but the benefits of doing so greatly outweigh the costs, and the OP is mainly listing out various benefits of being willing to shorten timelines.

Putting that aside, “two equal concerns” is a strange phrasing. The whole idea of cost-benefit analysis is that the costs and benefits are generally not equal, and we’re trying to figure out which one is bigger (in the context of the decision in question).

If someone thinks that shortening AI timelines is bad, then I think they shouldn’t and won’t ignore that. If they estimate that, in a particular decision, they’re shortening AI timelines infinitesimally, in exchange for a much larger benefit, then they shouldn’t ignore that either. I think “shortening AI timelines is bad but you should completely ignore that fact in all your actions” is a really bad plan. Not all timeline-shortening actions have infinitesimal consequence, and not all associated safety benefits are much larger. In some cases it’s the other way around—massive timeline-shortening for infinitesimal benefit. You won’t know which it is in a particular circumstance if you declare a priori that you’re not going to think about it in the first place.


I think another “psychological” factor is a deontological / Hippocratic Oath / virtue kind of thing: “first, do no harm”. Somewhat relatedly, it can come across as hypocritical if someone is building AGI on weekdays and publicly advocating for everyone to stop building AGI on weekends. (I’m not agreeing or disagreeing with this paragraph, just stating an observation.)

We see this with rhetoric against AGI labs - many in the alignment community will level terrible accusations against them, all while having to admit when pressed that it is plausible they are making AI risk better.

I think you’re confused about the perspective that you’re trying to argue against. Lots of people are very confident, including “when pressed”, that we’d probably be in a much better place right now if the big AGI labs (especially OpenAI) had never been founded. You can disagree, but you shouldn’t put words in people’s mouths.

Comment by Steven Byrnes (steve2152) on I Would Have Solved Alignment, But I Was Worried That Would Advance Timelines · 2023-10-20T18:44:52.846Z · LW · GW

I found the tone of this post annoying at times, especially for overgeneralizing (“the alignment community” is not monolithic) and exaggerating / mocking (e.g. the title). But I’m trying to look past that to the substance, of which there’s plenty :)

I think everyone agrees that you should weigh costs and benefits in any important decision. And everyone agrees that the costs and benefits may be different in different specific circumstances. At least, I sure hope everyone agrees with that! Also, everyone is in favor of accurate knowledge of costs and benefits. I will try to restate the points you’re making, without the sass. But these are my own opinions now. (These are all pro tanto, i.e. able to be outweighed by other considerations.)

  • If fewer people who care about AI risk join or found leading AI companies, then there will still be leading AI companies in the world, it’s just that a smaller fraction of their staff will care about AI risk than otherwise. (Cf. entryism.) (On the other hand, presumably those companies would make less progress on the margin.)
    • And one possible consequence of “smaller fraction of their staff will care about AI risk” is less hiring of alignment researchers into these companies. Insofar as working with cutting-edge models and knowhow is important for alignment research (a controversial topic, where I’m on the skeptical side), and insofar as philanthropists and others aren’t adequately funding alignment research (unfortunately true), then it’s especially great if these companies support lots of high-quality alignment research.
  • If fewer people who care about AI risk go about acquiring a reputation as a prestigious AI researcher, or acquiring AI credentials like PhDs then there will still be prestigious and credentialed AI researchers in the world, it’s just that a smaller fraction of them will care about AI risk than otherwise. (On the other hand, presumably those prestigious researchers would collectively make less progress on the margin.)
    • This and the previous bullet point are relevant to public perceptions, outreach, legislation efforts, etc.
  • MIRI has long kept some of their ideas secret, and at least some alignment people think that some MIRI people are “overly cautious with infosec”. Presumably at least some other people disagree, or they wouldn’t have that policy. I don’t know the secrets, so I am at a loss to figure out who’s in the right here. Incidentally, the OP implies that the stuff MIRI is choosing not to publish is “alignment research”, which I think is at least not obvious. One presumes that the ideas might bear on alignment research, or else they wouldn’t be thinking about it, but I think that’s a weaker statement, at least for the way I define “alignment research”.
  • If “the first AGI and later ASI will be built with utmost caution by people who take AI risk very seriously”, then that’s sure a whole lot better than the alternative, and probably necessary for survival, but experts strongly disagree about whether it’s sufficient for survival.
    • (The OP also suggests that this is the course we’re on. Well that depends on what “utmost caution” means. One can imagine companies being much less cautious than they have been so far, but also, one can imagine companies being much more cautious than they have been so far. E.g. I’d be very surprised if OpenAI etc. lives up to these standards, and forget about these standards. Regardless, we can all agree that more caution is better than less.)
  • A couple alignment people have commented that interpretability research can have problematic capabilities externalities, although neither was saying that all of it should halt right now or anything like that.
    • (For my part, I think interpretability researchers should weight costs and benefits of doing the research, and also the costs and benefits of publishing the results, on a case-by-case basis, just like everyone else should.)
  • The more that alignment researchers freely share information, the better and faster they will produce alignment research.
  • If alignment researchers are publishing some stuff, but not publishing other stuff, then that’s not necessarily good enough, because maybe the latter stuff is essential for alignment.
  • If fewer people who care about AI risk become VCs who invest in AI companies, or join AI startups or AI discord communities or whatever, then “the presence of AI risk in all of these spaces will diminish”.

I think some of these are more important and some are less, but all are real. I just think it’s extraordinarily important to be doing things on a case-by-case basis here. Like, let’s say I want to work at OpenAI, with the idea that I’m going to advocate for safety-promoting causes, and take actions that are minimally bad for timelines. OK, now I’ve been at OpenAI for a little while. How’s it going so far? What exactly am I working on? Am I actually advocating for the things I was hoping to advocate for? What are my prospects going forward? Am I being indoctrinated and socially pressured in ways that I don’t endorse? (Or am I indoctrinating and socially pressuring others? How?) Etc. Or: I’m a VC investing in AI companies. What’s the counterfactual if I wasn’t here? Do I actually have any influence over the companies I’m funding, and if so, what am I doing with that influence, now and in the future?

Anyway, one big thing that I see as missing from the post is the idea:

“X is an interesting AI idea, and it’s big if true, and more importantly, if it’s true, then people will necessarily discover X in the course of building AGI. OK, guess I won’t publish it. After all, if it’s true, then someone else will publish it sooner or later. And then—and only then—I can pick up this strand of research that depends on X and talk about it more freely. Meanwhile maybe I’ll keep thinking about it but not publish it.”

In those kinds of cases, not publishing is a clear win IMO. More discussion at “Endgame safety” for AGI. For my part, I say that kind of thing all the time.

And I publish some stuff where I think the benefits of publishing outweigh the costs, and don’t publish other stuff where I think they don’t, on a case-by-case basis, and it’s super annoying and stressful and I never know (and never will know) if I’m making the right decisions, but I think it’s far superior to blanket policies.

Comment by Steven Byrnes (steve2152) on [Intro to brain-like-AGI safety] 6. Big picture of motivation, decision-making, and RL · 2023-10-18T19:16:41.406Z · LW · GW


I think it’s important to distinguish

  • (A) “this thing has explicit goals for the future”, versus
  • (B) “this thing was designed to accomplish one or more goals”.

For example, I, Steve, qualify as (A) because I am a thing with explicit goals for the future—among many other things, I want my children to have a good life. Meanwhile, a sandwich qualifies as (B) but not (A). The sandwich was designed and built in order to accomplish one or more goals (to be yummy, to be nutritious, etc.), but the sandwich itself does not have explicit goals for the future. It’s just a sandwich. It’s not a goal-seeking agent. Do you see the difference?

(Confusingly, I am actually an example of both (A) and (B). I’m an example of (A) because I want my children to have a good life. I’m an example of (B) because I was designed by natural selection to have a high inclusive genetic fitness.)

Now, if you’re saying that the Steering Subsystem is an example of (B), then yes I agree, that’s absolutely true. What I was saying there was that it is NOT an example of (A). Do you see what I mean?