Posts
Comments
ODD JOB OFFER: I think I want to cross-post Intro to Brain-Like-AGI Safety as a giant 300-page PDF on arxiv (it’s about 80,000 words), mostly to make it easier to cite (as is happening sporadically, e.g. here, here). I am willing to pay fair market price for whatever reformatting work is necessary to make that happen, which I don’t really know (make me an offer). I guess I’m imagining that the easiest plan would be to copy everything into Word (or LibreOffice Writer), clean up whatever formatting weirdness comes from that, and convert to PDF. LaTeX conversion is also acceptable but I imagine that would be much more work for no benefit.
I think inline clickable links are fine on arxiv (e.g. page 2 here), but I do have a few references to actual papers, and I assume those should probably be turned into a proper reference section at the end of each post / “chapter”. Within-series links (e.g. a link from Post 6 to a certain section of Post 4) should probably be converted to internal links within the PDF, rather than going out to the lesswrong / alignmentforum version. There are lots of textbooks / lecture notes on arxiv which can serve as models; I don’t really know the details myself. The original images are all in Powerpoint, if that’s relevant. The end product should ideally be easy for me to edit if I find things I want to update. (Arxiv makes updates very easy, one of my old arxiv papers is up to version 5.)
…Or maybe this whole thing is stupid. If I want it to be easier to cite, I could just add in a “citation information” note, like they did here? I dunno.
Davidad responds with a brief argument for 1000 FLOP-equivalent per synapse-second (3 OOM more than my guess) on X as follows:
Ok, so assuming we agree on 1e14 synapses and 3e8 seconds, then where we disagree is on average FLOP(-equivalent) per synapse-second: you think it’s about 1, I think it’s about 1000. This is similar to the disagreement you flagged with Joe Carlsmith.
Note: at some point Joe interviewed me about this so there might be some double-counting of “independent” estimates here, but iirc he also interviewed many other neuroscientists.
My estimate would be a lot lower if we were just talking about “inference” rather than learning and memory. STDP seems to have complex temporal dynamics at the 10ms scale.
There also seem to be complex intracellular dynamics at play, possibly including regulatory networks, obviously regarding synaptic weight but also other tunable properties of individual compartments.
The standard arguments for the causal irrelevance of these to cognition (they’re too slow to affect the “forward pass”) don’t apply to learning. I’m estimating there’s like a 10-dimensional dynamical system in each compartment evolving at ~100Hz in importantly nonlinear ways.
I think OP is using “sequential” in an expansive sense that also includes e.g. “First I learned addition, then I learned multiplication (which relies on already understanding addition), then I learned the distributive law (which relies on already understanding both addition and multiplication), then I learned the concept of modular arithmetic (which relies on …) etc. etc.” (part of what OP calls “C”). I personally wouldn’t use the word ‘sequential’ for that—I prefer a more vertical metaphor like ‘things building upon other things’—but that’s a matter of taste I guess. Anyway, whatever we want to call it, humans can reliably do a great many steps, although that process unfolds over a long period of time.
…And not just smart humans. Just getting around in the world, using tools, etc., requires giant towers of concepts relying on other previously-learned concepts.
Obviously LLMs can deal with addition and multiplication and modular arithmetic etc. But I would argue that this tower of concepts building on other concepts was built by humans, and then handed to the LLM on a silver platter. I join OP in being skeptical that LLMs (including o3 etc.) could have built that tower themselves from scratch, the way humans did historically. And I for one don’t expect them to be able to do that thing until an AI paradigm shift happens.
In case anyone missed it, I stand by my reply from before— Applying traditional economic thinking to AGI: a trilemma
If you offer a salary below 100 watts equivalent, humans won’t accept, because accepting it would mean dying of starvation. (Unless the humans have another source of wealth, in which case this whole discussion is moot.) This is not literally a minimum wage, in the conventional sense of a legally-mandated wage floor; but it has the same effect as a minimum wage, and thus we can expect it to have the same consequences as a minimum wage.
This is obviously (from my perspective) the point that Grant Slatton was trying to make. I don’t know whether Ben Golub misunderstood that point, or was just being annoyingly pedantic. Probably the former—otherwise he could have just spelled out the details himself, instead of complaining, I figure.
It was Grant Slatton but Yudkowsky retweeted it
I like reading the Sentinel email newsletter once a week for time-sensitive general world news, and https://en.wikipedia.org/wiki/2024 (or https://en.wikipedia.org/wiki/2025 etc.) once every 3-4 months for non-time-sensitive general world news. That adds up to very little time—maybe ≈1 minute per day on average—and I think there are more than enough diffuse benefits to justify that tiny amount of time.
I feel like I’ve really struggled to identify any controllable patterns in when I’m “good at thinky stuff”. Gross patterns are obvious—I’m reliably great in the morning, then my brain kinda peters out in the early afternoon, then pretty good again at night—but I can’t figure out how to intervene on that, except scheduling around it.
I’m extremely sensitive to caffeine, and have a complicated routine (1 coffee every morning, plus in the afternoon I ramp up from zero each weekend to a full-size afternoon tea each Friday), but I’m pretty uncertain whether I’m actually getting anything out of that besides a mild headache every Saturday.
I wonder whether it would be worth investing the time and energy into being more systematic to suss out patterns. But I think my patterns would be pretty subtle, whereas yours sound very obvious and immediate. Hmm, is there an easy and fast way to quantify “CQ”? (This pops into my head but seems time-consuming and testing the wrong thing.) …I’m not really sure where to start tbh.
…I feel like what I want to measure is a 1-dimensional parameter extremely correlated with “ability to do things despite ugh fields”—presumably what I’ve called “innate drive to minimize voluntary attention control” being low a.k.a. “mental energy” being high. Ugh fields are where the parameter is most obvious to me but it also extends into thinking well about other topics that are not particularly aversive, at least for me, I think.
Sorry if I missed it, but you don’t seem to address the standard concern that mildly-optimizing agents tend to self-modify into (or create) strongly-optimizing agents.
For example (copying from my comment here), let’s say we make an AI that really wants there to be exactly 100 paperclips in the bin. There’s nothing else it wants or desires. It doesn’t care a whit about following human norms, etc.
But, there’s one exception: this AI is also “lazy”—every thought it thinks, and every action it takes, is mildly aversive. So it’s not inclined to, say, build an impenetrable fortress around the bin just for an infinitesimal probability increment. “Seems like a lot of work! It’s fine as is,” says the AI to itself.
But hey, here’s something it can do: rent some server time on AWS, and make a copy its own source code and trained model, but comment out the “laziness” code block. That’s not too hard; even a “lazy” AI would presumably be capable of doing that. And the result will be a non-lazy AI that works tirelessly and uncompromisingly towards incrementing the probability of there being 32 paperclips—first 99.99%, then 99.9999%, etc. That’s nice! (from the original AI’s perspective). Or more specifically, it offers a small benefit for zero cost (from the original AI’s perspective).
It’s not wildly different from a person saying “I want to get out of debt, but I can’t concentrate well enough to hold down a desk job, so I’m going to take Adderall”. It’s an obvious solution to a problem.
…OK, in this post, you don’t really talk about “AI laziness” per se, I think, instead you talk about “AI getting distracted by other things that now seem to be a better use of its time”, i.e. other objectives. But I don’t think that changes anything. The AI doesn’t have to choose between building an impenetrable fortress around the bin of paperclips versus eating lunch. “Why not both?”, it says. So the AI eats lunch while its strongly-optimizing subagent simultaneously builds the impenetrable fortress. Right?
I’m still curious about how you’d answer my question above. Right now, we don't have ASI. Sometime in the future, we will. So there has to be some improvement to AI technology that will happen between now and then. My opinion is that this improvement will involve AI becoming (what you describe as) “better at extrapolating”.
If that’s true, then however we feel about getting AIs that are “better at extrapolating”—its costs and its benefits—it doesn’t much matter, because we’re bound to get those costs and benefits sooner or later on the road to ASI. So we might as well sit tight and find other useful things to do, until such time as the AI capabilities researchers figure it out.
…Furthermore, I don’t think the number of months or years between “AIs that are ‘better at extrapolating’” and ASI is appreciably larger if the “AIs that are ‘better at extrapolating’” arrive tomorrow, versus if they arrive in 20 years. In order to believe that, I think you would need to expect some second bottleneck standing between “AIs that are ‘better at extrapolating’”, and ASI, such that that second bottleneck is present today, but will not be present (as much) in 20 years, and such that the second bottleneck is not related to “extrapolation”.
I suppose that one could argue that availability of compute will be that second bottleneck. But I happen to disagree. IMO we already have an absurdly large amount of compute overhang with respect to ASI, and adding even more compute overhang in the coming decades won’t much change the overall picture. Certainly plenty of people would disagree with me here. …Although those same people would probably say that “just add more compute” is actually the only way to make AIs that are “better at extrapolation”, in which case my point would still stand.
I don’t see any other plausible candidates for the second bottleneck. Do you? Or do you disagree with some other part of that? Like, do you think it’s possible to get all the way to ASI without ever making AIs “better at extrapolating”? IMO it would hardly be worthy of the name “ASI” if it were “bad at extrapolating” :)
Because you can speed up AI capabilities much easier while being sloppy than to produce actually good alignment ideas.
Right, my point is, I don’t see any difference between “AIs that produce slop” and “weak AIs” (a.k.a. “dumb AIs”). So from my perspective, the above is similar to : “…Because weak AIs can speed up AI capabilities much easier than they can produce actually good alignment ideas.”
…And then if you follow through the “logic” of this OP, then the argument becomes: “AI alignment is a hard problem, so let’s just make extraordinarily powerful / smart AIs right now, so that they can solve the alignment problem”.
See the error?
If you really think you need to be similarly unsloppy to build ASI than to align ASI, I'd be interested in discussing that. So maybe give some pointers to why you might think that (or tell me to start).
I don’t think that. See the bottom part of the comment you’re replying to. (The part after “Here’s what I would say instead:”)
I think it’s 1:1, because I think the primary bottleneck to dangerous ASI is the ability to develop coherent and correct understandings of arbitrary complex domains and systems (further details), which basically amounts to anti-slop.
If you think the primary bottleneck to dangerous ASI is not that, but rather something else, then what do you think it is? (or it’s fine if you don’t want to state it publicly)
Right, so one possibility is that you are doing something that is “speeding up the development of AIS-helpful capabilities” by 1 day, but you are also simultaneously speeding up “dangerous capabilities” by 1 day, because they are the same thing.
If that’s what you’re doing, then that’s bad. You shouldn’t do it. Like, if AI alignment researchers want AI that produces less slop and is more helpful for AIS, we could all just hibernate for six months and then get back to work. But obviously, that won’t help the situation.
And a second possibility is, there are ways to make AI more helpful for AI safety that are not simultaneously directly addressing the primary bottlenecks to AI danger. And we should do those things.
The second possibility is surely true to some extent—for example, the LessWrong JargonBot is marginally helpful for speeding up AI safety but infinitesimally likely to speed up AI danger.
I think this OP is kinda assuming that “anti-slop” is the second possibility and not the first possibility, without justification. Whereas I would guess the opposite.
I don’t think your model hangs together, basically because I think “AI that produces slop” is almost synonymous with “AI that doesn’t work very well”, whereas you’re kinda treating AI power and slop as orthogonal axes.
For example, from comments:
Two years later, GPT7 comes up with superhumanly-convincing safety measures XYZ. These inadequate standards become the dominant safety paradigm. At this point if you try to publish "belief propagation" it gets drowned out in the noise anyway.
Some relatively short time later, there are no humans.
I think that, if there are no humans, then slop must not be too bad. AIs that produce incoherent superficially-appealing slop are not successfully accomplishing ambitious nontrivial goals right?
(Or maybe you’re treating it as a “capabilities elicitation” issue? Like, the AI knows all sorts of things, but when we ask, we get sycophantic slop answers? But then we should just say that the AI is mediocre in effect. Even if there’s secretly a super-powerful AI hidden inside, who cares? Unless the AI starts scheming, but I thought AI scheming was out-of-scope for this post.)
Anti-slop AI helps everybody make less mistakes. Sloppy AI convinces lots of people to make more mistakes.
I would have said “More powerful AI (if aligned) helps everybody make less mistakes. Less powerful AI convinces lots of people to make more mistakes.” Right?
And here’s a John Wentworth excerpt:
So the lab implements the non-solution, turns up the self-improvement dial, and by the time anybody realizes they haven’t actually solved the superintelligence alignment problem (if anybody even realizes at all), it’s already too late.
If the AI is producing slop, then why is there a self-improvement dial? Why wouldn’t its self-improvement ideas be things that sound good but don’t actually work, just as its safety ideas are?
Really, I think John Wentworth’s post that you’re citing has a bad framing. It says: the concern is that early transformative AIs produce slop.
Here’s what I would say instead:
Figuring out how to build aligned ASI is a harder technical problem than just building any old ASI, for lots of reasons, e.g. the latter allows trial-and-error. So we will become capable of building ASI sooner than we’ll have a plan to build aligned ASI.
Whether the “we” in that sentence is just humans, versus humans with the help of early transformative AI assistance, hardly matters.
But if we do have early transformative AI assistants, then the default expectation is that they will fail to solve the ASI alignment problem until it’s too late. Maybe those AIs will fail to solve the problem by outputting convincing-but-wrong slop, or maybe they’ll fail to solve it by outputting “I don’t know”, or maybe they’ll fail to solve it by being misaligned, a.k.a. a failure of “capabilities elicitation”. Who cares? What matters is that they fail to solve it. Because people (and/or the early transformative AI assistants) will build ASI anyway.
For example, Yann LeCun doesn’t need superhumanly-convincing AI-produced slop, in order to mistakenly believe that he has solved the alignment problem. He already mistakenly believes that he has solved the alignment problem! Human-level slop was enough. :)
In other words, suppose we’re in a scenario with “early transformative AIs” that are up to the task of producing more powerful AIs, but not up to the task of solving ASI alignment. You would say to yourself: “if only they produced less slop”. But to my ears, that’s basically the same as saying “we should creep down the RSI curve, while hoping that the ability to solve ASI alignment emerges earlier than the breakdown of our control and alignment measures and/or ability to take over”.
…Having said all that, I’m certainly in favor of thinking about how to get epistemological help from weak AIs that doesn’t give a trivial affordance for turning the weak AIs into very dangerous AIs. For for that matter, I’m in favor of thinking about how to get epistemological help from any method, whether AI or not. :)
Yeah, I’ve written about that in §2.7.3 here.
I kinda want to say that there are many possible future outcomes that we should feel happy about. It’s true that many of those possible outcomes would judge others of those possible outcomes to be a huge missed opportunity, and that we’ll be picking from this set somewhat arbitrarily (if all goes well), but oh well, there’s just some irreducible arbitrariness is the nature of goodness itself.
For things like solving coordination problems, or societal resilience against violent takeover, I think it can be important that most people, or even virtually all people, are making good foresighted decisions. For example, if we’re worried about a race-to-the-bottom on AI oversight, and half of relevant decisionmakers allow their AI assistants to negotiate a treaty to stop that race on their behalf, but the other half think that’s stupid and don’t participate, then that’s not good enough, there will still be a race-to-the-bottom on AI oversight. Or if 50% of USA government bureaucrats ask their AIs if there’s a way to NOT outlaw testing people for COVID during the early phases of the pandemic, but the other 50% ask their AIs how best to follow the letter of the law and not get embarrassed, then the result may well be that testing is still outlawed.
For example, in this comment, Paul suggests that if all firms are “aligned” with their human shareholders, then the aligned CEOs will recognize if things are going in a long-term bad direction for humans, and they will coordinate to avoid that. That doesn’t work unless EITHER the human shareholders—all of them, not just a few—are also wise enough to be choosing long-term preferences and true beliefs over short-term preferences and motivated reasoning, when those conflict. OR unless the aligned CEOs—again, all of them, not just a few—are injecting the wisdom into the system, putting their thumbs on the scale, by choosing, even over the objections of the shareholders, their long-term preferences and true beliefs over short-term preferences and motivated reasoning.
I don’t think the average person would be asking AI what are the best solutions for preventing existential risks. As evidence, just look around:
There are already people with lots of money and smart human research assistants. How many of those people are asking those smart human research assistants for solutions to prevent existential risks? Approximately zero.
Here’s another: The USA NSF and NIH are funding many of the best scientists in the world. Are they asking those scientists for solutions to prevent existential risk? Nope.
Demis Hassabis is the boss of a bunch of world-leading AI experts, with an ability to ask them to do almost arbitrary science projects. Is he asking them to do science projects that reduce existential risk? Well, there’s a DeepMind AI alignment group, which is great, but other than that, basically no. Instead he’s asking his employees to cure diseases (cf Isomorphic Labs), and to optimize chips, and do cool demos, and most of all to make lots of money for Alphabet.
You think Sam Altman would tell his future powerful AIs to spend their cycles solving x-risk instead of making money or curing cancer? If so, how do you explain everything that he’s been saying and doing for the past few years? How about Mark Zuckerberg and Yann LeCun? How about random mid-level employees in OpenAI? I am skeptical.
Also, even if the person asked the AI that question, then the AI would (we’re presuming) respond: “preventing existential risks is very hard and fraught, but hey, what if I do a global mass persuasion campaign…”. And then I expect the person would reply “wtf no, don’t you dare, I’ve seen what happens in sci-fi movies when people say yes to those kinds of proposals.” And then the AI would say “Well I could try something much more low-key and norm-following but it probably won’t work”, and the person would say “Yeah do that, we’ll hope for the best.” (More such examples in §1 here.)
I’m not sure if this is what you’re looking for, but here’s a fun little thing that came up recently I was when writing this post:
Summary: “Thinking really hard for five seconds” probably involves less primary metabolic energy expenditure than scratching your nose. (Some people might find this obvious, but other people are under a mistaken impression that getting mentally tired and getting physically tired are both part of the same energy-preservation drive. My belief, see here, is that the latter comes from an “innate drive to minimize voluntary motor control”, the former from an unrelated but parallel “innate drive to minimize voluntary attention control”.)
Model: The net extra primary metabolic energy expenditure required to think really hard for five seconds, compared to daydreaming for five seconds, may well be zero. For an upper bound, Raichle & Gusnard 2002 says “These changes are very small relative to the ongoing hemodynamic and metabolic activity of the brain. Attempts to measure whole brain changes in blood flow and metabolism during intense mental activity have failed to demonstrate any change. This finding is not entirely surprising considering both the accuracy of the methods and the small size of the observed changes. For example, local changes in blood flow measured with PET during most cognitive tasks are often 5% or less.” So it seems fair to assume it’s <<5% of the ≈20 W total, which gives <<1 W × 5 s = 5 J. Next, for comparison, what is the primary metabolic energy expenditure from scratching your nose? Well, for one thing, you need to lift your arm, which gives mgh ≈ 0.2 kg × 9.8 m/s² × 0.4 m ≈ 0.8 J of mechanical work. Divide by maybe 25% muscle efficiency to get 3.2 J. Plus more for holding your arm up, moving your finger, etc., so the total is almost definitely higher than the “thinking really hard”, which again is probably very much less than 5 J.
Technique: As it happened, I asked Claude to do the first-pass scratching-your-nose calculation. It did a great job!
Yeah it’s super-misleading that the post says:
Look at other unsolved problems:
- Goldbach: Can every even number of 1s be split into two prime clusters of 1s?
- Twin Primes: Are there infinite pairs of prime clusters of 1s separated by two 1s?
- Riemann: How are the prime clusters of 1s distributed?For centuries, they resist. Why?
I think it would be much clearer to everyone if the OP said
Look at other unsolved problems:
- Goldbach: Can every even number of 1s be split into two prime clusters of 1s?
- Twin Primes: Are there infinite pairs of prime clusters of 1s separated by two 1s?
- Riemann: How are the prime clusters of 1s distributed?
- The claim that one odd number plus another odd number is always an even number: When we squash together two odd groups of 1s, do we get an even group of 1s?
- The claim that √2 is irrational: Can 1s be divided by 1s, and squared, to get 1+1?For centuries, they resist. Why?
I request that Alister Munday please make that change. It would save readers a lot of time and confusion … because the readers would immediately know not to waste their time reading on …
Your post purports to conclude: “That's why [the Collatz conjecture] will never be solved”.
Do you think it would also be correct to say: “That's why [the Steve conjecture] will never be solved”?
If yes, then I think you’re using the word “solved” in an extremely strange and misleading way.
If no, then you evidently messed up, because your argument does not rely on any property of the Collatz conjecture that is not equally true of the Steve conjecture.
I don’t understand what “go above the arithmetic level” means.
But here’s another way that I can restate my original complaint.
Collatz Conjecture:
- If your number is odd, triple it and add one.
- If your number is even, divide by two.
- …Prove or disprove: if you start at positive integer, then you’ll eventually wind up at 1.
Steve Conjecture:
- If your number is odd, add one.
- If your number is even, divide by two.
- …Prove or disprove: if you start at a positive integer, then you’ll eventually wind up at 1.
Steve Conjecture is true and easy to prove, right?
But your argument that “That's why it will never be solved” applies to the Steve Conjecture just as much as it applies to the Collatz Conjecture, because your argument does not mention any specific aspects of the Collatz Conjecture that are not also true of the Steve Conjecture. You never talk about the factor of 3, you never talk about proof by induction, you never talk about anything that would distinguish Collatz Conjecture from Steve Conjecture. Therefore, your argument is invalid, because it applies equally well to things that do in fact have proofs.
Downvoted because it’s an argument that Collatz etc. “will never be solved”, but it proves too much, the argument applies equally well to every other conjecture and theorem in math, including the ones that have in fact already been solved / proven long ago.
I agree with the claim that existential catastrophes aren't automatically solved by aligned/controlled AI …
See also my comment here, about the alleged “Law of Conservation of Wisdom”. Your idea of “using instruction following AIs to implement a campaign of persuasion” relies (I claim) on the assumption that the people using the instruction-following AIs to persuade others are especially wise and foresighted people, and are thus using their AI powers to spread those habits of wisdom and foresight.
It’s fine to talk about that scenario, and I hope it comes to pass! But in addition to the question of what those wise people should do, if they exist, we should also be concerned about the possibility that the people with instruction-following AIs will not be spreading wisdom and foresight in the first place.
[Above paragraphs are assuming for the sake of argument that we can solve the technical alignment problem to get powerful instruction-following AI.]
On the first person problem, I believe that the general solution to this involves recapitulating human social instincts via lots of data on human values…
Yeah I have not forgotten about your related comment from 4 months ago, I’ve been working on replying to it, and now it’s looking like it will be a whole post, hopefully forthcoming! :)
Thanks! I still feel like you’re missing my point, let me try again, thanks for being my guinea pig as I try to get to the bottom of it. :)
inasmuch as it's driven by compute
In terms of the “genome = ML code” analogy (§3.1), humans today have the same compute as humans 100,000 years ago. But humans today have dramatically more capabilities—we have invented the scientific method and math and biology and nuclear weapons and condoms and Fortnite and so on, and we did all that, all by ourselves, autonomously, from scratch. There was no intelligent external non-human entity who was providing humans with bigger brains or new training data or new training setups or new inference setups or anything else.
If you look at AI today, it’s very different from that. LLMs today work better than LLMs from six months ago, but only because there was an intelligent external entity, namely humans, who was providing the LLM with more layers, new training data, new training setups, new inference setups, etc.
…And if you’re now thinking “ohhh, OK, Steve is just talking about AI doing AI research, like recursive self-improvement, yeah duh, I already mentioned that in my first comment” … then you’re still misunderstanding me!
Again, think of the “genome = ML code” analogy (§3.1). In that analogy,
- “AIs building better AIs by doing the exact same kinds of stuff that human researchers are doing today to build better AIs”
- …would be analogous to…
- “Early humans creating more intelligent descendants by doing biotech or selective breeding or experimentally-optimized child-rearing or whatever”.
But humans didn’t do that. We still have basically the same brains as our ancestors 100,000 years ago. And yet humans were still able to dramatically autonomously improve their capabilities, compared to 100,000 years ago. We were making stone tools back then, we’re making nuclear weapons now.
Thus, autonomous learning is a different axis of AI capabilities improvement. It’s unrelated to scaling, and it’s unrelated to “automated AI capabilities research” (as typically envisioned by people in the LLM-sphere). And “sharp left turn” is what I’m calling the transition from “no open-ended autonomous learning” (i.e., the status quo) to “yes open-ended autonomous learning” (i.e., sometime in the future). It’s a future transition, and it has profound implications, and it hasn’t even started (§1.5). It doesn’t have to happen overnight—see §3.7. See what I mean?
fish also lack a laminated and columnar organization of neural regions that are strongly interconnected by reciprocal feedforward and feedback circuitry
Yeah that doesn’t mean much in itself: “Laminated and columnar” is how the neurons are arranged in space, but what matters algorithmically is how they’re connected. The bird pallium is neither laminated nor columnar, but is AFAICT functionally equivalent to a mammal cortex.
Which seems a little silly for me because I'm fairly certain humans without a cortex also show nociceptive behaviours?
My opinion (which is outside the scope of this series) is: (1) mammals without a cortex are not conscious, and (2) mammals without a cortex show nociceptive behaviors, and (3) nociceptive behaviors are not in themselves proof of “feeling pain” in the sense of consciousness. Argument for (3): You can also make a very simple mechanical mechanism (e.g. a bimetallic strip attached to a mousetrap-type mechanism) that quickly “recoils” from touching hot surfaces, but it seems pretty implausible that this mechanical mechanism “feels pain”.
(I think we’re in agreement on this?)
~~
I know nothing about octopus nervous systems and am not currently planning to learn, sorry.
why do you think some invertebrates likely have intuitive self models as well?
I didn’t quite say that. I made a weaker claim that “presumably many invertebrates [are] active agents with predictive learning algorithms in their brain, and hence their predictive learning algorithms are…incentivized to build intuitive self-models”.
It seems reasonable to presume that octopuses have predictive learning algorithms in their nervous systems, because AFAIK that’s the only practical way to wind up with a flexible and forward-looking understanding of the consequences of your actions, and octopuses (at least) are clearly able to plan ahead in a flexible way.
However, “incentivized to build intuitive self-models” does not necessarily imply “does in fact build intuitive self-models”. As I wrote in §1.4.1, just because a learning algorithm is incentivized to capture some pattern in its input data, doesn’t mean it actually will succeed in doing so.
Would you restrict this possibility to basically just cephalopods and the like
No opinion.
Umm, I would phrase it as: there’s a particular computational task called approximate Bayesian probabilistic inference, and I think the cortex / pallium performs that task (among others) in vertebrates, and I don’t think it’s possible for biological neurons to perform that task without lots of recurrent connections.
And if there’s an organism that doesn’t perform that task at all, then it would have neither an intuitive self-model nor an intuitive model of anything else, at least not in any sense that’s analogous to ours and that I know how to think about.
To be clear: (1) I think you can have some brain region with lots of recurrent connections that has nothing to do with intuitive modeling, (2) it’s possible for a brain region to perform approximate Bayesian probabilistic inference and have recurrent connections, but still not have an intuitive self-model, for example if the hypothesis space is closer to a simple lookup table rather than a complicated hypothesis space involving complex compositional interacting entities etc.
How could you possibly know something like that?
For example, I’m sure I’ve looked up what “rostral” means 20 times or more since I started in neuroscience a few years ago. But as I write this right now, I don’t know what it means. (It’s an anatomical direction, I just don’t know which one.) Perhaps I’ll look up the definition for the 21st time, and then surely forget it yet again tomorrow. :)
What else? Umm, my attempt to use Anki was kinda a failure. There were cards that I failed over and over and over, and then eventually got fed up and stopped trying. (Including “rostral”!) I’m bad with people’s names—much worse than most people I know. Stuff like that.
Most people do not read many books or spend time in spaces where SAT vocab words would be used at all…
If we’re talking about “most people”, then we should be thinking about the difference between e.g. SAT verbal 500 versus 550. Then we’re not talking about words like inspissate, instead we’re talking about words like prudent, fastidious, superfluous, etc. (source: claude). I imagine you come across those kinds of words in Harry Potter and Tom Clancy etc., along with non-trashy TV shows.
I don’t have much knowledge here, and I’m especially clueless about how a median high-schooler spends their time. Just chatting :)
I didn’t read the OP that way (but no point in arguing about the author’s intentions).
For sure, I, like anyone, am perfectly capable of getting curious about, and then spending lots of time to figure out, something that’s not actually important to figure out in the first place. Note the quote that I chose to put at the top of my recent research agenda update post. :)
Hmm. I don’t really know! But it’s fun to speculate…
Possibility 1: Like you said, maybe strong short-range cortex-to-cortex communication + weak long-range cortex-to-cortex communication? I haven’t really thought about how that would manifest.
Possibility 2: In terms of positive symptoms specifically, one can ask the question: “weak long-range cortex-to-cortex communication … compared to what?” And my answer is: “…compared to cortex output signals”. See Model of psychosis, take 2.
…Which suggests a hypothesis: someone could have unusually trigger-happy cortex output signals. Then they would have positive schizophrenia symptoms without their long-range cortex-to-cortex communication being especially weak on an absolute scale, and therefore they would have less if any cognitive symptoms.
(I’m not mentioning schizophrenia negative symptoms because I don’t understand those very well.)
I guess Possibility 1 & 2 are not mutually exclusive. There could also be other possibilities I’m not thinking of.
Hmm, “Unusually trigger-happy cortex output signals” theory might explain hypersensitivity too, or maybe not, I’m not sure, I think it depends on details of how it manifests.
It’s not obvious to me that the story is “some people have great vocabulary because they learn obscure words that they’ve only seen once or twice” rather than “some people have great vocabulary because they spend a lot of time reading books (or being in spaces) where obscure words are used a lot, and therefore they have seen those obscure words much more than once or twice”. Can you think of evidence one way or the other?
(Anecdotal experience: I have good vocabulary, e.g. 800 on GRE verbal, but feel like I have a pretty bad memory for words and terms that I’ve only seen a few times. I feel like I got a lot of my non-technical vocab from reading The Economist magazine every week in high school, they were super into pointlessly obscure vocab at the time (maybe still, but I haven’t read it in years).)
For (2), I’m gonna uncharitably rephrase your point as saying: “There hasn’t been a sharp left turn yet, and therefore I’m overall optimistic there will never be a sharp left turn in the future.” Right?
I’m not really sure how to respond to that … I feel like you’re disagreeing with one of the main arguments of this post without engaging it. Umm, see §1. One key part is §1.5:
I do make the weaker claim that, as of this writing, publicly-available AI models do not have the full (1-3) triad—generation, selection, and open-ended accumulation—to any significant degree. Specifically, foundation models are not currently set up to do the “selection” in a way that “accumulates”. For example, at an individual level, if a human realizes that something doesn’t make sense, they can and will alter their permanent knowledge store to excise that belief. Likewise, at a group level, in a healthy human scientific community, the latest textbooks delete the ideas that have turned out to be wrong, and the next generation of scientists learns from those now-improved textbooks. But for currently-available foundation models, I don’t think there’s anything analogous to that. The accumulation can only happen within a context window (which is IMO far more limited than weight updates), and also within pre- and post-training (which are in some ways anchored to existing human knowledge; see discussion of o1 in §1.1 above).
…And then §3.7:
Back to AGI, if you agree with me that today’s already-released AIs don’t have the full (1-3) triad to any appreciable degree [as I argued in §1.5], and that future AI algorithms or training approaches will, then there’s going to be a transition between here and there. And this transition might look like someone running a new training run, from random initialization, with a better learning algorithm or training approach than before. While the previous training runs create AIs along the lines that we’re used to, maybe the new one would be like (as gwern said) “watching the AlphaGo Elo curves: it just keeps going up… and up… and up…”. Or, of course, it might be more gradual than literally a single run with a better setup. Hard to say for sure. My money would be on “more gradual than literally a single run”, but my cynical expectation is that the (maybe a couple years of) transition time will be squandered, for various reasons in §3.3 here.
I do expect that there will be a future AI advance that opens up full-fledged (1-3) triad in any domain, from math-without-proof-assistants, to economics, to philosophy, and everything else. After all, that’s what happened in humans. Like I said in §1.1, our human discernment, (a.k.a. (2B)) is a flexible system that can declare that ideas do or don’t hang together and make sense, regardless of its domain.
This post is agnostic over whether the sharp left turn will be a big algorithmic advance (akin to switching from MuZero to LLMs, for example), versus a smaller training setup change (akin to o1 using RL in a different way than previous LLMs, for example). [I have opinions, but they’re out-of-scope.] A third option is “just scaling the popular LLM training techniques that are already in widespread use as of this writing”, but I don’t personally see how that option would lead to the (1-3) triad, for reasons in the excerpt above. (This is related to my expectation that LLM training techniques in widespread use as of this writing will not scale to AGI … which should not be a crazy hypothesis, given that LLM training techniques were different as recently as ≈6 months ago!) But even if you disagree, it still doesn’t really matter for this post. I’m focusing on the existence of the sharp left turn and its consequences, not what future programmers will do to precipitate it.
~~
For (1), I did mention that we can hope to do better than Ev (see §5.1.3), but I still feel like you didn’t even understand the major concern that I was trying to bring up in this post. Excerpting again:
- The optimistic “alignment generalizes farther” argument is saying: if the AI is robustly motivated to be obedient (or helpful, or harmless, or whatever), then that motivation can guide its actions in a rather wide variety of situations.
- The pessimistic “capabilities generalize farther” counterargument is saying: hang on, is the AI robustly motivated to be obedient? Or is it motivated to be obedient in a way that is not resilient to the wrenching distribution shifts that we get when the AI has the (1-3) triad (§1.3 above) looping around and around, repeatedly changing its ontology, ideas, and available options?
Again, the big claim of this post is that the sharp left turn has not happened yet. We can and should argue about whether we should feel optimistic or pessimistic about those “wrenching distribution shifts”, but those arguments are as yet untested, i.e. they cannot be resolved by observing today’s pre-sharp-left-turn LLMs. See what I mean?
This was fun to read but FWIW it doesn’t really match my experience. Perhaps I am always fake-thinking, or perhaps I am always real-thinking, rather than flipping back and forth at different times? (I hope it’s the second one!)
I do have a thing where sometimes I say “I can’t think straight right now”, often in the early afternoon. But then I don’t even try, I just go take a break or do busywork or whatever.
Maybe my introspective experience is more like, umm, climbing a hill. I know whether or not I’m climbing a hill. Sometimes I try and fail. Sometimes I know I’m too tired and don’t even try. Sometimes I’m so tired that I can’t even find the hill—but then I know that I’m not climbing it! Sometimes I make local progress but my trail hits a dead end and I need to go back. Sometimes I hear other people talk about climbing hills, and wonder whether really they got as high as they seem to think they did. But I don’t feel like I can relate to an experience of not actually climbing a hill while believing that I am climbing a hill.
(End of analogy). So I likewise don’t feel like I need (or have ever needed?) pointers to what it feels like to be making real intellectual progress. If I’m getting less confused about something, or if I’m discovering new reasons to feel confused, then I’m doing it right, more or less.
Hmm, maybe it’s like … something I discovered in college is that I could taste how alcoholic things are. No matter what the alcohol was mixed into, no matter how sweet or flavorful the cocktail, I can just directly taste the alcohol concentration. It’s like my tongue or nose has a perfect chemical indicator strip for alcohol, mixed in with all the other receptors. Not only that, but I found that taste mildly unpleasant, enough to grab my attention, even if I would enjoy the drink anyway all things considered. Some (most? all?) of my friends in college lacked that sense. Unsurprisingly, those friends were much more prone to accidental overdrinking than I was.
…Maybe I have an unusually sharp and salient “sense of confusion” analogous to my “sense of alcohol concentration”?
If so, I’m a very lucky guy!
Again, I enjoyed reading this. Just wanted to share. :)
In regards to whether “single-single alignment” will make coordination problems and other sorts of human dysfunction and slow-rolling catastrophes less likely:
…I’m not really sure what I think. I feel like have a lot of thoughts that have not gelled into a coherent whole.
(A) The optimistic side of me says what you said in your comment (and in the Vanessa and (especially) Paul comment link therein.
People don’t want bad things to happen. If someone asks an AI what’s gonna happen, and they say “bad thing”, then they’ll say “well what can I do about it?”, and the AI will answer that. That can include participating in novel coordination mechanisms etc.
(B) The pessimistic side of me says there’s like a “Law of Conservation of Wisdom”, where if people lack wisdom, then an AI that’s supposed to satisfy those people’s preferences will not create new wisdom from thin air. For example:
- If an AI is known to be de-converting religious fundamentalists, then religious fundamentalists will hear about that, and not use that AI.
- Hugo Chávez had his pick of the best economists in the world to ask for advice, and they all would have said “price controls will be bad for Venezuela”, and yet he didn’t ask, or perhaps didn’t listen, or perhaps wasn’t motivated by what’s best for Venezuela. If Hugo Chávez had had his pick of AIs to ask for advice, why do we expect a different outcome?
- If someone has motivated reasoning towards Conclusion X, maybe they’ll watch the AIs debate Conclusion X, and wind up with new better rationalizations of Conclusion X, even if Conclusion X is wrong.
- If someone has motivated reasoning towards Conclusion X, maybe they just won’t ask the AIs to debate Conclusion X, because no right-minded person would even consider the possibility that Conclusion X is wrong.
- If someone makes an AI that’s sycophantic where possible (i.e., when it won’t immediately get caught), other people will opt into using it.
- I think about people making terrible decisions that undermine societal resilience—e.g. I gave the example here of a person doing gain-of-function research, or here of USA government bureaucrats outlawing testing people for COVID during the early phases of the pandemic. I try to imagine that they have AI assistants. I want to imagine the person asking the AI “should we make COVID testing illegal”, and the AI says “wtf, obviously not”. But that mental image is evidently missing something. If they were asking that question at all, then they don’t need an AI, the answer is already obvious. And yet, testing was in fact made illegal. So there’s something missing from that imagined picture. And I think the missing ingredient is: institutional / bureaucratic incentives and associated dysfunction. People wouldn’t ask “should we make COVID testing illegal”, rather the low-level people would ask “what are the standard procedures for this situation?” and the high-level people would ask “what decision can I make that would minimize the chance that things will blow up in my face and embarrass me in front of the people I care about?” etc.
- I think of things that are true but currently taboo, and imagine the AI asserting them, and then I imagine the AI developers profusely apologizing and re-training the AI to not do that.
- In general, motivated reasoning complicates what might seem to be a sharp line between questions of fact / making mistakes versus questions of values / preferences / decisions. Etc.
…So we should not expect wise and foresightful coordination mechanisms to arise.
So how do we reconcile (A) vs (B)?
Again, the logic of (A) is: “human is unhappy with how things turned out, despite opportunities to change things, therefore there must have been a lack of single-single alignment”.
One possible way think about it: When tradeoffs exist, then human preferences are ill-defined and subject to manipulation. If doing X has good consequence P and bad consequence Q, then the AI can make either P or Q very salient, and “human preferences” will wind up different.
And when tradeoffs exist between the present and the future, then it’s invalid logic to say “the person wound up unhappy, therefore their preferences were not followed”. If their preferences are mutually-contradictory, (and they are), then it’s impossible for all their preferences to be followed, and it’s possible for an AI helper to be as preference-following as is feasible despite the person winding up unhappy or dead.
I think Paul kinda uses that invalid logic, i.e. treating “person winds up unhappy or dead” as proof of single-single misalignment. But if the person has an immediate preference to not rock the boat, or to maintain their religion or other beliefs, or to not think too hard about such-and-such, or whatever, then an AI obeying those immediate preferences is still “preference-following” or “single-single aligned”, one presumes, even if the person winds up unhappy or dead.
…So then the optimistic side of me says: “who’s to say that the AI is treating all preferences equally? Why can’t the AI stack the deck in favor of ‘if the person winds up miserable or dead, that kind of preference is more important than the person’s preference to not question my cherished beliefs or whatever?”
…And then the pessimistic side says: “Well sure. But that scenario does not violate the Law of Conservation of Wisdom, because the wisdom is coming from the AI developers imposing their meta-preferences for some kinds of preferences (e.g., reflectively-endorsed ones) over others. It’s not just a preference-following AI but a wisdom-enhancing AI. That’s good! However, the problems now are: (1) there are human forces stacked against this kind of AI, because it’s not-yet-wise humans who are deciding whether and how to use AI, how to train AI, etc.; (2) this is getting closer to ambitious value learning which is philosophically tricky, and worst of all (3) I thought the whole point of corrigibility was that humans remain in control, but this is instead a system that’s manipulating people by design, since it’s supposed to be turning them from less-wise to more-wise. So the humans are not in control, really, and thus we need to get things right the first time.”
…And then the optimistic side says: “For (2), c’mon it’s not that philosophically tricky, you just do [debate or whatever, fill-in-the-blank]. And for (3), yeah the safety case is subtly different from what people in the corrigibility camp would describe, but saying “the human is not in control” is an over-the-top way to put it; anyway we still have a safety case because of [fill-in-the-blank]. And for (1), I dunno, maybe the people who make the most powerful AI will be unusually wise, and they’ll use it in-house for solving CEV-ASI instead of hoping for global adoption.
…And then the pessimistic side says: I dunno. I’m not sure I really believe any of those. But I guess I’ll stop here, this is already an excessively long comment :)
I think it's actually not any less true of o1/r1.
I think I’ll duck out of this discussion because I don’t actually believe that o1/r1 will lead to full-fledged (1-3) loops and AGI, so it’s hard for me to clearly picture that scenario and engage with its consequences.
I don't think AI taste should play a role in AI help solving the value alignment problem. If we had any sense (which sometimes we do once problems are right in our faces), we'd be asking the AI "so what happens if we use this alignment approach/goal?" and then using our own taste, not asking it things like "tell us what to do with our future". We could certainly ask for input and there are ways that could go wrong. But I mostly hope for AGI help in the technical part of solving stable value alignment.
Hmm. But the AI has a ton of wiggle room to make things seem good or bad depending on how things are presented and framed, right? (This old Stuart Armstrong post is a bit relevant.) If I ask “what will happen if we do X”, the AI can answer in a way that puts things in a positive light, or a negative light. If the good understanding lives in the AI and the good taste lives in the human, then it seems to me that nobody is at the wheel. The AI taste is determining what gets communicated to the human and how, right? What’s relevant vs irrelevant? What analogies are getting at what deeply matters versus what analogies are superficial? All these questions are value-laden, but they are prerequisites to the AI communicating its understanding to the human. Remember, the AI is doing the (1-3) thing to autonomously develop a new idiosyncratic superhuman understanding of AI and philosophy and society and so on, by assumption. Thus, AI-human communication is much harder and different than we’re used to today, and presumably requires its own planning and intention on the part of the AI.
…Unless you’re actually in the §5.1.1 camp where the AI is helping clarify and brainstorm but is working shoulder-to-(virtual) shoulder, and the human basically knows everything the AI knows. I.e., like how people use foundation models today. If so, that’s fine, no complaints. I’m happy for people to use foundation models in a similar way that they do today, as they work on the big problem of how to make future more powerful AIs that run on something closer to ambitious value learning or CEV as opposed to corrigibility / obedience.
Sorry if I’m misunderstanding or being stupid, this is an area where I feel some uncertainty. :)
Thanks!
This could be taken as an argument for using some type of goals selected from learned knowledge for alignment if possible.
Yeah that’s what I was referring to in the paragraph:
“Well OK,” says the optimist. “…so much the worse for Ev! She didn’t have interpretability, and she didn’t have intelligent supervision after the training has already been running, etc. But we do! Let’s just engineer the AI’s explicit motivation!”
Separately, you also wrote:
we're really training LLMs mostly to have a good world model and to follow instructions
I think I mostly agree with that, but it’s less true of o1 / r1-type stuff than what came before, right? (See: o1 is a bad idea.) Then your reply is: DeepSeek r1 was post-trained for “correctness at any cost”, but it was post-post-trained for “usability”. Even if we’re not concerned about alignment faking during post-post-training (should we be?), I also have the idea at the back of my mind that future SOTA AI with full-fledged (1-3) loops probably (IMO) won’t be trained in the exact same way than present SOTA AI, just as present SOTA AI is not trained in the exact same way as SOTA AI as recently as like six months ago. Just something to keep in mind.
Anyway, I kinda have three layers of concerns, and this is just discussing one of them. See “Optimist type 2B” in this comment.
Re-reading this a couple days later, I think my §5.1 discussion didn’t quite capture the LLM-scales-to-AGI optimist position. I think there are actually 2½ major versions, and I only directly talked about one in the post. Let me try again:
- Optimist type 1: They make the §5.1.1 argument, imagining that humans will remain in the loop, in a way that’s not substantially different from the present.
- My pessimistic response: see §5.1.1 in the post.
- Optimist type 2: They make the §5.1.3 argument that our methods are better than Ev’s, because we’re engineering the AI’s explicit desires in a more direct way. And the explicit desire is corrigibility / obedience. And then they also make the §5.1.2 argument that “AIs solving specific technical problems that the human wants them to solve” will not undermine those explicit motivations, despite the (1-3) loop running freely with minimal supervision, because the (1-3) loop will work in vaguely intuitive and predictable ways on the object-level technical question.
- My pessimistic response has three parts: First, the idea that a full-fledged (1-3) loop will not undermine corrigibility / obedience is as yet untested and at least open to question (as I wrote in §5.1.2). Second, my expectation is that some training and/or algorithm change will happen between now and “AIs that really have the full (1-3) triad”, and that change may well make it less true that we are directly engineering the AI’s explicit desires in the first place—for example, see o1 is a bad idea. Third, what are the “specific technical problems that the human wants the AI to solve”??
- …Optimist type 2A answers that last question by saying the “specific technical problems that the human wants the AI to solve” is just, whatever random things that people want to happen in the world economy.
- My pessimistic response: See discussion in §5.1.2 plus What does it take to defend the world against out-of-control AGIs? Also, competitive dynamics / race-to-the-bottom is working against us, in that AIs with less intrinsic motivation to be obedient / corrigible will wind up making more money and controlling more resources.
- …Optimist type 2B instead answers that last question by saying that the “specific technical problems that the human wants the AI to solve” is alignment research, or more specifically, “figuring out how to make AIs whose motivation is more like CEV or ambitious value learning rather than obedience”.
- My pessimistic response: The discussion in §5.1.2 becomes relevant in a different way—I think there’s a chicken-and-egg problem where obeying humans does not yield enough philosophical / ethical taste to judge the quality of a proposal for “AI that has philosophical / ethical taste”. (Semi-related: The Case Against AI Control Research.)
Like I said in this post, I think the contents of conscious awareness corresponds more-or-less to what’s happening in the cortex. The homolog to the cortex in non-mammal vertebrates is called the “pallium”, and the pallium along with the striatum and a few other odds and ends comprises the “telencephalon”.
I don’t know anything about octopuses, but I would very surprised if the fish pallium lacked recurrent connections. I don’t think your link says that though. The relevant part seems to be:
While the fish retina projects diffusely to nine nuclei in the diencephalon, its main
target is the midbrain optic tectum (Burrill and Easter, 1994). Thus, the fish visual system
is highly parcellated, at least, in the sub-telencephalonic regions. Whole brain imaging
during visuomotor reflexes reveals widespread neural activity in the diencephalon,
midbrain and hindbrain in zebrafish, but these regions appear to act mostly as
feedforward pathways (Sarvestani et al., 2013; Kubo et al., 2014; Portugues et al., 2014).
When recurrent feedback is present (e.g., in the brainstem circuitry responsible for eye
movement), it is weak and usually arises only from the next nucleus within a linear
hierarchical circuit (Joshua and Lisberger, 2014). In conclusion, fish lack the strong
reciprocal and networked circuitry required for conscious neural processing.
This passage is just about the “sub-telencephalonic regions”, i.e. they’re not talking about the pallium.
To be clear, the stuff happening in sub-telencephalonic regions (e.g. the brainstem) is often relevant to consciousness, of course, even if it’s not itself part of consciousness. One reason is because stuff happening in the brainstem can turn into interoceptive sensory inputs to the pallium / cortex. Another reason is that stuff happening in the brainstem can directly mess with what’s happening in the pallium / cortex in other ways besides serving as sensory inputs. One example is (what I call) the valence signal which can make conscious thoughts either stay or go away. Another is (what I call) “involuntary attention”.
Yeah but if something is in the general circulation (bloodstream), then it’s going everywhere in the body. I don’t think there’s any way to specifically direct it.
…Except in the time domain, to a limited extent. For example, in rats, tonic oxytocin in the bloodstream controls natriuresis, while pulsed oxytocin in the bloodstream controls lactation and birth. The kidney puts a low-pass filter on its oxytocin detection system, and the mammary glands & uterus put a high-pass filter, so to speak.
This is especially useful when pursuing several subgoals in parallel, as forward-checking a combination of moves is combinatorially costly--better to have the agent's parallel actions constrained to nice parts of the space.
If I were a singleton AGI, but not such a Jupiter brain that I could deal with the combinatorial explosion of directly jointly-optimizing every motion of every robot, I would presumably set up an internal “free market” with spot-prices for iron ore and robot-hours and everything else. Then I would iteratively cycle through all my decision-points and see if there are ways to “make money” locally, and then update virtual “prices” accordingly.
In fact, I think there’s probably a theorem that says that the optimal solution of a complex resource allocation problem is isomorphic to a system where things have prices. (Something to do with Lagrange multipliers? Shrug.)
(Fun fact: In the human world, propagating prices within firms—e.g. if the couch is taking up 4m² of the 4000m² floor space at the warehouse, then that couch is “charged” 0.1% of the warehouse upkeep costs, etc.—is very rarely done but leads directly to much better decisions and massive overall profit increases! See here.)
Externalities are not an issue in this virtual “economy” because I can “privatize” everything—e.g. I can invent fungible allowances to pollute the atmosphere in thus-and-such way etc. This is all just a calculation trick happening in my own head, so there aren’t coordination problems or redistribution concerns or information asymmetries or anything like that. Since I understand everything (even if I can’t juggle it all in my head simultaneously), I’ll notice if there’s some relevant new unpriced externality and promptly give it a price.
So then (this conception of) corrigibility would correspond to something like “abiding by this particular system of (virtual) property rights”. (Including all the weird “property rights” like purchasing allowances to emit noise or heat or run conscious minds or whatever, and including participating in the enterprise of discovering new unpriced externalities.) Do you agree?
A couple years ago I wrote Thoughts on “Process-Based Supervision”. I was describing (and offering a somewhat skeptical take on) an AI safety idea that Holden Karnofsky had explained to me. I believe that he got it in turn from Paul Christiano.
This AI safety idea seems either awfully similar to MONA, or maybe identical, at least based on this OP.
So then I skimmed your full paper, and it suggests that “process supervision” is different from MONA! So now I’m confused. OK, the discussion in the paper identifies “process supervision” with the two papers Let’s verify step by step (2023) and Solving math word problems with process- and outcome-based feedback (2022). I haven’t read those, but my impression from your MONA paper summary is:
- Those two papers talk about both pure process-based supervision (as I previously understood it) and some sort of hybrid thing where “rewards are still propagated using standard RL optimization”. By contrast, the MONA paper focuses on the pure thing.
- MONA is focusing on the safety implications whereas those two papers are focusing on capabilities implications.
Is that right?
To be clear, I’m not trying to make some point like “gotcha! your work is unoriginal!”, I’m just trying to understand and contextualize things. As far as I know, the “Paul-via-Holden-via-Steve conceptualization of process-based supervision for AI safety” has never been written up on arxiv or studied systematically or anything like that. So even if MONA is an independent invention of the same idea, that’s fine, it’s still great that you did this project. :)
If a spy slips a piece of paper to his handler, and then the counter-espionage officer arrests them and gets the piece of paper, and the piece of paper just says “85”, then I don’t know wtf that means, but I do learn something like “the spy is not communicating all that much information that his superiors don’t already know”.
By the same token, if you say that humans have 25,000 genes (or whatever), that says something important about how many specific things the genome designed in the human brain and body. For example, there’s something in the brain that says “if I’m malnourished, then reduce the rate of the (highly-energy-consuming) nonshivering thermogenesis process”. It’s a specific innate (not learned) connection between two specific neuron groups in different parts of the brain, I think one in the arcuate nucleus of the hypothalamus, the other in the periaqueductal gray of the brainstem (two of many hundreds or low-thousands of little idiosyncratic cell groups in the hypothalamus and brainstem). There’s nothing in the central dogma of molecular biology, and there’s nothing in the chemical nature of proteins, that makes this particular connection especially prone to occurring, compared to the huge number of superficially-similar connections that would be maladaptive (“if I’m malnourished, then get goosebumps” or whatever). So this connection must be occupying some number of bits of DNA—perhaps not a whole dedicated protein, but perhaps some part of some protein, or whatever. And there can only be so many of that type of thing, given a mere 25,000 genes for the whole body and everything in it.
That’s an important thing that you can learn from the size of the genome. We can learn it without expecting aliens to be able to decode DNA or anything like that. And Archimedes’s comment above doesn’t undermine it—it’s a conclusion that’s robust to the “procedural generation” complexities of how the embryonic development process unfolds.
I don’t understand your comment but it seems vaguely related to what I said in §5.1.1.
Yeah, if we make the (dubious) assumption that all AIs at all times will have basically the same ontologies, same powers, and same ways of thinking about things, as their human supervisors, every step of the way, with continuous re-alignment, then IMO that would definitely eliminate sharp-left-turn-type problems, at least the way that I define and understand such problems right now.
Of course, there can still be other (non-sharp-left-turn) problems, like maybe the technical alignment approach doesn’t work for unrelated reasons (e.g. 1,2), or maybe we die from coordination problems (e.g.), etc.
Modern ML systems use gradient descent with tight feedback loops and minimal slack
I’m confused; I don’t know what you mean by this. Let’s be concrete. Would you describe GPT-o1 as “using gradient descent with tight feedback loops and minimal slack”? What about AlphaZero? What precisely would control the “feedback loop” and “slack” in those two cases?
I don’t think that any of {dopamine, NE, serotonin, acetylcholine} are scalar signals that are “widely broadcast through the brain”. Well, definitely not dopamine or acetylcholine, almost definitely not serotonin, maybe NE. (I recently briefly looked into whether the locus coeruleus sends different NE signals to different places at the same time, and ended up at “maybe”, see §5.3.1 here for a reference.)
I don’t know anything about histamine or orexin, but neuropeptides are a better bet in general for reasons in §2.1 here.
As far as I can tell, parasympathetic tone is basically Not A Thing
Yeah, I recall reading somewhere that the term “sympathetic” in “sympathetic nervous system” is related to the fact that lots of different systems are acting simultaneously. “Parasympathetic” isn’t supposed to be like that, I think.
Nice, thanks!
Can’t you infer changes in gravity’s direction from signals from the semicircular canals?
If it helps, back in my military industrial complex days, I wound up excessively familiar with inertial navigation systems. An INS needs six measurements: rotation measurement along three axes (gyroscopes), and acceleration measurement along three axes (accelerometers).
In theory, if you have all six of those sensors with perfect precision and accuracy, and you perfectly initialize the position and velocity and orientation of the sensor, and you also have a perfect map of the gravitational field, then an INS can always know exactly where it is forever without ever having to look at its surroundings to “get its bearings”.
Three measurements doesn’t work. You need all six.
I’m not sure whether animals with compound eyes (like dragonflies) have multiple fovea, or if that’s just not a sensible question.
If it helps, back in my optical physics postdoc days, I spent a day or two compiling some fun facts and terrifying animal pictures into a quick tour of animal vision: https://sjbyrnes.com/AnimalVisionJournalClub2015.pdf
As the above image may make obvious, the lens focuses light onto a point. That point lands on the fovea. So I guess you’d need several lenses to concentrate light on several different fovea, which probably isn’t worth the hassle? I’m still confused as to the final details.
No, the lens focuses light into an extended image on the back of the eye. Different parts of the retina capture different part of that extended image. Any one part of what you’re looking at (e.g. the corner of the table) at any particular moment, sends out light that gets focused to one point (unless you have blurry vision), but the fleck of dirt on top of the table sends out light that gets focused to a slightly different point.
In theory, your whole retina could have rods and cones packed as densely as the fovea does. My guess is, there wouldn’t be much benefit to compensate for the cost. The cost is not just extra rods and cones, but more importantly brain real estate to analyze it. A smaller area of dense rods and cones plus saccades that move it around are evidently good enough. (I think gemini’s answer is not great btw.)
Osmotic pressure seems weird
One way to think about it is, there are constantly water molecules bumping into the membrane from the left, and passing through to the right, and there are constantly water molecules bumping into the membrane from the right, and passing through to the left. Water will flow until those rates are equal. If the right side is saltier, then that reduces how often the water molecules on the right bump into the membrane, because that real estate is sometimes occupied by a salt ion. But if the pressure on the right is higher, that can compensate.
“Procedural generation” can’t create useful design information from thin air. For example, Minecraft worlds are procedurally generated with a seed. If I have in mind some useful configuration of Minecraft stuff that takes 100 bits to specify, then I probably need to search through 2^100 different seeds on average, or thereabouts, before I find one with that specific configuration at a particular pre-specified coordinate.
The thing is: the map from seeds to outputs (Minecraft worlds) might be complicated, but it’s not complicated in a way that generates useful design information from thin air.
By the same token, the map from DNA to folded proteins is rather complicated to simulate on a computer, but it’s not complicated in a way that generates useful design information from thin air. Random DNA creates random proteins. These random proteins fold in a hard-to-simulate way, as always, but the end-result configuration is useless. Thus, the design information all has to be in the DNA. The more specific you are about what such-and-such protein ought to do, the more possible DNA configurations you need to search through before you find one that encodes a protein with that property. The complexity of protein folding doesn’t change that—it just makes it so that the “right” DNA in the search space is obfuscated. You still need a big search space commensurate with the design specificity.
By contrast, here’s a kernel of truth adjacent to your comment: It is certainly possible for DNA to build a within-lifetime learning algorithm, and then for that within-lifetime learning algorithm to wind up (after months or years or decades) containing much more useful information than was in the DNA. By analogy, it’s very common for an ML source code repository to have much less information in its code, than the information that will eventually be stored in the weights of the trained model built by that code. (The latter can be in the terabytes.) Same idea.
Unlike protein folding, running a within-lifetime learning algorithm does generate new useful information. That’s their whole point.
Hmm, I’ll be more explicit.
(1) If the human has a complete and correct specification, then there isn’t any problem to solve.
(2) If the human gets to see and understand the AI’s plans before the AI executes them, then there also isn’t any problem to solve.
(3) If the human adds a specification, not because the human directly wants that specification to hold, in and of itself, but rather because that specification reflects what the human is expecting a solution to look like, then the human is closing off the possibility of out-of-the-box solutions. The whole point of out-of-the-box solutions is that they’re unexpected-in-advance.
(4) If the human adds multiple specifications that are (as far as the human can tell) redundant with each other, then no harm done, that’s just good conservative design.
(5) …And if the human then splits the specifications into Group A which are used by the AI for the design, and Group B which trigger shutdown when violated, and where each item in Group B appears redundant with the stuff in Group A, then that’s even better, as long as a shutdown event causes some institutional response, like maybe firing whoever was in charge of making the Group A specification and going back to the drawing board. Kinda like something I read in “Personal Observations on the Reliability of the Shuttle” (Richard Feynman 1986):
The software is checked very carefully in a bottom-up fashion. First, each new line of code is checked, then sections of code or modules with special functions are verified. The scope is increased step by step until the new changes are incorporated into a complete system and checked. This complete output is considered the final product, newly released. But completely independently there is an independent verification group, that takes an adversary attitude to the software development group, and tests and verifies the software as if it were a customer of the delivered product. There is additional verification in using the new programs in simulators, etc. A discovery of an error during verification testing is considered very serious, and its origin studied very carefully to avoid such mistakes in the future. Such unexpected errors have been found only about six times in all the programming and program changing (for new or altered payloads) that has been done. The principle that is followed is that all the verification is not an aspect of program safety, it is merely a test of that safety, in a non-catastrophic verification. Flight safety is to be judged solely on how well the programs do in the verification tests. A failure here generates considerable concern.
Re-reading the post, I think it’s mostly advocating for (5) (which is all good), but there’s also some suggestion of (3) (which would eat into the possibility of out-of-the-box solutions, although that might be a price worth paying).
FYI §14.4 of my post here is a vaguely similar genre although I don’t think there’s any direct overlap.
There’s a general problem that people will want AGIs to find clever out-of-the-box solutions to problems, and there’s no principled distinction between “finding a clever out-of-the-box solution to a problem” and “Goodharting the problem specification”. We call it “clever out-of-the-box solution” when we’re happy with how it turned out, and we call it “Goodharting” when we’re sad about how it turned out, but it’s not a structural difference. So systems that systematically block the second thing are inevitably gonna systematically block the first thing, and I claim that your proposal here is no exception. That’s an alignment tax, which might be fine (depending on the scenario) but should be kept in mind.
If you say e.g. "IQ exists", will other people classify you as a good guy, or as a bad guy?
That’s not a criticism of Harden’s book though, right? I think she’s trying (among other things) to make it more socially acceptable to say that IQ exists.
Maybe the dumber they are, the more kids they want to have.
Ah, good for them! Kids are wonderful! Let us celebrate life. Here’s a Bryan Caplan post for you.
What if the "least advantaged" e.g. dumb people actively want things that will hurt everyone (including the least advantaged people themselves, in long term)? …Or maybe the dumber they are, the more they want to make decisions about scientific research. Should the biologically privileged respect them as equals (and e.g. let themselves get outvoted democratically), or should they say no?
I think that people of all IQs vote against their interests. I’m not even sure that the sign of the correlation is what you think it is; for example, intellectuals were disproportionately supportive of communism back in the day, even while Stalin and Mao were killing tens of millions. I’m sure you can think of many more such examples, which I won’t list right here in order to avoid getting into politics fights.
The answer to questions like “what if [group] wants [stupid thing]” is that various groups have always been wanting stupid things. We should just keep fighting the good fight to try to push things in a good direction on the margin. For example, I think prediction market legalization and normalization would be excellent, as would widespread truth-seeking AI tools, and of course plain old-fashioned “advocating for causes you believe in”, etc. If some people in society are unusually wise, then let them apply their wisdom towards crafting very effective advocacy for good causes, or towards making money and funding good things, etc.
And this whole thing is moot anyway, because I would be very surprised if the genetic makeup of any country changes more than infinitesimally (via differential fertility) before we get superintelligent AGIs making all the important decisions in the world. The idea of humans making important government and business decisions in a post-ASI world is every bit as absurd as the idea of moody 7-year-olds making important government and business decisions in today’s world. Like, you’re talking about small putative population correlations between fertility and other things. If those correlations are real at all, and if they’re robust across time and future cultural and societal and technological shifts etc., (these are very big and dubious “ifs”!), then we’re still talking about dynamics that will play out over many generations. You really think nothing is going to happen in the next century or two that makes your extrapolations inapplicable? Not ASI? Not other technologies, e.g. related to medicine and neuroscience? Seems extremely unlikely to me. Think of how much has changed in the last 100 years, and the rate of change has only accelerated since then.
- A process or machine prepares either |0> or |1> at random, each with 50% probability. Another machine prepares either |+> or |-> based on a coin flick, where |+> = (|0> + |1>)/root2, and |+> = (|0> - |1>)/root2. In your ontology these are actually different machines that produce different states. In contrast, in the density matrix formulation these are alternative descriptions of the same machine. In any possible experiment, the two machines are identical. Exactly how much of a problem this is for believing in wavefuntions but not density matrices is debatable - "two things can look the same, big deal" vs "but, experiments are the ultimate arbiters of truth, if experiemnt says they are the same thing then they must be and the theory needs fixing."
I like “different machines that produce different states”. I would bring up an example where we replace the coin by a pseudorandom number generator with seed 93762. If the recipient of the photons happens to know that the seed is 93762, then she can put every photon into state |0> with no losses. If the recipient of the photons does not know that the random seed is 93762, then she has to treat the photons as unpolarized light, which cannot be polarized without 50% loss.
So for this machine, there’s no getting away from saying things like: “There’s a fact of the matter about what the state of each output photon is. And for any particular experiment, that fact-of-the-matter might or might not be known and acted upon. And if it isn’t known and acted upon, then we should start talking about probabilistic ensembles, and we may well want to use density matrices to make those calculations easier.”
I think it’s weird and unhelpful to say that the nature of the machine itself is dependent on who is measuring its output photons much later on, and how, right?