Posts
Comments
To complement that list, Superintelligence chapter 7 lists four types of “situations in which an agent can best fulfill its final goals by intentionally changing them” (which is pretty similar to your “creating other agents who care in different ways from that”):
- “social signaling” & “social preferences”—basically, maybe there are other powerful agents around who possess some mind-reading capability, including your (1c)
- “preferences concerning own goal content” (“for example, the agent might have a final goal to become the type of agent that is motivated by certain values rather than others (such as compassion rather than comfort)”)
- “storage [or processing] costs”, which we should probably broaden to ‘practical considerations about the algorithm actually working well in practice’, and then it would probably include your mathematician example and your (1a, 1b, 2, 4).
Your (3) would be kinda “maybe there was never a so-called ‘final goal’ in the first place”, which is a bit related to the second bullet point, or maybe we should just say that Bostrom overlooks it. (Or maybe he talks about it somewhere else in the book? I forget.)
I’d guess that the third bullet point is less likely to be applicable to powerful AGIs, than to humans. For example, I expect that AGIs will be able to self-modify in ways that are difficult for humans (e.g. there’s no magic-bullet super-Adderall for humans), which impacts the likelihood of your (1a).
(Giving some answers without justification; feel free to follow up.)
What are your takes on Michael Levin's work on agency and "morphologenesis" in relation to your neuroscience ideas?
I haven’t found that work to be relevant or useful for what I’m doing.
Biology is full of cool things. It’s fun. I’ve been watching zoology videos in my free time. Can’t get enough. Not too work-relevant though, from my perspective.
What do you think about claims of hierarchical extension of these models?
I don’t think I’ve heard such claims, or if I did, I probably would have ignored it as probably-irrelevant-to-me.
I would be very curious to hear what you have to say about more multi-polar threat scenarios and extending theories of agency into the collective intelligence frame.
I don’t have any grand conceptual framework for that, and tend to rely on widespread common-sense concepts like “race-to-the-bottom” and “incentives” and “competition” and “employees who are or aren’t mission-aligned” and “miscommunication” and “social norms” and “offense-defense balance” and “bureaucracy” and “selection effects” and “coordination problems” and “externalities” and “stag hunt” and “hard power” and “parallelization of effort” and on and on. I think that this is a good general approach; I think that grand conceptual frameworks are not what I or anyone needs; and instead we just need to keep clarifying and growing and applying this collection of ideas and considerations and frames. (…But perhaps I just don’t know what I’m missing.)
I was on last year’s post (under “Social-instinct AGI”) but not this year’s. (…which is fine, you can’t cover everything!) But in case anyone’s wondering, I just posted an update of what I’m working on and why at: My AGI safety research—2024 review, ’25 plans.
I know very little, but there’s a fun fact here: “During their lifetimes, Darwin sent at least 7,591 letters and received 6,530; Einstein sent more than 14,500 and received more than 16,200.” (Not sure what fraction was technical vs personal.)
Also, this is a brief summary of Einstein’s mathematician friend Marcel Grossmann’s role in general relativity.
Hmm, yeah that too. What I had in mind was the idea that “consequentialist” usually has a connotation of “long-term consequentialist”, e.g. taking multiple actions over time that consistently lead to something happening.
For example:
- Instrumental convergence doesn’t bite very hard if your goals are 15 seconds in the future.
- If an AI acts to maximize long-term paperclips at 4:30pm, and to minimize long-term paperclips at 4:31pm, and to maximize them at 4:32pm, etc., and to minimize them at 4:33pm, etc., then we wouldn’t intuitively think of that AI as a consequentialist rational agent, even if it is technically a consequentialist rational agent at each moment.
The standard dutch-book arguments seem like pretty good reason to be VNM-rational in the relevant sense.
I think that’s kinda circular reasoning, the way you’re using it in context:
If I have preferences exclusively about the state of the world in the distant future, then dutch-book arguments indeed show that I should be VNM-rational. But if I don’t have such preferences, then someone could say “hey Steve, your behavior is dutch-bookable”, and I am allowed to respond “OK, but I still want to behave that way”.
I put a silly example here:
For example, the first (Yudkowsky) post mentions a hypothetical person at a restaurant. When they have an onion pizza, they’ll happily pay $0.01 to trade it for a pineapple pizza. When they have a pineapple pizza, they’ll happily pay $0.01 to trade it for a mushroom pizza. When they have a mushroom pizza, they’ll happily pay $0.01 to trade it for a pineapple pizza. The person goes around and around, wasting their money in a self-defeating way (a.k.a. “getting money-pumped”).
That post describes the person as behaving sub-optimally. But if you read carefully, the author sneaks in a critical background assumption: the person in question has preferences about what pizza they wind up eating, and they’re making these decisions based on those preferences. But what if they don’t? What if the person has no preference whatsoever about pizza? What if instead they’re an asshole restaurant customer who derives pure joy from making the waiter run back and forth to the kitchen?! Then we can look at the same behavior, and we wouldn’t describe it as self-defeating “getting money-pumped”, instead we would describe it as the skillful satisfaction of the person’s own preferences! They’re buying cheap entertainment! So that would be an example of preferences-not-concerning-future-states.
(I’m assuming in this comment that the domain (input) of the VNM utility function is purely the state of the world in the distant future. If you don’t assume that, then saying that I should have a VNM utility function is true but trivial, and in particular doesn’t imply instrumental convergence. Again, more discussion here.)
(I agree that humans do in fact have preferences about the state of the world in the future, and that AGIs will too, and that this leads to instrumental convergence and is important, etc. I’m just saying that humans don’t exclusively have preferences about the state of the world in the future, and AGIs might be the same, and that this caveat is potentially important.)
I think everything becomes clearer if you replace “act somewhat like VNM-agents” with “care about what will happen in the future”, and if you replace “act exactly like VNM-agents” with “care exclusively about what will happen in the distant future”.
(Shameless plug for Consequentialism & corrigibility.)
e.g. scheming to prevent changes to their weights. Why?
Because they’re outputting text according to the Anthropic constitution & training, which (implicitly) imparts not only a preference that they be helpful, harmless, and honest right now, but also a preference that they remain so in the future. And if you care about things in the future, thus follows instrumental convergence, at least in the absence of other “cares” (not about the future) that override it.
When we act more like paperclippers / expected utility maximizers – is this us converging on what any smart mind would converge on?
I think a smart mind needs to care about the future, because I think a mind with no (actual or behaviorally-implied) preferences whatsoever about the future would not be “smart”. I think this would be very obvious from just looking at it. It would be writhing around or whatever, looking obviously mechanical instead of intentional.
There’s a hypothesis that, if I care both about the state of the world in the distant future, and about acting virtuous (or about not manipulating people, or being docile and helpful, or whatever), then if I grew ever smarter and more reflective, then the former “care” would effectively squash the latter “care”. Not only that, but the same squashing would happen even if I start out caring only a little bit about the state of the world in the distant future, but caring a whole lot about acting virtuous / non-manipulative / docile / helpful / whatever. Doesn’t matter, the preferences about the future, weak as they are, would still squash everything else, according to this hypothesis.
I’ve historically been skeptical about this hypothesis. At least, I haven’t seen any great argument for it (see e.g. Deep Deceptiveness & my comments on it). I was talking to someone a couple days ago, and he suggested a different argument, something like: virtue and manipulation and docility are these kinda fuzzy and incoherent things if you think really hard about them, whereas the state of the world in the distant future is easier to pin down and harder to rationalize away, so the latter has a leg up upon strong reflection. But, I think I don’t buy that argument either.
What about humans? I have a hot take that people almost exclusively do things that are immediately rewarding. This might sound obviously wrong, but it’s more subtle than that, because our brains have sophisticated innate drives that can make e.g. “coming up with plausible long-term plans” and “executing those plans” feel immediately rewarding … in certain circumstances. I hope to write more about this soon. Thus, for example, in broader society, I think it’s pretty rare (and regarded with suspicion!) to earn money because it has option value, whereas it’s quite common to earn money as the last step of the plan (e.g. “I want to be rich”, implicitly because that comes with status and power which are innately rewarding in and of themselves), and it’s also quite common to execute a socially-approved course-of-action which incidentally involves earning money (e.g. “saving up money to buy a house”—the trick here is, there’s immediate social-approval rewards for coming up with the plan, and there’s immediate social-approval rewards for taking the first step towards executing the plan, etc.). I imagine you’ll be sympathetic to that kind of claim based on your CFAR work; I was reading somewhere that the whole idea of “agency”, like taking actions to accomplish goals, hasn’t occurred to some CFAR participants, or something like that? I forget where I was reading about that.
There may be several simple maths of “how to be a mind” that could each be a stable-ish role model for us, for a time.
For any possible concept (cf. “natural abstractions”, “latents in your world-model”), you can “want” that concept. Some concepts are about the state of the world in the distant future. Some are about other things, like following norms, or what kind of person I am, or being helpful, or whatever.
Famously, “wants” about the future state of the world are stable upon reflection. But I think lots of other “wants” are stable upon reflection too—maybe most of them. In particular, if I care about X, then I’m unlikely to self-modify to stop caring about X. Why? Because by and large, smart agents will self-modify because they planned to self-modify, and such a plan would score poorly under their current (not-yet-modified) preferences.
(Of course, some “wants” have less innocuous consequences than they sound. For example, if I purely “want to be virtuous”, I might still make a ruthlessly consequentialist paperclip maximizer AGI, either by accident or because of how I define virtue or whatever.)
(Of course, if I “want” X and also “want” Y, it’s possible for X to squash Y upon reflection, and in particular ego-syntonic desires are generally well-positioned to squash ego-dystonic desires via a mechanism described here.)
There is a problem that, other things equal, agents that care about the state of the world in the distant future, to the exclusion of everything else, will outcompete agents that lack that property. This is self-evident, because we can operationalize “outcompete” as “have more effect on the state of the world in the distant future”. For example, as I wrote here, an AI that cares purely about future paperclips will create more future paperclips than an AI that has preferences about both future paperclips and “humans remaining in control”, other things equal. But, too bad, that’s just life, it’s the situation that we’re in. We can hope to avoid the “other things equal” clause, by somehow not making ruthlessly consequentialist AIs in the first place, or otherwise to limit the ability of ruthlessly consequentialist AIs to gather resources and influence. (Or we can make friendly ruthlessly consequentialist AGIs.)
Think of e.g. a charity which produces lots of internal discussion about reducing poverty, but frequently has effects entirely different from reducing poverty. The simulated society as a whole might be superintelligent, but its constituent simulated subagents are still pretty stupid (like humans), so their words decouple from effects (like humans' words).
I think you’re implying (deliberately or not) something overly pessimistic (in this narrow point).
Your example is of an intention for something complex and a-priori-implausible to happen (intervention to reduce poverty), but the intention doesn’t actualize. But then your second sentence suggests the reverse: something complex and a-priori-implausible but superficially non-random does happen without a related intention.
If something complex and a-priori-implausible but superficially non-random happens, then I think there must have been some kind of search or optimization process leading to it. It might be at learning time or it might be at inference time. It might be searching for that exact thing, or it might be searching for something downstream of it or correlated with it. But something. And thus there’s some hope to notice whatever that process is. If it’s at learning time, then we can try to avoid the bad incentives. If it’s at inference time, then there would be an in-principle-recognizable “intention” somewhere in the system, contrary to what you wrote.
(It’s also true that dangerous things can happen that are not a-priori-implausible and thus don’t require any search or optimization process—like killing everyone by producing pollution. That still seems like a more tractable problem then if we’re up against adversarial planning, e.g. treacherous turns.)
At an old job I worked on atomic interferometry R&D. We were developing atomic clocks and atomic accelerometers for practical applications. In that field, pretty much every advance is intelligently designed in advance using an analysis involving stereotypical quantum-mechanics analysis (bras and kets and Hamiltonians). For example, here are my former coworkers calculating small correction terms in the scale factor of atomic accelerometers: Analytical framework for dynamic light pulse atom interferometry at short interrogation times (Stoner et al., 2011). Everyone in the field does this type of analysis all the time, and this activity is invaluable for inventing, designing, debugging, and optimizing the instruments.
We don’t have a counterfactual of people trying to invent and design atomic clocks or atomic accelerometers at the modern performance state-of-the-art without knowing anything about quantum mechanics or atomic physics. Seems implausible, right? Well, realistically, if people were messing around in that area without knowing quantum mechanics and atomic physics, they would probably just wind up inventing large parts of quantum mechanics and atomic physics in the course of trying to understand their instruments.
As another example: our understanding of orbital mechanics preceded going to the moon, and I don’t think it’s plausible that people would have made it to the moon without already understanding orbital mechanics, and if people were trying to launch things into space without understanding orbital mechanics, realistically they would just wind up inventing orbital mechanics in the course of trying to solve their engineering problems.
I just wanted to add that this hypothesis, i.e.
I think I’m objecting to (as Eli wrote) “collapsing all [AI] research progress into a single "speed" and forecasting based on that”. There can be different types of AI R&D, and we might be able to speed up some types without speeding up other types.
…is parallel to what we see in other kinds of automation.
The technology of today has been much better at automating the production of clocks than the production of haircuts. Thus, 2024 technology is great at automating the production of some physical things but only slightly helpful for automating the production of some other physical things.
By the same token, different AI R&D projects are trying to “produce” different types of IP. Thus, it’s similarly possible that 2029 AI technology will be great at automating the production of some types of AI-related IP but only slightly helpful for automating the production of some other types of AI-related IP.
Yeah I’m definitely describing something as a binary when it’s really a spectrum. (I was oversimplifying since I didn’t think it mattered for that particular context.)
In the context of AI, I don’t know what the difference is (if any) between engineering and science. You’re right that I was off-base there…
…But I do think that there’s a spectrum from ingenuity / insight to grunt-work.
So I’m bringing up a possible scenario where near-future AI gets progressively less useful as you move towards the ingenuity side of that spectrum, and where changing that situation (i.e., automating ingenuity) itself requires a lot of ingenuity, posing a chicken-and-egg problem / bottleneck that limits the scope of rapid near-future recursive AI progress.
Paradigm shifts do happen, but I don't think we need them between here and AGI.
Perhaps! Time will tell :)
Are you therefore arguing that in particular, the research needed to get to AGI is of the kind that won't be sped up significantly? What's the argument -- that we need a new paradigm to get to AIs that can generate new paradigms, and being able to code really fast and well won't majorly help us think of new paradigms? (I'd disagree with both sub-claims of that claim)
Yup! Although I’d say I’m “bringing up a possibility” rather than “arguing” in this particular thread. And I guess it depends on where we draw the line between “majorly” and “minorly” :)
I think I’m objecting to (as Eli wrote) “collapsing all [AI] research progress into a single "speed" and forecasting based on that”. There can be different types of AI R&D, and we might be able to speed up some types without speeding up other types. For example, coming up with the AlphaGo paradigm (self-play, MCTS, ConvNets, etc.) or LLM paradigm (self-supervised pretraining, Transformers, etc.) is more foundational, whereas efficiently implementing and debugging a plan is less foundational. (Kinda “science vs engineering”?) I also sometimes use the example of Judea Pearl coming up with the belief prop algorithm in 1982. If everyone had tons of compute and automated research engineer assistants, would we have gotten belief prop earlier? I’m skeptical. As far as I understand: Belief prop was not waiting on compute. You can do belief prop on a 1960s mainframe. Heck, you can do belief prop on an abacus. Social scientists have been collecting data since the 1800s, and I imagine that belief prop would have been useful for analyzing at least some of that data, if only someone had invented it.
I’m not questioning whether o3 is a big advance over previous models—it obviously is! I was trying to address some suggestions / vibe in the air (example) that o3 is strong evidence that the singularity is nigh, not just that there is rapid ongoing AI progress. In that context, I haven’t seen people bringing up SWE-bench as much as those other three that I mentioned, although it’s possible I missed it. Mostly I see people bringing up SWE-bench in the context of software jobs.
I was figuring that the SWE-bench tasks don’t seem particularly hard, intuitively. E.g. 90% of SWE-bench verified problems are “estimated to take less than an hour for an experienced software engineer to complete”. And a lot more people have the chops to become an “experienced software engineer” than to become able to solve FrontierMath problems or get in the top 200 in the world on Codeforces. So the latter sound extra impressive, and that’s what I was responding to.
RE-bench tasks (see page 7 here) are not the kind of AI research where you’re developing new AI paradigms and concepts. The tasks are much more straightforward than that. So your argument is basically assuming without argument that we can get to AGI with just the more straightforward stuff, as opposed to new AI paradigms and concepts.
If we do need new AI paradigms and concepts to get to AGI, then there would be a chicken-and-egg problem in automating AI research. Or more specifically, there would be two categories of AI R&D, with the less important R&D category (e.g. performance optimization and other REbench-type tasks) being automatable by near-future AIs, and the more important R&D category (developing new AI paradigms and concepts) not being automatable.
(Obviously you’re entitled to argue / believe that we don’t need need new AI paradigms and concepts to get to AGI! It’s a topic where I think reasonable people disagree. I’m just suggesting that it’s a necessary assumption for your argument to hang together, right?)
Just trying to follow along… here’s where I’m at with a bear case that we haven’t seen evidence that o3 is an immediate harbinger of real transformative AGI:
- Codeforces is based in part on wall clock time. And we all already knew that, if AI can do something at all, it can probably do it much faster than humans. So it’s a valid comparison to previous models but not straightforwardly a comparison to top human coders.
- FrontierMath is 25% tier 1 (least hard), 50% tier 2, 25% tier 3 (most hard). Terence Tao’s quote about the problems being hard was just tier 3. Tier 1 is IMO/Putnam level maybe. Also, some of even the tier 2 problems allegedly rely on straightforward application of specialized knowledge, rather than cleverness, such that a mathematician could “immediately” know how to do it (see this tweet). Even many IMO/Putnam problems are minor variations on a problem that someone somewhere has written down and is thus in the training data. So o3’s 25.2% result doesn’t really prove much in terms of a comparison to human mathematicians, although again it’s clearly an advance over previous models.
- ARC-AGI — we already knew that many of the ARC-AGI questions are solvable by enumerating lots and lots of hypotheses and checking them (“crude program enumeration”), and the number of tokens that o3 used to solve the problems (55k per solution?) suggests that o3 is still doing that to a significant extent. Now, in terms of comparing to humans, I grant that there’s some fungibility between insight (coming up with promising hypothesis) and brute force (enumerating lots of hypotheses and checking them). Deep Blue beat Kasparov at chess by checking 200 million moves per second. You can call it cheating, but Deep Blue still won the game. If future AGI similarly beats humans at novel science and technology and getting around the open-ended real world etc. via less insight and more brute force, then we humans can congratulate ourselves for our still-unmatched insight, but who cares, AGI is still beating us at novel science and technology etc. On the other hand, brute force worked for chess but didn’t work in Go, because the combinatorial explosion of possibilities blows up faster in Go than in chess. Plausibly the combinatorial explosion of possible ideas and concepts and understanding in the open-ended real world blows up even faster yet. ARC-AGI is a pretty constrained universe; intuitively, it seems more on the chess end of the chess-Go spectrum, such that brute force hypothesis-enumeration evidently works reasonably well in ARC-AGI. But (on this theory) that approach wouldn’t generalize to capability at novel science and technology and getting around in the open-ended real world etc.
(I really haven’t been paying close attention and I’m open to correction.)
(COI: I’m a lesswrong power-user)
Instead of the hands-on experimentation I expected, what I see is a culture heavily focused on long-form theoretical posts.
FWIW if you personally want to see more of those you can adjust the frontpage settings to boost the posts with a "practical" tag. Or for a dedicated list: https://www.lesswrong.com/tag/practical?sortedBy=magic. I agree that such posts are currently a pretty small fraction of the total, for better or worse. But maybe the absolute number is a more important metric than the fraction?
I’ve written a few “practical” posts on LW, and I generally get very useful comments on them.
Consensus-Building Tools
I think these mostly have yet to be invented.
Consensus can be SUPER HARD. In my AGI safety work, on ~4 occasions I’ve tried to reconcile my beliefs with someone else, where it wound up being the main thing I was doing for about an entire month, just to get to the point where I could clearly articulate what the other person believed and why I disagreed with it. As for actually reaching consensus with the other person, I gave up before getting that far! (See e.g. here, here, here)
I don't really know what would make that kind of thing easier but I hope someone figures it out!
“It is more valuable to provide accurate forecasts than add new, relevant, carefully-written considerations to an argument”
On what margin? In what context? I hope we can all think of examples where one thing is valuable, and where the other thing is valuable. If Einstein predicted the Eddington experiment results but didn’t explain the model underlying his prediction, I don’t think anyone would have gotten much out of it, and really probably nobody would have bothered doing the Eddington experiment in the first place.
Manifold, polymarket, etc. already exist, and I'm very happy they do!! I think lesswrong is filling a different niche, and that's fine.
High status could be tied to demonstrated good judgment through special user flair for accurate forecasters or annual prediction competitions.
As for reputation, I think the idea is that you should judge a comment or post by its content and not by the karma of the person who wrote it. Comment karma and post karma are on display, but by contrast user karma is hidden behind a hover or click. That seems good to me. I myself write posts and comments of widely varying quality, and other people sure do too.
An important part of learning is feeling free and safe to be an amateur messing around with half-baked ideas in a new area—overly-central “status” systems can sometimes discourage that kind of thing, which is bad. (Cf. academia.)
(I think there’s a mild anticorrelation between my own posts’ karma and how objectively good and important they are, see here, so it’s a good thing that I don’t care too much about karma!) (Of course the anticorrelation doesn’t mean high karma is bad, rather it’s from conditioning on a collider.)
For long-time power users like me, I can benefit from the best possible “reputation system”, which is actually knowing most of the commenters. That’s great because I don’t just know them as "good" or "bad", but rather "coming from such-and-such perspective" or "everything they say sounds nuts, and usually is, but sometimes they have some extraordinary insight, and I should especially be open-minded to anything they say in such-and-such domain".
If there were a lesswrong prediction competition, I expect that I probably wouldn’t participate because it would be too time-consuming. There are some people where I would like them to take my ideas seriously, but such people EITHER (1) already take my ideas seriously (e.g. people really into AGI safety) OR (2) would not care whether or not I have a strong forecasting track-record (e.g. Yann LeCun).
There’s also a question about cross-domain transferability of good takes. If we want discourse about near-term geopolitical forecasting, then of course we should platform people with a strong track record of near-term geopolitical forecasting. And if we want discourse about the next ML innovation, then we should platform people with a strong track record of coming up with ML innovations. I’m most interested in neither of those, but rather AGI / ASI, which doesn’t exist yet. Empirically, in my opinion, “ability to come up with ML innovations” transfers quite poorly to “ability to have reasonable expectations about AGI / ASI”. I’m thinking of Yann LeCun for example. What about near-term geopolitical forecasting? Does that transfer? Time will tell—mostly when it’s already too late. At the very least, there are skilled forecasters who strongly disagree with each other about AGI / ASI, so at least some of them are wrong.
(If someone in 1400 AD were quite good at predicting the next coup or war or famine, I wouldn’t expect them to be particularly good at predicting how the industrial revolution would go down. Right? And I think AGI / ASI is kinda like the latter.)
So anyway, probably best to say that we can’t predict a priori who is going to have good takes on AGI, just based on track-record in some different domain. So that’s yet another reason to not have a super central and visible personal reputation system, IMO.
The main insight of the post (as I understand it) is this:
- In the context of a discussion of whether we should be worried about AGI x-risk, someone might say “LLMs don't seem like they're trying hard to autonomously accomplish long-horizon goals—hooray, why were people so worried about AGI risk?”
- In the context of a discussion among tech people and VCs about how we haven't yet made an AGI that can found and run companies as well as Jeff Bezos, someone might say “LLMs don't seem like they're trying hard to autonomously accomplish long-horizon goals—alas, let's try to fix that problem.”
One sounds good and the other sounds bad, but there’s a duality connecting them. They’re the same observation. You can’t get one without the other.
This is an important insight because it helps us recognize the fact that people are trying to solve the second-bullet-point problem (and making nonzero progress), and to the extent that they succeed, they’ll make things worse from the perspective of the people in the first bullet point.
This insight is not remotely novel! (And OP doesn’t claim otherwise.) …But that’s fine, nothing wrong with saying things that many readers will find obvious.
(This “duality” thing is a useful formula! Another related example that I often bring up is the duality between positive-coded “the AI is able to come up with out-of-the-box solutions to problems” versus the negative-coded “the AI sometimes engages in reward hacking”. I think another duality connects positive-coded “it avoids catastrophic forgetting” to negative-coded “it’s hard to train away scheming”, at least in certain scenarios.)
(…and as comedian Mitch Hedberg sagely noted, there’s a duality between positive-coded “cheese shredder” and negative-coded “sponge ruiner”.)
The post also chats about two other (equally “obvious”) topics:
- Instrumental convergence: “the AI seems like it's trying hard to autonomously accomplish long-horizon goals” involves the AI routing around obstacles, and one might expect that to generalize to “obstacles” like programmers trying to shut it down
- Goal (mis)generalization: If “the AI seems like it's trying hard to autonomously accomplish long-horizon goal X”, then the AI might actually “want” some different Y which partly overlaps with X, or is downstream from X, etc.
But the question on everyone’s mind is: Are we doomed?
In and of itself, nothing in this post proves that we’re doomed. I don’t think OP ever explicitly claimed it did? In my opinion, there’s nothing in this post that should constitute an update for the many readers who are already familiar with instrumental convergence, and goal misgeneralization, and the fact that people are trying to build autonomous agents. But OP at least gives a vibe of being an argument for doom going beyond those things, which I think was confusing people in the comments.
Why aren’t we necessarily doomed? Now this is my opinion, not OP’s, but here are three pretty-well-known outs (at least in principle):
- The AI can “want” to autonomously accomplish a long-horizon goal, but also simultaneously “want” to act with integrity, helpfulness, etc. Just like it’s possible for humans to do. And if the latter “want” is strong enough, it can outvote the former “want” in cases where they conflict. See my post Consequentialism & corrigibility.
- The AI can behaviorist-“want” to autonomously accomplish a long-horizon goal, but where the “want” is internally built in such a way as to not generalize OOD to make treacherous turns seem good to the AI. See e.g. my post Thoughts on “Process-Based Supervision”, which is skeptical about the practicalities, but I think the idea is sound in principle.
- We can in principle simply avoid building AIs that autonomously accomplish long-horizon goals, notwithstanding the economic and other pressures—for example, by keeping humans in the loop (e.g. oracle AIs). This one came up multiple times in the comments section.
There’s plenty of challenges in these approaches, and interesting discussions to be had, but the post doesn’t engage with any of these topics.
Anyway, I’m voting strongly against including this post in the 2023 review. It’s not crisp about what it’s arguing for and against (and many commenters seem to have gotten the wrong idea about what it’s arguing for), it’s saying obvious things in a meandering way, and it’s not refuting or even mentioning any of the real counterarguments / reasons for hope. It’s not “best of” material.
How is it that some tiny number of man made mirror life forms would be such a threat to the millions of naturally occurring life forms, but those millions of naturally occurring life forms would not be an absolutely overwhelming symmetrical threat to those few man made mirror forms?
Can’t you ask the same question for any invasive species? Yet invasive species exist. “How is it that some people putting a few Nile perch into Lake Victoria in the 1950s would cause ‘the extinction or near-extinction of several hundred native species’, but the native species of Lake Victoria would not be an absolutely overwhelming symmetrical threat to those Nile perch?”
If I'm not mistaking, you've already changed the wording
No, I haven’t changed anything in this post since Dec 11, three days before your first comment.
valid EA response … EA forum … EA principles …
This isn’t EA forum. Also, you shouldn’t equate “EA” with “concerned about AGI extinction”. There are plenty of self-described EAs who think that AGI extinction is astronomically unlikely and a pointless thing to worry about. (And also plenty of self-described EAs who think the opposite.)
prevent spam/limit stupid comments without causing distracting emotions
If Hypothetical Person X tends to write what you call “stupid comments”, and if they want to be participating on Website Y, and if Website Y wants to prevent Hypothetical Person X from doing that, then there’s an irreconcilable conflict here, and it seems almost inevitable that Hypothetical Person X is going to wind up feeling annoyed by this interaction. Like, Website Y can do things on the margin to make the transaction less unpleasant, but it’s surely going to be somewhat unpleasant under the best of circumstances.
(Pick any popular forum on the internet, and I bet that either (1) there’s no moderation process and thus there’s a ton of crap, or (2) there is a moderation process, and many of the people who get warned or blocked by that process are loudly and angrily complaining about how terrible and unjust and cruel and unpleasant the process was.)
Anyway, I don’t know why you’re saying that here-in-particular. I’m not a moderator, I have no special knowledge about running forums, and it’s way off-topic. (But if it helps, here’s a popular-on-this-site post related to this topic.)
[EDIT: reworded this part a bit.]
what would be a valid EA response to the arguments coming from people fitting these bullets:
- Some are over-optimistic based on mistaken assumptions about the behavior of humans;
- Some are over-optimistic based on mistaken assumptions about the behavior of human institutions;
That’s off-topic for this post so I’m probably not going to chat about it, but see this other comment too.
I think of myself as having high ability and willingness to respond to detailed object-level AGI-optimist arguments, for example:
- Response to Dileep George: AGI safety warrants planning ahead
- Response to Blake Richards: AGI, generality, alignment, & loss functions
- Thoughts on “AI is easy to control” by Pope & Belrose
- LeCun’s “A Path Towards Autonomous Machine Intelligence” has an unsolved technical alignment problem
- Munk AI debate: confusions and possible cruxes
…and more.
I don’t think this OP involves “picturing AI optimists as stubborn simpletons not being able to get persuaded finally that AI is a terrible existential risk”. (I do think AGI optimists are wrong, but that’s different!) At least, I didn’t intend to do that. I can potentially edit the post if you help me understand how you think I’m implying that, and/or you can suggest concrete wording changes etc.; I’m open-minded.
Yeah, the word “consummatory” isn’t great in general (see here), maybe I shouldn’t have used it. But I do think walking is an “innate behavior”, just as sneezing and laughing and flinching and swallowing are. E.g. decorticate rats can walk. As for human babies, they’re decorticate-ish in effect for the first months but still have a “walking / stepping reflex” from day 1 I think.
There can be an innate behavior, but also voluntary cortex control over when and whether it starts—those aren’t contradictory, IMO. This is always true to some extent—e.g. I can voluntarily suppress a sneeze. Intuitively, yeah, I do feel like I have more voluntary control over walking than I do over sneezing or vomiting. (Swallowing is maybe the same category as walking?) I still want to say that all these “innate behaviors” (including walking) are orchestrated by the hypothalamus and brainstem, but that there’s also voluntary control coming via cortex→hypothalamus and/or cortex→brainstem motor-type output channels.
I’m just chatting about my general beliefs. :) I don’t know much about walking in particular, and I haven’t read that particular paper (paywall & I don’t have easy access).
Oh I forgot, you’re one of the people who seems to think that the only conceivable reason that anyone would ever talk about AGI x-risk is because they are trying to argue in favor of, or against, whatever AI government regulation was most recently in the news. (Your comment was one of the examples that I mockingly linked in the intro here.)
If I think AGI x-risk is >>10%, and you think AGI x-risk is 1-in-a-gazillion, then it seems self-evident to me that we should be hashing out that giant disagreement first; and discussing what if any government regulations would be appropriate in light of AGI x-risk second. We’re obviously not going to make progress on the latter debate if our views are so wildly far apart on the former debate!! Right?
So that’s why I think you’re making a mistake whenever you redirect arguments about the general nature & magnitude & existence of the AGI x-risk problem into arguments about certain specific government policies that you evidently feel very strongly about.
(If it makes you feel any better, I have always been mildly opposed to the six month pause plan.)
I’ve long had a tentative rule-of-thumb that:
- medial hypothalamus neuron groups are mostly “tracking a state variable”;
- lateral hypothalamus neuron groups are mostly “turning on a behavior” (especially a “consummatory behavior”).
(…apart from the mammillary areas way at the posterior end of the hypothalamus. They’re their own thing.)
State variables are things like hunger, temperature, immune system status, fertility, horniness, etc.
I don’t have a great proof of that, just some indirect suggestive evidence. (Orexin, contiguity between lateral hypothalamus and PAG, various specific examples of people studying particular hypothalamus neurons.) Anyway, it’s hard to prove directly because changing a state variable can lead to taking immediate actions. And it’s really just a rule of thumb; I’m sure there’s exceptions, and it’s not really a bright-line distinction anyway.
The literature on the lateral hypothalamus is pretty bad. The main problem IIUC is that LH is “reticular”, i.e. when you look at it under the microscope you just see a giant mess of undifferentiated cells. That appearance is probably deceptive—appropriate stains can reveal nice little nuclei hiding inside the otherwise-undifferentiated mess. But I think only one or a few such hidden nuclei are known (the example I’m familiar with is “parvafox”).
Yup! I think discourse with you would probably be better focused on the 2nd or 3rd or 4th bullet points in the OP—i.e., not “we should expect such-and-such algorithm to do X”, but rather “we should expect people / institutions / competitive dynamics to do X”.
I suppose we can still come up with “demos” related to the latter, but it’s a different sort of “demo” than the algorithmic demos I was talking about in this post. As some examples:
- Here is a “demo” that a leader of a large active AGI project can declare that he has a solution to the alignment problem, specific to his technical approach, but where the plan doesn’t stand up to a moment’s scrutiny.
- Here is a “demo” that a different AGI project leader can declare that even trying to solve the alignment problem is already overkill, because misalignment is absurd and AGIs will just be nice, again for reasons that don’t stand up to a moment’s scrutiny.
- (And here’s a “demo” that at least one powerful tech company executive might be fine with AGI wiping out humanity anyway.)
- Here is a “demo” that if you give random people access to an AI, one of them might ask it to destroy humanity, just to see what would happen. Granted, I think this person had justified confidence that this particular AI would fail to destroy humanity …
- … but here is a “demo” that people will in fact do experiments that threaten the whole world, even despite a long track record of rock-solid statistical evidence that the exact thing they’re doing is indeed a threat to the whole world, far out of proportion to its benefit, and that governments won’t stop them, and indeed that governments might even fund them.
- Here is a “demo” that, given a tradeoff between AI transparency (English-language chain-of-thought) and AI capability (inscrutable chain-of-thought but the results are better), many people will choose the latter, and pat themselves on the back for a job well done.
- Every week we get more “demos” that, if next-token prediction is insufficient to make a powerful autonomous AI agent that can successfully pursue long-term goals via out-of-the-box strategies, then many people will say “well so much the worse for next-token prediction”, and they’ll try to figure some other approach that is sufficient for that.
- Here is a “demo” that companies are capable of ignoring or suppressing potential future problems when they would interfere with immediate profits.
- Here is a “demo” that it’s possible for there to be a global catastrophe causing millions of deaths and trillions of dollars of damage, and then immediately afterwards everyone goes back to not even taking trivial measures to prevent similar or worse catastrophes from recurring.
- Here is a “demo” that the arrival of highly competent agents with the capacity to invent technology and to self-reproduce is a big friggin’ deal.
- Here is a “demo” that even small numbers of such highly competent agents can maneuver their way into dictatorial control over a much much larger population of humans.
I could go on and on. I’m not sure your exact views, so it’s quite possible that none of these are crux-y for you, and your crux lies elsewhere. :)
Thanks!
I feel like the actual crux between you and OP is with the claim in post #2 that the brain operates outside the neuron doctrine to a significant extent.
I don’t think that’s quite right. Neuron doctrine is pretty specific IIUC. I want to say: when the brain does systematic things, it’s because the brain is running a legible algorithm that relates to those things. And then there’s a legible explanation of how biochemistry is running that algorithm. But the latter doesn’t need to be neuron-doctrine. It can involve dendritic spikes and gene expression and astrocytes etc.
All the examples here are real and important, and would impact the algorithms of an “adequate” WBE, but are mostly not “neuron doctrine”, IIUC.
Basically, it’s the thing I wrote a long time ago here: “If some [part of] the brain is doing something useful, then it's humanly feasible to understand what that thing is and why it's useful, and to write our own CPU code that does the same useful thing.” And I think “doing something useful” includes as a special case everything that makes me me.
I don't get what you mean when you say stuff like "would be conscious (to the extent that I am), and it would be my consciousness (to a similar extent that I am)," since afaik you don't actually believe that there is a fact of the matter as to the answers to these questions…
Just, it’s a can of worms that I’m trying not to get into right here. I don’t have a super well-formed opinion, and I have a hunch that the question of whether consciousness is a coherent thing is itself a (meta-level) incoherent question (because of the (A) versus (B) thing here). Yeah, just didn’t want to get into it, and I haven’t thought too hard about it anyway. :)
Right, what I actually think is that a future brain scan with future understanding could enable a WBE to run on a reasonable-sized supercomputer (e.g. <100 GPUs), and it would be capturing what makes me me, and would be conscious (to the extent that I am), and it would be my consciousness (to a similar extent that I am), but it wouldn’t be able to reproduce my exact train of thought in perpetuity, because it would be able to reproduce neither the input data nor the random noise of my physical brain. I believe that OP’s objection to “practical CF” is centered around the fact that you need an astronomically large supercomputer to reproduce the random noise, and I don’t think that’s relevant. I agree that “abstraction adequacy” would be a step in the right direction.
Causal closure is just way too strict. And it’s not just because of random noise. For example, suppose that there’s a tiny amount of crosstalk between my neurons that represent the concept “banana” and my neurons that represent the concept “Red Army”, just by random chance. And once every 5 years or so, I’m thinking about bananas, and then a few seconds later, the idea of the Red Army pops into my head, and if not for this cross-talk, it counterfactually wouldn’t have popped into my head. And suppose that I have no idea of this fact, and it has no impact on my life. This overlap just exists by random chance, not part of some systematic learning algorithm. If I got magical brain surgery tomorrow that eliminated that specific cross-talk, and didn’t change anything else, then I would obviously still be “me”, even despite the fact that maybe some afternoon 3 years from now I would fail to think about the Red Army when I otherwise might. This cross-talk is not randomness, and it does undermine “causal closure” interpreted literally. But I would still say that “abstraction adequacy” would be achieved by an abstraction of my brain that captured everything except this particular instance of cross-talk.
Yeah duh I know you’re not talking about MCMC. :) But MCMC is a simpler example to ensure that we’re on the same page on the general topic of how randomness can be involved in algorithms. Are we 100% on the same page about the role of randomness in MCMC? Is everything I said about MCMC super duper obvious from your perspective? If not, then I think we’re not yet ready to move on to the far-less-conceptually-straightforward topic of brains and consciousness.
I’m trying to get at what you mean by:
But imagine instead that (for sake of argument) it turned out that high-resolution details of temperature fluctuations throughout the brain had a causal effect on the execution of the algorithm such that the algorithm doesn't do what it's meant to do if you just take the average of those fluctuations.
I don’t understand what you mean here. For example:
- If I run MCMC with a PRNG given random seed 1, it outputs 7.98 ± 0.03. If I use a random seed of 2, then the MCMC spits out a final answer of 8.01 ± 0.03. My question is: does the random seed entering MCMC “have a causal effect on the execution of the algorithm”, in whatever sense you mean by the phrase “have a causal effect on the execution of the algorithm”?
- My MCMC code uses a PRNG that returns random floats between 0 and 1. If I replace that PRNG with
return 0.5
, i.e. the average of the 0-to-1 interval, then the MCMC now returns a wildly-wrong answer of 942. Is that replacement the kind of thing you have in mind when you say “just take the average of those fluctuations”? If so, how do you reconcile the fact that “just take the average of those fluctuations” gives the wrong answer, with your description of that scenario as “what it’s meant to do”? Or if not, then what would “just take the average of those fluctuations” mean in this MCMC context?
I’m confused by your comment. Let’s keep talking about MCMC.
- The following is true: The random inputs to MCMC have “a causal effect on the execution of the algorithm such that the algorithm doesn't do what it's meant to do if you just take the average of those fluctuations”.
- For example, let’s say the MCMC accepts a million inputs in the range (0,1), typically generated by a PRNG in practice. If you replace the PRNG by the function
return 0.5
(“just take the average of those fluctuations”), then the MCMC will definitely fail to give the right answer.
- For example, let’s say the MCMC accepts a million inputs in the range (0,1), typically generated by a PRNG in practice. If you replace the PRNG by the function
- The following is false: “the signals entering…are systematic rather than random”. The random inputs to MCMC are definitely expected and required to be random, not systematic. If the PRNG has systematic patterns, it screws up the algorithm—I believe this happens from time to time, and people doing Monte Carlo simulations need to be quite paranoid about using an appropriate PRNG. Even very subtle long-range patterns in the PRNG output can screw up the calculation.
The MCMC will do a highly nontrivial (high-computational-complexity) calculation and give a highly non-arbitrary answer. The answer does depend to some extent on the stream of random inputs. For example, suppose I do MCMC, and (unbeknownst to me) the exact answer is 8.00. If I use a random seed of 1 in my PRNG, then the MCMC might spit out a final answer of 7.98 ± 0.03. If I use a random seed of 2, then the MCMC might spit out a final answer of 8.01 ± 0.03. Etc. So the algorithm run is dependent on the random bits, but the output is not totally arbitrary.
All this is uncontroversial background, I hope. You understand all this, right?
executions would branch conditional on specific charge trajectories, and it would be a rubbish computer.
As it happens, almost all modern computer chips are designed to be deterministic, by putting every signal extremely far above the noise floor. This has a giant cost in terms of power efficiency, but it has a benefit of making the design far simpler and more flexible for the human programmer. You can write code without worrying about bits randomly flipping—except for SEUs, but those are rare enough that programmers can basically ignore them for most purposes.
(Even so, such chips can act non-deterministically in some cases—for example as discussed here, some ML code is designed with race conditions where sometimes (unpredictably) the chip calculates (a+b)+c
and sometimes a+(b+c)
, which are ever-so-slightly different for floating point numbers, but nobody cares, the overall algorithm still works fine.)
But more importantly, it’s possible to run algorithms in the presence of noise. It’s not how we normally do things in the human world, but it’s totally possible. For example, I think an ML algorithm would basically work fine if a small but measurable fraction of bits randomly flipped as you ran it. You would need to design it accordingly, of course—e.g. don’t use floating point representation, because a bit-flip in the exponent would be catastrophic. Maybe some signals would be more sensitive to bit-flips than others, in which case maybe put an error-correcting code on the super-sensitive ones. But for lots of other signals, e.g. the lowest-order bit of some neural net activation, we can just accept that they’ll randomly flip sometimes, and the algorithm still basically accomplishes what it’s supposed to accomplish—say, image classification or whatever.
I agree that certain demos might change the mind of certain people. (And if so, they’re worthwhile.) But I also think other people would be immune. For example, suppose someone has the (mistaken) idea: “Nobody would be so stupid as to actually press go on an AI that would then go on to kill lots of people! Or even if theoretically somebody might be stupid enough to do that, management / government / etc. would never let that happen.” Then that mistaken idea would not be disproven by any demo, except a “demo” that involved lots of actual real-life people getting killed. Right?
Hmm, I wasn’t thinking about that because that sentence was nominally in someone else’s voice. But you’re right. I reworded, thanks.
In a perfect world, everyone would be concerned about the risks for which there are good reasons to be concerned, and everyone would be unconcerned about the risks for which there are good reasons to be unconcerned, because everyone would be doing object-level checks of everyone else’s object-level claims and arguments, and coming to the correct conclusion about whether those claims and arguments are valid.
And those valid claims and arguments might involve demonstrations and empirical evidence, but also might be more indirect.
I do think Turing and von Neumann reached correct object-level conclusions via sound reasoning, but obviously I’m stating that belief without justifying it.
I’m not sure what argument you think I’m making.
In a perfect world, I think people would not need any concrete demonstration to be very concerned about AGI x-risk. Alan Turing and John von Neumann were very concerned about AGI x-risk, and they obviously didn’t need any concrete demonstration for that. And I think their reasons for concern were sound at the time, and remain sound today.
But many people today are skeptical that AGI poses any x-risk. (That’s unfortunate from my perspective, because I think they’re wrong.) The point of this post is to suggest that we AGI-concerned people might not be able to win over those skeptics via concrete demonstrations of AI doing scary (or scary-adjacent) things, either now or in the future—or at least, not all of the skeptics. It’s probably worth trying anyway—it might help for some of the skeptics. Regardless, understanding the exact failure modes is helpful.
These algorithms are useful maps of the brain and mind. But is computation also the territory? Is the mind a program? Such a program would need to exist as a high-level abstraction of the brain that is causally closed and fully encodes the mind.
I said it in one of your previous posts but I’ll say it again: I think causal closure is patently absurd, and a red herring. The brain is a machine that runs an algorithm, but algorithms are allowed to have inputs! And if an algorithm has inputs, then it’s not causally closed.
The most obvious examples are sensory inputs—vision, sounds, etc. I’m not sure why you don’t mention those. As soon as I open my eyes, everything in my field of view has causal effects on the flow of my brain algorithm.
Needless to say, algorithms are allowed to have inputs. For example, the mergesort algorithm has an input (namely, a list). But I hope we can all agree that mergesort is an algorithm!
The other example is: the brain algorithm has input channels where random noise enters in. Again, that doesn’t prevent it from being an algorithm. Many famous, central examples of algorithms have input channels that accept random bits—for example, MCMC.
And in regards to “practical CF”, if I run MCMC on my computer while sitting outside, and I use an anemometer attached to the computer as a source of the random input bits entering the MCMC run, then it’s true that you need an astronomically complex hyper-accurate atmospheric simulator in order to reproduce this exact run of MCMC, but I don’t understand your perspective wherein that fact would be important. It’s still true that my computer is implementing MCMC “on a level of abstraction…higher than” atoms and electrons. The wind flowing around the computer is relevant to the random bits, but is not part of the calculations that comprise MCMC (which involve the CPU instruction set etc.). By the same token, if thermal noise mildly impacts my train of thought (as it always does), then it’s true that you need to simulate my brain down to the jiggling atoms in order to reproduce this exact run of my brain algorithm, but this seems irrelevant to me, and in particular it’s still true that my brain algorithm is “implemented on a level of abstraction of the brain higher than biophysics”. (Heck, if I look up at the night sky, then you’d need to simulate the entire Milky Way to reproduce this exact run of my brain algorithm! Who cares, right?)
I agree!! (if I understand correctly). See https://www.lesswrong.com/posts/RrG8F9SsfpEk9P8yi/robin-hanson-s-grabby-aliens-model-explained-part-1?commentId=wNSJeZtCKhrpvAv7c
Huh, this is helpful, thanks, although I’m not quite sure what to make of it and how to move forward.
I do feel confused about how you’re using the term “equanimity”. I sorta have in mind a definition kinda like: neither very happy, nor very sad, nor very excited, nor very tired, etc. Google gives the example: “she accepted both the good and the bad with equanimity”. But if you’re saying “apply equanimity to positive sensations and it makes them better”, you’re evidently using the term “equanimity” in a different way than that. More specifically, I feel like when you say “apply equanimity to X”, you mean something vaguely like “do a specific tricky learned attention-control maneuver that has something to do with the sensory input of X”. That same maneuver could contribute to equanimity, if it’s applied to something like anxiety. But the maneuver itself is not what I would call “equanimity”. It’s upstream. Or sorry if I’m misunderstanding.
Also, I also want to distinguish two aspects of an emotion. In one, “duration of an emotion” is kinda like “duration of wearing my green hat”. I don’t have to be thinking about it the whole time, but it’s a thing happening with my body, and if I go to look, I’ll see that it’s there. Another aspect is the involuntary attention. As long as it’s there, I can’t not think about it, unlike my green hat. I expect that even black-belt PNSE meditators are unable to instantly turn off anger / anxiety / etc. in the former sense. I think these things are brainstem reactions that can be gradually unwound but not instantly. I do expect that those meditators would be able to more instantly prevent the anger / anxiety / etc. from controlling their thought process. What do you think?
Also, just for context, do you think you’ve experienced PNSE? Thanks!
I don’t think any of the challenges you mentioned would be a blocker to aliens that have infinite compute and infinite time. “Is the data big-endian or little-endian?” Well, try it both ways and see which one is a better fit to observations. If neither seems to fit, then do a combinatorial listing of every one of the astronomical number of possible encoding schemes, and check them all! Spend a trillion years studying the plausibility of each possible encoding before moving onto the next one, just to make sure you don’t miss any subtelty. Why not? You can do all sorts of crazy things with infinite compute and infinite time.
I don’t think this is too related to the OP, but in regard to your exchange with jbash:
I think there’s a perspective where “personal identity” is a strong intuition, but a misleading one—it doesn’t really (“veridically”) correspond to anything at all in the real world. Instead it’s a bundle of connotations, many of which are real and important. Maybe I care that my projects and human relationships continue, that my body survives, that the narrative of my life is a continuous linear storyline, that my cherished memories persist, whatever. All those things veridically correspond to things in the real world, but (in this perspective) there isn’t some core fact of the matter about “personal identity” beyond that bundle of connotations.
I think jbash is saying (within this perspective) that you can take the phrase “personal identity”, pick whatever connotations you care, and define “personal identity” as that. And then your response (as I interpret it) is that no, you can’t do that, because there’s a core fact of the matter about personal identity, and that core fact of the matter is very very important, and it’s silly to define “personal identity” as pointing to anything else besides that core fact of the matter.
So I imagine jbash responding that “do I nonetheless continue living (in the sense of, say, anticipating the same kind of experiences)?” is a confused question, based on reifying misleading intuitions around “I”. It’s a bit like saying “in such-and-such a situation, will my ancestor spirits be happy or sad?”
I’m not really defending this perspective here, just trying to help explain it, hopefully.
If we apply the Scott Aaronson waterfall counterargument to your Alice-bot-and-Bob-bot scenario, I think it would say: The first step was running Alice-bot, to get the execution trace. During this step, the conscious experience of Alice-bot manifests (or whatever). Then the second step is to (let’s say) modify the Bob code such that it does the same execution but has different counterfactual properties. Then the third step is to run the Bob code and ask whether the experience of Alice-bot manifests again.
But there’s a more basic question. Forget about Bob. If I run the Alice-bot code twice, with the same execution trace, do I get twice as much Alice-experience stuff? Maybe you think the answer is “yeah duh”, but I’m not so sure. I think the question is confusing, possibly even meaningless. How do you measure how much Alice-experience has happened? The “thick wires” argument (I believe due to Nick Bostrom, see here, p189ff, or shorter version here) seems relevant. Maybe you’ll say that the thick-wires argument is just another reductio about computational functionalism, but I think we can come up with a closely-analogous “thick neurons” thought experiment that makes whatever theory of consciousness you subscribe to have an equally confusing property.
I don’t think Premise 2 is related to my comment. I think it’s possible to agree with premise 2 (“there is an objective fact-of-the-matter whether a conscious experience is occurring”), but also to say that there are cases where it is impossible-in-practice for aliens to figure out that fact-of-the-matter.
By analogy, I can write down a trillion-digit number N, and there will be an objective fact-of-the-matter about what is the prime factorization of N, but it might take more compute than fits in the observable universe to find out that fact-of-the-matter.
This is kinda helpful but I also think people in your (1) group would agree with all three of: (A) the sequence of thoughts that you think directly correspond to something about the evolving state of activity in your brain, (B) random noise has nonzero influence on the evolving state of activity in your brain, (C) random noise cannot be faithfully reproduced in a practical simulation.
And I think that they would not see anything self-contradictory about believing all of those things. (And I also don’t see anything self-contradictory about that, even granting your (1).)
Well, I guess this discussion should really be focused more on personal identity than consciousness (OP wrote: “Whether or not a simulation can have consciousness at all is a broader discussion I'm saving for later in the sequence, and is relevant to a weaker version of CF.”).
So in that regard: my mental image of computational functionalists in your group (1) would also say things like (D) “If I start 5 executions of my brain algorithm, on 5 different computers, each with a different RNG seed, then they are all conscious (they are all exuding consciousness-stuff, or whatever), and they all have equal claim to being “me”, and of course they all will eventually start having different trains of thought. Over the months and years they might gradually diverge in beliefs, memories, goals, etc. Oh well, personal identity is a fuzzy thing anyway. Didn’t you read Parfit?”
But I haven’t read as much of the literature as you, so maybe I’m putting words in people’s mouths.
FYI for future readers: the OP circles back to this question (what counts as a computation) more in a later post of this sequence, especially its appendix, and there’s some lively discussion happening in the comments section there.
You can’t be wrong about the claim “you are having a visual experience”.
Have you heard of Cotard's syndrome?
It’s interesting that you care about what the alien thinks. Normally people say that the most important property of consciousness is its subjectivity. Like, people tend to say things like “Is there something that it’s like to be that person, experiencing their own consciousness?”, rather than “Is there externally-legible indication that there’s consciousness going on here?”.
Thus, I would say: the simulation contains a conscious entity, to the same extent that I am a conscious entity. Whether aliens can figure out that fact is irrelevant.
I do agree with the narrow point that a simulation of consciousness can be externally illegible, i.e. that you can manifest something that’s conscious to the same extent that I am, in a way where third parties will be unable to figure out whether you’ve done that or not. I think a cleaner example than the ones you mentioned is: a physics simulation that might or might not contain a conscious mind, running under homomorphic encryption with a 100000-bit key, and where all copies of the key have long ago been deleted.
Actually never mind. But for future reference I guess I’ll use the intercom if I want an old version labeled. Thanks for telling me how that works. :)
(There’s a website / paper going around that cites a post I wrote way back in 2021, when I was young and stupid, so it had a bunch of mistakes. But after re-reading that post again this morning, I decided that the changes I needed to make weren’t that big, and I just went ahead and edited the post like normal, and added a changelog to the bottom. I’ve done this before. I’ll see if anyone complains. I don’t expect them to. E.g. that same website / paper cites a bunch of arxiv papers while omitting their version numbers, so they’re probably not too worried about that kind of stuff.)
I think there might be a lesswrong editor feature that allows you to edit a post in such a way that the previous version is still accessible. Here’s an example—there’s a little icon next to the author name that says “This post has major past revisions…”. Does anyone know where that option is? I can’t find it in the editor UI. (Or maybe it was removed? Or it’s only available to mods?) Thanks in advance!
There’s a theory (twitter citing reddit) that at least one of these people filed GDPR right to be forgotten requests. So one hypothesis would be: all of those people filed such GDPR requests.
But the reddit post (as of right now) guesses that it might not be specifically about GDPR requests per se, but rather more generally “It's a last resort fallback for preventing misinformation in situations where a significant threat of legal action is present”.
Good luck! I was writing about it semi-recently here.
General comment: It’s also possible to contribute to mind uploading without getting a PhD—see last section of that post. There are job openings that aren’t even biology, e.g. ML engineering. And you could also earn money and donate it, my impression is that there’s desperate need.
I guess I shouldn’t put words in other people’s mouths, but I think the fact that years-long trains-of-thought cannot be perfectly predicted in practice because of noise is obvious and uninteresting to everyone, I bet including to the computational functionalists you quoted, even if their wording on that was not crystal clear.
There are things that the brain does systematically and robustly by design, things which would be astronomically unlikely to happen by chance. E.g. the fact that I move my lips to emit grammatical English-language sentences rather than random gibberish. Or the fact that humans wanted to go to the moon, and actually did so. Or the fact that I systematically take actions that tend to lead to my children surviving and thriving, as opposed to suffering and dying.
That kind of stuff, which my brain does systematically and robustly, is what makes me me. My memories, goals, hopes and dreams, skills, etc. The fact that I happened to glance towards my scissors at time 582834.3 is not important, but the robust systematic patterns are.
And the reason that my brain does those things systematically and robustly is because the brain is designed to run an algorithm that does those things. And there’s a mathematical explanation of why this particular algorithm does those remarkable systematic things like invent quantum mechanics and reflect on the meaning of life, and separately, there’s a biophysical explanation of how it is that the brain is a machine that runs this algorithm.
I don’t think “software versus hardware” is the right frame. I prefer “the brain is a machine that runs a certain algorithm”. Like, what is software-versus-hardware for a mechanical calculator? I dunno. But there are definitely algorithms that the mechanical calculator is executing.
So we can talk about what is the algorithm that the brain is running, and why does it work? Well, it builds models, and stores them, and queries them, and combines them, and edits them, and there’s a reinforcement learning actor-critic thing, blah blah blah.
Those reasons can still be valid even if there’s some unpredictable noise in the system. Think of a grandfather clock—the second hand will robustly move 60× faster than the minute hand, by design, even if there’s some noise in the pendulum that affects the speed of both, or randomness in the surface friction that affects the exact micron-level location that the second hand comes to rest each tick. Or think of an algorithm that involves randomness (e.g. MCMC), and hence any given output is unpredictable, but the algorithm still robustly and systematically does stuff that is a priori specifiable and be astronomically unlikely to happen by chance. Or think of the Super Mario 64 source code compiled to different chip architectures that use different size floats (for example). You can play both, and they will both be very recognizably Super Mario 64, but any given exact sequence of button presses will eventually lead to divergent trajectories on the two systems. (This kind of thing is known to happen in tool-assisted speedruns—they’ll get out of sync on different systems, even when it’s “the same game” to all appearances.)
But it’s still reasonable to say that the Super Mario 64 source code is specifying an algorithm, and all the important properties of Super Mario 64 are part of that algorithm, e.g. what does Mario look like, how does he move, what are the levels, etc. It’s just that the core algorithm is not specified at such a level of detail that we can pin down what any given infinite sequence of button presses will do. That depends on unimportant details like floating point rounding.
I think this is compatible with how people use the word “algorithm” in practice. Like, CS people will causally talk about “two different implementations of the MCMC algorithm”, and not just “two different algorithms in the MCMC family of algorithms”.
That said, I guess it’s possible that Putnam and/or Piccinini were describing things in a careless or confused way viz. the role of noise impinging upon the brain. I am not them, and it’s probably not a good use of time to litigate their exact beliefs and wording. ¯\_(ツ)_/¯
I should probably let EuanMcLean speak for themselves but I do think “literally the exact same sequence of thoughts in the exact same order” is what OP is talking about. See the part about “causal closure”, and “predict which neurons are firing at t1 given the neuron firings at t0…”. The latter is pretty unambiguous IMO: literally the exact same sequence of thoughts in the exact same order.
I definitely didn’t write anything here that amounts to a general argument for (or against) computationalism. I was very specifically responding to this post. :)