Posts
Comments
You want to help? Figure out what kind of incremental changes you can begin to introduce in any of them, in order to begin extinguishing the sort of problems you've now elevated to the rank of "saving-worthy" in your own head. Note that, in all likelihood, by extinguishing one you will merrily introduce a whole bunch of others - something you won't get to discover until much later one. Yet that is, realistically, what you can actually go on to accomplish.
I read this paragraph as saying ~the same thing as the original post in a different tone
Is there a better way of discovering strong arguments for a non-expert than asking for them publicly?
Also, it assumes there is a separate module for making predictions, which cannot be manipulated by the agent. This assumption is not very probable in my view.
Isn't this a blocker for any discussion of particular utility functions?
If a simple philosophical argument can cut the expected odds of AI doom by an order of magnitude, we might not change our current plans, but it suggests that we have a lot of confusion on the topic that further research might alleviate.
And more generally, "the world where we almost certainly get killed by ASI" and "The world where we have an 80% chance of getting killed by ASI" are different worlds, and, ignoring motives to lie for propaganda purposes, if we actually live in the latter we should not say we live in the former.
I don't think wireheading is "myopic" when it overlaps with self-maintenance. Classic example would be painkillers; they do ~nothing but make you "feel good now" (or at least less bad), but sometimes feeling less bad is necessary to function properly and achieve long-term value. I think that gratitude journaling is also part of this overlap area. That said I don't know many peoples' experiences with it so maybe it's more prone to "abuse" than I expect.
A corrigible AI is one that is cooperative to attempts to modify it to bring it more in line with what its creators/users want it to be. Some people think that this is a promising direction for alignment research, since if an AI could be guaranteed to be corrigible, even if it end up with wild/dangerous goals, we could in principle just modify it to not have those goals and it wouldn't try to stop us.
"Alignment win condition," as far as I know, is a phrase I just made up. I mean it as something that, regardless of whether it "solves" alignment in a specific technical sense, achieves the underlying goal of alignment research which is "have artificial intelligence which does things we want and doesn't do things we don't want." A superintelligence that is perfectly aligned with its creator's goals would be very interesting technically and mathematically, but if its creator wants it to kill anyone it really isn't any better than an unaligned superintelligence that kills everyone too.
I don't trust a hypothetical arbitrary superintelligence but I agree that a superintelligence is too much power for any extant organization, which means that "corrigibility" is not an alignment win condition. An AI resisting modification to do bad things (whatever that might mean on reflection) seems like a feature, not a bug.
Do you believe or allow for a distinction between value and ethics? Intuitively it feels like metaethics should take into account the Goodness of Reality principle, but I think my intuition comes from a belief that if there's some objective notion of Good, ethics collapses to "you should do whatever makes the world More Gooder," and I suppose that that's not strictly necessary.
The adulterer, the slave owner and the wartime rapist all have solid evolutionary reasons to engage in behaviors most of us might find immoral. I think their moral blind spots are likely not caused by trapped priors, like an exaggerated fear of dogs is.
I don't think the evopsych and trapped-prior views are incompatible. A selection pressure towards immoral behavior could select for genes/memes that tend to result in certain kinds of trapped prior.
I also suspect something along the lines of "Many (most?) great spiritual leaders were making a good-faith effort to understand the same ground truth with the same psychological equipment and got significantly farther than most normal people do." But in order for that to be plausible, you would need a reason why the almost-truths they found are so goddamn antimemetic that the most studied and followed people in history weren't able to make them stick. Some of the selection pressure surely comes down to social dynamics. I'd like to think that people who have grazed some great Truth are less likely to torture and kill infidels than someone who thinks they know a great truth. Cognitive blind spots could definitely explain things, though.
The problem is, the same thing that would make blind spots good at curbing the spread of enlightenment also makes them tricky to debate as a mechanism for it. They're so slippery that until you've gotten past one yourself it's hard to believe they exist (especially when the phenomenal experience of knowing-something-that-was-once-utterly-unknowable can also seemingly be explained by developing a delusion). They're also hard to falsify. What you call active blind spots are a bit easier to work with, I think most people can accept the idea of something like "a truth you're afraid to confront" even if they haven't experienced such a thing themselves (or are afraid to confront the fact that they have).
I look forward to reading your next post(s) as well as this site's reaction to them
That "so now what" doesn't sound like a dead end to me. The question of how to mitigate risk when normal risk-mitigation procedures are themselves risky seems like an important one.
Alright, based on your phrasing I had thought it was something you believed. I'm open to moral realism and I don't immediately see how phenomena being objectively bad would imply that physics is objectively bad.
Why does something causing something bad make that thing itself bad?
Then I'd like to see some explanation why it doesn't have an answer, which would be adding back to normality.
I'm not saying it doesn't, I'm saying it's not obvious that it does. Normalcy requirements don't mean all our possibly-confused questions have answers, they just put restrictions on what those answers should look like. So, if the idea of successors-of-experience is meaningful at all, our normal intuition gives us desiderata like "chains of sucessorship are continuous across periods of consciousness" and "chains of successorship do not fork or merge with eachother under conditions that we currently observe."
If you have any particular notion of successorship that meets all the desiderata you think should matter here, whether or not a teleporter creates a successor is a question of fact. But it's not obvious what the most principled set of desiderata is, and for most sets of desiderata it's probably not obvious whether there is a unique notion of successorship.
OP is advocating for something along the lines of "There is no uniquely-most-principled notion of successorship; the fact that different people have different desiderata, or that some people arbitrarily choose one idea of succesorship over another that's just as logical, is a result of normal value differences." There is no epistemic relativism; given any particular person's most valued notion of successorship, everyone can, in principle, agree whether any given situation preserves it.
The relativism is in choosing which (whose) notion to use when making any given decision. Even in a world where souls are real and most people agree that continuity-of-consciousness is equivalent to continuity-of-soul-state, which is preserved by those nifty new teleporters, some curmudgeon who thinks that continuity-of-physical-location is also important shouldn't be forced into a teleporter against their will, since they expect (and all informed observers will agree) that their favored notion of continuity of consciousness will be ended by the teleporter.
[...]only autonomous (driven by internal will) actions, derived from duty to the moral law, can be considered moral.
[...] a belief in a thing has a totalising effect on the will of the subject.
What makes this totalizing effect distinct from the "duty to moral law" explicitly called for?
People tend to agree that one should care about the successor of your subjective experience. The question is whether there will be one or not.And this is the question of fact.
But the question of "what, if anything, is the successor of your subjective experience" does not obviously have a single factual answer.
I can conceptualize a world where a soul always stays tied to the initial body, and as soon as its destroyed, its destroyed as well.
If souls are real (and the Hard Problem boils down to "it's the souls, duh"), then a teleporter that doesn't reattach/reconstruct your soul seems like it doesn't fit the hypothetical. If the teleporter perfectly reassembles you, that should apply to all components of you, even extraphysical ones.
I'm not convinced that there is a single "way" one should expect to wake up in the morning. If we're talking about things like observer-moments and exotic theories of identity, I don't think we can reliably communicate by analogy to mundane situations, since our intuitions might differ in subtle ways that don't matter in those situations.
For instance, should I believe that I will wake up because that will lead me to make decisions that lead to world-states I prefer, or should I expect to wake up because it is true that I will probably wake up? If the latter, does that just mean that there will exist an observer-moment in my bed tomorrow that is a close match to my current self, or am I actually expecting to be the same as that observer-moment in some important sense that that would not apply to other observer-moments in the same epistemic state? Different answers to these questions will mostly add up to similar bedtime behavior, but they diverge if the situation gets far out of distribution.
If I just care about doing things that help similar observer-moments, the assumption that "I" am distributed across multiple worlds doesn't matter much. I'll just act under the assumption that I'll live another day (or eventually be revived), and in worlds I don't, there aren't any observer-moments I care about to complain I guessed wrong. In that sense, I "should" expect to continue to exist (But if you're selfish enough, this means you "should" expect to be immortal even in a single, deterministic universe. When you inevitably die, you aren't around to regret your expectation!) But if you only "should" expect things as much as they are likely to actually happen, being distributed over multiple universes is a big issue, since it is likely that in any given time period you will actually die in some universes AND actually survive in others. You could decide to weight your belief according to how many universes you expect to survive in, but only if you have a measure over universes. (this is also an issue for the first kind of "should"; once you decide to act like you'll live, you might find yourself facing decisions that incline you to weigh one set of worlds against another.)
What does it mean to "should expect" something, if your identity is transmitted across multiple universes with different ground truths?
I think the key to approaches like this is to eschew pre-existing, complex concepts like "human flourishing" and look for a definition of Good Things that is actually amenable to constructing an agent that Does Good Things. There's no guarantee that this would lead anywhere; it relies on some weak form of moral realism. But an AGI that follows some morality-you-largely-agree-with by its very structure is a lot more appealing to me than an AGI that dutifully maximizes the morality-you-punched-into-its-utility-function-at-bootup, appealing enough that I think it's worth wading into moral philosophy to see if the idea pans out.
Should it make a difference? Same iterative computation.
Not necessarily, a lot of information is being discarded when you're only looking at the paper/verbal output. As an extreme example, if the emulated brain had been instructed (or had the memory of being instructed) to say the number of characters written on the paper and nothing else, the computational properties of the system as a whole would be much simpler than of the emulation.
I might be missing the point. I agree with you that an architecture that predicts tokens isn't necessarily non-conscious. I just don't think the fact that a system predicts tokens generated by a conscious process is reason to suspect that the system itself is conscious without some other argument.
I don't think that in the example you give, you're making a token-predicting transformer out of a human emulation; you're making a token-predicting transformer out of a virtual system with a human emulation as a component. In the system, the words "what's your earliest memory?" appearing on the paper are going to trigger all sorts of interesting (emulated) neural mechanisms that eventually lead to a verbal response, but the token predictor doesn't necessarily need to emulate any of that. In fact, if the emulation is deterministic, it can just memorize whatever response is given. Maybe gradient descent is likely to make the LLM conscious in order to efficiently memorize the outputs of a partly conscious system, but that's not obvious.
If you have a brain emulation, the best way to get a conscious LLM seems to me like it would be finding a way to tokenize emulation states and training it on those.
The number of poor people is much larger than the number of billionaires, but the number of poor people who THINK they're billionaires probably isn't that much larger. Good point about needing to forget the technique, though.
Is this an independent reinvention of the law of attraction? There doesn't seem to be anything special about "stop having a disease by forgetting about it" compared to the general "be in a universe by adopting a mental state compatible with that universe." That said, becoming completely convinced I'm a billionaire seems more psychologically involved than forgetting I have some disease, and the ratio of universes where I'm a billionaire versus I've deluded myself into thinking I'm a billionaire seems less favorable as well.
Anyway, this doesn't seem like a good solution since even for every "me" that gets into a better universe, another just gets booted into the worse one. As far as the interests of the whole cohort go it'd be a waste of effort.
What does it mean when one "should anticipate" something? At least in my mind, it points strongly to a certain intuition, but the idea behind that intuition feels confused. "Should" in order to achieve a certain end? To meet some criterion? To boost a term in your utility function?
I think the confusion here might be important, because replacing "should anticipate" with a less ambiguous "should" seems to make the problem easier to reason about, and supports your point.
For instance, suppose that you're going to get your brain copied next week. After you get copied, you'll take a physics test, and your copy will take a chemistry test (maybe this is your school's solution to a scheduling conflict during finals). You want both test scores to be high, but you expect taking either test without preparation will result in a low score. Which test should you prepare for?
It seems clear to me that you should prepare for both the chemistry test and the physics test. The version of you that got copied will be able to use the results of the physics preparation, and the copy will be able to use the copied results of the chemistry preparation. Does that mean you should anticipate taking a chemistry test and anticipate taking a physics test? I feel like it does, but the intuition behind the original sense of "should anticipate" seems to squirm out from under it.