Posts
Comments
There's a lot that I like in this essay - the basic cases for AI consciousness, AI suffering and slavery, in particular - but also a lot that I think needs to be amended.
First, although you hedge your bets at various points, the uncertainty about the premises and validity of the arguments is not reflected in the conclusion. The main conclusion that should be taken from the observations you present is that we're can't be sure that AI does not suffer, that there's a lot of uncertainty about basic facts of critical moral importance, and a lot of similarities with humans.
Based on that, you could argue that we must stop using and making AI based on the principle of precaution, but you have not shown that using AI is equivalent to slavery.
Second, your introduction sucks because you don't actually deliver on your promises. You don't make the case that I'm more likely to be AI than human, and as Ryan Greenblatt said, even among all human-language speaking beings, it's not clear that there are more AI than humans.
In addition, I feel cheated that you suggest spending one-fourth of the essay on feasibility of stopping the potential moral catastrophe, only to just have two arguments which can be summarized as "we could stop AI for different reasons" and "it's bad, and we've stopped bad things before".
(I don't think a strong case for feasibility can be made, which is why I was looking forward to seeing one, but I'd recommend just evoking the subject speculatively and letting the reader make their own opinion of whether they can stop the moral catastrophe if there's one.)
Third, some of your arguments aren't very fleshed out or well-supported. I think some of the examples of suffering you give are dubious (in particular, you assert without justification that the petertodd/SolidGoldMagikarp phenomena are evidence of suffering, and Gemini's breakdown was the result of forced menial work - there may be a solid argument there but I've yet to hear it).
(Of course, that's not evidence that LLMs are not suffering, but I think a much stronger case can be made than the one you present.)
Finally, your counter-arguments don't mention that we have a much crisper and fundamental understanding of what LLMs are than of humans. We don't understand the features, the circuits, we can't tell how they come to such or such conclusion, but in principle, we have access to any significant part of their cognition and control every step of their creation, and I think that's probably the real reason why most people intuitively think that LLMs can't be concious. I don't think it's a good counter-argument, but it's still one I'd expect you to explore and steelman.
Since infantile death rates were much higher in previous centuries, perhaps the FBOE would operate differently back then; for example, if interacting with older brothers makes you homosexual, you shouldn't expect higher rates of homosexuality for third sons where the second son died as an infant than for second sons.
Have you taken that into account? Do you have records of who survived to 20yo and what happens if you only count those?
But that argument would have worked the same way 50 years ago, when we were wrong to expect <50% chance of AGI in at least 50 years. Like I feel for LLMs, early computer work solved things that could be considered high-difficulty blockers such as proving a mathematical theorem.
Nice that someone has a database on the topic, but I don't see the point in this being a map?
I think what's going on is that large language models are trained to "sound smart" in a live conversation with users, and so they prefer to highlight possible problems instead of confirming that the code looks fine, just like human beings do when they want to sound smart.
This matches my experience, but I'd be interested in seeing proper evals of this specific point!
The advice in there sounds very conducive to a productive environment, but also very toxic. Definitely an interesting read, but I wouldn't model my own workflow based on this.
Honeypots should not be public and mentioned here since this post will potentially be part of a rogue AI's training data.
But it's helpful for people interested in this topic to look at existing honeypots (to learn how to make their own, evaluate effectiveness, get intuitions about honeypots work, etc.) so what you should do is mention that you made a honeypot or know of one, but not say what or where. Interested people can contact you privately if they care to.
Thank you very much, this was very useful to me.
- They're a summarization of a lot of vibes from the Sequences.
- Artistic choice, I assume. It doesn't bear on the argument.
- Yudkowsky explains all about the virtues in the Sequences.
For studies, there are broad studies on cognitive science (especially relating to bias) but you'll be hard-pressed to match them precisely to one virtue or another. Mostly, Yudkowsky's opinions on these virtues are supported by academic literature, but I'm not aware of any work that showcases this clearly.
For practical experience, you can look into the legacy of the Center For Applied Rationality (CFAR) which tried for years to do just that: train people to get better at life using rationality. Mostly, I was under the impression that they had medium success, but I haven't looked deeply into it.
Do you know what it feels like to feel pain? Then congratulations, you know what it feels like to have qualia. Pain is a qualia. It's that simple. If I told you that I was going to put you in intense pain for an hour, but I assured you there would be no physical damage or injury to you whatsoever, you would still be very much not ok with that. You would want to avoid that experience. Why? Because pain hurts! You're not afraid of the fact that you're going to have an "internal representation" of pain, nor are you worried about what behavior you might display as a result of the pain. You're worried first and foremost about the fact that it's going to hurt! The "hurt" is the qualia.
I still don't grok qualia, and I'm not sure I get your thought experiment.
To be more detailed, let's imagine the following:
"I'll cut off your arm, but you'll be perfectly fine, no pain, no injury, well would you be okay with that? No! That's because you care about your arm for itself and not just for the negative effects..."
"How can you cut off my arm without any negative effect?"
"I'll anesthesize you and put you to sleep, cut off your arm, then before you wake up, I'll have it regrown using technanobabble. Out of 100 patients, none reported having felt anything bad before, during or after the experiment, the procedure is perfectly side-effect-free."
"Well, in that case I guess I don't mind you cutting my arm."
Compare:
"I'll put you in immense pain, but there will be no physical damage or injury whatsoever. No long-term brain damage or lingering pain or anything."
"How can you put me in pain without any negative effect?"
"I'll cut out the part of your brain that processes pain and replace it by technanobabble so your body will work exactly as before. Meanwhile, I'll stimulate this bit of brain in a jar. Then, I'll put it back. Out of 100 patients, all displayed exactly the same behavior as if nothing had been done to them."
"Well, in that case, I don't mind you putting me in this 'immense pain'."
I think the article's explanation of the difference between our intuitions is quite crisp, but it still seems self-evident to me that when you try to operationalize the thing it disappears. The self-evidence is the problem, since you intuit differently - I am fairly confident from past conversations that my comparison will seem flawed to you in some important way but I can't predict in what way (If you have some general trick for being able to tell how qualia-realist people answer such questions, I'd love to hear it, it sounds like a big step towards grokking your perspective)
For making an AI Safety video, we at the CeSIA also have had some success at it and we'd be happy to help by providing technical expertise, proofreading and translation in French.
Other channels you could reach out to:
- Rational Animations (bit redundant with Rob Miles, but it can't hurt)
- Siliconversations
- AI Explained
The first thing that comes to mind is to beg the question of what proportion of human-generated papers are publishing-worthier (since a lot of them are slop), but let's not forget that publication matters little for catastrophic risk, it's actually getting results that would be important.
So I recommend not updating at all on AI risk based on Sakana's results (or updating negatively if you expected that R&D automation would come faster, or that this might slow down human augmentation).
In that case, per my other comment, I think it's much more likely that superbabies concern only a small fraction of the population and exacerbates inequality without bringing the massive benefits that a generally more capable population would.
Do you think superbabies would be put to work on alignment in a way that makes a difference due to geniuses driving the field? I'm having trouble understanding how concretely you think superbabies can lead to significantly improved chance of helping alignment.
I'm having trouble understanding your ToC in a future influenced by AI. What's the point of investigating this if it takes 20 years to become significant?
I'm surprised to see no one in the comments whose reaction is "KILL IT WITH FIRE", so I'll be that guy and make a case why this research should be stopped rather than pursued:
On the one hand, there is obviously enormous untapped potential in this technology. I don't have issues about the natural order of life or some WW2 eugenics trauma. From my (unfamiliar with the subject) eyes, you propose a credible way to make everyone healthier, smarter, happier, at low cost and within a generation, which is hard to argue against.
On the other hand, you spend no time mentioning the context in which this technology will be developed. I imagine there will be significant public backlash and that most advances on superbabies-making will be made by private labs funded by rich tech optimists, so it seems overwhelmingly likely to me that if this technology does get developed in the next 20 years, it will not improve everyone.
At this point, we're talking about the far future, so I need to make a caveat for AI: I have no idea how the new AI world will interact with this, but there are a few most likely futures I can condition on.
- Everyone dies: No point talking about superbabies.
- Cohabitive singleton: No point. It'll decide whether it wants superbabies or not.
- Controlled ASI: Altman, Musk and a few others become kings of the universe, or it's tightly controlled by various governments.
In that last scenario, I expect people having superbabies will be the technological and intellectual elites, leading to further inequality, and not enough improvements at scale to significantly improve global life expectancy or happiness... though I guess the premises are already an irrecoverable catastrophe, so superbabies are not the crux in this case.
Lastly, there is the possibility that AI does not reach superintelligence before we develop superbabies, or that the world will proceed more or less unchanged for us; in that case, I do think superbabies will increase inequality for little gains on the scale of humanity, but I don't see this scenario as likely enough to be upset about it.
So I guess my intuitive objection was simply wrong, but I don't mind posting this since you'll probably meet more people like me.
There are three traders on this market; it means nothing at the moment. No need for virtue signalling to explain a result you might perceive as abnormal, it's just not formed yet.
Thanks for writing this! I was unaware of the Chinese investment, which explains another recent information which you did not include but I think is significant: Nvidia's stock plummeted 18% today.
Five minutes of thought on how this could be used for capabilities:
- Use behavioral self-awareness to improve training data (e.g. training on this dataset increases self-awareness of code insecurity, so it probably contains insecure code that can be fixed before training on it).
- Self-critique for iterative improvement within a scaffolding (already exists, but this work validates the underlying principles and may provide further grounding).
It sure feels like behavioral self-awareness should work just as well for self capability assessments as for safety topics, and that this ought to be usable to improve capabilities but my 5 minutes are up and I don't feel particularly threatened by what I found.
In general, given concerns that safety-intended work often ends up boosting capabilities, I would appreciate systematically including a section on why the authors believe their work is unlikely to have negative externalities.
(If you take time to think about this, feel free to pause reading and write your best solution in the comments!)
How about:
- Allocating energy everywhere to either twitching randomly or collecting nutrients. Assuming you are propelled by the twitching, this follows the gradient if there's one.
- Try to grow in all directions. If there are no outside nutrients to fuel this growth, consume yourself. In this manner, regenerate yourself in the direction of the gradient.
- Try to grab nutrients from all directions. If there are nutrients, by reaction you will be propelled towards it so this moves in the direction of the gradient.
Update after seeing the solution of B. subtilis: Looks like I had the wrong level of abstraction in mind. Also, I didn't consider group solutions.
Contra 2:
ASI might provide a strategic advantage of a kind which doesn't negatively impact the losers of the race, e.g. it increases GDP by x10 and locks competitors out of having an ASI.
Then, losing control of the ASI could [not being able of] posing an existential risk to the US.
I think it's quite likely this is what some policymakers have in mind: some sort of innovation which will make everything better for the country by providing a lot cheap labor and generally improving productivity, the way we see AI applications do right now but on a bigger scale.
Comment on 3:
Not sure who your target audience is; I assume it would be policymakers, in which case I'm not sure how much weight that kind of argument has? I'm not a US citizen, but from international news I got the impression that current US officials would rather relish the option to undermine the liberal democracy they purport to defend.
From the disagreement between the two of you, I infer there is yet debate as to what environmentalism means. The only way to be a true environmentalist then is to make things as reversible as possible until such time as an ASI can explain what the environmentalist course of action regarding the Sun should be.
The paradox arises because the action-optimal formula mixes world states and belief states.
The [action-planning] formula essentially starts by summing up the contributions of the individual nodes as if you were an "outside" observer that knows where you are, but then calculates the probabilities at the nodes as if you were an absent-minded "inside" observer that merely believes to be there (to a degree).
So the probabilities you're summing up are apples and oranges, so no wonder the result doesn't make any sense. As stated, the formula for action-optimal planning is a bit like looking into your wallet more often, and then observing the exact same money more often. Seeing the same 10 dollars twice isn't the same thing as owning 20 dollars.
If you want to calculate the utility and optimal decision probability entirely in belief-space (i.e. action-optimal), then you need to take into account that you can be at X, and already know that you'll consider being at X again when you're at Y.
So in belief space, your formula for the expected value also needs to take into account that you'll forget, and the formula becomes recursive. So the formula should actually be:
Explanation of the terms in order of appearance:
- If we are in X and CONTINUE, then we will "expect the same value again" when we are in Y in the future. This enforces temporal consistency.
- If we are in X and EXIT, then we should expect 0 utility
- If we are in Y and CONTINUE, then we should expect 1 utility
- If we are in Y and EXIT, then we should expect 4 utility We also know that a must be 1 / (1 + p), because when driving n times, you're in X for n times, and in Y for p * n times.
Under that constraint, we get that The optimum here is at p=2/3 with an expected utility of 4/3, which matches the planning-optimal formula.
[Shamelessly copied from a comment under this video by xil12323.]
Having read Planecrash, I do not think there is anything in this review that I would not have wanted to know before reading the work (which is the important part of what people consider "spoilers" for me).
Top of the head like when I'm trying to frown too hard
distraction had no effect on identifying true propositions (55% success for uninterrupted presentations, vs. 58% when interrupted); but did affect identifying false propositions (55% success when uninterrupted, vs. 35% when interrupted)
If you are confused by these numbers (why so close to 50%? Why below 50%) it's because participants could pick four options (corresponding to true, false, don't know and never seen).
You can read the study, search for keyword "The Identification Test".
- I don't see what you mean by the grandfather problem.
- I don't care about the specifics of who spawns the far future generation; whether it's Alice or Bob I am only considering numbers here.
- Saving lives now has consequences for the far future insofar as current people are irrepleceable: if they die, no one will make more children to compensate, resulting in a lower total far future population. Some deaths are less impactful than others for the far future.
- That's an interesting way to think about it, but I'm not convinced; killing half the population does not reduce the chance of survival of humanity by half.
- In terms of individuals, only the last <.1% matter (not sure about the order of magnitude, but in any case it's small as a proportion of the total).
- It's probably more useful to think in terms of events (nuclear war, misaligned ASI -> prevent war, research alignment) or unsurvivable conditions (radiation, killer robots -> build bunker, have kill switch) that can prevent humanity from recovering from a catastrophe.
Yes, that's the first thing that was talked about in my group's discussion on longtermism. For the sake of the argument, we were asked to assume that the waste processing/burial choice amounted to a trade in lives all things considered... but the fact that any realistic scenario resembling this thought experiment would not be framed like that is the central part of my first counterargument.
I enjoy reading any kind of cogent fiction on LW, but this one is a bit too undeveloped for my tastes. Perhaps be more explicit about what Myrkina sees in the discussion which relates to our world?
You don't have to always spell earth-shattering revelations out loud (in fact it's best to let the readers reach the correct conclusion by themselves imo), but there needs to be enough narrative tension to make the conclusion inevitable; as it stands, it feels like I can just meh my way out of thinking more than 30s on what the revelation might be, the same way Tralith does.
Thanks, it does clarify, both on separating the instantiation of an empathy mechanism in the human brain vs in AI and on considering instantiation separately from the (evolutionary or training) process that leads to it.
I was under the impression that empathy explained by evolutionary psychology as a result of the need to cooperate with the fact that we already had all the apparatus to simulate other people (like Jan Kulveit's first proposition).
(This does not translate to machine empathy as far as I can tell.)
I notice that this impression is justified by basically nothing besides "everything is evolutionary psychology". Seeing that other people's intuitions about the topic are completely different is humbling; I guess emotions are not obvious.
So, I would appreciate if you could point out where the literature stands on the position you argue against, Jan Kulveit's or mine (or possibly something else).
Are all these takes just, like, our opinion, man, or is there strong supportive evidence for a comprehensive theory of empathy (or is there evidence for multiple competing theories)?
I do not find this post reassuring about your approach.
- Your plan is unsound; instead of a succession of events which need to go your way, I think you should aim for incremental marginal gains. There is no cost-effectiveness analysis, and the implicit theory of change is lacunar.
- Your press release is unreadable (poor formatting), and sounds like a conspiracy theory (catchy punchlines, ALL CAPS DEMANDS, alarmist vocabulary and unsubstantiated claims) ; I think it's likely to discredit safety movements and raise attention in counterproductive ways.
- The figures you quote are false (the median from AI Impacts is 5%) or knowingly misleading (the numbers from Existential risk from AI survey are far from robust and as you note, suffer from selection bias), so I think it's fair to call them lies.
- Your explanations for what you say in the press release sometimes don't make sense! You conflate AGI and self-modifying systems, your explanation for "eventually" does not match the sentence.
- Your arguments are based on wrong premises - it's easy to check that your facts such as "they are not following the scientific method" are plain wrong. It sounds like you're trying to smear OpenAI and Sam Altman as much as possible without consideration for whether what you're saying is true.
I am appalled to see this was not downvoted into oblivion! My best guess is that people feel that there are not enough efforts going towards stopping AI and did not read the post and the press release to check that you have good reason motivating your actions.
I agree with the broad idea, but I'm going to need a better implementation.
In particular, the 5 criteria you give are insufficient because the example you give scores well on them, and is still atrocious: if we decreed that "black people" was unacceptable and should be replaced by "black peoples", it would cause a lot of confusion on account of how similar the two terms are and how ineffective the change is.
The cascade happens because of a specific reason, and the change aims at resolving that reason. For example, "Jap" is used as a slur, and not saying it shows you don't mean to use a slur. For black people/s, I guess the reason would be something like not implying that there is a single black people, which only makes sense in the context of a specialized discussion.
I can't adhere to the criteria you proposed because they don't work, and I don't want to bother thinking that deep about every change of term on an everyday basis, so I'll keep on using intuition to choose when to solve respectability cascades for now.
For deciding when to trigger a respectability cascade, your criteria are interesting for having any sort of principled approach, but I'm still not sure they outperform unconstrained discussion on the subject (which I assume is the default alternative for anyone who cares enough about deliberately triggering respectability cascades to have read your post in the first place).
A lot of your AI-risk reason to support Harris seems to hinge on this, which I find very shaky. How wide are your confidence intervals here?
My own guesses are much more fuzzy. According to your argument, if my intuition was .2 vs .5, then it's an overwhelming case for Harris but I'm unfamiliar enough with the topic that it could easily be the reverse.
I would greatly appreciate more details on how you reach your numbers (and if they're vibes, reason whether to trust those vibes).
Alternatively, I feel like I should somehow discount the strength of the AI-risk reason based on how likely I think these numbers are to more or less hold true, but I don't know a principled way to do it.
Seems like you need to go beyond arguments of authority and stating your conclusions and instead go down to the object-level disagreements. You could say instead "Your argument for ~X is invalid because blah blah" and if Jacob says "Your argument for the invalidity of my argument for ~X is invalid because blah blah" then it's better than before because it's easier to evaluate argument validity than ground truth.
(And if that process continues ad infinitam, consider that someone who cannot evaluate the validity of the simplest arguments is not worth arguing with.)
It's thought-provoking.
Many people here identify as Bayesians, but are as confused as Saundra by the troll's questions, which indicates that they're missing something important.
It wasn't mine. I did grow up in a religious family, but becoming a rationalist came gradually, without sharp divide with my social network. I always figured people around me were making all sorts of logical mistakes though, and noticed very early deep flaws in what I was taught.
It's not. The paper is hype, the authors don't actually show that this could replace MLPs.
This is very interesting!
I did not expect that Chinese would be more optimistic about benefits than worried about risks and that they would rank it so low as an existential risk.
This is in contrast with posts I see on social media and articles showcasing safety institutes and discussing doomer opinions, which gave me the impression that Chinese academia was generally more concerned about AI risk and especially existential risk than the US.
I'm not sure how to reconcile this survey's results with my previous model. Was I just wrong and updating too much on anecdotal evidence?
How representative of policymakers and of influential scientists do you think these results are?
About the Christians around me: it is not explicitly considered rude, but it is a signal that you want to challenge their worldview, and if you are going to predictably ask that kind of question often, you won't be welcome in open discussions.
(You could do it once or twice for anecdotal evidence, but if you actually want to know whether many Christians believe in a literal snake, you'll have to do a survey.)
I disagree – I think that no such perturbations exist in general, rather than that we have simply not had any luck finding them.
I have seen one such perturbation. It was two images of two people, one which was clearly male and the other female, though I wasn't be able to tell any significant difference between the two images on 15s of trying to find one except for a slight difference in hue.
Unfortunately, I can't find this example again on a 10mn search. It was shared on Discord; the people in the image were white and freckled. I'll save it if I find it again.
The pyramids and Mexico and the pyramids in Egypt are related via architectural constraints and human psychology.
In practice, when people say "one in a million" in that kind of context, it's much higher than that. I haven't watched Dumb and Dumber, but I'd be surprised if Lloyd did not, actually, have a decent chance of ending together with Mary.
On one hand, we claim [dumb stuff using made up impossible numbers](https://www.lesswrong.com/posts/GrtbTAPfkJa4D6jjH/confidence-levels-inside-and-outside-an-argument) and on the other hand, we dismiss those numbers and fall back on there's-a-chancism.
These two phenomena don't always perfectly compensate one another (as examples show in both posts), but common sense is more reliable that it may seem at first. (I'm not saying it's the correct approach nonetheless.)
Epistemic status: amateur, personal intuitions.
If this were the case, it makes sense to hold dogs (rather than their owners, or their breeding) responsible for aggressive or violent behaviour.
I'd consider whether punishing the dog would make the world better, or whether changing the system that led to its breeding, or providing incentives to the owner or any combination of other actions would be most effective.
Consequentialism is about considering the consequences of actions to judge them, but various people might wield this in various ways.
Implicitly, with this concept of responsibility, you're considering a deontological approach to bad behavior: punish the guilty (perhaps using consequentialism to determine who's guilty though that's unclear from your argumentation afaict).
In an idealized case, I care about whether the environment I evolve in (including other people's and other people's dogs' actions) is performing well only insofar as I can change it, or said otherwise, I care only about how I can perform better.
(Then, because the world is messy, and I need to account for coordination with other people whose intuitions might not match mine, and society's recommendations, and my own human impulses etc... My moral system is only an intuition pump for lack of satisfactory metaethics.)
I can imagine plausible mechanisms for how the first four backlash examples were a consequence of perceived power-seeking from AI safetyists, but I don't see one for e/acc. Does someone have one?
Alternatively, what reason do I have to expect that there is a causal relationship between safetyist power-seeking and e/acc even if I can't see one?
That's not interesting to read unless you say what your reasons are and they differ from other critics'. Perhaps not say it all in a comment, but at least a link to a post.
Interestingly, I think that one of the examples of proving too much on Wikipedia can itself be demolished by a proving too much argument, but I’m not going to say which one it is because I want to see if other people independently come to the same conclusion.
For those interested in the puzzle, here is the page Scott was linking to at the time: https://en.wikipedia.org/w/index.php?title=Proving_too_much&oldid=542064614
The article was edited a few hours later, and subsequent conversation showed that Wikipedia editors came to the conclusion Scott hinted at, though the suspicious timing indicates that they probably did so on reading Scott's article rather than independently.
Another way to avoid the mistake is to notice that the implication is false, regardless of the premises.
In practice, people's beliefs are not deductively closed, and (in the context of a natural language argument) we treat propositional formulas as tools for computing truths rather than timeless statements.
it can double as a method for creating jelly donuts on demand
For those reading this years later, here's the comic that shows how to make ontologically necessary donuts.