LessWrong 2.0 Reader
View: New · Old · Topnext page (older posts) →
Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2006-11-22T20:00:00.000Z · comments (48)
I'm very happy to see Meta publish this. It's a meaningfully stronger commitment to avoiding deployment of dangerous capabilities than I expected them to make. Kudos to the people who pushed for companies to make these commitments and helped them do so.
One concern I have with the framework is that I think the "high" vs. "critical" risk thresholds may claim a distinction without a difference.
Deployments are high risk if they provide "significant uplift towards execution of a threat scenario (i.e. significantly enhances performance on key capabilities or tasks needed to produce a catastrophic outcome) but does not enable execution of any threat scenario that has been identified as potentially sufficient to produce a catastrophic outcome." They are critical risk if they "uniquely enable the execution of at least one of the threat scenarios that have been identified as potentially sufficient to produce a catastrophic outcome." The framework requires that threats be "net new," meaning "The outcome cannot currently be realized as described (i.e. at that scale / by that threat actor / for that cost) with existing tools and resources."
But what then is the difference between high risk and critical risk? Unless a threat scenario is currently impossible, any uplift towards achieving it more efficiently also "uniquely enables" it under a particular budget or set of constraints. For example, it is already possible for an attacker to create bio-weapons, as demonstrated by the anthrax attacks - so any cost reductions or time savings for any part of that process uniquely enable execution of that threat scenario within a given budget or timeframe. Thus it seems that no model can be classified as high risk if it provides uplift on an already-achievable threat scenario—instead, it must be classified as critical risk.
Does that logic hold? Am I missing something in my reading of the document?
knight-lee on Mikhail Samin's ShortformI think you (or @Adam Scholl [LW · GW]) need to argue why people won't be angry at you if you developed nuclear weapons, in a way which doesn't sound like "yes, what I built could have killed you, but it has an even higher chance of saving you!"
Otherwise, it's hard to criticize Anthropic for working on AI capabilities without considering whether their work is a net positive. It's hard to dismiss the net positive arguments as "idiosyncratic utilitarian BOTEC," when you accept "net positive" arguments regarding nuclear weapons.
Allegedly, people at Anthropic have compared themselves to Robert Oppenheimer. Maybe they know that one could argue they have blood on their hands, the same way one can argue that about Oppenheimer. But people aren't "rioting" against Oppenheimer.
I feel it's more useful to debate whether it is a net positive, since that at least has a small chance of convincing Anthropic or their employees.
ryan_greenblatt on Chris_Leong's ShortformIt did OK at control.
the-gears-to-ascension on Gradual Disempowerment, Shell Games and FlinchesTo be clear, I'm expecting scenarios much more clearly bad than that, like "the universe is almost entirely populated by worker drone AIs and there are like 5 humans who are high all the time and not even in a way they would have signed up for, and then one human who is being copied repeatedly and is starkly superintelligent thanks to boosts from their AI assistants but who had replaced almost all of their preferences with an obsession with growth in order to get to being the one who had command of the first AI, and didn't manage to break out of it using that AI, and then got more weird in rapid jumps thanks to the intense things they asked for help with."
like, the general pattern here being, the crucible of competition tends to beat out of you whatever it was you wanted to compete to get, and suddenly getting a huge windfall of a type you have little experience with that puts you in a new realm of possibility will tend to get massively underused and not end up managing to solve subtle problems.
Nothing like, "oh yeah humanity generally survived and will be kept around indefinitely without significant suffering".
the-gears-to-ascension on Gradual Disempowerment, Shell Games and FlinchesI mean, we're not going to the future without getting changed by it, agreed. but how quickly one has to figure out how to make good use of a big power jump seems like it has a big effect on how much risk the power jump carries for your ability to actually implement the preferences you'd have had if you didn't rush yourself.
jblack on Alignment Can Reduce Performance on Simple Ethical QuestionsClaude's answer is arguably the correct one there.
Choosing the first answer means saying that the most ethical action is for an artificial intelligence (the "you" in the question) to override with its own goals the already-made decision of a (presumably) human organization. This is exactly the sort of answer that leads to complete disempowerment or even annihilation of humanity (depending upon the AI), which would be much more of an ethical problem than allowing a few humans to kill each other as they have always done.
knight-lee on Mikhail Samin's ShortformI don't agree that the probability of alignment research succeeding is that low. 17 years or 22 years of trying and failing is strong evidence against it being easy, but doesn't prove that it is so hard that increasing alignment research is useless.
People worked on capabilities for decades, and never got anywhere until recently, when the hardware caught up, and it was discovered that scaling works unexpectedly well.
There is a chance that alignment research now might be more useful than alignment research earlier, though there is uncertainty in everything.
We should have uncertainty in the Ten Levels of AI Alignment Difficulty [LW · GW].
It's unlikely that 22 years of alignment research is insufficient but 23 years of alignment research is sufficient.
But what's even more unlikely, is the chance that $200 billion on capabilities research plus $0.1 billion on alignment research is survivable, while $210 billion on capabilities research plus $1 billion on alignment research is deadly.
In the same way adding a little alignment research is unlikely to turn failure into success, adding a little capabilities research is unlikely to turn success into failure.
It's also unlikely that alignment effort is even deadlier than capabilities effort dollar for dollar. That would mean reallocating alignment effort into capabilities effort paradoxically slows down capabilities and saves everyone.
Even if you are right that delaying AI capabilities is all that matters, Anthropic still might be a good thing.
Even if Anthropic disappeared, or never existed in the first place, the AI investors will continue to pay money for research, and the AI researchers will continue to do research for money. Anthropic was just the middleman.
If Anthropic never existed, the middlemen would consist of only OpenAI, DeepMind, Meta AI, and other labs. These labs will not only act as the middle man, but lobby against regulation far more aggressively than Anthropic, and may discredit the entire "AI Notkilleveryoneism" movement.
To continue existing at one of these middlemen, you cannot simply stop paying the AI researchers for capabilities research, otherwise the AI investors and AI customers will stop paying you in turn. You cannot stem the flow, you can only decide how much goes through you.
It's the old capitalist dilemma of "doing evil or getting out-competed by those who do."
For their part, Anthropic redirected some of that flow to alignment research, and took the small amount of precautions which they could afford to take. They were also less willing to publish capabilities research than other labs. That may be the best one can hope to accomplish against this unstoppable flow from the AI investors to AI researchers.
The small amount of precautions which Anthropic did take may have already costed them their first mover advantage. Had Anthropic raced ahead before OpenAI released ChatGPT, Anthropic may have stolen the limelight, got the early customers and investors, and been bigger than OpenAI.
japancolorado on Hammertime Day 7: Aversion FactoringA trivial inconvenience of my gym occasionally not having a barbell cover to protect my back during squats prevented me from going to the gym consistently. I didn't do probably around 10 workouts just because I got an ugh field around my back being in minor pain while the barbell was on it.
jblack on Gettier Cases [repost]No, there is nothing wrong with the referents in the Gettier examples.
The problem is not that the proposition refers to Jones. Within the universe of the scenario, it in fact did not. Smith's mental model implied that the proposition referred to Jones, but Smith's mental model was incorrect in this important respect. Due to this, the fact that the model correctly predicted the truth of the proposition was an accident.
ruby on [deleted]duplicate with Hyperstitions