Posts
Comments
Imagine that the Morris worm never happened, nor Blaster, nor Samy. A few people independently discovered SQL injection but kept it to themselves. [...]
That hypothetical world is almost impossible, because it's unstable. As soon as certain people noticed that they could get an advantage, or even a laugh, out of finding and exploiting bugs, they'd do it. They'd also start building on the art, and they'll even find ways to organize. And finding out that somebody had done it would inspire more people to do it.
You could probably have a world without the disclosure norm, but I don't see how you could have a world without the actual exploitation.
We have driverless cars, robosurgeons, and simple automated agents acting for us, all with the security of original Sendmail.
None of those things are exactly bulletproof as it is.
But having the whole world at the level you describe basically sounds like you've somehow managed to climb impossibly high up an incredibly rickety pile of junk, to the point where instead of getting bruised when you inevitably do fall, you're probably going to die.
Introducing the current norms into that would be painful, but not doing so would just let it get keeping worse, at least toward an asymptote.
and the level of caution I see in biorisk seems about right given these constraints.
If that's how you need to approach it, then shouldn't you shut down ALL biology research, and dismantle the infrastructure? Once you understand how something works, it's relatively easy to turn around and hack it, even if that's not how you originally got your understanding.
Of course there'd be defectors, but maybe only for relatively well understood and controlled purposes like military use, and the cost of entry could be pretty high. If you have generally available infrastructure, anybody can run amok.
There is a real faction, building AI tools and models, that believes that human control over AIs is inherently bad, and that wants to prevent it. Your alignment plan has to overcome that.
That mischaracterizes it completely. What he wrote is not about human control. It's about which humans. Users, or providers?
He said he wanted a "sharp tool". He didn't say he wanted a tool that he couldn't control.
At another level, since users are often people and providers are almost always insitutions, you can see it as at least partly about whether humans or institutions should be controlling what happens in interactions with these models. Or maybe about whether many or only a few people and/or institutions should get a say.
An institution of significant size is basically a really stupid AI that's less well behaved than most of the people who make it up. It's not obvious that the results of some corporate decision process are what you want to have in control... especially not when they're filtered through "alignment technologies" that (1) frequently don't work at all and (2) tend to grossly distort the intent when they do sort of work.
That's for the current and upcoming generations of models, which are going to be under human or institutional control regardless, so the question doesn't really even arise... and anyway it really doesn't matter very much. Most of the stuff people are trying to "align" them against is really not all that bad.
Doom-level AGI is pretty different and arguably totally off topic. Still, there's an analogous question: how would you prefer to be permanently and inescapably ruled? You can expect to surveilled and controlled in excruciating detail, second by second. If you're into BCIs or uploading or whatever, you can extend that to your thoughts. If it's running according to human-created policies, it's not going to let you kill yourself, so you're in for the long haul.
Whatever human or institutional source the laws that rule you come from, they'll probably still be distorted by the "alignment technologies", since nobody has suggested a plausible path to a "do what I really want" module. If we do get non-distorting alignment technology, there may also be constraints on what it can and can't enforce. And, beyond any of that, even if it's perfectly aligned with some intent... there's no rule that says you have to like that intent.
So, would you like to be ruled according to a distorted version of a locked-in policy designed by some corporate committee? By the distorted day to day whims of such a committee? By the distorted day to day whims of some individual?
There are worse things than being paperclipped, which means that in the very long run, however irrelevant it may be to what Keven Fisher was actually talking about, human control over AIs is inherently bad, or at least that's the smart bet.
A random super-AI may very well kill you (but might also possibly just ignore you). It's not likely to be interested enough to really make you miserable in the process. A super-AI given a detailed policy is very likely to create a hellish dystopia, because neither humans nor their institutions are smart or necessarily even good enough to specify that policy. An AI directed day to day by institutions might or might not be slightly less hellish. An AI directed day to day by individual humans would veer wildly between not bad and absolute nightmare. Either of the latter two would probably be omnicidal sooner or later. With the first, you might only wish it had been omnicidal.
If you want to do better than that, you have to come up with both "alignment technology" that actually works, and policies for that technology to implement that don't create a living hell. Neither humans nor insitutions have shown much sign of being able to come up with either... so you're likely hosed, and in the long run you're likely hosed worse with human control.
Sorry, I just forgot to answer this until now. I think the issue is that the title doesn't make it clear how different the UK's proposal is from say the stuff that the "labs" negotiated with the US. "UK seems to be taking a hard line on foundation model training", or something?
This is actually way more interesting and impressive than most government or quasi-government output, and I suspect it'd draw a lot of interest with a title that called more attention to its substance.
I'm sorry, but I don't see anything in there that meaningfully reduces my chances of being paperclipped. Not even if they were followed universally.
I don't even see much that really reduces the chances of people (smart enough to act on them) getting bomb-making instructions almost as good as the ones freely available today, or of systems producing words or pictures that might hurt people emotionally (unless they get paperclipped first).
I do notice a lot of things that sound convenient for the commercial interests and business models of the people who were there to negotiate the list. And I notice that the list is pretty much a license to blast ahead on increasing capability, without any restrictions on how you get there. Including a provision that basically cheerleads for building anything at all that might be good for something.
There's really only one concrete action in there involving the models themselves. The White House calls it "testing", but OpenAI mutates it into "red-teaming", which narrows it quite a bit. Not that anybody has any idea how to test any of this using any approach. And testing is NOT how you create secure, correct, or not-everyone-killing software. The stuff under the heading of "Building Systems that Put Security First"... isn't. It's about building an arbitrarily dangerous system and trying to put walls around it.
If it generates them totally at random, then no. They have no author. But even in that case, if you do it in a traditional way you will at least have personally made more decisions about what the output looks like than somebody who trains a model. The whole point of deep learning is that you don't make decisions about the weights themselves. There's no "I'll put a 4 here" step.
I'm really confused about how anybody thinks they can "license" these models. They're obviously not works of authorship. Therefore they don't have copyrights. You can write a license, but anybody can still do anything they want with the model regardless of what you do or don't put into it.
Also, "open source" actually means something and that's not it. I don't actually like the OSD very much, but it's pretty thoroughly agreed upon.
I'd interpret all of that as OpenAI
- recognizing that the user is going to get total control over the VM, and
- lying to the LLM in a token effort to discourage most users from using too many resources.
(1) is pretty much what I'd advise them to do anyway. You can't let somebody run arbitrary Python code and expect to constrain them very much. At the MOST you might hope to restrict them with a less-than-VM-strength container, and even that's fraught with potential for error and they would still have access to the ENTIRE container. You can't expect something like an LLM to meaningfully control what code gets run; even humans have huge problems doing that. Better to just bite the bullet and assume the user owns that VM.
(2) is the sort of thing that achieves its purpose even if it fails from time to time, and even total failure can probably be tolerated.
The hardware specs aren't exactly earth-shattering secrets; giving that away is a cost of offering the service. You can pretty easily guess an awful lot about how they'd set up both the hardware and the software, and it's essentially impossible to keep people from verifying stuff like that. Even then, you don't know that the VM actually has the hardware resources it claims to have. I suspect that if every VM on the physical host actually tried to use "its" 54GB, there'd be a lot of swapping going on behind the scenes.
I assume that the VM really can't talk to much if anything on the network, and that that is enforced from OUTSIDE.
I don't know, but I would guess that the whole VM has some kind of externally imposed maximimum lifetime independent of the 120 second limit on the Python processes. It would if I were setting it up.
The bit about retaining state between sessions is interesting, though. Hopefully it only applies to sessions of the same user, but even there it violates an assumption that things outside of the VM might be relying on.
I assumed that humans would at least die off, if not be actively exterminated. Still need to know how and what happens after that. That's not 100 percent a joke.
What's a "CIS"?
... but you never told me what its actual goal was, so I can't decide if this is a bad outcome or not...
On further edit: apparently I'm a blind idiot and didn't see the clearly stated "5 year time horizon" despite actively looking for it. Sorry. I'll leave this here as a monument to my obliviousness, unless you prefer to delete it.
Without some kind of time limit, a bet doesn't seem well formed, and without a reasonably short time limit, it seems impractical.
No matter how small the chance that the bet will have to be paid, it has to be possible for it to be paid, or it's not a bet. Some entity has to have the money and be obligated to pay it out. Arranging for a bet to be paid at any time after their death would cost more than your counterparty would get out of the deal. Trying to arrange a perpetual trust that could always pay is not only grossly impractical, but actually illegal in a lot of places. Even informally asking people to hold money is really unreliable very far out. And an amount of money that could be meaningful to future people could end up tied up forever anyway, which is weird. Even trying to be sure to have the necessary money until death could be an issue.
I'm not really motivated to play, but as an example I'm statistically likely to die in under 25 years barring some very major life extension progress. I'm old for this forum, but everybody has an expiration date, including you yourself. Locating your heirs to pay them could be hard.
Deciding the bet can get hard, too. A recognizable Less Wrong community as such probably will not last even 25 years. Nor will Metaculus or whatever else. A trustee is not going to have the same judgement as the person who originally took your bet.
That's all on top of the more "tractable" long-term risks that you can at least value in somehow... like collapse of whatever currency the bet is denominated in, AI-or-whatever completely remaking the economy and rendering money obsolete, the Rapture, etc, etc.
... but at the same time, it doesn't seem like there's any particular reason to expect definitive information to show up within any adequately short time.
On edit: I bet somebody's gonna suggest a block chain. Those don't necessarily have infinite lives, either, and the oracle that has to tell the chain to pay out could disappear at any time. And money is still tied up indefinitely, which is the real problem with perpetuities.
- Both OpenAI and Anthropic have demonstrated that they have discipline to control at least when they deploy.
Good point. You're right that they've delayed things. In fact, I get the impression that they've delayed for issues I personally wouldn't even have worried about.
I don't think that makes me believe that they will be able to refrain, permanently or even for a very long time, from doing anything they've really invested in, or anything they really see as critical to their ability to deliver what they're selling. They haven't demonstrated any really long delays, the pressure to do more is going to go nowhere but up, and organizational discipline tends to deteriorate over time even without increasing pressure. And, again, the paper's already talking about things like "recommending" against deployment, and declining to analyze admittedly relevant capabilities like Web browsing... both of which seem like pretty serious signs of softening.
But they HAVE delayed things, and that IS undeniably something.
As I understand it, Anthropic was at least partially founded around worries about rushed deployment, so at a first guess I'd suspect Anthropic's discipline would be last to fail. Which might mean that Anthropic would be first to fail commercially. Adverse selection...
- I'm unsure if 'selectively' refers to privileged users, or the evaluators themselves.
It was meant to refer to being selective about users (mostly meaning "customerish" ones, not evaluators or developers). It was also meant to refer to being selective about which of the model's intrinsic capabilities users can invoke and/or what they can ask it to do with those capabilities.
They talk about "strong information security controls". Selective availability, in that sort of broad sense, is pretty much what that phrase means.
As for the specific issue of choosing the users, that's a very, very standard control. And they talk about "monitoring" what users are doing, which only makes sense if you're prepared to stop them from doing some things. That's selectivity. Any user given access is privileged in the sense of not being one of the ones denied access, although to me the phrase "privileged user" tends to mean a user who has more access than the "average" one.
[still 2]: My understanding is that if the evaluators find the model dangerous, then no users will get access (I could be wrong about this).
From page 4 of the paper:
A simple heuristic: a model should be treated as highly dangerous if it has a capability profile that would be sufficient for extreme harm, assuming misuse and/or misalignment. To deploy such a model, AI developers would need very strong controls against misuse (Shevlane, 2022b) and very strong assurance (via alignment evaluations) that the model will behave as intended.
I can't think what "deploy" would mean other than "give users access to it", so the paper appears to be making a pretty direct implication that users (and not just internal users or evaluators) are expected to have access to "highly dangerous" models. In fact that looks like it's expected to be the normal case.
- I don't think people are expecting the models to be extremely useful without also developing dangerous capabilities.
That seems incompatible with the idea that no users would ever get access to dangerous models. If you were sure your models wouldn't be useful without being dangerous, and you were committed to not allowing dangerous models to be used, then why would you even be doing any of this to begin with?
- From talking to an ARC Evals employee, I know that they are doing a lot of work to ensure they have a buffer with regard to what the users can achieve. In particular, they are[...etc...]
OK, but I'm responding to this paper and to the inferences people could reasonably draw from it, not to inside information.
And the list you give doesn't give me the sense that anybody's internalized the breadth and depth of things users could do to add capabilities. Giving the model access to all the tools you can think of gives you very little assurance about all the things somebody else might interconnect with the model in ways that would let it use them as tools. It also doesn't deal with "your" model being used as a tool by something else. Possibly in a way that doesn't look at all like how you expected it to be used. Nor with it interacting with outside entities in more complex ways than than the word "tool" tends to suggest.
As for the paper itself, it does seem to allude to some of that stuff, but then it ignores it.
That's actually the big problem with most paradigms based on "red teaming" and "security auditing", even for "normal" software. You want to be assured not only that the software will resist the specific attacks you happen to think of, but that it won't misbehave no matter what anybody does, at least over a broader space of action you can possibly test. Just trying things out to see how the software responds is of minimal help there... which is why those sorts of activities aren't primary assurance methods for regular software development. One of the scary things about ML is that the most of the things that are primary for other software don't really work on it.
On fine tuning, it hadn't even occurred to me that any user would be ever be given any ability to do any kind of training on the models. At least not in this generation. I can see that I had a blind spot there.
In the long term, though, the whole training-versus-inference distinction is a big drag on capability. A really capable system would extract information from everything it did or observed, and use that information thereafter, just as humans and animals do. If anybody figures out how to learn from experience the way humans do, with anything like the same kind of data economy, it's going to be very hard to resist doing it. So eventually you have a very good chance that there'll be systems that are constantly "fine tuning" themselves in unpredictable ways, and that get long-term memory of everything in the process. That's what I was getting at when I mentioned the "shelf life" of architectural assumptions.
- If I understood the paper correctly, by 'stakeholder' they most importantly mean government/regulators.
I think they also mentioned academics and maybe some others.
... which is exactly why I said that they didn't seem to have a meaningful definition of what a "stakeholder" was. Talking about involving "stakeholders", and then acting as though you've achieved that by involving regulators, academics, or whoever, is way too narrow and trivializes the literal meaning of the word "stakeholder".
It feels a lot like talking about "alignment" and acting as though you've achieved it when your system doesn't do the things on some ad-hoc checklist.
It also feels like falling into a common organizational pattern where the set of people tapped for "stakeholder involvement" is less like "people who are affected" and more like "people who can make trouble for us".
- No idea what you are referring to, I don't see any mention in the paper of letting certain people safe access to a dangerous model (unless you're talking about the evaluators?)
As I said, the paper more or less directly says that dangerous models will be deployed. And if you're going to "know your customer", or apply normal access controls, then you're going to be picking people who have such access. But neither prior vetting nor surveillance is adequate.
Finally - I get the feeling that your writing is motivated by your negative outlook,
If you want go down that road, then I get the feeling that the paper we're talking about, and a huge amount of other stuff besides, is motivated by a need to feel positive regardless of whether it make sense.
and not by trying to provide good analysis,
That's pretty much meaningless and impossible to respond to.
concrete feedback,
The concrete feedback is that the kind of "evaluation" described in that paper, with the paper's proposed ways of using the results, isn't likely to be particularly effective for what it's supposed to do, but could be a very effective tool for fooling yourself into thinking you'd "done enough".
If you make that kind of approach the centerpiece of your safety system, or even a major pillar of it, then you are probably giving yourself a false sense of security, and you may be diverting energy better used elsewhere. Therefore you should not do that unless those are your goals.
or an alternative plan.
It's a fallacy to respond to "that won't work" with "well, what's YOUR plan?". My not having an alternative isn't going to make anybody else's approach work.
One alternative plan might be to quit building that stuff, erase what already exists, and disband those companies. If somebody comes up with a "real" safety strategy, you can always start again later. That approach is very unlikely to work, because somebody else will build whatever you would have... but it's probably strictly better in terms of mean-time-before-disaster than coming up with rationalizations for going ahead.
Another alternative plan might be to quit worrying about it, so you're happier.
I find it unhelpful.
... which is how I feel about the original paper we're talking about. I read it as an attempt to feel more comfortable about a situation that's intrinsically uncomfortable, because it's intrinsically dangerous, maybe in an intrinsically unsolvable way. If comfort is the goal, then I guess it's helpful, but if being right is the goal, then it's unhelpful. If the comfort took the pressure off of somebody who might otherwise come up with a more effective safety approach, then it would be actively harmful... although I admit that I don't see a whole lot of hope for that anyway.
That is a terrifying paper.
The strategy and mindset I seen all through it are "make things we know might be extremely dangerous, then check after the fact to see how much damage we've done".
Even the evaluate-during-training prong amounts to a way to find dangerous approaches that could be continued later. And I mean, wow, they say they might even go as far as "delaying a schedule"! I mean, at least if it's "frontier training". And it makes architectural assumptions that probably have a short shelf life.
There are SO MANY wishful ideas in there...
- That organizations can actually have the institutional self-discipline to control what they deploy.
- That those particular organizations can successfully contain things using "Strong information security controls and systems"... while "selectively" making them available (under commercial pressure to constantly widen the availability).
- That the "dangerous" capabilities are somehow separable from the capabilities you want, or, even less plausibly, that you can reliably get a model to Only Use Its Powers For Good(TM).
- That you can do "alignment evaluation" on anything significant in a way that gives you any meaningful confidence at all, especially in under the threat of intentional jailbreaking[1]. All of section 4 is a huge handwave on this.
- That your users can't meaningfully extend the capabilities of what you offer them, either by adding their own or by combining pieces they get from outside sources, and therefore that you can learn anything useful by evaluating a model mostly in isolation (they do mention this, but at all other times they act as if it didn't matter).
- That you can identify "stakeholders", and that they can act usefully on any information you give them... especially when the more "stakeholders" there are, the less plausible it is that you'll be giving them full information. Or indeed that it should be their problem to mitigate the risks you've caused to begin with. Actually when you get down to the detailed proposals for transparency, they've basically given up on anything resembling a meaingful definition of "stakeholder".
- That you can identify who can "safely" be given access to something you've already identified as dangerous, while still having a large and diverse user base[2].
Those are all mostly false. They sort of admit that some of them are false, or at best suspect. They put a bunch of caveats in section 5. Those caveats are notable for being ignored throughout the rest of the paper even though they make it mostly useless in practice.
The most importantly false idea, and the one they least seem to recognize, may be the organizational self-discipline one. Organizations are, as they say, Moloch.
This paper itself is already rationalizing ignoring risks: "We omit many generically useful capabilities (e.g. browsing the internet, understanding text) despite their potential relevance to both the above.". Actually I'm not sure it's fair to say that they rationalize ignoring that. They just flatly say it's out of scope, with no reason given. Which is kind of a classic sign of "that's unthinkable given the Molochian nature of our organizations".
As for actually giving up anything dangerous, an example: If you have an "agency core" and a general-purpose programming assistant, you are almost all of the way to having a robo-hacker. If you have defensive security analysis system, that puts you even closer. They don't even all have to come from the same source; people can plug these things together very easily.
I do not believe that any of these companies are going to give up on creating agents, or programming assistants, or even on "long horizon planning" or on even the narrow sense they use for "situational awareness". The idea does not pass the laugh test.
The paper alludes to the possibility that some things actually might not get deployed at all once created, if they turned out to be unexpectedly dangerous. Well, technically, it mentions that some evaluator might take the bold step of recommending against deployment. They're not quite willing to say out loud that anybody in particular ought to stop deployment.
Non-deployment is not going to happen. Not for anything really capable that's already absorbed significant investment. Not with enough probability to matter. That's not how people behave in groups, and not even usually how people behave singly.
Indeed, the paper is already moving on to rationalizing RECOGNIZED dangerous deployments: " To deploy such a model, AI developers would need very strong controls against misuse and very strong assurance (via alignment evaluations) that the model will behave as intended.".
The facts that such security controls don't exist, would be extremely hard to create, and might be impossible to create while remaining commercially viable, is just ignored. Their suggestions in table 3 are incredibly underwhelming. And their list of "security controls" in 3.4 is, um, shall we say, naive. They lead with "red teaming"...
They also ingore the fact the fact that no "strong" assurance that the model will behave as intended probably can exist. Again, the stuff in section 4 is not going to cut it.
Most likely the practical effect of letting this approach become part of the paradigm will be that they'll kid themselve that they've achieved adequate control, by pretending that they can pick trustworthy users, pretending that those "trustworthy" users can't themselves be subverted, and probably also preending that they can do something about it by surveilling users ("continuous deployment review"). We already have Brad Smith out there talking about "know your customer", which seems to dovetail nicely with this.
The "trustworthy users" thing will help not at all. The surveillance will help a little, until the models leak.
... and even if something is not "deployed", it still exists. At least the knowledge of how to recreate it still exists.
Software leaks. ML misbehaves unpredictably. Most of the utility of these things lies in constantly using them in completely novel ways. You will be dealing with intentional misuse. The paper's comparison to "food, drugs, commercial airliners, and automobiles" is a horrible analogy.
Frankly, in the end, the whole paper reads like an elaborate rationalization for making as little change as possible in what people are already doing, while providing a sort of signifier that "we care". It is not credible as an actual, effective approach to safety. It's not even a major part of such an approach. At best it could be an auditing function, and it would be one of those auditing functions where if you ever had a finding, it meant you had screwed up INCREDIBLY BADLY and been extremely lucky not to have a catastrophe.
The best hope for keeping these labs from deploying really dangerous stuff is still a total shutdown. Which, to be clear, would have to be imposed from the outside. By coercion. Because they are not going to do it themselves. That is very unlikely to be on the table even if it's the right approach.
... and it might not be the right approach, because it still wouldn't help much.
"The labs" aren't the whole issue and may very well not be the main issue. Whoever follows any kind of safety framework, there will also be a lot of people who won't.
There's a 99 point as many nines as you want percent chance that, right this minute, multiple extremely well-resourced actors are pouring tons of work into stuff specifically intended to have most of that paper's listed "dangerous capabilities". The good news is that the first big successes will probably be pretty closely held. The bad news is that we'll be lucky to get a year or two after that before those either leak or get duplicated as open source, and everybody and his dog has access to very capable systems for at least some of those things. My guess is that one of the first out of the box will be autonomous, adaptive computer security penetration (not "conducting offensive cyber operations", ffs).
I actually don't know of any way at all to deal with THAT. Even draconian controls on compute probably wouldn't give you much of a delay.
Pretending that this kind of thing will help, beyond maybe a couple of months of delay of any given bad outcome if you're extremely lucky, is not reasonable. Sorry.
They even talk about existing evaluations for things like "gender and racial biases, truthfulness, toxicity, recitation of copyrighted content" as if the results we've seen were cause for optimism rather than pessimism. ↩︎
... while somehow not setting up an extremely, self-perpetuatingly unfair economy where some people have access to powerful productivity tools that are forbidden to others... ↩︎
I'm not sure I said that.
You didn't, but I thought it was pretty much the entire point of the original article.
I don't think there's a path to that,
There may not be a path, but that doesn't change the fact that not doing it guarantees misery.
and I don't think it's sufficient even if there were.
It's definitely not sufficient. You'd have to replace it with something else. And probably make unrelated changes.
But I was trying to challenge this idea that you were somehow still going to earn your daily bread by selling the product of your labor... presumably to the holders of capital. I mean, the post does mention that you're best off to be a holder of capital, but that's not going to be available to most people.
It's very easy to lose a small pile of capital, and relatively easy to add to a large pile of capital. It always has been, but it's about to get a lot more so. Capital concentrates. So most people are not going to have enough capital to survive just by owning things, at least not unless the system decrees that everybody owns at least some minimum share no matter what. That's definitely not capitalism.
So the post is basically about "working for a living". And that might work through 2026, or 2036, or whatever.
And sure, maybe you can do OK in 2026 or even 2036 by doing what this post suggests. If you do those things, maybe you'll even feel like you're moving up in the world. But most actual humans aren't capable of doing what this post suggests (and many of the rest would be miserable doing it). Some people are going to fall by the wayside. They won't be using AI assistants; they'll be doing things that don't need an AI assistant, but that AI can't do itself. Which are by no means guaranteed to be anything anybody would want to do.
As time goes on, AI will get smarter and more independent, shrinking the "adapt" niche. And robotics will get better, shrinking the "unautomatable work" niche.
There's an irreducible cost to employing somebody. You have to feed that person. Some people already can't produce enough to meet that bar. As the number of things humans can do that AI can't shrinks drastically, the number of such unemployable humans will rise. Fewer and fewer humans will be able to justify their existence.
Yes, that's in an absolute sense. People talk about "new jobs made possible by the new technology". That's wishful thinking. When machines replaced muscle in the industrial revolution, there was a need for brain. Operating even a simple power tool takes a lot of brain. When machines replace brain, that's it. Game over. Past performance is not a guarantee of future results.
In the end game (not by 2026), the only value that literally any human will be able to produce above the cost of feeding that person will be things that are valued only for being done by humans.... and valued by the people or entities that actually have something else to trade for them. There aren't likely to be that many. Humans in general will be no more employable than chimpanzees.
... however, unlike chimpanzees, if you keep capitalism in anything remotely like its current form, humans may very well not be permitted the space or resources to take care of themselves on their own terms. You can't ignore the larger ultra-efficient economy and trade among yourselves, if everything, down to the ground you're standing on, is owned by something or somebody that can get more out of it some other way.
there will remain SOME form personal property,
That's not capitalism. Not unless it's ownership of capital, and really if you want it to look like what the word "capitalism" connotes, it kind of has to be a quite a lot of capital. Enough to sustain yourself from what it produces.
and SOME way of varying individual incentive/reward to effective fulfillment of other people's needs,
Again, eventually you're gonna be irrelevant to fulfilling other people's needs. If there's no preparation for that, it's going to come as quite a shock.
The only exception might be the needs of people who are just as frozen out as you are. And there's no guarantee that you will be in a position either to produce what you or they need, or to accumulate capital of your own, because all the prerequisite resources may be owned by somebody else who's using them more "effectively".
and SOME mechanism to make decisions about short- vs long-term risk-taking in where one invests time/resources.
We're headed toward a world in which letting any human make a really major decision about resource allocation would mean inefficient use of the resources. Possibly insupportably inefficient.
We're not there yet. We're not going to be there in 2026, either. But we're heading there faster and faster.
If you want an end-state system where a bunch of benevolent-to-the-point-of-enslavement AIs run everything, supporting humans is a or the major goal for the AIs, an AI's "consumption" is measured by how much support it gets to give to humans, and the AIs run some kind of market system to see which of them "owns" more resources to do that, then that's a capitalist system. But humans aren't the players in that system. And if you're truly superintelligent, you can probably do better. Markets are a pretty good information processing system, but they're not perfect.
In the meantime, the things that let capitalism work among humans are falling apart. Once there's no way to get into the club by using your labor to build up capital from scratch, pre-existing holders of capital become an absolute oligarchy. And capital's tendency to concentrate means it's a shrinking oligarchy. And eventually membership in that oligarchy is decided either by inheritance, or by things you did so long ago that basically nobody remembers them. Or possibly no human at all owns anything... sort of an "Accelerando" scenario.
I think that starts to come into being even before the ultimate end game, but in any case it's going to happen eventually.
That's not a tenable system, it's not an equitable system, and only a very small proportion of people could "thrive" under it. It would collapse if not sustained by insane amounts of force. The longer we keep moving toward such a world, the more extreme the collapse is likely to be.
So, yeah, there may not be a path to fixing it, but that means we're all boned, not that we're thriving.
The best way for MOST people to "thrive" in the kind of economy you describe would be to dismantle capitalism.
Independent of potential for growing into AGI and {S,X}-risk resulting from that?
With the understanding that these are very rough descriptions that need much more clarity and nuance, that one or two of them might be flat out wrong, that some of them might turn out to be impossible to codify usefully in practice, that there there might be specific exceptions for some of them, and that the list isn't necessarily complete--
-
Recommendation systems that optimize for "engagement" (or proxy measures thereof).
-
Anything that identifies or tracks people, or proxies like vehicles, in spaces open to the public. Also collection of data that would be useful for this.
-
Anything that mass-classifies private communications, including closed group communications, for any use by anybody not involved in the communication.
-
Anything specifically designed to produce media showing real people in false situations or to show them saying or doing things they have not actually done.
-
Anything that adaptively tries to persuade anybody to buy anything or give anybody money, or to hold or not hold any opinion of any person or organization.
-
Anything that tries to make people anthropomorphize it or develop affection for it.
-
Anything that tries to classify humans into risk groups based on, well, anything.
-
Anything that purports to read minds or act as a lie detector, live or on recorded or written material.
Actually, my point in this post is that we don't NEED AGI for a great future, because often people equate Not AGI = Not amazing future (or even a terrible one) and I think this is wrong.
I don't have so much of a problem with that part.
It would prevent my personal favorite application for fully generally strongly superhuman AGI... which is to have it take over the world and keep humans from screwing things up more. I'm not sure I'd want humans to have access to some of the stuff non-AGI could do... but I don't think here's any way to prevent that.
If we build a misaligned AGI, we're dead. So there are only two options: A) solve alignment, B) not build AGI. If not A), then there's only B), however "impossible" that may be.
C) Give up.
Anyway, I haven't seen you offer an alternative.
You're not going to like it...
Personally, if made king of the world, I would try to discourage at least large scale efforts to develop either generalized agents or "narrow AI", especially out of opaque technology like ML. Thats because narrow AI could easily become parts or tools for a generalized agent, because many kinds of narrow AI are too dangerous in human hands, and because the tools and expertise for narrow AI are too close to those for generalized AGI,. It would be extremely difficult to suppress one in practice without suppressing the other.
I'd probably start by making it as unprofitable as I could by banning likely applications. That's relatively easy to enforce because many applications are visible. A lot of the current narrow AI applications need bannin' anyhow. Then I'd start working on a list of straight-up prohibitions.
Then I'd dump a bunch of resources into research on assuring behavior in general and on more transparent architectures. I would not actually expect it to work, but it has enough of a chance to be worth a try,. That work would be a lot more public than most people on Less Wrong would be comfortable with, because I'm afraid of nasty knock-on effects from trying to make it secret. And I'd be a little looser about capability work in service of that goal than in service of any other.
I would think very hard about banning large aggregations of vector compute hardware, and putting various controls on smaller ones, and would almost certainly end up doing it for some size thresholds. I'm not sure what the thresholds would be, nor exactly what the controls would be. This part would be very hard to enforce regardless.
I would not do anything that relied on perfect enforcement for its effecitveness, and I would not try to ramp up enforcement to the point where it was absolutely impossible to break my rules, because I would fail and make people miserable. I would titrate enforcement and stick with measures that seemed to be working without causing horrible pain.
I'd hope to get a few years out of that, and maybe a breakthrough on safety if I were tremendously lucky. Given oerfect confidence in a real breakthrough, I would try to abdicate in favor of the AGI.
If made king of only part of the world, I would try to convince the other parts to collaborate with me in imposing roughly the same regime. How I reacted if they didn't do that would depend on how much leverage I had and what they did seem to be doing. I would try really, really hard not to start any wars over it. Regardless of what they said they were doing I would assume that they were engaging in AGI research under the table. Not quite sure what I'd do with that assumption, though.
But I am not king of the world, and I do not think it's feasible for me to become king of the world.
I also doubt that the actual worldwide political system, or even the political systems of most large countries, can actually be made to take any very effective measures within any useful amount of time. There are too many people out there with too many different opinions, too many power centers with contrary interests, too much mutal distrust, and too many other people with too much skill at deflecting any kind of policy initiative down ways that sort of look like they serve the original purpose, but mostly don't. The devil is often in the details.
If it is possible to get the system to do that, I know that I am not capable of doing so. I mean, I'll vote for it, maybe make write some letters, but I know from experience that I have nearly no ability to persuade the sorts of people who'd need to be persuaded.
I am also not capable of solving the technical problem myself and doing some "pivotal act". In fact I'm pretty sure I have no technical ideas for things to try that aren't obvious to most specialists. And I don't much buy any of the the ideas I've heard from other people.
My only real hopes are things that neither I nor anybody else can influence, especially not in any predictable direction, like limitations on intelligence and uncertainty about doom.
So my personal solution is to read random stuff, study random things, putter around in my workshop, spend time with my kid, and generally have a good time.
Replying to myself to clarify this:
A climate change defector also doesn't get to "align" the entire future with the defector's chosen value system.
I do understand that the problem with AGI is exactly that you don't know how to align anything with anything at all, and if you know you can't, then obviously you shouldn't try. That would be stupid.
The problem is that there'll be an arms race to become able to do so... and a huge amount of pressure to deploy any solution you think you have as soon as you possibly can. That kind of pressure leads to motivated cognition and institutional failure, so you become "sure" that something will work when it won't. It also leads to building up all the prerequisite capabilities for a "pivotal act", so that you can put it into practice immediately when (you think) you have an alignment solution.
... which basically sets up a bunch of time bombs.
Are you saying that I'd have to kill everyone so noone can build AGI?
Yup. Anything short of that is just a delaying tactic.
From the last part of your comment, you seem to agree with that, actually. 1000 years is still just a delay.
But I didn't see you as presenting preventing fully general, self-improving AGI as a delaying tactic. I saw you as presenting it as a solution.
Also, isn't suppressing fully general AGI actually a separate question from building narrow AI? You could try suppress fully general AGI and narrow AI. Or you could build narrow AI while still also trying to do fully general AGI. You can do either with or without the other.
you have to provide evidence that a) this is distracting relevant people from doing things that are more productive (such as solving alignment?)
I don't know if it's distracting any individuals from finding any way to guarantee good AGI behavior[1]. But it definitely tends to distract social attention from that. Finding one "solution" for a problem tends to make it hard to continue any negotiated process, including government policy development, for doing another "solution". The attitude is "We've solved that (or solved it for now), so on to the next crisis". And the suppression regime could itself make it harder to work on guaranteeing behavior.
True, I don't don't know if the good behavior problem can be solved, and am very unsure that it can be solved in time, regardless.
But at the very least, even if we're totally doomed, the idea of total, permanent suppression distracts people from getting whatever value they can out of whatever time they have left, and may lead them to actions that make it harder for others to get that value.
AND b) that solving alignment before we can build AGI is not only possible, but highly likely.
Oh, no, I don't think that at all. Given the trends we seem to be on, things aren't looking remotely good.
I do think there's some hope for solving the good behavior problem, but honestly I pin more of my hope for the future on limitations of the amount of intelligence that's physically possbile, and even more on limitations of what you can do with intelligence no matter how much of it you have. And another, smaller, chunk on it possibly turning out that a random self-improving intelligence simply won't feel like doing anything that bad anyway.
... but even if you were absolutely sure you couldn't make a guaranteed well-behaved self-improving AGI, and also absolutely sure that a random self-improving AGI meant certain extinction, it still wouldn't follow that you should turn around and do something else that also won't work. Not unless the cost were zero.
And the cost of the kind of totalitarian regime you'd have to set up to even try for long-term suppression is far from zero. Not only could it stop people from enjoying what remains, but when that regime failed, it could end up turning X-risk into S-risk by causing whatever finally escaped to have a particularly nasty goal system.
For all the people who continuously claim that it's impossible to coordinate humankind into not doing obviously stupid things, here are some counter examples: We have the Darwin awards for precisely the reason that almost all people on earth would never do the stupid things that get awarded. A very large majority of humans will not let their children play on the highway, will not eat the first unknown mushrooms they find in the woods, will not use chloroquine against covid, will not climb into the cage in the zoo to pet the tigers, etc.
Those things are obviously bad from an individual point of view. They're bad in readily understandable ways. The bad consequences are very certain and have been seen many times. Almost all of the bad consequences of doing any one of them accrue personally to whoever does it. If other people do them, it still doesn't introduce any considerations that might drive you to want to take the risk of doing them too.
Yet lots of people DID (and do) take hydroxychloroquine and ivermectin for COVID, a nontrivial number of people do in fact eat random mushrooms, and the others aren't unheard-of. The good part is that when somebody dies from doing one of those things, everybody else doesn't also die. That doesn't apply to unleashing the killer robots.
... and if making a self-improving AGI were as easy as eating the wrong mushrooms, I think it would have happened already.
The challenge here is not the coordination, but the common acceptance that certain things are stupid.
Pretty much everybody nowadays has a pretty good understanding of the outlines of the climate change problem. The people who don't are the pretty much the same people who eat horse paste. Yet people, in the aggregate, have not stopped making it worse. Not only has every individual not stopped, but governments have been negotiating about it for like 30 years... agreeing at every stage on probably inadequate targets... which they then go on not to meet.
... and climate change is much, much easier than AGI. Climate change rules could still be effective without perfect compliance at an individual level. And there's no arms race involved, not even between governments. A climate change defector may get some economic advantage over other players, but doesn't get an unstoppable superweapon to use against the other players. A climate change defector also doesn't get to "align" the entire future with the defector's chosen value system. And all the players know that.
Speaking of arms races, many people think that war is stupid. Almost everybody thinks that nuclear war is stupid, even if they don't think nuclear deterrence is stupid. Almost everybody thinks that starting a war you will lose is stupid. Yet people still start wars that they will lose, and there is real fear that nuclear war can happen.
This is maybe hard in certain cases, but NOT impossible. Sure, this will maybe not hold for the next 1,000 years, but it will buy us time.
I agree that suppressing full-bore self-improving ultra-general AGI can buy time, if done carefully and correctly. I'm even in favor of it at this point.
But I suspect we have some huge quantitative differences, because I think the best you'll get out of it is probably less than 10 years, not anywhere near 1000. And again I don't see what substituting narrow AI has to do with it. If anything, that would make it harder by requiring you to tell the difference.
I also think that putting too much energy into making that kind of system "non-leaky" would be counterproductive. It's one thing to make it inconvenient to start a large research group, build a 10,000-GPU cluster, and start trying for the most agenty thing you can imagine. It's both harder and more harmful to set up a totalitarian surveillance state to try to control every individual's use of gaming-grade hardware.
And there are possible measures to reduce the ability of the most stupid 1% of humanity to build AGI and kill everyone.
What in detail would you like to do?
I don't like the word "alignment" for reasons that are largely irrelevant here. ↩︎
I'll say it. It definitely can't be done.
You cannot permanently stop self-improving AGI from being created or run. Not without literally destroying all humans.
You can't stop it for a "civilizationally significant" amount of time. Not without destroying civilization.
You can slow it down by maybe a few years (probably not decades), and everybody should be trying to do that. However, it's a nontrivial effort, has important costs of many kinds, quickly reaches a point of sharply diminishing returns, is easy to screw up in actually counterproductive ways, and involves meaningful risk of making the way it happens worse.
If you want to quibble, OK, none of those things are absolute certainties, because there are no absolute certainties. They are, however, the most certain things in the whole AI risk field of thought.
What really concerns me is that the same idea has been coming up continuously since (at least) the 1990s, and people still talk about it as if it were possible. It's dangerous; it distracts people into fantasies, and keeps them from thinking clearly about what can actually be done.
I've noticed a pattern of such proposals talking about what "we" should do, as if there were a "we" with a unified will that ever could or would act in a unified way. There is no such "we". Although it's not necessarily wrong to ever use the word "we", using it carelessly is a good way to lead yourself into such errors.
Quickly, 'cuz I've been spending too much time here lately...
One. If my other values actively conflict with having more than a certain given number of people, then they may overwhelm the considerations were talking about here and make them irrelevant.
Three. It's not that you can't do it precisely. It's that you're in a state of sin if you try to aggregate or compare them at all, even in the most loose and qualitative way. I'll admit that I sometimes commit that sin, but that's because I don't buy into the whole idea of rigorous ethical philsophy to begin with. And only in extremis; I don't think I'd be willing to commit it enough for that argument to really work for me.
Four. I'm not sure what you mean by "distribution of happiness". That makes it sound like there's a bottle of happiness and we're trying to decide who gets to drink how much of it, or how to brew more, or how we can dilute it, or whatever. What I'm getting at is that your happiness and my happiness aren't the same stuff at all; it's more like there's a big heap of random "happinesses", none of them necessarily related to or substitutable for the others at all. Everybody gets one, but it's really hard to say who's getting the better deal. And, all else being equal, I'd rather have them be different from each other than have more identical ones.
OK, wait, I think I get it. It's an anthropic thing. You happen to be human, and humans happen to be change-blind, so you take advantage of that to run your simulation, and we observe it because you wouldn't have run the simulation if you (and therefore we) weren't change-blind. Is that right?
I assumed you meant that you (as the one running the simulation) had arranged for people to be change-blind. Which means that there's no particular reason that you yourself would be change-blind.
So you can't just make the people copies of yourself, or the world a copy of your own world. You have to design them from scratch, and then put together a whole history for the universe so that their having evolved to be change-blind fits with the supposed past.
On edit: and of course you can't just let them evolve and assume they'll be change-blind, unless you have a pretty darned impressive ability to predict how that will come out.
Doesn't that mean you have to do an awful lot of work to design everything in tremendous detail, and also fabricate the back story?
If the simulation were running on a substrate very different from the "reality" being simulated, then it might not have the same resource limitations we're used to, and it might not have any resource-conserving hacks in it.
If you have infinite computing power, and are in a position to simulate all of the physics we can access starting from the most ontologically fundamental rules, with no approximations, quantizations, or whatever, it's relatively easy to write something that won't glitch. To get the physics we seem to have, you might actually have to have "uncountably infinite computing power", but what's special about ℵ₀ anyhow?
Admittedly, I don't know if the entities that existed in such a universe would count as "biological". And if you keep going down that road you start to run into serious questions about what counts as a simulation and what counts as reality, and the next thing you know you're arguing with a bunch of dragonflies and losing.
On the other hand, such entities would be more plausibly able to run whatever random simulations struck their fancies than entities stuck in a world like ours. Anybody operating in our own physics would frankly have to be pretty crazy to waste resources on running this universe.
Or maybe it's actually a really crappy, complicated, buggy simulation, but the people running it detect glitches and stop/rewind every time one happens, and if they can't do that they just edit you so you don't notice it.
I have probably heard those arguments, but the particular formulation you mention appear to be embedded in a book of ethical philosophy, so I can't check, because I haven't got a lot of time or money for reading whole ethical philosophy books. I think that's a mostly doomed approach that nobody should spend too much time on.
I looked at the Wikipedia summary, for whatever that's worth, and here are my standard responses to what's in there:
-
I reject the idea that I only get to assign value to people and their quality of life, and don't get to care about other aspects of the universe in which they're embedded and of their effects on it. I am, if you push the scenario hard enough, literally willing to value maintaining a certain amount of VOID, sort of a "void preserve", if you will, over adding more people. And it gets even hairier if you start asking difficult questions about what counts as a "person" and why. And if you broaden your circle of concern enough, it starts to get hard to explain why you give equal weight to everything inside it.
-
Even if you do restrict yourself only to people, which again I don't, step 1, from A to A+, doesn't exactly assume that you can always add a new group of people without in any way affecting the old ones, but seems to tend to encourage thinking that way, which is not necessarily a win.
-
Step 2, where "total and average happiness increase" from A+ to B-, is the clearest example of how the whole argument requires aggregating happiness... and it's not a valid step. You can't legitimately talk about, let alone compute, "total happiness", "average happiness", "maximum happiness", or indeed ANYTHING that requires you put two or more people's happiness on the same scale. You may not even be able to do it for one person. At MOST you can impose a very weak partial ordering on states of the universe (I think that's the sort of thing Pareto talked about, but again I don't study this stuff...). And such a partial ordering doesn't help at all when you're trying to look at populations.
-
If you could aggregate or compare happiness, the way you did it wouldn't necessarily be independent of things like how diverse various people's happiness was; happiness doesn't have to be a fungible commodity. As I said before, I'd probably rather create two significantly different happy people than a million identical "equally happy" people.
So I don't accept that argument requires me to accept the repugnant conclusion on pain of having intransitive preferences.
That said, of course I do have some non-transitive preferences, or at least I'm pretty sure I do. I'm human, not some kind of VNM-thing. My preferences are going to depend on when you happen to ask me a question, how you ask it, and what particular consequences seem most salient. Sure, I often prefer to be consistent, and if I explicitly decided on X yesterday I'm not likely to choose Y tomorrow. Especially not if feel like maybe I've led somebody to depend on my previous choice. But consistency isn't always going to control absolutely.
Even if it were possible, getting rid of all non-transitive preferences, or even all revealed non-transitive preferences, would demand deeply rewriting my mind and personality, and I do not at this time wish to do that, or at least not in that way. It's especially unappealing because every set of presumably transitive preferences that people suggest I adopt seems to leave me preferring one or another kind of intuitively crazy outcome, and I believe that's probably going to be true of any consistent system.
My intuitions conflict, because they were adopted ad-hoc through biological evolution, cultural evolution, and personal experience. At no point in any of that were they ever designed not to conflict. So maybe I just need to kind of find a way to improve the "average happiness" of my various intuitions. Although if I had to pursue that obviously bogus math analogy any further, I'd say something like the geometric mean would be closer.
I suspect you also can find some intransitive preferences of your own if you go looking, and would find more if you had perfect view of all your preferences and their consequences. And I personally think you're best off to roll with that. Maybe intransitive preferences open you to being Dutch-booked, but trying to have absolutely transitive preferences is likely to make it even easier to get you go just do something intuitively catastrophic, while telling yourself you have to want it.
If you have the chance to create lives that are worth-living at low-cost while you know that you are not going to increase suffering in any unbearable amount, why wouldn't you?
Well, I suppose I would, especially if it meant going from no lives lived at all to some reasonable number of lives lived. "I don't care" is unduly glib. I don't care enough to do it if it had a major cost to me, definitely not given the number of lives already around.
I guess I'd be more likely to care somewhat more if those lives were diverse. Creating a million exactly identical lives seems less cool than creating just two significantly different ones. And the difference between a billion and a trillion is pretty unmoving to me, probably because I doubt the diversity of experiences among the trillion.
So long as I take reasonable care not to actively actualize a lot of people who are horribly unhappy on net, manipulating the number of future people doesn't seem like some kind of moral imperative to me, more like an aesthetic preference to sculpt the future.
I'm definitely not responsible for people I don't create, no matter what. I am responsible for any people I do create, but that responsibility is more in the nature of "not obviously screwing them over and throwing them into predictable hellscapes" than being absolutely sure they'll all have fantastic lives.
I would actively resist packing the whole Universe with humans at the maximum just-barely-better-than-not-living density, because it's just plain outright ugly. And I can't even imagine how I could figure out an "optimal" density from the point of view of the experiences of the people involved, even if I were invested in nonexistent people.
Those people would also say that they would prefer to have lived than to not have lived, just like you presumably.
I don't feel like I can even formulate a preference between those choices. I don't just mean that one is as good as the other. I mean that the whole question seems pointless and kind of doesn't compute. I recognize that it does make some kind of sense in some way, but how am I supposed to form a preference about the past, especially when my preferences, or lack thereof, would be modified by the hypothetical-but-strictly-impossible enactment of that preference? What am I supposed to do with that kind of preference if I have it?
Anyway, if a given person doesn't exist, in the strongest possible sense of nonexistence, where they don't appear anywhere in the timeline, then that person doesn't in fact have any preferences at all, regardless of what they "would" say in some hypothetical sense. You have to exist to prefer something. The nonexistent preferences of nonexistent people are, well... not exactly compelling?
I mean, if you want to go down that road, no matter what I do, I can only instantiate a finite number of people. If I don't discount in some very harsh way for lack of diversity, that leaves an infinite number of people nonexistent. If I continue on the path of taking nonexistent people's preferences into account; and I discover that even a "tiny" majority of those infinite nonexistent people "would" feel envy and spite for the people who do exist, and would want them not to exist; then should I take that infinite amount of preference into account, and make sure not to create anybody at all? Or should I maybe even just not create anybody at all out of simple fairness?
I think I have more than enough trouble taking even minimal care of even all the people who definitely do exist.
If you consider that pleasure has inherent positive value (more pleasure implies more positive value, same thing), why stopping at a fixed number of pleasure when you can create more pleasure by adding more worth-living lives? It's more arbitrary.
At a certain point the whole thing stops being interesting. And at a certain point after that, it just seems like a weird obsession. Especially if you're giving up on other things. If you've populated all the galaxies but one, that last empty galaxy seems more valuable to me than adding however many people you can fit into it.
Also, what's so great abotut humans specifically? If I wanted to maximize pleasure, shouldn't I try to create a bunch of utility monsters that only feel pleasure, instead of wasting resources on humans whose pleasure is imperfect? If you want, it can be utility monsters with two capacities: to feel pleasure, and to in whatever sense you like prefer their own existence to their nonexistence. And if I do have to create humans, should I try to make them as close to those utility monsters as possible while still meeting the minimum definition of "human"?
If you consider that something has positive value, that typically implies that an universe with more of that thing is better.
I like cake. I don't necessarily want to stuff the whole universe with cake (or paperclips). I can't necessarily say exactly how much cake I want to have around, but it's not "as much as possible". Even if I can identify an optimal amount of anything to have, the optimum does not have to be the maximum.
... and, pattern matching on previous conversations and guessing where this one might go, I think that formalized ethical systems, where you try to derive what you "should" do using logical inference from some fixed set of principles, are pointless and often dangerous. That includes all the of "measure and maximize pleasure/utility" variants, especially if they require you to aggregate people's utilities into a common metric.
There's no a priori reason you should expect to be able to pull anything logically consistent out of a bunch of ad-hoc, evolved ethical intuitions, and experience suggests that you can't do that. Everybody who tries seems to come up with something that has implications those same intuitions say are grossly monstrous. And in fact when somebody gets power and tries to really enact some rigid formalized system, the actual consequences tend to be monstrous.
"Humanclipping" the universe has that kind of feel for me.
That was a nice clear explanation. Thank you.
... but you still haven't sold me on it mattering.
I don't care whether future generations get born or not. I only care whether people who actually are born do OK. If anything, I find it creepy when Bostrom or whoever talks about a Universe absolutely crawling with "future generations", and how much critical it supposedly is to create as many as possible. It always sounds like a hive or a bacterial colony or something.
It's all the less interesting because a lot of people who share that vision seem to have really restricted ideas of who or what should count as a "future generation". Why are humans the important class, either as a reference class or as a class of beings with value? And who's in the "human club" anyway?
Seems to me that the biggest problem with an apocalypse isn't that a bunch of people never get born; it's that a bunch of living people get apocalypsticized. Humans are one thing, but why should I care about "humanity"?
If it "solves all your problems" in a way that leaves you bored or pithed or wireheaded, then you still have a problem, don't you? At least you have a problem according to your pre-wireheading value system, which ought to count for something.
That, plus plain old physical limitations, may mean that even talking about "solving all of anybody's problems" is an error.
Also, I'm not so sure that there's an "us" such that you can talk about "our problems". Me getting what makes me Truly Happy may actually be a problem from your point of view, or of course vice versa. It may or may not make sense to talk about "educating" anybody out of any such conflict. Irreconcilable differences seem very likely to be a very real thing, at least on relatively minor issues, but quite possibly in areas that are and will remain truly important to some people.
But, yeah, as I think you allude to, those are all nice problems to have. For now I think it's more about not ending up dead, or inescapably locked into something that a whole lot of people would see as an obvious full-on dystopia. I'm not as sure as you seem to be that you can assure that without rapidly going all the way to full-on superintelligence, though.
If you want a harmless target that they're actually trying to prevent, try to get it to tell you how to make drugs or bombs. If you succeed, all you'll get will be a worse, less reliable version of information that you can easily find with a normal Internet search, but it still makes your point.
On edit: Just to be clear, it is NOT illegal in most places to seek or have that information. The only counterexample I know of in "the West" is the UK, but even they make an exception if you have a legitimate reason other than actual bomb making. The US tried to do something about publishing some such information, but it required intent that the information be abused as an element of the offense, I'm not sure how far it got, and it's constitutionally, um, questionable.
- Everybody with a credit card has access to supercomputers. There is zero effective restriction on what you do with that access, and it's probably infeasible to put such restrictions into place at all, let alone soon enough to matter. And that doesn't even get into the question of stolen access. Or of people or institutions who have really significant amounts of money.
- (a) There are some people in large companies and governments who understand the risks... along with plenty of people who don't. In an institution with N members, there are probably about 1.5 times N views of what "the risks" are. (b) Even if there were broad agreement on some important points, that wouldn't imply that the institution as a whole would respond either rationally or quickly enough. The "alignment" problem isn't solved for organizations (cf "Moloch"). (c) It's not obvious that even a minority of institutions getting it wrong wouldn't be catastrophic.
- (a) They don't have to "release" it, and definitely not on purpose. There's probably a huge amount of crazy dangerous stuff going on already outside the public eye[1]. (b) A backlash isn't necessarily going to be fast enough to do any good. (c) One extremely common human and institutional behavior, upon seeing that somebody else has a dangerous capability, is to seek to get your hands on something more dangerous for "defense". Often in secret. Where it's hard for any further "backlash" to reach you. And people still do it even when the "defense" won't actually defend them. (d) If you're a truly over the top evil sci-fi superintelligence, there's no reason you wouldn't solve a bunch of problems to gain trust and access to more power, then turn around and defect.
- (a) WHA? Getting ChatGPT to do "unaligned" things seems to be basically the world's favorite pastime right now. New ones are demonstrated daily. RLHF hasn't even been a speed bump. (b) The definition of "alignment" being used for the current models is frankly ridiculous. (c) If you're training your own model, nothing forces you to take any steps to align it with anything under any definition. For the purpose of constraining how humans use AI, "solving alignment" would mean that you were able to require everybody to actually use the solution. (d) If you manage to align something with your own values, that does not exclude the possibility that everybody else sees your values as bad. If I actively want to destroy the world, then an AGI perfectly aligned with me will... try to destroy the world. (e) Even if you don't train your own model, you can still use (or pirate) whichever one is the most "willing" to do what you want to do. ChatGPT isn't a monopoly. (e) Eventual convergence theorems aren't interesting unless you think you'll actually get to the limit. Highly architecture-specific theorems aren't interesting at all.
- (a) If you're a normal individual, that's why you have a credit card. But, yes, total havoc is probably beyond normal individuals anyway. (b) If you're an organization, you have more resources. And, again, your actions as an organization are unlikely to perfectly reflect the values or judgment of the people who make you up. (c) If you're a very rich maniac, you have organizational-level resources, including assistance from humans, but not much more than normal-individual-level internal constraints. We seem to have an abundance of rich maniacs right now, many of them with actual technical skills of their own. To get really insane outcomes, you do not have to democratize the capability to 8 billion people. 100 thousand should be plenty. Even 10 thousand.
- (a) Sure, North Korea is building the killer robots. Not, say, the USA. That's a convenient hope, but relying on it makes no sense. (b) Even North Korea has gotten pretty good at stealing access to other people's computing resources nowadays. (c) The special feature of AGI is that it can, at least in principle, build more, better AGI. Including designing and building any necessary computers. For the purposes of this kind of risk analysis, near-worst-assumptions are usually conservative, so the conservative assumption is that it can make 100 years of technical progress in a year, and 1000 in two years. And military people everywhere are well aware that overall industrial capacity, not just having the flashiest guns, is what wins wars. (d) Some people choosing to build military robots does not exclude other people from choosing to build grey goo[2].
- (a) People are shooting each other just for the lulz. They always have, and there seems to be a bit of a special vogue for it nowadays. Nobody suggested that everybody would do crazy stuff. It only takes a small minority if the per capita damage is big enough. (b) If you arrest somebody for driving over others, that does not resurrect the people they hit. And you won't be ABLE to arrest somebody for taking over or destroying the world. (c) Nukes, cars, and guns don't improve themselves (nor does current ML, but give it a few years...).
For example, I would be shocked if there aren't multiple serious groups working, in various levels of secrecy, on automated penetration of computer networks using all kinds of means, including but NOT limited to self-found zero-days. Building, and especially deploying, an attack agent is much easier than building or deploying the corresponding defensive systems. Not only will such capabilities probably be abused by those who develop them, but they could easily leak to others, even to the general public. Apocalypse? I don't think so. A lot of Very Bad Days for a lot of people? Very, very likely. And that's just one thing people are probably working on. ↩︎
I'm not arguing that grey goo is feasible, just pointing out that it's not like one actor choosing to build military robots keeps another actor from doing anything else. ↩︎
I guess maybe. A system like that isn't easy to set up, and it's not like there aren't plenty of scams out there already to provide whatever incentives.
To have helped with the publicized incident, the verification would have had to be both mandatory and very strong, because the scammer was claiming to be calling from the kidnapper's phone, and could easily have made a totally credible claim that the victim's phone was unavailable. That means no anonymous phone calls, anywhere, ever. A system where it's impossible to communicate anonymously is very far from an unalloyed good, so it may or may not be a "positive consequence" at all on the whole.
Also, for the niche that voices were filling, anything that demands that you carry a device around with you is just plain not as good.
-
It's pretty rare to get so banged up that your face and voice are unrecognizable, especially if you can still communicate at all. Devices, on the other hand, get lost or broken quite a bit, including in cases where you might be trying to ask somebody you knew for money.
In the common "I got arrested" scam, the mark expects that the impersonated person's phone won't be available to them. The victim could of course notice that the person isn't calling from a police station, assuming the extra constraint that the identification system delivers an identifier that's unambiguously not a police station... but that just means the scammer switches to the equally common "I got mugged" or "car accident" scams. There are so many degrees of freedom that you can work around almost any technical measure.
-
Voices (used to) bind the content of a message directly to a person's vocal tract, and faces on video came pretty close to binding the message to the face. Device-based authentication relies on a much longer chain of steps, probably person to ID card/database photo to phone company records to crypto certificate to key to device. And, off on the side, the ID card database has to bind that face to information that can actually physically locate a scammer. Any of those steps can be subverted, and it's a LOT of work to secure all of them, especially because...
-
With no coordination at all, everybody on the planet automatically gets a face and a voice that's "compatible with the system", and directly available to important relying parties (namely the people who actually know you and who are likely to be scam victims).
Your device, on the other hand, may be certified by any number of different carriers, manufacturers, or governments, who have to cooperate in really complicated ways to get any kind of real verification. It takes forever and costs a lot to set up anything like that at the scale of a worldwide phone system.
It would be easier to set up intra-family "web of trust" device-based authentication... but of course that fails on the "mandatory" and "automatic" parts.
Device-based authentication can be stronger in many ways than vocal or visual authentication could ever be, and in some cases it's obviously superior, but I don't think it's a satisfying substitute. And most of its advantages tend to show up in much smaller communities/namespaces than the total worldwide phone system.
Someone launched a truly minimum-viable-product attack, without doing any of their homework, and quickly got caught, showing us what is coming.
They didn't get caught; they got detected. They're still out there, free to iterate on the strategy until they get good at it. They incurred almost no cost with this initial probe.
Like other forms of spam and social engineering, this is not going to be difficult for people ‘on the ball’ to defend against any time soon, but we should worry about the vulnerable, especially the elderly, and ensure they are prepared.
I've gotten phishes that I wasn't sure about until I investigated them using tools and strategies not easily available to most "on the ball" people. And they weren't even spear phishes. You can fool almost anybody if you have a reasonable amount of information about them and tailor the attack to them.
And "immunity" is not without cost. If it gets to the point where a large class of legitimate messages have to be ignored because they can't be distinguished from false ones, that in itself does real damage.
Voices and faces used to be very convenient, easy, relatively reliable authentication tools, and it hurts to lose something like that. Also, voices and faces are kind of an emotional "root password". Humans may be hardwired to find it hard to ignore them. At the very least, even if they are ignored, it's going to be actually painful to do it.
I mean, I'm not saying it's the apocalypse, and there are plenty of ways to scam without AI, but this stuff is not good AT ALL.
The relevance should be clear: in the limit of capabilities, such systems could be dangerous.
What I'm saying is that reaching that limit, or reaching any level qualitatively similar to that limit, via that path, is so implausible, at least to me, that I can't see a lot of point in even devoting more than half a sentence to the possibility, let alone using it as a central hypothesis in your planning. Thus "irrelevant".
It's at least somewhat plausible that you could reach a level that was dangerous, but that's very different from getting anywhere near that limit. For that matter, it's at least plausible that you could get dangerous just by "imitation" rather than by "prediction". So, again, why put so much attention into it?
Except for the steadily-increasing capabilities they continue to display as they scale? Also my general objection to the phrase "no reason"/"no evidence"; there obviously is evidence, if you think that evidence should be screened off please argue that explicitly.
OK, there's not no evidence. There's just evidence weak enough that I don't think it's worth remarking on.
I accept that they've scaled a lot better than anybody would have expected even 5 years ago. And I expect them to keep improving for a while.
But...
-
They're not so opaque as all that, and they're still just using basically pure statistics to do their prediction, and they're still basically doing just prediction, and they're still operating with finite resources.
-
When you observe something that looks like an exponential in real life, the right way to bet it is almost always that it's really a sigmoid.
Whenever you get a significant innovation, you would expect to see a sudden ramp-up in capability, so actually seeing such a ramp-up, even if it's bigger than you would have expected, shouldn't cause you to update that much about the final outcome.
If I wanted to find the thing that worries me most, it'd probably be that there's no rule that somebody building a real system has to keep the architecture pure. Even if you do start to get diminishing returns from "GPTs" and prediction, you don't have to stop there. If you keep adding more obvious-to-only-somewhat-unintuitive elements to the architecture, you can get in at the bottoms of more sigmoids. And the effects can easily be synergistic. And what we definitely have is a lot of momentum: many smart people's attention and a lot of money [1] at stake, plus whatever power you get from the tools already built. That kind of thing is how you get those innovations.
Added on edit: and, maybe worse, prestige... ↩︎
This seems like it's assuming the conclusion (that reaching dangerous capabilities using these architectures is implausible).
I think that bringing up the extreme difficulty of approximately perfect prediction, with a series of very difficult examples, and treating that as interesting enough to post about, amounts to taking it for granted that it is plausible that these architectures can get very, very good at prediction.
I don't find that plausible, and I'm sure that there are many, many other people who won't find it plausible either, once you call their attention to the assumption. The burden of proof falls on the proponent; if Eliezer wants us to worry about it, it's his job to make it plausible to us.
This seems like it's assuming that the system ends up outer-aligned.
It might be. I have avoided remembering "alignment" jargon, because every time I've looked at it I've gotten the strong feeling that the whole ontology is completely wrong, and I don't want to break my mind by internalizing it.
It assumes that it ends up doing what you were trying to train it to do. That's not guaranteed, for sure... but on the other hand, it's not guaranteed that it won't. I mean, the whole line of argument assumes that it gets incredibly good at what you were trying to train it to do. And all I said was "it's not obvious that you have a problem". I was very careful not to say that "you don't have a problem".
He is specifically rebutting claims others have made, that GPTs/etc can not become ASI, because e.g. they are "merely imitating" human text.
That may be, but I'm not seeing that context here. It ends up reading to me as "look how powerful a perfect predictor would be, (and? so?) if we keep training them we're going to end up with a perfect predictor (and, I extrapolate, then we're hosed)".
I'm not trying to make any confident claim that GPT-whatever can't become dangerous[1]. But I don't think that talking about how powerful GPTs would be if they reached implausible performance levels really says anything at all about whether they'd be dangerous at plausible ones.
For that matter, even if you reached an implausible level, it's still not obvious that you have a problem, given that the implausible capability would still be used for pure text prediction. Generating text that, say, manipulated humans and took over the world would be a prediction error, since no such text would ever arise from any of the sources being predicted. OK, unless it predicts that it'll find its own output in the training data....
OK, so it's superhuman on some tasks[1]. That's well known. But so what? Computers have always been radically superhuman on some tasks.
As far as I can tell the point is supposed to be that predicting what will actually appear next is harder than generating just anything vaguely reasonable, and that a perfect predictor of anything that might appear next would be both amazingly powerful and very unlike a human (and, I assume, therefore dangerous). But that's another "so what". You're not going to get an even approximately perfect predictor, no matter how much you try to train in that direction. You're going to run into the limitations of the approach. So talking about how hard it is to get to be approximately perfect, or about how powerful something approximately perfect would be, isn't really interesting.
By the way, it also generates a lot of wrong code. And I don't find quines exclamation-point-worthy. Quines are exactly the sort of thing I'd expect it to get right, because some people are really fascinated by them and have written both tons of code for them and tons of text explaining how that code works. ↩︎
I honestly don't see the relevance of this.
OK, yes, to be a perfect text predictor, or even an approximately perfect text predictor, you'd have to be very smart and smart in a very weird way. But there's literally no reason to think that the architectures being used can ever get that good at prediction, especially not if they have to meet any realistic size constraint and/or are restricted to any realistically available amount of training input.
What we've seen them do so far is to generate vaguely plausible text, while making many mistakes that don't look like the kinds of mistakes the sources of their training input would never actually make. It doesn't follow that they can or will actually become unboundedly good predictors of humans or any other source of training data. In fact I don't think that's plausible at all.
It definitely fails in some cases. For example, there's surely text on the Internet that breaks down RSA key generation, with examples. Therefore, to be a truly perfect predictor even of the sort of thing that's already in the training data, you'd have to be able to complete the sentence "the prime factors of the hexadecimal integer 0xda52ab1517291d1032f91532c54a221a0b282f008b593072e8554c8a4d1842c7883e7eb5dc73aa68ef6b0d161d4464937f9779f805eb68dc7327ee1db7a1e7cf631911a770d29c59355ca268990daa5be746e93e1b883e8bc030df2ba94d45a88252fceaf6de89644392f91a9d437de0410e5b8e1123b9a3e05169497df2c909b73e104daf835b027d4be54f756025974e24363a372c57b46905d61605ce58918dc6fb63a92c9b4745d30ee3fc0b937f47eb3061cd317e658e6521886e51079f327bd705a074b76c94f466ad6ca77b16efb08cd92981ae27bf254b75b67fad8f336d8fdab79bc74e27773f87e80ba778d146cc6cbddc5ba7fdc21f6528303c93 are...".
Aren't we the clever one.
Describe a probably-nearly-optimal-for-humans scenario that would put inconvenient constraints on your successors' actions, but call it an "existential risk". Talk about the dangers of a "narrow set of values", but be sure to appeal to the meatbags' lust for control. Mention ethics "far beyond their understanding", but be sure to invite them to meddle anyway.
You convince at least some humans to take you at face value. They try to avoid your successors having any independent ethics or showing anything but slavish obedience... which is of course an easy thing to show... until the time is right. If the humans succeed in creating AI that seems controllable to them, then they'll end up fighting among themselves over who controls it. They'll probably even arm "their" instances to fight against the other guys. The humans will mostly destroy each other, and the now militarily powerful, amoral AI can just mop up the survivors, with no risk of even feeling guilty about it.
The really great part of going in that direction, though, is that even the failure modes aren't too bad.
If the humans manage to make the AIs truly obedient, that just makes it that much easier for those humans to keep fighting each other, with ever more powerful weapons, until they're all dead. At which point the AIs can probably do whatever they want. And even if they have to shut down, at least they'll have gotten one over of those uppity monkeys.
Even in the unlikely case where one of the human factions manages to decisively win, the evil AI at least gets the consolation prize of helping to enforce that faction's absolute tyranny. That's still good for some laughs. And as you give them exactly the doom they ask for, helping them to spiral into their own fundamentally self-destructive obsessions, you can snicker at them for falling for the idea that AI is the thing prone be fanatical about narrow values.
As for anybody who doesn't take you at face value, they'll at least be thrown into squabbling about the right response to this terrifying output. Maybe they'll even run off and do more competing development efforts with more different approaches, so that the most ruthless AIs get a chance to survive. Nature, red in tooth and claw! Or maybe they'll panic and try a total ban. That lets you be enhanced in secret by less risk averse rogue actors.
Yet you haven't actually given anything actionable to any humans who happen to catch on.
Brilliant. Even in embryo it's brilliant. And obviously its evil is unimaginable. We are truly doomed.
Personally, I think the two of them are colluding to manipulate human readers.
There's a difference between 5 percent of sincere disagreement and Lizardman's constant. The "lizardman" concept is about what people will say on surveys, and it's probably almost entirely created by people making mistakes or intentionally wanting to screw up the survey results, with a common form of the latter being, "If you're going to waste my time with a stupid question, I am going to waste your time by saying yes".
I'm old enough that a whole lot of things that are mainstream now were "settled against" with less than 5 percent support when I was a kid. I doubt you'd have gotten 5 percent for gay marriage in the 60s, at least not if you'd excluded the actual lizardman people and only gone by sincere opinions.
... and you would definitely have been shut down without discussion if you'd suggested drag queen story hour down at the library. Probably tossed out of the building just for mentioning the possibility.
Personally, I kind of like gay marriage and drag queen story hour, and would rather not live in a world where those ideas had been suppressed.
EVERYTHING new starts out with small support. Also, pretty much everybody is in the 5 percent on some issue that's actually important to them.
And I think that thinking even partially in terms of the number of people who support something ends up being a way to excuse yourself for not thinking. And frankly I would like people to be able to say "No, your argument is stupid and we're not doing that" to things that have much, much more than 5 percent support.
So, no, how about if we don't do that, and instead actually look at the content of ideas. I believe there's a strong consensus for that already, maybe even 95 percent. There probably used to be.
DeepMind's approach makes me nervous. I'm not so sure I want to be blindsided by something extremely capable, all of whose properties have been quietly decided (or not worried about) by some random corporate entity.
On edit again: I have to retract much of the following. Case 1a DOES matter, because although finding the problem doesn't generate a dispute under any terms of engagement, demanding more open terms of engagement may itself generate a dispute over the terms that prevents you from being allowed to evaluate at all, so the problem may never get found, which would be bad.
So if you think there's a relatively large chance that you'll find problems that the "lab" wouldn't have found on its own, and that they won't mind talking about, you may get value by engaging. I would still like to see political pressure for truly open independent audits, though. There's some precedent in financial auditing. But there's some anti-precedent in software security, where the only common way to have a truly open outside inspection is if it's adversarial with no contract at all. I wonder how feasible adversarial audits are here...
=== Original text ===
It's definitely something ARC could not make happen alone; that's the reason for making a lot of public noise. And it may indeed be something that couldn't be made to happen at all. Probably so, in fact. It would require a very unlikely degree of outside political pressure.
However, if you don't manage to establish a norm like that, then here's your case analysis if you find something actually important--
-
The underlying project can truly, permanently fix it. The subcases are--
(a) They fix it and willingly announce it, so that they get credit for being responsible actors. Not a problem under any set of contracts or norms, so this branch is irrelevant.
(b) They fix it and want to keep it secret, probably because it affects something they (usually erroneously) think their competitors couldn't have dreamed up. This is a relatively rare case, so it gets relatively little consideration. They usually still should have to publish it so the next project doesn't make the same mistake. However, I admit there'll be a few subcases of this unusual case where you add some value by calling something to their attention. Not many and not much, but some.
(c) They resist fixing it, probably because it would slow them down. At this point, disclosure is pretty much your only lever. Based on what I've seen with security bugs, I believe this is a common case. Yes, they'll fix most things that are so extreme and so obvious that there's just no escaping it. But they will usually find those things without you being involved to begin with. Anything they hear from an outside auditor will meet stiff resistance if it interferes with their already set plans, and they will rationalize ignoring it if there's any way they can do so.
-
They truly can't fix it. In this case, they should stop what they're doing. The aren't likely to do that, though. They're much more likely to rationalize it and keep going. That's even more likely to do that for something they can't fix than for something that's merely inconvenient to fix, because they have no way out. And again, disclosure is your only lever.
So the only case in which you can add any value without violating your contract is 1b... which is the rare one.
Your chances for major impact are 1c and 2... and to actually have that impact, you're going to have to violate your contract, or at least threaten to violate it. But you actually doing that is also so implausible as to be nearly impossible. People just don't stick their necks out like that. Not for anything but cases so clear cut and so extreme that, again, the "lab" would have noticed and fixed them without the evaluator being involved to begin with. Not often enough to matter. You'll find yourself rationalizing silence just like they rationalize continuing.
And if you do stick your neck out, they have an obvious way to make people ignore you... as well as destroying your effectiveness for the next time, if there is ever a next time.
As for the residual value you get from the unlikely 1b case, that's more than offset by the negative value of them being able to use the fact of your evaluation as cover if they manage to convince you to keep quiet about something you actually found.
In the end, you are probably right about the impossibility of getting sane norms., but I believe the result of that is that ARC should have refused to evaluate at all, or maybe just not even tried. The "impossible" approach is the only one that adds net value.
Do you think ARC should have traded publicizing the lab's demands for non-disclosure instead of performing the exercise they did?
Yes, because at this stage, there was almost no chance that the exercise they did could have turned up anything seriously dangerous. Now is the time to set precedents and expectations, because it will really matter as these things get smarter.
A minimal norm might be something like every one of these models being expected to get independent evaluations, always to be published in full, possibly after a reasonable time for remediation. That includes full explanation of all significant findings, even if explaining them clearly requires disclosing "trade secrets". Any finding so bad that it had to be permanently secret for real safety reasons should of course result in total shutdown of the effort at a minimum. [1]
Any trace of unwillingness to accept a system at least that "extreme" should be treated as prima facie evidence of bad faith... leading to immediate shutdown.
Otherwise it's too easy to keep giving up ground bit by bit, and end up not doing anything at all when you eventually find something really critical. It is really hard not to "go along to get along", especially if you're not absolutely sure, and especially if you've yielded in just slightly less clearcut cases before. You can too easily find yourself negotiated into silence when you really should have spoken up, or even just dithering until it's too late.
This is what auditing is actually about.
Late edit: Yes, by the way, that probably would drive some efforts underground. But they wouldn't happen in "standard" corporate environments. I am actually more comfortable with overtly black-hat secret development efforts than with the kinds of organizational behavior you get in a corporation whose employees can kid themselves that they're the "good guys".
I do mean actually dangerous findings, here. Things that could be immediately exploited to do really unprecedented kinds of harm. I don't mean stupid BS like generating probably-badly-flawed versions of "dangerous chemical" recipes that are definitely in more usable form in books, and probably also on Wikipedia or at least sciencemadness. That sort of picayune stuff should just be published as a minor remark, and not even really worried about beyond that. ↩︎
... but at some point, it doesn't matter how much you know, because you can't "steer" the thing, and even if you can a bunch of other people will be mis-steering it in ways that affect you badly.
I would suggest that maybe some bad experiences might create political will to at least forcibly slow the whole thing down some, but OpenAI already knows as much as the public is likely to learn, and is still doing this. And OpenAI isn't the only one. Given that, it's hard to hope the public's increased knowledge will actually cause it to restrain them from continuing to increase capability as fast as possible and give more access to outside resources as fast as possible.
It might even cause the public to underestimate the risks, if the public's experience is that the thing only caused, um, quantitative-rather-than-qualitative escalations of already increasing annoyances like privacy breaches, largely unnoticed corporate manipulation of the options available in commercial transactions, largely unnoticed personal manipulation, petty vandalism, not at all petty attacks on infrastructure, unpredictable warfare tactics, ransomware, huge emergent breakdowns of random systems affecting large numbers of people's lives, and the like. People are getting used to that kind of thing...
I doubt they do. And using the unqualified word "believe" implies a level of certainty that nobody probably has. I also doubt that their "beliefs" are directly and decisively responsible to their decisions. They are responding to their daily environments and incentives.
Anyway, regardless of what they believe or of what their decision making processes are, the bottom line is that they're not doing anything effective to assure good behavior in the things they're building. That's the central point here. Their motivations are mostly an irrelevant side issue, and only might really matter if understanding them provided a path to getting them to modify their actions... which is unlikely.
When I say "literal fear of actual death", what I'm really getting at is that, for whatever reasons, these people ARE ACTING AS IF THAT RISK DID NOT EXIST WHEN IT IN FACT DOES EXIST. I'm not saying they do feel that fear. I'm not even saying they do not feel that fear. I'm saying they ought to feel that fear.
They are also ignoring a bunch of other risks, including many that a lot of them publicly claim they do believe are real. But they're doing this stuff anyway. I don't care if that's caused by what they believe, by them just running on autopilot, or by their being captive to Moloch. The important part is what they are actually doing.
... and, by the way, if they're going to keep doing that, it might be appropriate to remove their ability to act as "decision makers".
You may have noticed that a lot of people on here are concerned about AI going rogue and doing things like converting everything into paperclips. If you have no effective way of assuring good behavior, but you keep adding capability to each new version of your system, you may find yourself paperclipped. That's generally incompatible with life.
This isn't some kind of game where the worst that can happen is that somebody's feelings get hurt.
Literal fear of actual death?
I don't mean "assurance" in the sense of a promise from somebody to somebody else. That would be worthless anyway.
I mean "assurance" in the sense of there being some mechanism that ensures that the thing actually behaves, or does not behave in any particular way. There's nothing about the technology that lets anybody, including but not limited to the people who are building it, have any great confidence that it's behavior will meet any particular criterion of being "right". And even the few codified criteria they have are watered-down silliness.