jbash's Shortform

jbash

jbash's Shortform

post by jbash · 2025-01-05T16:43:43.572Z · LW · GW · 13 comments

13 comments

13 comments

Comments sorted by top scores.

comment by jbash · 2025-01-05T16:43:43.752Z · LW(p) · GW(p)

Please, I beg you guys, stop fretting about humans "losing control over the light cone", or the like.

Humans, collectively, may get lucky enough to close off some futures where we immediately get paperclipped or worse^[1].

That, by itself, would be unusually great control.

Please don't overconstrain it with "Oh, and I won't accept any solution where humans stop being In Charge". Part of the answer may be to put something better In Charge. In fact it probably is. Is that a risk? Yes. Stubborn, human-chauvinistic refusal is probably a much bigger risk.

To get a better future, you may have to commit to it, no take-backsies and no micromanaging.

Any loss is mostly an illusion anyway. Humans have influenced history, at least the parts of history that humans most care about, and in big ways. But humans have never had much control.

You can take an action, even an important one. You can rarely predict its effects, not for long, not in the details, and not in the always very numerous and important areas you weren't actively planning for. Causal chains get chaotic very, very fast. Events interact in ways you can't expect to anticipate. It's worse when everything's changing at once, and the effects you want have to happen in a radically different world.

Metaphors about being "in the driver's seat" should notice that the vehicle has no brakes, and sometimes takes random turns by itself. The roads are planless and winding, in a forest, in the fog, in an unknown country, with no signs, no map and no clear destination. The passengers don't agree about why they're on the trip. And since we're talking about humans, I think I have to add that the driver is drunk.

Not having control, and accepting that, is not going to somehow "crush the human spirit". I think most people, the ones who don't see themselves as Elite World Changers, long ago made peace with their personal lack of control. They may if anything take some solace from the fact that even the Elite World Changers still don't have much. Elite World Changers, being human, are best presumed dangerous.

Please join them. To whatever small degree you, I, or we do have control over the shared future, please don't fall victim to the pretense that we're the best possible holders of that control, let alone the only acceptable ones.

I mean, assuming we're even worrying about the right things. The human track record there is mixed. ↩︎

Replies from: Vladimir_Nesov, sharmake-farah

↑ comment by Vladimir_Nesov · 2025-01-05T16:59:14.810Z · LW(p) · GW(p)

But humans have never had much control.

Not yet. There's been barely thousands of years of civilization, and there are 1e34-1e100 years more to figure it out.

Replies from: jbash

↑ comment by jbash · 2025-01-05T17:15:44.143Z · LW(p) · GW(p)

I tend to think that...

... if you operate humans grossly out of distribution by asking them to supervise or control ASI, or even much better than human AGI...

... and if their control is actively meaningful in that they're not just being manipulated to have the ASI do exactly what it would want to do anyway...

... then even if the ASI is actively trying to help as much as it can under that constraint...

... you'll be lucky to have 1e1 years before the humans destroy the world, give up the control on purpose, lose the control by accident, lock in some kind of permanent (probably dystopian) stasis that will prevent the growth you suggest, or somehow render the entire question moot.

I also don't think that humans are physically capable of doing much better than they do now, no matter how long they have to improve. And I don't think that anything augmented enough to do substantially better would qualify as human.

↑ comment by Noosphere89 (sharmake-farah) · 2025-01-05T18:11:08.148Z · LW(p) · GW(p)

I think a crux here is I genuinely don't think that we'd inevitably destroy/create a permanent dystopia with ASI by default (assuming it's controlled/aligned, which I think is pretty likely), but I do think it's reasonably plausible, so the main thing I'm more or less objecting to is the certainty involved here, rather than it's plausibility.

My other thing I'd probably disagree around this statement, in that I think the default outcome is we do avoid being paperclipped or worse by human-uncontrolled AGIs, mostly due to the alignment problem being noticeably easier to solve than 10 years ago, combined with capabilities progress being slow and spikey enough in favorable directions that something like the AI control agenda is actually workable to get humans controlling even reasonably capable AI by default:

Humans, collectively, may get lucky enough to close off some futures where we immediately get paperclipped or worse^[1] [? · GW].

A moderate disagreement that isn't a crux for me, but is illuminating:

I also don't think that humans are physically capable of doing much better than they do now, no matter how long they have to improve. And I don't think that anything augmented enough to do substantially better would qualify as human.

I actually disagree with this, with caveats here.

I do think a lot of people tend to assume magical results out of say genetic engineering, but I do think that the tradeoffs that made sense 200,000 years ago no longer apply nearly as well, and whether anything that is augmented enough to do substantially better than humanity is a human will ultimately depend on your definition of what counts as humanity.

Most of the gains from augmentation are probably due to different tradeoffs, IMO.

Replies from: jbash

↑ comment by jbash · 2025-01-05T18:52:39.953Z · LW(p) · GW(p)

I think a crux here is I genuinely don't think that we'd inevitably destroy/create a permanent dystopia with ASI by default (assuming it's controlled/aligned, which I think is pretty likely), but I do think it's reasonably plausible, so the main thing I'm more or less objecting to is the certainty involved here, rather than it's plausibility.

I don't think it's inevitable, but I do think it's the expected outcome. I agree I'm more suspicious of humans than most people, but obviously I also think I'm right.

People wig out when they get power, even collectively. Trying to ride herd on an AxI is bound to generate stress, tax cognitive capacities, and possibly engender paranoia. Almost everybody seems to have something they'd do if they were King of the World that a substantial number of other people would see as dystopian. One of the strong tendencies seems to be the wish to universalize rightthink, and real mind control might become possible with plausible technology. Grand Visions, moral panics, and purity spirals often rise to pandemic levels, but are presently constrained by being impossible to fully act on. And once you have the Correct World Order on the Most Important Issue, there's a massive impulse to protect it regardless of any collateral damage.

the alignment problem being noticeably easier to solve than 10 years ago

I'm really unconvinced of that. I think people are deceived by their ability to get first-order good behavior in relatively constrained circumstances. I'm definitely totally unconvinced that any of the products that are out there now are "aligned" with anything importantly useful, and they are definitely easy mode.

Also, that's without annoying complications like having to expect the model to advise you on things you literally can't comprehend. I can believe that you and an ASI might end up agreeing on something, but when the ASI can't convey all the information you'd need to have a truly informed opinion, who's aligned with whom? How is it supposed to avoid manipulating you, no matter whether it wants to, if it has to reduce a set of ideas that fundamentally won't fit into your head into something you can give it an opinion on?

Mind you, I don't know how to do "friendliness" any more than I know how to do "intent alignment". But I know which one I'd pick.

[Oh, and on edit to be clear, what I was asking for with the original post was not so much to abandon human control as obviously unacceptable, no matter how suspicious I am of it personally. It was to stop treating any solution that didn't involve human control as axiomatically unacceptable, without regard to other outcomes. If somebody does solve friendliness, use it, FFS, especially if that solution actually turns out to be more reliable than any available alternative human-control solution.]

Replies from: Vladimir_Nesov, sharmake-farah

↑ comment by Vladimir_Nesov · 2025-01-05T19:44:23.329Z · LW(p) · GW(p)

It was to stop treating any solution that didn't involve human control as axiomatically unacceptable, without regard to other outcomes.

The issue is that it's unclear if it's acceptable, so should be avoided if at all possible, pending more consideration. In principle there is more time for that than what's relevant for any other concerns that don't involve the risk of losing control in a less voluntary way. The revealed preference looks the same as finding it unacceptable to give up the potential for human control, but the argument is different, so long term implied behavior following from that argument is different. It might only take a million years to decide to give up control.

Replies from: jbash

↑ comment by jbash · 2025-01-05T22:07:33.177Z · LW(p) · GW(p)

By this, are you not assuming that keeping humans in charge is extremely unlikely to result in a short-term catastrophe? You may not get a million years or even a hundred years.

By the way, I think the worst risk from human control isn't extinction. The worse, and more likely, risk, is some kind of narrow, fanatical value system being imposed universally, very possibly by direct mind control. I'd expect "safeguards" to be set up to make sure that the world won't drift away from that system... not even in a million years. And the collateral damage from the safeguards would probably be worse than the limitations imposed by the base value system.

I would expect the mind control to apply more to the humans "in charge" than to the rest.

Replies from: Vladimir_Nesov

↑ comment by Vladimir_Nesov · 2025-01-05T23:03:27.015Z · LW(p) · GW(p)

I'm not making any claims about feasibility, I only dispute the claim that it's known that permanently giving up the potential for human control is an acceptable thing to do, or that making such a call (epistemic call about what is known) is reasonable in the foreseeable future. To the extent it's possible to defer this call, it should therefore be deferred (this is a normative claim, not a plan or a prediction of feasibility). If it's not possible to keep the potential for human control despite this uncertainty, then it's not possible, but that won't be because the uncertainty got resolved to the extent that it could be humanly resolved.

↑ comment by Noosphere89 (sharmake-farah) · 2025-01-05T21:10:11.066Z · LW(p) · GW(p)

I don't think it's inevitable, but I do think it's the expected outcome. I agree I'm more suspicious of humans than most people, but obviously I also think I'm right.

People wig out when they get power, even collectively. Trying to ride herd on an AxI is bound to generate stress, tax cognitive capacities, and possibly engender paranoia. Almost everybody seems to have something they'd do if they were King of the World that a substantial number of other people would see as dystopian. One of the strong tendencies seems to be the wish to universalize rightthink, and real mind control might become possible with plausible technology. Grand Visions, moral panics, and purity spirals often rise to pandemic levels, but are presently constrained by being impossible to fully act on. And once you have the Correct World Order on the Most Important Issue, there's a massive impulse to protect it regardless of any collateral damage.

Agree with this (with the caveat that dystopian worlds are relative to your values).

I'm really unconvinced of that. I think people are deceived by their ability to get first-order good behavior in relatively constrained circumstances. I'm definitely totally unconvinced that any of the products that are out there now are "aligned" with anything importantly useful, and they are definitely easy mode.

I think a crux is that I consider the opposite problem of people needing to search for an essence and ignoring the behavioral aspects a more serious problem than people overgeneralizing from first-order good behavior in reasonably constrained circumstances, because it's way too easy to assume that there must be a platonic essence of a thing that is almost ineffable and inscrutable to empirical study.

More generally, a crux here is that I believe most of the alignment-relevant parts of the AIs are in large part the data it was trained on, combined with me believing that the adversarial examples where human language doesn't track reality to be less important for alignment than a lot of people, and thus training on human data does implicitly align them 50-70% of the way towards human values at minimum.

Alos, that's without annoying complications like having to expect the model to advise you on things you literally can't comprehend. I can believe that you and an ASI might end up agreeing on something, but when the ASI can't convey all the information you'd need to have a truly informed opinion, who's aligned with whom? How is it supposed to avoid manipulating you, no matter whether it wants to, if it has to reduce a set of ideas that fundamentally won't fit into your head into something you can give it an opinion on?

Yeah, this does mean you can't have too strict of a definition of manipulation, and it's important to note that even aligned AI probably makes us pets over time (with the caveat that instruction following/corrigibility may extend this time immensely, and augmentation of certain humans may make them the ultimate controllers of the future in an abstract sense.)

Replies from: jbash

↑ comment by jbash · 2025-01-05T21:59:44.383Z · LW(p) · GW(p)

More generally, a crux here is that I believe most of the alignment-relevant parts of the AIs are in large part the data it was trained on, combined with me believing that the adversarial examples where human language doesn't track reality to be less important for alignment than a lot of people, and thus training on human data does implicitly align them 50-70% of the way towards human values at minimum.

I have trouble with the word "alignment", although even I find myself slipping into that terminology occasionally now. What I really want is good behavior. And as you say, that's good behavior by my values. Which I hope are closer to the values of the average person with influence over AI development than they are to the values of the global average human.

Since I don't expect good behavior from humans, I don't think it's adequate to have AI that's even 100 percent aligned, in terms of behaviorally revealed preferences, with humans-in-general as represented by the training data. A particular danger for AI is that it's pretty common for humans, or even significant groups of humans, to get into weird corner cases and obsess over particular issues to the exclusion of things that other humans would think are more important... something that's encouraged by targeted interventions like RLHF. Fanatically "aligned" AI could be pretty darned dystopian. But even "alignment" with the average person could result in disaster.

If you look at it in terms of of stated preferences instead of revealed preferences, I think it gets even worse. Most of ethical philosophy looks to me like humans trying to come up with post hoc ways to make "logical necessities" out of values and behaviors (or "intuitions") that they were going to prefer anyway. If you follow the implications of the resulting systems a little bit beyond wherever their inventors stopped thinking, they usually come into violent conflict with other intuitions that are often at least as important.

If you then add the caveat that it's only 50 to 70 percent "aligned"... well, would you want to have to deal with a human that only agreed with you 50 to 70 percent of the time on what behavior was good? Especially on big issues? I think that, on most ways of "measuring" it, the vast majority of humans are probably much better than 50 to 70 percent "aligned" with one another... but humans still aren't mutually aligned enough to avoid massive violent conflicts over stated values, let alone massive violent conflicts over object-level outcomes.

Replies from: sharmake-farah

↑ comment by Noosphere89 (sharmake-farah) · 2025-01-05T22:36:23.843Z · LW(p) · GW(p)

To the extent that I understand your position, it's that sharing a lot of values doesn't automatically imply that AI is safe/non-dystopian to your values if built, rather than saying that alignment is hard/impossible to someone's values (note when I say that a model is aligned, I am always focused on aligning it to one person's values).

I also dislike the terminology, and I actually agree that alignment is not equal to safety, and this is probably one of my disagreements between a lot of LWers and myself, where I don't think alignment automatically makes things better (in fact, things can get worse by making alignment better).

For example, it does not rule out this scenario, where the species doesn't literally go extinct, but lots of humans die because the economic incentives for not stealing/using violence fall apart as humans become effectively worthless on the market:

https://www.lesswrong.com/posts/2ujT9renJwdrcBqcE/the-benevolence-of-the-butcher [LW · GW]

Replies from: jbash

↑ comment by jbash · 2025-01-05T23:12:41.647Z · LW(p) · GW(p)

To the extent that I understand your position, it's that sharing a lot of values doesn't automatically imply that AI is safe/non-dystopian to your values if built, rather than saying that alignment is hard/impossible to someone's values (note when I say that a model is aligned, I am always focused on aligning it to one person's values).

Yes, with the caveat that I am not thereby saying that it's not hard to align to even one person's values.

Replies from: sharmake-farah

↑ comment by Noosphere89 (sharmake-farah) · 2025-01-05T23:22:22.319Z · LW(p) · GW(p)

Fair enough.

I admittedly have a lot of agreement with you, and that's despite thinking we can make machines that do follow orders/are intent-aligned ala Seth Herd's definition:

https://www.lesswrong.com/posts/7NvKrqoQgJkZJmcuD/instruction-following-agi-is-easier-and-more-likely-than [LW · GW]

jbash's Shortform

Contents

13 comments