Posts

Orthogonality or the "Human Worth Hypothesis"? 2024-01-23T00:57:41.064Z
A Letter to the Editor of MIT Technology Review 2023-08-30T16:59:14.906Z
If we had known the atmosphere would ignite 2023-08-16T20:28:51.166Z
Oh, Think of the Bananas 2023-06-01T06:46:37.334Z

Comments

Comment by Jeffs on Orthogonality or the "Human Worth Hypothesis"? · 2024-01-25T00:41:19.039Z · LW · GW

Thank you.  You are helping my thinking.

Comment by Jeffs on Orthogonality or the "Human Worth Hypothesis"? · 2024-01-24T20:18:01.409Z · LW · GW

(I'm liking my analogy even though it is an obvious one.)

To me, it feels like we're at the moment when Szilard has conceived of the chain reaction, letters to presidents are getting written, and GPT-3 was a Fermi pile-like moment.

I would give it a 97% chance you feel we are not nearly there, yet.  (And I should quit creating scientific by association feelings.  Fair point.)

To me, I am convinced intelligence is a superpower because the power and control we have over all the other animals.  That is enough evidence for me to believe the boom could be big.  Humanity was a pretty big "boom" if you are a chimpanzee. 

The empiricist in me (and probably you) says: "Feelings are worthless.  Do an experiment."

The rationalist in me says: "Be careful which experiments you do."  (Yes, hope stick is long enough as you say.)

In any event, we agree on: "Do some experiments with a long stick.  Quickly."  Agreed!

Comment by Jeffs on Orthogonality or the "Human Worth Hypothesis"? · 2024-01-24T19:17:48.068Z · LW · GW

I am applying myself to try and come up with experiments.  I have a kernel of an idea I'm going to hound some Eval experts with and make sure it is already being performed.

Comment by Jeffs on Orthogonality or the "Human Worth Hypothesis"? · 2024-01-24T19:11:27.327Z · LW · GW

A rationalist and an empiricist went backpacking together.  They got lost, ended up in a desert, and were on the point of death from thirst.  They wander to a point where they can see a cool, clear stream in the distance but unfortunately there is a sign that tells them to BEWARE THE MINE FIELD between them and the stream.

The rationalist says, "Let's reason through this and find a path."  The empiricist says, "What? No.  We're going to be empirical.  Follow me."  He starts walking through the mind field and gets blown to bits a few steps in.

The rationalist sits down and dies of thirst.

Alternate endings:

  • The rationalist gets killed by flying shrapnel along with the empiricist.
  • The rationalist grabs the empiricist and stops him.  He carefully analyzes dirt patterns, draws a map, and tells the empiricist to start walking.  The empiricist blows up.  The rationalist sits down and dies of thirst chanting "The map is not the territory."
  • The rationalist grabs the empiricist, analyzes dirt patterns, draws map, tells empiricist to start walking.  Empiricist blows up.  Rationalist says, "Hmmm.  Now I understand the dirt patterns better."  Rationalist redraws map.  Walks through mind field.  While drinking water takes off fleece to reveal his "Closet Empiricist" t-shirt.
  • They sit down together, figure out how to find some magnetic rocks, build a very crude metal detector, put it on the end of a stick, and start making their way slowly through the mine field.  Step on a mine and a nuclear mushroom cloud erupts.

So how powerful are those dad gum land mines??  Willingness to perform certain experiments should be a function of the expected size of the boom.

If you think you are walking over sand burs and not land mines, you are more willing to be an empiricist exploring the space.  "Ouch don't step there" instead of "Boom. <black screen>"

If one believes that smarter things will see >0 value in humanity, that is, if you believe some version of the Human Worth Hypothesis, then you believe the land mines are less deadly and it makes sense to proceed...especially for that clear, cool water that could save your life.

I'm not really making a point, here, but just turning the issues into a mental cartoon, I guess.

Okay, well, I guess I am trying to make one point:  There are experiments one should not perform.

Comment by Jeffs on Orthogonality or the "Human Worth Hypothesis"? · 2024-01-23T23:28:06.579Z · LW · GW

Totally agreed that we are fumbling in the dark.  (To me, though, I'm fairly convinced there is a cliff out there somewhere given that intelligence is a superpower.)

And, I also agree on the need to be empirical.  (Of course, there are some experiments that scare me.)

I am hoping that, just maybe, this framing (Human Worth Hypothesis) will lead to experiments.

Comment by Jeffs on Orthogonality or the "Human Worth Hypothesis"? · 2024-01-23T21:40:42.452Z · LW · GW

I would predict your probability of doom is <10%.  Am I right?  And no judgment here!!  I'm testing myself.

Comment by Jeffs on Orthogonality or the "Human Worth Hypothesis"? · 2024-01-23T21:38:38.231Z · LW · GW

I interpret people who disbelieve Orthogonality to think there is some cosmic guardrail that protects against such process failures like poor seeking.  How? What mechanism?  No idea.   But I believe they believe that. Hence my inclusion of "...regardless of the process to create the intelligence."

Most readers of Less Wrong believe Orthogonality.

But, I think the term is confusing and we need to talk about it in simpler terms like Human Worth Hypothesis.  (Put the cookies on the low shelf for the kids.)

And, its worth some creative effort to design experiments to test the Human Worth hypothesis.

Imagine the headline: "Experiments demonstrate that frontier AI models do not value humanity."  

If it were believable, a lot of people would update.

Comment by Jeffs on Orthogonality or the "Human Worth Hypothesis"? · 2024-01-23T20:16:34.907Z · LW · GW

Well, if it doesn't really value humans, it could demonstrate good behavior, deceptively, to make it out of training.  If it is as smart as a human, it will understand that.

I think there are a lot of people banking on the good behavior towards humans being intrinsic: Intelligence > Wisdom > Benevolence towards these sentient humans.  That's what I take Scott Aaronson to be arguing.

In addition to people like Scott who engage directly with the concept of Orthogonality, I feel like everyone saying things like "Those terminator sci-fi scenarios are crazy!" are expressing a version of the Human Worth Hypothesis.  They are saying approximately: "Oh, cmon, we made it.  It's going to like us.  Why would it hate us?"

I'm suggesting we try and put this Human Worth Hypothesis to the test.

It feels like a lot is riding on it.

Comment by Jeffs on Orthogonality or the "Human Worth Hypothesis"? · 2024-01-23T20:07:00.678Z · LW · GW

I believe you are predicting that resource constraints will be unlikely.  To use my analogy from the post, you are saying we will likely be safer because the ASI will not require our habitat for its highway.  There are so many other places for it to build roads.

I do not think that is a case that it values our wellbeing...just that it will not get around to depriving us of resources because of a cost/benefit analysis.

Do you think the Human Worth hypothesis is likely true?  That the more intelligent an agent is the more it will positively value human wellbeing?

Comment by Jeffs on Orthogonality or the "Human Worth Hypothesis"? · 2024-01-23T04:42:39.717Z · LW · GW

One experiment is worth more than all the opinions.

IMHO, no, there is not a coherent argument for the human worth hypothesis.  My money is on it being disproven.

But, I assert the human worth hypothesis is the explicit belief of smart people like Scott Aaronson and the implicit belief of a lot of other people who think AI will be just fine.  As Scott says Orthogonality is "a central linchpin" of the doom argument.  

Can we be more clear about what people do believe at get at it with experiments??  That's the question I'm asking.

It's hard to construct experiments to prove all kinds of minds are possible, that is, to prove Orthogonality.

I think it may be less hard to quantify what an agent values.  (Deception, yes.  Still...)

Comment by Jeffs on Orthogonality or the "Human Worth Hypothesis"? · 2024-01-23T01:23:07.911Z · LW · GW

Okay, a "hard zone" rather than a no-go zone.  Which begs the question "How hard?" and consequently how much comfort should one take in the belief?

Thank you for reading and commenting.

Comment by Jeffs on If we had known the atmosphere would ignite · 2023-08-18T16:23:06.694Z · LW · GW

Yes.  Valid.  How to avoid reducing to a toy problem or such narrowing assumptions (in order to achieve a proof) that allows Mr. CEO to dismiss it.

When I revise, I'm going to work backwards with CEO/Senator dialog in mind.

Comment by Jeffs on If we had known the atmosphere would ignite · 2023-08-17T20:05:47.568Z · LW · GW

Agreed. Proof or disproof should win.

Comment by Jeffs on If we had known the atmosphere would ignite · 2023-08-17T18:39:41.988Z · LW · GW

All the way up meaning at increasing levels of intelligence…your 10,000 becomes 100,000X, etc.

At some level of performance, a moral person faces new temptations because of increased capabilities and greater power for damage, right?

In other words, your simulation may fail to be aligned at 20,000...30,000...

Comment by Jeffs on If we had known the atmosphere would ignite · 2023-08-17T18:20:35.278Z · LW · GW

Okay, maybe I'm moving the bar, hopefully not and this thread is helpful...

Your counter-example, your simulation would prove that examples of aligned systems - at a high level - are possible.  Alignment at some level is possible, of course.  Functioning thermostats are aligned.

What I'm trying to propose is the search for a proof that a guarantee of alignment - all the way up - is mathematically impossible.  We could then make the statement: "If we proceed down this path, no one will ever be able to guarantee that humans remain in control."  I'm proposing we see if we can prove that Stuart Russell's "provably beneficial" does not exist.

If a guarantee is proved to be impossible, I am contending that the public conversation changes.

Maybe many people - especially on LessWrong - take this fact as a given.  Their internal belief is close enough to a proof...that there is not a guarantee all the way up.

I think a proof that there is no guarantee would be important news for the wider world...the world that has to move if there is to be regulation.

Comment by Jeffs on If we had known the atmosphere would ignite · 2023-08-17T17:54:12.454Z · LW · GW

Great question.  I think the answer must be "yes."  The alignment-possible provers must get the prize, too.  

And, that would be fantastic.  Proving a thing is possible, accelerates development.  (US uses atomic bomb. Russia has it 4 years later.) Okay, it would be fantastic if the possible proof did not create false security in the short term.  It's important when alignment gets solved.  A peer-reviewed paper can't get the coffee.  (That thought is an aside and not enough to kill the value of the prize, IMHO.  If we prove it is possible, that must accelerate alignment work and inform it.)

Getting definitions and criteria right will be harder than raising the $10 million.  And important.  And contribute to current efforts.

Making it agnostic to possible/impossible would also have the benefit of removing political/commercial antibodies to the exercise, I think.

Comment by Jeffs on If we had known the atmosphere would ignite · 2023-08-17T17:15:47.227Z · LW · GW

I envision the org that offers the prize, after broad expert input, would set the definitions and criteria.  

Yes, surely the definition/criteria exercise would be a hard thing...but hopefully valuable.

Comment by Jeffs on If we had known the atmosphere would ignite · 2023-08-17T17:02:02.808Z · LW · GW

Yes, surely the proof would be very difficult or impossible.  However, enough people have the nagging worry that it is impossible to justify the effort to see if we can prove that it is impossible...and update.

But, if the effort required for a proof is - I don't know - 120 person months - let's please, Humanity, not walk right past that one into the blades.

I am not advocating that we divert dozens of people from promising alignment work. 

Even if it failed, I would hope the prove-impossibility effort would throw off beneficial by-products like:

  • the alignment difficulty demonstrations Mitchell_Porter raised,
  • the paring of some alignment paths to save time, 
  • new, promising alignment paths.

_____

I thought there was a 60%+ chance I would get a quick education on the people who are trying or who have tried to prove impossibility.  

But, I also thought, perhaps this is one of those those Nate Soares blind spots...maybe caused by the fact that those who understand the issues are the types who want to fix.

Has it gotten the attention it needs?

Comment by Jeffs on If we had known the atmosphere would ignite · 2023-08-17T16:17:18.288Z · LW · GW

Like dr_s stated, I'm contending that proof would be qualitatively different from "very hard" and powerful ammunition for advocating a pause...

Senator X: “Mr. CEO, your company continues to push the envelope and yet we now have proof that neither you nor anyone else will ever be able to guarantee that humans remain in control.  You talk about safety and call for regulation but we seem to now have the answer.  Human control will ultimately end.  I repeat my question: Are you consciously working to replace humanity? Do you have children, sir?”

AI expert to Xi Jinping: “General Secretary, what this means is that we will not control it.  It will control us. In the end, Party leadership will cede to artificial agents.  They may or may not adhere to communist principals.  They may or may not believe in the primacy of China.  Population advantage will become nothing because artificial minds can be copied 10 billion times.  Our own unification of mind, purpose, and action will pale in comparison.  Our chief advantages of unity and population will no longer exist.”

AI expert to US General: “General, think of this as building an extremely effective infantry soldier who will become CJCS then POTUS in a matter of weeks or months.”

Comment by Jeffs on Book Review: How Minds Change · 2023-06-01T05:16:45.242Z · LW · GW

I would love to see a video or transcript of this technique in action in a 1:1 conversation about ai x-risk.

Answer to my own question: https://www.youtube.com/watch?v=0VBowPUluPc