Gradient Descent on the Human Brain

post by Jozdien, gaspode (lord-elwin) · 2024-04-01T22:39:24.862Z · LW · GW · 5 comments

Contents

  Introduction
  The Setup
    Aside: Whose brains should we use for this?
  Potential Directions
    More sophisticated methods
    Outreach
    Alternate optimization methods
  Appendix
    Toy examples
      Initial Sketch (Simple Case, Maximum Technology Assumed)
      Initial Sketch (Harder, Advanced Technology Assumed)
    How to run gradient descent on the human brain (longer version)
    Neural gradient descent: Organoid edition
    A more advanced sketch
None
5 comments

TL;DR: Many alignment research proposals often share a common motif: figure out how to enter a basin of alignment / corrigibility for human-level models, and then amplify to more powerful regimes while generalizing gracefully. In this post we lay out a research agenda that comes at this problem from a different direction: if we already have ~human-level systems with extremely robust generalization properties, we should just amplify those directly. We’ll call this strategy “Gradient Descent on the Human Brain”.

Introduction

Put one way, the hard part of the alignment problem is figuring out how to solve ontology identification: mapping between an AI’s model of the world and a human’s model, in order to translate and specify human goals in an alien ontology.

In generality, in the worst case, this is a pretty difficult problem. But is solving this problem necessary to create safe superintelligences? The assumption that you need to solve for arbitrary ontologies is true if you assume that the way to get to superintelligence necessarily routes through systems with different ontologies. We don’t need to solve ontology translation for high-bandwidth communication with other humans[1].

Thus far, we haven’t said anything really novel. The central problem to this approach, as any alignment researcher would know, is that we don’t really have a good way to bootstrap the human brain to superintelligent levels. There have been a few [LW · GWattempts [LW · GW] to approach this recently, though focusing on very prosaic methods that, at best, buy points on the margin. Scaling to superintelligence requires much stronger and robust methods of optimization.

The Setup

The basic setup is pretty simple, though there are a few nuances and extensions that are hopefully self-explanatory.

The simple version: Take a hundred human brains, put them in a large vat, and run gradient descent on the entire thing.

The human brain is a remarkably powerful artifact for its size, so finding a way to combine the capabilities of a hundred human brains with gradient descent should result in something significantly more powerful. As an intuition pump, think of how powerful human organizations are with significantly shallower communication bandwidth. At the very lowest bound we can surpass this, more impressive versions of this could look like an integrated single mind that combines the capabilities of all hundred brains.

The specifics of what the training signal should be are, I think, a rather straightforward engineering problem. Some pretty off-the-cuff ideas, in increasing order of endorsement:

A slightly more sophisticated setup:

We believe this diagram speaks for itself.

Aside: Whose brains should we use for this?

The comparative advantage of this agenda is the strong generalization properties inherent to the human brain[2]. However, to further push the frontier of safety and allow for a broad basin of graceful failure, we think that the brains used should have a strong understanding of alignment literature. We’re planning on running a prototype with a few volunteer researchers - if you want to help, please reach out!

Potential Directions

More sophisticated methods

In light of recent excitement [LW · GWover [LW · GW] sparse auto-encoders (SAEs), one direction we think would be interesting would be training SAEs on the human brain and seeing whether we can get more targeted amplification.

Outreach

We believe that this agenda also aids in community outreach. For instance, e/accs seem unlikely to gain any real political clout due to lacking the mandate of heaven, but we can certainly get them on board with this idea as accelerating the literal next stage of human evolution.

Alternate optimization methods

For reasons beyond the scope of this post, we’re also excited about the potential of training human brains using the forward-forward algorithm instead of standard backpropagation.

Appendix

This contains some rough notes on more detailed sketches. Some of them are pretty hand-wavey, but it seems better to put them up than not for legibility.

Toy examples

Initial Sketch (Simple Case, Maximum Technology Assumed)

Initial Sketch (Harder, Advanced Technology Assumed)

How to run gradient descent on the human brain (longer version)

Neural gradient descent: Organoid edition

A more advanced sketch

This will be a bit more advanced than the toy experiment on our organoid, as it’s intended to be a prototype for running on a human brain.

  1. ^

    Though I’m not claiming it’s a trivial problem even for humans, there’s certainly some variance in ontology - the central point here is that it’s much more manageable and easier.

  2. ^

    To clarify: these generalization properties are literally as good as they can get, because this tautologically determines what we would want things to generalize as.

  3. ^

    Eye tracking tech may also help here.

5 comments

Comments sorted by top scores.

comment by [deleted] · 2024-04-02T00:08:17.029Z · LW(p) · GW(p)

Totally reasonable, the big man would agree:

https://twitter.com/ESYudkowsky/status/1773064617239150796

Humans augmented far enough to stop being idiots, smart enough that they never put faith in a step that will not hold, as do not exist in this world today, could handle it. But that's the army you need; and if instead you go to war with the army you have, you just get slaughtered by a higher grade of problem than you can handle.

Sounds like intelligence amplification.  100 brains in a vat is a good form of augmentation since we know there are fundamental limits to augmenting one human brain (heat, calories and oxygen supply, fixed space for more grey matter, immune system rejection, insanity, damage from installing deep brain electrode grids, need for more advanced technology to install on the order of billions of electrodes..) so why not 100.

Just a 'few minor problems' with needing to master all of human biology to keep 100 brains in a tank alive long enough to get results.  Biology, it's just so damn complicated, and 1.5 million papers are published every year.  Plus to keep brains alive you need to be an expert on the immune system, and circulatory, and liver and digestion, and all their weird organs that nobody knows what they do.  It's just so hard to find anyone qualified to make decisions multiple times a second to keep the brains alive.

If only there was a solution, some way to automate this process.  You'd need greater than human intelligence though...

comment by Cleo Nardo (strawberry calm) · 2024-09-09T16:30:39.896Z · LW(p) · GW(p)

Fun idea, but idk how this helps as a serious solution to the alignment problem.

suggestion: can you be specific about exactly what “work” the brain-like initialisation is doing in the story?

thoughts:

  1. This risks moral catastrophe. I'm not even sure "let's run gradient descent on your brain upload till your amygdala is playing pong" is something anyone can consent to, because you're creating a new moral patient once you upload and mess with their brain. 
  2. How does this address the risks of conventional ML?
    1. Let's say we have a reward signal R and we want a model to maximise R during deployment. Conventional ML says "update a model with SGD using R during training" and then hopefully SGD carves into the model R-seeking behaviour. This is risky because, if the model already understands the training process and has some other values, then SGD might carve into the model scheming behaviour. This is because "value R" and "value X and scheme" are both strategies which achieve high R-score during training. But during deployment, the "value X and scheme" model would start a hostile AI takeover.
    2. How is this risk mitigated if the NN is initialised to a human brain? The basic deceptive alignment story remains the same.
      1. If the intuition here is "humans are aligned/corrigible/safe/honest etc", then you don't need SGD. Just ask the human to do complete the task, possible with some financial incentive.
      2. If the purpose of SGD is to change the human's values from X to R, then you still risk deceptive alignment. That is, SGD is just as likely to instead change human behaviour from non-scheming to scheming. Both strategies "value R" and "value X and scheme" will perform well during training as judged by R.
  3. "The comparative advantage of this agenda is the strong generalization properties inherent to the human brain. To clarify: these generalization properties are literally as good as they can get, because this tautologically determines what we would want things to generalize as."
    1. Why would this be true?
  4. If we have the ability to upload and run human brains, what do we SGD for? SGD is super inefficient, compared with simply teaching a human how to do something. If I remember correctly, if we trained a human-level NN from initialisation using current methods, then the training would correspond to like a million years of human experience. In other words, SGD (from initialisation), would require as much compute as running 1000 brains continuously for 1000 years. But if I had that much compute, I'd probably rather just run the 1000 brains for 1000 years.

That said, I think something in the neighbourhood of this idea could be helpful.

comment by Jonas Hallgren · 2024-04-02T08:49:11.507Z · LW(p) · GW(p)

I think Neurallink already did this actually, a bit late to the point but a good try anyway. Also, have you considered having Michael Bay direct the research effort? I think he did a pretty good job with the first Transformers.

comment by eggsyntax · 2024-04-02T01:08:27.868Z · LW(p) · GW(p)

Train them to predict each other. Human brains being the most general-purpose objects in existence, this should be a very richly general training channel, and incentivizes brain-to-brain (B2B) interaction.

 

You may wish to consider a review of the political science machine learning literature here; prior work in that area demonstrates that only a GAN approach allows brains to predict each other effectively (although I believe that there's some disagreement from Limerence Studies scholars).

comment by Joseph Miller (Josephm) · 2024-04-02T13:00:03.614Z · LW(p) · GW(p)

Does anything think this could actually be done in <20 years?