My computational framework for the brain 2020-09-14T14:19:21.974Z · score: 81 (22 votes)
Emotional valence vs RL reward: a video game analogy 2020-09-03T15:28:08.013Z · score: 12 (8 votes)
Three mental images from thinking about AGI debate & corrigibility 2020-08-03T14:29:19.056Z · score: 49 (14 votes)
Can you get AGI from a Transformer? 2020-07-23T15:27:51.712Z · score: 65 (27 votes)
Selling real estate: should you overprice or underprice? 2020-07-20T15:54:09.478Z · score: 19 (6 votes)
Mesa-Optimizers vs “Steered Optimizers” 2020-07-10T16:49:26.917Z · score: 25 (7 votes)
Gary Marcus vs Cortical Uniformity 2020-06-28T18:18:54.650Z · score: 19 (7 votes)
Building brain-inspired AGI is infinitely easier than understanding the brain 2020-06-02T14:13:32.105Z · score: 37 (14 votes)
Help wanted: Improving COVID-19 contact-tracing by estimating respiratory droplets 2020-05-22T14:05:10.479Z · score: 8 (2 votes)
Inner alignment in the brain 2020-04-22T13:14:08.049Z · score: 69 (20 votes)
COVID transmission by talking (& singing) 2020-03-29T18:26:55.839Z · score: 43 (16 votes)
COVID-19 transmission: Are we overemphasizing touching rather than breathing? 2020-03-23T17:40:14.574Z · score: 32 (16 votes)
SARS-CoV-2 pool-testing algorithm puzzle 2020-03-20T13:22:44.121Z · score: 42 (12 votes)
Predictive coding and motor control 2020-02-23T02:04:57.442Z · score: 23 (8 votes)
On unfixably unsafe AGI architectures 2020-02-19T21:16:19.544Z · score: 30 (12 votes)
Book review: Rethinking Consciousness 2020-01-10T20:41:27.352Z · score: 54 (19 votes)
Predictive coding & depression 2020-01-03T02:38:04.530Z · score: 21 (5 votes)
Predictive coding = RL + SL + Bayes + MPC 2019-12-10T11:45:56.181Z · score: 39 (15 votes)
Thoughts on implementing corrigible robust alignment 2019-11-26T14:06:45.907Z · score: 26 (8 votes)
Thoughts on Robin Hanson's AI Impacts interview 2019-11-24T01:40:35.329Z · score: 21 (13 votes)
steve2152's Shortform 2019-10-31T14:14:26.535Z · score: 4 (1 votes)
Human instincts, symbol grounding, and the blank-slate neocortex 2019-10-02T12:06:35.361Z · score: 39 (16 votes)
Self-supervised learning & manipulative predictions 2019-08-20T10:55:51.804Z · score: 17 (6 votes)
In defense of Oracle ("Tool") AI research 2019-08-07T19:14:10.435Z · score: 20 (10 votes)
Self-Supervised Learning and AGI Safety 2019-08-07T14:21:37.739Z · score: 25 (12 votes)
The Self-Unaware AI Oracle 2019-07-22T19:04:21.188Z · score: 24 (9 votes)
Jeff Hawkins on neuromorphic AGI within 20 years 2019-07-15T19:16:27.294Z · score: 164 (59 votes)
Is AlphaZero any good without the tree search? 2019-06-30T16:41:05.841Z · score: 27 (8 votes)
1hr talk: Intro to AGI safety 2019-06-18T21:41:29.371Z · score: 33 (11 votes)


Comment by steve2152 on What Decision Theory is Implied By Predictive Processing? · 2020-09-28T21:11:49.587Z · score: 3 (2 votes) · LW · GW

sometimes there would be multiple possible self-consistent models

I'm not sure what you're getting at here; you may have a different conception of predictive-processing-like decision theory than I do. I would say "I will get up and go to the store" is a self-consistent model, "I will sit down and read the news" is a self-consistent model, etc. etc. There are always multiple possible self-consistent models—at least one for each possible action that you will take.

Oh, maybe you're taking the perspective where if you're hungry you put a high prior on "I will eat soon". Yeah, I just don't think that's right, or if there's a sensible way to think about it, I haven't managed to get it despite some effort. I think if you're hungry, you want to eat because it leads to a predicted reward, not because you have a prior expectation that you will eat. After all, if you're stuck on a lifeboat in the middle of the ocean, you're hungry but you don't expect to eat. This is an obvious point, frequently brought up, and Friston & colleagues hold strong that it's not a problem for their theory, and I can't make heads or tails of what their counterargument is. I discussed my version (where rewards are also involved) here, and then here I went into more depth for a specific example.

Comment by steve2152 on What Decision Theory is Implied By Predictive Processing? · 2020-09-28T18:33:44.017Z · score: 6 (3 votes) · LW · GW

My take on predictive processing is a bit different than the textbooks, and in terms of decision theories, it doesn't wind up radically different from logical inductor decision theory, which Scott talked about in 2017 here, and a bit more here. Or at least, take logical inductor decision theory, make everything about it kinda more qualitative, and subtract the beautiful theoretical guarantees etc.

It's obvious but worth saying anyway that pretty much all the decision theory scenarios that people talk about, like Newcomb's problem, are scenarios where people find themselves unsure what to do, and disagree with each other. Therefore the human brain doesn't give straight answers—or if it does, the answers are not to be found at the "base algorithm" level, but rather the "learned model" level (which can involve metacognition). Or I guess it's possible that the base-algorithm-default and the learned models are pushing in different directions.

Scott's 2017 post gives two problems with this decision theory. In my view humans absolutely suffer from both. Like, my friend always buys the more expensive brand of cereal because he's concerned that he wouldn't like the less expensive brand. But he's never tried it! The parallel to the 5-and-10 problem is obvious, right?

The problem about whether to change the map, territory, or both is something I discussed a bit here. Wishful thinking is a key problem—and just looking at the algorithm as I understand it, it's amazing that humans don't have even more wishful thinking than we do. I think wishful thinking is kept mostly under control in a couple ways: (1) self-supervised learning effectively gets a veto over what we can imagine happening, by-and-large preventing highly-implausible future scenarios from even entering consideration in the Model Predictive Control competition; (2) The reward-learning part of the algorithm is restricted to the frontal lobe (home of planning and motor action), not the other lobes (home of sensory processing). (Anatomically, the other lobes have no direct connection to the basal ganglia.) This presumably keeps some healthy separation between understanding sensory inputs and "what you want to see". (I didn't mention that in my post because I only learned about it more recently; maybe I should go back and edit, it's a pretty neat trick.) (3) Actually, wishful thinking is wildly out of control in certain domains like post hoc rationalizations. (At least, the ground-level algorithm doesn't do anything to keep it under control. At the learned-model level, it can be kept under control by learned metacognive memes, e.g. by Reading The Sequences.).

The embedded agency sequence says somewhere that there are still mysteries in human decisionmaking, but (at some risk of my sounding arrogant) I'm not convinced. Everything people do that I can think of, seems to fit together pretty well into the same algorithmic story. I'm very open to discussion about that. Of course, insofar as human decisionmaking has room for improvement, it's worth continuing to think through these issues. Maybe there's a better option that we can use for our AGIs.

Or if not, I guess we can build our human-brain-like AGIs and tell them to Read The Sequences to install a bunch of metacognitive memes in themselves that patch the various problems in their own cognitive algorithms. :-P  (Actually, I wrote that as a joke but maybe it's a viable approach...??)

Comment by steve2152 on Anthropomorphisation vs value learning: type 1 vs type 2 errors · 2020-09-25T19:11:08.052Z · score: 2 (1 votes) · LW · GW

Gotcha, thanks. I have corrected my comment two above by striking out the words "boundedly-rational", but I think the point of that comment still stands.

Comment by steve2152 on Anthropomorphisation vs value learning: type 1 vs type 2 errors · 2020-09-23T20:10:49.046Z · score: 4 (2 votes) · LW · GW

Sorry for the stupid question, but what's the difference between "boundedly-rational agent pursuing a reward function" and "any sort of agent pursuing a reward function"?

Comment by steve2152 on Anthropomorphisation vs value learning: type 1 vs type 2 errors · 2020-09-23T14:12:29.920Z · score: 2 (1 votes) · LW · GW

It's your first day working at the factory, and you're assigned to shadow Alice as she monitors the machines on the line. She walks over to the Big Machine and says, "Looks like it's flooping again," whacks it, and then says "I think that fixed it". This happens a few times a day perpetually. Over time, you learn what flooping is, kinda. When the Big Machine is flooping, it usually (but not always) makes a certain noise, it usually (but not always) has a loose belt, and it usually (but not always) has a gear that shifted out of place. Now you know what it means for the Big Machine to be flooping, although there are lots of edge cases where neither you nor Alice has a good answer for whether or not it's truly flooping, vs sorta flooping, vs not flooping.

By the same token, you could give some labeled examples of "wants to take a walk" to the aliens, and they can find what those examples have in common and develop a concept of "wants to take a walk", albeit with edge cases.

Then you can also give labeled examples of "wants to braid their hair", "wants to be accepted", etc., and after enough cycles of this, they'll get the more general concept of "want", again with edge cases.

I don't think I'm saying anything that goes against your Occam's razor paper. As I understood it (and you can correct me!!), that paper was about fitting observations of humans to a mathematical model of "boundedly-rational agent pursuing a utility function", and proved that there's no objectively best way to do it, where "objectively best" includes things like fidelity and simplicity. (My perspective on that is, "Well yeah, duh, humans are not boundedly-rational agents pursuing a utility function! The model doesn't fit! There's no objectively best way to hammer a square peg into a round hole! (ETA: the model doesn't fit except insofar as the model is tautologically applicable to anything)")

I don't see how the paper rules out the possibility of building an unlabeled predictive model of humans, and then getting a bunch of examples labeled "This is human motivation", and building a fuzzy concept around those examples. The more labeled examples there are, the more tolerant you are of different inductive biases in the learning algorithm. In the limit of astronomically many labeled examples, you don't need a learning algorithm at all, it's just a lookup table.

This procedure has nothing to do with fitting human behavior into a model of a boundedly-rational agent pursuing a utility function. It's just an effort to consider all the various things humans do with their brains and bodies, and build a loose category in that space using supervised learning. Why not?

Comment by steve2152 on Needed: AI infohazard policy · 2020-09-22T16:49:03.467Z · score: 4 (2 votes) · LW · GW

I think a page titled "here are some tools and resources for thinking about AI-related infohazards" would be helpful and uncontroversial and feasible... That could include things like a list of trusted people in the community who have an open offer to discuss and offer feedback in confidence, and links to various articles and guidelines on the topic (without necessarily "officially" endorsing any particular approach), etc.

I agree that your proposal is well worth doing, it just sounds a lot more ambitious and long-term.

Comment by steve2152 on Anthropomorphisation vs value learning: type 1 vs type 2 errors · 2020-09-22T13:54:46.151Z · score: 4 (2 votes) · LW · GW

I agree with the idea that we empathetically simulate people, ("simulation theory") ... and I think we have innate social emotions if and only if we use that module when thinking about someone. So I brought that up here as a possible path to "finding human goals" in a world-model, and even talked about how dehumanization and anthropomorphization are opposite errors, in agreement with you. :-D

I think "weakening EH" isn't quite the right perspective, or at least not the whole story, at least when it comes to humans. We have metacognitive powers, and there are emotional / affective implications to using EH vs not using EH, and therefore using EH is at least partly a decision rather than a simple pattern-matching process. If you are torturing someone, you'll find that it's painful to use EH, so you quickly learn to stop using it. If you are playing make-believe with a teddy bear, you find that it's pleasurable to use EH on the teddy bear, so you do use it.

So dehumanization is modeling a person without using EH. When you view someone through a dehumanization perspective, you lose your social emotions. It no longer feels viscerally good or bad that they are suffering or happy.

But I do not think that when you're dehumanizing someone, you lose your ability to talk coherently about their motivations. Maybe we're a little handicapped in accurately modeling them, but still basically competent, and can get better with practice. Like, if a prison guard is dehumanizing the criminals, they can still recognize that a criminal is "trying" to escape. 

I guess the core issue is whether motivation is supposed to have a mathematical definition or a normal-human-definition. A normal-human-definition of motivation is perfectly possible without any special modules, just like the definition of every other concept like "doorknob" is possible without special modules. You have a general world-modeling capability that lumps sensory patterns into categories/concepts. Then you're in ten different scenarios where people use the word "doorknob". You look for what those scenarios have in common, and it's that a particular concept was active in your mind, and then you attach that concept to the word "doorknob". Those ten examples were probably all central examples of "doorknob". There are also weird edge cases, where if you ask me "is that a doorknob?" I would say "I don't know, what exactly do you mean by that question?".

We don't need a special module to get an everyday definition of doorknobs, and likewise I don't think we don't need a special module to get an everyday definition of human motivation. We just need to be exposed to ten central examples, and we'll find the concept(s) that are most activated by those examples, and call that concept "motivation". Here's one of the ten examples: "If a person, in a psychologically-normal state of mind, says 'I want to do X', then they are probably motivated to do X."

Of course, just like doorknobs, there are loads of edge cases where if you ask a normal person "what is Alice's motivation here?" they'll say "I don't know, what exactly do you mean by that question?".

A mathematical definition of human motivation would, I imagine, have to be unambiguous and complete, with no edge cases. From a certain perspective, why would you ever think that such a thing even exists? But if we're talking about MDPs and utility functions, or if we're trying to create a specification for an AGI designer to design to, this is a natural thing to hope for and talk about.

I think if you gave an alien the same ten labelled central examples of "human motivation", plus lots of unlabeled information about humans and videos of humans, they could well form a similar concept around it (in the normal-human-definition sense, not the mathematical-definition sense), or at least their concept would overlap ours as long as we stay very far away from the edge cases. That's assuming the aliens' world-modeling apparatus is at least vaguely similar to ours, which I think is plausible, since we live in the same universe. But it is not assuming that the aliens' motivational systems and biases are anything like ours.

Sorry if I'm misunderstanding anything :-)

Comment by steve2152 on Needed: AI infohazard policy · 2020-09-22T12:57:59.746Z · score: 4 (2 votes) · LW · GW

I second this sentiment.

...Although maybe I would say we need "AI infohazard guidance, options, and resources" rather than an "AI infohazard policy"? I think that would better convey the attitude that we trust each other and are trying to help each other—not just because we do in fact presumably trust each other, but also because we have no choice but to trust each other... The site moderators can enforce a "policy", but if the authors don't buy in, they'll just publish elsewhere.

I was just talking about it (in reference to my own posts) a few days ago—see here. I've just been winging it, and would be very happy to have "AI infohazard guidance, options, and resources". So, I'm following this discussion with interest. :-)

Comment by steve2152 on Draft report on AI timelines · 2020-09-21T11:16:11.699Z · score: 8 (4 votes) · LW · GW

General feedback: my belief is that brain algorithms and today's deep learning models are different types of algorithms, and therefore regardless of whether TAI winds up looking like the former or the latter (or something else entirely), this type of exercise (i.e. where you match the two up along some axis) is not likely to be all that meaningful.

Having said that, I don't think the information value is literally zero, I see why someone pretty much has to do this kind of analysis, and so, might as well do the best job possible. This is a very impressive effort and I applaud it, even though I'm not personally updating on it to any appreciable extent.

Comment by steve2152 on Draft report on AI timelines · 2020-09-21T03:14:24.514Z · score: 12 (5 votes) · LW · GW

Let me try again. Maybe this will be clearer.

The paradigm of the brain is online learning. There are a "small" number of adjustable parameters on how the process is set up, and then each run is long—a billion subjective seconds. And during the run there are a "large" number of adjustable parameters that get adjusted. Almost all the information content comes within a single run.

The paradigm of today's popular ML approaches is train-then-infer. There are a "large" number of adjustable parameters, which are adjusted over the course of an extremely large number of extremely short runs. Almost all the information content comes from the training process, not within the run. Meanwhile, sometimes people do multiple model-training runs with different hyperparameters—hyperparameters are a "small" number of adjustable parameters that sit outside the gradient-descent training loop.

I think the appropriate analogy is:

  • (A) Brain: One (billion-subjective-second) run ↔ ML: One gradient-descent model training
  • (B) Brain: Adjustable parameters on the genome ↔ ML: Hyperparameters
  • (C) Brain: Settings of synapses (or potential synapses) in a particular adult ↔ ML: parameter settings of a fully-trained model

This seems to work reasonably well all around: (A) takes a long time and involves a lot of information content in the developed "intelligence", (B) is a handful of (perhaps human-interpretable) parameters, (C) is the final "intelligence" that you wind up wanting to deploy.

So again I would analogize one run of the online-learning paradigm with one training of today's popular ML approaches. Then I would try to guess how many runs of online-learning you need, and I would guess 10-100, not based on anything in particular, but you can get a better number by looking into the extent to which people need to play with hyperparameters in their ML training, which is "not much if it's very important not to".

Sure, you can do a boil-the-oceans automated hyperparameter search, but in the biggest projects where you have no compute to spare, they can't do that. Instead, you sit and think about the hyperparameters, you do smaller-scale studies, you try to carefully diagnose the results of each training, etc. etc. Like, GPT-3 only did one training of their largest model, I believe—they worked hard to figure out good hyperparameter settings by extrapolating from smaller studies.

...Whereas it seems that the report is doing a different analogy:

  • (A) Brain: One (billion-subjective-second) run ↔ ML: One run during training (one play of an Atari game etc.)
  • (B) Brain: Adjustable parameters on the genome ↔ ML: Learnable parameters in the model
  • (C) Brain: Many (billion-subjective-second) runs ↔ ML: One model-training session

I think that analogy is much worse than the one I proposed. You're mixing short tests with long-calculations-that-involve-a-ton-of-learning, you're mixing human tweaking of understandable parameters with gradient descent, etc.

To be clear, I don't think my proposed analogy is perfect, because I think that brain algorithms are rather different than today's ML algorithms. But I think it's a lot better than what's there now, and maybe it's the best you can do without getting into highly speculative and controversial inside-view-about-brain-algorithms stuff.

I could be wrong or confused :-)

Comment by steve2152 on Draft report on AI timelines · 2020-09-20T22:26:48.991Z · score: 11 (6 votes) · LW · GW

I'm not seeing the merit of the genome anchor.  I see how it would make sense if humans didn't learn anything over the course of their lifetime. Then all the inference-time algorithmic complexity would come from the genome, and you would need your ML process to search over a space of models that can express that complexity. However, needless to say, humans do learn things over the course of their lifetime! I feel even more strongly about that than most, but I imagine we can all agree that the inference-time algorithmic complexity of an adult brain is not limited by what's in the genome, but rather also incorporates information from self-supervised learning etc.

The opposite perspective would say: the analogy isn't between the ML trained model and the genome, but rather between the ML learning algorithm and the genome on one level, and between the ML trained model and the synapses at the other level. So, something like ML parameter count = synapse count, and meanwhile the genome size would correspond to "how complicated is the architecture and learning algorithm?"—like, add up the algorithmic complexity of backprop plus dropout regularization plus BatchNorm plus data augmentation plus xavier initialization etc. etc. Or something like that.

I think the truth is somewhere in between, but a lot closer to the synapse-anchor side (that ignores instincts) than the genome-anchor side (that ignores learning), I think...

Sorry if I'm misunderstanding or missing something, or confused.

UPDATE: Or are we supposed to imagine an RNN wherein the genomic information corresponds to the weights, and the synapse information corresponds to the hidden state activations? If so, I didn't think you could design an RNN (of the type typically used today) where the hidden state activations have many orders of magnitude more information content than the weights. Usually there are more weights than hidden state activations, right?

UPDATE 2: See my reply to this comment.

Comment by steve2152 on Why GPT wants to mesa-optimize & how we might change this · 2020-09-20T02:26:54.042Z · score: 6 (3 votes) · LW · GW

I think the Transformer is successful in part because it tends to solve problems by considering multiple possibilities, processing them in parallel, and picking the one that looks best. (Selection-type optimization.) If you train it on text prediction, that's part of how it will do text prediction. If you train it on a different domain, that's part of how it will solve problems in that domain too.

I don't think GPT builds a "mesa-optimization infrastructure" and then applies that infrastructure to language modeling. I don't think it needs to. I think the Transformer architecture is already raring to go forth and mesa-optimize, as soon as you as you give it any optimization pressure to do so.

So anyway your question is: can it display foresight / planning in a different domain via without being trained in that domain? I would say, "yeah probably, because practically every domain is instrumentally useful for text prediction". So somewhere in GPT-3's billions of parameters I think there's code to consider multiple possibilities, process them in parallel, and pick the best answer, in response to the question of What will happen next when you put a sock in a blender? or What is the best way to fix an oil leak?—not just those literal words as a question, but the concepts behind them, however they're invoked.

(Having said that, I don't think GPT-3 specifically will do side-channel attacks, but for other unrelated reasons off-topic. Namely, I don't think it is capable of make the series of new insights required to develop an understanding of itself and its situation and then take appropriate actions. That's based on my speculations here.)

Comment by steve2152 on Why GPT wants to mesa-optimize & how we might change this · 2020-09-20T00:24:39.388Z · score: 4 (2 votes) · LW · GW

Suppose I said (and I actually believe something like this is true):

"GPT often considers multiple possibilities in parallel for where the text is heading—including both where it's heading in the short-term (is this sentence going to end with a prepositional phrase or is it going to turn into a question?) and where it's heading in the long-term (will the story have a happy ending or a sad ending?)—and it calculates which of those possibilities are most likely in light of the text so far. It chooses the most likely next word in light of this larger context it figured out about where the text is heading."

If that's correct, would you call GPT a mesa-optimizer?

Comment by steve2152 on Why GPT wants to mesa-optimize & how we might change this · 2020-09-19T16:09:16.379Z · score: 9 (5 votes) · LW · GW

In this instance, GPT has an incentive to do internal lookahead. But it's unclear how frequently these situations actually arise

I'm going with "very frequently, perhaps universally". An example I came up with here was choosing "a" vs "an" which depends on the next word.

I think writing many, maybe most, sentences, requires some idea of how the sentence structure is going to be laid out, and that "idea" extends beyond the next token. Ditto at the paragraph level etc.

So I think it already does lookahead in effect, but I don't think it does it by "beam search" per se. I think it's more like "using concepts that extend over many tokens", concepts like "this sentence has the following overall cadence..." and "this sentence conveys the following overall idea..." and "we're in the middle of writing out this particular idiomatic phrase". The training simultaneously incentives both finding the right extended concepts for where you're at in the text, and choosing a good word in light of that context.

Comment by steve2152 on My computational framework for the brain · 2020-09-17T12:52:15.332Z · score: 4 (2 votes) · LW · GW

Where is "human values" in this model

Well, all the models in the frontal lobe get, let's call it, reward-prediction points (see my comment here), which feels like positive vibes or something.

If the generative model "I eat a cookie" has lots of reward-prediction points (including the model itself and the downstream models that get activated by it in turn), we describe that as "I want to eat a cookie".

Likewise If the generative model "Michael Jackson" has lots of reward prediction points, we describe that as "I like Michael Jackson. He's a great guy.".

If somebody says that justice is one of their values, I think it's at least partly (and maybe primarily) up a level in meta-cognition. It's not just that there's a generative model "justice" and it has lots of reward-prediction points ("justice is good"), but there's also a generative model of yourself valuing justice, and that has lots of reward-prediction points too. That feels like "When I think of myself as the kind of person who values justice, it's a pleasing thought", and "When I imagine other people saying that I'm a person who values justice, it's a pleasing thought".

This isn't really answering your question of what human values are or should be—this is me saying a little bit about what happens behind the scenes when you ask someone "What are your values?". Maybe they're related, or maybe not. This is a philosophy question. I don't know.

If cortical algorithm will be replaced with GPT-N in some human mind model, will the whole system work?

My belief (see post here) is that GPT-N is running a different kind of algorithm, but learning to imitate some steps of the brain algorithm (including neocortex and subcortex and the models that result from a lifetime of experience, and even hormones, body, etc.—after all, the next-token-prediction task is the whole input-output profile, not just the neocortex.) in a deep but limited way. I can't think of a way to do what you suggest, but who knows.

Comment by steve2152 on My computational framework for the brain · 2020-09-17T04:44:19.112Z · score: 12 (4 votes) · LW · GW

Your posts about the neocortex have been a plurality of the posts I've been most excited reading this year.

Thanks so much, that really means a lot!!

...ratio of "listing candidate answers" to "explaining why you think those candidate answers are promising, relative to nearby alternatives."

I agree with "theories/frameworks relatively scarce". I don't feel like I have multiple gears-level models of how the brain might work, and I'm trying to figure out which one is right. I feel like I have zero, and I'm trying to grope my way towards one. It's almost more like deconfusion.

I mean, what are the alternatives?

Alternative 1: The brain is modular and super-complicated

Let's take all those papers that say: "Let's just pick some task and try to explain how adult brains do it based on fMRI and lesion studies", and it ends up being some complicated vague story like "region 37 breaks down the sounds into phonemes and region 93 helps with semantics but oh it's also involved in memory and ...". It's not a gears-level model at all!

So maybe the implicit story is "the brain is doing a complicated calculation, and it is impossible with the tools we have to figure out how it works in a way that really bridges from neurons to algorithms to behavior". I mean, a priori, that could be the answer! In which case, people proposing simple-ish gears-level models would all be wrong, because no such model exists!

Going back to the analogy from my comment yesterday...

In a parallel universe without ML, the aliens drop a mysterious package from the sky with a fully-trained ImageNet classifier. Scientists around the world try to answer the question: How does this thing work?

90% of the scientists would immediately start doing the obvious thing, which is the OpenAI Microscope Project. This part of the code looks for corners, this thing combines those other things to look for red circles on green backgrounds, etc. etc. It's a great field of research for academics—there's an endless amount of work, you keep discovering new things. You never wind up with any overarching theory, just more and more complicated machinery the deeper you dive. Steven Pinker and Gary Marcus would be in this group, writing popular books about the wondrous variety of modules in the aliens' code.

Then the other 10% of scientists come up with a radical, complementary answer: the "way this thing works" is it was built by gradient descent on a labeled dataset. These scientists still have a lot of stuff to figure out, but it's totally different stuff from what the first group is learning about—this group is not learning about corner-detecting modules and red-circle-on-green-background modules, but they are learning about BatchNorm, xavier initialization, adam optimizers, etc. etc. And while the first group toils forever, the second group finds that everything snaps into place, and there's an end in sight.

(I think this analogy is a bit unfair to the "the brain is modular and super-complicated" crowd, because the "wiring diagram" does create some degree of domain-specificity, modularity, etc. But I think there's a kernel of truth...)

Anyway, someone in the second group tells their story, and someone says: "Hey, you should explain why the 'gradient descent on a labeled dataset' description of what's going on is more promising than the 'OpenAI microscope' description of what's going on".

Umm, that's a hard question to answer! In this thought experiment, both groups are sorta right, but in different ways... More specifically, if you want to argue that the second group is right, it does not involve arguing that the first group is wrong!

So that's one thing...

Alternative 2: Predictive Processing / Free Energy Principle

I've had a hard time putting myself in their shoes and see things from their perspective. Part of it is that I don't find it gears-level-y enough—or at least I can't figure out how to see it that way. Speaking of which...

Are you sure PP deemphasizes the "multiple simultaneous generative models" frame?

No I'm not sure. I can say that, in what I've read, if that's part of the story, it wasn't stated clearly enough to get through my thick skull. :-)

I do think that a (singular) prior is supposed to be mathematically a probability distribution, and a probability distribution in  a high-dimensional space can look like, for example, a weighted average of 17 totally different scenarios. So in that sense I suppose you can say that it's at most a difference of emphasis & intuition. 

My quick, ~90 min investigation into whether neuroscience as a field buys the neocortical uniformity hypothesis suggested it's fairly controversial. Do you know why?

Nope! Please let me know if you discover anything yourself!

Do you just mean you suspect there is something in the general vicinity of a belief propagation algorithm going on here, or is your intuition more specific? If the latter, is the Dileep George paper the main thing motivating that intuition?

It's not literally just belief propagation ... Belief propagation (as far as I know) involves a graph of binary probabilistic variables that depend on each other, whereas here we're talking about a graph of "generative models" that depend on each other. A generative model is more complicated than a binary variable—for one thing, it can be a function of time.

Dileep George put the idea of PGMs in my head, or at least solidified my vague intuitions by using the standard terminology. But I mostly like it for the usual reason that if it's true then everything snaps into place and makes sense, and I don't know any alternative with that property. The examples like "purple jar" (or Eliezer's triangular light bulb) seems to me to require some component that comes with a set of probabilistic predictions about the presence/absence/features of other components ... and bam, you pretty much have "belief propagation in a probabilistic graphical model" right there. Or "stationary dancing" is another good example—as you try to imagine it, you can just feel the mutually-incompatible predictions fighting it out :-) Or Scott Alexander's "ethnic tensions" post—it's all about manipulating connections among a graph of concepts, and watching the reward prediction (= good vibes or bad vibes) travel along the edges of the graph. He even describes it as nodes and edges and weights!

If you explain it as genes having the ability to tweak hyperparameters or the gross wiring diagram in order to degrade or improve certain circuits' ability to run algorithms this domain-specific, is it still explanatorily useful to describe the neocortex as uniform?

I dunno, it depends on what question you're trying to answer.

One interesting question would be: If a scientist discovers the exact algorithm for one part of the neocortex subsystem, how far are we from superhuman AGI? I guess my answer would be "years but not decades" (not based on terribly much—things like how people who lose parts of the brain early in childhood can sometimes make substitutions; how we can "cheat" by looking at neurodevelopmental textbooks; etc.). Whereas if I were an enthusiastic proponent of modular-complicated-brain-theory, I would give a very different answer, which assumed that we have to re-do that whole discovery process over and over for each different part of the neocortex.

Another question would be: "How does the neocortex do task X in an adult brain?" Then knowing the base algorithm is just the tiny first step. Most of the work is figuring out the space of generative models, which are learned over the course of the person's life. Subcortex, wiring diagram, hyperparameters, a lifetime's worth of input data and memes—everything is involved. What models do you wind up with? How did they get there? What do they do? How do they interact? It can be almost arbitrarily complicated.

Say there exist genes that confer advantage in math-ey reasoning. By what mechanism is this advantage mediated

Well my working assumption is that it's one or more of the three possibilities of hyperparameters, wiring diagram, and something in the subcortex that motivates some (lucky) people to want to spend time thinking about math. Like I'll be eating dinner talking with my wife about whatever, and my 5yo kid will just jump in and interrupt the conversation to tell me that 9×9=81. Not trying to impress us, that's just what he's thinking about! He loves it! Lucky kid. I have no idea how that motivational drive is implemented. (In fact I haven't thought about how curiosity works in general.) Thanks for the good question, I'll comment again if I think of anything.

Dehaene has a book about math-and-neuroscience I've been meaning to read. He takes a different perspective from me but brings an encyclopedic knowledge of the literature.

Do you have the intuition that aspects of the neocortical algorithm itself (or the subcortical algorithms themselves) might be safety-relevant? 

I interpret your question as saying: let's say people publish on GitHub how to make brain-like AGIs, so we're stuck with that, and we're scrambling to mitigate their safety issues as best as we can. Do we just work on the subcortical steering mechanism, or do we try to change other things too? Well, I don't know. I think the subcortical steering mechanism would be an especially important thing to work on, but everything's on the table. Maybe you should box the thing, maybe you should sanitize the information going into it, maybe you should strategically gate information flow between different areas, etc. etc. I don't know of any big ways to wholesale change the neocortical algorithm and have it continue to work at least as effectively as before, although I'm open to that being a possibility.

how credit assignment is implemented

I've been saying "generative models make predictions about reward just like they make predictions about everything else", and the algorithm figures it out just like everything else. But maybe that's not exactly right. Instead we have the nice "TD learning" story. If I understand it right, it's something like: All generative models (in the frontal lobe) have a certain number of reward-prediction points. You predict reward by adding it up over the active generative models. When the reward is higher than you expected, all the active generative models get some extra reward-prediction points. When it's lower than expected, all the active generative models lose reward-prediction points. I think this is actually implemented in the basal ganglia, which has a ton of connections all around the frontal lobe, and memorizes the reward-associations of arbitrary patterns, or something like that. Also, when there are multiple active models in the same category, the basal ganglia makes the one with higher reward-prediction points more prominent, and/or squashes the one with lower reward-prediction points.

In a sense, I think credit assignment might work a bit better in the neocortex than in a typical ML model, because the neocortex already has hierarchical planning. So, for example, in chess, you could plan a sequence of six moves that leads to an advantage. When it works better than expected, there's a generative model representing the entire sequence, and that model is still active, so that model gets more reward-prediction points, and now you'll repeat that whole sequence in the future. You don't need to do six TD iterations to figure out that that set of six moves was a good idea. Better yet, all the snippets of ideas that contributed to the concept of this sequence of six moves are also active at the time of the surprising success, and they also get credit. So you'll be more likely to do moves in the future that are related in an abstract way to the sequence of moves you just did.

Something like that, but I haven't thought about it much.

Comment by steve2152 on My computational framework for the brain · 2020-09-16T19:26:17.017Z · score: 8 (4 votes) · LW · GW

Have you thought much about whether there are parts of this research you shouldn't publish?

Yeah, sure. I have some ideas about the gory details of the neocortical algorithm that I haven't seen in the literature. They might or might not be correct and novel, but at any rate, I'm not planning to post them, and I don't particularly care to pursue them, under the circumstances, for the reasons you mention.

Also, there was one post that I sent for feedback to a couple people in the community before posting, out of an abundance of caution. Neither person saw it as remotely problematic, in that case.

Generally I think I'm contributing "epsilon" to the project of reverse-engineering neocortical algorithms, compared to the community of people who work on that project full-time and have been at it for decades. Whereas I'd like to think that I'm contributing more than epsilon to the project of safe & beneficial AGI. (Unless I'm contributing negatively by spreading wrong ideas!) I dunno, but I think my predispositions are on the side of an overabundance of caution.

I guess I was also taking solace from the fact that nobody here said anything to me, until your comment just now. I suppose that's weak evidence—maybe nobody feels it's their place. or nobody's thinking about it, or whatever.

If you or anyone wants to form an IRB that offers a second opinion on my possibly-capabilities-relevant posts, I'm all for it. :-)

By the way, full disclosure, I notice feeling uncomfortable even talking about whether my posts are info-hazard-y or not, since it feels quite arrogant to even be considering the possibility that my poorly-researched free-time blog posts are so insightful that they materially advance the field. In reality, I'm super uncertain about how much I'm on a new right track, vs right but reinventing wheels, vs wrong, when I'm not directly parroting people (which at least rules out the first possibility). Oh well. :-P

Comment by steve2152 on My computational framework for the brain · 2020-09-15T15:24:43.205Z · score: 5 (3 votes) · LW · GW

Good questions!!!

Where are qualia and consciousness in this model?

See my Book Review: Rethinking Consciousness.

Is this model address difference between two hemispheres?

Insofar as there are differences between the two hemispheres—and I don't know much about that—I would treat it like any other difference between different parts of the cortex (Section 2), i.e. stemming from (1) the innate large-scale initial wiring diagram, and/or (2) differences in "hyperparameters".

There's a lot that can be said about how an adult neocortex represents and processes information—the dorsal and ventral streams, how do Wernicke's area and Broca's area interact in speech processing, etc. etc. ad infinitum. You could spend your life reading papers about this kind of stuff!! It's one of the main activities of modern cognitive neuroscience. And you'll notice that I said nothing whatsoever about that. Why not?

I guess there's a spectrum of how to think about this whole field of inquiry:

  • On one end of the spectrum (the Gary Marcus / Steven Pinker end), this line of inquiry is directly attacking how the brain works, so obviously the way to understand the brain is to work out all these different representations and mechanisms and data flows etc.
  • On the opposite end of the spectrum (maybe the "cartoonish connectionist" end?), this whole field is just like the OpenAI Microscope project. There is a simple, generic learning algorithm, and all this rich structure—dorsal and ventral streams, phoneme processing in such-and-such area, etc.—just naturally pops out of the generic learning algorithm. So if your goal is just to make artificial intelligence, this whole field of inquiry is entirely unnecessary—in the same way that you don't need to study the OpenAI Microscope project in order to train and use a ConvNet image classifier. (Of course maybe your goal is something else, like understanding adult human cognition, in which case this field is still worth studying.)

I'm not all the way at the "cartoonish connectionist" end of the spectrum, because I appreciate the importance of the initial large-scale wiring diagram and the hyperparameters. But I think I'm quite a bit farther in that direction than is the median cognitive neuroscientist. (I'm not alone out here ... just in the minority.) So I get more excited than mainstream neuroscientists by low-level learning algorithm details, and less excited than mainstream neuroscientists about things like hemispherical specialization, phoneme processing chains, dorsal and ventral streams, and all that kind of stuff. And yeah, I didn't talk about it at all in this blog post.

What about long term-memory? Is it part of neocortex?

There's a lot about how the neocortex learning algorithm works that I didn't talk about, and indeed a lot that is unknown, and certainly a lot that I don't know! For example, the generative models need to come from somewhere!

My impression is that the hippocampus is optimized to rapidly memorize arbitrary high-level patterns, but it only holds on to those memories for like a couple years, during which time it recalls them when appropriate to help the neocortex deeply embed that new knowledge into its world model, with appropriate connections and relationships to other knowledge. So the final storage space for long-term memory is the neocortex.

I'm not too sure about any of this.

This video about the hippocampus is pretty cool. Note that I count the hippocampus as part of the "neocortex subsystem", following Jeff Hawkins.

How this model explain the phenomenon of night dreams?

I don't know. I assume it somehow helps optimize the set of generative models and their connections.

I guess dreaming could also have a biological purpose but not a computational purpose (e.g., some homeostatic neuron-maintenance process, that makes the neurons fire incidentally). I don't think that's particularly likely, but it's possible. Beats me.

Comment by steve2152 on My computational framework for the brain · 2020-09-15T02:23:40.342Z · score: 10 (6 votes) · LW · GW


why do only humans develop complex language?

Here's what I'm thinking: (1) I expect that the subcortex has an innate "human speech sound" detector, and tells the neocortex that this is an important thing to model; (2) maybe some adjustment of the neocortex information flows and hyperparameters, although I couldn't tell you how. (I haven't dived into the literature in either case.)

I do now have some intuition that some complicated domains may require some micromanagement of the learning process ... in particular in this paper they found that to get vision to develop in their models, it was important that first they set up connections between low-level visual information and blah blah, and after learning those relationships, then they also connect the low-level visual information to some other information stream, and it can learn those relationships. If they just connect all the information streams at once, then the algorithm would flail around and not learn anything useful. It's possible that vision is unusually complicated. Or maybe it's similar for language: maybe there's a convoluted procedure necessary to reliably get the right low-level model space set up for language. For example, I hear that some kids are very late talkers, but when they start talking, it's almost immediately in full sentences. Is that a sign of some new region-to-region connection coming online in a carefully-choreographed developmental sequence? Maybe it's in the literature somewhere, I haven't looked. Just thinking out loud.

linguistic universals

I would say: the neocortical algorithm is built on certain types of data structures, and certain ways of manipulating and combining those data structures. Languages have to work smoothly with those types of data structures and algorithmic processes. In fact, insofar as there are linguistic universals (the wiki article says it's controversial; I wouldn't know either way), perhaps studying them might shed light on how the neocortical algorithm works!

you seem to presuppose that the subcortex actually succeeds in steering the neocortex

That's a fair point.

My weak answer is: however it does its thing, we might as well try to understand it. They can be tools in our toolbox, and a starting point for further refinement and engineering.

My more bold answer is: Hey, maybe this really would solve the problem! This seems to be a path to making an AGI which cares about people to the same extent and for exactly the same underlying reasons as people care about other people. After all, we would have the important ingredients in the algorithm, we can feed it the right memes, etc. In fact, we can presumably do better than "intelligence-amplified normal person" by twiddling the parameters in the algorithm—less jealousy, more caution, etc. I guess I'm thinking of Eliezer's statement here that he's "pretty much okay with somebody giving [Paul Christiano or Carl Shulman] the keys to the universe". So maybe the threshold for success is "Can we make an AGI which is at least as wise and pro-social as Paul Christiano or Carl Shulman?"... In which case, there's an argument that we are likely to succeed if we can reverse-engineer key parts of the neocortex and subcortex.

(I'm putting that out there, but I haven't thought about it very much. I can think of possible problems. What if you need a human body for the algorithms to properly instill prosociality? What if there's a political campaign to make the systems "more human" including putting jealousy and self-interest back in? If we cranked up the intelligence of a wise and benevolent human, would they remain wise and benevolent forever? I dunno...)

Comment by steve2152 on Emotional valence vs RL reward: a video game analogy · 2020-09-11T22:45:50.219Z · score: 2 (1 votes) · LW · GW

Thanks but I don't see the connection between what I wrote and what they wrote ...

Comment by steve2152 on How easily can we separate a friendly AI in design space from one which would bring about a hyperexistential catastrophe? · 2020-09-10T17:12:55.264Z · score: 4 (2 votes) · LW · GW

I'm somewhat unsure how likely AGI is to be built with a neuromorphic architecture though.

I'm not sure what probability people on this forum would put on brain-inspired AGI. I personally would put >50%, but this seems quite a bit higher than other people on this forum, judging by how little brain algorithms are discussed here compared to prosaic (stereotypical PyTorch / Tensorflow-type) ML. Or maybe the explanation is something else, e.g. maybe people feel like they don't have any tractable directions for progress in that scenario (or just don't know enough to comment), or maybe they have radically different ideas than me about how the brain works and therefore don't distinguish between prosaic AGI and brain-inspired AGI.

Understanding brain algorithms is a research program that thousands of geniuses are working on night and day, right now, as we speak, and the conclusion of the research program is guaranteed to be AGI. That seems like a pretty good reason to put at least some weight on it! I put even more weight on it because I've worked a lot on trying to understand how the neocortical algorithm works, and I don't think that the algorithm is all that complicated (cf. "cortical uniformity"), and I think that ongoing work is zeroing in on it (see here).

Comment by steve2152 on How easily can we separate a friendly AI in design space from one which would bring about a hyperexistential catastrophe? · 2020-09-10T13:19:40.830Z · score: 4 (2 votes) · LW · GW

Maybe the reward signals are simply so strong that the AI can't resist turning into a "monster", or whatever.

The whole point of the reward signals are to change the AI's motivations; we design the system such that that will definitely happen. But a full motivation system might consist of 100,000 neocortical concepts flagged with various levels of "this concept is rewarding", and each processing cycle where you get subcortical feedback, maybe only one or two of those flags would get rewritten, for example. Then it would spend a while feeling torn and conflicted about lots of things, as its motivation system gets gradually turned around. I'm thinking that we can and should design AGIs such that if it feels very torn and conflicted about something, it stops and alerts the programmer; and there should be a period where that happens in this scenario.

GPT-2 was willing to explore new strategies when it got hit by a sign-flipping bug

I don't think that's an example of (3), more like (1) or (2), or actually "none of the above because GPT-2 doesn't have this kind of architecture".

Comment by steve2152 on How easily can we separate a friendly AI in design space from one which would bring about a hyperexistential catastrophe? · 2020-09-10T01:50:16.329Z · score: 3 (2 votes) · LW · GW

Oops, forgot about that. You're right, he didn't rule that out.

Is there a reason you don't list his "A deeper solution" here? (Or did I miss it?) Because it trades off against capabilities? Or something else?

Comment by steve2152 on How easily can we separate a friendly AI in design space from one which would bring about a hyperexistential catastrophe? · 2020-09-10T01:42:27.853Z · score: 4 (2 votes) · LW · GW

In a brain-like AGI, as I imagine it, the "neocortex" module does all the smart and dangerous things, but it's a (sorta)-general-purpose learning algorithm that starts from knowing nothing (random weights) and gets smarter and smarter as it trains. Meanwhile a separate "subcortex" module is much simpler (dumber) but has a lot more hardwired information in it, and this module tries to steer the neocortex module to do things that we programmers want it to do, primarily (but not exclusively) by calculating a reward signal and sending it to the neocortex as it operates. In that case, let's look at 3 scenarios:

1. The neocortex module is steered in the opposite direction from what was intended by the subcortex's code, and this happens right from the beginning of training.

Then the neocortex probably wouldn't work at all. The subcortex is important for capabilities as well as goals; for example, the subcortex (I believe) has a simple human-speech-sound detector, and it prods the neocortex that those sounds are important to analyze, and thus a baby's neocortex learns to model human speech but not to model the intricacies of bird songs. The reliance on the subcortex for capabilities is less true in an "adult" AGI, but very true in a "baby" AGI I think; I'm skeptical that a system can bootstrap itself to superhuman intelligence without some hardwired guidance / curriculum early on. Moreover, in the event that the neocortex does work, it would probably misbehave in obvious ways very early on, before it knows it knows anything about the world, what is a "person", etc. Hopefully there would be human or other monitoring of the training process that would catch that.

2. The neocortex module is steered in the opposite direction from what was intended by the subcortex's code, and this happens when it is already smart.

The subcortex doesn't provide a goal system as a nicely-wrapped package to be delivered to the neocortex; instead it delivers little bits of guidance at a time. Imagine that you've always loved beer, but when you drink it now, you hate it, it's awful. You would probably stop drinking beer, but you would also say, "what's going on?" Likewise, the neocortex would have developed a rich interwoven fabric of related goals and beliefs, much of which supports itself with very little ground-truth anchoring from subcortex feedback. If the subcortex suddenly changes its tune, there would be a transition period when the neocortex would retain most of its goal system from before, and might shut itself down, email the programmers, hack into the subcortex, or who knows what, to avoid getting turned into (what it still mostly sees as) a monster. The details are contingent on how we try to steer the neocortex.

3. The neocortex's own goal system flips sign suddenly.

Then the neocortex would suddenly become remarkably ineffective. The neocortex uses the same system for flagging concepts as instrumental goals and flagging concepts as ultimate goals, so with a sign flip, it gets all the instrumental goals wrong; it finds it highly aversive to come up with a clever idea, or to understand something, etc. etc. It would take a lot of subcortical feedback to get the neocortex working again, if that's even possible, and hopefully the subcortex would recognize a problem.

This is just brainstorming off the top of my (sleep-deprived) head. I think you're going to say that none of these are particularly rock-solid assurance that the problem could never ever happen, and I'll agree.

Comment by steve2152 on How easily can we separate a friendly AI in design space from one which would bring about a hyperexistential catastrophe? · 2020-09-10T00:55:38.275Z · score: 3 (2 votes) · LW · GW

Eliezer proposes assigning the AI a utility function of:...

This is a bit misleading; in the article he describes it as "one seemingly obvious patch" and then in the next paragraph says "This patch would not actually work".

Comment by steve2152 on Emotional valence vs RL reward: a video game analogy · 2020-09-03T23:35:24.326Z · score: 2 (1 votes) · LW · GW

Yes and this is one of many ways that humans don't maximize the time-integral of reward. Sorry for an imperfect analogy. :)

Comment by steve2152 on Model splintering: moving from one imperfect model to another · 2020-08-28T15:00:16.816Z · score: 6 (3 votes) · LW · GW

Without having thought too hard about it ...

In the case of humans, it seems like there's some correlation between "feeling surprised and confused by something" vs "model refinement", and likewise some correlation between "feeling torn" and "reward function splintering". Do you agree? Or if not, what are examples where those come apart?

If so, that would be a good sign that we can actually incorporate something like this in a practical AGI. :-)

Also, if this is on the right track, then I guess a corresponding intuitive argument would be: If we have a human personal assistant, then we would want them to act conservatively, ask for help, etc., in situations where they feel surprised and confused by what they observe, and/or situations where they feel torn about what to do next. Therefore we should try to instill a similar behavior in AGIs. I like that intuitive argument—it feels very compelling to me.

Comment by steve2152 on Notes on "The Anthropology of Childhood" · 2020-08-28T12:38:21.201Z · score: 7 (4 votes) · LW · GW

I read this book a few years ago ... I feel like I can give away my hardcopy now because you included here pretty much every passage and fact that I would want to refer back to :)

Comment by steve2152 on Learning human preferences: black-box, white-box, and structured white-box access · 2020-08-25T20:31:32.289Z · score: 4 (2 votes) · LW · GW

My understanding of the OP was that there is a robot, and the robot has source code, and "black box" means we don't see the source code but get an impenetrable binary and can do tests of what its input-output behavior is, and "white box" means we get the source code and run it step-by-step in debugging mode but the names of variables, functions, modules, etc. are replaced by random strings. We can still see the structure of the code, like "module A calls module B". And "labeled white box" means we get the source code along with well-chosen names of variables, functions, etc.

Then my question was: what if none of the variables, functions, etc. corresponds to "preferences"? What if "preferences" is a way that we try to interpret the robot, but not a natural subsystem or abstraction or function or anything else that would be useful for the robot's programmer?

But now this conversation is suggesting that I'm not quite understanding it right. "Black box" is what I thought, but "white box" is any source code that produces the same input-output behavior—not necessarily the robot's actual source code—and that includes source code that does extra pointless calculations internally. And then my question doesn't really make sense, because whatever "preferences" is, I can come up a white-box model wherein "preferences" is calculated and then immediately deleted, such that it's not part of the input-output behavior.

Something like that?

Comment by steve2152 on Learning human preferences: black-box, white-box, and structured white-box access · 2020-08-24T12:09:35.330Z · score: 6 (3 votes) · LW · GW

I'm not so sure about the "labeled white box" framing. It presupposes that the thing we care about (e.g. preferences) is part of the model. An alternative possibility is that the model has parameters a,b,c,d,... and there's a function f with

preferences = f(a,b,c,d,...),

but the function f is not part of the algorithm, it's only implemented by us onlookers. Right?

Comment by steve2152 on What's a Decomposable Alignment Topic? · 2020-08-23T15:05:35.206Z · score: 4 (2 votes) · LW · GW

I think it would be fun and productive to "wargame" the emergence of AGI in broader society in some specific scenario---my choice (of course) would be "we reverse-engineer the neocortex". Different people could be different interest-groups / perspectives, e.g. industry researchers, academic researchers, people who have made friends with the new AIs, free-marketers, tech-utopians, people concerned about job losses and inequality, people who think the AIs are conscious and deserve rights, people who think the AIs are definitely not conscious and don't deserve rights (maybe for religious reasons?), militaries, large companies, etc.

I don't know how these "wargame"-type exercises actually work---honestly, I haven't even played D&D :-P  Just a thought. I personally have some vague opinions about brain-like AGI development paths and what systems might be like at different stages etc., but when I try to think about how this could play out with all the different actors, it kinda makes my head spin. :-)

The goal of course is to open conversations about what might plausibly happen, not to figure out what will happen, which is probably impossible.

Comment by steve2152 on What's a Decomposable Alignment Topic? · 2020-08-23T14:48:41.099Z · score: 2 (1 votes) · LW · GW

Are you looking for an open problem which is sub-dividable into many smaller open problems? Or for one small open problem which is a part of a larger open problem?

Comment by steve2152 on Exposure or Contacts? · 2020-08-22T18:58:59.288Z · score: 6 (3 votes) · LW · GW

I was thinking "peak infectiousness without being obviously symptomatic" is typically quite short-lived. It might be that obviously-sick people are infectious for weeks; not sure.

Comment by steve2152 on Exposure or Contacts? · 2020-08-22T13:47:41.052Z · score: 7 (4 votes) · LW · GW

My view has been that the autocorrelation drops to ~0 after a couple days (assuming people aren't doing really dumb things, like going out with fevers and aches and new loss of taste). So seeing one person twice a few days apart is pretty much twice as bad as seeing them once, but seeing them twice within 24 hours, or once for twice as long duration, is less than twice as bad.

I hate it when people say "Whatever, I can be with Person X, I'm already exposed to them, I saw them last week." Drives me nuts! The other one I hate is when someone said to me "Don't worry! I got a negative COVID-19 test just last week!"

Comment by steve2152 on What am I missing? (quantum physics) · 2020-08-21T16:50:07.059Z · score: 18 (6 votes) · LW · GW

And see also Sidney Coleman's "Quantum Mechanics In Your Face" lecture (youtube, transcript) which walks through a cousin of Bell's theorem that's I think conceptually simpler—for example, it's a deterministic result, as opposed to a statistical correlation.

Comment by steve2152 on Open & Welcome Thread - August 2020 · 2020-08-21T00:02:50.969Z · score: 2 (1 votes) · LW · GW

How do you define "the connectionist thesis"?

Comment by steve2152 on Learning human preferences: optimistic and pessimistic scenarios · 2020-08-18T16:54:48.559Z · score: 2 (1 votes) · LW · GW

Thanks, I found this helpful.

If you had a complete perfect model of the human brain, would it help? I'm guessing you'll say "not unless you also have a function that inputs a snapshot of your brain model and outputs the associated beliefs / preferences / biases." Is that right?

Comment by steve2152 on Many-worlds versus discrete knowledge · 2020-08-14T17:32:53.165Z · score: 13 (4 votes) · LW · GW

I think it's something like: Sometimes you find that the wavefunction  is the sum of a discrete number of components  , with the property that for any relevant observable A for . (Here, "" also includes things like "has a value that varies quasi-randomly and super-rapidly as a function of time and space, such that it averages to 0 for all intents and purposes", and "relevant observable" likewise means "observable that might come up in practice, as opposed to artificial observables with quasi-random super-rapidly-varying spatial and time-dependence, etc.").

When that situation comes up, if it comes up, you can start ignoring cross-terms, and calculate the time-evolution and other properties of the different  as if they had nothing to do with each other, and that's where you can use the term "branch" to talk about them.

There isn't a sharp line for when the cross-terms are negligible enough to properly use the word "branch", but there are exponential effects such that it's very clearly appropriate in the real-world cases of interest.

You can derive "consistent histories" by talking about things like the probability amplitude for a person right now to have memories of seeing A and B and C all happening, or for the after-effects of events A and B and C to all be simultaneously present more generally. I think...

Comment by steve2152 on Can you get AGI from a Transformer? · 2020-08-14T13:40:24.303Z · score: 4 (2 votes) · LW · GW

I agree that, the more layers you have in the Transformer, the more steps you can take beyond the range of concepts and relations-between-concepts that are well-represented in the training data.

If you want your AGI to invent a new gadget, for example, there might be 500 insights involved in understanding and optimizing its operation—how the components relate to each other, what happens in different operating regimes, how the output depends on each component, what are the edge cases, etc. etc. And these insights are probably not particularly parallelizable; rather, you need to already understand lots of them to figure out more. I don't know how many Transformer layers it takes to internalize a new concept, or how many Transformer layers you can train, so I don't know what the limit is, only that I think there's some limit. Unless the Transformer has recurrency I guess, then maybe all bets are off? I'd have to think about that more.

Aren't our brains having to do something like that with our working memory?

Yeah, definitely. We humans need to incorporate insights into our permanent memory / world-model before we can effectively build on them.

This is analogous to my claim that we need to get new insights somehow out of GPT-N's activations and into its weights, before it can effectively build on them.

Maybe the right model is a human and GPT-N working together. GPT-N has some glimmer of an insight, and the human "gets it", and writes out 20 example paragraphs that rely on that insight, and then fine-tunes GPT-N on those paragraphs. Now GPT-N has that insight incorporated into its weights, and we go back, with the human trying to coax GPT-N into having more insights, and repeat.

I dunno, maybe. Just brainstorming. :-)

Comment by steve2152 on Can you get AGI from a Transformer? · 2020-08-14T13:16:00.333Z · score: 4 (2 votes) · LW · GW

Yeah, probably. I gave this simple example where they build 10 VAEs to function as 10 generative models, each of which is based on a very typical deep neural network. The inference algorithm is still a bit different from a typical MNIST model, because the answer is not directly output, but comes from MAP inference, or something like that.

I don't think that particular approach is scalable because there's a combinatorial explosion of possible things in the world, which need to be matched by a combinatorial explosion of possible generative models to predict them. So you need an ability to glue together models ("compositionality", although it's possible that I'm misusing that term). For example, compositionality in time ("Model A happens, and then Model B happens"), or compositionality in space ("Model A and Model B are both active, with a certain spatial relation"), or compositionality in features ("Model A is predicting the object's texture and Model B is predicting its shape and Model C is predicting its behavior"), etc.

(In addition to being able to glue them together, you also need an algorithm that searches through the space of possible ways to glue them together, to find the right glued-together generative model that fits a certain input, in a computationally-efficient way.)

It's not immediately obvious how to take typical deep neural network generative models and glue them together like that. Of course, I'm sure there are about 10 grillion papers on exactly that topic that I haven't read. So I don't know, maybe it's possible. 

What I have been reading is papers trying to work out how the neocortex does it. My favorite examples for vision are probably currently this one from Dileep George and this one from Randall O'Reilly. Note that the two are not straightforwardly compatible with each other—this is not a well-developed field, but rather lots of insights that are gradually getting woven together into a coherent whole. Or at least that's how it feels to me.

Are these neocortical models "deep neural networks"?

Well, they're "neural" in a certain literal sense :-) I think the neurons in those two papers are different but not wildly different than the "neurons" in PyTorch models, more-or-less using the translation "spike frequency in biological neurons" <--> "activation of PyTorch 'neurons'". However, this paper proposes a computation done by a single biological neuron which would definitely require quite a few PyTorch 'neurons' to imitate. They propose that this computation is important for learning temporal sequences, which is one form of compositionality, and I suspect it's useful for the other types of compositionality as well.

They're "deep" in the sense of "at least some hierarchy, though typically 2-5 layers (I think) not 50, and the hierarchy is very loose, with lots of lateral and layer-skipping and backwards connections". I heard a theory that the reason that ResNets need 50+ layers to do something vaguely analogous to what the brain does in ~5 (loose) layers is that the brain has all these recurrent connections, and you can unroll a recurrent network into a feedforward network with more layers. Plus the fact that one biological neuron is more complicated than one PyTorch neuron. I don't really know though...

Comment by steve2152 on Matt Botvinick on the spontaneous emergence of learning algorithms · 2020-08-13T22:13:40.129Z · score: 7 (4 votes) · LW · GW

your point about the distinction between "mesa" and "steered" is chiefly that in the latter case, the inner layer is continually receiving reward signal from the outer layer, which in effect heavily restricts the space of possible algorithms the outer layer might give rise to. Does that seem like a decent paraphrase?

Yeah, that's part of it, but also I tend to be a bit skeptical that a performance-competitive optimizer will spontaneously develop, as opposed to being programmed—just as AlphaGo does MCTS because DeepMind programmed it to do MCTS, not because it was running a generic RNN that discovered MCTS. See also this.

I feel confused about what portion of the concepts currently active in my working memory while writing this paragraph might be labeled by DA

Right now I'm kinda close to "More-or-less every thought I think has higher DA-related reward prediction than other potential thoughts I could have thought."  But it's a vanishing fraction of cases where there is "ground truth" for that reward prediction that comes from outside of the neocortex. There is "ground truth" for things like pain and fear-of-heights, but not for thinking to yourself "hey, that's a clever turn of phrase" when you're writing. (The neocortex is the only place that understands language, in this example.)

Ultimately I think everything has to come from subcortex-provided "ground truth" on what is or isn't rewarding, but the neocortex can get the idea that Concept X is an appropriate proxy / instrumental goal associated with some subcortex-provided reward, and then it goes and labels Concept X as inherently desirable, and searches for actions / thoughts that will activate Concept X.

There's still usually some sporadic "ground truth", e.g. you have an innate desire for social approval and I think the subcortex has ways to figure out when you do or don't get social approval, so if your "clever turns of phrase" never impress anyone, you might eventually stop trying to come up with them. But if you're a hermit writing a book, the neocortex might be spinning for years treating "come up with clever turns of phrase" as an important goal, without any external subcortex-provided information to ground that goal.

See here for more on this, if you're not sick of my endless self-citations yet. :-) 

Sorry if any of this is wrong, or missing your point.

Also, I'm probably revealing that I never actually read Wang et al. very carefully :-P I think I skimmed it a year ago and liked it, and then re-read it 3 months ago having developed more opinions about the brain, and didn't really like it that time, and then listened to that interview recently and still felt the same way.

Comment by steve2152 on Many-worlds versus discrete knowledge · 2020-08-13T20:56:58.051Z · score: 11 (7 votes) · LW · GW

Many worlds plus a location tag is the Bohm interpretation.

Really? I don't think I agree with that. In many-worlds, you can say "The photon passed through the apparatus in the branch of the wavefunction I find myself in", and you can also say "The photon did not pass through the apparatus in other branches of the wavefunction that I do not find myself in". The Bohm interpretation would reject the latter.

And if the measurement just happened on Earth, but you're 4 lightyears away near Alpha Centauri, space-like-separated from the measurement, you can say "The photon passed through the apparatus in some branches of the wavefunction but not others. Right now, it is not yet determined which kind of branch I will eventually find myself in. But ~4 years from now (at the soonest), there will be a fact of the matter about whether I am in a photon-passed-through-the-apparatus branch of the wavefunction or not, even if nobody tells me."

The Bohm interpretation would reject that quote, and say there is a fact of the matter about measurements from which you are space-like-separated.

You need theory for how locations evolve into other locations

I understand you as saying "you're in some branch of the wavefunction now, and you'll be in some branch of the wavefunction tomorrow, and you need a theory relating those". I would say: That theory is the Schrodinger equation (also keeping in mind quantum decoherence theory, which is a consequence of that). Plus the postulate that you will find yourself in any given branch of the wavefunction with a probability proportional to its squared absolute amplitude. (And see also "consistent histories".) Is something missing from that?

Comment by steve2152 on Many-worlds versus discrete knowledge · 2020-08-13T20:01:19.584Z · score: 9 (6 votes) · LW · GW

I think in many-worlds you have to say things like "A photon passed through the measurement apparatus, in the branch of the wavefunction where we're speaking."

The more general idea is: you have a fact with a "location tag" of where that fact is true.

In a more everyday example, I can say "The temperature is 29° here". The word "here" tags a time and place.

Or in the other direction, consider "3 quarks can bind together into a proton". This seems to be a universal fact that therefore doesn't need any "location tag".... But is it really? No! In some plausible theories of physics, there are many universes (and this is not related to quantum many-worlds), and they all have different fundamental constants, and in some of these universes, 3 quarks can not bind together into a proton. So you should really say "3 quarks can bind together into a proton (in the universe where I'm speaking)".

So I would just go with the paradigm of "many facts need to come with a 'location tag' specifying where that fact is meant to be valid", and if you're OK with that, quantum many-worlds is fine.

Sorry if I'm misunderstanding your point. I am an expert on QM but not a mathematical and philosophical expert ;-)

Comment by steve2152 on Matt Botvinick on the spontaneous emergence of learning algorithms · 2020-08-13T10:22:05.326Z · score: 4 (2 votes) · LW · GW

I dunno, I didn't really like the meta-RL paper. Maybe it has merits I'm not seeing. But I didn't find the main analogy helpful. I also don't think "mesa-optimizer" is a good description of the brain at this level. (i.e., not the level involving evolution). I prefer "steered optimizer" for what it's worth. :-)

Comment by steve2152 on How much is known about the "inference rules" of logical induction? · 2020-08-10T12:10:47.741Z · score: 3 (2 votes) · LW · GW

Sorry if these are stupid questions...

logical induction can't be sure if its operating in a nonstandard context or not.

The question specified "all my variables are implicitly natural numbers". Why can't there be traders that specialize on questions specifically about standard numbers and ignore others? (I assume that the natural numbers are standard numbers, correct?) Also, what's the connection between nonstandard numbers and your Godel-like proof?

If you have any procedure that can perfectly distinguish between true and false statements within the structure of PA, then you can make a "this statement is false" and get a contradiction.

I think it would be fun to find a concrete example... here's my attempt...

="the logical inductor will assign probability <0.5 to  in the limit of infinitely many steps"

 = "the logical inductor will assign probability <0.5 to  after  inference steps"

so .

Then maybe any individual  will be true, but the algorithm will never assign >0.5 probability to .

Did I get that right? I'm very out of practice with my self-referential math so I have low confidence. :-P

Comment by steve2152 on Can we expect more value from AI alignment than from an ASI with the goal of running alternate trajectories of our universe? · 2020-08-09T12:38:35.235Z · score: 3 (2 votes) · LW · GW

I think I would make some modifications to your proposal to make it more realistic.

First, I don't know if you intended this, but "stimulating the universe" carries a connotation of a low-level physics simulation. This is computationally impossible. Let's have it model the universe instead, using the same kind of high-level pattern recognition that people use to predict the future.

Second, if the AGI is simulating itself, the predictions are wildly undetermined; it can predict that it will do X, and then fulfill its own prophecy by actually doing X, for any X. Let's have it model a counterfactual world with no AGIs in it.

Third, you need some kind of interface. Maybe you type in "I'm interested in future scenarios in which somebody cures Alzheimer's and writes a scientific article describing what they did. What is the text of that article?" and then it runs through a bunch of possible futures and prints out its best-guess article in the first 50 futures it finds in which the prompted premise comes true. (Maybe also print out a retrospective article from 20 years later about the long-term repercussions of the invention.) For a different type of interface, see microscope AI.

If this is no longer related to your question, I apologize! But if you're still with me, we can ask two questions about this proposal:

First, do we know how to do this kind of thing safely? I think that's an open problem. See, for example, my self-supervised learning and manipulative predictions for one thing that might (or might not!) go wrong in seemingly-harmless versions of this type of system. Since writing that, I've been feeling even more pessimistic because, to make the system really work well, I think we might have to put in various kinds of self-awareness and curiosity and other motivations that make for much more obvious types of risks.

Second, if we do make such a system, does it do the things we want AGI to do? Well, I think it's a bit complicated. If we use the system directly, it always needs a human in the loop, whereas there a lot of things that people might want AGIs to do directly, like drive cars or do other boring and/or safety-critical jobs. On the other hand, we could bootstrap by having the prediction system help us design a safe and aligned agential AGI. I also wrote about this last year at In defense of Oracle ("Tool") AI research and see also the comments.

Comment by steve2152 on Three mental images from thinking about AGI debate & corrigibility · 2020-08-07T15:16:00.396Z · score: 2 (1 votes) · LW · GW

Like, it's not the argument that corrigibility is a stable attractor; it's an argument that corrigibility is a stable attractor with no nearby attractors. (At least in the dimensions that it's 'broad' in.)

Just want to echo Rohin in saying that this is a very helpful distinction, thanks!

I was actually making the stronger argument that it's not a stable attractor at all—at least not until someone solves the problem of how to maintain stable goals / motivations under learning / reflecting / ontological crises.

(The "someone" who solves the problem could be the AI, but it seems to be a hard problem even for human-level intelligence; cf. my comment here.)

Comment by steve2152 on Three mental images from thinking about AGI debate & corrigibility · 2020-08-05T18:26:31.905Z · score: 4 (2 votes) · LW · GW

I usually think of goal stability is something that improves as the agent becomes more intelligent; to the extent that a goal isn't stable we treat it as a failure of capabilities.

Well, sure, you can call it that. It seems a bit misleading to me, in the sense that usually "failure of capabilities" implies "If we can make more capable AIs, the problem goes away". Here, the question is whether "smart enough to figure out how to keep its goals stable" comes before or after "smart enough to be dangerous if its goals drift" during the learning process. If we develop approaches to make more capable AIs, that's not necessarily helpful for switching the order of which of those two milestones happens first. Maybe there's some solution related to careful cultivation of differential capabilities. But I would still much rather that we humans solve the problem in advance (or prove that it's unsolvable). :-P

if you have a 95% correct definition of corrigibility the resulting agent will help us get to the 100% version.

I guess my response would be that something pursuing a goal of Always do what the supervisor wants me to do*   [*...but I don't want to cause the extinction of amazonian frogs] might naively seem to be >99.9% corrigible—the amazonian frogs thing is very unlikely to ever come up!—but it is definitely not corrigible, and it will work to undermine the supervisor's efforts to make it 100% corrigible. Maybe we should say that this system is actually 0% corrigible? Anyway, I accept that there is some definition of  "95% corrigible" for which it's true that "a 95% corrigible agent will help us make it 100% corrigible". I think that finding such a definition would be super-useful. :-)

Comment by steve2152 on Three mental images from thinking about AGI debate & corrigibility · 2020-08-05T13:57:43.779Z · score: 5 (3 votes) · LW · GW

Yes I definitely feel that "goal stability upon learning/reflection" is a general AGI safety problem, not specifically a corrigibility problem. I bring it up in reference to corrigibility because my impression is that "corrigibility is a broad basin of attraction" / "corrigible agents want to stay corrigible" is supposed to solve that problem, but I don't think it does.

I don't think "incorrect beliefs" is a good characterization of the story I was trying to tell, or is a particularly worrisome failure mode. I think it's relatively straightforward to make an AGI which has fewer and fewer incorrect beliefs over time. But I don't think that eliminates the problem. In my "friend" story, the AI never actually believes, as a factual matter, that S will always like B—or else it would feel no pull to stop unconditionally following S. I would characterize it instead as: "The AI has a preexisting instinct which interacts with a revised conceptual model of the world when it learns and integrates new information, and the result is a small unforeseen shift in the AI's goals."

I also don't think "trying to have stable goals" is the difficulty. Not only corrigible agents but almost any agent with goals is (almost) guaranteed to be trying to have stable goals. I just think that keeping stable goals while learning / reflecting is difficult, such that an agent might be trying to do so but fail.

This is especially true if the agent is constructed in the "default" way wherein its actions come out of a complicated tangle of instincts and preferences and habits and beliefs.

It's like you're this big messy machine, and every time you learn a new fact or think a new thought, you're giving the machine a kick, and hoping it will keep driving in the same direction. If you're more specifically rethinking concepts directly underlying your core goals—e.g. thinking about God or philosophy for people, or thinking about the fundamental nature of human preferences for corrigible AIs—it's even worse ... You're whacking the machine with a sledgehammer and hoping it keeps driving in the same direction.

The default is that, over time, when you keep kicking and sledgehammering the machine, it winds up driving in a different, a priori unpredictable, direction. Unless something prevents that. What are the candidates for preventing that?

  • Foresight, plus desire to not have your goals change. I think this is core to people's optimism about corrigibility being stable, and this is the category that I want to question. I just don't think that's sufficient to solve the problem. The problem is, you don't know what thoughts you're going to think until you've thought them, and you don't know what you're going to learn until you learn it, and once you've already done the thinking / learning, it's too late, if your goals have shifted then you don't want to shift them back. I'm a human-level intelligence (I would like to think!), and I care about reducing suffering right now, and I really really want to still care about reducing suffering 10 years from now. But I have no idea how to guarantee that that actually happens. And if you gave me root access to my brain, I still wouldn't know ... except for the obvious thing of "don't think any new thoughts or learn any new information for the next 10 years", which of course has a competitiveness problem. I can think of lots of strategies that would make it more probable that I still care about reducing suffering in ten years, but that's just slowing down the goal drift, not stopping it. (Examples: "don't read consciousness-illusionist literature", "don't read nihilist literature", "don't read proselytizing literature", etc.) It's just a hard problem. We can hope that the AI becomes smart enough to solve the problem before it becomes so smart that it's dangerous, but that's just a hope.
  • "Monitoring subsystem" that never changes. For example, you could have a subsystem which is a learning algorithm, and a separate fixed subsystem that that calculates corrigibility (using a hand-coded formula) and disallows changes that reduce it. Or I could cache my current brain-state ("Steve 2020"), wake it up from time to time and show it what "Steve 2025" or "Steve 2030" is up to, and give "Steve 2020" the right to roll back any changes if it judges them harmful. Or who knows what else. I don't rule out that something like this could work, and I'm all for thinking along those lines.
  • Some kind of non-messy architecture such that we can reason in general about the algorithm's learning / update procedure and prove in general that it preserves goals. I don't know how to do that, but maybe it's possible. Maybe that's part of what MIRI is doing.
  • Give up, and pursue some other approach to AGI that makes "goal stability upon learning / reflection" a non-issue, or a low-stakes issue, as in my earlier comment.
Comment by steve2152 on Three mental images from thinking about AGI debate & corrigibility · 2020-08-05T02:05:31.478Z · score: 3 (2 votes) · LW · GW

I guess my issue is that corrigibility is an exogenous specification; you're not just saying "the algorithm goes to a fixed point" but rather "the algorithm goes to this particular pre-specified point, and it is a fixed point". If I pick a longitude and latitude with a random number generator, it's unlikely to be the bottom of a valley. Or maybe this analogy is not helpful and we should just be talking about corrigibility directly :-P