Comment by Jon Garcia on [Intro to brain-like-AGI safety] 1. What's the problem & Why work on it now? · 2022-01-27T05:22:33.743Z · LW · GW

Great introduction. As someone with a background in computational neuroscience, I'm really looking forward to what you have to say on all this.

By the way, you seem to use a very broad umbrella for covering "brain-like-AGI" approaches. Would you classify something like a predictive coding network as more brain-like or more prosaic? What about a neural Turing machine? In other words, do you have a distinct boundary in mind for separating the red box from the blue box, or is your classification more fuzzy (or does it not matter)?

Comment by Jon Garcia on Why do we need a NEW philosophy of progress? · 2022-01-25T20:56:17.581Z · LW · GW

I agree. Progress ought to be seen as a project that humanity collectively works towards, not an inevitable consequence of Natural Law nor a corrupting addiction to be resisted.

For a new vision of progress, I think we need something like a Gaia Initiative, like the Gaia Hypothesis except that instead of presuming that the Earth functions like a giant superorganism, we frame that as a goal that humanity should strive to achieve. To that end, if we could solve alignment, I think that AGI could be used primarily as a tool for solving coordination problems, aiming to maximize sustainable cooperative harmony, both for humanity and for the global ecosystem.

Comment by Jon Garcia on Words are Imprecise Telepathy · 2022-01-24T14:38:57.033Z · LW · GW

It's not just knowledge that we telepathize with language. With declarative sentences, we transfer whole multimodal scenes from one mind to another. With interrogative sentences, we initiate database queries (memory retrieval) in other minds. And with imperative sentences, we copy over behavioral programs into the minds of other agents for execution.

Comment by Jon Garcia on Harry Potter and the Methods of Psychomagic | Chapter 3: Intelligence Explosions · 2022-01-21T01:46:50.650Z · LW · GW

The obvious first step in this setting to initiating an intelligence explosion would be to find the Lost Diadem of Ravenclaw, even if all it does is accelerate processing speed and/or expand working memory capacity. Though it might be more complicated if it's part of Tom Riddle's enhanced horcrux network, especially if it's one of the items he buried deep underground/sank in the ocean/hovered invisibly in the atmosphere/etc. I'm interested to see where you take this.

Comment by Jon Garcia on Thought Experiments Provide a Third Anchor · 2022-01-19T03:45:38.609Z · LW · GW

Thanks for the links!

Comment by Jon Garcia on Thought Experiments Provide a Third Anchor · 2022-01-19T03:01:45.476Z · LW · GW

There has been work on constructing adversarial examples for human brains, and some interesting demonstrations of considerable neural-level control even with our extremely limited ability to observe brains

Do you have a source for this? I would be interested in looking into it. I could see this happening for isolated neurons, at least, but it would be curious if it could happen for whole circuits in vivo.

Does this go beyond just manipulating how our brains process optical illusions? I don't see how the brain would perceive the type of pixel-level adversarial perturbations most of us think of (e.g.: as anything other than noise, if it even reaches past the threshold of perception at all. The sorts of illusions humans fall prey to are qualitatively different, taking advantage of our perceptual assumptions like structural continuity or color shifts under changing lighting conditions or 3-dimensionality. We don't tend to go from making good guesses about what something is to being wildly, confidently incorrect when the texture changes microscopically.

My guess would be that you could get rid of a lot of adversarial susceptibility from DL systems by adding in the right kind of recurrent connectivity (as in predictive coding, where hypotheses about what the network is looking at help it to interpret low-level features), or even by finding a less extremizing nonlinearity than ReLU (e.g.: Such changes might get us closer to how the brain does things.

Overparameterization, such as through making the network arbitrarily deep, might be able to get you around some of these limitations eventually (just like a fully connected NN can do the same thing as a CNN in principle), but I think we'll have to change how we design neural networks at a fundamental level in order to avoid these issues more effectively in the long term.

Comment by Jon Garcia on Thought Experiments Provide a Third Anchor · 2022-01-18T21:29:02.612Z · LW · GW

To an extent that's true. There are certainly some similarities in how human brains work and how deep learning works, if for no other reason than that DL uses a connectionist approach to AI, which has given narrow AIs something like an intuition, rather than the hard-coded rules of GOFAI. And yes, once we start developing goal-oriented artificial agents, humans will remain, for a long time, the best model we have for approaching an understanding of them.

However, remember how susceptible current DL models can be to adversarial examples, even when the adversarial examples have no perceptible difference to non-adversarial examples as fas as humans can tell. That means that something is going on in DL systems that is qualitatively much different from how human brains process information. Something that makes them fragile in a way that is hard to anthropomorphize. Something alien.

And then there is the orthogonality thesis. Even though humans are the best example we currently have of general intelligence, there is no reason to think that the first AGIs will have goal/value structures any less alien to humans than would a superintelligent spider. Anthropomorphization of such systems carries the risk of assuming too much about how they think or what they want, where we miss critical points of misalignment.

Comment by Jon Garcia on Truthful LMs as a warm-up for aligned AGI · 2022-01-17T21:41:55.187Z · LW · GW

I like this approach to alignment research. Getting AIs to be robustly truthful (producing language output that is consistent with their best models of reality, modulo uncertainty) seems like it falls in the same space as getting them to keep their goals consistent with their best estimates of human goals and values.

As for avoiding negligent falsehoods, I think it will be crucial for the AI to have explicit estimates of its uncertainty for anything it might try to say. To a first approximation, assuming the system can project statements to a consistent conceptual space, it could predict the variance in the distribution of opinions in its training data around any particular issue. Then it could use this estimate of uncertainty to decide whether to state something confidently, to add caveats to what it says, or to turn it into a question for the interlocutor.

Comment by Jon Garcia on [Linkpost] [Fun] CDC To Send Pamphlet On Probabilistic Thinking · 2022-01-15T03:18:10.647Z · LW · GW

Walensky added that if Americans took away one easy lesson from the pamphlet, she hoped it would be P(H|E) = (P(E|H) * P(H)) / P(E).

Seriously, do this.

Comment by Jon Garcia on Goal-directedness: my baseline beliefs · 2022-01-14T18:59:53.920Z · LW · GW

Maybe I could try to disentangle competence from goal-directedness in what I wrote. The main idea that I was trying to push in that paragraph is that there is more to goal-directed behavior in real animals than just movement toward a goal state. There is also (attempted) movement away from anti-goal states and around obstacle states.

An example of the former could be a zebra seeing a bunch of crocodiles congregated by the bank of the Nile and deciding not to cross the river today (unfortunately, it later got chased down and eaten by a lion due to the zebra's incompetence at evading all anti-goal states).

An example of the latter could be a golfer veering his swing slightly to the right to avoid the sand traps on the left (unfortunately, the ball ended up landing in the pond instead due to the golfer's incompetence at avoiding all obstacle states).

Anti-goals and obstacles act as repulsor states, complementing the attractor states known as goals, redirecting the flow of behavior to maximize the chances of survival and of reaching the actual goals.

As to the latter part of that paragraph, I think policy-selection for single goals and goal-selection more generally are important for enabling systems to exhibit flexible behavior. Someone in a recent thread ( brought up some interesting research on goal selection (more like goal pruning) in animals that could be worth looking into.

Comment by Jon Garcia on Question Gravity · 2022-01-13T22:40:02.037Z · LW · GW

Before reading your solution, my guess would be that the center of gravity was always over the edge. It's just that when the bottle was empty, the adhesive forces of the water linking the bottle to the shower door were sufficient to hold it back from tipping.

Edit: It looks like we reached the same conclusion.

Yes, it's definitely a skill that we all need to develop to be able to identify all the assumptions we make in our reasoning. Thanks for the exemplifying post.

Comment by Jon Garcia on (briefly) RaDVaC and SMTM, two things we should be doing · 2022-01-13T22:25:27.505Z · LW · GW

Other types of causes I would like to see more groups working on and more people supporting are those that would help make human communities more robust against civilizational collapse (minus scenarios where we're all turned into paperclips, of course). Right now, billions of humans are utterly dependent on global economic infrastructure to supply food, water, energy, shelter, etc. If some event breaks down this infrastructure, billions could die, since not only do most people lack survival skills, but the local resources in most areas of high population density are not sufficient to provide enough for (even a small fraction of) everyone. Ideally, every local community, from small villages to sprawling metropolises, could become locally self-sustaining to the point where getting cut off from the global economy would lead to the loss of luxury items and foods rather than to mass starvation.

Things that come to mind include the Global Village Construction Set, an open-source set of 50 blueprints for technologies explicitly designed to require minimal material and manufacturing resources to construct but that could be used to rebuild civilization in the event of collapse.

Food production is also a big issue, of course, and efforts to provide year-round local produce using geothermally-regulated greenhouses or vertical farming could help immensely to minimize costs of both production and distribution for communities that use them at sufficient scale. (One limitation to hydroponics/aeroponics that currently outweighs their resource-efficiency is the fact that they tend to focus on just salad greens, which are the easiest to grow. Are there any groups working on genetically engineering fruit trees to produce fruit without the trees?)

Mycelium is another group working on providing automation for community gardens, as well as more effective waste-handling and housing technologies.

I don't know how effective these projects, specifically, would prove to be if given enough funding, but I feel like they are reaching in the right direction. Local, resource-efficient sustainability technology seems like one of the most impactful areas to focus on for ensuring humanity's long-term survival, after things like AI alignment. An additional benefit of putting more effort into these sorts of projects is that they could also be applied to helping people who are struggling in "third world" countries today. And if you could get self-sufficiency technology good enough to work in deserts or on ice sheets, we could also apply it to supporting space colonies. Maybe Elon Musk could be convinced to invest more in this area.

Comment by Jon Garcia on Value extrapolation partially resolves symbol grounding · 2022-01-12T23:55:10.555Z · LW · GW

This might have a better chance of working if you give the AI strong inductive biases to perceive rewarding stimuli not as intrinsically rewarding in themselves, but rather as evidence of something of value happening that merely generates the stimulus. We want the AI to see smiles as good things, not because upwardly curving mouths are good, but because they are generated by good mental states that we actually value.

For that, it would need a generative model (likelihood function) whose hidden states (hypotheses) are mapped to its value function, rather than mapping the sensory data (evidence) directly. The latter type of feed-forward mapping, which most current deep learning is all about, can only offer cheap heuristics, as far as I can tell.

The AI should also priviledge hypotheses about the true nature of the value function where the mental/physiological states of eudaimonic agents are the key predictors. With your happy-video example, the existence of the video player and video files is sufficient to explain its training data, but that hypothesis (R_2) does not involve other agents. R_1, on the other hand, does assume that the happiness of other agents is what predicts both its input data and its value labels, so it should be given priority. (For this to work, it would also need a model of human emotional expression to compare its sensory data against in the likelihood function.)

Comment by Jon Garcia on Calibration proverbs · 2022-01-11T06:09:03.026Z · LW · GW

When all the voices you've chosen to trust speak in unison, it's time to get your confirmation bias debunked by someone.

Comment by Jon Garcia on Let's Stop Pretending to Be Original Thinkers · 2022-01-09T15:04:40.692Z · LW · GW

This incessant push for originality is probably largely what drives imposter syndrome in academia. From the outside, our work may even look brilliant to laypeople, but from the inside, we know how painfully derivative all our ideas really are.

Comment by Jon Garcia on Goal-directedness: my baseline beliefs · 2022-01-08T14:32:00.138Z · LW · GW

I'm very interested to see what you discover or what framework for deconfusion you arrive at with this research project.

My own take on goal-directedness, to a first approximation, is that it is the property of a system that allows it to consistently arrive at a narrow region of state space from a broad range of starting points. The tighter the distribution of steady states, and the wider the distribution of starting states that allow the system to reach them, the more goal-directed it is. A system where a ball is always rolling down to the center of a basin could be considered more goal-directed than a ball rolling around a random landscape, for instance.

These goal states could exist within the agent itself (i.e., homeostatic set points) or out in the external environment (e.g., states that maximize attainable utility, like collecting resources). They could also be represented either explicitly as patterns within the agent's mental model or implicitly within the structure of the agent's policy functions.

Another dimension to this could be the ability to avoid unexpected states that would prevent the achievement of goals (e.g., avoid predators or move around obstacles), or the ability to select actions, either choosing among multiple narrow policies in pursuit of a single goal or choosing among multiple goals in pursuit of utility (or a meta-goal).

Comment by Jon Garcia on You can't understand human agency without understanding amoeba agency · 2022-01-08T06:03:47.543Z · LW · GW

To be honest, when I talk about the "free energy principle", I typically have in mind a certain class of algorithmic implementations of it, involving generative models and using maximum likelihood estimation through online gradient descent to minimize their prediction errors. Something like

Comment by Jon Garcia on Reductionism is not the ultimate tool for causal inference · 2022-01-08T05:21:42.994Z · LW · GW

And that's why the brain processes and encodes things hierarchically. There are statistical regularities in most systems at multiple levels of abstraction that we naturally take advantage of as we build up our intuitive mental models of the world.

It seems like there's a default level of abstraction that most humans work at when reasoning about causality. Somewhere above the level of raw pixels or individual sound frequencies (great artists should be able to deal competently at this level); somewhere below the level of global political, economic, and ecological phenomena (great leaders should be able to deal competently at this level); somewhere around the level of individual social interactions and simple tool use (just using the words "individual" and "simple" should indicate that this is the default level).

It takes a lot of scientific inquiry and accumulated knowledge to dig beneath these levels to a more reductionist view of reality. And even then, our brains simply lack the raw processing power to extract practical insights from running mental simulations of bedrock-level causality (not even supercomputers can do that for most problems of interest).

Abstraction is inescapable, even when we know that reality is technically reducible.

Comment by Jon Garcia on You can't understand human agency without understanding amoeba agency · 2022-01-08T01:30:22.947Z · LW · GW

The model may be implicit, but it's embedded in the structure of the whole thermostat system, from the thermometer that measures temperature to the heating and cooling systems that it controls. For instance, it "knows" that turning on the heat is the appropriate thing to do when the temperature it reads falls below its set point. There is an implication there that the heater causes temperature to rise, or that the AC causes it to fall, even though it's obviously not running simulations (unless it's a really good thermostat) on how the heating/cooling systems affect the dynamics of temperature fluctuations in the building.

The engineers did all the modeling beforehand, then built the thermostat to activate the heating and cooling systems in response to temperature fluctuations according to the rules that they precomputed. Evolution did just this in building the structure of the amoeba's gene networks and the suite of human instincts (heritable variation + natural selection is how information is transferred from the environment into a species' genome). Lived experience pushes further information from the environment to the internal model, upregulating or downregulating various genes in response to stimuli or learning to reinforce certain behaviors in certain contexts. But environmental information was already there in the structure to begin with, just like it is in more traditional artificial control systems.

The example with the hunter and pheasants was just to show how "regulating" (i.e., consistently achieving a state in the desirable set = pheasant successfully shot) requires the hunter to have a good mental model of the system (pheasant behavior, wind disturbances, etc.). Again, this model does not have to be explicit in general but could be completely innate.

Comment by Jon Garcia on You can't understand human agency without understanding amoeba agency · 2022-01-08T00:03:33.241Z · LW · GW

That's not unreasonable as a quick summary of the principle.

I would say there is more to what makes a living system alive than just following the free energy principle per se. For instance, the robot would also need to scavenge for material and energy resources to incorporate into itself for maintenance, repair, and/or reproduction. Just correcting its gait when thrown off balance allows it minimize a sort of behavioral free energy, but that's not enough to count as alive.

But if you want to put amoebas and humans in the same qualitative category of "agency", then you need a framework that is general enough to capture the commonalities of interest. And yes, under such a broad umbrella, artificial control systems and dynamically balancing walking robots would be included.

The free energy principle applies to a lot of systems, not just living or agentic. I see it more as a way to systematize our approach to understanding a system or process rather than an explanation in and of itself. By focusing on how a system maintains set points (e.g., homeostasis) and minimizes prediction error (e.g., unsupervised learning), I think we would be better positioned to figure out what real agents are actually doing in a way that could inform the both the design and alignment of AGI.

Comment by Jon Garcia on Infra-Bayesian physicalism: a formal theory of naturalized induction · 2022-01-06T17:02:44.758Z · LW · GW

Okay, so it's just a constraint on the final shape of the loss function. Would you construct such a loss function by integrating a strictly non-positive computation-value function over all of space and time (or at least over the future light-cones of all its copies, if it focuses just on the effects of its own behavior)?

Comment by Jon Garcia on You can't understand human agency without understanding amoeba agency · 2022-01-06T15:30:34.666Z · LW · GW

I think Friston's free energy principle has a lot to offer here in terms of generalizing agency to include everything from amoebas to humans (although, ironically, maybe not AIXI).

Basically, rational agents, and living systems more generally, are built to regulate themselves in resistance to entropy by minimizing the free energy (essentially, "conflict") between their a priori models of themselves as living or rational systems and what they actually experience through their senses. To do this well, they need to have some sort of internal model of their environment ( that they use for responding to changes in what they sense in order to increase the likelihood of their survival.

For human minds, this internal model is encoded in the neural circuitry of our brains. For amoebas, this internal model could be encoded in the states of the promoters and inhibitors in its gene regulatory network.

Comment by Jon Garcia on Infra-Bayesian physicalism: a formal theory of naturalized induction · 2022-01-06T12:34:10.405Z · LW · GW

The monotonicity principle requires it to be non-decreasing w.r.t. the manifesting of less facts. Roughly speaking, the more computations the universe runs, the better.

I think this is what I was missing. Thanks.

So, then, the monotonicity principle sets a baseline for the agent's loss function that corresponds to how much less stuff can happen to whatever subset of the universe it cares about, getting worse the fewer opportunities become available, due to death or some other kind of stifling. Then the agent's particular value function over universe-states gets added/subtracted on top of that, correct?

Comment by Jon Garcia on Infra-Bayesian physicalism: a formal theory of naturalized induction · 2022-01-06T00:48:00.272Z · LW · GW

Could you explain what the monotonicity principle is, without referring to any symbols or operators? I gathered that it is important, that it is problematic, that it is a necessary consequence of physicalism absent from cartesian models, and that it has something to do with the min-(across copies of an agent) max-(across destinies of the agent copy) loss. But I seem to have missed the content and context that makes sense of all of that, or even in what sense and over what space the loss function is being monotonic.

Your discussion section is good. I would like to see more of the same without all the math notation.

If you find that you need to use novel math notation to convey your ideas precisely, I would advise you to explain what every symbol, every operator, and every formula as a whole means every time you reference them. With all the new notation, I forgot what everything meant after the first time they were defined. If I had a year to familiarize myself with all the symbols and operators and their interpretations and applications, I imagine that this post would be much clearer.

That being said, I appreciate all the work you put into this. I can tell there's important stuff to glean here. I just need some help gleaning it.

Comment by Jon Garcia on The Map-Territory Distinction Creates Confusion · 2022-01-05T03:17:48.459Z · LW · GW

This is perhaps a somewhat subtle distinction, but the point is to shift as much as possible from assumption to inference. If we take an arealist stance and do not assume realism, we may still come to infer it based on the evidence we collect.

I think I can agree with this.

One caveat would be to note that the brain's map-making algorithm does make some implicit assumptions about the nature of the territory. For instance, it needs to assume that it's modeling an actual generative process with hierarchical and cross-modal statistical regularities. It further assumes, based on what I understand about how the cortex learns, that the territory has things like translational equivariance and spatiotemporally local causality.

The cortex (and cerebellum, hippocampus, etc.) has built-in structural and dynamical priors that it tries to map its sensory experiences to, which limits the hypothesis space that it searches when it infers things about the territory. In other words, it makes assumptions.

On the other hand, it is a bit of a theme around here that we should be able to overcome such cognitive biases when trying to understand reality. I think you're on the right track in trying to peel back the assumptions that evolution gave us (even the more seemingly rational ones like splitting map from territory) to ground our beliefs as solidly as possible.

Comment by Jon Garcia on Promising posts on AF that have fallen through the cracks · 2022-01-05T02:51:23.535Z · LW · GW

I remember seeing that post and really wanting to look into it further, but the dense, novel math notation made me postpone that until I forgot about it. (We also had a new baby recently, and I'm working on responding to reviewer comments on a paper I'm trying to publish, so that wasn't the only reason.) It could certainly have helped me grasp the gist of their work if someone had posted a comment at the time summarizing what they got out of it.

What if posts could be flagged by authors and/or the community as needing feedback or discussion? This could work something like pinning to the front page, except that the "pinnedness" could decay over time to make room for other posts while getting periodically refreshed.

I don't know what algorithm is used for sorting front page posts (some combination of recency of the post and its comments and the karma score it has received?), but you could an extra term that has this "jump and decay" behavior. Perhaps the more community members flag it for discussion and feedback, the more often its front-page status gets refreshed. And the more comments it receives, the more the coefficient on this term goes to 0.

Thus, its position in the sorted queue of posts would become like that of a typical post once it has actually gotten feedback (or the author has indicated that the feedback is satisfactory). But it would keep coming back to the community's attention automatically until then.

Comment by Jon Garcia on The Map-Territory Distinction Creates Confusion · 2022-01-04T18:01:11.105Z · LW · GW

This is an interesting take on the map-territory distinction, and I agree in part. Thinking about it now, my only issues with the correspondence theory of truth would be

  1. that it might imply dualism, as you suggested, between a "map" partition of reality and a "territory" partition of reality, and
  2. that it might imply that the territory is something that can be objectively "checked" against the map by some magical method that transcends the process of observation that produced the map in the first place.

These implications, however, seem to be artifacts of human psychology that can be corrected for. As for the metaphysical assumption of an objective, external physical world, I don't see how you can really get around that.

It's true that the only way we get any glimpse at the territory is through our sensory experiences. However, the map that we build in response to this experience carries information that sets a lower bound on the causal complexity of the territory that generates it.

Both the map and the territory are generative processes. The territory generates our sensory experiences, while the map (or rather, the circuitry in our brains) uses something analogous to Bayesian inference to build generative models whose dynamics are predictive of our experiences. In so doing, the map takes on a causal structure that is necessary to predict the hierarchical statistical regularities that it observes in the dynamics of its experiences.

This structure is what is supposed to correspond to the territory. The outputs of the two generative processes (the sensations coming from the territory and the predictions coming from the map) are how correspondence is checked, but they are not the processes themselves.

In other words, the sensory experiences you talked about are Bayesian evidence for the true structure of the territory that generates them, not the territory itself.

Comment by Jon Garcia on We Choose To Align AI · 2022-01-03T21:22:55.876Z · LW · GW

Thanks for the link. That call-and-response was beautiful.

Comment by Jon Garcia on Open Thread - Jan 2022 [Vote Experiment!] · 2022-01-03T04:23:13.919Z · LW · GW

I like the idea of using the Open Thread for testing new karma systems.

Adding multidimensionalilty to it certainly seems like a good idea. In my experience, karma scores on comments seem to be correlated not just to quality of content but also to how well it aligns with the community narrative, to entertainment value, to the prior status of the commenter, and even to the timing of the comment relative to that of the post. Disentangling these would be helpful.

But then, what is it we really want karma to represent? If community members are not vigilant in how we rate things, the karma system is ripe for Goodharting. It's easy to feel compelled to try whatever it takes to get that karma counter to increment.

In my opinion, karma ought to represent how much a comment or post should be seen by other members of the community, in terms of how useful it is for promoting good ideas specifically or rational thinking/behavior generally. Upvotes/downvotes are only (somewhat weak) Bayesian evidence for or against this.

Comment by Jon Garcia on We Choose To Align AI · 2022-01-02T23:48:21.766Z · LW · GW

Yeah! Let's do this!

The part of me which looks at a rickety ladder 30 feet down into a dark tunnel and says “let’s go!” wants this. The part of me which looks at a cliff face with no clear path up and cracks its knuckles wants this.

Although, I would rather those working on AI alignment adopt a general policy of not descending rickety ladders into dark abysses or free-climbing sheer cliffs, just to avoid having the probability of AI catastrophe make a discontinuous jump upward after an exciting weekend trip.

Comment by Jon Garcia on Each reference class has its own end · 2022-01-02T17:24:45.471Z · LW · GW

Your example of the grid of rooms does not quite work. Unlike height (but somewhat like birthday), which column you are in follows a uniform distribution, with no mode near the middle. You are in fact more likely to find yourself in column {A or Z} than to find yourself in column M, for instance. Same for the rows.

Generally speaking, we expect a priori to be part of the typical set of any given distribution, not near the middle per se. In fact, even for Gaussian distributions, as the dimensionality of the space increases, the typical set actually recedes away from the mode/center and towards a hyperellipsoidal shell around it (

This is just something to keep in mind. I haven't yet thought through how this caveat may apply to doomsday timelines, but it's probably important.

Comment by Jon Garcia on The Plan · 2022-01-01T14:01:57.988Z · LW · GW

I can see how that would work. The author needs to be careful, though. Predictive processing may be a necessary condition for robust AGI alignment, but it is not per se a sufficient condition.

First of all, that only works if you give the AGI strong inductive priors for detecting and predicting human needs, goals, and values. Otherwise, it will tend to predict humans as though we are just "physical" systems (we are, but I mean modeling us without taking our sentience and values into account), no more worthy of special care than rocks or streams.

Second of all, this only works if the AGI has a structural bias toward treating the needs, goals, and values that it infers from predictive processing as its own. Otherwise, it may understand how to align with us, but it won't care by default.

Comment by Jon Garcia on The Plan · 2022-01-01T02:03:55.642Z · LW · GW

This looks really interesting. The first thought that jumped to mind was how this geometric principle might extend to abstract goal space in general. There is research suggesting that savannah-like environments may have provided human evolution ideal selective pressures for developing the cognitive tools necessary for making complex plans. Becoming adept at navigating physical scenes with obstacles, predators, refuges, and prey gave humans the right kind of brain architecture for also navigating abstract spaces full of abstract goals, anti-goals (bad outcomes to avoid), obstacles, and paths (plans).

The "geometric decision making" in the paper was studied for physical spaces, but I could imagine that animal minds (including humans) use such a bifurcation method in other goal spaces as well. In other words, agents would start out traversing state space toward the average of multiple, moderately distant goals (seeking a state from which multiple goals are still achievable), then would switch to choosing a sub-cluster of the goals to pursue once they get close enough (the binary decision / bifurcation point). This would iterate until the agent has only one easily achievable goal in front of it.

My guess is that this strategy would be safer than choosing a single goal among many at the outset of planning (e.g., the one goal with the highest expected utility upon achievement). If the situation changes while the agent is in the middle of pursuing a goal, it might find itself too far away from any other goal to make up for the sunk cost. If instead it had been pursuing some sort of multi-goal-centroid state, it could still achieve a decent alternative goal even when what would have been its first choice ceases to be an option. As it gets closer to the multi-goal-centroid, it can afford to focus on just a subset (or just a single goal), since it knows that other decent options are still nearby in state space.

Comment by Jon Garcia on Missing Mental Models · 2021-12-29T18:45:56.303Z · LW · GW

For #1, you could call them something like "usefully bad policies."

For #2, it sounds like the first step in the scientific method. My only other suggestion would be just to find other interested minds (like on LessWrong) to discuss the pattern and to collaborate on forming a new mental model and label. Someone else may already be more predisposed to thinking with a mental model that predicts or explains the phenomenon, so they could help get you to a clear conceptual anchor more quickly.

Comment by Jon Garcia on More accurate models can be worse · 2021-12-29T02:15:42.260Z · LW · GW

All information currently in working memory could potentially become highly weighted when a saliency signal comes along. Through reinforcement learning, I imagine the agent could optimize whatever attention circuit does the loading of information into working memory in order to make this more useful, as part of some sort of learning-to-learn algorithm.

Comment by Jon Garcia on More accurate models can be worse · 2021-12-28T17:02:17.629Z · LW · GW

The brain overcomes this issue through the use of saliency-weighted learning. I don't have any references at the moment, but essentially, information is more salient when it is more surprising, either to the agent's world model or to its self model.

For the former, the agent is constantly making predictions about what it will experience along with the precision of these expectations such that when it encounters something outside of these bounds, it takes notice and updates its world model more strongly in the direction of minimizing these prediction errors.

The latter, however, is where the "usefulness" of salient information is most directly apparent. The agent is not just predicting what will happen in the external world like some disembodied observer. It is modeling what it expects to experience conditioned on its model of itself being healthy and functional. When something surprisingly good occurs, it takes special note of all information that was coincident with the pleasure signal to try to make such experiences more likely in the future. And when something surprisingly bad occurs, it also takes notice of all information coincident with the pain signal so that it can make such experiences less likely in the future.

When everything is going as expected, though, the agent will tend not to keep that information around. Saliency-weighted learning is all about steering an agent's models toward better predictive power and steering its behavior toward states of easier survivability (or easier learnability for a curiosity drive), allowing it to discard most information that it encounters in favor of only that which challenges its expectations.

Comment by Jon Garcia on On Stateless Societies · 2021-12-27T21:27:52.692Z · LW · GW

Is the primary mechanism by which stateless societies cooperate to prevent unequal distributions of power really just to suppress the innovators? You would think that some tribe would instead consider forcing the innovators to share their methods with everyone else, which would allow everyone to prosper while still preventing anyone from getting too far ahead. I guess punishing the one is easier than educating the many, but I would like to think that humanity could evolve toward doing things the other way around.

Comment by Jon Garcia on Conversation as path traversal · 2021-12-27T18:19:52.742Z · LW · GW

As a slight tangent, I'm interested in your thoughts on the nature of the "conversation space" being traversed. You seemed to imply that conversations can have goals, i.e. destinations that participants in the conversation can try to steer it towards. For this, I'm imagining that the interlocutors have certain mental maps (e.g., a "who", "what", "when", "where", "why", and "how", along with their relations) that they either want filled in for themselves or that they're trying to fill in for the other person.

The conversation would then involve looking for holes in each other's mental maps (regions of high uncertainty) and cooperating to fill them in. Interruptions would then entail one participant trying to shift the other person's attention to a (slightly?) different mental map that's of greater interest to them (kind of like what I'm doing with this comment?).

As a further aside, I think this is one area where large language models like GPT-3 fall short of how humans actually use language. They can simulate conversations, but they can't really participate in genuine conversation-space traversals in the sense of deliberately looking for gaps in understanding and for ways to fill those gaps.

By the way, how would your model handle other types of conversation that have purposes other than conveying or seeking information, such as witty banter, small talk, or giving/receiving orders? Would such conversations still involve traversals in the same space, or would it look qualitatively different? Would there still be goal states or just open-ended evolution?

Comment by Jon Garcia on What is a probabilistic physical theory? · 2021-12-25T19:59:41.071Z · LW · GW

As I see it, probability is essentially just a measure of our ignorance, or the ignorance of any model that's used to make predictions. An event with a probability of 0.5 implies that in half of all situations where I have information indistinguishable from the information I have now, this event will occur; in the other half of all such indistinguishable situations, it won't happen.

For example, all I know is that I have a coin with two sides of equal weight that I plan to flip carelessly through the air until it lands on a flat surface. I'm not tracking how all the action potentials in the neurons of my motor cortex, cerebellum, and spinal cord will affect the precise twitches of individual muscle fibers as I execute the flip, nor the precise orientation of the coin prior to the flip, nor the position of every bone and muscle in my body, nor the minute air currents that might interact differently with the textures on the heads versus tails side, nor any variations in the texture of the landing surface, nor that sniper across the street who's secretly planning to shoot the coin once it's in the air, nor etc., etc., etc. Under the simplified model, where that's all you know, it really will land heads half the time and tails half the time across all possible instantiations of the situation where you can't tell any difference in the relevant initial conditions. In the reality of a deterministic universe, however, the coin (of any particular Everett branch of the multiverse) will either land heads-up or it won't, with no in-between state that could be called "probability".

Similarly, temperature also measures our ignorance, or rather lack of control, of the trajectories of a large number of particles. There are countless microstates that produce identical macrostates. We don't know which microstate is currently happening, how fast and in what direction each atom is moving. We just know that the molecules in the fluid in the calorimeter are bouncing around fast enough to cause the mercury atoms in the thermometer to bounce against each other hard enough to cause the mercury to expand out to the 300K mark. But there are vigintillions of distinct ways this could be accomplished at the subatomic level, which are nevertheless indistinguishable to us at the macroscopic level. You could shoot cold water through a large pipe at 100 mph and we would still call it cold, even though the average kinetic energy of the water molecules is now equivalent to a significantly higher temperature. This is because we have control over the largest component of their motion, because we can describe it with a simple model.

To a God-level being that actually does track the universal wave function and knows (and has the ability to control) the trajectories of every particle everywhere, there is no such thing as temperature, no such thing as probability. Particles just have whatever positions and momenta they have, and events either happen or they don't (neglecting extra nuances from QM). For those of us bound by thermodynamics, however, these same systems of particles and events are far less predictable. We can't see all the lowest-level details, much less model them with the same precision as reality itself, much less control them with God-level orchestration. Thus, probability, temperature, etc. become necessary tools for predicting and controlling reality at the level of rational agents embedded in the physical universe, with all the ignorance and impotence that comes along with it.

Comment by Jon Garcia on Transformer Circuits · 2021-12-25T03:35:26.833Z · LW · GW

This looks really interesting. Is there any intention to use these insights to design even more interpretable models than transformers? I've had the feeling that transformer models may be too general-purpose for their own good, in terms of training efficiency and interpretability. By that I mean that, just like fully connected neural networks technically have at least as much computational/representational power as convolutional neural networks, yet they are much harder to train for general image processing than their more constrained counterparts that take full advantage of translational equivariance, transformer-type language models might not have enough constraints to make them efficient enough for AGI.

In these models, some representation of every token is compared against a representation of every other token encountered so far, which gives quadratic complexity for every attention layer at runtime. This then leads to further transformation of the data after each attention block, creating what is effectively a new string of abstract tokens, each of which is some hard-to-interpret combination of the token representations in the level below. The only information added to the vector representation of each token, as far as I understand it, is some vector representing the relative position of the tokens within the string (which itself necessitates a special type of normalization step later on). Otherwise, it's up to the model to learn to assign implicit roles/functions to each token through the attention module. This hides away the information of what each token is doing, which a more constrained model could instead represent explicitly.

It seems to me that we could do better. For instance, suppose we had a model that had "slots" (I'm thinking something like CPU registers) that it would fill in with token vectors as it went along. The LM would learn to assign functions like "subject", "verb", "direct object modifier", etc., with one part learning which tokens should get routed to which slots, another part learning to predict which slot (e.g., part of speech) will get routed to next based on what information has already been filled in and on the learned rules of syntax, and another part predicting what information should go into the empty slots (allowing it to "read between the lines"). That last part could also be hooked up to a long-memory database of learned relations that it could fill in and update as it accumulates training data (something like what DeepMind published recently:

Although the role of each slot will be arbitrary and assigned only through training, I think this type of architecture would make it easier to extract semantic roles for the tokens that it reads in, since these semantic roles have explicit locations where they can always be found. In other words, you can use the same method to find out what the LM thinks about the who, what, when, where, why, and how of what it reads or says (along with what it thinks about everything it doesn't read or say by looking into the "unused" slots). With transformers, this would be much more difficult, since semantic roles are assigned much more implicitly and a lot could be hiding in the weights.

That was just an idea, but I think that intepretibility will come more easily the more we constrain the language model with both functional and representational modularity. Perhaps the work you do could help inform what sorts of constraints would be most effective to that end.

Comment by Jon Garcia on The Plan · 2021-12-12T00:41:03.676Z · LW · GW

It's quite possible that control is easier than ambitious value learning, but I doubt that it's as sustainable. Approaches like myopia, IDA, or HCH would probably get you an AGI that is aligned to much higher levels of intelligence than doing without them, all else being equal. But if there is nothing pulling its motivations explicitly back toward a basin of value alignment, then I feel like these approaches would be prone to diverging from alignment at some level beyond where any human could tell what's going on with the system.

I do think that methods of control are worthwhile to pursue over the short term, but we had better be simultaneously working on ambitious value learning in the meantime for when an ASI inevitably escapes our control anyway. Even if myopia, for instance, worked perfectly to constrain what some AGI is able to conspire, it still seems likely that someone, somewhere, will try fiddling around with another AGI's time horizon parameters and cause a disaster. It would be better if AGI models, from the beginning, had at least some value learning system built in by default to act as an extra safeguard.

Comment by Jon Garcia on Are big brains for processing sensory input? · 2021-12-11T23:36:07.555Z · LW · GW

While fine motor control is certainly far from all that the cerebellum does, it is also certainly something that it does really do ( From my understanding, it learns to smooth out motor trajectories (as well as generalized trajectories in cortical processing) using feed-forward circuitry (with feedback of error signals from the inferior olive acting as just a training signal), which is why I called it "learned reflexes". And, as you mentioned, this feed-forward "reflexive" trajectory-smoothing extends to all cognitive processes.

I have come to see the basal ganglia as helping to decide what actions to take (or where information is routed to), while the cerebellum handles more how the actions are carried out (or how to adjust the information transferred between cortical regions). And it has all the computational universality of extremely wide feed-forward neural networks ( due to its huge number of neurons. Maybe this would play into your idea in your other comment about how cerebellar outputs might also help train the cortex.

Comment by Jon Garcia on The Plan · 2021-12-11T03:51:09.143Z · LW · GW

I strongly agree with your focus on ambitious value learning, rather than approaches that focus more on control (e.g., myopia). What we want is an AGI that can robustly identify humans (and I would argue, any agentic system), determine their values in an iteratively improving way, and treat these learned values as its own. That is, we should be looking for models where goal alignment and a desire to cooperate with humanity is situated within a broad basin of attraction (like how corrigibility is supposed to work), where any misalignment that the AGI notices (or that humans point out to it) is treated as an error signal that pulls its value model back into the basin. For such a scheme to work, of course, you need some way for it to infer human goals (watching human behavior?, imagining what it would be trying to achieve that would make it behave the same way?), some way for the AGI to represent "human goals" once it has inferred them, some way for it to represent "my own goals" in the same conceptual space (while still using those goal representations to drive its own behavior), and some way for it to take any differences in these representations to make itself more aligned (something like online gradient descent?).

And I think that solutions to this line of research would involve building generative agentic models into the AGI's architecture to give it strong inductive priors for detecting human agency in its world model (using something along the lines of analysis by synthesis or predictive coding). We wouldn't necessarily have to figure out everything about how the human mind works in order to build this (although that would certainly help), just enough so that it has the tools to teach itself how humans think and act, maintain homeostasis, generate new goals, use moral instincts of empathy, fairness, reciprocity, and status-seeking, etc. And as long as it is built to treat its best model of human values and goals as its own values and goals, I think we wouldn't need to worry about it torturing simulated humans, no matter how sophisticated its agentic models get. Of course, this would require figuring out how to detect agentic models in general systems, as you mentioned, so that we can make sure that the only parts of the AGI capable of simulating agents are those that have their preferences routed to the AGI's own preference modules.

Comment by Jon Garcia on There is essentially one best-validated theory of cognition. · 2021-12-10T19:15:07.858Z · LW · GW

So essentially, which types of information get routed for processing to which areas during the performance of some behavioral or cognitive algorithm, and what sort of processing each module performs?

Comment by Jon Garcia on There is essentially one best-validated theory of cognition. · 2021-12-10T18:24:44.881Z · LW · GW

So, from what I read, it looks like ACT-R is mostly about modeling which brain systems are connected to which and how fast their interactions are, not in any way how the brain systems actually do what they do. Is that fair? If so, I could see this framework helping to set useful structural priors for developing AGI (e.g., so we don't make the mistake of hooking up the declarative memory module directly to the raw sensory or motor modules), but I would expect most of the progress still to come from research in machine learning and computational neuroscience.

Comment by Jon Garcia on Are big brains for processing sensory input? · 2021-12-10T18:00:01.556Z · LW · GW

As for the elephant's oversized cerebellum, I've heard it suggested that it's for controlling the trunk. Elephant trunks are able to manipulate things with extreme dexterity, allowing them to pluck individual leaves, pick up heavy objects, or even paint. Since the cerebellum is known for "smoothing out" fine motor control (basically acting as a giant library of learned reflexes [including cognitive reflexes], as I understand it), it makes sense that elephant cerebellums would become so large as their trunks evolved to act as prehensile appendages.

According to this, the human brain has about 15 billion neurons in the telencephalon (neocortex, etc.), 70 billion in the cerebellum, and 1 billion in the brainstem. So it sounds like we still have much more circuitry dedicated to generalized abstract intelligence than elephants; they just have better dexterity with a more complex appendage than human limbs (minus the fingers but plus a ton of complexly interacting muscles). If we had cerebella closer in size to the elephant's, all humans would probably be experts in gymnastics, martial arts, painting, and playing musical instruments.

Comment by Jon Garcia on Theoretical Neuroscience For Alignment Theory · 2021-12-09T23:33:10.226Z · LW · GW

This post is a really great summary. Steve's posts on the thalamo-cortical-basal ganglia loop and how it relates learning-from-scratch to value learning and action selection has added a lot to my mental model of the brain's overall cognitive algorithms. (Although I would add that I currently see the basal ganglia as a more general-purpose dynamic routing system, able to act as both multiplexer and demultiplexer between sets of cortical regions for implementing arbitrarily complex cognitive algorithms. That is, it may have evolved for action selection, but humans, at least, have repurposed it for abstract thought, moving our conscious thought processes toward acting more CPU-like. But that's another discussion.)

One commonly-proposed solution to this problem is to capture these intuitions indirectly through human-in-the-loop-style proposals like imitative amplification, safety via debate, reward modeling, etc., but it might also be possible to just “cut out the middleman” and install the relevant human-like social-psychological computations directly into the AGI. In slogan form, instead of (or in addition to) putting a human in the loop, we could theoretically put "humanness" in our AGI.

In my opinion, it's got to be both. It definitely makes sense that we should be trying to get an AGI to have value priors that align as closely with true human values as we can get them. However, to ensure that the system remains robust to arbitrarily high levels of intelligence, I think it's critical to have a mechanism built in where the AGI is constantly trying to refine its model of human needs/goals and feed that in to how it steers its behavior. This would entail an ever-evolving theory of mind that uses human words, expressions, body language, etc. as Bayesian evidence of humans' internal states and an action-selection architecture that is always trying to optimize for its current best model of human need/goal satisfaction.

Understanding the algorithms of the human brainstem/hypothalamus would help immensely in providing a good model prior (e.g., knowing that smiles are evidence for satisfied preferences, that frowns are evidence for violated preferences, and that humans should have access to food, water, shelter, and human community gives the AGI a much better head start than making it figure out all of that on its own). But it should still have the sort of architecture that would allow it to figure out human preferences from scratch and try its best to satisfy them in case we misspecify something in our model.

Comment by Jon Garcia on [deleted post] 2021-12-04T17:36:43.188Z

I think these types of "trolley" problems are designed to explore the Pareto frontier of human moral systems. You can't improve the results in one dimension without making them worse in another. The ideal would of course be that everyone is magically saved without any adverse side-effects, but the real world (and the hypothetical worlds of these scenarios) is more constrained. The best we can do is a Pareto-optimal choice, and for most of these dilemmas, either option could be seen as Pareto-optimal.

Other than that, the scenarios tend to either present physically impossible options (2, 5), logistically impossible options (5, 6), impossibly contrived situations (1, 2, 3, 4, 5, 6), or ridiculous overconfidence in the effectiveness of possible solutions (2, 3, 4, 5, 6). The rules of human morality are designed for situations that fall within the typical set of human experience and assume human-level control over the efficacy of actions, all embedded within a typical human social fabric. In any of these extreme edge cases, I think that anyone trying to do the right thing could at least be forgiven for any inevitable harmful side-effects.

Comment by Jon Garcia on Morality is Scary · 2021-12-03T17:57:06.232Z · LW · GW

Getting from here to there is always the tricky part with coordination problems, isn't it? I do have some (quite speculative) ideas on that, but I don't see human society organizing itself in this way on its own for at least a few centuries given current political and economic trends, which is why I postulated a cooperative ASI.

So assuming that either an aligned ASI has taken over (I have some ideas on robust alignment, too, but that's out of scope here) or political and economic forces (and infrastructure) have finally pushed humanity past a certain social phase transition, I see humanity undergoing an organizational shift much like what happened with the evolution of multicellularity and eusociality. This would look at first mostly the same as today, except that national borders have become mostly irrelevant due to advances in transportation and communication infrastructure. Basically, imagine the world's cities and highways becoming something like the vascular system of dicots or the closed circulatory system of vertebrates, with the regions enclosed by network circuits acting as de facto states (or organs/tissues, to continue the biological analogy). Major cities and the regions along the highways that connect them become the de facto arbiters of international policy, while the major cities and highways within each region become the arbiters of regional policy, and so on in a hierarchically embedded manner.

Within this structure, enclosed regions would act as hierarchically embedded communities that end up performing a division of labor for the global network, just as organs divide labor for the body (or like tissues divide labor within an organ, or cells within a tissue, or organelles within a cell, if you're looking within regions). Basically, the transportation/communication/etc. network edges would come to act as Markov blankets for the regions they encapsulate, and this organization would extend hierarchically, just like in biological systems, down to the level of local communities. (Ideally, each community would become locally self-sufficient, with broader networks taking on a more modulatory role, but that's another discussion.)

Anyway, once this point is reached, or even as the transition is underway, I see the ASI and/or social pressures facilitating the movement of people toward communities of shared values and beliefs (i.e., shared narratives, or at least minimally conflicting narratives), much like in Scott Alexander's Archipelago. Each person or family unit should move so as to minimize their displacement while maximizing the marginal improvement they could make to their new community (and the marginal benefit they could receive from the new community).

In the system that emerges, stories would become something of a commodity, arising within communities as shared narratives that assign social roles and teach values and lessons (just like the campfire legends of ancient hunter-gatherer societies). Stories with more universal resonance would propagate up hierarchical layers of the global network and then get disseminated top-down toward other local communities within the broader regions. This would provide a narrative-synchronization effect at high levels and across adjacent regions while also allowing for local variations. The status games / moralities of the international level would eventually attain a more "liberal" flavor, while those at more local levels could be more "conservative" in nature.

Sorry, that was long. And it probably involved more idealizational fantasy than rational prediction of future trends. But I have a hunch that something like this could work

Comment by Jon Garcia on Morality is Scary · 2021-12-02T18:45:13.601Z · LW · GW

Even if moralities vary from culture to culture based on the local status games, I would suggest that there is still some amount of consequentialist bedrock to why certain types of norms develop. In other words, cultural relativism is not unbounded.

Generally speaking, norms evolve over time, where any given norm at one point didn't yet exist if you go back far enough. What caused these norms to develop? I would say the selective pressures for norm development come from some combination of existing culturally-specific norms and narratives (such as the sunrise being an agent that could get hurt when kicked) along with more human-universal motivations (such as empathy + {wellbeing = good, suffering = bad} -> you are bad for kicking the sunrise -> don't sleep facing west) or other instrumentally-convergent goals (such as {power = good} + "semen grants power" -> institutionalized sodomy). At every step along the evolution of a moral norm, every change needs to be justifiable (in a consequentialist sense) to the members of the community who would adopt it. Moral progress is when the norms of society come to better resonate with both the accepted narratives of society (which may come from legends or from science) and the intrinsic values of its members (which come from our biology / psychology).

In a world where alignment has been solved to most everyone's satisfaction, I think that the status-game / cultural narrative aspect of morality will necessarily have been taken into account. For example, imagine a post-Singularity world kind of like Scott Alexander's Archipelago, where the ASI cooperates with each sub-community to create a customized narrative for the members to participate in. It might then slowly adjust this narrative (over decades? centuries?) to align better with human flourishing in other dimensions. The status-game aspect could remain in play as long as status becomes sufficiently correlated with something like "uses their role in life to improve the lives of others within their sphere of control". And I think everyone would be better off if each narrative also becomes at least consistent with what we learn from science, even though the stories that define the status game will be different from one culture to another in other ways.