Non-Book Review: Patterns of Conflict 2020-11-30T21:05:24.389Z
The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables 2020-11-18T17:47:40.929Z
Anatomy of a Gear 2020-11-16T16:34:44.279Z
Early Thoughts on Ontology/Grounding Problems 2020-11-14T23:19:36.000Z
A Self-Embedded Probabilistic Model 2020-11-13T20:36:24.407Z
Communication Prior as Alignment Strategy 2020-11-12T22:06:14.758Z
A Correspondence Theorem in the Maximum Entropy Framework 2020-11-11T22:46:38.732Z
What Would Advanced Social Technology Look Like? 2020-11-10T17:55:30.649Z
Open Problems Create Paradigms 2020-11-09T20:04:34.534Z
When Hindsight Isn't 20/20: Incentive Design With Imperfect Credit Allocation 2020-11-08T19:16:03.232Z
Three Open Problems in Aging 2020-11-07T19:39:07.352Z
Generalized Heat Engine II: Thermodynamic Efficiency Limit 2020-11-06T17:30:43.805Z
Generalized Heat Engine 2020-11-05T19:01:32.699Z
Gifts Which Money Cannot Buy 2020-11-04T19:37:57.451Z
When Money Is Abundant, Knowledge Is The Real Wealth 2020-11-03T17:34:45.516Z
Confucianism in AI Alignment 2020-11-02T21:16:45.599Z
"Inner Alignment Failures" Which Are Actually Outer Alignment Failures 2020-10-31T20:18:35.536Z
A Correspondence Theorem 2020-10-26T23:28:06.305Z
Problems Involving Abstraction? 2020-10-20T16:49:39.618Z
Toy Problem: Detective Story Alignment 2020-10-13T21:02:51.664Z
Lessons on Value of Information From Civ 2020-10-07T18:18:40.118Z
Words and Implications 2020-10-01T17:37:20.399Z
What Decision Theory is Implied By Predictive Processing? 2020-09-28T17:20:51.946Z
Comparative Advantage is Not About Trade 2020-09-22T18:43:11.496Z
Book Review: Working With Contracts 2020-09-14T23:22:11.215Z
Egan's Theorem? 2020-09-13T17:47:01.970Z
CTWTB: Paths of Computation State 2020-09-08T20:44:08.951Z
Alignment By Default 2020-08-12T18:54:00.751Z
The Fusion Power Generator Scenario 2020-08-08T18:31:38.757Z
Infinite Data/Compute Arguments in Alignment 2020-08-04T20:21:37.310Z
Generalized Efficient Markets in Political Power 2020-08-01T04:49:32.240Z
Alignment As A Bottleneck To Usefulness Of GPT-3 2020-07-21T20:02:36.030Z
Anthropomorphizing Humans 2020-07-17T17:49:37.086Z
Mazes and Duality 2020-07-14T19:54:42.479Z
Models of Value of Learning 2020-07-07T19:08:31.785Z
High Stock Prices Make Sense Right Now 2020-07-03T20:16:53.852Z
Mediators of History 2020-06-27T19:55:48.485Z
Abstraction, Evolution and Gears 2020-06-24T17:39:42.563Z
The Indexing Problem 2020-06-22T19:11:53.626Z
High-School Algebra for Data Structures 2020-06-17T18:09:24.550Z
Causality Adds Up to Normality 2020-06-15T17:19:58.333Z
Cartesian Boundary as Abstraction Boundary 2020-06-11T17:38:18.307Z
Public Static: What is Abstraction? 2020-06-09T18:36:49.838Z
Everyday Lessons from High-Dimensional Optimization 2020-06-06T20:57:05.155Z
Speculations on the Future of Fiction Writing 2020-05-28T16:34:45.599Z
Highlights of Comparative and Evolutionary Aging 2020-05-22T17:01:30.158Z
Pointing to a Flower 2020-05-18T18:54:53.711Z
Conjecture Workshop 2020-05-15T22:41:31.984Z
Project Proposal: Gears of Aging 2020-05-09T18:47:26.468Z
Writing Causal Models Like We Write Programs 2020-05-05T18:05:38.339Z


Comment by johnswentworth on The LessWrong 2018 Review · 2020-12-02T19:56:33.333Z · LW · GW

So, I don't necessarily think that all the details of this belong in the 2019 books, but... y'know, this is LessWrong, things just don't feel complete without a few levels of meta thrown in.

Comment by johnswentworth on Understanding “Deep Double Descent” · 2020-12-02T19:42:23.056Z · LW · GW

I found this post interesting and helpful, and have used it as a mental hook on which to hang other things. Interpreting what's going on with double descent, and what it implies, is tricky, and I'll probably write a proper review at some point talking about that.

Comment by johnswentworth on Excerpts from a larger discussion about simulacra · 2020-12-02T19:37:26.046Z · LW · GW

This seems to be where simulacra first started to appear in LW discourse? There doesn't seem to be a polished general post on the subject until 2020, but I feel like the concepts and classification were floating around in 2019, and some credit probably belongs on this post.

Comment by johnswentworth on Risks from Learned Optimization: Introduction · 2020-12-02T19:14:30.792Z · LW · GW

So, this was apparently in 2019. Given how central the ideas have become, it definitely belongs in the review.

Comment by johnswentworth on Unconscious Economics · 2020-12-02T19:04:31.627Z · LW · GW

In order to apply economic reasoning in the real world, this is an indispensable concept, and this post is my go-to link for it.

Comment by johnswentworth on Propagating Facts into Aesthetics · 2020-12-02T18:59:34.602Z · LW · GW

I think this post remains under-appreciated. Aesthetics drive a surprisingly large chunk of our behavior, and I find it likely that some aesthetics tend to outperform others in terms of good decision-making. Yet it's a hard thing to discuss at a community level, because aesthetics are often inherently tied to politics. I'd like to see more intentional exploration of aesthetic-space, and more thinking about how to evaluate how-well-different-aesthetics-perform-on-decisions, assuming the pitfalls of politicization can be avoided.

Comment by johnswentworth on Coherent decisions imply consistent utilities · 2020-12-02T18:46:23.178Z · LW · GW

I don't particularly like dragging out the old coherence discussions, but the annual review is partly about building common knowledge, so it's the right time to bring it up.

This currently seems to be the canonical reference post on the subject. On the one hand, I think there are major problems/missing pieces with it. On the other hand, looking at the top "objection"-style comment (i.e. Said's), it's clear that the commenter didn't even finish reading the post and doesn't understand the pieces involved. I think this is pretty typical among people who object to coherence results: most of them have only dealt with the VNM theorem, and correctly complain about the assumptions of that theorem being too strong, but don't know about the existence of all the other coherence theorems (including the complete class theorem mentioned in the post, and Savage's theorem mentioned in the comments). The "real" coherence theorems do have problems with them, but they're not the problems which a lot of people point to in VNM.

I'll leave a more detailed review later. The point of this nomination is to build common knowledge: I'd like to get to the point where the objections to coherence theorems are the right objections, rather than objections based in ignorance, and this post (and reviews of it) seem like a good place for that.

Comment by johnswentworth on What is “protein folding”? A brief explanation · 2020-12-01T22:33:10.874Z · LW · GW

I agree with this answer - it is still likely to be a useful component in a simulation pipeline in the long run, but it's probably not going to revolutionize things as a standalone tool in the short run.

Comment by johnswentworth on [Linkpost] AlphaFold: a solution to a 50-year-old grand challenge in biology · 2020-12-01T00:30:16.241Z · LW · GW

That's the dream.

Comment by johnswentworth on What Would Advanced Social Technology Look Like? · 2020-11-30T23:50:00.136Z · LW · GW

These answers gave me a strong sense of "there's really useful models to be found here, but I'm not quite sure what they look like".

Comment by johnswentworth on [Linkpost] AlphaFold: a solution to a 50-year-old grand challenge in biology · 2020-11-30T22:34:50.639Z · LW · GW

We mainly want to know (a) what reactions a protein is involved in, and (b) the rate constants on those reactions. In practice, protein shape tells us very little about either of those without extensive additional simulation. (It can give some hints as to what broad classes of reaction the protein might be involved in, but my understanding is that we can get most of those same hints from the sequence alone.)

In principle, folded protein structures could be used as an input to those sorts of simulations, but the simulation is expensive in much the same way as the folding problem itself, and as far as I know the cutting edge in simulation still can't provide precision or speed comparable to high-throughput assays (even given folded structures).

In gears terms: everything we care about in a high-dimensional protein structure is summarized by low-dimensional reaction rates, so proteins make really good gears. A practical consequence is that directly measuring reaction rates is way more efficient than simulating all the low-level activity. There are things that approach can't handle - e.g. we don't know how a change to the protein will change reaction rates - but even with protein folding "solved", simulation isn't at the point where it can make those predictions faster and more precisely than a new experiment.

Comment by johnswentworth on [Linkpost] AlphaFold: a solution to a 50-year-old grand challenge in biology · 2020-11-30T21:50:23.978Z · LW · GW

Prediction: this won't make much difference for either biology or medicine in general. The one big thing it will do is cause funding agencies to stop wasting so much money on protein structure studies (assuming that AlphaFold's results generalize beyond this particular challenge, which I'm uncertain about). The whole field of structural biology was 95% useless anyway.

It is an interesting result from the AI angle, though.

Comment by johnswentworth on What Would Advanced Social Technology Look Like? · 2020-11-28T17:44:33.068Z · LW · GW

Tools allowing a group to intentionally choose their memes sound like they could be useful in much the same way as genetic engineering. Like genetic engineering of group culture. I imagine it would be especially useful as group size increases, potentially making large group cooperation less inherently unstable.

Comment by johnswentworth on What Would Advanced Social Technology Look Like? · 2020-11-27T23:38:23.653Z · LW · GW

Re: general public contract, I found parts of Legal Systems Very Different From Ours interesting for exactly this topic. Religious legal systems, for example, are mostly implemented on top of more general legal systems in today's first world.

Among those having it, it will be zero-sum of course - and tricky, because fixed-point algorithms will need to be developed.

Why would this necessarily be zero sum?

Comment by johnswentworth on What Would Advanced Social Technology Look Like? · 2020-11-27T23:31:54.913Z · LW · GW

I remember fantasizing about this sort of thing in high school. Would have made my life sooo much better for a few years.

Comment by johnswentworth on Covid 11/26: Thanksgiving · 2020-11-26T20:02:52.192Z · LW · GW

This isn’t Covid-19 but there was this claim by some Israelis to have reversed the human aging process. Using oxygen. In particular, they are claiming they can lengthen telomeres and the accumulation of resulting senescent cells. [...] I assume I have some readers who can explain why this is nothing to get excited about, but seems worth asking for them to do that.

Ok, there's a lot going on here.

First, general epistemic comments. Paper is here. There are some major red flags: only ~20 patients in analysis, no control group, tested a bunch of different cell types and endpoints. In this case, I think the lack of a control group isn't too alarming - we have a pretty decent prior idea of what "normal" looks like in old people, and in some ways using the initial conditions of this particular group as the "control" is better anyways, especially with such a small sample size. The garden of forking paths is a bigger concern. The effect sizes and p-values are strong enough that I still think there's plausibly a real effect here, but definitely take it with a sizable helping of salt.

The main measurements I'd pay attention to are the senescent cell counts post-treatment (the "post-HBOT", taken "1-2 weeks" after the treatment concluded). Hyperbaric oxygen will definitely have short-term effects, but it's mainly the longer-term effects which are interesting here, so post-HBOT is the thing to look at. Telomere length measurements in general are... kinda tricky to interpret. In normal operation, they're effectively a downstream measurement of DNA damage rates (which are what "really" seem to matter for aging), but some interventions can lengthen telomeres without significantly reducing the damage rates. Senescent cell counts, on the other hand, seem to be more directly relevant, based on my current best understanding. Put that together, and we can ignore most of the forking paths and just focus on post-HBOT senescent cell changes.

And in this case, the post-HBOT senescent cell changes are exactly where the most dramatic results are. They looked at senescent cell counts in two cell types. One had ~37% drop in senescent cell count, the other had ~11% drop. Those are definitely not "the problem is solved" kind of numbers, especially when the 37% drop is only in one cell type, but it's substantial.

The bigger question is how long the effect lasts. The study only checked in 1-2 weeks after treatment, which is a bit less than the typical half-life of senescent cells. What we really want to know is whether the effect persists after 6 months or a year. Given the mechanisms involved (i.e. hyperbaric oxygen, defense activation), I would expect a priori that it probably wears off after 1-2 months, and that in the long run the hyperbaric oxygen exposure accelerates aging overall.

Comment by johnswentworth on Anatomy of a Gear · 2020-11-25T17:19:55.842Z · LW · GW

Solid answer.

Comment by johnswentworth on It’s not economically inefficient for a UBI to reduce recipient’s employment · 2020-11-25T16:26:46.851Z · LW · GW

(Given that this seems like a kind of unenlightening thread about a topic that's not super important to me, I'll probably drop it.)

Reasonable. If you want a halfway-decent defense of the view that whether UBI is a good idea should depend on whether recipients stop working (while still accepting that work is not inherently good), you might like this.

Comment by johnswentworth on It’s not economically inefficient for a UBI to reduce recipient’s employment · 2020-11-24T17:43:13.477Z · LW · GW

Yes, I'm talking about the additional consumption if you earn+spend more money.

Generally speaking, if we're asking "what's the impact of policy X?" in economics, we:

  • consider how each agent will react to policy X
  • compare outcomes under the decisions which each agent will actually make

Key point: we do not compare outcomes under decisions the agents could make (i.e. their choice-sets), we compare outcomes under decisions they will make, in both a with-policy scenario and a without-policy scenario.

In this context, that means we ask

  • How will you react to the UBI - i.e. how will your production and consumption (as well as everyone else' production and consumption) change in a world with UBI vs a world without UBI?
  • What does that imply about how nice the world will be with or without UBI?

The question you are currently asking is instead "given UBI, how will production and consumption change if I do vs do not work?". But that's not the relevant question for evaluating UBI. For evaluating UBI, the questions are 

  • "given UBI, will you work?" - to which we'll assume the answer is "no", for current purposes
  • given that, is the world better off with (no UBI + you working) or (UBI + you not working)

In particular:

We are wondering if my decision to stop working was inefficient

This is not the question. The question is whether UBI is inefficient, given that you will react to the UBI by not working. The question is not whether your own decision is inefficient.

(If the question were whether your own decision is inefficient, then the discussion of externalities would be roughly correct; at that point it's basically just the usual question of whether individual utility-maximization produces efficient outcomes.)

Comment by johnswentworth on It’s not economically inefficient for a UBI to reduce recipient’s employment · 2020-11-23T16:54:03.796Z · LW · GW

The question is whether the combination of {me working} + {me consuming} is better or worse for the rest of the world than {me relaxing}

Huh?? This does not make sense to me, in two ways:

  • If we're talking economic efficiency, then your own utility should be included. What's best for "the rest of the world" isn't the efficiency question; we should be asking what's best for everyone, including you. Why would we focus on the rest of the world?
  • In a UBI scenario, you should be able to stop working while still consuming (though someone else will consume less). You may cut back consumption to some extent, but presumably by much less than if you just stopped working and had no income at all. The choice between {me working} + {me consuming} vs {me relaxing} is the choice faced when considering retirement, not when considering UBI.
Comment by johnswentworth on It’s not economically inefficient for a UBI to reduce recipient’s employment · 2020-11-22T19:21:26.337Z · LW · GW

I think you are getting very distracted by the money flows here. A generally-useful move in these sorts of problems is to forget about the money flows, and just look at the real goods/services/economic value produced and consumed.

If a UBI causes a bunch of people to stop working, what does that mean in terms of production and consumption of real value?

Well, obviously if someone stops working then there is less total goods and services produced. That's a loss of real economic value. The trade-off is that the (former) worker gains some leisure time. So, from a real value perspective, the main question is whether the leisure gained has more real economic value than the goods/services no longer being produced.

There will also be second-order redistributive effects, as the redistribution of income/wealth changes spending patterns, but the analysis there shouldn't be any different from redistribution more generally. (This would include e.g. your Netflix/city apartment examples.) The part which is specific to UBI is the trade-off between leisure and goods/services produced.

Comment by johnswentworth on Draft report on AI timelines · 2020-11-22T18:52:01.579Z · LW · GW

I saw a presentation covering a bunch of this back in February, and the graphs I found most informative were those showing the training flop distributions before updating against already-achievable levels. There is one graph along these lines on page 13 in part 1 in the google docs, but it doesn't show the combined distribution without the update against already achievable flops.

Am I correct in remembering that the combined distribution before that update was distinctly bimodal? That was one of my main takeaways from the presentation, and I want to make sure I'm remembering it correctly.

Comment by johnswentworth on The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables · 2020-11-21T01:48:34.699Z · LW · GW

Setting up the "locality of goals" concept: let's split the variables in the world model into observables , action variables , and latent variables . Note that there may be multiple stages of observations and actions, so we'll only have subsets  and  of the observation/action variables in the decision problem. The Bayesian utility maximizer then chooses  to maximize

... but we can rewrite that as

Defining a new utility function , the original problem is equivalent to:

In English: given the original utility function on the ("non-local") latent variables, we can integrate out the latents to get a new utility function defined only on the ("local") observation & decision variables. The new utility function yields completely identical agent behavior to the original.

So observing agent behavior alone cannot possibly let us distinguish preferences on latent variables from preferences on the "local" observation & decision variables.

Comment by johnswentworth on The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables · 2020-11-19T19:12:12.012Z · LW · GW

You've mostly understood the problem-as-stated, and I like the way you're thinking about it, but there's some major loopholes in this approach.

First, I may value the happiness of agents who I cannot significantly impact via my actions - for instance, prisoners in North Korea.

Second, the actions we chose probably won't provide enough data. Suppose there are n different people, and I could give any one of them $1. I value these possibilities differently (e.g. maybe because they have different wealth/cost of living to start with, or just because I like some of them better). If we knew how much I valued each action, then we'd know how much I valued each outcome. But in fact, if I chose person 3, then all we know is that I value person 3 having the dollar more than I value anyone else having it; that's not enough information to back out how much I value each other person having the dollar. This sort of underdetermination will probably be the usual result, since the choice-of-action contains a lot less bits than a function mapping the whole action space to values.

Third, and arguably most important: "run the calculation for all desired moral agents" requires first identifying all the "desired moral agents", which is itself an instance of the problem in the post. What the heck is a "moral agent", and how does an AI know which ones are "desired"? These are latent variables in your world-model, and would need to be translated to something in the real world.

Comment by johnswentworth on The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables · 2020-11-19T18:24:28.214Z · LW · GW

This makes a lot of sense.

I had been weakly leaning towards the idea that a solution to the pointers problem should be a solution to deferral - i.e. it tells us when the agent defers to the AI's world model, and what mapping it uses to translate AI-variables to agent-variables. This makes me lean more in that direction.

What I'd like to add to this post would be the point that we shouldn't be imposing a solution from the outside. How to deal with this in an aligned way is itself something which depends on the preferences of the agent. I don't think we can just come up with a general way to find correspondences between models, or something like that, and apply it to solve the problem. (Or at least, we don't need to.)

I see a couple different claims mixed together here:

  • The metaphilosophical problem of how we "should" handle this problem is sufficient and/or necessary to solve in its own right.
  • There probably isn't a general way to find correspondences between models, so we need to operate at the meta-level.

The main thing I disagree with is the idea that there probably isn't a general way to find correspondences between models. There are clearly cases where correspondence fails outright (like the ghosts example), but I think the problem is probably solvable allowing for error-cases (by which I mean cases where the correspondence throws an error, not cases in which the correspondence returns an incorrect result). Furthermore, assuming that natural abstractions work the way I think they do, I think the problem is solvable in practice with relatively few error cases and potentially even using "prosaic" AI world-models. It's the sort of thing which would dramatically improve the success chances of alignment by default.

I absolutely do agree that we still need the metaphilosophical stuff for a first-best solution. In particular, there is not an obviously-correct way to handle the correspondence error-cases, and of course anything else in the whole setup can also be close-but-not-exactly-right . I do think that combining a solution to the pointers problem with something like the communication prior strategy, plus some obvious tweaks like partially-ordered preferences and some model of logical uncertainty, would probably be enough to land us in the basin of convergence (assuming the starting model was decent), but even then I'd prefer metaphilosophical tools to be confident that something like that would work.

Comment by johnswentworth on The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables · 2020-11-19T00:18:53.683Z · LW · GW

Well, if there were unique values, we could say "maximize the unique values." Since there aren't, we can't. We can still do some similar things, and I agree, those do seem wrong. See this post for basically my argument for what we're going to have to do with that wrong-seeming.

Before I get into the meat of the response... I certainly agree that values are probably a partial order, not a total order. However, that still leaves basically all the problems in the OP: that partial order is still a function of latent variables in the human's world-model, which still gives rise to all the same problems as a total order in the human's world-model. (Intuitive way to conceptualize this: we can represent the partial order as a set of total orders, i.e. represent the human as a set of utility-maximizing subagents. Each of those subagents is still a normal Bayesian utility maximizer, and still suffers from the problems in the OP.)

Anyway, I don't think that's the main disconnect here...

Yes, the point is multiple abstraction levels (or at least multiple abstractions, ordered into levels or not). But not multiple abstractions used by humans, multiple abstractions used on humans.

Ok, I think I see what you're saying now. I am of course on board with the notion that e.g. human values do not make sense when we're modelling the human at the level of atoms. I also agree that the physical system which comprises a human can be modeled as wanting different things at different levels of abstraction.

However, there is a difference between "the physical system which comprises a human can be interpreted as wanting different things at different levels of abstraction", and "there is not a unique, well-defined referent of 'human values'". The former does not imply the latter. Indeed, the difference is essentially the same issue in the OP: one of these statements has a type-signature which lives in the physical world, while the other has a type-signature which lives in a human's model.

An analogy: consider a robot into which I hard-code a utility function and world model. This is a physical robot; on the level of atoms, its "goals" do not exist in any more real a sense than human values do. As with humans, we can model the robot at multiple levels of abstraction, and these different models may ascribe different "goals" to the robot - e.g. modelling it at the level of an electronic circuit or at the level of assembly code may ascribe different goals to the system, there may be subsystems with their own little control loops, etc.

And yet, when I talk about the utility function I hard-coded into the robot, there is no ambiguity about which thing I am talking about. "The utility function I hard-coded into the robot" is a concept within my own world-model. That world-model specifies the relevant level of abstraction at which the concept lives. And it seems pretty clear that "the utility function I hard-coded into the robot" would correspond to some unambiguous thing in the real world - although specifying exactly what that thing is, is an instance of the pointers problem.

Does that make sense? Am I still missing something here?

Comment by johnswentworth on The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables · 2020-11-18T23:30:38.158Z · LW · GW

Could you uncompress this comment a bit please?

Comment by johnswentworth on The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables · 2020-11-18T19:31:31.237Z · LW · GW

This comment seems wrong to me in ways that make me think I'm missing your point.

Some examples and what seems wrong about them, with the understanding that I'm probably misunderstanding what you're trying to point to:

we're non-Cartesian, which means that when we talk about our values, we are assuming a specific sort of way of talking about the world, and there are other ways of talking about the world in which talk about our values doesn't make sense

I have no idea why this would be tied to non-Cartesian-ness.

But in the real world, humans don't have a unique set of True Values or even a unique model of the world

There are certainly ways in which humans diverge from Bayesian utility maximization, but I don't see why we would think that values or models are non-unique. Certainly we use multiple levels of abstraction, or multiple sub-models, but that's quite different from having multiple distinct world-models.

Thus in the real world we cannot require that the AI has to maximize humans' True Values, we can only ask that it models humans [...] and satisfy the modeled values.

How does this follow from non-uniqueness of values/world models? If humans have more than one set of values, or more than one world model, then this seems to say "just pick one set of values/one world model and satisfy that", which seems wrong.

One way to interpret all this is that you're pointing to things like submodels, subagents, multiple abstraction levels, etc. But then I don't see why the problem would be any easier in the real world than in the model, since all of those things can be expressed in the model (or a straightforward extension of the model, in the case of subagents).

Comment by johnswentworth on Anatomy of a Gear · 2020-11-18T15:29:40.835Z · LW · GW

That is excellent.

Comment by johnswentworth on Signalling & Simulacra Level 3 · 2020-11-17T16:40:03.185Z · LW · GW

I'm not talking about learning language, I'm talking about how we chunk the world into objects. It's not about learning the word "tree", it's about recognizing the category-of-things which we happen to call trees. It's about thinking that maybe the things I know about one of the things-we-call-trees are likely to generalize to other things-I-call-trees. We must do that before attaching the word "tree" to the concept, because otherwise it would take millions of examples to hone in on which concept the word is trying to point to.

Comment by johnswentworth on When Money Is Abundant, Knowledge Is The Real Wealth · 2020-11-17T16:35:02.806Z · LW · GW

I generally agree with this comment, and I think the vast majority of people underestimate the importance of this factor. Personally, I consider "staying grounded" one of the primary challenges of what I'm currently doing, and I do not think it's healthy to stay out of the markets for extended periods of time.

Comment by johnswentworth on Anatomy of a Gear · 2020-11-17T16:28:16.911Z · LW · GW

Interesting connection, I hadn't thought about it from the Occam's razor angle. That's also similar to the maps <-> abstraction connection.

Comment by johnswentworth on Early Thoughts on Ontology/Grounding Problems · 2020-11-16T21:04:54.817Z · LW · GW

At this point, I think that I personally have enough evidence to be reasonably sure that I understand abstraction well enough that it's not a conceptual bottleneck. There are still many angles to pursue - I still don't have efficient abstraction learning algorithms, there's probably good ways to generalize it, and of course there's empirical work. I also do not think that other people have enough evidence that they should believe me at this point, when I claim to understand well enough. (In general, if someone makes a claim and backs it up by citing X, then I should assign the claim lower credence than if I stumbled on X organically, because the claimant may have found X via motivated search. This leads to an asymmetry: sometimes I believe a thing, but I do not think that my claim of the thing should be sufficient to convince others, because others do not have visibility into my search process. Also I just haven't clearly written up every little piece of evidence.)

Anyway, when I consider what barriers are left assuming my current model of abstraction and how it plays with the world are (close enough to) correct, the problems in the OP are the biggest. One of the main qualitative takeaways from the abstraction project is that clean cross-model correspondences probably do exist surprisingly often (a prediction which neural network interpretability work has confirmed to some degree). But that's an answer to a question I don't know how to properly set up yet, and the details of the question itself seem important. What criteria do we want these correspondences to satisfy? What criteria does the abstraction picture predict they satisfy in practice? What criteria do they actually satisfy in practice? I don't know yet.

Comment by johnswentworth on Anatomy of a Gear · 2020-11-16T19:00:45.457Z · LW · GW

Good question.

Everyday Lessons from High-Dimensional Optimization and Gears vs Behavior talk about why we use gears. Briefly: we represent high-dimensional systems as a bunch of coupled low-dimensional systems because brute-force-y reasoning works well in low dimensions, but not in high dimensions. So, to reason about high dimensions, we break the system up, use brute-force-y tricks locally on the low-dimensional parts, and then propagate information between components. This also usually makes our models generalize well, because the low-dimensional interfaces of the gears correspond to modularity of reality (just as low-dimensional function interfaces in software correspond to modularity of the code). If there's a change in one subsystem, then the impact of that change on the rest of the system will be mediated by the change in the one-dimensional summary.

For instance, in the gearbox in the post, if we hit the middle gear with a hammer and it breaks, what happens? If we have a gears-level model, then we can re-use most of that model in the new regime - the upper two gears and the handle are still coupled in the same way, and the lower two gears and the wheel are still coupled in the same way. (Though, in this case the gearbox is sufficiently low-dimensional that the difference between using the gears and using the blue boxes isn't too dramatic. In general, these things become more important as the system grows higher-dimensional and more complex.)

Comment by johnswentworth on Signalling & Simulacra Level 3 · 2020-11-16T16:39:39.154Z · LW · GW

You have a concept of apples before learning the word (otherwise you wouldn't know which thing in our very-high-dimensional world to tie the word to; word-learning does not require nearly enough examples to narrow down the concept space without some pre-existing concept). Whatever data structure your brain uses to represent the concept is separate from the word itself, and that's the thing I'm talking about here.

Well, really I'm talking about the idealized theoretical Bayesian version of that thing. Point is, it should not require other agents in the picture, including your parents.

Comment by johnswentworth on What considerations influence whether I have more influence over short or long timelines? · 2020-11-15T17:07:47.459Z · LW · GW

I'd be interested to hear more about why you think resources are likely to be the main constraint, especially in light of that OpenAI report earlier this year.

Comment by johnswentworth on Signalling & Simulacra Level 3 · 2020-11-15T17:01:47.531Z · LW · GW

Nope, that is not what I'm talking about here. At least I don't think so. The thing I'm talking about applies even when there's only one agent; it's a question of how that agent's own internal symbols end up connected to physical things in the world, for purposes of the agent's own reasoning. Honesty when communicating with other agents is related, but sort of tangential.

Comment by johnswentworth on Final Babble Challenge (for now): 100 ways to light a candle · 2020-11-15T00:04:24.344Z · LW · GW
  1. Match
  2. Lighter
  3. The other kind of lighter, with the stick and handle
  4. Flint and a knife
  5. Those lighters we used on the bunsen burners in high school chem class. Man, there’s a lot of different kinds of lighters.
  6. Google for unusual lighters, buy one, and use that.
  7. Bunsen burner
  8. Electric sparker from a grill/stove
  9. Grill/stove
  10. Another already-lit candle
  11. Another already-lit fire
  12. Blowtorch
  13. Molotov cocktail
  14. Rube goldberg machine
  15. Battery and some steel filings
  16. Sparkler
  17. Fireworks
  18. Model rocket igniter
  19. Rocket igniter
  20. Rocket exhaust
  21. Heat from orbital re-entry
  22. Throw it into Mount Doom
  23. Find those people who claim to be the spark that will light the fire that will burn the First Order down, and ask them for a light before their tactical and strategic incompetence gets them all killed.
  24. Welder
  25. Plasma torch
  26. Lens, a sunny day and a steady hand
  27. Galaxy Note 7
  28. Two paperclips and an electrical outlet
  29. Pure sodium and a few drops of water
  30. Kite, wire and thunderstorm
  31. Leave the candle nearby some bored Boy Scouts.
  32. Hand grenade
  33. Ship the candle to California
  34. Parabolic reflector and some sun
  35. Dragon breath
  36. Car battery, a wrench, and something besides my hand with which to hold the wrench
  37. Steel cable, hook, and a high-voltage power line
  38. Flamethrower
  39. Propane tank and a rifle
  40. Lit propane lantern
  41. Use a space heater in a way which will definitely void the warranty
  42. Leave the candle on a table in somebody else’ house, then kill their power
  43. Feed candle to a goat, let the mitochondria burn it
  44. Piss off an arsonist, leave the candle at home and go on vacation
  45. Throw candle in a bonfire
  46. Leave the candle in a church and wait for them to light it
  47. Leave candle in oven on self-clean cycle
  48. Toast candle in toaster
  49. Autoclave
  50. Campfire
  51. Poke candle into glassblowing furnace
  52. Poke candle into vat of molten steel
  53. Leave candle in kiln
  54. Find an oilwell with a not-very-thrifty operator and light the candle from the gas they’re burning off
  55. Pilot light
  56. Bring candle to the Darvaza gas crater
  57. Bring candle to an eternal flame monument
  58. Torch
  59. Light it from a refinery gas flare
  60. Throw it into the sun
  61. Visit an Amish kitchen, use the first flame available
  62. Break an old-school lightbulb, touch candle wick to it while it’s still hot.
  63. 9-volt battery and a fork
  64. Drive around California inspecting the power lines and transformers and whatnot until sparks are found, then light the candle from the sparks
  65. Go to a (coal/oil/gas) power plant, and throw the candle into the fuel pile
  66. Hide the candle inside of a missile which will soon be used for wargames or something
  67. Hide the candle in a nuclear test site in North Korea
  68. Hide the candle in some other bomb test site
  69. Tie the candle to a weight, and let enough string hang off to drag it back after throwing, then throw it into a minefield. If nothing goes boom, drag it back. If it still hasn’t gone boom, throw again at a different spot. Iterate until boom.
  70. Throw the candle into an active jet engine
  71. Shine a flashlight at the candle. Now the candle is lit.
  72. Arrange a romantic candle-lit dinner with significant other, then pretend to not have any candle-lighting tools available and see how they light it.
  73. Light it with a cigarette
  74. Light it from the flaming pitchfork outside the Hell’s Kitchen restaurant at Caesar’s Palace
  75. Throw it out the window of an airplane while flying through a thunderstorm
  76. Assault a castle, and hope defenders drop burning pitch with which to light the candle
  77. Defend the castle, and hope the attackers shoot flaming arrows with which to light the candle
  78. Join the volunteer fire department, and light it from whatever’s on fire on the first call
  79. Leave ambiguous political remarks on twitter/facebook, and light the candle from the resulting dumpster fire
  80. Blasting cap
  81. Drop the candle down the hole at a blasting site
  82. Put a little bit of foil around the wick, then put the candle in the microwave and turn it on
  83. Put a little bit of foil around the wick, then shoot the candle at high speed next to a strong magnet
  84. High intensity laser
  85. Place the candle in a large tank of air, then rapidly compress it enough that the heat lights the candle
  86. Coat the wick with a solid oxidizer, then place the whole thing in a vacuum chamber and shoot it with an electron gun
  87. Coat the wick with a solid oxidizer, then throw it into a deep-sea vent
  88. Obtain a black hole, then use it as a gravitational lense to focus sunlight on the candle.
  89. Coat the wick in a glow-in-the-dark chemical rather than burning it
  90. Leave the candle in a very large compost pile
  91. Leave the candle in a box with some rags soaked in linseed oil
  92. Leave the candle in a loose pile of pyrite and coal on a hot day
  93. Leave the candle in a pile of pistachio nuts
  94. Leave the candle in a pile of nitrate film
  95. Place the candle in a pile of hay in the back of an 18-wheeler, with a space heater at the other end of the truck bed (opposite the hay pile). Then have an amateur driver race the truck around a track.
  96. Leave the candle in an open field and wait for a meteor to strike.
  97. Leave the candle in an open field and wait for the second law of thermodynamics to somehow, someway take its course.
  98. Place the candle in a room with a 100% oxygen atmosphere, and don’t try very hard to prevent fire.
  99. Light it from the olympic torch
  100. Attempt to make homemade explosives with the candle nearby.

Legal text: results not guaranteed. May result in destruction of candle, severe burns, loss of limbs, spontaneous appearance of Boy Scouts, property damage, and/or loss of life. Do not attempt without supervision by a properly licensed candle lighter. No pistachio nuts were harmed in the compilation of this list.

Comment by johnswentworth on Signalling & Simulacra Level 3 · 2020-11-14T23:06:40.688Z · LW · GW

I believe the last section of this post is pointing to something central and important which is really difficult to articulate. Which is ironic, since "how does articulating concepts work?" is kinda part of it.

To me, it feels like Bayesianism is missing an API. Getting embeddedness and reflection and communication right all require the model talking about its own API, and that in turn requires figuring out what the API is supposed to be - like how the literal meanings of things passed in and out actually get tied to the world.

Comment by johnswentworth on A Self-Embedded Probabilistic Model · 2020-11-14T22:20:36.758Z · LW · GW

I like to think that I influenced your choice of subject.

Yup, you did.

it seems that "head-state" is what would usually called "state" in TMs.

Correct. Really, the "state" of a TM (as the word is used most often in other math/engineering contexts) is both the head-state and whatever's on the tape.

In a technical sense, the "state" of a system is usually whatever information forms a Markov blanket between future and past - i.e. the interaction between everything in the future and everything in the past should be completely mediated by the system state. There are lots of exceptions to this, and the word isn't used consistently everywhere, but that's probably the most useful heuristic.

Comment by johnswentworth on Interest survey: Forming an MIT Mystery Hunt team (Jan. 15-18, 2021) · 2020-11-13T22:05:57.991Z · LW · GW

Hmm, apparently coordination on team name suggestions needs to happen before filling out the survey. Perhaps not the intended incentives?

Team names off the top of my head:

  • Rat Team
  • Gray Tribe
  • Least Wrong Team
  • Bayesian Conspirators
Comment by johnswentworth on Communication Prior as Alignment Strategy · 2020-11-13T21:38:13.596Z · LW · GW

I don't think it can be significantly harder for behavior-space than reward-space. If it were, then one of our first messages would be (a mathematical version of) "the behavior I want is approximately reward-maximizing". I don't think that's actually the right way to do things, but it should at least give a reduction of the problem.

Anyway, I'd say the most important difference between this and various existing strategies is that we can learn "at the outermost level". We can treat the code as message, so there can potentially be a basin of attraction even for bugs in the code. The entire ontology of the agent-model can potentially be wrong, but still end up in the basin. We can decide to play an entirely different game. Some of that could potentially be incorporated into other approaches (maybe it has and I just didn't know about it), though it's tricky to really make everything subject to override later on.

Of course, the trade-off is that if everything is subject to override then we really need to start in the basin of attraction - there's no hardcoded assumptions to fall back on if things go off the rails. Thus, robustness tradeoff.

Comment by johnswentworth on What considerations influence whether I have more influence over short or long timelines? · 2020-11-13T21:32:58.211Z · LW · GW

Even though my absolute influence may be low, it seems higher in the US than in Asia, and thus higher in short-timelines scenarios than long-timelines scenarios. Or so I'm thinking.

Lemme sketch out a model here. We start with all the people who have influence on the direction of AI. We then break out two subgroups - US and Asia - and hypothesize that total influence of US goes down over time, and total influence of Asia goes up over time. Then we observe that you are in the US group, so this bodes poorly for your own personal influence. However, your own influence is small, which means that your contribution to the US' total influence is small. This means your own influence can vary more-or-less independently of the US total; a delta in your influence is not large enough to significantly cause a delta in the US total influence. Now, if there was some reason to think that your influence were strongly correlated with the US total, then the US total would matter. And there are certainly things we could think of which might make that true, but "US total influence" does not seem likely to be a stronger predictor of "Daniel's influence" than any of 50 other variables we could think of. The full pool of US AI researchers/influencers does not seem like all that great a reference class for Daniel Kokotajlo - and as long as your own influence is small relative to the total, a reference class is basically all it is.

An analogy: GDP is only very weakly correlated with my own income. If I had dramatically more wealth - like hundreds of millions or billions - then my own fortunes would probably become more tied to GDP. But as it is, using GDP to predict my income is effectively treating the whole US population as a reference class for me, and it's not a very good reference class.

Anyway, the more interesting part...

I apparently have very different models of how the people working on AI are likely to shift over time. If everything were primarily resource-constrained, then I'd largely agree with your predictions. But even going by current trends, algorithmic/architectural improvements matter at least as much raw resources. Giant organizations - especially governments - are not good at letting lots of people try their clever ideas and then quickly integrating the successful tricks into the main product. Big organizations/governments are all about coordinating everyone around one main plan, with the plan itself subject to lots of political negotiation and compromise, and then executing that plan. That's good for deploying lots of resources, but bad for rapid innovation.

Along similar lines, I don't think the primary world seat of innovation is going to shift from the US to China any time soon. China has the advantage in terms of raw population, but it's only a factor of 4 advantage; really not that dramatic a difference in the scheme of things. On the other hand, Western culture seems dramatically and unambiguously superior in terms of producing innovation, from an outside view. China just doesn't produce breakthrough research nearly as often. 20 years ago that could easily have been attributed to less overall wealth, but that becomes less and less plausible over time - maybe I'm just not reading the right news sources, but China does not actually seem to be catching up in this regard. (That said, this is all mainly based on my own intuitions, and I could imagine data which would change my mind.)

That said, I also don't think a US/China shift is all that relevant here either way; it's only weakly correlated with influence of this particular community. This particular community is a relatively small share of US AI work, so a large-scale shift would be dominated by the rest of the field, and the rationalist community in particular has many channels to grow/shrink in influence independent of the US AI community. It's essentially the same argument I made about your influence earlier, but this time applied to the community as a whole.

I do think "various other things might happen that effectively impose a discount rate" is highly relevant here. That does cut both ways, though: where there's a discount rate, there's a rate of return on investment, and the big question is whether rationalists have a systematic advantage in that game.

Comment by johnswentworth on Why You Should Care About Goal-Directedness · 2020-11-13T19:23:45.475Z · LW · GW

In theory I could treat myself as a black box, though even then I'm going to need at least a functional self model (i.e. model of what outputs yield what inputs) in order to get predictions out of the model for anything in my future light cone.

But usually I do assume that we want a "complete" world model, in the sense that we're not ignoring any parts by fiat. We can be uncertain about what my internal structure looks like, but that still leaves us open to update if e.g. we see some FMRI data. What I don't want is to see some FMRI data and then go "well, can't do anything with that, because this here black box is off-limits". When that data comes in, I want to be able to update on it somehow.

Comment by johnswentworth on Why You Should Care About Goal-Directedness · 2020-11-13T18:39:19.287Z · LW · GW

I'm not quite clear on what you're asking, so I'll say some things which sound relevant.

I'm embedded in the world, so my world model needs to contain a model of me, which means my world model needs to contain a copy of itself. That's the sense in which my own world model is self-referential.

Practically speaking, this basically means taking the tricks from Writing Causal Models Like We Write Programs, and then writing the causal-model-version of a quine. It's relatively straightforward; the main consequence is that the model is necessarily lazily evaluated (since I'm "too small" to expand the whole thing), and then the interesting question is which queries to the model I can actually answer (even in principle) and how fast I can answer them.

In particular, based on how game theory works, there's probably a whole class of optimization queries which can be efficiently answered in-principle within this self-embedded model, but it's unclear exactly how to set them up so that the algorithm is both correct and always halts.

My world model is necessarily "high-level" in the sense that I don't have direct access to all the low-level physics of the real world; I expect that the real world (approximately) abstracts into my model, at least within the regimes I've encountered. I probably also have multiple levels of abstraction within my world model, in order to quickly answer a broad range of queries.

Did that answer the question? If not, can you give an example or two to illustrate what you mean by self-reference?

Comment by johnswentworth on What considerations influence whether I have more influence over short or long timelines? · 2020-11-13T18:19:34.232Z · LW · GW

Rudeness no problem; did I come across as arrogant or something?

No not at all, it's just that the criticism was almost directly "your status is not high enough for this". It's like I took the underlying implication which most commonly causes offense and said it directly. It was awkward because it did not feel like you were over-reaching in terms of status, even in appearance, but you happened to be reasoning in a way which (subtly) only made sense for a version of Daniel with much more public following. So I somehow needed to convey that without the subtext which such a thing would almost always carry.

That was kind of long-winded, but this was an unusually interesting case of word-usage.

It seems to me that this community has more influence in short-timeline worlds than long-timeline worlds. Significantly more.

Ah interesting. I haven't thought much about the influence of the community as a whole (as opposed to myself); I find this plausible, though I'm definitely not convinced yet. Off the top of my head, seems like it largely depends on the extent to which the rationalist community project succeeds in the long run (even in the weak sense of individual people going their separate ways and having outsized impact) or reverts back to the mean. Note that that is itself something which you and I probably do have an outsized impact on!

When I look at the rationalist community as a bunch of people who invest heavily in experimentation and knowledge and learning about the world, that looks to me like a group which is playing the long game and should have a growing advantage over time. On the other hand, if I look at the rationalist community as a bunch of plurality-software-developers with a disproportionate chunk of AI researchers... yeah, I can see where that would look like influence on AI in the short term.

Comment by johnswentworth on A Correspondence Theorem in the Maximum Entropy Framework · 2020-11-13T18:10:10.880Z · LW · GW

I basically agree with what you're saying about policy implications. What I want to say is more like "if we actually tried high-level interventions X and Y, and empirically X worked better for high-level success metric Z, then that should still be true under the new model, with a lower-level grounding of X, Y and Z". It's still possible that an old model incorrectly predicts which of X and Y work better empirically, which would mean that the old model has worse predictive performance. Similarly: if the old model predicts that X is the optimal action, then the new model should still predict that, to the extent that the old model successfully predicts the world. If the new model is making different policy recommendations, then those should be tied to some place where the old model had inferior predictive power.

This seems to me like the sort of thing one should think about when desigining an AI one hopes to align.

This is not obvious to me. Can you explain the reasoning and/or try to convey the intuition?

Comment by johnswentworth on What could one do with truly unlimited computational power? · 2020-11-13T05:21:34.822Z · LW · GW

Yeah, that makes sense.

How exactly does the "speed" box take input? Because if its input is a plain decimal number, then we really only have access to exponential compute, since we'll need to type out a string of n digits to get 2^O(n) compute. If takes more general programs (which output a number), then we can use the machine itself to search for short formulas which output large numbers, which might buy us a lot more compute. On the other hand, if the speed box takes general programs, then it needs to somehow recognize programs which don't halt, which we could also exploit for hypercomputation.

Comment by johnswentworth on Communication Prior as Alignment Strategy · 2020-11-13T02:11:55.798Z · LW · GW

Roughly, yeah, though there are some differences - e.g. here the AI has no prior "directly about" values, it's all mediated by the "messages", which are themselves informing intended AI behavior directly. So e.g. we don't need to assume that "human values" live in the space of utility functions, or that the AI is going to explicitly optimize for something, or anything like that. But most of the things which are hard in CIRL are indeed still hard here; it doesn't really solve anything in itself.

One way to interpret it: this approach uses a similar game to CIRL, but strips out most of the assumptions about the AI and human being expected utility maximizers. To the extent we're modelling the human as an optimizer, it's just an approximation to kick off communication, and can be discarded later on.

Comment by johnswentworth on Why You Should Care About Goal-Directedness · 2020-11-13T00:31:53.580Z · LW · GW

A few other ways in which goal-directedness intersects with abstraction:

  • abstraction as an instrumentally convergent tool: to the extent that computation is limited but the universe is local, we'd expect abstraction to be used internally by optimizers of many different goals.
  • instrumental convergence to specific abstract models: the specific abstract model used should be relatively insensitive to variation in the goal.
  • type signature of the goal: to the extent that humans are goal-directed, our goals involve high-level objects (like cars or trees), not individual atoms.
  • embedded agency = abstraction + generality + goal-directedness. Roughly speaking, an embedded agent is a low-level system which abstracts into a goal-directed system, and that goal-directed system can operate across a wide range of environments requiring different behaviors.

what can be thrown out of the perfect model to get a simpler non-self-referential model (an abstraction) that is useful for a specific purpose?

Kind of tangential, but it's actually the other way around. The low-level world is "non-self-referential"; the universe itself is just one big causal DAG. In order to get a compact representation of it (i.e. a small enough representation to fit in our heads, which are themselves inside the low-level world), we sometimes throw away information in a way which leaves a simpler "self-referential" abstract model. This is a big part of how I think about agenty things in a non-agenty underlying world.