Look For Principles Which Will Carry Over To The Next Paradigm 2022-01-14T20:22:58.606Z
We Choose To Align AI 2022-01-01T20:06:23.307Z
The Plan 2021-12-10T23:41:39.417Z
Why Study Physics? 2021-11-27T22:30:21.163Z
How To Get Into Independent Research On Alignment/Agency 2021-11-19T00:00:21.600Z
Relaxation-Based Search, From Everyday Life To Unfamiliar Territory 2021-11-10T21:47:45.474Z
Study Guide 2021-11-06T01:23:09.552Z
True Stories of Algorithmic Improvement 2021-10-29T20:57:13.638Z
What Do GDP Growth Curves Really Mean? 2021-10-07T21:58:15.121Z
What Selection Theorems Do We Expect/Want? 2021-10-01T16:03:49.478Z
Some Existing Selection Theorems 2021-09-30T16:13:17.879Z
Selection Theorems: A Program For Understanding Agents 2021-09-28T05:03:19.316Z
Shared Frames Are Capital Investments in Coordination 2021-09-23T23:24:51.263Z
Testing The Natural Abstraction Hypothesis: Project Update 2021-09-20T03:44:43.061Z
Writing On The Pareto Frontier 2021-09-17T00:05:32.310Z
Optimizing Multiple Imperfect Filters 2021-09-15T22:57:16.961Z
Framing Practicum: Comparative Advantage 2021-09-09T23:59:09.468Z
The Telephone Theorem: Information At A Distance Is Mediated By Deterministic Constraints 2021-08-31T16:50:13.483Z
How To Write Quickly While Maintaining Epistemic Rigor 2021-08-28T17:52:21.692Z
Framing Practicum: Turnover Time 2021-08-24T16:29:04.701Z
What fraction of breakthrough COVID cases are attributable to low antibody count? 2021-08-22T04:07:46.495Z
Framing Practicum: Timescale Separation 2021-08-19T18:27:55.891Z
Framing Practicum: Dynamic Equilibrium 2021-08-16T18:52:00.632Z
Staying Grounded 2021-08-14T17:43:53.003Z
Framing Practicum: Bistability 2021-08-12T04:51:53.287Z
Framing Practicum: Stable Equilibrium 2021-08-09T17:28:48.338Z
Slack Has Positive Externalities For Groups 2021-07-29T15:03:25.929Z
Working With Monsters 2021-07-20T15:23:20.762Z
Generalizing Koopman-Pitman-Darmois 2021-07-15T22:33:03.772Z
The Additive Summary Equation 2021-07-13T18:23:06.016Z
Potential Bottlenecks to Taking Over The World 2021-07-06T19:34:53.016Z
The Language of Bird 2021-06-27T04:44:44.474Z
Notes on War: Grand Strategy 2021-06-18T22:55:30.174Z
Variables Don't Represent The Physical World (And That's OK) 2021-06-16T19:05:08.512Z
The Apprentice Experiment 2021-06-10T03:29:27.257Z
Search-in-Territory vs Search-in-Map 2021-06-05T23:22:35.773Z
Selection Has A Quality Ceiling 2021-06-02T18:25:54.432Z
Abstraction Talk 2021-05-25T16:45:15.996Z
SGD's Bias 2021-05-18T23:19:51.450Z
How to Play a Support Role in Research Conversations 2021-04-23T20:57:50.075Z
Updating the Lottery Ticket Hypothesis 2021-04-18T21:45:05.898Z
Computing Natural Abstractions: Linear Approximation 2021-04-15T17:47:10.422Z
Specializing in Problems We Don't Understand 2021-04-10T22:40:40.690Z
Testing The Natural Abstraction Hypothesis: Project Intro 2021-04-06T21:24:43.135Z
Core Pathways of Aging 2021-03-28T00:31:49.698Z
Another RadVac Testing Update 2021-03-23T17:29:10.741Z
Chaos Induces Abstractions 2021-03-18T20:08:21.739Z
What's So Bad About Ad-Hoc Mathematical Definitions? 2021-03-15T21:51:53.242Z
How To Think About Overparameterized Models 2021-03-03T22:29:13.126Z
RadVac Commercial Antibody Test Results 2021-02-26T18:04:09.171Z


Comment by johnswentworth on Core Pathways of Aging · 2022-01-24T15:42:39.348Z · LW · GW

That is one of the more interesting hypotheses I've heard! Thankyou for promoting it to my attention.

Comment by johnswentworth on The Telephone Theorem: Information At A Distance Is Mediated By Deterministic Constraints · 2022-01-19T19:53:39.619Z · LW · GW

Nice! That is a pretty good fit for the sorts of things the Telephone Theorem predicts, and potentially relevant information for selection theorems as well.

Comment by johnswentworth on [deleted post] 2022-01-19T03:05:11.723Z

Test comment

Comment by johnswentworth on Long covid: probably worth avoiding—some considerations · 2022-01-18T17:25:54.485Z · LW · GW

It's not that I don't want to believe it, it's that long covid is the sort of thing I'd expect to hear people talk about and publish papers about even in a world where it isn't actually significant, and many of those papers would have statistically-significant positive results even in a world where long covid isn't actually significant. Long covid is a story which has too much memetic fitness independent of its truth value. So I have to apply enough skepticism that I wouldn't believe it in a world where it isn't actually significant.

No, these problems are most probably cause by a lack of oxygen getting through to tissues.

That sounds right for shortness of breath, chest pain, and low oxygen levels. I'm more skeptical that it's driving palpitations, fatigue, joint and muscle pain, brain fog, lack of concentration, forgetfulness, sleep disturbance, and digestive and kidney problems; those sound a lot more like a list of old-age issues.

Comment by johnswentworth on Challenges with Breaking into MIRI-Style Research · 2022-01-18T01:51:30.303Z · LW · GW

There's definitely some truth to this, but I guess I'm skeptical that there isn't anything that we can do about some of these challenges. Actually, rereading I can see that you've conceded this towards the end of your post. I agree that there might be a limit to how much progress we can make on these issues, but I think we shouldn't rule out making progress too quickly.

To be clear, I don't intend to argue that the problem is too hard or not worthwhile or whatever. Rather, my main point is that solutions need to grapple with the problems of teaching people to create new paradigms, and working with people who don't share standard frames. I expect that attempts to mimic the traditional pipelines of paradigmatic fields will not solve those problems. That's not an argument against working on it, it's just an argument that we need fundamentally different strategies than the standard education and career paths in other fields.

Comment by johnswentworth on Long covid: probably worth avoiding—some considerations · 2022-01-18T01:31:49.722Z · LW · GW

"Baseline" does not mean they stick around. It means that background processes introduce new SnCs at a steady rate, so the equilibrium level is nonzero. As the removal rate slows, that equilibrium level increases, but that still does not mean that the "baseline" SnCs are long-lived, or that a sudden influx of new SnCs (from e.g. covid) will result in a permanently higher level.

Comment by johnswentworth on Long covid: probably worth avoiding—some considerations · 2022-01-18T00:06:12.780Z · LW · GW

At this point, I have yet to see any compelling evidence that any SnCs stick around over a long timescale, despite this being a thing which I'd expect to have heard about if anybody had the evidence. Conversely, it sure does look like treatments to remove senescent cells have to be continuously administered; a one-time treatment wears off on roughly the same timescale that SnCs turn over. That pretty strongly suggests that there are not pools of long-lived SnCs hanging around. And a noticeable pathology would take a lot of SnCs sticking around.

Comment by johnswentworth on Long covid: probably worth avoiding—some considerations · 2022-01-17T22:53:16.177Z · LW · GW

That is not how senescent cells work. They turn over on a fast timescale. If covid induces a bunch of senescent cell development (which indeed makes sense), those senescent cells should generally be cleared out on a timescale of weeks. Any long-term effects would need to be mediated by something else.

Comment by johnswentworth on A Correspondence Theorem · 2022-01-17T20:29:34.754Z · LW · GW

Note to self: use infinitely many observable variables  instead of just two, and the condition for  should probably be that no infinite subset of the 's are mutually dependent (or something along those lines). Intuitively: for any "piece of latent information", either we have infinite data on that piece and can precisely estimate it, or it only significantly impacts finitely many variables.

Comment by johnswentworth on Long covid: probably worth avoiding—some considerations · 2022-01-17T18:14:00.321Z · LW · GW

Sorry, I was lumping together misattribution and the like under "psychosomaticity", and I probably shouldn't have done that.

Comment by johnswentworth on Long covid: probably worth avoiding—some considerations · 2022-01-17T17:28:54.890Z · LW · GW

This mostly sounds like age-related problems. I do expect generic age-related pathologies to be accelerated by covid (or any other major stressor), but if that's the bulk of what's going on, then I'd say "long covid" is a mischaracterization. It wouldn't be relevant to non-elderly people, and to elderly people it would be effectively the same as any other serious stressor.

Comment by johnswentworth on Challenges with Breaking into MIRI-Style Research · 2022-01-17T17:12:03.563Z · LW · GW

The object-level claims here seem straightforwardly true, but I think "challenges with breaking into MIRI-style research" is a misleading way to characterize it. The post makes it sound like these are problems with the pipeline for new researchers, but really these problems are all driven by challenges of the kind of research involved.

The central feature of MIRI-style research which drives all this is that MIRI-style research is preparadigmatic. The whole point of preparadigmatic research is that:

  • We don't know the right frames to apply (and if we just picked some, they'd probably be wrong)
  • We don't know the right skills or knowledge to train (and if we just picked some, they'd probably be wrong)
  • We don't have shared foundations for communicating work (and if we just picked some, they'd probably be wrong)
  • We don't have shared standards for evaluating work (and if we just picked some, they'd probable be wrong)

Here's how the challenges of preparadigmicity apply the points in the post.

  • MIRI doesn’t seem to be running internships[3] or running their AI safety for computer scientists workshops

MIRI does not know how to efficiently produce new theoretical researchers. They've done internships, they've done workshops, and the yields just weren't that great, at least for producing new theorists.

  • You can park in a standard industry job for a while in order to earn career capital for ML-style safety. Not so for MIRI-style research.
  • There are well-crafted materials for learning a lot of the prerequisites for ML-style safety.
  • There seems to be a natural pathway of studying a masters then pursuing a PhD to break into ML-style safety. There are a large number of scholarships available and many countries offer loans or income support
  • General AI safety programs and support - ie. AI Safety Fundamentals Course, AI Safety Support, AI Safety Camp, Alignment Newsletter, ect. are naturally going to strongly focus on ML-style research and might not even have the capability to vet MIRI-style research.

There is no standardized field of knowledge with the tools we need. We can't just go look up study materials to learn the right skills or knowledge, because we don't know what skills or knowledge those are. There's no standard set of alignment skills or knowledge which an employer could recognize as probably useful for their own problems, so there's no standardized industry jobs. Similarly, there's no PhD for alignment; we don't know what would go into it.

  • There's no equivalent to submitting a paper[4]. If a paper passes review, then it gains a certain level of credibility. There are upvotes, but this signaling mechanism is more distorted by popularity or accessibility. Further, unlike writing an academic paper, writing alignment forum posts won't provide credibility outside of the field.

We don't have clear shared standards for evaluating work. Most people doing MIRI-style research think most other people doing MIRI-style research are going about it all wrong. Whatever perception of credibility might be generated by something paper-like would likely be fake.

  • It is much harder to find people with similar interests to collaborate with or mentor you. Compare to how easy it is to meet a bunch of people interested in ML-style research by attending EA meetups or EAGx.

We don't have standard frames shared by everyone doing MIRI-style research, and if we just picked some frames they would probably be wrong, and the result would probably be worse than having a wide mix of frames and knowing that we don't know which ones are right.

Main takeaway of all that: most of the post's challenges of breaking into MIRI-style research accurately reflect the challenges involved in doing MIRI-style research. Figuring out new paths, new frames, applying new skills and knowledge, explaining your own ways of evaluating outputs... these are all central pieces of doing this kind of research. If the pipeline did not force people to figure this sort of stuff out, then it would not select for researchers well-suited to this kind of work.

Now, I do still think the pipeline could be better, in principle. But the challenge is to train people to build their own paradigms, and that's a major problem in its own right. I don't know of anyone ever having done it before at scale; there's no template to copy for this. I have been working on it, though.

Comment by johnswentworth on Long covid: probably worth avoiding—some considerations · 2022-01-17T16:28:43.595Z · LW · GW

Strong upvote, this is great info.

Comment by johnswentworth on Long covid: probably worth avoiding—some considerations · 2022-01-17T00:31:23.000Z · LW · GW

Good points. Some responses:

  • I put a lot more trust in a single study with ground-truth data than in a giant pile of studies with data which is confounded in various ways. So, I trust the study with the antibody tests more than I'd trust basically-any number of studies relying on self-reports. (A different-but-similar application of this principle: I trust the Boston wastewater data on covid prevalence more than I trust all of the data from test results combined.)
  • I probably do have relatively high prior (compared to other people) on health-issues-in-general being psychosomatic. The effectiveness of placebos (though debatable) is one relevant piece of evidence here, though a lot of my belief is driven by less legible evidence than that.
  • I expect some combination of misattribution, psychosomaticity, selection effects (e.g. looking at people hospitalized and thereby accidentally selecting for elderly people), and maybe similar issues which I'm not thinking of at the moment to account for an awful lot of the "long covid" from self-report survey studies. I'm thinking less like 50% of it, and more like 90%+. Basically, when someone runs a survey and publishes data from it, I expect the results to mostly measure things other than what the authors think they're measuring, most of the time, especially when an attribution of causality is involved.
Comment by johnswentworth on Long covid: probably worth avoiding—some considerations · 2022-01-16T18:08:24.549Z · LW · GW

Good point. If we take that post's analysis at face value, then a majority of reported long covid symptoms are probably psychosomatic, but only just barely a majority, not a large majority. Though looking at the post, I'd say a more accurate description is that at least a majority of long covid symptoms are psychosomatic, i.e. it's a majority even if we pretend that all of the supposedly-long-covid symptoms in people who actually had covid are "real".

Comment by johnswentworth on Long covid: probably worth avoiding—some considerations · 2022-01-16T17:42:46.860Z · LW · GW

This is not going to be kind, but it's true and necessary to state. I apologize in advance.

Had you asked me in advance, I would have said that Katja in particular is likely to buy into long covid even in a world where long covid is completely psychosomatic; I think you (Katja) are probably unusually prone to looking-for-reasons-to-"believe"-things-which-are-actually-psychosomatic, without symmetrically looking-for-reasons-to-"disbelieve".

On the object level: the "Long covid probably isn't psychosomatic" section of the post looks pretty compatible with that prior. That section basically says two things:

  • Just because reports of long covid are basically uncorrelated with having had covid does not imply that long covid does not happen
  • There is still evidence of higher-than-usual death rates among people who have had covid

If we take both of these as true, they point to a world where there are some real post-covid symptoms, but the large majority of reported long covid symptoms are still psychosomatic. That seems plausible, but for some reason it isn't propagated into the other sections of the post. For instance, the very first sections of this post are talking about anecdotes and survey studies (at least I think they're survey studies based on a quick glance, didn't look too close), and I do not see in any of those sections any warning along the lines of "BY THE WAY THE LARGE MAJORITY OF THIS IS PROBABLY PSYCHOSOMATIC". You're counting evidence which should have been screened off by the lack of correlation between self-reported long covid symptoms and actually having had covid.

Comment by johnswentworth on Subspace optima · 2022-01-15T04:27:21.842Z · LW · GW

This was a concept which it never occurred to me that people might not have, until I saw the post. Noticing and drawing attention to such concepts seems pretty valuable in general. This post in particular was short, direct, and gave the concept a name, which is pretty good; the one thing I'd change about the post is that it could use a more concrete, everyday example/story at the beginning.

Comment by johnswentworth on Value extrapolation partially resolves symbol grounding · 2022-01-12T16:53:00.838Z · LW · GW

That might work in a tiny world model with only two possible hypotheses. In a high-dimensional world model with exponentially many hypotheses, the weight on happy humans would be exponentially small.

Comment by johnswentworth on Negative Feedback and Simulacra · 2022-01-08T05:37:47.751Z · LW · GW

Simulacra levels were probably the biggest incorporation to the rationalist canon in 2020. This was one of maybe half-a-dozen posts which I think together cemented the idea pretty well. If we do books again, I could easily imagine a whole book on simulacra, and I'd want this post in it.

Comment by johnswentworth on The First Sample Gives the Most Information · 2022-01-07T22:12:49.793Z · LW · GW

A lot of useful techniques can be viewed as ways to "get the first sample" in some sense. Fermi estimates are one example. Attempting to code something in Python is another.

(I'm not going to explain that properly here. Consider it a hook for a future post.)

Comment by johnswentworth on The First Sample Gives the Most Information · 2022-01-07T21:02:30.822Z · LW · GW

Mark mentions that he got this point from Ben Pace. A few months ago I heard the extended version from Ben, and what I really want is for Ben to write a post (or maybe a whole sequence) on it. But in the meantime, it's an important idea, and this short post is the best source to link to on it.

Comment by johnswentworth on Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian · 2022-01-07T20:00:11.711Z · LW · GW

The work linked in this post was IMO the most important work done on understanding neural networks at the time it came out, and it has also significantly changed the way I think about optimization more generally.

That said, there's a lot of "noise" in the linked papers; it takes some digging to see the key ideas and the data backing them up, and there's a lot of space spent on things which IMO just aren't that interesting at all. So, I'll summarize the things which I consider central.

When optimizing an overparameterized system, there are many many different parameter settings which achieve optimality. Optima are not peaks, they're ridges; there's a whole surface on which optimal performance is achieved. In this regime, the key question is which of the many optima an optimized system actually converges to.

Here's a kind-of-silly way to model it. First, we sample some random point in parameter space from the distribution ; in the neural network case, this is the parameter initialization. Then, we optimize: we find some new parameter values  such that  is maximized. But which of the many optimal  values does our optimizer end up at? If we didn't know anything about the details of the optimizer, one simple guess would be that  is sampled from the initialization distribution, but updated on the point being optimal, i.e.

... so the net effect of randomly initializing and then optimizing is equivalent to using the initialization distribution as a prior, doing a Bayesian update on  being optimal, and then sampling from that posterior.

The linked papers show that this kind-of-silly model is basically accurate. It didn't have to be this way a priori; we could imagine that the specifics of SGD favored some points over others, so that the distribution of  was not proportional to the prior. But that mostly doesn't happen (and to the extent it does, it's a relatively weak effect); the data shows that  values are sampled roughly in proportion to their density in the prior, exactly as we'd expect from the Bayesian-update-on-optimality model.

One implication of this is that the good generalization of neural nets must come mostly from the prior, not from some bias in SGD, because bias in SGD mostly just doesn't impact the distribution of optimized parameters values. The optimized parameter value distribution is approximately-determined by the initialization prior, so any generalization must come from that prior. And indeed, the papers also confirm that the observed generalization error lines up with what we'd expect from the Bayesian-update-on-optimality model.

For me, the most important update from this work has not been specific to neural nets. It's about overparameterized optimization in general: we can think of overparameterized optimization as sampling from the initialization prior updated on optimality, i.e. . This is a great approximation to work with analytically, and the papers here show that it is realistic for real complicated systems like SGD-trained neural nets.

Comment by johnswentworth on Shuttling between science and invention · 2022-01-07T19:17:20.381Z · LW · GW

One of the main problems I think about is how science and engineering are able to achieve such efficient progress despite the very high dimensionality of our world - and how we can more systematically leverage whatever techniques provide that efficiency. One broad class of techniques I think about a lot involves switching between search-for-designs and search-for-constraints - like proof and counterexample in math, or path and path-of-walls in a maze.

My own writing on the topic is usually pretty abstract; I'm thinking about it algorithmic terms, as a search process, and asking about big-O efficiency with respect to the dimensionality of the system. People then ask: "ok, but what does this look like in practice?".

This post is what it looks like in practice. We have inventors/engineers who build things and try new designs. We have scientists who characterize the constraints on these designs, the rules which govern them. The magic is in shuttling back-and-forth between those two processes, and Crawford gives a concrete example of what that looks like, in one of history's major innovative events.

Comment by johnswentworth on The Problem of the Criterion is NOT an Open Problem · 2022-01-06T17:38:17.939Z · LW · GW

I don't really disagree with the main claim here, but I'll steelman the opposite claim for a moment. Why call the problem of the criterion open?

To my knowledge (and please tell me if I'm wrong here), there is no widely accepted mathematical framework for the problem of the criterion in which the problem has been proved unsolvable. In that regard it is not analogous to e.g. Gödel's theorems. This is important: if some formal version of the problem of the criterion comes up when I'm working on a theorem about agency, or trying to design an AI architecture with some property, then I want the formal argument, not just a natural-language argument that my problem is intractable. Such natural-language arguments are not particularly reliable; they tend to sneak in a bunch of hidden premises, and a mathematical version of the problem which shows up in practice can violate those hidden premises.

For example: for most of the 20th century, it was basically-universally accepted that no statistical analysis of correlation could reliably establish causation. Post-Judea-Pearl, this is clearly wrong. The formal arguments that correlation cannot establish causation had loopholes in them - most importantly, they were only about two variables, and largely fell apart with three or more variables. If I were working on some theorem about AI or agency, and wanted to show something about an agent's ability to deduce causation from observation of a large number of variables, I might have noticed my inability to prove the theorem I wanted. At the very least, I would have noticed the lack of a robust mathematical framework for talking about what causality even is, and likely would have needed to develop one. (Indeed, this is basically what Pearl and others did.) But the natural language arguments glossed over such subtleties; it wasn't until people actually started developing the mathematical framework for talking about causality that we noticed correlative data could be sufficient to deduce it.

By contrast, I find it hard to imagine something like that being overlooked by Gödel's theorems. There, we do have a mathematical framework, and we know what kinds-of-things allow loopholes, and roughly how big those loopholes can be.

I don't see any framework for the problem of the criterion which would make me confident that we won't have a repeat of "correlation doesn't imply causation", the way Gödel's theorems give me such confidence. Again, this may just be my ignorance in not having read up on the topic much; please correct me if so.

Comment by johnswentworth on How To Get Into Independent Research On Alignment/Agency · 2021-12-31T16:36:20.180Z · LW · GW

Strong agree. A lot of the technical material which I think is relevant is typically not taught until the grad level, but that does not mean that actually finishing a PhD program is useful. Indeed, I sometimes joke that dropping out of a PhD program is one of the most widely-recognized credentials by people currently in the field - you get the general technical background skills, and also send a very strong signal of personal agency.

Comment by johnswentworth on What Selection Theorems Do We Expect/Want? · 2021-12-28T22:50:31.483Z · LW · GW

Yeah definitely.

Comment by johnswentworth on The Solomonoff Prior is Malign · 2021-12-28T19:07:09.279Z · LW · GW

I like the feedback framing, it seems to get closer to the heart-of-the-thing than my explanation did. It makes the role of the pointers problem and latent variables more clear, which in turn makes the role of outer alignment more clear. When writing my review, I kept thinking that it seemed like reflection and embeddedness and outer alignment all needed to be figured out to deal with this kind of malign inner agent, but I didn't have a good explanation for the outer alignment part, so I focused mainly on reflection and embeddedness.

That said, I think the right frame here involves "feedback" in a more general sense than I think you're imagining it. In particular, I don't think catastrophes are very relevant.

The role of "feedback" here is mainly informational; it's about the ability to tell which decision is correct. The thing-we-want from the "feedback" is something like the large-data guarantee from SI: we want to be able to train the system on a bunch of data before asking it for any output, and we want that training to wipe out the influence of any malign agents in the hypothesis space. If there's some class of decisions where we can't tell which decision is correct, and a malign inner agent can recognize that class, then presumably we can't create the training data we need.

With that picture in mind, the ability to give feedback "online" isn't particularly relevant, and therefore catastrophes are not particularly central. We only need "feedback" in the sense that we can tell which decision we want, in any class of problems which any agent in the hypothesis space can recognize, in order to create a suitable dataset.

Comment by johnswentworth on Alignment By Default · 2021-12-28T03:24:32.979Z · LW · GW

Some time in the next few weeks I plan to write a review of The Solomonoff Prior Is Malign which will talk about one such argument.

It's up.

Comment by johnswentworth on The Solomonoff Prior is Malign · 2021-12-28T02:45:04.587Z · LW · GW

This post is an excellent distillation of a cluster of past work on maligness of Solomonoff Induction, which has become a foundational argument/model for inner agency and malign models more generally.

I've long thought that the maligness argument overlooks some major counterarguments, but I never got around to writing them up. Now that this post is up for the 2020 review, seems like a good time to walk through them.

In Solomonoff Model, Sufficiently Large Data Rules Out Malignness

There is a major outside-view reason to expect that the Solomonoff-is-malign argument must be doing something fishy: Solomonoff Induction (SI) comes with performance guarantees. In the limit of large data, SI performs as well as the best-predicting program, in every computably-generated world. The post mentions that:

A simple application of the no free lunch theorem shows that there is no way of making predictions that is better than the Solomonoff prior across all possible distributions over all possible strings. Thus, agents that are influencing the Solomonoff prior cannot be good at predicting, and thus gain influence, in all possible worlds.

... but in the large-data limit, SI's guarantees are stronger than just that. In the large-data limit, there is no computable way of making better predictions than the Solomonoff prior in any world. Thus, agents that are influencing the Solomonoff prior cannot gain long-term influence in any computable world; they have zero degrees of freedom to use for influence. It does not matter if they specialize in influencing worlds in which they have short strings; they still cannot use any degrees of freedom for influence without losing all their influence in the large-data limit.

Takeaway of this argument: as long as we throw enough data at our Solomonoff inductor before asking it for any outputs, the malign agent problem must go away. (Though note that we never know exactly how much data that is; all we have is a big-O argument with an uncomputable constant.)

... but then how the hell does this outside-view argument jive with all the inside-view arguments about malign agents in the prior?

Reflection Breaks The Large-Data Guarantees

There's an important gotcha in those guarantees: in the limit of large data, SI performs as well as the best-predicting program, in every computably-generated world. SI itself is not computable, therefore the guarantees do not apply to worlds which contain more than a single instance of Solomonoff induction, or worlds whose behavior depends on the Solomonoff inductor's outputs.

One example of this is AIXI (basically a Solomonoff inductor hooked up to a reward learning system): because AIXI's future data stream depends on its own present actions, the SI guarantees break down; takeover by a malign agent in the prior is no longer blocked by the SI guarantees.

Predict-O-Matic is a similar example: that story depends on the potential for self-fulfilling prophecies, which requires that the world's behavior depend on the predictor's output.

We could also break the large-data guarantees by making a copy of the Solomonoff inductor, using the copy to predict what the original will predict, and then choosing outcomes so that the original inductor's guesses are all wrong. Then any random program which will outperform the inductor's predictions. But again, this environment itself contains a Solomonoff inductor, so it's not computable; it's no surprise that the guarantees break.

(Interesting technical side question: this sort of reflection issue is exactly the sort of thing Logical Inductors were made for. Does the large-data guarantee of SI generalize to Logical Inductors in a way which handles reflection better? I do not know the answer.)

If Reflection Breaks The Guarantees, Then Why Does This Matter?

The real world does in fact contain lots of agents, and real-world agents' predictions do in fact influence the world's behavior. So presumably (allowing for uncertainty about this handwavy argument) the maligness of the Solomonoff prior should carry over to realistic use-cases, right? So why does this tangent matter in the first place?

Well, it matters because we're left with an importantly different picture: maligness is not a property of SI itself, so much as a property of SI in specific environments. Merely having malign agents in the hypothesis space is not enough for the malign agents to take over in general; the large data guarantees show that much. We need specific external conditions - like feedback loops or other agents - in order for malignness to kick in. Colloquially speaking, it is not strictly an "inner" problem; it is a problem which depends heavily on the "outer" conditions.

If we think of malignness of SI just in terms of malign inner agents taking over, as in the post, then the problem seems largely decoupled from the specifics of the objective (i.e. accurate prediction) and environment. If that were the case, then malign inner agents would be a very neatly-defined subproblem of alignment - a problem which we could work on without needing to worry about alignment of the outer objective or reflection or embeddedness in the environment. But unfortunately the problem does not cleanly factor like that; the large-data guarantees and their breakdown show that malignness of SI is very tightly coupled to outer alignment and reflection and embeddedness and all that.

Now for one stronger claim. We don't need malign inner agent arguments to conclude that SI handles reflection and embeddedness poorly; we already knew that. Reflection and embedded world-models are already problems in need of solving, for many different reasons. The fact that malign agents in the hypothesis space are relevant for SI only in the cases where we already knew SI breaks suggests that, once we have better ways of handling reflection and embeddedness in general, the malign inner agents problem will go away on its own. This kind of malign inner agent is not a subproblem which we need to worry about in its own right. Indeed, I expect this is probably the case: once we have good ways of handling reflection and embeddedness in general, the problem of malign agents in the hypothesis space will go away on its own. (Infra-Bayesianism might be a case in point, though I haven't studied it enough myself to be confident in that.)

Comment by johnswentworth on When Money Is Abundant, Knowledge Is The Real Wealth · 2021-12-28T01:11:20.468Z · LW · GW

Overall coming back to this I'm realizing that I don't actually have any way to act on this piece. even though I am in the intended audience, and I have been making a specific effort in my life to treat money as cheap and plentiful, I am not seeing:

  • Advice on which subjects are likely to pay dividends, or why
  • Advice on how to recover larger amounts of time or effort by spending money more efficiently
  • Discussion of when those tradeoffs would be useful

This seems especially silly not to have given, for example, Zvi's Covid posts, which are a pretty clear modern day example of the Louis XV smallpox problem.

Sounds like you want roughly the sequence Inadequate Equilibria.

Comment by johnswentworth on What Selection Theorems Do We Expect/Want? · 2021-12-28T01:07:53.961Z · LW · GW

... it is embarrassingly plausible that I made a sign error and that whole argument is exactly wrong.

The picture in my head is "broad basin => circular-ish peak => large determinant" (since long, narrow peaks have low volume and low determinant). But maybe the diagonals were exactly the wrong things to keep fixed in order to make that argument work.

Comment by johnswentworth on What Do GDP Growth Curves Really Mean? · 2021-12-28T01:02:38.699Z · LW · GW

That would indeed be the right way to estimate total surplus. The problem is that total surplus is not obviously the right metric to worry about. For a use case like forecasting AI, for instance, it's not particularly central.

Comment by johnswentworth on Worst-case thinking in AI alignment · 2021-12-23T03:47:40.256Z · LW · GW

A few more reasons...

First: why do software engineers use worst-case reasoning?

  • A joking answer would be "the users are adversaries". For most software this isn't literally true; the users don't want to break the software. But users are optimizing for things, and optimization in general tends to find corner cases. (In linear programming, for instance, almost all objectives will be maximized at a literal corner of the set allowed by the constraints.) This is sort of like "being optimized against", but it emphasizes that the optimizer need not be "adversarial" in the intuitive sense of the word in order to have that effect.
  • Users do a lot of different things, and "corner cases" tend to come up a lot more often than a naive analysis might think. If a user is weird in one way, they're more likely to be weird in another way too. This is sort of like "the space contains a high proportion of bad things", but with more emphasis on the points in the space being weighted in ways which weight Weirdness more than a naive analysis would suggest.
  • Software engineers often want to provide simple, predictable APIs. Error cases (especially unexpected error cases) make APIs more complex.
  • In software, we tend to have a whole tech stack. Even if each component of the stack fails only rarely, overall failure can still be extremely common if there's enough pieces any one of which can break the whole thing. (I worked at a mortgage startup where this was a big problem - we used a dozen external APIs which were each fine 95+% of the time, but that still meant our app was down very frequently overall.) So, we need each individual component to be very highly reliable.

And one more, generated by thinking about some of my own use-cases:

  • Unknown unknowns. Worst-case reasoning forces people to consider all the possible failure modes, and rule out any unknown unknowns.

These all carry over to alignment pretty straightforwardly.

Comment by johnswentworth on What’s Up With the CDC Nowcast? · 2021-12-22T23:35:25.566Z · LW · GW

Where the hell are all the cases?

Just qualitatively eyeballing, the Boston wastewater data does look roughly like what I'd expect to see in a world where Omicron took over last week. And I consider that the single best US data source available - it is immune to almost all of the lag and selection effects which impact most other sources. It is the closest thing we have to a proper ground truth.

Omicron taking over late last week/early this week, at least in major urban centers, also matches what we've seen in London, and a priori I'd expect pretty similar timing here. I wouldn't expect that we had substantially less international travel, or substantially slower spread; if anything, I'd expect things here to be a little faster a priori.

So, I agree the CDC's data is not particularly informative on the currently-relevant timescale, but it seems pretty plausible to me that it's off in the "Omicron cases were way underestimated, and total cases now are way underestimated" direction rather than the "Omicron cases now are way overestimated" direction.

Comment by johnswentworth on Alignment By Default · 2021-12-21T16:48:17.140Z · LW · GW

Next, John suggests that “human values” may be such a “natural abstraction”, such that “human values” may wind up a “prominent” member of an AI's latent space, so to speak.

I'm fairly confident that the inputs to human values are natural abstractions - i.e. the "things we care about" are things like trees, cars, other humans, etc, not low-level quantum fields or "head or thumb but not any other body part". (The "head or thumb" thing is a great example, by the way). I'm much less confident that human values themselves are a natural abstraction, for exactly the same reasons you gave.

Comment by johnswentworth on Alignment By Default · 2021-12-20T00:01:22.899Z · LW · GW

That's fair, but it's still perfectly in line with the learning-theoretic perspective: human values are simpler to express through the features acquired by unsupervised learning than through the raw data, which translates to a reduction in sample complexity.

Yup, that's right. I still agree with your general understanding, just wanted to clarify the subtlety.

If you do IRL with the correct type signature for human values then in the best case you get the true human values. IRL is not mutually exclusive with your approach: e.g. you can do unsupervised learning and IRL with shared weights.

Yup, I agree with all that. I was specifically talking about IRL approaches which try to learn a utility function, not the more general possibility space.

Malign simulation hypotheses already look like "Dr. Nefarious" where the role of Dr. Nefarious is played by the masters of the simulation, so I'm not sure what exactly is the distinction you're drawing here.

The distinction there is about whether or not there's an actual agent in the external environment which coordinates acausally with the malign inner agent, or some structure in the environment which allows for self-fulfilling prophecies, or something along those lines. The point is that there has to be some structure in the external environment which allows a malign inner agent to gain influence over time by making accurate predictions. Otherwise, the inner agent will only have whatever limited influence it has from the prior, and every time it deviates from its actual best predictions (or is just out-predicted by some other model), some of that influence will be irreversibly spent; it will end up with zero influence in the long run.

Comment by johnswentworth on The Plan · 2021-12-19T21:40:54.097Z · LW · GW

If we're just optimizing some function, then indeed breadth is the only relevant part. But for something like evolution or SGD, we're optimizing over random samples, and it's the use of many different random samples which I'd expect to select for robustness.

Comment by johnswentworth on Alignment By Default · 2021-12-19T19:00:46.628Z · LW · GW

One subtlety which approximately 100% of people I've talked to about this post apparently missed: I am pretty confident that the inputs to human values are natural abstractions, i.e. we care about things like trees, cars, humans, etc, not about quantum fields or random subsets of atoms. I am much less confident that "human values" themselves are natural abstractions; values vary a lot more across cultures than e.g. agreement on "trees" as a natural category.

Relatedly, the author is too optimistic (IMO) in his comparison of this technique to alternatives: ...

In the particular section you quoted, I'm explicitly comparing the best-case of abstraction by default to the the other two strategies, assuming that the other two work out about-as-well as they could realistically be expected to work. For instance, learning a human utility function is usually a built-in assumption of IRL formulations, so such formulations can't do any better than a utility function approximation even in the best case. Alignment by default does not need to assume humans have a utility function; it just needs whatever-humans-do-have to have low marginal complexity in a system which has learned lots of natural abstractions.

Obviously alignment by default has analogous assumptions/flaws; much of the OP is spent discussing them. The particular section you quote was just talking about the best-case where those assumptions work out well.

The potential for malign hypotheses (learning of hypotheses / models containing malign subagents) exists in any learning system, and in particular malign simulation hypotheses are a serious concern. ...

I partially agree with this, though I do think there are good arguments that malign simulation issues will not be a big deal (or to the extent that they are, they'll look more like Dr Nefarious than pure inner daemons), and by historical accident those arguments have not been circulated in this community to nearly the same extent as the arguments that malign simulations will be a big deal. Some time in the next few weeks I plan to write a review of The Solomonoff Prior Is Malign which will talk about one such argument.

Comment by johnswentworth on COVID and the holidays · 2021-12-19T18:10:25.620Z · LW · GW

Omicron will make up at least 1% of cases in the US by Dec 31. Which means it could make up substantially more than that. However, in mid-December when you’re traveling and going to solstice, it probably won’t be that high—and even if it’s 5 or 10% at that point, that’s not going to have a major effect on the state of COVID. 

Not sure when this post was written, but I think this is an extreme underestimate at this point. For instance, my own current median guess for Omicron overtaking Delta in the Bay Area specifically is early this coming week. This is based on eyeballing doubling rate estimates in Zvi's posts, and guessing how our initial conditions could plausibly compare to London or Denmark. (This wasn't a careful calculation, but the exact initial conditions don't actually matter that much because the doubling rate is so fast, so a factor-of-two in initial conditions only changes the takeover date by 2-3 days.)

I expect that using the Polymarket predictions as a proxy here will severely underestimate Omicron's timeline for two reasons. First, those predictions are for the whole US; we'd expect it to hit first and fastest in major urban centers with a lot of international travel (as we saw in e.g. the UK). Second, those predictions are about what the CDC data says on Jan 4, which means there's a ton of lag built in - both from the usual lag on data, and from the holidays slowing things down.

Even given all that, Polymarket currently gives an 83% chance that Omicron will be >50% in the US as a whole on Jan 1, based on the CDC's data of Jan 4.

Comment by johnswentworth on Where can one learn deep intuitions about information theory? · 2021-12-18T01:02:35.684Z · LW · GW

Even a good intuitive explanation of thermodynamics as seen through the lens of information theory would be helpful.

I have a post which will probably help with this in particular.

Comment by johnswentworth on What Selection Theorems Do We Expect/Want? · 2021-12-17T17:13:36.049Z · LW · GW

... what you were saying in the quoted text is that you'll often see an economist, etc., use coherence theorems informally to justify a particular utility maximization model for some system, with particular priors and conditionals. (As opposed to using coherence theorems to justify the idea of EU models generally, which is what I'd thought you meant.)


This is a problem not because I want the choices fully justified, but rather because with many real world systems it's not clear exactly how I should set up my agent model. For instance, what's the world model and utility function of an e-coli? Some choices would make the model tautological/trivial; I want my claim that e.g. an e-coli approximates a Bayesian expected utility maximizer to have nontrivial and correct implications. I want to know the sense-in-which an e-coli approximates a Bayesian expected utility maximizer, and a rock doesn't. The coherence theorems tell us how to do that. They provide nontrivial sufficient conditions (like e.g. pareto optimality) which imply (and are implied by) particular utilities and world models.

To try to give an example of this: suppose I wanted to use coherence / consistency conditions alone to assign priors over the outcomes of a VNM lottery. ...

Is this a correct interpretation?

Your example is correct, though it is not the usual way of obtaining probabilities from coherence conditions. (Well, ok, in actual practice it kinda is the usual way, because existing coherence theorems are pretty weak. But it's not the usual way used by people who talk about coherence theorems a lot.) A more typical example: I can look at a chain of options on a stock, and use the prices of those options to back out market-implied probabilities for each possible stock price at expiry. Many coherence theorems do basically the same thing, but "prices" are derived from the trade-offs an agent accepts, rather than from a market.

Comment by johnswentworth on The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables · 2021-12-16T22:08:16.861Z · LW · GW

Why This Post Is Interesting

This post takes a previously-very-conceptually-difficult alignment problem, and shows that we can model this problem in a straightforward and fairly general way, just using good ol' Bayesian utility maximizers. The formalization makes the Pointers Problem mathematically legible: it's clear what the problem is, it's clear why the problem is important and hard for alignment, and that clarity is not just conceptual but mathematically precise.

Unfortunately, mathematical legibility is not the same as accessibility; the post does have a wide inductive gap.

Warning: Inductive Gap

This post builds on top of two important pieces for modelling embedded agents which don't have their own posts (to my knowledge). The pieces are:

  • Lazy world models
  • Lazy utility functions (or value functions more generally)

In hindsight, I probably should have written up separate posts on them; they seem obvious once they click, but they were definitely not obvious beforehand.

Lazy World Models

One of the core conceptual difficulties of embedded agency is that agents need to reason about worlds which are bigger than themselves. They're embedded in the world, therefore the world must be as big as the entire agent plus whatever environment the world includes outside of the agent. If the agent has a model of the world, the physical memory storing that model must itself fit inside of the world. The data structure containing the world model must represent a world larger than the storage space the data structure takes up.

That sounds tricky at first, but if you've done some functional programming before, then data structures like this actually pretty run-of-the-mill. For instance, we can easily make infinite lists which take up finite memory. The trick is to write a generator for the list, and then evaluate it lazily - i.e. only query for list elements which we actually need, and never actually iterate over the whole thing.

In the same way, we can represent a large world (potentially even an infinite world) using a smaller amount of memory. We specify the model via a generator, and then evaluate queries against the model lazily. If we're thinking in terms of probabilistic models, then our generator could be e.g. a function in a probabilistic programming language, or (equivalently but through a more mathematical lens) a probabilistic causal model leveraging recursion. The generator compactly specifies a model containing many random variables (potentially even infinitely many), but we never actually run inference on the full infinite set of variables. Instead, we use lazy algorithms which only reason about the variables necessary for particular queries.

Once we know to look for it, it's clear that humans use some kind of lazy world models in our own reasoning. We never directly estimate the state of the entire world. Rather, when we have a question, we think about whatever "variables" are relevant to that question. We perform inference using whatever "generator" we already have stored in our heads, and we avoid recursively unpacking any variables which aren't relevant to the question at hand.

Lazy Utility/Values

Building on the notion of lazy world models: it's not very helpful to have a lazy world model if we need to evaluate the whole data structure in order to make a decision. Fortunately, even if our utility/values depend on lots of things, we don't actually need to evaluate utility/values in order to make a decision. We just need to compare the utility/value across different possible choices.

In practice, most decisions we make don't impact most of the world in significant predictable ways. (More precisely: the impact of most of our decisions on most of the world is wiped out by noise.) So, rather than fully estimating utility/value we just calculate how each choice changes total utility/value, based only on the variables significantly and predictably influenced by the decision.

A simple example (from here): if we have a utility function , and we're making a decision which only effects , then we don't need to estimate the sum at all; we only need to estimate  for each option.

Again, once we know to look for it, it's clear that humans do something like this. Most of my actions do not effect a random person in Mumbai (and to the extent there is an effect, it's drowned out by noise). Even though I value the happiness of that random person in Mumbai, I never need to think about them, because my actions don't significantly impact them in any way I can predict. I never actually try to estimate "how good the whole world is" according to my own values.

Where This Post Came From

In the second half of 2020, I was thinking about existing real-world analogues/instances of various parts of the AI alignment problem and embedded agency, in hopes of finding a case where someone already had a useful frame or even solution which could be translated over to AI. "Theory of the firm" (a subfield of economics) was one promising area. From wikipedia:

In simplified terms, the theory of the firm aims to answer these questions:

  1. Existence. Why do firms emerge? Why are not all transactions in the economy mediated over the market?
  2. Boundaries. Why is the boundary between firms and the market located exactly there with relation to size and output variety? Which transactions are performed internally and which are negotiated on the market?
  3. Organization. Why are firms structured in such a specific way, for example as to hierarchy or decentralization? What is the interplay of formal and informal relationships?
  4. Heterogeneity of firm actions/performances. What drives different actions and performances of firms?
  5. Evidence. What tests are there for respective theories of the firm?

To the extent that we can think of companies as embedded agents, these mirror a lot of the general questions of embedded agency. Also, alignment of incentives is a major focus in the literature on the topic.

Most of the existing literature I read was not very useful in its own right. But I generally tried to abstract out the most central ideas and bottlenecks, and generalize them enough to apply to more general problems. The most important insight to come out of this process was: sometimes we cannot tell what happened, even in hindsight. This is a major problem for incentives: for instance, if we can't tell even in hindsight who made a mistake, then we don't know where to assign credit/blame. (This idea became the post When Hindsight Isn't 20/20: Incentive Design With Imperfect Credit Allocation.)

Similarly, this is a major problem for bets: we can't bet on something if we cannot tell what the outcome was, even in hindsight.

Following that thread further: sometimes we cannot tell how good an outcome was, even in hindsight. For instance, we could imagine paying someone to etch our names on a plaque on a spacecraft and then launch it on a trajectory out of the solar system. In this case, we would presumably care a lot that our names were actually etched on the plaque; we would be quite unhappy if it turned out that our names were left off. Yet if someone took off the plaque at the last minute, or left our names off of it, we might never find out. In other words, we might not ever know, even in hindsight, whether our values were actually satisfied.

There's a sense in which this is obvious mathematically from Bayesian expected utility maximization. The "expected" part of "expected utility" sure does suggest that we don't know the actual utility. Usually we think of utility as something we will know later, but really there's no reason to assume that. The math does not say we need to be able to figure out utility in hindsight. The inputs to utility are random variables in our world model, and we may not ever know the values of those random variables.

Once I started actually paying attention to the idea that the inputs to the utility function are random variables in the agent's world model, and that we may never know the values of those variables, the next step followed naturally. Of course those variables may not correspond to anything observable in the physical world, even in principle. Of course they could be latent variables. Then the connection to the Pointer Problem became clear.

Comment by johnswentworth on What Selection Theorems Do We Expect/Want? · 2021-12-15T23:46:31.180Z · LW · GW

The problem with VNM-style lotteries is that the probabilities involved have to come from somewhere besides the coherence theorems themselves. We need to have some other, external reason to think it's useful to model the environment using these probabilities. That also means that the "probabilities" associated with the lottery are not necessarily the agent's probabilities, at least not in the sense that the implied probabilities derived from coherence theorems are the agent's.

Comment by johnswentworth on Exercises in Comprehensive Information Gathering · 2021-12-15T23:42:43.347Z · LW · GW

This sounds right.

Comment by johnswentworth on What Do GDP Growth Curves Really Mean? · 2021-12-15T23:34:25.203Z · LW · GW

So, there's this general problem in economics where economists want to talk about what we "should" do in policy debates, and that justifies quantifying things in terms of e.g. social surplus (or whatever), on the basis that we want policies to increase social surplus (or whatever).

The problem with this is that such metrics are not chosen for robust generalization to many different use-cases, so unsurprisingly they don't generalize very well to other use-cases. For instance, if we want to make predictions about the probable trajectory of AI based on the smoothness of some metric of economic impact of technologies, social surplus does not seem like a particularly great metric for that purpose.

Comment by johnswentworth on The Plan · 2021-12-15T18:09:04.095Z · LW · GW

To the extent that for all Y so far we've found an X, I'm pretty confident that my dream-team H would find X-or-better given a couple of weeks and access to their HCH.

It sounds like roughly this is cruxy.

We're trying to decide how reliable <some scheme> is at figuring out the right questions to ask in general, and not letting things slip between the cracks in general, and not overlooking unknown unknowns in general, and so forth. Simply observing <the scheme> in action does not give us a useful feedback signal on these questions, unless we already know the answers to the questions. If <the scheme> is not asking the right questions, and we don't know what the right questions are, then we can't tell it's not asking the right questions. If <the scheme> is letting things slip between the cracks, and we don't know which things to check for crack-slippage, then we can't tell it's letting things slip between the cracks. If <the scheme> is overlooking unknown unknowns, and we don't already know what the unknown unknowns are, then we can't tell it's overlooking unknown unknowns.

So: if the dream team cannot figure out beforehand all the things it needs to do to get HCH to avoid these sorts of problems, we should not expect them to figure it out with access to HCH either. Access to HCH does not provide an informative feedback signal unless we already know the answers. The cognitive labor cannot be delegated.

(Interesting side-point: we can make exactly the same argument as above about our own reasoning processes. In that case, unfortunately, we simply can't do any better; our own reasoning processes are the final line of defense. That's why a Simulated Long Reflection is special, among these sorts of buck-passing schemes: it is the one scheme which does as well as we would do anyway. As soon as we start to diverge from Simulated Long Reflection, we need to ask whether the divergence will make the scheme more likely to ask the wrong questions, let things slip between cracks, overlook unknown unknowns, etc. In general, we cannot answer this kind of question by observing the scheme itself in operation.)

For complex questions I don't think you'd have the top-level H immediately divide the question itself: you'd want to avoid this single-point-of-failure.

(This is less cruxy, but it's a pretty typical/central example of the problems with this whole way of thinking.) By the time the question/problem has been expressed in English, the English expression is already a proxy for the real question/problem.

One of the central skills involved in conceptual research (of the sort I do) is to not accidentally optimize for something we wrote down in English, rather than the concept which that English is trying to express. It's all too easy to to think that e.g. we need a nice formalization of "knowledge" or "goal directedness" or "abstraction" or what have you, and then come up with some formalization of the English phrase which does not quite match the thing in our head, and which does not quite fit the use-cases which originally generated the line of inquiry.

This is also a major problem in real bureaucracies: the boss can explain the whole problem to the underlings, in a reasonable amount of detail, without attempting to factor it at all, and the underlings are still prone to misunderstand the goal or the use-cases and end up solving the wrong thing. In software engineering, for instance, this happens all the time and is one of the central challenges of the job.

Comment by johnswentworth on The Natural Abstraction Hypothesis: Implications and Evidence · 2021-12-14T23:57:27.830Z · LW · GW

General Thoughts

Solid piece!

One theme I notice throughout the "evidence" section is that it's mostly starting from arguments that the NAH might not be true, then counterarguments, and sometimes counter-counterarguments. I didn't see as much in the way of positive reasons we would expect the NAH to be true, as opposed to negative reasons (i.e. counterarguments to arguments against NAH). Obviously I have some thoughts on that topic, but I'd be curious to hear yours.


Wentworth thinks it is quite likely (~70%) that a broad class of systems (including neural networks) trained for predictive power will end up with a simple embedding of human values.

Subtle point: I believe the claim you're drawing from was that it's highly likely that the inputs to human values (i.e. the "things humans care about") are natural abstractions. (~70% was for that plus NAH; today I'd assign at least 85%.) Whether "human values" are a natural abstraction in their own right is, under my current understanding, more uncertain.

The NAH only says that AIs will develop abstractions similar to humans when they have similar priors, which may not always be the case.

There's a technical sense in which this is true, but it's one of those things where the data should completely swamp the effect of the prior for an extremely wide range of priors.

There are still some problems here - we might be able to infer each others' values across small inferential distances where everybody shares cultural similarities, but values can differ widely across larger cultural gaps.

This is the main kind of argument which makes me think human values are not a natural abstraction.

We could argue that navigators throughout history were choosing from a discrete set of abstractions, with their choices determined by factors like available tools, objectives, knowledge, or cultural beliefs, but the set of abstractions itself being a function of the environment, not the navigators.

The dependence of abstractions on data makes it clear that something like this is necessary. For instance, a culture which has never encountered snow will probably not have a concept of it; snow is not a natural abstraction of their data. On the other hand, if you take such people and put them somewhere snowy, they will immediately recognize snow as "a thing"; snow is still the kind-of-thing which humans recognize as an abstraction.

I expect this to carry over to AI to a large extent: even when AIs are using concepts not currently familiar to humans, they'll still be the kinds-of-concepts which a human is capable of using. (At least until you get to really huge hardware, where the AI can use enough hardware brute-force to handle abstractions which literally can't fit in a human brain.)

In the paper ImageNet-trained CNNs are biased towards texture, the authors observe that the features CNNs use when classifying images lean more towards texture, and away from shape (which seems much more natural and intuitive to humans).

However, this also feels like a feature of the "valley of confused abstractions". Humans didn't evolve based on individual snapshots of reality, we evolved with moving pictures as input data.

Also the "C" part of "CNN" is especially relevant here; we'd expect convolutional techniques to bias toward repeating patterns (like texture) in general.

Comment by johnswentworth on The Plan · 2021-12-14T23:33:13.329Z · LW · GW

This is a capability thing, not just an efficiency thing. If, for instance, I lack enough context to distinguish real expertise from prestigious fakery in some area, then I very likely also lack enough context to distinguish those who do have enough context from those who don't (and so on up the meta-ladder). It's a bottleneck which fundamentally cannot be circumvented by outsourcing cognitive labor.

Similarly, if the interface at the very top level does not successfully convey what I want those one step down to do, then there's no error-correction mechanism for that; there's no way to ground out the top-level question anywhere other than the top-level person. Again, it's a bottleneck which fundamentally cannot be circumvented by outsourcing cognitive labor.

Orthogonal to the "some kinds of cognitive labor cannot be outsourced" problem, there's also the issue that HCH can only spend >99% of its time on robustness if the person being amplified decides to do so, and then the person being amplified needs to figure out the very difficult problem of how to make all that robustness-effort actually useful. HCH could do all sorts of things if the H in question were already superintelligent, could perfectly factor problems, knew exactly the right questions to ask, knew how to deploy lots of copies in such a way that no key pieces fell through the cracks, etc. But actual humans are not perfectly-ideal tool operators who don't miss anything or make any mistakes, and actual humans are also not super-competent managers capable of extracting highly robust performance on complex tasks from giant bureaucracies. Heck, it's a difficult and rare skill just to get robust performance on simple tasks from giant bureaucracies.

In general, if HCH requires some additional assumption that the person being amplified is smart enough to do X, then that should be baked into the whole plan from the start so that we can evaluate it properly. Like, if every time someone says "HCH has problem Y" the answer is "well the humans can just do X", for many different values of Y and X, then that implies there's some giant unstated list of things the humans need to do in order for HCH to actually work. If we're going to rely on the scheme actually working, then we need that whole list in advance, not just some vague hope that the humans operating HCH will figure it all out when the time comes. Humans do not, in practice, reliably ask all the right questions on-the-fly.

And if your answer to that is "ok, the first thing for the HCH operator to do is spin up a bunch of independent HCH instances and ask them what questions we need to ask..." then I want to know why we should expect that to actually generate a list containing all the questions we need to ask. Are we assuming that those subinstances will first ask their subsubinstances (what questions the subinstances need to ask in order to determine (what questions the top instance needs to ask))? Where does that recursion terminate, and when it does terminate, and how does the thing it's terminating on actually end up producing a list which doesn't miss any crucial questions?

Comment by johnswentworth on The Plan · 2021-12-13T20:56:42.603Z · LW · GW

I would say the e-coli's fitness function has some kind of reflection baked into it, as does a human's fitness function. The qualitative difference between the two is that a human's own world model also has an explicit self-model in it, which is separate from the reflection baked into a human's fitness function.

After that, I'd say that deriving the (probable) mechanistic properties from the fitness functions is the name of the game.

... so yeah, I'm on basically the same page as you here.

Comment by johnswentworth on The Plan · 2021-12-13T19:56:14.643Z · LW · GW

I expect that progress on the general theory of agency is a necessary component of solving all the problems on which MIRI has worked. So, conditional on those problems being instantly solved, I'd expect that a lot of general theory of agency came along with it. But if a "solution" to something like e.g. the Tiling Problem didn't come with a bunch of progress on more foundational general theory of agency, then I'd be very suspicious of that supposed solution, and I'd expect lots of problems to crop up when we try to apply the solution in practice.

(And this is not symmetric: I would not necessarily expect such problems in practice for some more foundational piece of general agency theory which did not already have a solution to the Tiling Problem built into it. Roughly speaking, I expect we can understand e-coli agency without fully understanding human agency, but not vice-versa.)