Introducing the Principles of Intelligent Behaviour in Biological and Social Systems (PIBBSS) Fellowship 2021-12-18T15:23:26.672Z
Redwood's Technique-Focused Epistemic Strategy 2021-12-12T16:36:22.666Z
Interpreting Yudkowsky on Deep vs Shallow Knowledge 2021-12-05T17:32:26.532Z
Applications for AI Safety Camp 2022 Now Open! 2021-11-17T21:42:31.672Z
Epistemic Strategies of Safety-Capabilities Tradeoffs 2021-10-22T08:22:51.169Z
Epistemic Strategies of Selection Theorems 2021-10-18T08:57:23.109Z
On Solving Problems Before They Appear: The Weird Epistemologies of Alignment 2021-10-11T08:20:36.521Z
Alignment Research = Conceptual Alignment Research + Applied Alignment Research 2021-08-30T21:13:57.816Z
What are good alignment conference papers? 2021-08-28T13:35:37.824Z
Approaches to gradient hacking 2021-08-14T15:16:55.798Z
A review of "Agents and Devices" 2021-08-13T08:42:40.637Z
Power-seeking for successive choices 2021-08-12T20:37:53.194Z
Goal-Directedness and Behavior, Redux 2021-08-09T14:26:25.748Z
Applications for Deconfusing Goal-Directedness 2021-08-08T13:05:25.674Z
Traps of Formalization in Deconfusion 2021-08-05T22:40:50.267Z
LCDT, A Myopic Decision Theory 2021-08-03T22:41:44.545Z
Alex Turner's Research, Comprehensive Information Gathering 2021-06-23T09:44:34.496Z
Looking Deeper at Deconfusion 2021-06-13T21:29:07.811Z
Review of "Learning Normativity: A Research Agenda" 2021-06-06T13:33:28.371Z
[Event] Weekly Alignment Research Coffee Time (01/24) 2021-05-29T13:26:28.471Z
[Event] Weekly Alignment Research Coffee Time (05/24) 2021-05-21T17:45:53.618Z
[Event] Weekly Alignment Research Coffee Time (05/17) 2021-05-15T22:07:02.339Z
[Event] Weekly Alignment Research Coffee Time (05/10) 2021-05-09T11:05:30.875Z
[Weekly Event] Alignment Researcher Coffee Time (in Walled Garden) 2021-05-02T12:59:20.514Z
[Linkpost] Teaching Paradox, Europa Univeralis IV, Part I: State of Play 2021-05-02T09:02:19.191Z
April 2021 Deep Dive: Transformers and GPT-3 2021-05-01T11:18:08.584Z
Review of "Fun with +12 OOMs of Compute" 2021-03-28T14:55:36.984Z
Behavioral Sufficient Statistics for Goal-Directedness 2021-03-11T15:01:21.647Z
Epistemological Framing for AI Alignment Research 2021-03-08T22:05:29.210Z
Suggestions of posts on the AF to review 2021-02-16T12:40:52.520Z
Tournesol, YouTube and AI Risk 2021-02-12T18:56:18.446Z
Epistemology of HCH 2021-02-09T11:46:28.598Z
Infra-Bayesianism Unwrapped 2021-01-20T13:35:03.656Z
Against the Backward Approach to Goal-Directedness 2021-01-19T18:46:19.881Z
Literature Review on Goal-Directedness 2021-01-18T11:15:36.710Z
The Case for a Journal of AI Alignment 2021-01-09T18:13:27.653Z
Postmortem on my Comment Challenge 2020-12-04T14:15:41.679Z
[Linkpost] AlphaFold: a solution to a 50-year-old grand challenge in biology 2020-11-30T17:33:43.691Z
Small Habits Shape Identity: How I became someone who exercises 2020-11-26T14:55:57.622Z
What are Examples of Great Distillers? 2020-11-12T14:09:59.128Z
The (Unofficial) Less Wrong Comment Challenge 2020-11-11T14:18:48.340Z
Why You Should Care About Goal-Directedness 2020-11-09T12:48:34.601Z
The "Backchaining to Local Search" Technique in AI Alignment 2020-09-18T15:05:02.944Z
Universality Unwrapped 2020-08-21T18:53:25.876Z
Goal-Directedness: What Success Looks Like 2020-08-16T18:33:28.714Z
Mapping Out Alignment 2020-08-15T01:02:31.489Z
Will OpenAI's work unintentionally increase existential risks related to AI? 2020-08-11T18:16:56.414Z
Analyzing the Problem GPT-3 is Trying to Solve 2020-08-06T21:58:56.163Z
What are the most important papers/post/resources to read to understand more of GPT-3? 2020-08-02T20:53:30.913Z
What are you looking for in a Less Wrong post? 2020-08-01T18:00:04.738Z


Comment by adamShimi on Methods of Phenomenology · 2022-01-18T21:49:43.754Z · LW · GW

The Pros claim science for their own, so the Antis reject it. Phenomenology says science is useful for understanding many things but not literally all things, so the extremists on the Pro side reject phenomenology for being “impure”. The Anti side then takes phenomenology in and plays up the limits-of-science thing while downplaying the usefulness-of-science part. As a result phenomenologists more often find themselves having to defend their ideas against material realism, scientism, and other ideas on the Pro side and less against irrationality, mysticism, and other incompatible positions on the Anti side. This creates a skewed picture that implies phenomenology is anti-science by association, and it doesn’t help that some phenomenologists, being humans, may actually take up sides in this debate.


That part was useful for me. I haven't looked at the discussions in enough detail to check your proposal, but it's giving a coherent story for how phenomenology get its bad rap.

Comment by adamShimi on The Natural Abstraction Hypothesis: Implications and Evidence · 2022-01-13T19:47:58.369Z · LW · GW

Thanks for the post! Two general points I want to make before going into more general comments:

  • I liked the section on concepts difference across, and hadn't thought much about it before, so thanks!
  • One big aspect of the natural abstraction hypothesis that you missed IMO is "how do you draw the boundaries around abstractions?" — more formally how do you draw the markov blanket. This to me is the most important question to answer for settling the NAH, and John's recent work on sequences of markov blanket is IMO him trying to settle this.

In general, we should expect that systems will employ natural abstractions in order to make good predictions, because they allow you to make good predictions without needing to keep track of a huge number of low-level variables .

Don't you mean "abstractions" instead of "natural abstractions"?

John Maxwell proposes differentiating between the unique abstraction hypothesis (there is one definitive natural abstraction which we expect humans and AIs to converge on), and the useful abstraction hypothesis (there is a finite space of natural abstractions which humans use for making predictions and inference, and in general this space is small enough that an AGI will have a good chance of finding abstractions that correspond to ours, if we have it "cast a wide enough net").

Note that both can be reconciled by saying that the abstraction in John's sense (the high-level summary statistics that capture everything that isn't whipped away by noise) is unique up to isomorphism (because it's a summary statistics), but different predictors will learn different parts of this abstraction depending on the information that they don't care about (things they don't have sensor for, for example). Hence you have a unique natural abstraction, which generates a constrained space of subabstractions which are the ones actually learned by real world predictors.

This might be relevant for your follow-up arguments.

To put it another way — the complexity of your abstractions depends on the depth of your prior knowledge. The NAH only says that AIs will develop abstractions similar to humans when they have similar priors, which may not always be the case.

Or another interpretation, following my model above, is that with more knowledge and more data and more time taken, you get closer and closer to the natural abstraction, but you don't generally start up there. Although that would mean that the refinement of abstractions towards the natural abstraction often breaks the abstraction, which sounds fitting with the course of science and modelling in general.

Different strengths of the NAH can be thought of as corresponding to different behaviours of this graph. If there is a bump at all, this would suggest some truth to the NAH, because it means (up to a certain level) models can become more powerful as their concepts more closely resemble human ones. A very strong form of the NAH would suggest this graph doesn't tail off at all in some cases, because human abstractions are the most natural, and anything more complicated won't provide much improvement. This seems quite unlikely—especially for tasks where humans have poor prior knowledge—but the tailing off problem could be addressed by using an amplified overseer, and incentivising interpretability during the training process. The extent to which NAH is true has implications for how easy this process is (for more on this point, see next section).

Following my model above, one interpretation is that the bump means that when reaching human-level of competence, the most powerful abstractions available as approximations of the natural abstractions are the ones humans are using. Which is not completely ridiculous, if you expect that AIs for "human-level competence at human tasks" would have similar knowledge, inputs and data than humans.

The later fall towards alien concepts might come from the approximation shifting to very different abstractions as the best approximation, just like the shift between classical physics and quantum mechanics.

However, it might make deceptive alignment more likely. There are arguments for why deceptive models might be favoured in training processes like SGD—in particular, learning and modelling the training process (and thereby becoming deceptive) may be a more natural modification than internal or corrigible alignment. So if  is a natural abstraction, this would make it easier to point to, and (since having a good model of the base objective is a sufficient condition for deception) the probability of deception is subsequently higher.

Here is my understanding of your argument: because X is a natural abstraction and NAH is true, models will end up actually understanding X, which is a condition for the apparition of mesa-optimization. Thus NAH might make deception more likely by making one of its necessary condition more likely. Is that a good description of what you propose?

I would assume that NAH actually makes deception less likely, because of the reasons you gave above about the better proxy, which might entail that the mesa-objective is actually the base-objective.

This abstractions-framing of instrumental convergence implies that getting better understanding of which abstractions are learned by which agents in which environments might help us better understand how an agent with instrumental goals might behave (since we might expect an agent will try to gain control over some variable  only to the extent that it is controlling the features described by the abstraction  which it has learned, which summarizes the information about the current state relevant to the far-future action space).

Here too I feel that the model of "any abstraction is an approximation of the natural abstraction" might be relevant. Especially because if it's correct, then in addition to knowing the natural abstraction, you have to understand which part of it could the model approximate, and which parts of that approximations are relevant for its goal.

If NAH is strongly true, then maybe incentivising interpretability during training is quite easy, just akin to a "nudge in the right direction". If NAH is not true, this could make incentivising interpretability really hard, and applying pressure away from natural abstractions and towards human-understandable ones will result in models Goodharting interpretability metrics—where a model is be emphasised to trick the metrics into thinking it is forming human-interpretable concepts when it actually isn't.

The less true NAH is, the harder this problem becomes. For instance, maybe human-interpretability and "naturalness" of abstractions are actually negatively correlated for some highly cognitively-demanding tasks, in which case the trade-off between these two will mean that our model will be pushed further towards Goodharting.

The model I keep describing could actually lead to both hard to interpret  and yet potentially interpretable abstractions. Like, if someone from 200 years ago had to be taught Quantum Mechanics. It might be really hard, depends on a lot of mental moves that we don't know how to convey, but that would be the sort of problem equivalent to interpreting better approximations than ours of the natural abstractions. It sounds at least possible to me.

A danger of this process is that the supervised learner would model the data-collection process instead of using the unsupervised model - this could lead to misalignment. But suppose we supplied data about human values which is noisy enough to make sure the supervised learner never switched to directly modelling the data-collection process, while still being a good enough proxy for human values that the supervised learner will actually use the unsupervised model in the first place.

Note that this amount to solving ELK, which probably takes more than "adequately noisy data".

We can argue that human values are properties of the abstract object "humans", in an analogous way to branching patterns being properties of the abstract object "trees". However there are complications to this analogy: for instance, human values seem especially hard to infer from behaviour without using the "inside view" that only humans have.

I mean, humans learning approximations of the natural abstractions through evolutionary processes makes a lot of sense to me, as it's just evolution training predictors, right?

For another, even if we accept that "which abstractions are natural?" is a function of factors like culture, this argues against the "unique abstraction hypothesis" but not the "useful abstraction hypothesis". We could argue that navigators throughout history were choosing from a discrete set of abstractions, with their choices determined by factors like available tools, objectives, knowledge, or cultural beliefs, but the set of abstractions itself being a function of the environment, not the navigators.

Hence with the model I'm discussing, the culture determined their choice of subabstractions but they were all approximating the same natural abstractions.

Comment by adamShimi on [New Feature] Support for Footnotes!! · 2022-01-05T10:10:39.087Z · LW · GW

It would be a slight exaggeration to say that this is the best day of my life. Only a slight one though. Thanks for the great work!

Comment by adamShimi on More Is Different for AI · 2022-01-04T21:40:32.445Z · LW · GW

Really excited about this sequence, as I'm currently spending a lot of time on clarifying and formalizing the underlying assumptions and disagreements of what you're calling the Philosophy view (I don't completely agree with the term, but I think you're definitely pointing at important aspects of it). Hence having someone else posting on the different strengths and weaknesses of this and the Engineering view sound great!

Comment by adamShimi on Promising posts on AF that have fallen through the cracks · 2022-01-04T21:37:19.934Z · LW · GW

Nope, this is dead at the moment. I still think peer review matters a lot, but that's not what I'm focusing on at the moment, and none of the researchers on that project had enough time to invest for these in-depth review...

Comment by adamShimi on Robustness to Scale · 2022-01-03T10:17:55.424Z · LW · GW

Rereading this post while thinking about the approximations that we make in alignment, two points jump at me:

  • I'm not convinced that robustness to relative scale is as fundamental as the other two, because there is no reason to expect that in general the subcomponents will be significantly different in power, especially in settings like adversarial training where both parts are trained according to the same approach. That being said, I still agree that this is an interesting question to ask, and some proposal might indeed depend on a version of this.
  • Robustness to scaling up and robustness to scaling down sounds like they can be summarized by: "does it break in the limit of optimality? and "does it only work in the limit of optimality?". Where the first gives us an approximation for studying and designing alignment proposals, and the second points out a potential issue in this approximation. (Not saying that this is capturing all of your meaning, though)
Comment by adamShimi on Reply to Eliezer on Biological Anchors · 2021-12-28T21:47:05.390Z · LW · GW

Thanks for pushing back on my interpretation.

I feel like you're using "strongest" and "weakest" to design "more concrete" and "more abstract", with maybe the value judgement (implicit in your focus on specific testable claims) that concreteness is better. My interpretation doesn't disagree with your point about Bio Anchors, it simply says that this is a concrete instantiation of a general pattern, and that the whole point of the original post as I understand it is to share this pattern. Hence the title who talks about all biology-inspired timelines, the three examples in the post, and the seven times that Yudkowsky repeats his abstract arguments in differents ways.

It's hardly surprising there are 'two paths through a space' - if you reran either (biological or cultural/technological) evolution with slightly different initial conditions you'd get a different path. However technological evolution is aware of biological evolution and thus strongly correlated to and influenced by it. IE deep learning is in part brain reverse engineering (explicitly in the case of DeepMind, but there are many other examples). The burden proof is thus arguably more opposite of what you claim (EY claims).

Maybe a better way of framing my point here is that the optimization processes are fundamentally different (something about which Yudkowsky has written a lot, see for example this post from 13 years ago), and that the burden of proof is on showing that they have enough similarity to extract a lot of info from the evolutionary optimization to the human research optimization.

I also don't think your point about DeepMind works, because DM is working in a way extremely different from evolution. They are in part reverse engineering the brain, but that's a very different (and very human and insight heavy) paths towards AGI than the one evolution took.

Lastly for this point, I don't think the interpretation that "Yudkowsky says that the burden of proof is on showing that the optimization of evolution and human research are non correlated" survives the contact with a text where Yudkowsky constantly berates his interlocutors for assuming such correlation, and keeps drawing again and again the differences.

To the extent EY makes specific testable claims about the inefficiency of biology, those claims are in err - or at least easily contestable.

Hum, I find myself feeling like this comment: Yudkowsky's main point about biology IMO is that brains are not at all the most efficient computational way of implementing AGI. Another way of phrasing it is that Yudkowsky says (according to me) that you could use significantly less hardware and ops/sec to make an AGI.

Comment by adamShimi on My Overview of the AI Alignment Landscape: Threat Models · 2021-12-26T18:01:25.332Z · LW · GW

Thanks so much for the effort your putting in this work! It looks particularly relevant to my current interest of understanding the different approximations and questions used in alignment, and what forbids us the Grail of paradigmaticity.

Here is my more concrete feedback

A common approach when setting research agendas in AI Alignment is to be specific, and focus on a threat model. That is, to extrapolate from current work in AI and our theoretical understanding of what to expect, to come up with specific stories for how AGI could cause an existential catastrophe. And then to identify specific problems in current or future AI systems that make these failure modes more likely to happen, and try to solve them now.

Given that AFAIK it’s Rohin who introduced the term in alignment, linking to his corresponding talk might be a good idea. I also like this drawing from his slides, which might clarify the explanation for more visual readers.

While I’m at threat models, you confused me at first because “threat model” always makes me think of “development model”, and so I expected a discussion of seed AI vs Prosaic AI vs Brain-based-AGI vs CAIS vs alternatives. What you do instead is more a discussion of “risk models”, with a mention in passing that the first one traditionally came from the more seed AI development model.

Of course that’s your choice, but neglecting a bunch of development models with a lot of recent work, notably the brain-based AGI model  of Steve Byrnes, feels incoherent with the stated aim of the sequence — “mapping out the AI Alignment research landscape”.

And having a specific story to guide what you do can be a valuable source of direction, even if ultimately you know it will be flawed in many ways. Nate Soares makes the case for having a specific but flawed story in general well.

My first reaction when reading this part was “Hum, that doesn’t seem to be exactly what Nate is justifying here”. After rereading the post, I think what disturbed me was my initial reading that you were saying something like “the correctness of a threat model doesn’t matter, you just choose one and do stuff”. Which is not what either you or Nate are saying; instead, it’s that spending all the time waiting for a perfect plan/threat model is less productive than taking the best option available, getting your hands dirty and trying things.

Note that I think there is very much a spectrum between this category and robustly good approaches (a forthcoming post in this sequence). Most robustly good ways to help also address specific threat models, and many ways to address specific threat models feel useful even if that specific threat model is wrong. But I find this a helpful distinction to keep in mind.

This sounds to me like a better defense of threat model thinking, and I would like to read more about your ideas (especially the last two sentences).

When naively considered, this framework often implicitly thinks of intelligence as a mysterious black box that caches out as 'better able to achieve plans than us', without much concrete detail. Further, it assumes that all goals would lead to these issues.

I agree with the gist of the paragraph, but “all goals”  is an overstatement: both Nick Bostrom and Steve Omohundro note that some goals obviously don’t have power-seeking incentives, like the goal of dying as fast as possible. They say that most goals would have instrumental subgoals, which is the part that Richard Ngo criticizes and Alex Turner formalizes.

See Tom Adamczewski’s discussion of how arguments have shifted

Oh, awesome resource! Thanks for the link!

Understanding the incentives and goals of the agent, and how the training process can affect these in subtle ways

I feel like you should definitely mention Alex Turner’s work here, where he formalizes Bostrom’s instrumental convergence thesis.

Limited optimization: Many of these problems inherently stem from having a goal-directed utility-maximiser, which will find creative ways to achieve these goals. Can we shift away from this paradigm?

Shouldn’t you include work on impact measures here? For example this survey post and Alex Turner’s sequence.

A particularly concerning special case of the power-seeking concern is inner misalignment. This was an idea that had been floating around MIRI for a while, but was first properly clarified by Evan Hubinger in Risks from Learned Optimization.

Evan is adamant that the paper was done equally by all coauthors, and so should be cited as done by “Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, Scott Garrabrant”.

Sub-Threat model: Inner Alignment

I feel like you’re sticking a bit close to the paper’s case, when there are more compact statements of the problem. Especially with your previous case, you could just say that inner alignment is about justifying power-seeking behavior and treacherous turns in the case where the AI is found by search instead of programmed by hand.

Plausibility of misaligned cognition: It is likely that, in practice, we will end up with networks with misaligned cognition

There’s also an argument that deception is robust once it has been found: making a deceptive model less deceptive would make it do more what it really wants to do, and so have a worse loss, which means it’s not pushed out of deception by SGD.

Better understanding how and when mesa-optimization arises (if it does at all).

One cool topic here is gradient hacking — see for example this recent survey.

Anecdotally, some researchers I respect take this very seriously - it was narrowly rated the most plausible threat model in a recent survey.

I want to note that this scenario looks more normal, which makes me think that by default, anyone would find this more plausible than the Bostrom/Yudkowsky scenario due to normalcy bias. So I tend to cancel this advantage when looking at what scenario people favor.

But this error-correction mechanism may break-down for AI. There are three key factors to analyse here: pace, comprehensibility and lock-in.

I like this decomposition!

So, why would AI make cooperation worse/harder?

At least for Critch’s RAAPs, my understanding is that it’s mostly Pace that makes a difference: the process already exists, but it’s not moving as fast as it could because of the fallibility of humans, because of legislation and restrictions. Replacing humans with AIs in most tasks removes the slow down, and so the process moves faster, towards loss of control.

Comment by adamShimi on Reply to Eliezer on Biological Anchors · 2021-12-26T12:35:05.856Z · LW · GW

Thanks for this post!

That being said, my model of Yudkowsky, which I built by spending time interpreting and reverse engineering the post you're responding to, feels like you're not addressing his points (obviously, I might have missed the real Yudkowsky's point)

My interpretation is that he is saying that Evolution (as the generator of most biological anchors) explores the solution space in a fundamentally different path than human research.  So what you have is two paths through a space. The burden of proof for biological anchors thus lies in arguing that there are enough connections/correlations between the two paths to use one in order to predict the other.

Here it sounds like you're taking as an assumption that human research follows the same or a faster path towards the same point in search space. But that's actually the assumption that IMO Yudkowsky is criticizing!

In his piece, Yudkowsky is giving arguments that the human research path should lead to more efficient AGIs than evolution, in part due to the ability of humans to have and leverage insights, which the naive optimization process of evolution can't do. He also points to the inefficiency of biology in implementing new (in geological-time) complex solutions. On the other hand, he doesn't seem to see a way of linking the amount of resources needed by evolution to the amount of resources needed by human research, because they are so different.

If the two paths are very different and don't even aim at the same parts of the search space, there's nothing telling you that computing the optimization power of the first path helps in understanding the second one.

I think Yudkowsky would agree that if you could estimate the amount of resources needed to simulate all evolution  until humans at the level of details that you know is enough to capture all relevant aspects, that amount of resources would be an upper bound on the time taken by human research because that's a way to get AGI if you have the resources. But the number is so vastly large (and actually unknown due to the "level of details" problem) that it's not really relevant for timelines calculations

(Also, I already had this discussion with Daniel Kokotajlo in this thread, but I really think that Platt's law is one of the least cruxy aspects of the original post. So I don't think discussing it further or pointing attention to it is a good idea)

Comment by adamShimi on Interfaces as a Scarce Resource · 2021-12-24T11:36:49.211Z · LW · GW

These are not cheap industries. Lawyers, accountants, lobbyists, programmers… these are experts in complicated systems, and they get paid accordingly. The world spends large amounts of resources using people as interfaces - indicating that these kinds of interfaces are a very scarce resource.

Another example that fits the trend is plumber. And a more modern one is prompt engineer.

Comment by adamShimi on Material Goods as an Abundant Resource · 2021-12-22T17:26:26.961Z · LW · GW

Fun thing I thought about while reading this post: it applies pretty well for predicting the recent explosion of NFTs, which are basically a huge number of badges attached to things that can be copied or manufactured easily.

Comment by adamShimi on Biology-Inspired AGI Timelines: The Trick That Never Works · 2021-12-15T13:52:12.499Z · LW · GW

First, I want to clarify that I feel we're going into a more interesting place, where there's a better chance that you might find a point that invalidates Yudkowsky's argument, and can thus convince him of the value of the model.

But it's also important to realize that IMO, Yudkowsky is not just saying that biological anchors are bad. The more general problem (which is also developed in this post) is that predicting the Future is really hard. In his own model of AGI timelines, the factor that is basically impossible to predict until you can make AGI is the "how much resources are needed to build AGI".

So saying "let's just throw away the biological anchors" doesn't evade the general counterargument that to predict timelines at all, you need to find information on "how much resources are needed to build AGI", and that is incredibly hard. If you or Ajeya can argue for actual evidence in that last question, then yeah, I expect Yudkowsky would possibly update on the validity of the timeline estimates.

But at the moment, in this thread, I see no argument like that.

Comment by adamShimi on Biology-Inspired AGI Timelines: The Trick That Never Works · 2021-12-14T17:54:46.579Z · LW · GW

I do think you are misconstruing Yudkowsky's argument. I'm going to give evidence (all of which are relatively strong IMO) in order of "ease of checkability". So I'll start with something anyone can check in a couple of minutes, and close by the more general interpretation that requires rereading the post in details.

Evidence 1: Yudkowsky flags Simulated-Eliezer as talking smack in the part you're mentioning

If I follow you correctly, your interpretation mostly comes from this part:

OpenPhil:  We did already consider that and try to take it into account: our model already includes a parameter for how algorithmic progress reduces hardware requirements.  It's not easy to graph as exactly as Moore's Law, as you say, but our best-guess estimate is that compute costs halve every 2-3 years.

Eliezer:  Oh, nice.  I was wondering what sort of tunable underdetermined parameters enabled your model to nail the psychologically overdetermined final figure of '30 years' so exactly.

OpenPhil:  Eliezer.

Note that this is one of the two times in this dialogue where Simulated-OpenPhil calls out Simulated-Eliezer. But remember that this whole dialogue was written by Yudkowsky! So he is flagging himself that this particular answer is a quip. Simulated-Eliezer doesn't reexplain it as he does most of his insulting points to Humbali; instead Simulated-Eliezer goes for a completely different explanation in the next answer.

Evidence 2: Platt's law is barely mentioned in the whole dialogue

"Platt" is used 6-times in the 20k words piece. "30 years" is used 8 times (basically at the same place where "Platt" is used").

Evidence 3: Humbali spends far more time discussing and justifying the "30 years" time than Simulated-OpenPhil. And Humbali is the strawman character, whereas Simulated-OpenPhil actually tries to discuss and to understand what Simulated Eliezer is saying.

Evidence 4: There is an alternative interpretation that takes into account the full text and doesn't use Platt's law at all: see this comment on your other thread for my current best version of that explanation.

Evidence 5: Yudkowsky's whole criticism relying on a purely empirical and superficial similarity goes contrary to everything that I extracted from his writing in my recent post, and also to all the time he spends here discussing deep knowledge and the need for an underlying model.


So my opinion is that Platt's law is completely superfluous here, and is present here only because it gives a way of pointing to the ridiculousness of some estimates, and because to Yudkowsky it probably means that people are not even making interesting new mistakes but just the same mistakes over and over again. I think discussing it in this post doesn't add much, and weakens the post significantly as it allows reading like yours Daniel, missing the actual point.

Comment by adamShimi on Biology-Inspired AGI Timelines: The Trick That Never Works · 2021-12-14T17:17:50.679Z · LW · GW

Here I think I share your interpretation of Yudkowsky; I just disagree with Yudkowsky. I agree on the second part; the model overestimates median TAI arrival time. But I disagree on the first part -- I think that having a probability distribution over when to expect TAI / AGI / AI-PONR etc. is pretty important/decision-relevant, e.g. for advising people on whether to go to grad school, or for deciding what sort of research project to undertake. (Perhaps Yudkowsky agrees with this much.) 

Hum, I would say Yudkowsky seems to agree with the value of a probability distribution for timelines.

(Quoting The Weak Inside View (2008) from the AI FOOM Debate)

So to me it seems “obvious” that my view of optimization is only strong enough to produce loose, qualitative conclusions, and that it can only be matched to its retrodiction of history, or wielded to pro-
duce future predictions, on the level of qualitative physics.

“Things should speed up here,” I could maybe say. But not “The doubling time of this exponential should be cut in half.”

I aspire to a deeper understanding of intelligence than this, mind you. But I’m not sure that even perfect Bayesian enlightenment would let me predict quantitatively how long it will take an AI to solve various problems in advance of it solving them. That might just rest on features of an unexplored solution space which I can’t guess in advance, even though I understand the process that searches.

On the other hand, my interpretation of Yudkowsky strongly disagree with the second part of your paragraph:

And I think that Ajeya's framework is the best framework I know of for getting that distribution. I think any reasonable distribution should be formed by Ajeya's framework, or some more complicated model that builds off of it (adding more bells and whistles such as e.g. a data-availability constraint or a probability-of-paradigm-shift mechanic.). Insofar as Yudkowsky was arguing against this, and saying that we need to throw out the whole model and start from scratch with a different model, I was not convinced. (Though maybe I need to reread the post and/or your steelman summary)

So my interpretation of the text is that Yudkowsky says that you need to know how compute will be transformed into AGI to estimate the timelines (then you can plug your estimates for the compute), and that the default of any approach which relies on biological analogies for that part will be sprouting nonsense, because evolution and biology optimize in fundamentally different ways than human researchers do.

For each of the three examples, he goes into more detail about the way this is instantiated. My understanding of his criticism of Ajeya's model is that he disagrees that just current deep learning algorithms are actually a recipe for turning compute into AGI, and so saying "we keep to current deep learning and estimated the required compute" doesn't make sense and doesn't solve the question of how to turn compute into AGI. (Note that his might be the place where you or someone defending Ajeya's model want to disagree with Yudkowsky. I'm just pointing that this is a more productive place to debate him because that might actually make him change his mind — or change your mind if he convinces you)

The more general argument (the reason why "the trick" doesn't work) is that if you actually have a way of transforming compute into AGI, that means you know how to build AGI. And if you do, you're very, very close to the end of the timeline.

Comment by adamShimi on Biology-Inspired AGI Timelines: The Trick That Never Works · 2021-12-14T14:28:05.841Z · LW · GW

Strongly disagree with this, to the extent that I think this is probably the least cruxy topic discussed in this post, and thus the comment is as wrong as is physically possible.

Remove Platt's law, and none of the actual arguments and meta-discussions changes. It's clearly a case of Yudkowsky going for the snappy "hey, see like even your new-and-smarter report makes exactly the same estimation predicted by a random psychological law" + his own frustration with the law still applying despite expected progress.

But once again, if Platt's law was so wrong that there was never in the history of the universe a single instance of people predicting strong AI and/or AGI in 30 years, this would have no influence whatsoever on the arguments in this post IMO.

Comment by adamShimi on Biology-Inspired AGI Timelines: The Trick That Never Works · 2021-12-14T14:22:22.650Z · LW · GW

I do agree that the halving-of-compute-costs-every-2.5-years estimate seems too slow to me; it seems like that's the rate of "normal incremental progress" but that when you account for the sort of really important ideas (or accumulations of ideas, or shifts in research direction towards more fruitful paths) that happen about once a decade, the rate should be faster than that.

I don't think this is what Yudkowsky is saying at all in the post. Actually, I think he is saying the exact opposite: that 2.5 years estimate is too fast as an estimate that is supposed to always work. If I understand correctly, his point is that you have significantly less than that most of the time, except during the initial growth after paradigms shifts where you're pushing as much compute as you can on your new paradigm. (That being said, Yudkowsky seems to agree with you that this should make us directionally update towards AGI arriving in less time)

My interpretation seems backed by this quote (and the fact that he's presenting these points as if they're clearly wrong):

Eliezer:  Backtesting this viewpoint on the previous history of computer science, it seems to me to assert that it should be possible to:

  • Train a pre-Transformer RNN/CNN-based model, not using any other techniques invented after 2017, to GPT-2 levels of performance, using only around 2x as much compute as GPT-2;
  • Play pro-level Go using 8-16 times as much computing power as AlphaGo, but only 2006 levels of technology.


Your model apparently suggests that we have gotten around 50 times more efficient at turning computation into intelligence since that time; so, we should be able to replicate any modern feat of deep learning performed in 2021, using techniques from before deep learning and around fifty times as much computing power.



This seems true but changing the subject. Insofar as the subject is "what should our probability distribution over date-of-AGI-creation look like" then Ajeya's framework (broadly construed) is the right way to think about it IMO. Separately, we should worry that this will never let us predict with confidence that it is happening in X years, and thus we should be trying to have a general policy that lets us react quickly to e.g. two years of warning.

I don't understand how Yudkowsky can be changing the subject when his subject has never been about "probability distribution over date-of-AGI-creation"? His point IMO is that this is a bad question to ask, not because you wouldn't want the true answer if you could magically get it, but because we don't have and won't have even close to the amount of evidence needed to do this non-trivially until 2 years before AGI (and maybe not even then, because you need to know the Thielian secrets). As such, to reach an answer that fit that type, you must contort the evidence and extract more bits of information that the analogies actually contain, which means that this is a recipe for saying nonsense.

(Note that I'm not arguing Yudkowsky is right, just that I think this is his point, and that your comment is missing it — might be wrong about all of those ^^)

I think OpenPhil is totally right here. My own stance is that the 2050-centered distribution is a directional overestimate because e.g. the long-horizon anchor is a soft upper bound (in fact I think the medium-horizon anchor is a soft upper bound too, see Fun with +12 OOMs.)

Here too this sounds like missing Yudkowsky's point, which is made in the paragraph just after your original quote:

Eliezer:  Mmm... there's some justice to that, now that I've come to write out this part of the dialogue.  Okay, let me revise my earlier stated opinion:  I think that your biological estimate is a trick that never works and, on its own terms, would tell us very little about AGI arrival times at all.  Separately, I think from my own model that your timeline distributions happen to be too long.

My interpretation is that he's saying that:

  • The model, and the whole approach, is a fundamentally bad and misguided way of thinking about these questions, which falls in the many ways he's arguing for before in the dialogue
  • If he stops talking about whether the model is bad, and just looks at its output, then he thinks that's an overestimate compared to the output of his own model.
Comment by adamShimi on Biology-Inspired AGI Timelines: The Trick That Never Works · 2021-12-14T13:58:05.745Z · LW · GW

(My comment is quite critical, but I want to make it clear that I think doing this exercise is great and important, despite my disagreement with the result of the exercise ;) )

So, for having done the same exercise, I feel that you go far too meta here. And that by doing so, you're losing most of the actual valuable meta insights of the post. I'm not necessarily saying that your interpretation doesn't fit what Yudkowsky says, but if the goal is to distill where Yudkowsky is coming from in this specific post, I feel like this comment fails.

The "trick that never works", in general form, is to go looking in epistemology-space for some grounding in objective reality, which will systematically tend to lead you into these illusory traps.

AFAIU, Yudkowsky is not at all arguing against searching for grounding in reality, he's arguing for a very specific grounding in reality that I've been calling deep knowledge in my post interpreting him on the topic. He's arguing that there are ways to go beyond the agnosticism of Science (which is very similar to the agnosticism of the outside view and reference class forecasting) between hypotheses that haven't been falsified yet, and let you move towards the true answer despite the search space being far too large to tractably explore. (See that section in particular of my post, where I go into a lot of details about Yudkowsky's writing on that in the Sequences).

I also feel like your interpretation conflates the errors that Humbali makes and the ones Simulated-OpenPhil makes, but they're different in my understanding:

  • Humbali keeps on criticising Yudkowsky's confidence, and is the representative of the bad uses of the outside view and reference class forecasting. Which is why a lot of the answers to Humbali focus on deep knowledge (which Yudkowsky refer to here with the extended metaphor of the rails), where it comes from, and why it lets you discard some hypotheses (which is the whole point)
  • Simulated-OpenPhil mostly defend their own approach and the fact that you can use biological anchors to reason about timelines if you do it carefully. The answer Yudkowsky gives is IMO that they don't have/give a way of linking the path of evolution in search space and the path of human research in search space, and as such more work and more uncertainty handling on evolution and the other biological anchors don't give you more information about AGI timelines. The only thing you get out of evolution and biological anchors is the few bits that Yudkowsky already integrates in his model (like that humans will need less optimization power because they're smarter than evolution), which are not enough to predict timelines.

If I had to state it (and I will probably go into more detail in the post I'm currently writing), my interpretation is that the trick that never works is "using a biological analogy that isn't closely connected to how human research is optimizing for AGI". So the way of making a "perpetual motion machine" would be to explain why the specific path of evolution (or other anchors) is related to the path of human optimization, and derive stuff from this.

Comment by adamShimi on Biology-Inspired AGI Timelines: The Trick That Never Works · 2021-12-14T11:59:59.854Z · LW · GW

(I'm trying to answer and clarify some of the points in the comments based on my interpretation of Yudkowsky in this post. So take the interpretations with a grain of salt, not as "exactly what Yudkowsky meant")

Progress in AI has largely been a function of increasing compute, human software research efforts, and serial time/steps. Throwing more compute at researchers has improved performance both directly and indirectly (e.g. by enabling more experiments, refining evaluation functions in chess, training neural networks, or making algorithms that work best with large compute more attractive).

Historically compute has grown by many orders of magnitude, while human labor applied to AI and supporting software  by only a few. And on plausible decompositions of progress (allowing for adjustment of software to current hardware and vice versa), hardware growth accounts for more of the progress over time than human labor input growth.

So if you're going to use an AI production function for tech forecasting based on inputs (which do relatively OK by the standards tech forecasting), it's best to use all of compute, labor, and time, but it makes sense for compute to have pride of place and take in more modeling effort and attention, since it's the biggest source of change (particularly when including software gains  downstream of hardware technology and expenditures).

My summary of what you're defending here: because hardware progress is (according to you) the major driver of AI innovation, then we should invest a lot of our forecasting resources into forecasting it, and we should leverage it as the strongest source of evidence available for thinking about AGI timelines.

I feel like this is not in contradiction with what Yudkowsky wrote in this post? I doubt he agrees that just additional compute is the main driver of progress (after all, the Bitter Lesson mostly tells you that insights and innovations leveraging more compute will beat hardcorded ones), but insofar as he expect us to have next to no knowledge of how to build AGI until around 2 years before it is done (and then only for those with the Thelian secret), then compute is indeed the next best thing that we have to estimate timelines.

Yet Yudkowsky's point is that being the next best thing doesn't mean it's any good.

Thinking about hardware has a lot of helpful implications for constraining timelines:

  • Evolutionary anchors, combined with paleontological and other information (if you're worried about Rare Earth miracles), mostly cut off extremely high input estimates for AGI development, like Robin Hanson's, and we can say from known human advantages relative to evolution that credence should be suppressed some distance short of that (moreso with more software progress)

Evolution being an upper bound makes sense, and I think Yudkowsky agrees. But it's an upper bound on the whole human optimization process, and the search space of the human optimization is tricky to think about. I see much of Yudkowsky's criticisms of biological estimates here as saying "this biological anchor doesn't express the cost of evolution's optimization in terms of human optimization, but instead goes for a proxy which doesn't tell you anything".

So if someone captured both evolution and human optimization in the same search space, and found an upper bound on the cost (in terms of optimization power) that evolution spent to find humans, then I expect Yudkowsky would agree that this is an upper bound for the optimization power that human will use. But he might still retort that translating optimization power into compute is not obvious.

  • You should have lower a priori credence in smaller-than-insect brains yielding AGI than more middle of the range compute budgets

Okay, I'm going to propose what I think is the chain of arguments you're using here:

  • Currently, we can train what sounds like the compute equivalent of insect brains, and yet we don't have AGI. Hence we're not currently able to build AGI with "smaller-than-insect brains", which means AGI is less likely to be created with "smaller-than-insect brains".
    • I agree that we don't have AGI
    • The "compute equivalent" stuff is difficult, as I mentioned above, but I don't think this is the main issue here.
    • Going from "we don't know how to do that now" to "we should expect that it is not how we will do it" doesn't really work IMO. As Yudkowsky points out, the requirements for AGI are constantly dropping, and maybe a new insight will turn out to make smaller neural nets far more powerful, before the bigger models reach AGI
  • Evolution created insect-sized brains and they were clearly not AGI, so we have evidence against AGI with that amount of resources.
    • Here the fact that evolution is far worse an optimizer than humans breaks most of the connection between evolution creating insects and humans creating AGI. Evolution merely shows that insects can be made with insect-sized brains, not that AGI cannot be extracted by better use of the same resources.
    • From my perspective this is exactly what Yudkowsky is arguing against in this post: it's not because you know of a bunch of paths through search space that you know what a cleverer optimizer could find. There are ways to use a bunch of paths as data to understand the search space, but you then need either to argue that they are somehow dense in the search space, or that the sort of paths you're interested in look similar to this bunch of paths. And at the moment, I don't see an argument in any of these forms.
  • By default we should expect AGI to have a decent minimal size because of it's complexity, hence smaller models have a lower credence.
    • Agree with the principle (sounds improbable that AGI will be made in 10 lines of LISP), but the threshold is where most of the difficulty lies: how much is too little? A 100 neurons sounds clearly too small, but when you reach insect-sized brains, it's not obvious (at least to me) that better use of resources couldn't bring you most of the way to AGI.
    • (I wonder if there's an availability bias here where the only good models we have nowadays are huge, hence we expect that AGI must be a huge model?)
  • It lets you see you should concentrate probability mass in the next decade or so because of the rapid scaleup of compute investment  (with a supporting argument from the increased growth of AI R&D effort) covering a substantial share of the orders of magnitude between where we are and levels that we should expect are overkill

I think this is where the crux of can the current paradigm just scale matters a lot. The main point Yudkowsky uses in the dialogue to argue against your concentration of probability mass is that he doesn't agree that deep learning scales that way to AGI. In his view (on which I'm not clear yet, and that's not a view that I've seen anyone who actually studies LMs have), the increase in performance will break before. And as such, the concentration of probability mass shouldn't happen, because the fact that you can reach the anchor is irrelevant since we don't know a way to turn compute into AGI (according to Yudkowsky's view).

  • It gets you likely AGI this century,  and on the closer part of that, with a pretty flat prior over orders of magnitude of inputs that will go into success of magnitude of inputs

Here too, it depends on transforming the optimization power of evolution into compute and other requirements, and then know how this compute is supposed to get transformed into efficiency and AGI. (That being said, I think Yudkowsky agrees with the conclusion, just not that specific way of reaching it).

  • It suggests lower annual probability later on if Moore's Law and friends are dead, with stagnant inputs to AI

Not clear to me what you mean here (might be clearer with the right link to the section of Cotra's report about this).  But note that based on Yudkowsky's model in this post, the cost to make AGI should continue to drop as long as the world doesn't end, which creates a weird situation where the probability of AGI keeps increasing with time (Not sure how to turn that into a distribution though...)

These are all useful things highlighted by Ajeya's model, and by earlier work like Moravec's. In particular, I think Moravec's forecasting methods are looking pretty good, given the difficulty of the problem. He and Kurzweil (like the computing industry generally)  were surprised by the death of Dennard scaling and general price-performance of computing growth slowing, and we're definitely years behind his forecasts in AI capability, but we are seeing a very compute-intensive AI boom in the right region of compute space. Moravec also did anticipate it would take a lot more compute than one lifetime run to get to AGI. He suggested human-level AGI would be in the vicinity of human-like compute quantities being cheap and available for R&D. This old discussion is flawed, but makes me feel the dialogue is straw-manning Moravec to some extent.

This is in the same spirit as a bunch of comments on this post, and I feel like it's missing the point of the post? Like, it's not about Moravec's estimate being wildly wrong, it's about the unsoundedness of the methods by which Moravec reaches his conclusion. Your analysis doesn't give such evidence for Moravec predicting accuracy that we should expect he has a really strong method that just looks bad to Yudkowksy but is actually sound. And I feel points like that don't go at all for the cruxes (the soundness of the method), instead they mostly correct a "too harsh judgment" by Yudkowsky, without invalidating his points.

Ajeya's model puts most of the modeling work on hardware, but it is intentionally expressive enough to let you represent a lot of different views about software research progress, you just have to contribute more of that yourself when adjusting weights on the different scenarios, or effective software contribution year by year. You can even represent a breakdown of the expectation that software and hardware significantly trade off over time, and very specific accounts of the AI software landscape and development paths. Regardless modeling the most importantly changing input to AGI is useful, and I think this dialogue misleads with respect to that by equivocating between hardware not being the only contributing factor and not being an extremely important to dominant driver of progress.

Hum, my impression here is that Yudkowsky is actually arguing that he is modeling AGI timelines that way; and if you don't add unwarranted assumptions and don't misuse the analogies to biological anchors, then you get his model, which is completely unable to give the sort of answer Cotra's model is outputting.

Or said differently, I expect that Yudkowsky thinks that if you reason correctly and only use actual evidence instead of unsound lines of reasoning, you get his model; but doing that in the explicit context of biological anchors is like trying to quit sugar in a sweetshop: the whole setting just makes that far harder. And given that he expects that he get the right constraints on models without the biological anchors stuff, then it's completely redundant AND unhelpful.

Comment by adamShimi on Is "gears-level" just a synonym for "mechanistic"? · 2021-12-13T18:20:08.705Z · LW · GW

Also, I feel like the mental picture of gears turning is far more telling than the picture of a "mechanism".

Comment by adamShimi on Is "gears-level" just a synonym for "mechanistic"? · 2021-12-13T18:19:06.496Z · LW · GW

I'm very confused by this comment because LW and AF post talk about causation all the time. They're definitely a place where I expect causal and causation to appear in 90% of the posts I read, more than once.

Comment by adamShimi on Is "gears-level" just a synonym for "mechanistic"? · 2021-12-13T18:17:51.812Z · LW · GW

I think the best pointer for gears-level as it is used nowadays is John Wentworth's post Gears vs Behavior. And in this summary comment, he explicitly says that the definition is the opposite of a black box, and that gears-level vs black box is a binary distinction.

Gears-level models are the opposite of black-box models.


One important corollary to this (from a related comment): gears/no gears is a binary distinction, not a sliding scale.

As for the original question, I feel that "mechanistic" can be applied to models that are just one neat equation but with no moving parts, such that you don't know how to alter the equation when the underlying causal process.

If mechanistic indeed means the opposite of black-box, then in principle we could replace gears-level model.

Comment by adamShimi on The Plan · 2021-12-13T18:03:04.221Z · LW · GW

If I imagine what my work would look like if I started out expecting reflection to be the taut constraint, then it does seem like I'd follow a path a lot more like MIRI's. So yeah, this fits.

One thing I'm still not clear about in this thread is whether you (John) would feel that progress has been made for the theory of agency if all the problems on which MIRI were instantaneously solved. Because there's a difference between saying "this is the obvious first step if you believe reflection is the taut constraint" and "solving this problem would help significantly even if reflection wan't the taut constraint".

Comment by adamShimi on The Plan · 2021-12-12T01:07:49.238Z · LW · GW

I think having that post on the AF would be very good. ;)

Comment by adamShimi on Interpreting Yudkowsky on Deep vs Shallow Knowledge · 2021-12-08T10:44:00.028Z · LW · GW

Without security mindset, one tends to think an unstoppable AI is a-priori likely to do what humans want, since humans built it. With security mindset, one sees that most AIs are nukes that wreak havoc on human values, and getting them to do what humans want is analogous to building crash-proof software for a space probe, except the whole human race only gets to launch one probe and it goes to whoever launches it first.

I think this is a really shallow argument that undersells enormously the actual reasons for caring about alignment. We have actual arguments for why unstoppable AI are not-likely to do what human wants, and they don't need the security mindset at all. The basic outline is something like:

  • Since we have historically a lot of trouble writing down programs that solve more complex and general problems like language or image recognition (and successes through ML), future AI and AGI will probably the sort to "fill-in the gaps" in our request/specifications
  • For almost everything we could ask an AI to accomplish, there are actions that would help it and would be bad and counterintuitive from previous technology standpoint (the famous convergent subgoals)
  • Precisely specifying what we want without relying on common sense is incredibly hard, and doesn't survive strong optimization (Goodhart's law)
  • And competence by itself doesn't solve the problem, because understanding what humans want doesn't mean caring about it (Orthogonality thesis).

This line of reasoning (which is not new by any mean, it's basically straight out of Bostrom and early Yudkowsky's writing) justify the security mindset for AGI and alignment. Not the other way around. 

(And historically, Yudkowsky wanted to build AGI before he found out about these points, which turned him into the biggest user — but not the only one by all mean — of the security mindset in alignment)

Comment by adamShimi on Interpreting Yudkowsky on Deep vs Shallow Knowledge · 2021-12-08T10:34:56.847Z · LW · GW

Oh no, confusion is going foom!

Joke aside, I feel less confused after your clarifications. I think the issue is that it wasn't clear at all to me that you were talking about the whole "interpreting Yudkowsky" schtick as the icky  feeling.

Now it makes sense, and I definitely agree with you that there are enormous parallel with Biblical analysis. Yudkowsky's writing is very biblical in ways IMO (the parables and the dialogues), and in general is far more literary than 99% of the rat writing out there. I'm not surprised he found HPMOR easy to write, his approach to almost everything seem like a mix of literary fiction and science-fiction tropes/ideas.

Which is IMO why this whole interpretation is so important. More and more, I think I'm understanding why so many people get frustrated with Yudkowsky's writing and points: because they come expecting essays with arguments and a central point, and instead they get a literary text that requires strong interpretation before revealing what it means. I expect your icky feeling to come from the same place.

(Note that I think Yudkowsky is not doing that to be obscure, but for a mix of "it's easier for him" and "he believes that you only learn and internalize the sort of knowledge he's trying to convey through this interpretative labor, if not on the world itself, at least on his text".)

Also, as a clarifier: I'm not comparing the content of literary fiction or the Bible to Yudkowsky's writing. Generally with analysis of the former, you either get mysterious answers or platitudes; more and more with Yudkwosky I'm getting what I feel are deep insights (and his feedback on this post make me think that I'm not off the mark by much for some of those).

Comment by adamShimi on Interpreting Yudkowsky on Deep vs Shallow Knowledge · 2021-12-08T10:25:12.115Z · LW · GW

I think we disagree on Yudkowsky's conclusion: his point IMO is that Einstein was able to reduce the search space a lot. He overemphasize for effect (and because it's more impressive to have someone who guesses right directly through these methods), but that doesn't change that Einstein reduced the state space a lot (which you seem to agree with).

Many of the relevant posts I quoted talk about how the mechanism of Science are fundamentally incapable of doing that, because they don't specify any constraint on hypothesis except that they must be falsifiable. Your point seems to be that in the end, Einstein still used the sort of experimental data and methods underlying traditional Science, and I tend to agree. But the mere fact that he was able to get the right answer out of millions of possible formulations by checking a couple of numbers should tell you that there was a massive hypothesis-space reducing step before.

Comment by adamShimi on On Solving Problems Before They Appear: The Weird Epistemologies of Alignment · 2021-12-08T10:06:59.998Z · LW · GW

Thanks for explaining your point in more details.

The type of formal methods use I am referring to was popular in the late 1970s to at least the 1990s, not sure how popular it is now. It can be summarized by Dijkstra's slogan of “designing proof and program hand in hand”.

This approach stands in complete opposite to the approach of using a theorem prover to verify existing code after it was written.

In simple cases, the designing hand-in-hand approach works as follows: you start with a mathematical specification of the properties you want the program to be written to have. Then you use this specification to guide both the writing of the code and the writing of the correctness proof for the code at the same time. The writing of the next line in the proof will often tell you exactly what next lines of code you need to add. This often leads to code which much more clearly expresses what is going on. The whole code and proof writing process can often be done without even leveraging a theorem prover.

In more complex cases, you first have to develop the mathematical language to write the specification in. These complex cases are of course the more fun and interesting cases. AGI safety is one such more complex case.

I'm also aware of this approach; it is just about as intractable in my opinion and experience than the "check afterward". You still need the specification, which is one of my big arguments against the usefulness of formal methods to alignment. Also, such methods followed strongly are hardly "faster" than using a theorem prover — they are about emulating the sort of thing a theorem prover does while writing the program! Actually, I wonder if this is not just equivalent to writing the proof in Coq and generating the corresponding program. (To be fair, I expect you'll gain an order of magnitude over Coq proofs because you don't have to go that anally over that many details; but an order of magnitude or two better than Coq proof is still incredibly slow when compared to the speed of software development, let alone ML!)

You mention distributed computing. This is one area where the hand-in-hand style of formal methods use is particularly useful, because the method of intuitive trial and error sucks so much at writing correct distributed code. Hand-in-hand methods are useful if you want to design distributed systems protocols that are proven to be deadlock-free, or have related useful properties. (The Owicki-Gries method was a popular generic approach for this type of thing in the 1990, not sure if it is still mentioned often.)

One really important detail that is missing from your description is the incredible simplicity of distributed programs (where I actually mean distributed, not the concurrent you seem to imply) and yet they are tremendously difficult to certify and/or come up with hand and hand in proof. Like the most advanced distributed algorithms in use are probably simpler in terms of code and what happens than the second or third project someone has to code in their first software engineering class. So I hardly see why I should generalize with any confidence from "we manage to design through proofs programs that are simpler that literally what any non-distributed computing programmer writes" to "we're going to design through proof the most advanced and complex program that ever existed, orders of magnitudes more complex than the most complex current systems".

The key to progress is that you specify various useful 'safety properties' only, which you then aim to construct/prove. Small steps! It would be pretty useless to try to specify the sum total of all current and future human goals and morality in one single step.

Sure — my point is that useful safety properties are incredibly hard to specify at that level. I'm of course interested of examples to the contrary. ;)

This is of course the usual Very American complaint about formal methods.

Actually, this is a complain by basically every researcher and big shot in academic formal methods that I ever heard talk. They all want to defend formal methods, but they all have to acknowledge that:

  • Basically no one ask for formal check or proof-based algorithms until they really have to, because it costs them a shitton of money and time.
  • And even after they get the formal guarantees, no one change their mind and think this was a good investment because it so rarely make the product better in any externally observable way.

However, if I compare AGI to airplanes, I can note that, compared to airplaine disasters, potential AGI disasters have an even bigger worst-case risk of producing mass death. So by implication, everybody should consider it worth the expense to apply formal methods to AGI risk management. In this context, 2 person years to construct a useful safety proof should be considered money well spent.

And the big problem with AGI is that a lot of the people running for it don't understand that it involves risks of death and worse.

You're missing my point: most of the people developing AGI don't believe in any potential AGI disasters. They don't have the availability of plane crashes and regulations to be "okay, we probably have to check this". If we had such formal methods backed approaches, I could see DeepMind potentially using it, maybe OpenAI, but not chinese labs or all the competitors we don't know about. And the fact that it would slow down whoever uses it for 2 years would be a strong incentive for even the aligned places to not use it, or lose the race and risk having an even worsely unaligned AGI released into the world.

Comment by adamShimi on On Solving Problems Before They Appear: The Weird Epistemologies of Alignment · 2021-12-06T11:15:28.566Z · LW · GW

Thanks for the thoughtful comment!

You fail to mention the more important engineering strategy: one which does not rely on tinkering, but instead on logical reasoning and math to chart a straight line to your goal.

To use the obvious example, modern engineering does not design bridges by using the staple strategy of tinkering, it will use applied math and materials science to create and validate the bridge design.

From this point of view, the main difference between 'science' and 'engineering' is that science tries to understand nature: it seeks to understand existing physical laws and evolved systems. But engineering builds systems from the ground up. The whole point of this is that the engineer can route around creating the kind of hard-to-analyse complexity you often encounter in evolved systems. Engineering is a 'constructive science' not an 'observational science' or 'experimental science'.

That's a fair point. My intuition here is that this more advanced epistemic strategy requires the science, and that the one I described is the one people actually used for thousands of years when they had to design bridge without material science and decent physics. So in a sense, the modern engineering epistemic strategy sounds even less feasible for alignment.

As for your defense of formal methods... let's start by saying that I have a good idea of what you're talking about: I studied engineering in Toulouse, which is one of the places with the most formal methods in France, which is itself a big formal methods hotspots. I got classes and lab sessions about all the classical formal methods techniques. I did my PhD in a formal methods lab, going to a bunch of seminars and conferences. And even if my PhD was on the theory of distributed computing, which takes a lot from computability and complexity theory, it's also the field that spurned the invention of formal methods and the analysis of algorithms as this is pretty much the only way of creating distributed algorithms that work.

All of that to clarify that I'm talking from some amount of expertise here. And my reaction is that formal methods have two big problems: they're abysmally slow, and they're abysmally weak, at least compared to the task of verifying an AGI.

On the slow part, it took two years for researchers to formally verify Compcert, a C compiler that actually can't compile the whole C specification. It's still an amazing accomplishment, but that gives you an idea of the timescale of formal verification.

You could of course answer that CompCert was verified "by hand" using proof assistants, and that we can leverage more automatic methods like model checking or abstract interpretation. But there's a reason why none of these approaches was used to certify a whole compiler: they're clearly not up to the task currently.

Which brings us to the weak part. The most relevant thing here is that you need a specification to apply formal methods. And providing a specification for alignment would be on par with proving P != NP. Not saying it's impossible, just that it's clearly so hard that you nee to show a lot of evidence to convince people that you did. (I'm interested if you feel you have evidence for that ^^)

And even then, the complexity of the specification would without a doubt far exceeds what can be checked using current methods.

I also see that you're saying formal methods help you design stuff. That sounds very wrong to me. Sure, you can use formal methods if you want to test your spec, but it's so costly with modern software that basically no one considers it worth it except cases where there's a risk of death (think stuff like airplanes). And the big problem with AGI is that a lot of the people running for it don't understand that it involves risks of death and worse. So a technology that is completely useless to make their model stronger, and only serves to catch errors that lead to death has no value for people who don't believe in the possibility of critical error in their software.

But maybe the last and final point is that, sure, it will require a lot of work and time, but doing it through formal methods will provide guarantees that we can hardly get from any other source. I sympathise with that, but like almost all researchers here, my answer is "Deep Learning seem to be going there far sooner than the formal methods, so I'm going to deal with aligning the kind of system we're linkely to get in 15~30 years, not the ones that we could at best get in a century or two.

(Sorry if you feel I'm strawmanning you, I have quite strong opinions on the complete irrelevance of formal methods for alignment. ^^)

Comment by adamShimi on Interpreting Yudkowsky on Deep vs Shallow Knowledge · 2021-12-06T10:44:28.945Z · LW · GW

I find myself confused by this comment. I'm going to try voicing this confusion as precisely as possible, so you can hopefully clarify it for me.

I am confused that you get an icky feeling from basically the most uncontroversial part of my post and Yudkowsky's point. The part you're quoting is just saying that Yudkowsky cares more about anticipation-constraining than predictions. Of course, predictions are a particular type of very strong anticipation-constraining, but saying "this is impossible" is not wishy-washy fake specification: if the impossible thing is done, that invalidates your hypothesis. So "no perpetual motion machines" is definitely anticipation-constraining in that sense, and can readily falsified.

I am confused because this whole anticipation constraing, especially saying what can't be done, is very accepted in traditional Science. Yudkowsky says that Science Isn't Strict Enough because he says that it allows any type of anticipation-constraining hypothesis to the rank of "acceptable hypothesis": if it's wrong, it will evenutally be falsified.

I am confused because you keep comparing deep knowledge with the sort of conclusions that can always be reinterpreted from new evidence, when my posts goes into a lot of details about how Yudkowsky writes about the anticipation-constraining aspect and how to be stricter with your hypothesis, not just allowing any non-disproved hypothesis the same level of credibility.

Also I feel that I should link to this post, where Yudkowsky argues that the whole "Religion is non-falsifiable" is actually a modern invention that it doesn't make sense to retrofit into the past.

Comment by adamShimi on Interpreting Yudkowsky on Deep vs Shallow Knowledge · 2021-12-06T10:31:45.047Z · LW · GW

Thanks for the comment, and glad it helped you. :)

  • outside vs. inside view - I've thought about this before but hadn't read this clear a description of the differences and tradeoffs before (still catching up on Eliezer's old writings)

My inner Daniel Kokotajlo is very emphatically pointing to that post about all the misuses of the term "outside view". Actually, Daniel commented on my draft that he definitely didn't thought that Hanson was using the real outside view AKA reference class forecasting in the FOOM debate, and that as Yudkowsky points out, reference class forecasting just doesn't seem to work for AGI prediction and alignment.

I just hope all his smack talking doesn't turn off/away talented people coming to lend a hand on alignment. I expect a lot of people on this (AF) forum found it like me after reading all Open Phil and 80,000 Hours' convincing writing about the urgency of solving the AI alignment problem. It seems silly to have those orgs working hard to recruit people to help out, only to have them come over here and find one of the leading thinkers in the community going on frequent tirades about how much EAs suck, even though he doesn't know most of us. Not to mention folks like Paul and Richard who have been taking his heat directly in these marathon discussions!

Yeah, I definitely think there are and will be bad consequences. My point is not that I think this is a good idea, just that I understand better where Yudkowsky is coming from, and can empathize more with his frustration.

I feel the most dangerous aspect of the smack talking is that it makes people not want to listen to him, and just see him as a smack talker with nothing to add. That was my reaction when reading the first discussions, and I had to explicitly notice that my brain was going from "This guy is annoying me so much" to "He's wrong", which is basically status-fueled "deduction". So I went looking for more. But I completely understand the people, especially those who are doing a lot of work in alignment, being just "I'm not going to stop my valuable work to try to understand someone who's just calling me a fool and is unable to voice their arguments in a way I understand."

Comment by adamShimi on Interpreting Yudkowsky on Deep vs Shallow Knowledge · 2021-12-06T10:23:37.912Z · LW · GW

Thanks for the kind and thoughtful comment! not true.  In the comments to Einstein's Speed, Scott Aaronson explains the real story: Einstein spent over a year going down a blind alley, and was drawn back by -- among other things -- his inability to make his calculations fit the observation of Mercury's perihelion motion.  Einstein was able to reason his way from a large hypothesis space to a small one, but not to actually get the right answer.

(and of course, in physics you get a lot of experimental data for free.  If you're working on a theory of gravity and it predicts that things should fall away from each other, you can tell right away that you've gone wrong without having to do any new experiments.  In AI safety we are not so blessed.)

That's a really good point. I didn't go into that debate in the post (because I tried to not criticize Yudkowky, and also because the post is already way too long), but my take on this is: Yudkowsky probably overstates the case, but that doesn't mean he's wrong about the relevance for Einstein's work of the constrains and armchair reasoning (even if the armchair reasoning was building on more empirical evidence that Yudkowsky originally pointed out). As you say, Einstein apparently did reduce the search space significantly: he just failed to find exactly what he wanted in the reduced space directly.

Comment by adamShimi on Interpreting Yudkowsky on Deep vs Shallow Knowledge · 2021-12-06T10:04:37.852Z · LW · GW

Yes, I do quote the security mindset in the post.

I feel you're quite overstating the ability of the security mindset to show FOOM though. The reason it's not presented as a direct consequence of a security mindset is... because it's not one?

Like, once you are convinced of the strong possibility and unavoidability of AGI and superintelligence (maybe through FOOM arguments), then the security mindset actually helps you, and combining it with deep knowledge (like the Orthogonality Thesis) let's you find a lot more ways of breaking the "humanity security". But the security mindset applied without arguments for AGI doesn't let you postulate AGI, for the same reason that the security mindset without arguments about mind-reading doesn't let you postulate that the hackers might read the password in your mind.

Comment by adamShimi on Interpreting Yudkowsky on Deep vs Shallow Knowledge · 2021-12-06T09:54:44.872Z · LW · GW

This is basically the thing that bothered me about the debates. Your solution seems to be to analogize, Einstein:relativity::Yudkowsky:alignment is basically hopeless. But in the debates, M. Yudkowsky over and over says, "You can't understand until you've done the homework, and I have, and you haven't, and I can't tell you what the homework is." It's a wall of text that can be reduced to, "Trust me."

He might be right about alignment, but under the epistemic standards he popularized, if I update in the direction of his view, the strength of the update must be limited to "M. Yudkowsky was right about some of these things in the past and seems pretty smart and to have thought a lot about this stuff, but even Einstein was mistaken about spooky action at a distance, or maybe he was right and we haven't figured it out yet, but, hey, quantum entanglement seems pretty real." In many ways, science just is publishing the homework so people can poke holes in it.

I definitely feel you: that reaction was my big reason for taking so much time rereading his writing and penning this novel-length post.

The first thing I want to add is that after looking for discussions of this in the Sequences, they were there. So the uncharitable explanation of "he's hiding the homework/explanation because he knows he's wrong or doesn't have enough evidence" doesn't really work. (I don't think you're defending this, but it definitely crossed my mind and that of others I talked to). I honestly believe Yudowsky is saying in good faith that he has found deep knowledge and that he doesn't know how to share it in a way he didn't try in his 13 years of writing about them.

The second thing is that I feel my post brings together enough bits of Yudkowsky's explanations of deep knowledge that we have at least a partial handle on how to check it? Quoting back my conclusion:

Yudkowsky sees deep knowledge as highly compressed causal explanations of “what sort of hypothesis ends up being right”. The compression means that we can rederive the successful hypotheses and theories from the causal explanation. Finally, such deep knowledge translates into partial constraints on hypothesis space, which focus the search by pointing out what cannot work.

So the check requires us to understand what sort of successful hypotheses he is compressing, if that is really a compression as a causal underlying process that can be used to rederive these hypotheses, and if the resulting constraint actually cuts a decent chunk of hypothesis space when applied to other problems.

That's definitely a lot of work, and I can understand if people don't want to invest the time there. But it seems different from me to have a potential check and be "I don't think this is a good time investment" from saying that there's no way to check the deep knowledge.


If Einstein came to you in 1906 (after general relativity) and stated the conclusion of the special relativity paper, and when you asked him how he knew, he said, "You can't understand until you've done the homework, and I have, and you haven't," which is all true from my experience studying the equations, "and I can't tell you what the homework is," the strength of your update would be similarly limited. 

I recommend reading Einstein's Speed and Einstein's Superpowers, which are the two posts where Yudkowsky tries to point out that if you look for it, it's possible to find where Einstein was coming from and the sort of deep knowledge he used. I agree it would be easier if the person leveraging the deep knowledge could state it succintly enough that we could get it, but I also acknowledge that this sort of fundamental principle from which other thing derives are just plain hard to express. And even then, you need to do the homework.

(My disagreement with Yudkowsky here is that he seems to believe mostly in providing a lot of training data and examples so that people can see the deep knowledge for themselves, whereas I expect that most smart people would find it far easier to have a sort of pointer to the deep knowledge and what it is good for, and then go through a lot of examples).

Comment by adamShimi on Biology-Inspired AGI Timelines: The Trick That Never Works · 2021-12-06T09:37:13.942Z · LW · GW

I endorse most of this comment; this "core thing" idea is exactly what I tried to understand when writing my recent post on deep knowledge according to Yudkowsky.

This post, and this whole series of posts, feels like its primary function is training data to use to produce an Inner Eliezer that has access to the core thing, or even better to know the core thing in a fully integrated way. And maybe a lot of Eliezer's other communications is kind of also trying to be similar training data, no matter the superficial domain it is in or how deliberate that is.

Yeah, that sounds right. I feel like Yudkowsky always write mostly training data, and feels like explaining as precisely as he can the thing he's talking about never works. I agree with him that it can't work without the reader doing a bunch of work (what he calls homework), but I expect (from my personal experience) that doing the work while you have an outline of the thing is significantly easier. It's easier to trust that there's something valuable at the end of the tunnel when you have a half-decent description.

The condescension is important information to help a reader figure out what is producing the outputs, and hiding it would make the task of 'extract the key insights' harder.

Here though I feel like you're overinterpreting. In older writing, Yudkowsky is actually quite careful to not directly insult people and be condescending. I'm not saying he never does it, but he tones it down a lot compared to what's happening in this recent dialogue. I think that a better explanation is simply that he's desperate, and has very little hope of being able to convey what he means because he's being doing that for 13 years and no one catched on.

Maybe point 8 is also part of the explanation: doing this non-condescendingly sounds like far more work for him, and yet he does't expect it to work, so he doesn't take that extra charge for little expected reward. 

My Inner Eliezer says that writing this post without the condescension, or making it shorter, would be much much more effort for Eliezer to write. To the extent such a thing can be written, someone else has to write that version. Also, it's kind of text in several places.

Comment by adamShimi on Why Study Physics? · 2021-11-28T23:43:22.737Z · LW · GW

High dimensional world: to find something as useful as e.g. Fourier methods by brute-force guess-and-check would require an exponentially massive amount of search, and is unlikely to have ever happened at all. Therefore we should expect that it was produced by a method which systematically produces true/useful things more often than random chance, not just by guess-and-check with random guessing. (Einstein's Arrogance is saying something similar.)


I don't think this contradict the hypothesis that "Physicists course-correct by regularly checking their answers". After all, the reason Fourier methods and others tricks kept being used is because they somehow worked a lot of the time. Similarly, I expect (maybe wrongly) that there was a bunch of initial fiddling before they got the heuristics to work decently. If you can't check your answer, the process of refinement that these ideas went through might be harder to replicate.

Physicists have a track record of successfully applying physics-like methods in other fields (biology, economics, psychology, etc). This is not symmetric - i.e. we don't see lots of biologists applying biology methods to physics, the way we see physicists applying physics methods to biology. We also don't see this sort of generalization between most other field-pairs - e.g. we don't see lots of biologists in economics, or vice versa.

The second point sounds stronger than the first, because the first can be explained in the fact that biological systems (for example) are made of physical elements, but not the other way around. So you should expect that biology has not that much to say about physics. Still, one could say that it's not obvious physics would have relevant things to say about biology because of the complexity and the abstraction involved.

Relatedly: I once heard a biologist joke that physicists are like old western gunslingers. Every now and then, a gang of them rides into your field, shoots holes in all your theories, and then rides off into the sunset. Despite the biologist's grousing, I would indeed call that sort of thing successful generalization of the methods of physics.

This makes me wonder if the most important skills of physicists is to have strong enough generators to provide useful criticism in a wide range of fields?

Comment by adamShimi on Why Study Physics? · 2021-11-28T23:22:34.420Z · LW · GW

Could you write a list of physicists which have such "gift"? Might be useful for analyzing that specific skill.

Comment by adamShimi on Why Study Physics? · 2021-11-28T23:19:19.283Z · LW · GW

I'm wondering about the different types of intuitions in physics and mathematics.

What I remember from prepa (two years after high school where we did the full undergraduate program of maths and physics) was that some people had maths intuition (like me) and some had physics intuition (not me). That's how I recall it, but thinking back on it, there were different types of maths intuitions, which correlated very differently with physics intuition. I had algebra intuition, which means I could often see the way to go about algebraic problems, whereas I didn't have analysis intuition, which was about variations and measures and dynamics. And analysis intuition correlated strongly with physical intuition.

It's also interesting that all your examples of physicist using informal mathematical reasoning successfully ended up being formalized through analysis.

This observation makes me wonder if there are different forms of "informal mathematical reasoning" underlying these intuitions, and how relevant each one is to alignment.

  • An algebra/discrete maths intuition which is about how to combine parts into bigger stuff and reversely how to split stuff into parts, as well as the underlying structure and stuffs like generators. (Note that "the deep theory of addition" discussed recently is probably there)
  • An analysis/physics intuition which is about movement and how a system reacts to different changes.

Also the distinction becomes fuzzy because there's a lot of tricks which allow one to use a type of intuition to study the objects of the other type (things like analytic methods and inequalities in discrete maths, let's say, or algebraic geometry). Although maybe this is just evidence that people tend to have one sort of intuition, and want to find way of applying it at everything.

Comment by adamShimi on Yudkowsky and Christiano discuss "Takeoff Speeds" · 2021-11-23T15:49:11.648Z · LW · GW

I grimly predict that the effect of this dialogue on the community will be polarization: People who didn't like Yudkowsky and/or his views will like him / his views less, and the gap between them and Yud-fans will grow (more than it shrinks due to the effect of increased dialogue). I say this because IMO Yudkowsky comes across as angry and uncharitable in various parts of this dialogue, and also I think it was kinda a slog to get through & it doesn't seem like much intellectual progress was made here.

Strongly agree with that.

Since you agree with Yudkowksy, do you think you could strongman his position?

Comment by adamShimi on LCDT, A Myopic Decision Theory · 2021-11-22T13:05:38.211Z · LW · GW

Yeah, that's a subtle point.

Here we're stressing the difference between the simulator's action and the simulation's (HCH or Evan in your example) action. Obviously, if the simulation is non-myopic, then the simulation's action will depend on the long-term consequences of this action (for the goals of the simulation). But the simulator itself only cares about answering the question "what would the simulation do next?". Once again, that might mean that the simulator will think about the long term consequences of the simulation's action on the simulation's goals, but the simulator doesn't have this goal: such reasoning is completely instrumental to its task of simulation. And more generally, the simulator isn't choosing his next action to make it easier to predict the future actions (like a predict-o-matic would do).

That might sound like nitpicking, but this means something important: the simulator itself has no reason to be deceptive. It might output actions (as its best guess of what the simulation would do) that are deceptive, but only if the simulation itself is deceptive.

What does that give us?

  • If we manage to point the simulation at something that is non-deceptive yet powerful, the myopic simulator will not introduce deception into the mix. Whereas doing IRL on the simulation and then optimizing for the reward would probably lead to goodhart and deception because of mesa-optimizers.
    • Here Evan would probably say that HCH sounds like the right non-deceptive simulation; I'm less convinced that HCH will not be deceptive.
    • An obvious question is to ask why not do imitation learning? Well, I expect (and I believe Evan expects to) that simulation is strictly more powerful than imitation, because it can make models of non-observed or ideal processes that we point out to.
  • If instead of having a single simulation, we have a simulator that can deal with a range of simulation (how some researchers and I are currently thinking about GPT-3 and LMs), then myopia let you use the simulator to detect deception in the simulations/change between simulations/test different options, in a way that a "deceptive agent acting like a simulator" would not (because it would tamper with your experimentations)
    • A tangent, but there's also an argument that I'm writing up about why we should not expect models which simulate many different processes to be agents — spoiler: agents are bad at being simulators.
  • Even in the worse case scenario where we make a simulator that simulates a deceptive agent, the simulator has no incentive to hides its "internal thought" about how the simulation works. That's a very small comfort, but it might make interpretability easier because there is no adversarial pressure against it.
Comment by adamShimi on Ngo and Yudkowsky on AI capability gains · 2021-11-22T00:53:31.747Z · LW · GW

Thanks for giving more details about your perspective.

Your comment is phrased as if the object-level refutations have been tried, while conveying the meta-level intuitions hasn't been tried. If anything, it's the opposite: the sequences (and to some extent HPMOR) are practically all content about how to think, whereas Yudkowsky hasn't written anywhere near as extensively on object-level AI safety.

It's not clear to me that the sequences and HPMOR are good pointers for this particular approach to theory building. I mean, I'm sure there are posts in the sequences that touch on that (Einstein's Arrogance is an example I already mentioned), but I expect that they only talk about it in passing and obliquely, and that such posts are spread all over the sequences. Plus the fact that Yudkowsky said that there was a new subsequence to write lead me to believe that he doesn't think the information is clearly stated already.

So I don't think you can really put the current confusion as an evidence that the explanation of how that kind of theory would work doesn't help, given that this isn't readily available in a form I or anyone reading this can access AFAIK.

This has been valuable for community-building, but less so for making intellectual progress - because in almost all domains, the most important way to make progress is to grapple with many object-level problems, until you've developed very good intuitions for how those problems work. In the case of alignment, it's hard to learn things from grappling with most of these problems, because we don't have signals of when we're going in the right direction. Insofar as Eliezer has correct intuitions about when and why attempted solutions are wrong, those intuitions are important training data.

Completely agree that these intuitions are important training data. But your whole point in other comments is that we want to understand why we should expect these intuitions to differ from apparently bad/useless analogies between AGI and other stuff. And some explanation of where these intuitions come from could help with evaluating these intuitions, even more because Yudkowsky has said that he could write a sequence about the process. 

By contrast, trying to first agree on very high-level epistemological principles, and then do the object-level work, has a very poor track record. See how philosophy of science has done very little to improve how science works; and how reading the sequences doesn't improve people's object-level rationality very much.

This sounds to me like a strawman of my position (which might be my fault for not explaining it well).

  • First, I don't think explaining a methodology is a "very high-level epistemological principle", because it let us concretely pick apart and criticize the methodology as a truthfinding method.
  • Second, the object-level work has already been done by Yudkowsky! I'm not saying that some outside-of-the-field epistemologist should ponder really hard about what would make sense for alignment without ever working on it concretely and then give us their teaching. Instead I'm pushing for a researcher who has built a coherent collections of intuitions and has thought about the epistemology of this process to share the latter to help us understand the former.
  • A bit similar to my last point, I think the correct comparison here is not "philosophers of science outside the field helping the field", which happens but is rare as you say, but "scientists thinking about epistemology for very practical reasons". And given that the latter is from my understanding what started the scientific revolution and a common activity of all scientists until the big paradigms were established (in Physics and biology at least) in the early 20th century, I would say there is a good track record here.
    (Note that this is more your specialty, so I would appreciate evidence that I'm wrong in my historical interpretation here)

I model you as having a strong tendency to abstract towards higher-level discussion of epistemology in order to understand things. (I also have a strong tendency to do this, but I think yours is significantly stronger than mine.)

Hum, I certainly like a lot of epistemic stuff, but I would say my tendencies to use epistemology are almost always grounded in concrete questions, like understanding why a given experiment tells us something relevant about what we're studying.

I also have to admit that I'm kind of confused, because I feel like you're consistently using the sort of epistemic discussion that I'm advocating for when discussing predictions and what gives us confidence in a theory, and yet you don't think it would be useful to have a similar-level model of the epistemology used by Yudkowsky to make the sort of judgment you're investigating?

I expect that there's just a strong clash of intuitions here, which would be hard to resolve. But one prompt which might be useful: why aren't epistemologists making breakthroughs in all sorts of other domains?

As I wrote about, I don't think this is a good prompt, because we're talking about scientists using epistemology to make sense of their own work there.

Here is an analogy I just thought of: I feel that in this discussion, you and Yudkowsky are talking about objects which have different types. So when you're asking question about his model, there's a type mismatch. And when he's answering, having noticed the type mismatch, he's trying to find what to ascribe it to (his answer has been quite consistently modest epistemology, which I think is clearly incorrect). Tracking the confusing does tell you some information about the type mismatch, and is probably part of the process to resolve it. But having his best description of his type (given that your type is quite standardized) would make this process far faster, by helping you triangulate the differences.

Comment by adamShimi on Ngo and Yudkowsky on AI capability gains · 2021-11-21T01:07:09.910Z · LW · GW

I'm honestly confused by this answer.

Do you actually think that Yudkowsky having to correct everyone's object-level mistakes all the time is strictly more productive and will lead faster to the meat of the deconfusion than trying to state the underlying form of the argument and theory, and then adapting it to the object-level arguments and comments?

I have trouble understanding this, because for me the outcome of the first one is that no one gets it, he has to repeat himself all the time without making the debate progress, and this is one more giant hurdle for anyone trying to get into alignment and understand his position. It's unclear whether the alternative would solve all these problems (as you quote from the preface of the Sequences, learning the theory is often easier and less useful than practicing), but it still sounds like a powerful accelerator.

There is no dichotomy of "theory or practice", we probably need both here. And based on my own experience reading the discussion posts and the discussions I've seen around these posts, the object-level refutations have not been particularly useful forms of practice, even if they're better than nothing.

Comment by adamShimi on Ngo and Yudkowsky on AI capability gains · 2021-11-21T00:27:56.087Z · LW · GW

Good point, I hadn't thought about that one.

Still, I have to admit that my first reaction is that this particular sequence seems quite uniquely in a position to increase the quality of the debate and of alignment research singlehandedly. Of course, maybe I only feel that way because it's the only one of the long list that I know of. ^^

(Another possibility I just thought of is that maybe this subsequence requires a lot of new preliminary subsequences, such that the work is far larger than you could expect from reading the words "a subsequence". Still sounds like it would be really valuable though.

Comment by adamShimi on Ngo and Yudkowsky on AI capability gains · 2021-11-21T00:05:20.391Z · LW · GW

That's a really helpful comment (at least for me)!

But at least step one could be saying, "Wait, do these two kinds of ideas actually go into the same bucket at all?"

I'm guessing that a lot of the hidden work here and in the next steps would come from asking stuff like:

  • so I need to alter the bucket for each new idea, or does it instead fit in its current form each time?
  • does the mental act of finding that an idea fit into the bucket removes some confusion and clarifies, or is it just a mysterious answer?
  • Does the bucket become more simple and more elegant with each new idea that fit in it?

Is there some truth in this, or am I completely off the mark?

It seems like the sort of thing that would take a subsequence I don't have time to write

You obviously can do whatever you want, but I find myself confused at this idea being discarded. Like, it sounds exactly like the antidote to so much confusion around these discussions and your position, such that if that was clarified, more people could contribute helpfully to the discussion, and either come to your side or point out non-trivial issues with your perspective. Which sounds really valuable for both you and the field!

So I'm left wondering:

  • Do you disagree with my impression of the value of such a subsequence?
  • Do you think it would have this value but are spending your time doing something more valuable?
  • Do you think it would be valuable but really don't want to write it?
  • Do you think it would be valuable, you could in principle write it, but probably no one would get it even if you did?
  • Something else I'm failing to imagine?

Once again, you do what you want, but I feel like this would be super valuable if there was anyway of making that possible. That's also completely relevant to my own focus on the different epistemic strategies used in alignment research, especially because we don't have access to empirical evidence or trial and error at all for AGI-type problems.

(I'm also quite curious if you think this comment by dxu points at the same thing you are pointing at)

Comment by adamShimi on How To Get Into Independent Research On Alignment/Agency · 2021-11-20T14:38:22.725Z · LW · GW

Giving a perspective from another country that is far more annoying in administrative terms (France), grant administration can be a real plus. I go through a non-profit in France, and they can take care of the taxes and the declarations, which would be a hassle. In addition, here being self-employed is really bad for many things you might want to do (rent a flat, get a loan, pay for unemployment funds), and having a real contract helps a lot with that.

Comment by adamShimi on Ngo and Yudkowsky on AI capability gains · 2021-11-19T18:32:38.132Z · LW · GW

Damn. I actually think you might have provided the first clear pointer I've seen about this form of knowledge production, why and how it works, and what could break it. There's a lot to chew on in this reply, but thanks a lot for the amazing food for thought!

(I especially like that you explained the physical points and put links that actually explain the specific implication)

And I agree (tentatively) that a lot of the epistemology of science stuff doesn't have the same object-level impact. I was not claiming that normal philosophy of science was required, just that if that was not how we should evaluate and try to break the deep theory, I wanted to understand how I was supposed to do that.

Comment by adamShimi on Ngo and Yudkowsky on AI capability gains · 2021-11-19T18:20:30.300Z · LW · GW

That's when I understood that spatial structure is a Deep Fundamental Theory.

And it doesn't stop there. The same thing explains the structure of our roadways, blood vessels, telecomm networks, and even why the first order differential equations for electric currents, masses on springs, and water in pipes are the same.

(The exact deep structure of physical space which explains all of these is differential topology, which I think is what Vaniver was gesturing towards with "geometry except for the parallel postulate".)

Can you go into more detail here? I have done a decent amount of maths but always had trouble in physics due to my lack of physical intuition, so it might be completely obvious but I'm not clear about what is "that same thing" or how it explains all your examples? Is it about shortest path? What aspect of differential topology (a really large field) captures it?

(Maybe you literally can't explain it to me without me seeing the deep theory, which would be frustrating, but I'd want to know if that was the case. )

Comment by adamShimi on Ngo and Yudkowsky on AI capability gains · 2021-11-19T17:15:58.070Z · LW · GW

This particular type of fallback-prediction is a common one in general: we have some theory which makes predictions, but "there's a phenomenon which breaks one of the modelling assumption in a way noncentral to the main theory" is a major way the predictions can fail.

That's a great way of framing it! And a great way of thinking about why these are not failures that are "worrysome" at first/in most cases.

Comment by adamShimi on Ngo and Yudkowsky on AI capability gains · 2021-11-19T16:26:11.356Z · LW · GW

Thanks for the thoughtful answer!

So, thermodynamics also feels like a deep fundamental theory to me, and one of the predictions it makes is "you can't make an engine more efficient than a Carnot engine." Suppose someone exhibits an engine that appears to be more efficient than a Carnot engine; my response is not going to be "oh, thermodynamics is wrong", and instead it's going to be "oh, this engine is making use of some unseen source."

My gut reaction here is that "you can't make an engine more efficient than a Carnot engine" is not the right kind of prediction to try to break thermodynamics, because even if you could break it in principle, staying at that level without going into the detailed mechanisms of thermodynamics will only make you try the same thing as everyone else does. Do you think that's an adequate response to your point, or am I missing what you're trying to say?

So, later Eliezer gives "addition" as an example of a deep fundamental theory. And... I'm not sure I can imagine a universe where addition is wrong? Like, I can say "you would add 2 and 2 and get 5" but that sentence doesn't actually correspond to any universes.

Like, similarly, I can imagine universes where evolution doesn't describe the historical origin of species in that universe. But I can't imagine universes where the elements of evolution are present and evolution doesn't happen.

[That said, I can imagine universes with Euclidean geometry and different universes with non-Euclidean geometry, so I'm not trying to claim this is true of all deep fundamental theories, but maybe the right way to think about this is "geometry except for the parallel postulate" is the deep fundamental theory.]

The mental move I'm doing for each of these examples is not imagining universes where addition/evolution/other deep theory is wrong, but imagining phenomena/problems where addition/evolution/other deep theory is not adapted. If you're describing something that doesn't commute, addition might be a deep theory, but it's not useful for what you want. Similarly, you could argue that given how we're building AIs and trying to build AGI, evolution is not the deep theory that you want to use. 

It sounds to me like you (and your internal-Yudkowsky) are using "deep fundamental theory" to mean "powerful abstraction that is useful in a lot of domains". Which addition and evolution fundamentally are. But claiming that the abstraction is useful in some new domain requires some justification IMO. And even if you think the burden of proof is on the critics, the difficulty of formulating the generators makes that really hard.

Once again, do you think that answers your point adequately?

Comment by adamShimi on Ngo and Yudkowsky on AI capability gains · 2021-11-19T13:10:44.779Z · LW · GW

Thanks John for this whole thread!

(Note that I only read the whole Epistemology section of this post and skimmed the rest, so I might be saying stuff that are repeated/resolved elsewhere. Please point me to the relevant parts/quotes if that's the case. ;) )

Einstein's arrogance sounds to me like an early pointer in the Sequences for that kind of thing, with a specific claim about General Relativity being that kind of theory.

That being said, I still understand Richard's position and difficulty with this whole part (or at least what I read of Richard's difficulty). He's coming from the perspective of philosophy of science, which has focused mostly on ideas related to advanced predictions and taking into account the mental machinery of humans to catch biases and mistakes that we systematically make. The Sequences also spend a massive amount of words on exactly this, and yet in this discussion (and in select points in the Sequences like the aforementioned post), Yudkowsky sounds a bit like considering that his fundamental theory/observation doesn't need any of these to be accepted as obvious (I don't think he is thinking that way, but that's hard to extract out of the text).

It's even more frustrating because Yudkowsky focuses on "showing epistemic modesty" as his answer/rebuttal to Richard's inquiry, when Richard just sounds like he's asking the completely relevant question "why should we take your word on it?" And the confusion IMO is because the last sentence sounds very status-y (How do you dare claiming such crazy stuff?), but I'm pretty convinced Richard actually means it in a very methodological/philosophy of science/epistemic strategies way of "What are the ways of thinking that you're using here that you expect to be particularly good at aiming at the truth?"

Furthermore, I agree with (my model of) Richard that the main issue with the way Yudkowsky (and you John) are presenting your deep idea is that you don't give a way of showing it wrong. For example, you (John) write:

It's one of those predictions where, if it's false, then we've probably discovered something interesting - most likely some place where an organism is spending resources to do something useful which we haven't understood yet.

And even if I feel what you're gesturing at, this sounds/looks like you're saying "even if my prediction is false, that doesn't mean that my theory would be invalidated". Whereas I feel you want to convey something like "this is not a prediction/part of the theory that has the ability to falsify the theory" or "it's part of the obvious wiggle room of the theory". What I want is a way of finding the parts of the theory/model/prediction that could actually invalidate it, because that's what we should be discussing really. (A difficulty might be that such theories are so fundamental and powerful than being able to see them makes it really hard to find any way they could go wrong and endanger the theory)

An analogy that comes to my mind is with the barriers for proving P vs NP. These make explicit ways in which you can't solve the P vs NP question, such that it becomes far easier to weed proof attempts out. My impression is that You (Yudkowky and John) have models/generators that help you see at a glance that a given alignment proposal will fail. Which is awesome! I want to be able to find and extract and use those. But what Richard is pointing out IMO is that having the generators explicit would give us a way to stress test them, which is a super important step to start believing in them further. Just like we want people to actually try to go beyond GR, and for that they need to understand it deeply.

(Obviously, maybe the problem is that as you two are pointing it out, making the models/generators explicit and understandable is just really hard and you don't know how to do that. That's fair).

Comment by adamShimi on Applications for AI Safety Camp 2022 Now Open! · 2021-11-18T10:48:40.770Z · LW · GW

It means more. :)