Four levels of understanding decision theory 2023-06-01T20:55:07.974Z
Without a trajectory change, the development of AGI is likely to go badly 2023-05-29T23:42:16.511Z
Where do you lie on two axes of world manipulability? 2023-05-26T03:04:17.780Z
Reward is the optimization target (of capabilities researchers) 2023-05-15T03:22:49.152Z
Max H's Shortform 2023-05-13T00:17:37.757Z
Gradient hacking via actual hacking 2023-05-10T01:57:10.145Z
LLM cognition is probably not human-like 2023-05-08T01:22:45.694Z
A test of your rationality skills 2023-04-20T01:19:03.210Z
Paying the corrigibility tax 2023-04-19T01:57:51.631Z
"Aligned" foundation models don't imply aligned systems 2023-04-13T04:13:49.984Z
A decade of lurking, a month of posting 2023-04-09T00:21:23.321Z
Eliezer on The Lunar Society podcast 2023-04-06T16:18:47.316Z
Steering systems 2023-04-04T00:56:55.407Z
Grinding slimes in the dungeon of AI alignment research 2023-03-24T04:51:53.642Z
Instantiating an agent with GPT-4 and text-davinci-003 2023-03-19T23:57:19.604Z
Gradual takeoff, fast failure 2023-03-16T22:02:03.997Z


Comment by Max H (Maxc) on Think carefully before calling RL policies "agents" · 2023-06-03T00:23:53.806Z · LW · GW

Also, I didn't mean for this distinction to be particularly interesting - I am still slightly concerned that it is so pedantic / boring / obvious that I'm the only one who finds it worth distinguishing at all. 


I'm literally just saying, a description of a function / mind / algorithm is a different kind of thing than the (possibly repeated) execution of that function / mind / algorithm on some substrate. If that sounds like a really deep or interesting point, I'm probably still being misunderstood. 

Comment by Max H (Maxc) on How could AIs 'see' each other's source code? · 2023-06-02T23:07:20.791Z · LW · GW

What source code and what machine code is actually being executed on some particular substrate is an empirical fact about the world, so in general, an AI (or a human) might learn it the way we learn any other fact - by making inferences from observations of the world. 

For example, maybe you hook up a debugger or a waveform reader to the AI's CPU to get a memory dump, reverse engineer the running code from the memory dump, and then prove some properties you care about follow inevitably from running the code you reverse engineered.

In general though, this is a pretty hard, unsolved problem - you probably run into a bunch of issues related to embedded agency pretty quickly.

To get an intuition for why these problems might be possible to solve, consider the tools that humans who want to cooperate or verify each other's decision processes might use in more restricted circumstances.

For example, maybe two governments which distrust each other conduct a hostage exchange by meeting in a neutral country, and bring lots of guns and other hostages they intend not to exchange that day. Each government publicly declares that if the hostage exchange doesn't go well, they'll shoot a bunch of the extra hostages.

Or, suppose two humans are in a True Prisoner's Dilemma with each other, in a world where fMRI technology has advanced to the point where we can read off semantically meaningful inner thoughts from a scan in real-time. Both players agree to undergo such a scan and make the scan results available to their opponent, while making their decision. This might not allow them to prove that their opponent will cooperate iff they cooperate, but it will still probably make it a lot easier for them to robustly cooperate, for valid decision theoretic reasons, rather than for reasons of honor or a spirit of friendliness.


Comment by Max H (Maxc) on Four levels of understanding decision theory · 2023-06-02T22:31:21.541Z · LW · GW

but it's not clear that an agent needs to understand the formal theory in order to actually avoid D,D.


They definitely don't, in many cases - humans in PDs cooperate all the time, without actually understanding decision theory.

The hierarchy is meant to express that robustly avoiding (D,D) for decision theory-based reasons, requires either that the agent itself, or its programmers, understand and implement the theory.

Each level is intended to be a pre-requisite for the preceding levels, modulo the point that, in the case of programmed bots in a toy environment, the comprehension can be either in the bot itself, or in the programmers that built the bot.

I don't see how level 2 depends on anything in level 3 - being at level 2 just means you understand the concept of a Nash equilibrium and why it is an attractor state. You have a desire to avoid it (in fact, you have that desire even at level 1), but you don't know how to do so, robustly and formally.

Comment by Max H (Maxc) on A mind needn't be curious to reap the benefits of curiosity · 2023-06-02T22:07:43.780Z · LW · GW

I feel like this is common enough—"are they helping me out here just because they're really nice, or because they want to get in my good graces or have me owe them a favor?"—that authors often have fictional characters wonder if it's one or the other.  And real people certainly express similar concerns about, say, whether someone donates to charity for signaling purposes or for "altruism".


That's a good example, though I was originally thinking of an agent which behaves actually kindly,  not because it expects any favor or reciprocation, nor because it is trying to manipulate the agent it is being kind to (or any other agent(s))  as part of some larger goal.

An agent might be capable of behaving in such a manner, as well as understanding the true and precise meaning of kindness, as humans understand it, but without having any of the innate drives or motivations which cause humans to behave kindly.

Such an agent might actually behave kindly despite lacking such drives though, for various reasons: perhaps an inclination to the kindness behavior pattern has somehow been hardcoded into the agent's mind, or, if we're in the world of HPMOR, the agent has taken some kind of Unbreakable Vow to behave kindly.

Comment by Max H (Maxc) on A mind needn't be curious to reap the benefits of curiosity · 2023-06-02T21:24:18.055Z · LW · GW

Are there any other common concepts to which this distinction can be applied? Vengefulness and making threats, maybe?

"A mind needn't be vengeful to reap the benefits of vengefulness / making threats"?

Meaning, in more words: humans have drives and instincts which cause them to act vengefully or threateningly under certain circumstances. Some or all humans might even care about punishment or revenge for its own sake,  i.e. terminally value that other agents get their just desserts, including punishment. (Though that might be a value that would fade away or greatly diminish under sufficient reflection.)

But it might be the case that a mind could find it instrumentally useful to make threats or behave vengefully towards agents which respond to such behavior, without the mind itself internally exhibiting any of the drives or instincts that cause humans to be vengeful.

Maybe kindness is also like this: there might be benefits to behaving kindly, in some situations. But a mind behaving kindly (pico-psuedokindly?) need not value kindness for its own sake, nor have any basic drive or instinct to kindness.

Comment by Max H (Maxc) on The case for turning glowfic into Sequences · 2023-06-02T20:31:14.110Z · LW · GW

I wasn't aware of the bounty until seeing this comment, but I am a big fan of planecrash, both as a work of fiction and as pedagogy. 

I wrote one post that built on the corrigibility tag in planecrash, and another on understanding decision theory, which isn't directly based on anything in placecrash, but is kind of loosely inspired by some things I learned from reading it.

(Neither of these posts appear to meet the requirements for the bounty, and they didn't get much engagement in any case. Just pointing them out in case you or anyone else is looking for some planecrash-inspired rationality / AI content.)

Comment by Max H (Maxc) on Four levels of understanding decision theory · 2023-06-02T18:29:47.140Z · LW · GW

Thanks! Glad at least one person read it; this post set a new personal record for low engagement, haha. 

I think exploring ways that AIs and / or humans (through augmentation, neuroscience, etc.) could implement decision theories more faithfully is an interesting idea. I chose not to focus directly on AI in this post, since I think LW, and my own writing specifically, has been kinda saturated with AI content lately. And I wanted to keep this shorter and lighter in the (apparently doomed) hope that more people would read it.

Comment by Max H (Maxc) on How to have Polygenically Screened Children · 2023-06-02T15:49:03.504Z · LW · GW

But there is a slippery slope towards a scenario where people select for sex, skin (and hair and eye) colour, not being queer, not being neurodivergent... and I do find that a dystopian scenario where we would lose something valuable that enriches our world.


Not necessarily disagreeing with your main point ("discrimination concerns"), but I want to note that the status quo is currently that an alien god gets to choose those traits. Given the option to wrest (more) control away from natural selection, specific humans, or even humanity collectively, might indeed choose to exercise that control in even worse and more horrifying ways than evolution.

So "even more horrifying" is a way things might go, and it's worth weighing that in a cost-benefit analysis. But I just want to note that, as a transhumanist, I regard the status quo, of ceding humanity's collective heritage to an unthinking and unfeeling alien optimization process, to already be pretty horrifying.

Partially, I'm horrified at some of the actual outcomes it produces, but I also have a larger philosophical discomfort with trusting something as important as the process by which humans get created to such an alien process, in a universe that (aside from humans themselves) is so unthinking and uncaring and unfair.


Comment by Max H (Maxc) on Think carefully before calling RL policies "agents" · 2023-06-02T15:41:26.209Z · LW · GW

Interesting distinction. An agent that is asleep isn't an agent, by this usage.

Well, a sleeping person is still an embodied system, with running processes and sensors that can wake the agent up. And the agent, before falling asleep, might arrange things such that they are deliberately woken up in the future under certain circumstances (e.g. setting an alarm, arranging a guard to watch over them during their sleep).

The thing I'm saying that is not an agent is more like, a static description of a mind. e.g. the source code of an AGI isn't an agent until it is compiled and executed on some kind of substrate. I'm not a carbon (or silicon) chauvinist; I'm not picky about which substrate. But without some kind of embodiment and execution, you just have a mathematical description of a computation, the actual execution of which may or may not be computable or otherwise physically realizable within our universe.

By the way, are you Max H of the space rock ai thingy?

Nope, different person! 

Comment by Max H (Maxc) on Think carefully before calling RL policies "agents" · 2023-06-02T12:52:52.444Z · LW · GW

Hadn't seen the paper, but I think I basically agree with it, and your claim.

I was mainly saying something even weaker: the policy itself is just a function, so it can't be an agent. The thing that might or might not be an agent is an embodiment of the policy by repeatedly executing it in the appropriate environment, while hooked up to (real or simulated) I/O channels.

Comment by Max H (Maxc) on Think carefully before calling RL policies "agents" · 2023-06-02T04:36:20.820Z · LW · GW

Once you have a policy network and a sampling procedure, you can embody it in a system which samples the network repeatedly, and hooks up the I/O to the proper environment and actuators. Usually this involves hooking the policy into a simulation of a game environment (e.g. in a Gym), but sometimes the embodiment is an actual robot in the real world.

I think using the term "agent" for the policy itself is actually a type error, and not just misleading. I think using the term to refer to the embodied system has the correct type signature, but I agree it can be misleading, for the reasons you describe.

OTOH, I do think modelling the outward behavior of such systems by regarding them as agents with black-box internals is often useful as a predictor, and I would guess that this modelling is the origin of the use of the term in RL.

But modelling outward behavior is very different from attributing that behavior to agentic cognition within the policy itself. I think it is unlikely that any current policy networks are doing (much) agentic cognition at runtime, but I wouldn't necessarily count on that trend continuing. So moving away from the term "agent" proactively seems like a good idea.

Anyway, I appreciate posts like this which clarify / improve standard terminology. Curious if you agree with my distinction about embodiment, and if so, if you have any better suggested term for the embodied system than "agent" or "embodiment".

Comment by Max H (Maxc) on Yudkowsky vs Hanson on FOOM: Whose Predictions Were Better? · 2023-06-02T01:56:17.440Z · LW · GW

I do think that some of the deep learning revolution turned out to be kind of compute bottlenecked, but I don't believe this is currently that true anymore

I had kind of the exact opposite impression of compute bottlenecks (that deep learning was not meaingfully compute bottlenecked until very recently). OpenAI apparently has a bunch of products and probably also experiments that are literally just waiting for H100s to arrive. Probably this is mainly due to the massive demand for inference, but still, this seems like a kind actual hardware bottleneck that is pretty new for the field of DL. It kind of has a parallel to Bitcoin mining technology, where the ability to get the latest-gen ASICs first was (still is?) a big factor in miner profitability.

Comment by Max H (Maxc) on Yudkowsky vs Hanson on FOOM: Whose Predictions Were Better? · 2023-06-02T01:36:11.187Z · LW · GW

So I think that

Implementing old methods more vigorously is more or less exactly what got modern deep learning started

is just straightforwardly true.

I would say that the "vigor" was almost entirely bottlenecked on researcher effort and serial thinking time, rather than compute resources. A bunch of concrete examples to demonstrate what I mean:

  • There are some product rollouts (e.g. multimodal and 32k GPT-4) and probably some frontier capabilities experiments which are currently bottlenecked on H100 capacity. But this is a very recent phenomenon, and has more to do with the sudden massive demand for inference rather than anything to do with training. In the meantime, there are plenty of other ways that OpenAI and others are pushing the capabilities frontier by things other than just piling on more layers on more GPUs.
  • If you sent back the recipe for training the smallest GPT model that works at all (GPT-J  6B, maybe?), people in 2008 could probably cobble together existing GPUs into a supercomputer, or, failing that, have the foundries of 2008 fabricate ASICs, and have it working in <1 year.
  • OTOH, if researchers in 2008 had access to computing resources of today, I suspect it would take them many years to get to GPT-3. Maybe not the full 14 years, since now many of their smaller-scale training runs go much quicker, and they can try larger things out much faster. But the time required to: (a) think up which experiments to try (b) implement the code to try them, and (c) analyze the results, dominates the time spent waiting around for a training run to finish.
  • More generally, it's not obvious how to "just scale things up" with more compute, and figuring out the exact way of scaling things is itself an algorithmic innovation. You can't just literally throw GPUs at the DL methods of 10-15 years ago and have them work.
  • "Researcher time" isn't exactly the same thing as algorithmic innovation, but I think that has been the actual bottleneck on the rate of capabilities advancement during most of the last 15 years (again, with very recent exceptions). That seems much more like a win for Yudkowksy than Hanson, even if the innovations themselves are "boring" or somewhat obvious in retrospect.


In worlds where Hanson was actually right, I would expect that the SoTA models of today are all controlled by whichever cash-rich organizations (including non-tech companies and governmaents) can literally dump the most money into GPUs. Instead, we're in a world where any tech company which has the researcher expertise can usually scrounge up enough cash to get ~SoTA. Compute isn't a negligible line item in the budget even for the biggest players, but it's not (and has never been) the primary bottleneck on progress.


But -- regardless of Yudkowsky's current position -- it still remains that you'd have been extremely surprised by the last decade's use of compute if you had believed him, and much less surprised if you had believed Hanson.

I don't think 2008!Yudkowsky's world model is "extremely surprised" by the observation that, if you have an AI algorithm that works at all, you can make it work ~X% better using a factor of Y more computing power.

For one, that just doesn't seem like it would be very surprising to anyone in 2008, though maybe I'm misremembering the recent past or suffering from hindsight bias here. For two, Eliezer himself has always been pretty consistent that, even if algorithmic innovation is likely to be more important, one of the first things that a superintelligence would do is turn as much of the Earth as possible into computronium as fast as possible. That doesn't seem like someone who would be surprised that more compute is more effective.

Comment by Max H (Maxc) on Yudkowsky vs Hanson on FOOM: Whose Predictions Were Better? · 2023-06-01T20:25:37.506Z · LW · GW

Yudkowsky seems quite wrong here, and Hanson right, about one of the central trends -- and maybe the central trend -- of the last dozen years of AI. Implementing old methods more vigorously is more or less exactly what got modern deep learning started; algorithms in absence of huge compute have achieved approximately nothing.


Really? If you sent a bunch of H100 GPUs (and infrastructure needed to run them) back in time to 2008, people might have been able to invent transformers, GPTs, and all the little quirks that actually make them work a little faster, and a little more cheaply.

OTOH, if you sent back Attention is all you need (and some other papers or documentation on ML from the last decade), without the accompanying hardware, people likely would have gotten pretty far, pretty quickly, just using 2008-level hardware (or buying / building more and faster hardware, once they knew the right algorithms to run on them). People didn't necessarily have a use for all the extra compute, until they invented the algorithms which could actually make use of it.

Even today, just scaling up GPTs even further is one obvious thing to try that is currently somewhat bottlenecked on supercomputer and GPU availability, but there are a bunch of algorithmic things that people are also trying to advance capabilities that are mostly bottlenecked on researcher time to experiment with.

Also, I'm pretty sure this is a repost? Did it appear on LW or somewhere else previously? If so, can you please link to any existing discussion about this?

Comment by Max H (Maxc) on The Crux List · 2023-06-01T17:08:58.722Z · LW · GW

The target of the second hyperlink appears to contain some HTML, which breaks the link and might be the source of some other problems:

Comment by Max H (Maxc) on The bullseye framework: My case against AI doom · 2023-06-01T15:14:34.452Z · LW · GW

I see. This is exactly the kind of result for which I think the relevance breaks down, when the formal theorems are actually applied correctly and precisely to situations we care about. The authors even mention the instance / limiting distinction that I draw in the comment I linked, in section 4.

As a toy example of what I mean by irrelevance, suppose it is mathematically proved that strongly solving Chess requires space or time which is exponential as a function of board size. (To actually make this precise, you would first need to generalize Chess to n x n Chess, since for a fixed board size, the size of the game tree is a necessarily fixed / constant.)

Maybe you can prove that there is no way of strongly solving 8x8 Chess within our universe, and furthermore that it is not even possible to approximate well. Stockfish 15 does not suddenly poof out of existence, as a result of your proofs, and you still lose the game, when you play against it.

Comment by Max H (Maxc) on The bullseye framework: My case against AI doom · 2023-06-01T13:06:47.125Z · LW · GW

First of all, I didn't say anything about utility maximization. I partially agree with Scott Garrabrant's take that VNM rationality and expected utility maximization are wrong, or at least conceptually missing a piece. Personally, I don't think utility maximization is totally off-base as a model of agent behavior; my view is that utility maximization is an incomplete approximation, analogous to the way that Newtonian mechanics is an incomplete understanding of physics, for which general relativity is a more accurate and complete model. The analogue to general relativity for utility theory may be Geometric rationality, or something else yet-undiscovered.

By humans are maximizers of something, I just meant that some humans (including myself) want to fill galaxies with stuff (e.g. happy sentient life), and there's not any number of galaxies already filled at which I expect that to stop being true. In other words, I'd rather fill all available galaxies with things I care about than leave any fraction, even a small one, untouched, or used for some other purpose (like fulfilling the values of a squiggle maximizer).


Note that ideal utility maximisation is computationally intractable.

I'm not sure what this means precisely. In general, I think claims about computational intractability could benefit from more precision and formality (see the second half of this comment here for more), and I don't see what relevance they have to what I want, and to what I may be able to (approximately) get.

Comment by Max H (Maxc) on Cosmopolitan values don't come free · 2023-06-01T00:45:51.694Z · LW · GW

This comment changed my mind on the probability that evolved aliens are likely to end up kind, which I now think is somewhat more likely than 5%. I still think AI systems are unlikely to have kindness, for something like the reason you give at the end:

In ML we just keep on optimizing as the system gets smart. I think this doesn't really work unless being kind is a competitive disadvantage for ML systems on the training distribution.

I actually think it's somewhat likely that ML systems won't value kindness at all before they are superhuman enough to take over. I expect kindness as a value within the system itself not to arise spontaneously during training, and that no one will succeed at eliciting it deliberately before take over. (The outward behavior of the system may appear to be kind, and mechanistic interpretability may show that some internal component of the system has a correct understanding of kindness. But that's not the same as the system itself valuing kindness the way that humans do or aliens might.)

Comment by Max H (Maxc) on Cosmopolitan values don't come free · 2023-05-31T23:06:30.677Z · LW · GW

I can’t tell if you think kindness is rare amongst aliens, or if you think it’s common amongst aliens but rare amongst AIs. Either way, I would like to understand why you think that. What is it that makes humans so weird in this way?

Can't speak for Nate and Eliezer, but I expect kindness to be somewhat rare among evolved aliens (I think Eliezer's wild guess is 5%? That sounds about right to me), and the degree to which they are kind will vary, possibly from only very slightly kind (or kind only under a very cosmopolitan view of kindness), to as kind or more kind than humans.

For AIs that humans are likely to build soon, I think there is significant probability (more than 50, less than 99? 90% seems fair) that they have literally 0 kindness. One reason is that I expect there is a significant chance that there is nothing within the first superintelligent AI systems to care about kindness or anything else, in the way that humans and aliens might care about something. If an AI system is superintelligent, then by assumption, some component piece of the system will necessarily have a deep and correct understanding of kindness (and many other things), and be capable of manipulating that understanding to achieve some goals. But understanding kindness is different from the system itself valuing kindness, or for there being anything at all "there" to have values of any kind whatsoever.

I think that current AI systems don't provide much evidence on this question one way or the other, and as I've said elsewhere, arguments about this which rely on pattern matching human cognition to structures in current AI systems often fail to draw the understanding / valuing distinction sharply enough, in my view. 

So a 90% chance of ~0 kindness is mostly just a made-up guess, but it still feels like a better guess to me than a shaky, overly-optimistic argument about how AI systems designed by processes which look nothing like human (or alien) evolution will produce minds which, very luckily for us, just so happen to share an important value with minds produced by evolution.

Comment by Max H (Maxc) on The Crux List · 2023-05-31T19:37:56.084Z · LW · GW

This is cool. Something I might try later this week as an exercise is going through every question (at least at the top level of nesting, maybe some of the nested questions as well), and give yes / no / it depends answers (or other short phrases, for non Y/N questions), without much justification.

(Some of the cruxes here overlap with ones I identified in my own contest entry.  Some, I think are unlikely to be truly key as important cruxes. Some, I have a fairly strong and confident view on, but would not be surprised if my view is not the norm. Some, I haven't considered in much detail at all...)

Comment by Max H (Maxc) on Cosmopolitan values don't come free · 2023-05-31T17:14:13.101Z · LW · GW

I think another common source of disagreement is that people sometimes conflate a mind or system's ability to comprehend and understand some particular cosmopolitan, human-aligned values and goals, with the system itself actually sharing those values, or caring about them at all.  Understanding a value and actually valuing it are different kinds of things, and this is true even if some component piece of the system has a deep, correct, fully grounded understanding of cosmopolitan values and goals, and is capable of generalizing them in the way that humans would want them generalized.

In my view, current AI systems are not at the point where they have any kind of "values" of their own at all, though LLMs appear to have some kind of understanding of some human values which correctly bind to reality, at least weakly. But such an understanding is more a fact about LLM ability to understand the world at all, than it is about the LLM's own "values", whatever they may be.

Comment by Max H (Maxc) on Without a trajectory change, the development of AGI is likely to go badly · 2023-05-31T16:15:20.313Z · LW · GW

Update: I've now submitted a version of this post to the worldviews contest, and will likely not make further edits.

This post hasn't gotten much engagement on LW or the EA forum so far though.

Some hypotheses for why:

  • it's too long / dense / boring
  • people view it as mostly re-treading old ground
  • it's not 101-friendly and requires a lot of background material to grapple with
  • I posted it initially at the end of a holiday weekend, and it dropped off the front page before most people had a chance to see it at all.
  • Some people only read longform, especially about AI, if it is by a well-known author or has at least a few upvotes already. This post did not break through some initial threshold before dropping off the front page, and was thus not read by many people, even if they saw the title.
  • There is too much other AI content on LW; people saw it but chose not to read or not to upvote because they were busy reading other content.
  • Lots of people saw it, but felt it was not good enough for its length to be worth an upvote, but not wrong enough to deserve a downvote.
  • parts of it are wrong / unclear / misleading in some way
  • Lots of people filter or down-weight the "AI" tag, and this post doesn't have any other tags (e.g. world modeling) which are less likely to be down-weighted.
  • Something else I'm not thinking of.

I am slightly hesitant to ask for more engagement directly, but if you read or skimmed more than 30% of the post, I'd appreciate an "I saw this" react on this comment. (If you read this comment, but didn't read the original post, feel free to react to this comment instead.)

If you have thoughts about why this post didn't get much engagement, feel free to reply here.

Comment by Max H (Maxc) on Bandgaps, Brains, and Bioweapons: The limitations of computational science and what it means for AGI · 2023-05-30T22:17:29.711Z · LW · GW

Well, "opens up the possibility that all such plans are intractable" is a much weaker claim than "impossible", and I disagree about the concrete difficulty of at least one of the step in your plan: there are known toxins with ~100% lethality to humans in nature.

Distributing this toxin via a virus engineered using known techniques from GoF research and some nanotechnology for a timer seems pretty tractable, and close enough to 100% lethal to me.

The tech to build a timer circuit out RNA and ATP instead of in silicon and electricity doesn't currently exist yet AFAIK, but the complexity, size, and energy constraints that such a timer design must meet are certainly tractable to design at nanoscale in silicon. Moving to a biological substrate might be hard, but knowing a bit about what hardware engineers are capable of doing with silicon, often with extremely limited energy budgets, it certainly doesn't seem intractable for human designers, let alone for an ASI, to do similar things with biology.

So I'm a bit skeptical of your estimate of the other steps as "probably incomputable"!

Also, a more general point: you've used "incomputable" throughout, in what appears to be an informal way of saying "computationally intractable".

In computational complexity theory, "uncomputable", "undecidable", "NP-complete", and Big-O notation have very precise technical meanings: they are statements about the limiting behavior of particular classes of problems. They don't necessarily imply anything about particular concrete instances of such problems.

So it's not just that there are good approximations for solving the traveling salesman problem in general or probabilistically, which you correctly note.

It's that, for any particular instance of the traveling salesman problem (or any other NP-hard problem), approximating or solving that particular instance may be tractable or even trivial, for example, by applying a specialized algorithm, or because the particular instance of the problem you need to solve has exploitable regularities or is otherwise degenerate in some way.

The same is true of e.g. the halting problem, which is provably undecidable in general! And yet, many programs that we care about can be proved to halt, or proved not to halt, in very reasonable amounts of time, often trivially by running them, or by inspection of their source. In fact, for a given randomly chosen program (under certain sampling assumptions), it is overwhelmingly likely that whether it halts or not is decidable. See the reference in this footnote for more.

The point of all of this is that I think saying something is "probably incomputable" is just too imprecise and informal to be useful as a bound the capabilities of a superintelligence (or even on human designers, for that matter), and trying to make the argument more precise probably causes it to break down, or requires a formulation of the problem in a domain where results from computational complexity theory are simply not applicable.

Comment by Max H (Maxc) on The bullseye framework: My case against AI doom · 2023-05-30T15:19:17.845Z · LW · GW

Capabilities are instrumentally convergent, values and goals are not. That's why we're more likely to end up in the bottom right quadrant, regardless of the "size" of each category.

The instrumental convergence argument is only strong for fixed goal expected value maximisers. Ie, a computer that is given a goal like “produce as many paperclips as possible”. I call these “fanatical” AI’s. This was the typical AI that was imagined many years ago when these concepts were invented. However, I again have to invoke the principle that if humans aren’t fanatical maximisers, and currently existing software aren’t fanatical maximisers, then maybe AI will not be either. 

Instrumental convergence is called convergent for a reason; it is not convergent only for "fanatical maximizers".  Also, sufficiently smart and capable humans probably are maximizers of something, it's just that the something is complicated. See e.g. this recent tweet for more.

(Also, the paperclip thought experiment was never about an AI explicitly given a goal of maximizing paperclips; this is based on a misinterpretation of the original thought experiment. See the wiki for more details.)

Comment by Max H (Maxc) on Without a trajectory change, the development of AGI is likely to go badly · 2023-05-29T23:57:09.672Z · LW · GW

I enabled the cool new reactions feature on the comments for this post! Reactions aren't (yet?) supported on posts themselves, but feel free to react to this comment with any reactions you would give to the post as a whole.

Comment by Max H (Maxc) on Hands-On Experience Is Not Magic · 2023-05-28T20:46:46.759Z · LW · GW

I largely agree with the general point that I think this post is making, which I would summarize in my own words as: the importance of iteration-and-feedback cycles, experimentation, experience, trial-and-error, etc. (LPE, in your terms) is sometimes overrated in importance and necessity. This over-emphasis is particularly common among those who have an optimistic view on solving the alignment problem through iterative experimentation.

I think degree to which LPE is actually necessary for solving problems in any given domain, as well as the minimum amount of time, resources, and general tractability of obtaining such LPE, is an empirical question which people frequently investigate for particular important domains.

Differing intuitions about how important LPE is in general, and how tractable it is to obtain, seems like an important place for identifying cruxes in world views. I wrote a bit more about this in a recent post, and commented on one of the empirical investigations to which my post is partially a response to. As I said in the comment, I find such investigations interesting and valuable as a matter of furthering scientific understanding about the limits of the possible, but pretty futile as attempts to bound the capabilities of a superintelligence. I think your post is a good articulation of one reason why I find these arguments so uncompelling.

Comment by Max H (Maxc) on A strong mind continues its trajectory of creativity · 2023-05-27T20:27:06.622Z · LW · GW

Probably no current AI system qualifies as a "strong mind", for the purposes of this post? Adding various kinds of long term memory is a very natural and probably instrumentally convergent improvement to make to LLM-based systems, though. 

I expect that as LLM-based systems get smarter and more agentic, they'll naturally start hitting on this strategy for self-improvement on their own. If you ask GPT-4 for improvements one could make to LLMs, it will come up with the idea of adding various kinds of memory. AutoGPT and similar solutions are not yet good enough to actually implement these solutions autonomously, but I expect that will change in the near future, and that it will be pretty difficult to get comparable performance out of a memoryless system. As you go even further up the capabilities ladder, it probably gets hard to avoid developing memory, intentionally or accidentally or as a side effect.

Comment by Max H (Maxc) on Open Thread With Experimental Feature: Reactions · 2023-05-26T23:56:26.580Z · LW · GW

I initially voted to eliminate agreement voting, because having both seems like too much UI complexity with confusing overlap.

But thinking about it further, I strongly predict that agree / disagree reactions will be used much less often than agree / disagree voting, especially by lurkers and non-participants in a discussion thread, because reactions are not anonymous.

I think the ability to give anonymous, low-effort / no-impact feedback is an important consideration, and I often find it useful to see how a large number of voters feel about a comment. I'm not sure if this consideration outweighs the UI complexity / overwhelmingness / duplicative-ness concern. 

If both are kept, one could:

  • strong disagree vote
  • disagree react
  • agree anti-react

All on the same post or comment. This could be interpreted as "super" disagreement bordering on hostility, or it could just be the result of a confused user unsure how to communicate their disagreement, but who wants to be extra-sure that it is communicated.

Comment by Max H (Maxc) on Where do you lie on two axes of world manipulability? · 2023-05-26T23:39:57.085Z · LW · GW

I think the modeling dimension to add is "how much trial and error is needed".

I tried to capture that in the tractability axis with "how much resources / time / data / experimentation / iteration..." in the second bullet point.

Could an SI spit out a recipe for a killer virus just from reading current literature? I doubt it.

The genome for smallpox is publicly available, and people are no longer vaccinated against it. I think it's at least plausible that a SI could ingest that data, existing gain-of-function research, and tools like AlphaFold (perhaps improving on them using its own insights, creativity, and simulation capabilities) and then come up with something pretty deadly and vaccine-resistant without experimentation in a wet lab.

Comment by Max H (Maxc) on Bandgaps, Brains, and Bioweapons: The limitations of computational science and what it means for AGI · 2023-05-26T17:14:11.029Z · LW · GW

Given reasonable computational time (say, a month), can the AI, using my chatlog alone, guess my password right on the first guess? 

"using my chatlog alone" appears to be doing a lot of work in this example. Human-built computer systems are notoriously bug-filled and exploitable, even by other humans. Why would an AI not also be capable of exploiting such vulnerabilities?[1]

Explorations of and arguments about limits of physical possibility based on computational physics and other scientific domains can lead to valuable research and interesting discussion, and I'm with you up until point (4) in your summary. But for forecasting the capabilities and actions of a truly smarter-than-you adversarial agent, it's important to confront the problem under the widest possible threat model, in the least convenient possible world and under the highest degree of difficulty. 

This post is a great example of the kind of object-level argument I gesture at in this recently-published post.  My point there is mainly: I think concrete, science-backed explorations of the limits of what is possible and tractable are great tools for world model building. But I find them pretty uncompelling when used as forecasts about how AGI takeover is likely to go, or as arguments for why such takeover is unlikely. I think an analogy to computer security is a good way of explaining this intuition. From another recent post of mine:

Side-channels are ubiquitous attack vectors in the field of computer security and cryptography. Timing attacks and other side-effect based attacks can render cryptographic algorithms which are provably secure under certain threat models, completely insecure when implemented on real hardware, because the vulnerabilities are at lower levels of abstraction than those considered in the threat model.

Proving that something is computationally intractable under a certain restricted model only means that the AI must find a way to step outside of your model, or do something else you didn't think of.


  1. ^

     Many vulnerabilities are only discoverable by humans when those humans have access to source code or at least binaries of the system under target. But this also doesn't seem like a fatal problem for the AI: even if the exact source code for the system the AI is running on, and / or the code for the system protecting the password, does not appear in the AI's training data, source code for many similar systems likely does.

Comment by Max H (Maxc) on Where do you lie on two axes of world manipulability? · 2023-05-26T15:13:14.577Z · LW · GW

That does seem like a good axis for identifying cruxes of takeover risk. Though I think "how hard is world takeover" is mostly a function of the first two axes? If you think there are lots of tasks (e.g. creating a digital dictatorship, or any subtasks thereof) which are both possible and tractable, then you'll probably end up pretty far along the "vulnerable" axis.

I also think the two axes alone are useful for identifying differences in world models, which can help to identify cruxes and interesting research or discussion topics, apart from any implications those different world models have for AI takeover risk or anything else to do with AI specifically.

If you think, for example, that nanotech is relatively tractable, that might imply that you think there are promising avenues for anti-aging or other medical research that involve nanotech, AI-assisted or not.

Comment by Max H (Maxc) on Where do you lie on two axes of world manipulability? · 2023-05-26T13:09:42.623Z · LW · GW

Not sure what you mean by "happening naturally". There are lots of inventions that are the result of human activity which we don't observe anywhere else in the universe - an internal combustion engine or a silicon CPU do not occur naturally, for example. But inventing these doesn't seem very hard in an absolute sense.


It happens to all of us all the time to various degrees, without us realizing it.

Yes, and I think that puts certain kinds of brain hacking squarely in the "possible" column. The question is then how tractable, and to what degree is it possible to control this process, and under what conditions. Is it possible (even in principle, for a superintelligence) to brainwash a randomly chosen human just by making them watch a short video? How short?

Comment by Max H (Maxc) on Open Thread With Experimental Feature: Reactions · 2023-05-26T00:13:41.149Z · LW · GW

I anti-agree with this comment. I also anti-disagree with it! 

Comment by Max H (Maxc) on A rejection of the Orthogonality Thesis · 2023-05-24T21:40:32.200Z · LW · GW
  1. There is a large mind design space. Do we have any actual reasons for thinking so? Sure, one can argue everything has a large design space, but in practice, there's often an underlying unique mechanism for how things work.

I don't see how this relates to the Orthogonality Thesis. For a given value or goal, there may be many different cognitive mechanisms for figuring out how to accomplish it, or there may be few, or there may be only one unique mechanism. Different cognitive mechanisms (if they exist) might lead to the same or different conclusions about how to accomplish a particular goal. 

For some goals, such as re-arranging all atoms in the universe in a particular pattern, it may be that there is only one effective way of accomplishing such a goal, so whether different cognitive mechanisms are  able to find the strategy for accomplishing such a goal is mainly a question of how effective those cognitive mechanisms are. The Orthogonality Thesis is saying, in part, that figuring out how to do something is independent of wanting to do something, and that the space of possible goals and values is large. If I were smarter, I probably could figure out how to tile the universe with tiny squiggles, but I don't want to do that, so I wouldn't.

2. Ethics are not an emergent property of intelligence - but again, that's just an assertion. There's no reason to believe or disbelieve it. It's possible that self-reflection (and hence ethics and the ability to question one's goals and motivations) is a pre-requisite for general cognition - we don't know whether this is true or not because we don't really understand intelligence yet.


I don't see what ability to self-reflect has to do with ethics. It's probably true that anything superintelligent is capable, in some sense, of self-reflection, but why would that be a problem for the Orthogonality Thesis? Do you believe that an agent which terminally values tiny molecular squiggles would "question its goals and motivations" and conclude that creating squiggles is somehow "unethical"? If so, maybe review the metaethics sequence; you may be confused about what we mean around here when we talk about ethics, morality, and human values.


The previous two are assertions that could be true, but reflective stability is definitely not true - it's paradoxical.


I think reflective stability, as it is usually used on LW, means something more narrow than how you're interpreting it, and is not paradoxical. It's usually used to describe a property of an agent following a particular decision theory. For example, a causal decision theory agent is not reflectively stable, because on reflection, it will regret not having pre-committed in certain situations. Logical decision theories are more reflectively stable in the sense that their adherents do not need to pre-commit to anything, and will therefore not regret not making any pre-commitments when reflecting on their own minds and decision processes, and how they would behave in hypothetical or future situations.

Comment by Max H (Maxc) on Open Thread With Experimental Feature: Reactions · 2023-05-24T19:00:18.301Z · LW · GW

I like the wide variety of possible reactions available in Slack and Discord, though I think for LW, the default / starting set could be a bit smaller, to reduce complexity / overwhelming-ness of picking an appropriate reaction.

Reactions I'd strike: 

  • Additional questions (I'd feel a bit disconcerted if I received this reaction without an accompanying comment.)
  • Strawman (kinda harsh for a reaction)
  • Concrete (this is either covered by an upvote, or seems like faint praise if not accompanied by an upvote.)
  • one of "key insight" or "insightful" and one of "too harsh" or "combative" (too much overlap)


But maybe it's easier to wait and see which reactions are used least often, and then eliminate those.

Comment by Max H (Maxc) on My May 2023 priorities for AI x-safety: more empathy, more unification of concerns, and less vilification of OpenAI · 2023-05-24T01:12:59.258Z · LW · GW

I think it's great for prominent alignment / x-risk people to summarize their views like this. Nice work!

Somewhat disorganized thoughts and reactions to your views on OpenAI:

It's possible that their charter, recent behavior of executive(s), and willingness to take public stances are net-positive relative to a hypothetical version of OA which behaved differently, but IMO the race dynamic that their founding, published research, and product releases have set off is pretty clearly net-negative, relative to the company not existing at all.

It also seems plausible that OpenAI's existence is directly or indirectly responsible for events like the Google DeepMind merger, Microsoft's AI capabilities and interest, and general AI capabilities hype. In a world where OA doesn't get founded, perhaps DeepMind plugs along quietly and slowly towards AGI, fully realizing the true danger of their work before their is much public or market hype.

But given that OpenAI does exist already, and there are some cats which are already out of the bag, it's true that many of their current actions are much better relative to the actions of what the worst possible version of an AI company looks like.

As far as vilifying or criticizing goes, I don't have strong views on what public or "elite" opinion of OpenAI should be, or how anyone here should try to manage or influence it. Some public criticism (e.g. about data privacy or lack of transparency / calls for even more openness) does seem frivolous or even actively harmful / wrong to me. I agree that human survival probably depends on the implementation of fairly radical regulatory and governance reform, and find it plausible that OpenAI is currently doing a lot of positive work to actually bring about such reform. So it's worth calling out bad criticism when we see it, and praising OA for things they do that are praiseworthy, while still being able to acknowledge the negative aspects of their existence.

Comment by Max H (Maxc) on Worrying less about acausal extortion · 2023-05-23T03:15:07.982Z · LW · GW

If you want more technical reasons for why you shouldn't worry about this, I think Decision theory does not imply that we get to have nice things is relevant. In humans, understanding exotic decision theories also doesn't imply bad things, because (among other reasons) understanding a decision theory, even on a very deep and / or nuts-and-bolts level, is different from actually implementing it.[1]


  1. ^

    Planecrash is a work of fiction that may give you a deep sense for the difference between understanding and implementing a decision theory, but I wouldn't exactly recommend it for anyone suffering from anxiety, or anyone who doesn't want to spend a lot of time alleviating their anxiety.

Comment by Max H (Maxc) on AI Will Not Want to Self-Improve · 2023-05-16T23:30:55.314Z · LW · GW

There are lots of ways current humans self-improve without much fear and without things going terribly wrong in practice, through medication (e.g. adderall, modafinil), meditation, deliberate practice of rationality techniques, and more.

There are many more kinds of self-improvement that seem safe enough that many humans will be willing and eager to try as the technologies improve.

If I were an upload running on silicon, I would feel pretty comfortable swapping in improved versions of the underlying hardware I was running on (faster processors, more RAM, better network speed, reliability / redundancy, etc.)

I'd be more hesitant about tinkering with the core algorithms underlying my cognition, but I could probably get pretty far with "cyborg"-style enhancements like grafting a calculator or a search engine directly into my brain.  After making the improvements that seem very safe, I might be able to make further self-improvements safely, for two reasons: (a) I have gained confidence and knowledge experimenting with small, safe self-improvements, and (b) the cyborg improvements have made me smarter, giving me the ability to prove the safety of more fundamental changes.

Whether we call it wanting to self-improve or or not, I do expect that most human-level AIs will at least consider self-improvement for instrumental convergence reasons. It's probably true that in the limit of self-improvement, the AI will need to solve many of the same problems that alignment researchers are currently working on, and that might slow down any would-be superintelligence for some hard-to-predict amount of time.

Comment by Max H (Maxc) on Reward is the optimization target (of capabilities researchers) · 2023-05-15T13:38:00.864Z · LW · GW

Hmm, I'm not sure anyone is "making an assertion that we expect to hold no matter how much the AI is scaled up.", unless scaling up means something pretty narrow like applying current RL algorithms to larger and larger networks and more and more data.

But you're probably right that my claim is not strictly a narrowing of the original. FWIW, I think both your (1) and (2) above are pretty likely when talking about current and near-future systems, as they scale to human levels of capability and agency, but not necessarily beyond.

I read the original post as talking mainly about current methods for RL, applied to future systems, though TurnTrout and I probably disagree on when it makes sense to start calling a system an "RL agent".

Also, regarding your thought experiment - of course, if in training the AI finds some way to cheat, that will be reinforced! But that has limited relevance for when cheating in training isn't possible.

As someone who has worked in computer security, and also written and read a lot of Python code, my guess is that cheating at current RL training processes as actually implemented is very, very possible for roughly human-level agents. (That was the other point of my post on gradient hacking.)

Comment by Max H (Maxc) on Bayesian Networks Aren't Necessarily Causal · 2023-05-14T14:40:37.441Z · LW · GW

If both networks give the same answers to marginal and conditional probability queries, that amounts to them making the same predictions about the world.

Does it? A bunch of probability distributions on opaque variables like  and  seems like it is missing something in terms of making predictions about any world. Even if you relabel the variables with more suggestive names like "rain" and "wet", that's a bit like manually programming in a bunch of IS-A() and HAS-A() relationships and if-then statements into a 1970s AI system.

Bayes nets are one component for understanding and formalizing causality, and they capture something real and important about its nature. The remaining pieces involve concepts that are harder to encode in simple, traditional algorithms, but that doesn't make them any less real or ontologically special, nor does it make Bayes nets useless or flawed.

Without all the knowledge about what words like rain and wetness and and slippery mean, you might be better off replacing these labels with things like "bloxor" and "greeblic". You could then still do interventions on the network to learn something about whether the data you have suggests that bloxors cause greeblic-ness is a simpler hypothesis than greeblic-ness causing bloxors. Without Bayes nets (or something isomorphic to them in conceptspace), you'd be totally lost in an unfamiliar world of bloxors and greeblic-ness. But there's still a missing piece for explaining causality that involves using physics (or higher-level domain-specific knowledge) about what bloxors and greeblic-ness represent to actually make predictions about them.

(I read this post as claiming implicitly that the original post (or its author) are missing or forgetting some of the points above. I did find this post useful and interesting as a exploration and explanation of the nuts and bolts of Bayes nets, but I don't think I was left confused or misled by the original piece.)

Comment by Max H (Maxc) on leogao's Shortform · 2023-05-13T21:49:06.649Z · LW · GW

I can see how this could be a frustrating pattern for both parties, but I think it's often an important conversation tree to explore when person 1 (or anyone) is using results about P in restricted domains to make larger claims or arguments about something that depends on solving P at the hardest difficulty setting in the least convenient possible world.

As an example, consider the following three posts:

I think both of the first two posts are valuable and important work on formulating and analyzing restricted subproblems. But I object to citation of the second post (in the third post) as evidence in support of a larger point that doom from mesa-optimizers or gradient descent is unlikely in the real world, and object to the second post to the degree that it is implicitly making this claim.

There's an asymmetry when person I is arguing for an optimistic view on AI x-risk and person 2 is arguing for a doomer-ish view, in the sense that person I has to address all counterarguments but person 2 only has to find one hole. But this asymmetry is unfortunately a fact about the problem domain and not the argument / discussion pattern between I and 2.

Comment by Max H (Maxc) on LLM cognition is probably not human-like · 2023-05-13T18:30:36.755Z · LW · GW

These papers are interesting, thanks for compiling them!

Skimming through some of them, the sense I get is that they provide evidence for the claim that the structure and function of LLMs is similar to (and inspired by) the structure of particular components of human brains, namely, the components which do language processing. 

This is slightly different from the claim I am making, which is about how the cognition of LLMs compares to the cognition of human brains as a whole. My comparison is slightly unfair, since I'm comparing a single forward pass through an LLM to get a prediction of the next token, to a human tasked with writing down an explicit probability distribution on the next token, given time to think, research, etc. [1]

Also, LLM capability at language processing / text generation is already far superhuman (by some metrics). The architecture of LLMs may be simpler than the comparable parts of the brain's architecture in some ways, but the LLM version can run with far more precision / scale / speed than a human brain. Whether or not LLMs are already exceeding human brains by specific metrics is debatable / questionable, but they are not bottlenecked on further scaling by biology.

And this is to say nothing of all the other kinds of cognition that happens in the brain. I see these brain components as analogous to LangChain or AutoGPT, if LangChain or AutoGPT themselves were written as ANNs that interfaced "natively" with the transformers of an LLM, instead of as Python code.

Finally, similarity of structure doesn't imply similarity of function. I elaborated a bit on this in a comment thread here.


  1. ^

    You might be able to get better predictions from an LLM by giving it more "time to think", using chain-of-thought prompting or other methods. But these are methods humans use when using LLMs as a tool, rather than ideas which originate from within the LLM itself, so I don't think it's exactly fair to call them "LLM cognition" on their own.

Comment by Max H (Maxc) on Gradient hacking via actual hacking · 2023-05-13T17:05:10.651Z · LW · GW

It's overkill in some sense, yes, but the thing I was trying to demonstrate with the human-alien thought experiment is that hacking the computer or system that is doing the training might actually be a lot easier than gradient hacking directly via solving a really difficult, possibly intractable math problem.

Hacking the host doesn't require the mesa-optimizer to have exotic superintelligence capabilities, just ordinary human programmer-level exploit-finding abilities. These hacking capabilities might be enough to learn the expected outputs during training through a sidechannel and effectively stop the gradient descent process, but not be sufficient to completely take over the host system and manipulate it in undetectable ways.

Comment by Max H (Maxc) on Aggregating Utilities for Corrigible AI [Feedback Draft] · 2023-05-13T16:49:12.799Z · LW · GW

I like these ideas. Personally, I think a kitchen-sink approach to corrigibility is the way to go.

Some questions and comments:

  • Can the behavior of a sufficiently smart and reflective agent which uses utility aggregation with sweetening and effort penalties be modeled as a non-corrigible / utility-maximizing agent with a more complicated utility function? What would such a utility function look like, if so? Does constructing such a model require drawing the boundaries around the agent differently (perhaps to include humans within), or otherwise require that the agent itself has a somewhat contrived view / ontology related to its own sense of "self"?
  • You cited a bunch of Russell's work, but I'd be curious for a more nuts-and-bolts analysis about how your ideas relate and compare to CIRL specifically.
  • Is utility aggregation related to geometric rationality in any way? The idea of aggregating utilities across possible future selves seems philosophically similar.
Comment by Max H (Maxc) on Reality and reality-boxes · 2023-05-13T15:43:10.274Z · LW · GW

A couple of related ideas and work you might find interesting / relevant:

But suppose that instead you ask the question:

Given such-and-such initial conditions, and given such-and-such cellular automaton rules, what would be the mathematical result?

Not even God can modify the answer to this question, unless you believe that God can implement logical impossibilities.  Even as a very young child, I don't remember believing that.  (And why would you need to believe it, if God can modify anything that actually exists?)

Comment by Max H (Maxc) on Orthogonal's Formal-Goal Alignment theory of change · 2023-05-13T01:38:54.860Z · LW · GW

Also, regarding ontologies and having a formal goal which is ontology-independent (?): I'm curious for Orthogonal's take on e.g. Finding gliders in the game of life, in terms of the role you see for this kind of research, whether the conclusions and ideas in that post specifically are on the right track, and how they relate to QACI.

Comment by Max H (Maxc) on Orthogonal's Formal-Goal Alignment theory of change · 2023-05-13T01:33:51.890Z · LW · GW

(Posting as a top-level comment, but this is mainly a response to @the gears to ascension's request for perspectives here.)

I like this:

One core aspect of our theory of change is backchaining: come up with an at least remotely plausible story for how the world is saved from AI doom, and try to think about how to get there.

as a general strategy. In terms of Orthogonal's overall approach and QACI specifically, one thing I'd like to see more of is how it can be applied to relatively easier (or at least, plausibly easier) subproblems like the symbol grounding problem, corrigibility, and diamond maximization, separately and independently from using it to solve alignment in general.

I can't find the original source, but I think someone (Nate Soares, maybe somewhere in the 2021 MIRI conversations?), once said something somewhere that robust alignment strategies should scale and degrade gracefully: they shouldn't depend on solving only the hardest problem, and avoiding catastrophic failure shouldn't depend on superintelligent capability levels. (I might be mis-remembering or imagining this wholesale, but I agree with the basic idea, as I've stated it here. Another way of putting it: ideally, you want some "capabilities" parameter in a system that you can dial up gradually, and then turn that dial just enough to solve the weakest problem that ends the acute risk period. Maybe afterwards, you use the same system, dialed up even further, to bring about the GTF, but regardless, you should be able to do easy things before you do hard things.)

I'm not sure that QACI doesn't have these desiderata, but I'm not sure that it does either. 

In any case, very much looking forward to more from Orthogonal!

Comment by Max H (Maxc) on Max H's Shortform · 2023-05-13T00:17:38.046Z · LW · GW

Using shortform to register a public prediction about the trajectory of AI capabilities in the near future: the next big breakthroughs, and the most capable systems within the next few years, will look more like generalizations of MuZero and Dreamer, and less like larger / better-trained / more efficient large language models.

Specifically, SoTA AI systems (in terms of generality and problem-solving ability) will involve things like tree search and / or networks which are explicitly designed and trained to model the world, as opposed to predicting text or generating images.

These systems may contain LLMs or diffusion models as components, arranged in particular ways to work together. This arranging may be done by humans or AI systems, but it will not be performed "inside" a current-day / near-future GPT-based LLM, nor via direct execution of the text output of such LLMs (e.g. by executing code the LLM outputs, or having the instructions for arrangement otherwise directly encoded in a single LLM's text output). There will recognizably be something like search or world modeling that happens outside or on top of a language model.


The reason I'm making this prediction is, I was listening to Paul Christiano's appearance on the Bankless podcast from a few weeks ago.

Around the 28:00 mark the hosts ask Paul if we should be concerned about AI developments from vectors other than LLM-like systems, broadly construed.

Paul's own answer is good and worth listening to on its own (up to the 33 minute mark), but I think he does leave out (or at least doesn't talk about it in this part of the podcast) the actual answer to the question, which is that, yes, there are other avenues of AI development that don't involve larger networks, more training data, and more generalized prediction and generation abilities.

I have no special / non-public knowledge about what is likely to be promising here (and wouldn't necessarily speculate if I did); but I get the sense that the zeitgeist among some people (not necessarily Paul himself) in alignment and x-risk focused communities, is that model-based RL systems and relatively complicated architectures like MuZero have recently been left somewhat in the dust by advances in LLMs. I think capabilities researchers absolutely do not see things this way, and they will not overlook these methods as avenues for further advancing capabilities. Alignment and x-risk focused researchers should be aware of this avenue, if they want to have accurate models of what the near future plausibly looks like.

Comment by Max H (Maxc) on Open & Welcome Thread - May 2023 · 2023-05-11T22:20:06.349Z · LW · GW

Are there any plans to add user-visible analytics features to LW, like on the EA forum? Screenshot from an EA forum draft for reference:

I'm often pretty curious about whether the reception to a post is lukewarm, or if most people just didn't see it before it dropped off the front page.

Analytics wouldn't tell me everything, but it would allow me to distinguish between the case that people are clicking and then not reading / voting vs. not clicking through at all.

On the EA forum, analytics are only viewable by the post author. I'm most interested in analytics for my own posts (e.g. my recent post, which received just a single vote), but I think having the data be public, or at least have it be configurable that way, is another possibility. Maybe this could take the form of a "view counter", like Twitter recently introduced, with an option for the post author to control whether it shows up or not.

Comment by Max H (Maxc) on In defence of epistemic modesty [distillation] · 2023-05-10T15:36:28.496Z · LW · GW

You should practice strong epistemic modesty: On a given issue, adopt the view experts generally hold, instead of the view you personally like.

What if the general view of experts (specifically, experts in the field of epistemics) have the view that at least some people, and perhaps me personally, should not always practice strong epistemic modesty?

Even if you believe that deferring to the specific people above as experts is pathological, suppose we show this post (or the original) to a wider and more diverse group of experts and get their consensus, weighted by our belief about each expert's epistemic virtue. Suppose the (virtue-weighted) consensus view among a wider class of experts is that strong epistemic modesty is sometimes or generally a bad idea, or misguided in some way. Should we then (under epistemic modesty) be pretty skeptical of epistemic modesty itself?

Note, this is a distinct issue from deciding which experts to trust: I'm assuming there is widespread consensus on how to determine the "all things considered" outside view on the validity /usefulness of epistemic modesty, and then asking: what if that view is that it's not valid?

Apologies if this sounds like a slightly troll-ish objection, but if so, I remark that calls for epistemic modesty during object-level arguments often have the same sort of troll-ish feeling!