Thoughts on gradient hacking 2021-09-03T13:02:44.181Z
A short introduction to machine learning 2021-08-30T14:31:45.357Z
SL4 in more legible format 2021-08-12T06:36:03.371Z
Deep limitations? Examining expert disagreement over deep learning 2021-06-27T00:55:53.327Z
Distinguishing claims about training vs deployment 2021-02-03T11:30:06.636Z
Deutsch and Yudkowsky on scientific explanation 2021-01-20T01:00:04.235Z
Some thoughts on risks from narrow, non-agentic AI 2021-01-19T00:04:10.108Z
Excerpt from Arbital Solomonoff induction dialogue 2021-01-17T03:49:47.405Z
Why I'm excited about Debate 2021-01-15T23:37:53.861Z
Meditations on faith 2021-01-15T22:20:02.651Z
Eight claims about multi-agent AGI safety 2021-01-07T13:34:55.041Z
Commentary on AGI Safety from First Principles 2020-11-23T21:37:31.214Z
Continuing the takeoffs debate 2020-11-23T15:58:48.189Z
My intellectual influences 2020-11-22T18:00:04.648Z
Why philosophy of science? 2020-11-07T11:10:02.273Z
Responses to Christiano on takeoff speeds? 2020-10-30T15:16:02.898Z
Reply to Jebari and Lundborg on Artificial Superintelligence 2020-10-25T13:50:23.601Z
AGI safety from first principles: Conclusion 2020-10-04T23:06:58.975Z
AGI safety from first principles: Control 2020-10-02T21:51:20.649Z
AGI safety from first principles: Alignment 2020-10-01T03:13:46.491Z
AGI safety from first principles: Goals and Agency 2020-09-29T19:06:30.352Z
AGI safety from first principles: Superintelligence 2020-09-28T19:53:40.888Z
AGI safety from first principles: Introduction 2020-09-28T19:53:22.849Z
Safety via selection for obedience 2020-09-10T10:04:50.283Z
Safer sandboxing via collective separation 2020-09-09T19:49:13.692Z
The Future of Science 2020-07-28T02:43:37.503Z
Thiel on Progress and Stagnation 2020-07-20T20:27:59.112Z
Environments as a bottleneck in AGI development 2020-07-17T05:02:56.843Z
A space of proposals for building safe advanced AI 2020-07-10T16:58:33.566Z
Arguments against myopic training 2020-07-09T16:07:27.681Z
AGIs as collectives 2020-05-22T20:36:52.843Z
Multi-agent safety 2020-05-16T01:59:05.250Z
Competitive safety via gradated curricula 2020-05-05T18:11:08.010Z
Against strong bayesianism 2020-04-30T10:48:07.678Z
What is the alternative to intent alignment called? 2020-04-30T02:16:02.661Z
Melting democracy 2020-04-29T20:10:01.470Z
Richard Ngo's Shortform 2020-04-26T10:42:18.494Z
What achievements have people claimed will be warning signs for AGI? 2020-04-01T10:24:12.332Z
What information, apart from the connectome, is necessary to simulate a brain? 2020-03-20T02:03:15.494Z
Characterising utopia 2020-01-02T00:00:01.268Z
Technical AGI safety research outside AI 2019-10-18T15:00:22.540Z
Seven habits towards highly effective minds 2019-09-05T23:10:01.020Z
What explanatory power does Kahneman's System 2 possess? 2019-08-12T15:23:20.197Z
Why do humans not have built-in neural i/o channels? 2019-08-08T13:09:54.072Z
Book review: The Technology Trap 2019-07-20T12:40:01.151Z
What are some of Robin Hanson's best posts? 2019-07-02T20:58:01.202Z
On alien science 2019-06-02T14:50:01.437Z
A shift in arguments for AI risk 2019-05-28T13:47:36.486Z
Would an option to publish to AF users only be a useful feature? 2019-05-20T11:04:26.150Z
Which scientific discovery was most ahead of its time? 2019-05-16T12:58:14.628Z


Comment by Richard_Ngo (ricraz) on Buck's Shortform · 2021-09-20T15:44:14.845Z · LW · GW

For example, people who care a lot about particular causes might end up getting really mindkilled by politics, or might end up strongly affiliated with groups that have false beliefs as part of their tribal identity.

These both seem pretty common, so I'm curious about the correlation that you've observed. Is it mainly based on people you know personally? In that case I expect the correlation not to hold amongst the wider population.

Also, a big effect which probably doesn't show up much amongst the people you know: younger people seem more altruistic (or at least signal more altruism) and also seem to have worse epistemics than older people.

Comment by Richard_Ngo (ricraz) on Writing On The Pareto Frontier · 2021-09-17T14:48:40.116Z · LW · GW

I notice that you didn't give any arguments, in this post, for why people should try to write on the pareto frontier. Intuitively it seems like something that you should sometimes aim for, and sometimes not. But it seems like you find it a more compelling goal than I do, if you've made it a rule for yourself. Mind explaining briefly what the main reasons you think this is a good goal are?

Also, do you intend this to apply to fiction as well as nonfiction?

Comment by Richard_Ngo (ricraz) on Thoughts on gradient hacking · 2021-09-06T14:41:18.072Z · LW · GW

I discuss the possibility of it going in some other direction when I say "The two most salient options to me". But the bit of Evan's post that this contradicts is:

Now, if the model gets to the point where it's actually just failing because of this, then gradient descent will probably just remove that check—but the trick is never to actually get there.

Comment by Richard_Ngo (ricraz) on Formalizing Objections against Surrogate Goals · 2021-09-03T13:42:46.129Z · LW · GW

Interesting report :) One quibble:

For one, our AIs can only use “things like SPI” if we actually formalize the approach

I don't see why this is the case. If it's possible for humans to start using things like SPI without a formalisation, why couldn't AIs too? (I agree it's more likely that we can get them to do so if we formalise it, though.)

Comment by Richard_Ngo (ricraz) on Combining the best of Georgian and Harberger taxes · 2021-08-12T11:46:19.410Z · LW · GW

It needs to be higher than what the competitor is willing to pay since they can always counter offer.

Agreed. My previous comment was incorrect; it should have been "it just needs to be more than what it's sustainable for the competitor to pay". The underlying problem is still that the cost of disruption may be far higher than the sustainable price to rent each piece of land. So if I gain a billion dollars from Apple's supply chains being cut, then I can bid up to a billion dollars for each piece of land used in their supply chain, and the only way Apple can stop me is paying a billion dollars of rent per piece of land.

You could also require that a bid to take over a piece of land stands for at least a length of time equal to K * (total value of current  property / bid rental price)  for some K

How would we determine the total value of the current property?

Comment by Richard_Ngo (ricraz) on Combining the best of Georgian and Harberger taxes · 2021-08-12T09:01:13.274Z · LW · GW

The thing is, it doesn't have to be a "ridiculous" price, it just needs to be more than what the competitor is currently paying EDIT: can sustainably pay. Imagine the cost to Apple of not being able to produce any smartphones for a year because a competitor had snatched a crucial factory - it's probably orders of magnitude greater than the rental cost of all the land they use. And remember that the attacker only has to remove one piece of the supply chain, whereas the defender has to ensure that they maintain control over every piece of infrastructure needed.

Given this disparity, you wouldn't even need malicious destruction, you could just snatch pieces of land from under them and wait it out.

(Re tit-for-tat, see patent wars - these things do happen!)

Comment by Richard_Ngo (ricraz) on Combining the best of Georgian and Harberger taxes · 2021-08-12T06:49:49.959Z · LW · GW

Suppose my competitor has just built a factory on land they are renting for $X. I offer $X+1 rent for a day, then demolish their factory, then give the land back to them to rent for $X.

You mention that the exact details will need to be hashed out separately, but I don't think that they can prevent this problem, because it derives from asymmetry between renting being temporary but demolition being permanent.

For example, you might say that the original owner can just outbid me. But I only need the land for the shortest available unit of time in order to destroy their factory and drive them out of business. Suppose that if this happens, I'll gain 20% of the profits my competitor would have earned. So they'll need to pay 20% of their profits for each time period that I can bid on the land.

Insurance doesn't solve this either, because my competitor will still be paying the same cost in expectation, just to the insurance company.

Comment by Richard_Ngo (ricraz) on My Marriage Vows · 2021-07-21T22:26:05.155Z · LW · GW

I will also never withhold information that [pronoun] would in hindsight prefer to know

This phrasing seems a little strange. Both because the only thing you can actually commit to is never withholding information which you believe that [pronoun] will end up wishing they'd known; but also because the optimal communication policy has some non-zero rate of false negatives.

Comment by Richard_Ngo (ricraz) on What will the twenties look like if AGI is 30 years away? · 2021-07-14T22:44:32.991Z · LW · GW

I'd actually be fine with a solution where we all agree to stop using the terms "long timelines" and "short timelines" and just use numbers instead. How does that sound?

Yeah, that sounds very reasonable; let's do that.

Ajeya's report says median 2050, no?

I just checked, and it's at 2055 now. Idk what changed.

Comment by Richard_Ngo (ricraz) on What will the twenties look like if AGI is 30 years away? · 2021-07-14T06:14:22.384Z · LW · GW

The most comprehensive (perhaps only comprehensive?) investigation into this says median +35 years, with 35% credence on 50+, and surveys of experts in ML give even higher numbers. I don't know who you're counting as a timelines expert, but I haven't seen any summaries/surveys of their opinions which justifies less than 20 years being the default option.

I'm not saying that this makes your view unreasonable. But presenting your view as consensus without presenting any legible evidence for that consensus is the sort of thing which makes me concerned about information cascades - particularly given the almost-total lack of solid public (or even private, to my knowledge) defences of that view.

Comment by Richard_Ngo (ricraz) on What will the twenties look like if AGI is 30 years away? · 2021-07-14T05:27:32.417Z · LW · GW

I want to push back on calling 20+ years "long timelines". It's a linguistic shift which implicitly privileges a pretty radical position, in a way which is likely to give people a mistaken impression of general consensus.

Comment by Richard_Ngo (ricraz) on Some thoughts on risks from narrow, non-agentic AI · 2021-06-30T20:54:54.441Z · LW · GW

Take a big language model like GPT-3, and then train it via RL on tasks where it gets given a language instruction from a human, and then it gets reward if the human thinks it's done the task successfully.

Comment by Richard_Ngo (ricraz) on Richard Ngo's Shortform · 2021-06-28T13:27:31.958Z · LW · GW

A short complaint (which I hope to expand upon at some later point): there are a lot of definitions floating around which refer to outcomes rather than processes. In most cases I think that the corresponding concepts would be much better understood if we worked in terms of process definitions.

Some examples: Legg's definition of intelligence; Karnofsky's definition of "transformative AI"; Critch and Krueger's definition of misalignment (from ARCHES).

Sure, these definitions pin down what you're talking about more clearly - but that comes at the cost of understanding how and why it might come about.

E.g. when we hypothesise that AGI will be built, we know roughly what the key variables are. Whereas transformative AI could refer to all sorts of things, and what counts as transformative could depend on many different political, economic, and societal factors.

Comment by Richard_Ngo (ricraz) on Richard Ngo's Shortform · 2021-06-28T13:04:41.755Z · LW · GW

Half-formed musing: what's the relationship between being a nerd and trusting high-level abstractions? In some sense they seem to be the opposite of each other - nerds focus obsessively on a domain until they understand it deeply, not just at high levels of abstraction. But if I were to give a very brief summary of the rationalist community, it might be: nerds who take very high-level abstractions (such as moloch, optimisation power, the future of humanity) very seriously.

Comment by Richard_Ngo (ricraz) on Frequent arguments about alignment · 2021-06-23T22:47:39.202Z · LW · GW

Whether this is a point for the advocate or the skeptic depends on whether advances in RL from human feedback unlock other alignment work more than they unlock other capabilities work. I think there's room for reasonable disagreement on this question, although I favour the former.

Comment by Richard_Ngo (ricraz) on How do you deal with people cargo culting COVID-19 defense? · 2021-06-23T22:07:13.076Z · LW · GW

Here are three less cynical ways of thinking about this:

Firstly, other people have different preferences than you do. Most people in Berlin worry much less about the personal consequences of catching covid than most people on LessWrong do. And sure, wearing masks more carefully helps other people too. But so does recycling, or helping old ladies across the street. It's reasonable for people to make different choices about what types of altruistic behaviour they should prioritise (at least in cases where the size of the benefit is pretty unclear, like it is here).

Secondly, thinking things through for yourself is costly in time and effort. It's reasonable for most people to spend very little time reasoning from first principles about how to reduce their covid risk, and to instead just listen to what the government/everyone else says, e.g. by following "socially approved rituals against COVID-19". (Indeed, I think many rationalists would also have benefited from spending far less time thinking about covid.)

Thirdly, policymakers are dealing with many complex tradeoffs under a great deal of uncertainty. Mandating masks is a step which helps reduce the spread of covid. Ensuring that every individual step in implementing the mask mandate is carried out competently is a huge deal of effort, and so even if the highest-level decisionmakers were totally rational, you'd expect to see a bunch of local inefficiencies like security guards wearing suboptimal masks.

Comment by Richard_Ngo (ricraz) on Frequent arguments about alignment · 2021-06-23T21:40:45.711Z · LW · GW

Skeptic: It seems to me that the distinction between "alignment" and "misalignment" has become something of a motte and bailey. Historical arguments that AIs would be misaligned used it in sense 1: "AIs having sufficiently general and large-scale motivations that they acquire the instrumental goal of killing all humans (or equivalently bad behaviour)". Now people are using the word in sense 2: "AIs not quite doing what we want them to do". But when our current AIs aren't doing quite what we want them to do, is that mainly evidence that future, more general systems will be misaligned1 (which I agree is bad) or misaligned2?

Advocate: Concepts like agency are continuous spectra. GPT-3 is a little bit agentic, and we'll eventually build AGIs that are much more agentic. Insofar as GPT-3 is trying to do something, it's trying to do the wrong thing. So we should expect future systems to be trying to do the wrong thing in a much more worrying way (aka be misaligned1) for approximately the same reason: that we trained them on loss functions that incentivised the wrong thing.

Skeptic: I agree that this is possible. But what should our update be after observing large language models? You could look at the difficulties of making GPT-3 do exactly what we want, and see this as evidence that misalignment is a big deal. But actually, large language models seem like evidence against misalignment1 being a big deal (because they seem to be quite intelligent without being very agentic, but the original arguments for worrying about misalignment1 relied on the idea that intelligence and agency are tightly connected, making it very hard to build superintelligent systems which don't have large-scale goals).

Advocate: Even if that's true for the original arguments, it's not for more recent arguments.

Skeptic: These newer arguments rely on assumptions about economic competition and coordination failures which seem quite speculative to me, and which haven't been vetted very much.

Advocate: These assumptions seem like common sense to me - e.g. lots of people are already worried about the excesses of capitalism. But even if they're speculative, they're worth putting a lot of effort into understanding and preparing for.

In case it wasn't clear from inside the dialogue, I'm quite sympathetic to both sides of this conversation (indeed, it's roughly a transcript of a debate that I've had with myself a few times). I think more clarity on these topics would be very valuable.

Comment by Richard_Ngo (ricraz) on I’m no longer sure that I buy dutch book arguments and this makes me skeptical of the "utility function" abstraction · 2021-06-23T19:03:58.843Z · LW · GW

To state the obvious, it adds formality.

Here are two ways to relate to formality.  Approach 1: this formal system is much less useful for thinking about the phenomenon than our intuitive understanding, but we should keep developing it anyway because eventually it may overtake our intuitive understanding.

Approach 2: by formalising our intuitive understanding, we have already improved it. When we make arguments about the phenomenon, using concepts from the formalism is better than using our intuitive concepts.

I have no problem with the approach 1; most formalisms start off bad, and get better over time. But it seems like a lot of people around here are taking the latter approach, and believe that the formalism of utility theory should be the primary lens by which we think about the goals of AGIs.

I'm not sure if you defend the latter. If you do, then it's not sufficient to say that utility theory adds formalism, you also need to explain why that formalism is net positive for our understanding. When you're talking about complex systems, there are plenty of ways that formalisms can harm our understanding. E.g. I'd say behaviourism in psychology was more formal and also less correct than intuitive psychology. So even though it made a bunch of contributions to our understanding of RL, which have been very useful, at the time people should have thought of it using approach 1 not approach 2. I think of utility theory in a similar way to how I think of behaviourism: it's a useful supplementary lens to see things through, but (currently) highly misleading as a main lens to see things like AI risk arguments through.

If I thought "goals" were a better way of thinking than "utility functions", I would probably be working on formalizing goal theory.

See my point above. You can believe that "goals" are a better way of thinking than "utility functions" while still believing that working on utility functions is more valuable. (Indeed, "utility functions" seem to be what "formalising goal theory" looks like!)

Utility theory, on the other hand, can still be saved

Oh, cool. I haven't thought enough about the Jeffrey-Bolker approach enough to engage with it here, but I'll tentatively withdraw this objection in the context of utility theory.

From a descriptive perspective, relativity suggests that agents won't convergently think in states, because doing so doesn't reflect the world perfectly.

I still strongly disagree (with what I think you're saying). There are lots of different problems which agents will need to think about. Some of these problems (which involve relativity) are more physically fundamental. But that doesn't mean that the types of thinking which help solve them need to be more mentally fundamental to our agents. Our thinking doesn't reflect relativity very well (especially on the intuitive level which shapes our goals the most), but we manage to reason about it alright at a high level. Instead, our thinking is shaped most to be useful for the types of problems we tend to encounter at human scales; and we should expect our agents to also converge to thinking in whatever way is most useful for the majority of problems which they face, which likely won't involve relativity much.

(I think this argument also informs our disagreement about the normative claim, but that seems like a trickier one to dig into, so I'll skip it for now.)

Comment by Richard_Ngo (ricraz) on I’m no longer sure that I buy dutch book arguments and this makes me skeptical of the "utility function" abstraction · 2021-06-22T22:03:30.535Z · LW · GW

relativity tells us that a simple "state" abstraction isn't quite right

Hmm, this sentence feels to me like a type error. It doesn't seem like the way we reason about agents should depend on the fundamental laws of physics. If agents think in terms of states, then our model of agent goals should involve states regardless of whether that maps onto physics. (Another way of saying this is that agents are at a much higher level of abstraction than relativity.)

I don't like reward functions, since that implies observability (at least usually it's taken that way).

Hmm, you mean that reward is taken as observable? Yeah, this does seem like a significant drawback of talking about rewards. But if we assume that rewards are unobservable, I don't see why reward functions aren't expressive enough to encode utilitarianism - just let the reward at each timestep be net happiness at that timestep. Then we can describe utilitarians as trying to maximise reward.

I expect "simple goals and simple world-models" is going to generalize better than "simple policies".

 I think we're talking about different debates here. I agree with the statement above - but the follow-up debate which I'm interested in is the comparison is "utility theory" versus "a naive conception of goals and beliefs" (in philosophical parlance, the folk theory), and so this actually seems like a point in favour of the latter. What does utility theory add to the folk theory of agency? Here's one example: utility theory says that deontological goals are very complicated. To me, it seems like folk theory wins this one, because lots of people have pretty deontological goals. Or another example: utility theory says that there's a single type of entity to which we assign value. Folk theory doesn't have a type system for goals, and again that seems more accurate to me (we have meta-goals, etc).

To be clear, I do think that there are a bunch of things which the folk theory misses (mostly to do with probabilistic reasoning) and which utility theory highlights. But on the fundamental question of the content of goals (e.g. will they be more like "actually obey humans" or "tile the universe with tiny humans saying 'good job'") I'm not sure how much utility theory adds.

Comment by Richard_Ngo (ricraz) on I’m no longer sure that I buy dutch book arguments and this makes me skeptical of the "utility function" abstraction · 2021-06-22T06:28:42.561Z · LW · GW

I think the key issue here is what you take as an "outcome" over which utility functions are defined. If you take states to be outcomes, then trying to model sequential decisions is inherently a mess. If you take trajectories to be outcomes, then this problem goes away - but then for any behaviour you can very easily construct totally arbitrary utility functions which that behaviour maximises. At this point I really don't know what people who talk about coherence arguments on LW are actually defending. But broadly speaking, I expect that everything would be much clearer if phrased in terms of reward rather than utility functions, because reward functions are inherently defined over sequential decisions.

I don’t think utility functions being a poor abstraction for agency in the real world has much bearing on whether there is AI risk. It might change the shape and tenor of the problem, but highly capable agents with alien seed preferences are still likely to be catastrophic to human civilization and human values.

If argument X plays an important role in convincing you of conclusion Y, and also the proponents of Y claim that X is important to their views, then it's surprising to hear that X has little bearing on Y. Was X redundant all along? Also, you currently state this in binary terms (whether there is AI risk); maybe it'd be clearer to state how you expect your credences to change (or not) based on updates about utility functions.

Comment by Richard_Ngo (ricraz) on What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs) · 2021-06-18T03:46:24.591Z · LW · GW

These aren't complicated or borderline cases, they are central example of what we are trying to avert with alignment research.

I'm wondering if the disagreement over the centrality of this example is downstream from a disagreement about how easy the "alignment check-ins" that Critch talks about are. If they are the sort of thing that can be done successfully in a couple of days by a single team of humans, then I share Critch's intuition that the system in question starts off only slightly misaligned. By contrast, if they require a significant proportion of the human time and effort that was put into originally training the system, then I am much more sympathetic to the idea that what's being described is a central example of misalignment.

My (unsubstantiated) guess is that Paul pictures alignment check-ins becoming much harder (i.e. closer to the latter case mentioned above) as capabilities increase? Whereas maybe Critch thinks that they remain fairly easy in terms of number of humans and time taken, but that over time even this becomes economically uncompetitive.

Comment by Richard_Ngo (ricraz) on Taboo "Outside View" · 2021-06-17T13:29:06.937Z · LW · GW

I really like this post, it feels like it draws attention to an important lack of clarity.

One thing I'd suggest changing: when introducing new terminology, I think it's much better to use terms that are already widely comprehensible if possible, than terms based on specific references which you'd need to explain to people who are unfamiliar in each case.

So I'd suggest renaming 'ass-number' to wild guess and 'foxy aggregation' to multiple models or similar.

Comment by Richard_Ngo (ricraz) on Challenge: know everything that the best go bot knows about go · 2021-06-03T23:48:25.742Z · LW · GW

I'm not sure what you mean by "actual computation rather than the algorithm as a whole". I thought that I was talking about the knowledge of the trained model which actually does the "computation" of which move to play, and you were talking about the knowledge of the algorithm as a whole (i.e. the trained model plus the optimising bot).

Comment by Richard_Ngo (ricraz) on What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs) · 2021-05-29T07:45:49.249Z · LW · GW

Rhymes with carp.

Comment by Richard_Ngo (ricraz) on Richard Ngo's Shortform · 2021-05-27T09:26:09.500Z · LW · GW

In the scenario governed by data, the part that counts as self-improvement is where the AI puts itself through a process of optimisation by stochastic gradient descent with respect to that data.

You don't need that much hardware for data to be a bottleneck. For example, I think that there are plenty of economically valuable tasks that are easier to learn than StarCraft. But we get StarCraft AIs instead because games are the only task where we can generate arbitrarily large amounts of data.

Comment by Richard_Ngo (ricraz) on Richard Ngo's Shortform · 2021-05-26T10:58:01.153Z · LW · GW

Yudkowsky mainly wrote about recursive self-improvement from a perspective in which algorithms were the most important factors in AI progress - e.g. the brain in a box in a basement which redesigns its way to superintelligence.

Sometimes when explaining the argument, though, he switched to a perspective in which compute was the main consideration - e.g. when he talked about getting "a hyperexponential explosion out of Moore’s Law once the researchers are running on computers".

What does recursive self-improvement look like when you think that data might be the limiting factor? It seems to me that it looks a lot like iterated amplification: using less intelligent AIs to provide a training signal for more intelligent AIs.

I don't consider this a good reason to worry about IA, though: in a world where data is the main limiting factor, recursive approaches to generating it still seem much safer than alternatives.

Comment by Richard_Ngo (ricraz) on Snyder-Beattie, Sandberg, Drexler & Bonsall (2020): The Timing of Evolutionary Transitions Suggests Intelligent Life Is Rare · 2021-05-20T12:18:11.047Z · LW · GW

Yeah, this seems like a reasonable argument. It feels like it really relies on this notion of "pretty smart" though, which is hard to pin down. There's a case for including all of the following in that category:

And yet I'd guess that none of these were/are on track to reach human-level intelligence. Agree/disagree?

Comment by Richard_Ngo (ricraz) on Snyder-Beattie, Sandberg, Drexler & Bonsall (2020): The Timing of Evolutionary Transitions Suggests Intelligent Life Is Rare · 2021-05-20T07:07:48.953Z · LW · GW

My argument is consistent with the time from dolphin- to human-level intelligence being short in our species, because for anthropic reasons we find ourselves with all the necessary features (dexterous fingers, sociality, vocal chords, etc).

The claim I'm making is more like: for every 1 species that reaches human-level intelligence, there will be N species that get pretty smart, then get stuck, where N is fairly large. (And this would still be true if neurons were, say, 10x smaller and 10x more energy efficient.)

Now there are anthropic issues with evaluating this argument by pegging "pretty smart" to whatever level the second-most-intelligent species happens to be at. But if we keep running evolution forward, I can imagine elephants, whales, corvids, octopuses, big cats, and maybe a few others reaching dolphin-level intelligence. But I have a hard time picturing any of them developing cultural evolution.

Comment by Richard_Ngo (ricraz) on Formal Inner Alignment, Prospectus · 2021-05-18T11:57:42.286Z · LW · GW

Mesa-optimizers are in the search space and would achieve high scores in the training set, so why wouldn't we expect to see them?

I like this as a statement of the core concern (modulo some worries about the concept of mesa-optimisation, which I'll save for another time).

With respect to formalization, I did say up front that less-formal work, and empirical work, is still valuable.

I missed this disclaimer, sorry. So that assuages some of my concerns about balancing types of work. I'm still not sure what intuitions or arguments underlie your optimism about formal work, though. I assume that this would be fairly time-consuming to spell out in detail - but given that the core point of this post is to encourage such work, it seems worth at least gesturing towards those intuitions, so that it's easier to tell where any disagreement lies.

Comment by Richard_Ngo (ricraz) on Formal Inner Alignment, Prospectus · 2021-05-17T14:45:03.714Z · LW · GW

I have fairly mixed feelings about this post. On one hand, I agree that it's easy to mistakenly address some plausibility arguments without grasping the full case for why misaligned mesa-optimisers might arise. On the other hand, there has to be some compelling (or at least plausible) case for why they'll arise, otherwise the argument that 'we can't yet rule them out, so we should prioritise trying to rule them out' is privileging the hypothesis. 

Secondly, it seems like you're heavily prioritising formal tools and methods for studying mesa-optimisation. But there are plenty of things that formal tools have not yet successfully analysed. For example, if I wanted to write a constitution for a new country, then formal methods would not be very useful; nor if I wanted to predict a given human's behaviour, or understand psychology more generally. So what's the positive case for studying mesa-optimisation in big neural networks using formal tools?

In particular, I'd say that the less we currently know about mesa-optimisation, the more we should focus on qualitative rather than quantitative understanding, since the latter needs to build on the former. And since we currently do know very little about mesa-optimisation, this seems like an important consideration.

Comment by Richard_Ngo (ricraz) on Challenge: know everything that the best go bot knows about go · 2021-05-16T14:25:40.859Z · LW · GW

The trained AlphaZero model knows lots of things about Go, in a comparable way to how a dog knows lots of things about running.

But the algorithm that gives rise to that model can know arbitrarily few things. (After all, the laws of physics gave rise to us, but they know nothing at all.)

Comment by Richard_Ngo (ricraz) on Challenge: know everything that the best go bot knows about go · 2021-05-16T14:20:51.627Z · LW · GW

I'd say that this is too simple and programmatic to be usefully described as a mental model. The amount of structure encoded in the computer program you describe is very small, compared with the amount of structure encoded in the neural networks themselves. (I agree that you can have arbitrarily simple models of very simple phenomena, but those aren't the types of models I'm interested in here. I care about models which have some level of flexibility and generality, otherwise you can come up with dumb counterexamples like rocks "knowing" the laws of physics.)

As another analogy: would you say that the quicksort algorithm "knows" how to sort lists? I wouldn't, because you can instead just say that the quicksort algorithm sorts lists, which conveys more information (because it avoids anthropomorphic implications). Similarly, the program you describe builds networks that are good at Go, and does so by making use of the rules of Go, but can't do the sort of additional processing with respect to those rules which would make me want to talk about its knowledge of Go.

Comment by Richard_Ngo (ricraz) on Agency in Conway’s Game of Life · 2021-05-13T19:10:43.462Z · LW · GW

I don't think there is a fundamental difference in kind between trees, bacteria, humans, and hypothetical future AIs

There's at least one important difference: some of these are intelligent, and some of these aren't.

It does seem plausible that the category boundary you're describing is an interesting one. But when you indicate in your comment below that you see the "AI hypothesis" and the "life hypothesis" as very similar, then that mainly seems to indicate that you're using a highly nonstandard definition of AI, which I expect will lead to confusion.

Comment by Richard_Ngo (ricraz) on Agency in Conway’s Game of Life · 2021-05-13T15:41:18.131Z · LW · GW

It feels like this post pulls a sleight of hand. You suggest that it's hard to solve the control problem because of the randomness of the starting conditions. But this is exactly the reason why it's also difficult to construct an AI with a stable implementation. If you can do the latter, then you can probably also create a much simpler system which creates the smiley face.

Similarly, in the real world, there's a lot of randomness which makes it hard to carry out tasks. But there are a huge number of strategies for achieving things in the world which don't require instantiating an intelligent controller. For example, trees and bacteria started out small but have now radically reshaped the earth. Do they count as having "perception, cognition, and action that are recognizably AI-like"?

Comment by Richard_Ngo (ricraz) on Challenge: know everything that the best go bot knows about go · 2021-05-13T15:27:20.690Z · LW · GW

The human knows the rules and the win condition. The optimisation algorithm doesn't, for the same reason that evolution doesn't "know" what dying is: neither are the types of entities to which you should ascribe knowledge.

Comment by Richard_Ngo (ricraz) on Challenge: know everything that the best go bot knows about go · 2021-05-13T15:23:02.837Z · LW · GW

it's not obvious to me that this is a realistic target

Perhaps I should instead have said: it'd be good to explain to people why this might be a useful/realistic target. Because if you need propositions that cover all the instincts, then it seems like you're basically asking for people to revive GOFAI.

(I'm being unusually critical of your post because it seems that a number of safety research agendas lately have become very reliant on highly optimistic expectations about progress on interpretability, so I want to make sure that people are forced to defend that assumption rather than starting an information cascade.)

Comment by Richard_Ngo (ricraz) on Challenge: know everything that the best go bot knows about go · 2021-05-11T10:26:58.485Z · LW · GW

As an additional reason for the importance of tabooing "know", note that I disagree with all three of your claims about what the model "knows" in this comment and its parent.

(The definition of "know" I'm using is something like "knowing X means possessing a mental model which corresponds fairly well to reality, from which X can be fairly easily extracted".)

Comment by Richard_Ngo (ricraz) on Challenge: know everything that the best go bot knows about go · 2021-05-11T10:24:57.979Z · LW · GW

I think at this point you've pushed the word "know" to a point where it's not very well-defined; I'd encourage you to try to restate the original post while tabooing that word.

This seems particularly valuable because there are some versions of "know" for which the goal of knowing everything a complex model knows seems wildly unmanageable (for example, trying to convert a human athlete's ingrained instincts into a set of propositions). So before people start trying to do what you suggested, it'd be good to explain why it's actually a realistic target.

Comment by Richard_Ngo (ricraz) on Gradations of Inner Alignment Obstacles · 2021-05-06T16:00:24.440Z · LW · GW

I used to define "agent" as "both a searcher and a controller"

Oh, I really like this definition. Even if it's too restrictive, it seems like it gets at something important.

I'm not sure what you meant by "more compressed".

Sorry, that was quite opaque. I guess what I mean is that evolution is an optimiser but isn't an agent, and in part this has to do with how it's a very distributed process with no clear boundary around it. Whereas when you have the same problem being solved in a single human brain, then that compression makes it easier to point to the human as being an agent separate from its environment.

The rest of this comment is me thinking out loud in a somewhat incoherent way; no pressure to read/respond.

It seems like calling something a "searcher" describes only a very simple interface: at the end of the search, there needs to be some representation of the output which it has found. But that output may be very complex.

Whereas calling something a "controller" describes a much more complex interface between it and its environment: you need to be able to point not just to outcomes, but also to observations and actions. But each of those actions is usually fairly simple for a pure controller; if it's complex, then you need search to find which action to take at each step.

Now, it seems useful to sometimes call evolution a controller. For example, suppose you're trying to wipe out a virus, but it keeps mutating. Then there's a straightforward sense in which evolution is "steering" the world towards states where the virus still exists, in the short term. You could also say that it's steering the world towards states where all organisms have high fitness in the long term, but organisms are so complex that it's easier to treat them as selected outcomes, and abstract away from the many "actions" by evolution which led to this point.

In other words, evolution searches using a process of iterative control. Whereas humans control using a process of iterative search.

(As a side note, I'm now thinking that "search" isn't quite the right word, because there are other ways to do selection than search. For example, if I construct a mathematical proof (or a poem) by writing it one line at a time, letting my intuition guide me, then it doesn't really seem accurate to say that I'm searching over the space of proofs/poems. Similarly, a chain of reasoning may not branch much, but still end up finding a highly specific conclusion. Yet "selection" also doesn't really seem like the right word either, because it's at odds with normal usage, which involves choosing from a preexisting set of options - e.g. you wouldn't say that a poet is "selecting" a poem. How about "design" as an alternative? Which allows us to be agnostic about how the design occurred - whether it be via a control process like evolution, or a process of search, or a process of reasoning.)

Comment by Richard_Ngo (ricraz) on What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs) · 2021-05-05T09:52:16.955Z · LW · GW

My default would be "raahp", which doesn't have any of the problems you mentioned.

Comment by Richard_Ngo (ricraz) on Why I Work on Ads · 2021-05-04T16:03:07.867Z · LW · GW

+1 for making the case for a side that's not the one your personal feelings lean towards.

Comment by Richard_Ngo (ricraz) on Gradations of Inner Alignment Obstacles · 2021-04-30T16:12:10.279Z · LW · GW

To me it sounds like you're describing (some version of) agency, and so the most natural term to use would be mesa-agent.

I'm a bit confused about the relationship between "optimiser" and "agent", but I tend to think of the latter as more compressed, and so insofar as we're talking about policies it seems like "agent" is appropriate. Also, mesa-optimiser is taken already (under a definition which assumes that optimisation is equivalent to some kind of internal search).

Comment by Richard_Ngo (ricraz) on Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More · 2021-04-30T09:47:55.697Z · LW · GW

Yann LeCun: ... instrumental subgoals are much weaker drives of behavior than hardwired objectives. Else, how could one explain the lack of domination behavior in non-social animals, such as orangutans.

What's your specific critique of this? I think it's an interesting and insightful point.

Comment by Richard_Ngo (ricraz) on Coherence arguments imply a force for goal-directed behavior · 2021-04-28T13:52:27.097Z · LW · GW

My internal model of you is that you believe this approach would not be enough because the utility would not be defined on the internal concepts of the agent. Yet I think it doesn't have so much to be defined on these internal concepts itself than to rely on some assumption about these internal concepts.

Yeah, this is an accurate portrayal of my views. I'd also note that the project of mapping internal concepts to mathematical formalisms was the main goal of the whole era of symbolic AI, and failed badly. (Although the analogy is a little loose, so I wouldn't take it as a decisive objection, but rather a nudge to formulate a good explanation of what they were doing wrong that you will do right.)

I agree more and more with you that the big mistake with using utility functions/reward for thinking about goal-directedness is not so much that they are a bad abstractions, but that they are often used as if every utility function is as meaningful as any other.

I don't think this is an accurate portrayal of my views. I am trying to say that utility functions are a bad abstraction for reasoning about AGI, for similar reasons to why health points are a bad abstraction for reasoning about livers. (I think I agree with the rest of the paragraph though.)

Comment by Richard_Ngo (ricraz) on Coherence arguments imply a force for goal-directed behavior · 2021-04-28T09:53:32.152Z · LW · GW

Wouldn't these coherence arguments be pretty awesome? Wouldn't this be a massive step forward in our understanding (both theoretical and practical) of health, damage, triage, and risk allocation?

Insofar as such a system could practically help doctors prioritise, then that would be great. (This seems analogous to how utilities are used in economics.)

But if doctors use this concept to figure out how to treat patients, or using it when designing prostheses for their patients, then I expect things to go badly. If you take HP as a guiding principle - for example, you say "our aim is to build an artificial liver with the most HP possible" - then I'm worried that this would harm your ability to understand what a healthy liver looks like on the level of cells, or tissues, or metabolic pathways, or roles within the digestive system. Because HP is just not a well-defined concept at that level of resolution.

Analogously, it seems very hard to have a good understanding of goals without talking about concepts, instincts, desires, etc, and the roles that all of these play within cognition as a whole - concepts which people just don't talk about much around here. I hypothesise that this is partly because they think they can talk about utilities instead. But when people reason about how to design AGIs in terms of utilities, on the basis of coherence theorems, then I think they're making a very similar mistake as a doctor who tries to design artificial livers based on the theoretical triage virtues of HP.

Comment by Richard_Ngo (ricraz) on Gradations of Inner Alignment Obstacles · 2021-04-26T09:16:20.388Z · LW · GW

Do you think that's a problem?

I'm inclined to think so, mostly because terms shouldn't be introduced unnecessarily. If we can already talk about systems that are capable/competent at certain tasks, then we should just do that directly.

I guess the mesa- prefix helps point towards the fact that we're talking about policies, not policies + optimisers.

Probably my preferred terminology would be:

  • Instead of mesa-controller, "competent policy".
  • And then we can say that competent policies sometimes implement search or learning (or both, or possibly neither).
  • And when we need to be clear, we can add the mesa- prefix to search or learning. (Although I'm not sure whether something like AlphaGo is a mesa-searcher - does the search need to be emergent?)

This helps make it clear that mesa-controller isn't a disjoint category from mesa-searcher, and also that mesa-controller is the default, rather than a special case.

Having written all this I'm now a little confused about the usefulness of the mesa-optimisation terminology at all, and I'll need to think about it more. In particular, it's currently unclear to me what the realistic alternative to mesa-optimisation is, which makes me wonder if it's actually carving off an important set of possibilities, or just reframing the whole space of possibilities. (If the policy receives a gradient update every minute, is it useful to call it a mesa-optimiser? Or every hour? Or...)

Comment by Richard_Ngo (ricraz) on Gradations of Inner Alignment Obstacles · 2021-04-23T09:55:06.270Z · LW · GW

Mesa-controller refers to any effective strategies, including mesa-searchers but also "dumber" strategies which nonetheless effectively steer toward a misaligned objective. For example, thermostat-like strategies, or strategies which have simply memorized a number of effective interventions.

I'm confused about what wouldn't qualify as a mesa-controller. In practice, is this not synonymous with "capable"?

Also, why include "misaligned" in this definition? If mesa-controller turns out to be a useful concept, then I'd want to talk about both aligned and misaligned mesa-controllers.

Comment by Richard_Ngo (ricraz) on Against "Context-Free Integrity" · 2021-04-15T19:16:45.377Z · LW · GW

You’re saying things like ‘provocative’ and ‘mindkilling’ and ‘invoking tribal loyalties’, but you’ve not made any arguments relating that to my writing

I should be clear here that I'm talking about a broader phenomenon, not specifically your writing. As I noted above, your post isn't actually a central example of the phenomenon. The "tribal loyalties" thing was primarily referring to people's reactions to the SSC/NYT thing. Apologies if it seemed like I was accusing you personally of all of these things. (The bits that were specific to your post were mentions of "evil" and "disgust".)

Nor am I saying that we should never talk about emotions; I do think that's important. But we should try to also provide argumentative content which isn't reliant on the emotional content. If we make strong claims driven by emotions, then we should make sure to also defend them in less emotionally loaded ways, in a way which makes them compelling to someone who doesn't share these particular emotions. For example, in the quotation you gave, what makes science's principles "fake" just because they failed in psychology? Is that person applying an isolated demand for rigour because they used to revere science? I can only evaluate this if they defend their claims more extensively elsewhere.

On the specific example of facebook, I disagree that you're using evil in a central way. I think the central examples of evil are probably mass-murdering dictators. My guess is that opinions would be pretty divided about whether to call drug dealers evil (versus, say, amoral); and the same for soldiers, even when they end up causing a lot of collateral damage.

Your conclusion that facebook is evil seems particularly and unusually strong because your arguments are also applicable to many TV shows, game producers, fast food companies, and so on. Which doesn't make those arguments wrong, but it means that they need to meet a pretty high bar, since either facebook is significantly more evil than all these other groups, or else we'll need to expand the scope of words like "evil" until they refer to a significant chunk of society (which would be quite different from how most people use it).

(This is not to over-focus on the specific word "evil", it's just the one you happened to use here. I have similar complaints about other people using the word "insane" gratuitously; to people casually comparing current society to Stalinist Russia or the Cultural Revolution; and so on.)

Comment by Richard_Ngo (ricraz) on Against "Context-Free Integrity" · 2021-04-15T09:19:41.424Z · LW · GW

Whether I agree with this point or not depends on whether you're using Ben's framing of the costs and benefits, or the framing I intended; I can't tell.

Comment by Richard_Ngo (ricraz) on Against "Context-Free Integrity" · 2021-04-15T09:06:27.999Z · LW · GW

I think we're talking past each other a little, because we're using "careful" in two different senses. Let's say careful1 is being careful to avoid reputational damage or harassment. Careful2 is being careful not to phrase claims in ways that make it harder for you or your readers to be rational about the topic (even assuming a smart, good-faith audience).

It seems like you're mainly talking about careful1. In the current context, I am not worried about backlash or other consequences from failure to be careful1. I'm talking about careful2. When you "aim to say valuable truths that aren't said elsewhere", you can either do so in a way that is careful2 to be nuanced and precise, or you can do so in a way that is tribalist and emotionally provocative and mindkilling. From my perspective, the ability to do the former is one of the core skills of rationality.

In other words, it's not just a question of the "worst" interpretation of what you write; rather, I think that very few people (even here) are able to dispassionately evaluate arguments which call things "evil" and "disgusting", or which invoke tribal loyalties. Moreover, such arguments are often vague because they appeal to personal standards of "evil" or "insane" without forcing people to be precise about what they mean by it (e.g. I really don't know what you actually mean when you say facebook is evil). So even if your only goal is to improve your personal understanding of what you're writing about, I would recommend being more careful2.