Deutsch and Yudkowsky on scientific explanation 2021-01-20T01:00:04.235Z
Some thoughts on risks from narrow, non-agentic AI 2021-01-19T00:04:10.108Z
Excerpt from Arbital Solomonoff induction dialogue 2021-01-17T03:49:47.405Z
Why I'm excited about Debate 2021-01-15T23:37:53.861Z
Meditations on faith 2021-01-15T22:20:02.651Z
Eight claims about multi-agent AGI safety 2021-01-07T13:34:55.041Z
Commentary on AGI Safety from First Principles 2020-11-23T21:37:31.214Z
Continuing the takeoffs debate 2020-11-23T15:58:48.189Z
My intellectual influences 2020-11-22T18:00:04.648Z
Why philosophy of science? 2020-11-07T11:10:02.273Z
Responses to Christiano on takeoff speeds? 2020-10-30T15:16:02.898Z
Reply to Jebari and Lundborg on Artificial Superintelligence 2020-10-25T13:50:23.601Z
AGI safety from first principles: Conclusion 2020-10-04T23:06:58.975Z
AGI safety from first principles: Control 2020-10-02T21:51:20.649Z
AGI safety from first principles: Alignment 2020-10-01T03:13:46.491Z
AGI safety from first principles: Goals and Agency 2020-09-29T19:06:30.352Z
AGI safety from first principles: Superintelligence 2020-09-28T19:53:40.888Z
AGI safety from first principles: Introduction 2020-09-28T19:53:22.849Z
Safety via selection for obedience 2020-09-10T10:04:50.283Z
Safer sandboxing via collective separation 2020-09-09T19:49:13.692Z
The Future of Science 2020-07-28T02:43:37.503Z
Thiel on Progress and Stagnation 2020-07-20T20:27:59.112Z
Environments as a bottleneck in AGI development 2020-07-17T05:02:56.843Z
A space of proposals for building safe advanced AI 2020-07-10T16:58:33.566Z
Arguments against myopic training 2020-07-09T16:07:27.681Z
AGIs as collectives 2020-05-22T20:36:52.843Z
Multi-agent safety 2020-05-16T01:59:05.250Z
Competitive safety via gradated curricula 2020-05-05T18:11:08.010Z
Against strong bayesianism 2020-04-30T10:48:07.678Z
What is the alternative to intent alignment called? 2020-04-30T02:16:02.661Z
Melting democracy 2020-04-29T20:10:01.470Z
ricraz's Shortform 2020-04-26T10:42:18.494Z
What achievements have people claimed will be warning signs for AGI? 2020-04-01T10:24:12.332Z
What information, apart from the connectome, is necessary to simulate a brain? 2020-03-20T02:03:15.494Z
Characterising utopia 2020-01-02T00:00:01.268Z
Technical AGI safety research outside AI 2019-10-18T15:00:22.540Z
Seven habits towards highly effective minds 2019-09-05T23:10:01.020Z
What explanatory power does Kahneman's System 2 possess? 2019-08-12T15:23:20.197Z
Why do humans not have built-in neural i/o channels? 2019-08-08T13:09:54.072Z
Book review: The Technology Trap 2019-07-20T12:40:01.151Z
What are some of Robin Hanson's best posts? 2019-07-02T20:58:01.202Z
On alien science 2019-06-02T14:50:01.437Z
A shift in arguments for AI risk 2019-05-28T13:47:36.486Z
Would an option to publish to AF users only be a useful feature? 2019-05-20T11:04:26.150Z
Which scientific discovery was most ahead of its time? 2019-05-16T12:58:14.628Z
When is rationality useful? 2019-04-24T22:40:01.316Z
Book review: The Sleepwalkers by Arthur Koestler 2019-04-23T00:10:00.972Z
Arguments for moral indefinability 2019-02-12T10:40:01.226Z
Coherent behaviour in the real world is an incoherent concept 2019-02-11T17:00:25.665Z
Vote counting bug? 2019-01-22T15:44:48.154Z


Comment by ricraz on Literature Review on Goal-Directedness · 2021-01-26T14:35:19.591Z · LW · GW

Kinda, but I think both of these approaches are incomplete. In practice finding a definition and studying examples of it need to be interwoven, and you'll have a gradual process where you start with a tentative definition, identify examples and counterexamples, adjust the definition, and so on. And insofar as our examples should focus on things which are actually possible to build (rather than weird thought experiments like blockhead or the chinese room) then it seems like what I'm proposing has aspects of both of the approaches you suggest.

My guess is that it's more productive to continue discussing this on my response to your other post, where I make this argument in a more comprehensive way.

Comment by ricraz on Deutsch and Yudkowsky on scientific explanation · 2021-01-21T02:04:35.909Z · LW · GW

To summarise, I interpret TAG as saying something like "when SI assigns a probability of x to a program P, what does that mean; how can we cash that out in terms of reality? And Vaniver is saying "It means that, if you sum up the probabilities assigned to all programs which implement roughly the same function, then you get the probability that this function is 'the underlying program of reality'".

I think there are three key issues with this response (if I've understood it correctly):

  1. It is skipping all the hard work of figuring out which functions are roughly the same. This is a difficult unsolved (and maybe unsolveable?) problem, which is, for example, holding back progress on FDT.
  2. It doesn't actually address the key problem of epistemology. We're in a world, and we'd like to know lots of things about it. Solomonoff induction, instead of giving us lots of knowledge about the world, gives us a massive Turing machine which computes the quantum wavefunction, or something, and then outputs predictions for future outputs. For example, let's say that previous inputs were the things I've seen in the past, and the predictions are of what I'll see in the future. But those predictions might tell us very few interesting things about the world!  For example, they probably won't help me derive general relativity. In some sense the massive Turing machine contains the fact that the world runs on general relativity, but accessing that fact from the Turing machine might be even harder than accessing it by studying the world directly. (Relatedly, see Deutsch's argument (which I quote above) that even having a predictive oracle doesn't "solve" science.)
  3. There's no general way to apply SI to answer a bounded question with a sensible bounded answer. Hence, when you say "you can make your stable of hypotheses infinitely large", this is misleading: programs aren't hypotheses, or explanations, in the normal sense of the word, for almost all of the questions we'd like to understand.
Comment by ricraz on Literature Review on Goal-Directedness · 2021-01-20T16:26:09.625Z · LW · GW

Hmm, okay, I think there's still some sort of disagreement here, but it doesn't seem particularly important. I agree that my distinction doesn't sufficiently capture the middle ground of interpretability analysis (although the intentional stance doesn't make use of that, so I think my argument still applies against it).

Comment by ricraz on Against the Backward Approach to Goal-Directedness · 2021-01-20T16:19:17.015Z · LW · GW

Hmmm, it doesn't seem like these two approaches are actually that distinct. Consider: in the forward approach, which intuitions about goal-directedness are you using? If you're only using intuitions about human goal-directedness, then you'll probably miss out on a bunch of important ideas. Whereas if you're using intuitions about extreme cases, like superintelligences, then this is not so different to the backwards approach.

Meanwhile, I agree that the backward approach will fail if we try to find "the fundamental property that the forward approach is trying to formalise". But this seems like bad philosophy. We shouldn't expect there to be a formal or fundamental definition of agency, just like there's no formal definition of tables or democracy (or knowledge, or morality, or any of the other complex concepts philosophers have spent centuries trying to formalise). Instead, the best way to understand complex concepts is often to treat them as a nebulous cluster of traits, analyse which traits it's most useful to include and how they interact, and then do the same for each of the component traits. On this approach, identifying convergent instrumental goals is one valuable step in fleshing out agency; and another valuable step is saying "what cognition leads to the pursuit of convergent instrumental goals"; and another valuable step is saying "what ways of building minds lead to that cognition"; and once we understand all this stuff in detail, then we will have a very thorough understanding of agency. Note that even academic philosophy is steering towards this approach, under the heading of "conceptual engineering".

So I count my approach as a backwards one, consisting of the following steps:

  1. It's possible to build AGIs which are dangerous in a way that intuitively involves something like "agency".
  2. Broadly speaking, the class of dangerous agentic AGIs have certain cognition in common, such as making long-term plans, and pursuing convergent instrumental goals (many of which will also be shared by dangerous agentic humans).
  3. By thinking about the cognition that agentic AGIs would need to carry out to be dangerous, we can identify some of traits which contribute a lot to danger, but contribute little to capabilities.
  4. We can then try to design training processes which prevent some of those traits from arising.

(Another way of putting this: the backwards approach works when you use it to analyse concepts as being like network 1, not network 2.)

If you're still keen to find a "fundamental property", then it feels like you'll need to address a bunch of issues in embedded agency.

Comment by ricraz on Some thoughts on risks from narrow, non-agentic AI · 2021-01-20T12:35:41.089Z · LW · GW

Cool, thanks for the clarifications. To be clear, overall I'm much more sympathetic to the argument as I currently understand it, than when I originally thought you were trying to draw a distinction between "new forms of reasoning honed by trial-and-error" in part 1 (which I interpreted as talking about systems lacking sufficiently good models of the world to find solutions in any other way than trial and error) and "systems that have a detailed understanding of the world" in part 2.

Let me try to sum up the disagreement. The key questions are:

  1. What training data will we realistically be able to train our agents on?
  2. What types of generalisation should we expect from that training data?
  3. How well will we be able to tell that these agents are doing the wrong thing?

On 1: you think long-horizon real-world data will play a significant role in training, because we'll need it to teach agents to do the most valuable tasks. This seems plausible to me; but I think that in order for this type of training to be useful, the agents will need to already have robust motivations (else they won't be able to find rewards that are given over long time horizons). And I don't think that this training will be extensive enough to reshape those motivations to a large degree (whereas I recall that in an earlier discussion on amplification, you argued that small amounts of training could potentially reshape motivations significantly). Our disagreement about question 1 affects questions 2 and 3, but it affects question 2 less than I previously thought, as I'll discuss.

On 2: previously I thought you were arguing that we should expect very task-specific generalisations like being trained on "reduce crime" and learning "reduce reported crime", which I was calling underspecified. However, based on your last comment it seems that you're actually mainly talking about broader generalisations, like being trained on "follow instructions" and learning "do things that the instruction-giver would rate highly". This seems more plausible, because it's a generalisation that you can learn in many different types of training; and so our disagreement on 1 becomes less consequential.

I don't have a strong opinion on the likelihood of this type of generalisation. I guess your argument is that, because we're doing a lot of trial and error, we'll keep iterating until we either get something aligned with our instructions, or something which optimises for high ratings directly. But it seems to me by default, during early training periods the AI won't have much information about either the overseer's knowledge (or the overseer's existence), and may not even have the concept of rewards, making alignment with instructions much more natural. Above, you disagree; in either case my concern is that this underlying concept of "natural generalisation" is doing a lot of work, despite not having been explored in your original post (or anywhere else, to my knowledge). We could go back and forth about where the burden of proof is, but it seems more important to develop a better characterisation of natural generalisation; I might try to do this in a separate post.

On 3: it seems to me that the resources which we'll put into evaluating a single deployment are several orders of magnitude higher than the resources we'll put into evaluating each training data point - e.g. we'll likely have whole academic disciplines containing thousands of people working full-time for many years on analysing the effects of the most powerful AIs' behaviour.

You say that you expect people to work to design training procedures that get good performance on type two measures. I agree with this - but if you design an AI that gets good performance on type 2 measurements despite never being trained on them, then that rules out the most straightforward versions of the "do things that the instruction-giver would rate highly" motivation. And since the trial and error to find strategies which fool type 2 measurements will be carried out over years, the direct optimisation for fooling type 2 measurements will be weak.

I guess the earlier disagreement about question 1 is also relevant here. If you're an AI trained primarily on data and feedback which are very different from real-world long-term evaluations, then there are very few motivations which lead you to do well on real-world long-term evaluations. "Follow instructions" is one of them; some version of "do things that the instruction-giver would rate highly" is another, but it would need to be quite a specific version. In other words, the greater the disparity between the training regime and the evaluation regime, the fewer ways there are for an AI's motivations to score well on both, but also score badly on our idealised preferences.

In another comment, you give a bunch of ways in which models might generalise successfully to longer horizons, and then argue that "many of these would end up pursuing goals that are closely related to the goals they pursue over short horizons". I agree with this, but note that "aligned goals" are also closely related to the goals pursued over short time horizons. So it comes back to whether motivations will generalise in a way which prioritises the "obedience" aspect or the "produces high scores" aspect of the short-term goals.

Comment by ricraz on Some thoughts on risks from narrow, non-agentic AI · 2021-01-19T21:21:58.366Z · LW · GW

To clarify your position: if I train a system that makes good predictions over 1 minute and 10 minutes and 100 minutes, is your position that there's not much reason that this system would make a good prediction over 1000 minutes? Analogously, if I train a system by meta-learning to get high rewards over a wide range of simulated environments, is your position that there's not much reason to think it will try to get high rewards when deployed in the real world?

In most of the cases you've discussed, trying to do tasks over much longer time horizons involves doing a very different task. Reducing reported crime over 10 minutes and reducing reported crime over 100 minutes have very little to do with reducing reported crime over a year or 10 years. The same is true for increasing my wealth, or increasing my knowledge (which over 10 minutes involves telling me things, but over a year might involve doing novel scientific research). I tend to be pretty optimistic about AI motivations generalising, but this type of generalisation seems far too underspecified. "Making predictions" is perhaps an exception, insofar as it's a very natural concept, and also one which transfers very straightforwardly from simulations to reality. But it probably depends a lot on what type of predictions we're talking about.

On meta-learning: it doesn't seem realistic to think about an AI "trying to get high rewards" on tasks where the time horizon is measured in months or years. Instead it'll try to achieve some generalisation of the goals it learned during training. But as I already argued, we're not going to be able to train on single tasks which are similar enough to real-world long-term tasks that motivations will transfer directly in any recognisable way.

Insofar as ML researchers think about this, I think their most common position is something like "we'll train an AI to follow a wide range of instructions, and then it'll generalise to following new instructions over longer time horizons". This makes a lot of sense to me, because I expect we'll be able to provide enough datapoints (mainly simulated datapoints, plus language pre-training) to pin down the concept "follow instructions" reasonably well, whereas I don't expect we can provide enough datapoints to pin down a motivation like "reduce reports of crime". (Note that I also think that we'll be able to provide enough datapoints to incentivise influence-seeking behaviour, so this isn't a general argument against AI risk, but rather an argument against the particular type of task-specific generalisation you describe.)

In other words, we should expect generalisation to long-term tasks to occur via a general motivation to follow our instructions, rather than on a task-specific basis, because the latter is so underspecified. But generalisation via following instructions doesn't have a strong bias towards easily-measurable goals.

I agree that it's only us who are operating by trial and error---the system understands what it's doing. I don't think that undermines my argument. The point is that we pick the system, and so determine what it's doing, by trial and error, because we have no understanding of what it's doing (under the current paradigm). For some kinds of goals we may be able to pick systems that achieve those goals by trial and error (modulo empirical uncertainty about generalization, as discussed in the second part). For other goals there isn't a plausible way to do that.

I think that throughout your post there's an ambiguity between two types of measurement. Type one measurements are those which we can make easily enough to use as a feedback signal for training AIs. Type two measurements are those which we can make easily enough to tell us whether an AI we've deployed is doing a good job. In general many more things are type-two-measurable than type-one-measurable, because training feedback needs to be very cheap. So if we train an AI on type one measurements, we'll usually be able to use type two measurements to evaluate whether it's doing a good job post-deployment. And that AI won't game those type two measurements even if it generalises its training signal to much longer time horizons, because it will never have been trained on type two measurements.

These seem like the key disagreements, so I'll leave off here, to prevent the thread from branching too much. (Edited one out because I decided it was less important).

Comment by ricraz on Literature Review on Goal-Directedness · 2021-01-19T16:55:12.300Z · LW · GW

Really, the only issue for our purposes with this definition is that it focuses on how goal-directedness emerges, instead of what it entails for a system. Hence it gives less of a handle to predict the behavior of a system than Dennett’s intentional stance for example.

Another way to talk about this distinction is between definitions that allow you to predict the behaviour of agents which you haven't observed yet given how they were trained, versus definitions of goal-directedness which allow you to predict the future behaviour of an existing system given its previous behaviour.

I claim that the former is more important for our current purposes, for three reasons. Firstly, we don't have any AGIs to study, and so when we ask the question of how likely it is that AGIs will be goal-directed, we need to talk about the way in which that trait might emerge.

Secondly, because of the possibility of deceptive alignment, it doesn't seem like focusing on observed behaviour is sufficient for analysing goal-directedness.

Thirdly, suppose that we build a system that's goal-directed in a dangerous way. What do we do then? Well, we need to know why that goal-directedness emerges, and how to change the training regime so that it doesn't happen again.

Comment by ricraz on Some thoughts on risks from narrow, non-agentic AI · 2021-01-19T13:41:30.363Z · LW · GW

In the second half of WFLL, you talk about "systems that have a detailed understanding of the world, which are able to adapt their behavior in order to achieve specific goals". Does the first half of WFLL also primarily refer to systems with these properties? And if so, does "reasoning honed by trial-and-error" refer to the reasoning that those systems do?

If yes, then this undermines your core argument that "[some things] can’t be done by trial and error. To solve such tasks we need to understand what we are doing and why it will yield good outcomes", because "systems that have a detailed understanding of the world" don't need to operate by trial and error; they understand what they're doing.

We do need to train them by trial and error, but it's very difficult to do so on real-world tasks which have long feedback loops, like most of the ones you discuss. Instead, we'll likely train them to have good reasoning skills on tasks which have short feedback loops, and then transfer them to real-world with long feedback loops. But in that case, I don't see much reason why systems that have a detailed understanding of the world will have a strong bias towards easily-measurable goals on real-world tasks with long feedback loops. (Analogously: when you put humans in a new domain, and give them tasks and feedback via verbal instructions, then we can quickly learn sophisticated concepts in that new domain, and optimise for those, not just the easily-measured concepts in that new domain.)

I'm pretty agnostic on whether AI will in fact be optimizing for the easily measured objectives used in training or for unrelated values that arise naturally in the learning process (or more likely some complicated mix), and part of my point is that it doesn't seem to much mater.

Why is your scenario called "You get what you measure" if you're agnostic about whether we actually get what we measure, even on the level of individual AIs?

Or do you mean part 1 to be the case where we do get what we measure, and part 2 to be the case where we don't?

I'm saying: it's easier to pursue easily-measured goals, and so successful organizations and individuals tend to do that and to outcompete those whose goals are harder to measure (and to get better at / focus on the parts of their goals that are easy to measure, etc.). I'm not positing any change in the strength of competition, I'm positing a change in the extent to which goals that are easier to measure are in fact easier to pursue.

Firstly, I think this is only true for organisations whose success is determined by people paying attention to easily-measured metrics, and not by reality. For example, an organisation which optimises for its employees having beliefs which are correct in easily-measured ways will lose out to organisations where employees think in useful ways. An organisation which optimises for revenue growth is more likely to go bankrupt than an organisation which optimises for sustainable revenue growth. An organisation which optimises for short-term customer retention loses long-term customer retention. Etc.

The case in which this is more worrying is when an organisation's success is determined by (for example) whether politicians like it, and politicians only pay attention to easily-measurable metrics. In this case, organisations which pursue easily-measured goals will be more successful than ones which pursue the goals the politicians actually want to achieve. This is why I make the argument that actually the pressure on politicians to pursue easily-measurable metrics is pretty weak (hence why they're ignoring most economists' recommendations on how to increase GDP).

I don't disagree with [AI improving our ability to steer our future] at all. The point is that right now human future-steering is basically the only game in town. We are going to introduce inhuman reasoning that can also steer the future, and over time human reasoning will lose out in relative terms. That's compatible with us benefiting enormously, if all of those benefits also accrue to automated reasoners---as your examples seem to. We will try to ensure that all this new reasoning will benefit humanity, but I describe two reasons that might be difficult.

I agree that you've described some potential harms; but in order to make this a plausible long-term concern, you need to give reasons to think that the harms outweigh the benefits of AI enhancing (the effective capabilities of) human reasoning. If you'd written a comparable post a few centuries ago talking about how human physical power will lose out to inhuman physical power, I would have had the same complaint.

(If you classify all future-steering machinery as "agentic" then evidently I'm talking about agents and I agree with the informal claim that "non-agentic" reasoning isn't concerning.)

I classify Facebook's newsfeed as future-steering in a weak sense (it steers the future towards political polarisation), but non-agentic. Do you agree with this? If so, do you agree that if FB-like newsfeeds became prominent in many ways that would not be very concerning from a longtermist perspective?

Comment by ricraz on Why I'm excited about Debate · 2021-01-17T21:15:03.026Z · LW · GW

suppose we sorted out a verbal specification of an aligned AI and had a candidate FAI coded up - could we then use Debate on the question "does this candidate match the verbal specification?"

I'm less excited about this, and more excited about candidate training processes or candidate paradigms of AI research (for example, solutions to embedded agency). I expect that there will be a large cluster of techniques which produce safe AGIs, we just need to find them - which may be difficult, but hopefully less difficult with Debate involved.

Comment by ricraz on Why I'm excited about Debate · 2021-01-17T01:16:26.072Z · LW · GW

I think I agree with all of this. In fact, this argument is one reason why I think Debate could be valuable, because it will hopefully increase the maximum complexity of arguments that humans can reliably evaluate.

This eventually fails at some point, but hopefully it fails after the point at which we can use Debate to solve alignment in a more scalable way. (I don't have particularly strong intuitions about whether this hope is justified, though.)

Comment by ricraz on Why I'm excited about Debate · 2021-01-17T01:01:13.608Z · LW · GW

If arguments had no meaning but to argue other people into things, if they were being subject only to neutral selection or genetic drift or mere conformism, there really wouldn't be any reason for "the kind of arguments humans can be swayed by" to work to build a spaceship.  We'd just end up with some arbitrary set of rules fixed in place.

I agree with this. My position is not that explicit reasoning is arbitrary, but that it developed via an adversarial process where arguers would try to convince listeners of things, and then listeners would try to distinguish between more and less correct arguments. This is in contrast with theories of reason which focus on the helpfulness of reason in allowing individuals to discover the truth by themselves, or theories which focus on its use in collaboration.

Here's how Sperber and Mercier describe their argument: 

Reason is not geared to solitary use, to arriving at better beliefs and decisions on our own. What reason does, rather, is help us justify our beliefs and actions to others, convince them through argumentation, and evaluate the justifications and arguments that others address to us.

I can see how my summary might give a misleading impression; I'll add an edit to clarify. Does this resolve the disagreement?

Comment by ricraz on ricraz's Shortform · 2021-01-14T19:12:49.108Z · LW · GW

rules out a bunch of methods of cognition as being clearly in conflict with that theoretical ideal

Which ones? In Against Strong Bayesianism I give a long list of methods of cognition that are clearly in conflict with the theoretical ideal, but in practice are obviously fine. So I'm not sure how we distinguish what's ruled out from what isn't.

which I now realize is because those tended to be problems where embededness or logical uncertainty mattered a lot

Can you give an example of a real-world problem where logical uncertainty doesn't matter a lot, given that without logical uncertainty, we'd have solved all of mathematics and considered all the best possible theories in every other domain?

Comment by ricraz on ricraz's Shortform · 2021-01-14T13:24:38.854Z · LW · GW

This seems reasonable, thanks. But I note that "in order to actually think about anything you have to somehow move beyond naive bayesianism" is a very strong criticism. Does this invalidate everything that has been said about using naive bayesianism in the real world? E.g. every instance where Eliezer says "be bayesian".

One possible answer is "no, because logical induction fixes the problem". My uninformed guess is that this doesn't work because there are comparable problems with applying to the real world. But if this is your answer, follow-up question: before we knew about logical induction, were the injunctions to "be bayesian" justified?

(Also, for historical reasons, I'd be interested in knowing when you started believing this.)

Comment by ricraz on ricraz's Shortform · 2021-01-14T13:16:42.836Z · LW · GW

Hmmm, but what does this give us? He talks about the difference between vague theories and technical theories, but then says that we can use a scoring rule to change the probabilities we assign to each type of theory.

But my question is still: when you increase your credence in a vague theory, what are you increasing your credence about? That the theory is true?

Nor can we say that it's about picking the "best theory" out of the ones we have, since different theories may overlap partially.

Comment by ricraz on Radical Probabilism · 2021-01-14T12:55:24.665Z · LW · GW

DP: (sigh...) OK. I'm still never going to design an artificial intelligence to have uncertain observations. It just doesn't seem like something you do on purpose.

What makes you think that having certain observations is possible for an AI?

Comment by ricraz on ricraz's Shortform · 2021-01-14T12:37:06.314Z · LW · GW

Scott Garrabrant and Abram Demski, two MIRI researchers.

For introductions to their work, see the Embedded Agency sequence, the Consequences of Logical Induction sequence, and the Cartesian Frames sequence.

Comment by ricraz on ricraz's Shortform · 2021-01-14T00:25:39.837Z · LW · GW

In a bayesian rationalist view of the world, we assign probabilities to statements based on how likely we think they are to be true. But truth is a matter of degree, as Asimov points out. In other words, all models are wrong, but some are less wrong than others.

Consider, for example, the claim that evolution selects for reproductive fitness. Well, this is mostly true, but there's also sometimes group selection, and the claim doesn't distinguish between a gene-level view and an individual-level view, and so on...

So just assigning it a single probability seems inadequate. Instead, we could assign a probability distribution over its degree of correctness. But because degree of correctness is such a fuzzy concept, it'd be pretty hard to connect this distribution back to observations.

Or perhaps the distinction between truth and falsehood is sufficiently clear-cut in most everyday situations for this not to be a problem. But questions about complex systems (including, say, human thoughts and emotions) are messy enough that I expect the difference between "mostly true" and "entirely true" to often be significant.

Has this been discussed before? Given Less Wrong's name, I'd be surprised if not, but I don't think I've stumbled across it.

Comment by ricraz on Coherent decisions imply consistent utilities · 2021-01-13T23:07:56.407Z · LW · GW

Your argument is plausible. On the other hand, this review is for 2019, not 2017 (when this post was written) nor 2013 (when this series of ideas was originally laid out). So it seems like it should reflect our current-ish thinking.

I note that the page for the review doesn't have anything about voting criteria. This seems like something of an oversight?

Comment by ricraz on Coherent decisions imply consistent utilities · 2021-01-13T18:01:15.876Z · LW · GW

I don't see Eliezer saying that coherence theorems are the justification for his claim about the anti-naturalness of deference.

If coherence theorems are consistent with deference being "natural", then I'm not sure what argument Eliezer is trying to make in this post, because then couldn't they also be consistent with other deontological cognition being natural, and therefore likely to arise in AGIs?

effective cognition will generically involve trading off these resources in a way that does not reliably lose them

In principle, maybe. In practice, if we'd been trying to predict how monkeys will evolve, what does this claim imply about human-monkey differences?

Comment by ricraz on Imitative Generalisation (AKA 'Learning the Prior') · 2021-01-11T14:40:30.659Z · LW · GW

Ooops, yes, this seems correct. I'll edit mine accordingly.

Comment by ricraz on Imitative Generalisation (AKA 'Learning the Prior') · 2021-01-11T01:01:56.381Z · LW · GW

A few things that I found helpful in reading this post:

  • I mentally replaced D with "the past" and D' with "the future".
  • I mentally replaced z with "a guide to reasoning about the future".

This gives us a summary something like:

We want to understand the future, based on our knowledge of the past. However, training a neural net on the past might not lead it to generalise well about the future. Instead, we can train a network to be a guide to reasoning about the future, by evaluating its outputs based on how well humans with access to it can reason about the past, plus how well humans expect it to generalise to the future, plus immense amounts of interpretability work. (Note that this summary was originally incorrect, and has been modified in response to Lanrian's corrections below.)

Some concerns that arise from my understanding of this proposal:

  • It seems like the only thing stopping z from primarily containing object-level knowledge about the world is the human prior about the unlikelihood of object-level knowledge. But humans are really bad at assigning priors even to relatively simple statements - this is the main reason that we need science.
  • z will consist of a large number of claims, but I have no idea how to assign a prior to the conjunction of many big claims about the world, even in theory. That prior can't calculated recursively, because there may be arbitrarily-complicated interactions between different components of z.
  • Consider the following proposal: "train an oracle to predict the future, along with an explanation of its reasoning. Reward it for predicting correctly, and penalise it for explanations that sound fishy". Is there an important difference between this and imitative generalisation?
  • An agent can "generalise badly" because it's not very robust, or because it's actively pursuing goals that are misaligned with those of humans. It doesn't seem like this proposal distinguishes between these types of failures. Is this distinction important in motivating the proposal?
Comment by ricraz on Eight claims about multi-agent AGI safety · 2021-01-10T20:44:31.144Z · LW · GW

This all seems straightforwardly correct, so I've changed the line in question accordingly. Thanks for the correction :)

One caveat: technical work to address #8 currently involves either preventing AGIs from being misaligned in ways that lead them to make threats, or preventing AGIs from being aligned in ways which make them susceptible to threats. The former seems to qualify as an aspect of the "alignment problem", the latter not so much. I should have used the former as an example in my original reply to you, rather than using the latter.

Comment by ricraz on Eight claims about multi-agent AGI safety · 2021-01-10T19:29:12.110Z · LW · GW

I'd say that each of #5-#8 changes the parts of "AI alignment" that you focus on. For example, you may be confident that your AI system is not optimising against you, without being confident that 1000 copies of your AI system working together won't be optimising against you. Or you might be confident that your AI system won't do anything dangerous in almost all situations, but no longer confident once you realise that threats are adversarially selected to be extreme.

Whether you count these shifts as "moving beyond the standard paradigm" depends, I guess, on how much they change alignment research in practice. It seems like proponents of #7 and #8 believe that, conditional on those claims, alignment researchers' priorities should shift significantly. And #5 has already contributed to a shift away from the agent foundations paradigm. On the other hand, I'm a proponent of #6, and I don't currently believe that this claim should significantly change alignment research (although maybe further thought will identify some ways).

I think I'll edit the line you quoted to say "beyond standard single-AGI safety paradigms" to clarify that there's no single paradigm everyone buys into.

Comment by ricraz on Coherent decisions imply consistent utilities · 2021-01-07T15:46:40.154Z · LW · GW

It seems to me that there has been enough unanswered criticism of the implications of coherence theorems for making predictions about AGI that it would be quite misleading to include this post in the 2019 review. 

In an earlier review, johnswentworth argues:

I think instrumental convergence provides a strong argument that...we can use trade-offs with those resources in order to work out implied preferences over everything else, at least for the sorts of "agents" we actually care about (i.e. agents which have significant impact on the world).

I think this is a  reasonable point, but also a very different type of argument from Eliezer's argument, since it relies on things like economic incentives. Instead, when Eliezer critiques Paul's concept of corrigibility, he says things like "deference is an unusually anti-natural shape for cognition". How do coherence theorems translate to such specific claims about the "shape of cognition"; and why is grounding these theorems in "resources" a justifiable choice in this context? These are the types of follow-up arguments which seem necessary at this point in order for further promotion of this post to be productive rather than harmful.

Comment by ricraz on "Other people are wrong" vs "I am right" · 2021-01-07T15:18:03.781Z · LW · GW

This has been one of the most useful posts on LessWrong in recent years for me personally. I find myself often referring to it, and I think almost everyone underestimates the difficulty gap between critiquing others and proposing their own, correct, ideas.

Comment by ricraz on ricraz's Shortform · 2020-12-22T17:35:38.834Z · LW · GW

Cool, glad to hear it. I'd clarify the summary slightly: I think all safety techniques should include at least a rough intuition for why they'll work in the scaled-up version, even when current work on them only applies them to simple AIs. (Perhaps this was implicit in your summary already, I'm not sure.)

Comment by ricraz on ricraz's Shortform · 2020-12-21T11:12:50.412Z · LW · GW

One source of our disagreement: I would describe evolution as a type of local search. The difference is that it's local with respect to the parameters of a whole population, rather than an individual agent. So this does introduce some disanalogies, but not particularly significant ones (to my mind). I don't think it would make much difference to my heuristic if we imagined that humans had evolved via gradient descent over our genes instead.

In other words, I like the heuristic of backchaining to local search, and I think of it as a subset of my heuristic. The thing it's missing, though, is that it doesn't tell you which approaches will actually scale up to training regimes which are incredibly complicated, applied to fairly intelligent agents. For example, impact penalties make sense in a local search context for simple problems. But to evaluate whether they'll work for AGIs, you need to apply them to massively complex environments. So my intuition is that, because I don't know how to apply them to the human ancestral environment, we also won't know how to apply them to our AGIs' training environments.

Similarly, when I think about MIRI's work on decision theory, I really have very little idea how to evaluate it in the context of modern machine learning. Are decision theories the type of thing which AIs can learn via local search? Seems hard to tell, since our AIs are so far from general intelligence. But I can reason much more easily about the types of decision theories that humans have, and the selective pressures that gave rise to them.

As a third example, my heuristic endorses Debate due to a high-level intuition about how human reasoning works, in addition to a low-level intuition about how it can arise via local search.

Comment by ricraz on ricraz's Shortform · 2020-12-10T16:53:13.462Z · LW · GW

A well-known analogy from Yann LeCun: if machine learning is a cake, then unsupervised learning is the cake itself, supervised learning is the icing, and reinforcement learning is the cherry on top.

I think this is useful for framing my core concerns about current safety research:

  • If we think that unsupervised learning will produce safe agents, then why will the comparatively small contributions of SL and RL make them unsafe?
  • If we think that unsupervised learning will produce dangerous agents, then why will safety techniques which focus on SL and RL (i.e. basically all of them) work, when they're making comparatively small updates to agents which are already misaligned?

I do think it's more complicated than I've portrayed here, but I haven't yet seen a persuasive response to the core intuition.

Comment by ricraz on Continuing the takeoffs debate · 2020-12-08T17:25:18.368Z · LW · GW

I think that, because culture is eventually very useful for fitness, you can either think of the problem as evolution not optimising for culture, or evolution optimising for fitness badly. And these are roughly equivalent ways of thinking about it, just different framings. Paul notes this duality in his original post:

If we step back from skills and instead look at outcomes we could say: “Evolution is always optimizing for fitness, and humans have now taken over the world.” On this perspective, I’m making a claim about the limits of evolution. First, evolution is theoretically optimizing for fitness, but it isn’t able to look ahead and identify which skills will be most important for your children’s children’s children’s fitness. Second, human intelligence is incredibly good for the fitness of groups of humans, but evolution acts on individual humans for whom the effect size is much smaller (who barely benefit at all from passing knowledge on to the next generation).

It seems like most of your response is an objection to this framing. I may need to think more about the relative advantages and disadvantages of each framing, but I don't think either is outright wrong.

What does "useful" mean here? If by "useful" you mean "improves an individual's reproductive fitness", then I disagree with the claim and I think that's where the major disagreement is.

Yes, I meant useful for reproductive fitness. Sorry for ambiguity.

Comment by ricraz on Continuing the takeoffs debate · 2020-12-07T16:05:52.025Z · LW · GW

Hmm, let's see. So the question I'm trying to ask here is: do other species lack proto-culture mainly because of an evolutionary oversight, or because proto-culture is not very useful until you're close to human-level in other respects? In other words, is the discontinuity we've observed mainly because evolution took a weird path through the landscape of possible minds, or because the landscape is inherently quite discontinuous with respect to usefulness? I interpret Paul as claiming the former.

But if the former is true, then we should expect that there are many species (including chimpanzees) in which selection for proto-culture would be useful even in the absence of other changes like increased brain size or social skills, because proto-culture is a useful thing for them to have in ways that evolution has been ignoring. So by "simple changes" I mean something like: changes which could be induced by a relatively short period of medium-strength selection specifically for proto-culture (say, 100,000 years; much less than the human-chimp gap).

Another very similar question which is maybe more intuitive: suppose we take animals like monkeys, and evolve them by selecting the ones which seem like they're making the most progress towards building a technological civilisation, until eventually they succeed. Would their progress be much more continuous than the human case, or fairly similar? Paul would say the former, I'm currently leaning (slightly) towards the latter. This version of the question doesn't make so much sense with chimpanzees, since it may be the case that by the time we reach chimpanzees, we've "locked in" a pretty sharp discontinuity.

Both of these are proxies for the thing I'm actually interested in, which is whether more direct optimisation for reaching civilisation leads to much more continuous paths to civilisation than the one we took.

The question isn't whether there are simple changes -- it seems likely there were -- the question is whether we should expect humans not to find these simple changes.

Both of these are interesting questions, if you interpret the former in the way I just described.

Separately, even if we concede that evolutionary progress could have been much more continuous if it had been "optimising for the right thing", we can also question whether humans will "optimise for the right thing".

You seem to be arguing that the dumber we expect human optimization to be, the more we should expect discontinuities. This seems kind of wild

Paul's argument is that evolution was discontinuous specifically because evolution was dumb in certain ways. My claim is that AGI may be discontinuous specifically because humans are dumb in certain ways (i.e. taking a long time to notice big breakthroughs, during which an overhang builds up). There are other ways in which humans being dumb would make discontinuities less likely (e.g. if we're incapable of big breakthroughs). That's why I phrased the question as "Will humans continually pursue all simple yet powerful changes to our AIs?", because I agree that humans are smart enough to find simple yet powerful changes if we're looking in the right direction, but I think there will be long periods in which we're looking in the wrong direction (i.e. not "continually pursuing" the most productive directions).

Thanks for the feedback. My responses are all things that I probably should have put in the original post. If they make sense to you (even if you disagree with them) then I'll edit the post to add them in.

Oh, one last thing I should mention: a reason that this topic seems quite difficult for me to pin down is that the two questions seem pretty closely tied together. So if you think that the landscape of usefulness is really weird and discontinuous, then maybe humans can still find a continuous path by being really clever. Or maybe the landscape is actually pretty smooth, but humans are so much dumber than evolution that by default we'll end up on a much more discontinuous path (because we accumulate massive hardware overhangs while waiting for the key insights). I don't know how to pin down definitions for each of the questions which don't implicitly depend on our expectations about the other question.

Comment by ricraz on ricraz's Shortform · 2020-12-05T15:14:44.763Z · LW · GW

So I think Debate is probably the best example of something that makes a lot of sense when applied to humans, to the point where they're doing human experiments on it already.

But this heuristic is actually a reason why I'm pretty pessimistic about most safety research directions.

Comment by ricraz on ricraz's Shortform · 2020-12-04T15:30:58.139Z · LW · GW


Comment by ricraz on ricraz's Shortform · 2020-12-04T15:29:42.223Z · LW · GW

I don't think that even philosophers take the "genie" terminology very seriously. I think the more general lesson is something like: it's particularly important to spend your weirdness points wisely when you want others to copy you, because they may be less willing to spend weirdness points.

Comment by ricraz on ricraz's Shortform · 2020-12-04T15:27:40.146Z · LW · GW

No. I meant: suppose we were rerunning a simulation of evolution, but can modify some parts of it (e.g. evolution's objective). How do we ensure that whatever intelligent species comes out of it is safe in the same ways we want AGIs to be safe?

(You could also think of this as: how could some aliens overseeing human evolution have made humans safe by those aliens' standards of safety? But this is a bit trickier to think about because we don't know what their standards are. Although presumably current humans, being quite aggressive and having unbounded goals, wouldn't meet them).

Comment by ricraz on ricraz's Shortform · 2020-12-02T23:39:08.239Z · LW · GW

The crucial heuristic I apply when evaluating AI safety research directions is: could we have used this research to make humans safe, if we were supervising the human evolutionary process? And if not, do we have a compelling story for why it'll be easier to apply to AIs than to humans?

Sometimes this might be too strict a criterion, but I think in general it's very valuable in catching vague or unfounded assumptions about AI development.

Comment by ricraz on Pain is not the unit of Effort · 2020-11-29T12:14:47.717Z · LW · GW

This is the second time you've (inaccurately) accused me of something, while simultaneously doing that thing yourself.

In the first case, I quoted a specific claim from your post and argued that it wasn't well-supported and, interpreted as a statement of fact, was false. In response, you accused me of "rounding off a specific technical claim to the nearest idea they've heard before", and then rounded off my criticism to a misunderstanding of the overall post.

Here, I asked "what justifies claims like [claim you made]"? The essence of my criticism was that you'd made a bold claim while providing approximately zero evidence for it. You accuse me of being uncharitable because I highlighted the "never" part in particular, which you interpreted as me taking you totally literally. But this is itself rather uncharitable, because in fact I'm also uninterested in whether "the probability is literally zero", and was just trying to highlight that you'd made a strong claim which demands correspondingly strong evidence. If you'd written "almost never" or "very rarely", I would have responded in approximately the same way: "Almost never? Based on what?" In other words, I was happy to use "never" in whatever sense you intended it, but you then did exactly what you criticised me for, and jumped to a "literally zero" interpretation.

I would suggest being more restrained with such criticisms in the future.

In any case, it's not unreasonable for you to make a substantive part of your post about "useful heuristics" (even though you do propose them as "beliefs"). It's not the best, epistemically, but there's plenty of space in an intellectual ecosystem for memorable, instrumentally useful blog posts. The main problem, from my point of view, is that Less Wrong still seems to think that insight porn is the unit of progress, as judged by engagement and upvotes. You get what you reward, and I wish our reward mechanism were more aligned. But this is a community-level issue which means your post may be interpreted in ways that you didn't necessarily intend, so it's probably not too useful for me to continue criticising it (even though I think we do have further territory-level disagreements - e.g. I agree with your statement about happiness, but would also say "nobody is trying their best, and not feeling enough pain is a particularly high ROI dimension along which to notice this", which I expect you'd disagree with). 

Comment by ricraz on ricraz's Shortform · 2020-11-28T21:42:44.395Z · LW · GW

Ah, yeah, that's a great point. Although I think act-based agents is a pretty bad name, since those agents may often carry out a whole bunch of acts in a row - in fact, I think that's what made me overlook the fact that it's pointing at the right concept. So not sure if I'm comfortable using it going forward, but thanks for point that out.

Comment by ricraz on Snyder-Beattie, Sandberg, Drexler & Bonsall (2020): The Timing of Evolutionary Transitions Suggests Intelligent Life Is Rare · 2020-11-28T21:35:12.619Z · LW · GW

I think it might be quite hard to go from dolphin- to human-level intelligence.

I discuss some possible reasons in this post:

I expect that most animals [if magically granted as many neurons as humans have] wouldn’t reach sufficient levels of general intelligence to do advanced mathematics or figure out scientific laws. That might be because most are too solitary for communication skills to be strongly selected for, or because language is not very valuable even for social species (as suggested by the fact that none of them have even rudimentary languages). Or because most aren’t physically able to use complex tools, or because they’d quickly learn to exploit other animals enough that further intelligence isn’t very helpful, or...

Comment by ricraz on Pain is not the unit of Effort · 2020-11-28T18:58:17.678Z · LW · GW

I take some responsibility for my original point being misinterpreted, because it was phrased in an unnecessarily confrontational way. Sorry about that.

I think where I went wrong and raised rationalist red flags is that the way I make this argument: (a) makes it seem like I don't believe in the strong form of the lemma and am intentionally stating false observations for instrumental reasons.

I think this falls on a spectrum of epistemic rigour. The good end involves treating instrumentally useful observations with the same level of scrutiny as instrumentally anti-useful observations (or even more, to counteract bias). The bad end involves intentionally say things known to be false, because they are instrumentally useful. I interpret you as doing something in the middle, which I'd describe as: applying lower epistemic standards to instrumentally useful claims, and exaggerating them to make them more instrumentally useful.

To be clear, I don't think it's a particularly big deal, because I expect most people to have defensive filters that prevent them from taking these types of motivational sayings too seriously. However, this post has been very highly upvoted, which makes me a bit more concerned that people will start treating your two antidotes as received knowledge - especially given my background beliefs about this being a common mistake on LW. Hence why I pushed back on it. 

Moving to the object level claims: I accept that the main point you're making doesn't depend on the truth of the antidotes. I've already critiqued #1, but #2 also seems false to me. Consider someone who's very depressed, and also trying very hard to become less depressed. Are they "not trying their best"? Or someone who is working a miserable minimum-wage job while putting themselves through university and looking after children? Is there always going to be a magic bullet that solves these problems and makes them happy, apart from gritting their teeth and getting through it?

I tentatively accept the applicability of this claim to the restricted domain of people who are physically/mentally healthy, economically/socially privileged and focused on their long-term impact. Since I'm in that category, it may well be useful for me actually, so I'll try think about it more; thanks for raising the argument.

Comment by ricraz on Pain is not the unit of Effort · 2020-11-27T12:02:12.087Z · LW · GW

I am specifically saying that when you measure effort in units of pain this systematically leads to really bad places.

I think this is probably a useful insight, and seems to have resonated with quite a few people.

I'm specifically disputing your further conclusion that people in general should believe: "if it hurts, you're probably doing it wrong" (and also "You're not trying your best if you're not happy."). In fact, these are quite different from the original claim, and also broader than it, which is why they seem like overstatements to me.

I'm reminded of Buck's argument that it's much easier to determine that other people are wrong, than to be right yourself. In this case, even though I buy the criticism of the existing heuristic, proposing new heuristics is a difficult endeavour. Yet you present them as if they follow directly from your original insight. I mean, what justifies claims like "in practice [trading off happiness for short bursts of productivity] is never worth it"? Like, never? Based on what?

I get that this is an occupational hazard of writings posts that are meant to be very motivational/call-to-arms type posts. But you can motivate people without making blanket assertions about how to think about all domains. This seems particularly important on Less Wrong, where there's a common problem that content of the category "interesting speculation about psychology and society, where I have no way of knowing if it's true" is interpreted as solid intellectual progress.

Comment by ricraz on ricraz's Shortform · 2020-11-26T18:05:46.782Z · LW · GW

Oracle-genie-sovereign is a really useful distinction that I think I (and probably many others) have avoided using mainly because "genie" sounds unprofessional/unacademic. This is a real shame, and a good lesson for future terminology.

Comment by ricraz on ricraz's Shortform · 2020-11-26T18:00:17.959Z · LW · GW

Oh, actually, you're right (that you were wrong). I think I made the same mistake in my previous comment. Good catch.

Comment by ricraz on Pain is not the unit of Effort · 2020-11-26T17:44:42.378Z · LW · GW

"If it hurts, you're probably doing it wrong." This is just an assertion from an analogy with sports, where even the analogy is false - elite athletes put themselves through a ridiculous amount of pain.

In fact, I'd argue the exact opposite: the fact that intellectual work is currently much less painful than athletics suggests that there are still big gains to be made via painful interventions. Perhaps that pain comes from not seeing your friends as often, and spending weekends in the lab; or from alienating people by demanding higher standards from them. (Sure, not all pain is useful - but I'm arguing on the basis of a few examples that there are some good painful interventions, whereas you seem to be arguing from a couple of examples that there are almost none). People don't currently make those interventions because the rewards aren't high enough, but they would if the rewards of academic work were more heavy-tailed, like they are in sports. As evidence for this, note that the domain where rewards are most heavy-tailed (entrepreneurship) is notorious for being painful: "It’s like chewing glass and staring into the abyss."

Normatively, also, people in intellectual domains probably should make more of those painful interventions because the rewards to society of them becoming better are very heavy-tailed even though their own personal rewards are not (being the best academic in a field is not that different from being any other tenured academic).

I originally made this argument here:

Comment by ricraz on ricraz's Shortform · 2020-11-26T17:31:18.568Z · LW · GW

Wait, really? I thought it made sense (although I'd contend that most people don't think about AIXI in terms of those TMs reinforcing hypotheses, which is the point I'm making). What's incorrect about it?

Comment by ricraz on ricraz's Shortform · 2020-11-26T02:41:23.790Z · LW · GW

Yes we do: training is our evolutionary history, deployment is an individual lifetime. And our genomes are our reusable parameters.

Unfortunately I haven't yet written any papers/posts really laying out this analogy, but it's pretty central to the way I think about AI, and I'm working on a bunch of related stuff as part of my PhD, so hopefully I'll have a more complete explanation soon.

Comment by ricraz on Continuing the takeoffs debate · 2020-11-25T12:59:43.594Z · LW · GW

So my reasoning is something like:

  • There's the high-level argument that AIs will recursively self-improve very fast.
  • There's support for this argument from the example of humans.
  • There's a rebuttal to that support from the concept of changing selection pressures.
  • There's a counterrebuttal to changing selection pressures from my post.

By the time we reach the fourth level down, there's not that much scope for updates on the original claim, because at each level we lose confidence that we're arguing about the right thing, and also we've zoomed in enough that we're ignoring most of the relevant considerations.

I'll make this more explicit.

Comment by ricraz on Continuing the takeoffs debate · 2020-11-25T12:56:09.521Z · LW · GW

Yeah, I don't think this is a conclusive argument, it's just pointing to an intuition (which was then backed up by simulations in the paper). And the importance of transmission fidelity is probably higher when we're thinking about cumulative culture (with some skills being prerequisites for others), not just acquiring independent skills. But I do think your point is a good one.

Comment by ricraz on ricraz's Shortform · 2020-11-25T11:45:27.759Z · LW · GW

I suspect that AIXI is misleading to think about in large part because it lacks reusable parameters - instead it just memorises all inputs it's seen so far. Which means the setup doesn't have episodes, or a training/deployment distinction; nor is any behaviour actually "reinforced".

Comment by ricraz on Snyder-Beattie, Sandberg, Drexler & Bonsall (2020): The Timing of Evolutionary Transitions Suggests Intelligent Life Is Rare · 2020-11-25T00:33:13.141Z · LW · GW

Thinking out loud:

Suppose we treat ourselves as a random sample of intelligent life, and make two observations: first, we're on a planet that will last for X billion years, and second that we emerged after Y billion years. And we're trying to figure out Z, the expected time that life would take to emerge (if planet longevity weren't an issue).

This paper reasons from these facts to conclude that the Z >> Y, and that (as a testable prediction) we'll eventually find that planets which are much longer-lived than the Earth are probably much less habitable for other reasons, because otherwise we would almost certainly have emerged there.

But it seems like this reasoning could go exactly the other way. In particular, why shouldn't we instead reason: "We have some prior over how habitable long-lived planets are. According to this prior, it would be quite improbable if Z >> Y, because then we would have almost definitely found ourselves on a long-lived planet.

So what I'm wondering is, what licenses us to ignore this when doing the original bayesian calculation of Z?

Comment by ricraz on A space of proposals for building safe advanced AI · 2020-11-21T16:59:29.399Z · LW · GW

Wouldn't it just be "train M* to win debates against itself as judged by H"? Since in the original formulation of debate a human inspects the debate transcript without assistance.

Anyway, I agree that something like this is also a reasonable way to view debate. In this case, I was trying to emphasise the similarities between Debate and the other techniques: I claim that if we call the combination of the judge plus one debater Amp(M), then we can think of the debate as M* being trained to beat Amp(M) by Amp(M)'s own standards.

Maybe an easier way to visualise this is that, given some question, M* answers that question, and then Amp(M) tries to identify any flaws in the argument by interrogating M*, and rewards M* if no flaws can be found.