Posts

SDM's Shortform 2020-07-23T14:53:52.568Z · score: 4 (1 votes)
Modelling Continuous Progress 2020-06-23T18:06:47.474Z · score: 28 (10 votes)
Coronavirus as a test-run for X-risks 2020-06-13T21:00:13.859Z · score: 65 (24 votes)
Will AI undergo discontinuous progress? 2020-02-21T22:16:59.424Z · score: 24 (17 votes)
The Value Definition Problem 2019-11-18T19:56:43.271Z · score: 14 (9 votes)

Comments

Comment by sdm on A voting theory primer for rationalists · 2020-08-31T16:51:54.777Z · score: 1 (1 votes) · LW · GW
You seem to be comparing Arrow's theorem to Lord Vetinari, implying that both are undisputed sovereigns?

It was a joke about how if you take Arrow's theorem literally, the fairest 'voting method' (at least among ranked voting methods), the only rule which produces a definite transitive preference ranking and which meets the unanimity and independence conditions is 'one man, one vote', i.e. dictatorship.

And frankly, I think that the model used in the paper bears very little relationship to any political reality I know of. I've never seen a group of voters who believe "I would love it if any two of these three laws pass, but I would hate it if all three of them passed or none of them passed" for any set of laws that are seriously proposed and argued-for.

Such a situation doesn't seem all that far-fetched to me - suppose there are three different stimulus bills on offer, and you want some stimulus spending but you also care about rising national debt. You might not care which bills pass, but you still want some stimulus money, but you also don't want all of them to pass because you think the debt would rise too high, so maybe you decide that you just want any 2 out of 3 of them to pass. But I think the methods introduced in that paper might be most useful not to model the outcomes of voting systems, but for attempts to align an AI to multiple people's preferences.

Comment by sdm on Forecasting Thread: AI Timelines · 2020-08-29T11:11:04.240Z · score: 2 (2 votes) · LW · GW

I'll take that bet! If I do lose, I'll be far too excited/terrified/dead to worry in any case.

Comment by sdm on Covid 8/27: The Fall of the CDC · 2020-08-28T11:32:12.947Z · score: 4 (3 votes) · LW · GW
I’m still periodically scared in an existential or civilization-is-collapsing-in-general kind of way, but not in a ‘the economy is about to collapse’ or ‘millions of Americans are about to die’ kind of way. 
I’m not sure whether this is progress.

It definitely is progress. If we were in the latter situation, there would be nothing at all to do except hope you personally don't die, whereas in the former there's a chance for things to get better - if we learn the lesson.

By strange coincidence, it's exactly 6 months since I wrote this, and I think it's important to remember just how dire the subjective future seemed at the end of February - that (subjectively, anyway) could have happened, but didn't.

Comment by sdm on SDM's Shortform · 2020-08-28T10:50:18.165Z · score: 3 (2 votes) · LW · GW
The tl;dr is that instead of thinking of ethics as a single unified domain where "population ethics" is just a straightforward extension of "normal ethics," you split "ethics" into a bunch of different subcategories:
Preference utilitarianism as an underdetermined but universal morality
"What is my life goal?" as the existentialist question we have to answer for why we get up in the morning
"What's a particularly moral or altruistic thing to do with the future lightcone?" as an optional subquestion of "What is my life goal?" – of interest to people who want to make their life goals particularly altruistically meaningful

This is very interesting - I recall from our earlier conversation that you said you might expect some areas of agreement, just not on axiology:

(I say elements because realism is not all-or-nothing - there could be an objective 'core' to ethics, maybe axiology, and much ethics could be built on top of such a realist core - that even seems like the most natural reading of the evidence, if the evidence is that there is convergence only on a limited subset of questions.)

I also agree with that, except that I think axiology is the one place where I'm most confident that there's no convergence. :)
Maybe my anti-realism is best described as "some moral facts exist (in a weak sense as far as other realist proposals go), but morality is underdetermined."

This may seem like an odd question, but, are you possibly a normative realist, just not a full-fledged moral realist? What I didn't say in that bracket was that 'maybe axiology' wasn't my only guess about what the objective, normative facts at the core of ethics could be.

Following Singer in the expanding circle, I also think that some impartiality rule that leads to preference utilitarianism, maybe analogous to the anonymity rule in social choice, could be one of the normatively correct rules that ethics has to follow, but that if convergence among ethical views doesn't occur the final answer might be underdetermined. This seems to be exactly the same as your view, so maybe we disagree less than it initially seemed.


In my attempted classification (of whether you accept convergence and/or irreducible normativity), I think you'd be somewhere between 1 and 3. I did say that those views might be on a spectrum depending on which areas of Normativity overall you accept, but I didn't consider splitting up ethics into specific subdomains, each of which might have convergence or not:

Depending on which of the arguments you accept, there are four basic options. These are extremes of a spectrum, as while the Normativity argument is all-or-nothing, the Convergence argument can come by degrees for different types of normative claims (epistemic, practical and moral)

Assuming that it is possible to cleanly separate population ethics from 'preference utilitarianism', it is consistent, though quite counterintuitive, to demand reflective coherence in our non-population ethical views but allow whatever we want in population ethics (this would be view 1 for most ethics but view 3 for population ethics).

(This still strikes me as exactly what we'd expect to see halfway to reaching convergence - the weirder and newer subdomain of ethics still has no agreement, while we have reached greater agreement on questions we've been working on for longer.)

It sounds like your contrasting my statement from The Case for SFE ("fit all one’s moral intuitions into an overarching theory based solely on intuitively appealing axioms") with "arbitrarily halting the search for coherence" / giving up on ethics playing a role in decision-making. But those are not the only two options: You can have some universal moral principles, but leave a lot of population ethics underdetermined.

Your case for SFE was intended to defend a view of population ethics - that there is an asymmetry between suffering and happiness. If we've decided that 'population ethics' is to remain undetermined, that is we adopt view 3 for population ethics, what is your argument (that SFE is an intuitively appealing explanation for many of our moral intuitions) meant to achieve? Can't I simply declare that my intuitions say different, and then we have nothing more to discuss, if we already know we're going to leave population ethics undetermined?

Comment by sdm on Forecasting Thread: AI Timelines · 2020-08-26T14:35:28.173Z · score: 1 (0 votes) · LW · GW

The 'progress will be continuous' argument, to apply to our near future, does depend on my other assumptions - mainly that the breakthroughs on that list are separable, so agentive behaviour and long-term planning won't drop out of a larger GPT by themselves and can't be considered part of just 'improving up language model accuracy'.

We currently have partial progress on human-level language comprehension, a bit on cumulative learning, but near zero on managing mental activity for long term planning, so if we were to suddenly reach human level on long-term planning in the next 5 years, that would probably involve a discontinuity, which I don't think is very likely for the reasons given here.

If language models scale to near-human performance but the other milestones don't fall in the process, and my initial claim is right, that gives us very transformative AI but not AGI. I think that the situation would look something like this:

If GPT-N reaches par-human:

discovering new action sets
managing its own mental activity
(?) cumulative learning
human-like language comprehension
perception and object recognition
efficient search over known facts

So there would be 2 (maybe 3?) breakthroughs remaining. It seems like you think just scaling up a GPT will also resolve those other milestones, rather than just giving us human-like language comprehension. Whereas if I'm right and also those curves do extrapolate, what we would get at the end would be an excellent text generator, but it wouldn't be an agent, wouldn't be capable of long-term planning and couldn't be accurately described as having a utility function over the states of the external world, and I don't see any reason why trivial extensions of GPT would be able to do that either since those seem like problems that are just as hard as human-like language comprehension. GPT seems like it's also making some progress on cumulative learning, though it might need some RL-based help with that, but none at all on managing mental activity for longterm planning or discovering new action sets.

As an additional argument, admittedly from authority - Stuart Russell also clearly sees human-like language comprehension as only one of several really hard and independent problems that need to be solved.

A humanlike GPT-N would certainly be a huge leap into a realm of AI we don't know much about, so we could be surprised and discover that agentive behaviour and having a utility function over states of the external world spontaneously appears in a good enough language model, but that argument has to be made, and you need that argument to hold and GPT to keep scaling for us to reach AGI in the next five years, and I don't see the conjunction of those two as that likely - it seems as though your argument rests solely on whether GPT scales or not, when there's also this other conceptual premise that's much harder to justify.

I'm also not sure if I've seen anyone make the argument that GPT-N will also give us these specific breakthroughs - but if you have reasons that GPT scaling would solve all the remaining barriers to AGI, I'd be interested to hear it. Note that this isn't the same as just pointing out how impressive the results scaling up GPT could be - Gwern's piece here, for example, seems to be arguing for a scenario more like what I've envisaged, where GPT-N ends up a key piece of some future AGI but just provides some of the background 'world model':

Models like GPT-3 suggest that large unsupervised models will be vital components of future DL systems, as they can be ‘plugged into’ systems to immediately provide understanding of the world, humans, natural language, and reasoning.

If GPT does scale, and we get human-like language comprehension in 2025, that will mean we're moving up that list much faster, and in turn suggests that there might not be a large number of additional discoveries required to make the other breakthroughs, which in turn suggests they might also occur within the Deep Learning paradigm, and relatively soon. I think that if this happens, there's a reasonable chance that when we do build an AGI a big part of its internals looks like a GPT, as gwern suggested, but by then we're already long past simply scaling up existing systems.

Alternatively, perhaps you're not including agentive behaviour in your definition of AGI - a par-human text generator for most tasks that isn't capable of discovering new action sets or managing its mental activity is, I think a 'mere' transformative AI and not a genuine AGI.

Comment by sdm on SDM's Shortform · 2020-08-25T15:56:57.852Z · score: 2 (2 votes) · LW · GW

So to sum up, a very high-level summary of the steps in this method of preference elicitation and aggregation would be:

    1. With a mixture of normative assumptions and multi-channel information (approval and actions) as inputs, use a reward-modelling method to elicit the debiased preferences of many individuals.
      1. Determining whether there actually are significant differences between stated and revealed preferences when performing reward modelling is the first step to using multi-channel information to effectively separate biases from preferences.
    2. Create 'proxy agents' using the reward model developed for each human (this step is where intent-aligned amplification can potentially occur).
    3. Place the proxies in an iterated voting situation which tends to produce sensible convergent results. The use of RL proxies here can be compared to the use of human proxies in liquid democracy.
      1. Which voting mechanisms tend to work in iterated situations with RL agents can be determined in other experiments (probably with purely artificial agents)
    4. Run the voting mechanism until an unambiguous winner is decided, using methods like those given in this paper.

This seems like a reasonable procedure for extending a method that is aligned to one human's preferences (step 1,2) to produce sensible results when trying to align to an aggregate of human preferences (step 3,4). It reduces reliance on the specific features of one voting method, Other than the insight that multiple channels of information might help, all the standard unsolved problems with preference learning from one human remain.

Even though we can't yet align an AGI to one human's preferences, trying to think about how to aggregate human preferences in a way that is scalable isn't premature, as has sometimes been claimed.

In many 'non-ambitious' hypothetical settings where we aren't trying to build an AGI sovereign over the whole world (for example, designing a powerful AI to govern the operations of a hospital), we still need to be able to aggregate preferences sensibly and stably. This method would do well at such intermediate scales, as it doesn't approach the question of preference aggregation from a 'final' ambitious value-learning perspective but instead tries to look at preference aggregation the same way we look at elicitation, with an RL-based iterative approach to reaching a result.

However, if you did want to use such a method to try and produce the fabled 'final utility function of all humanity', it might not give you Humanity's CEV, since some normative assumptions (preferences count equally and in the way given by the voting mechanism), are built in. By analogy with CEV, I called the idealized result of this method a coherent extrapolated framework (CEF). This is a more normatively direct method of aggregating values than CEV, (since you fix a particular method of aggregating preferences in advance), as it extrapolates from a voting framework rather than extrapolating based on our volition, more broadly (and vaguely) defined, hence CEF.

Comment by sdm on A voting theory primer for rationalists · 2020-08-25T13:00:09.261Z · score: 3 (2 votes) · LW · GW
Kenneth Arrow, proved that the problem that Condorcet (and Llull) had seen was in fact a fundamental issue with any ranked voting method. He posed 3 basic "fairness criteria" and showed that no ranked method can meet all of them:
Ranked unanimity, Independence of irrelevant alternatives, Non-dictatorial

I've been reading up on voting theory recently and Arrow's result - that the only voting system which produces a definite transitive preference ranking, that will pick the unanimous answer if one exists, and doesn't change depending on irrelevant alternatives - is 'one man, one vote'.

“Ankh-Morpork had dallied with many forms of government and had ended up with that form of democracy known as One Man, One Vote. The Patrician was the Man; he had the Vote.”

In my opinion, aside from the utilitarian perspective offered by VSE, the key to evaluating voting methods is an understanding of strategic voting; this is what I'd call the "mechanism design" perspective. I'd say that there are 5 common "anti-patterns" that voting methods can fall into; either where voting strategy can lead to pathological results, or vice versa.

One recent extension to these statistical approaches is to use RL agents in iterated voting and examine their convergence behaviour. The idea is that we embrace the inevitable impossibility results (such as Arrow and GS theorems) and consider agents' ability to vote strategically as an opportunity to reach stable outcomes. This paper uses very simple Q-learning agents with a few different policies - epsilon-greedy, greedy and upper confidence bound, in an iterated voting game, and gets behaviour that seems sensible. Many thousands of rounds of iterated voting isn't practical for real-world elections, but for preference elicitation in other contexts (such as value learning) it might be useful as a way to try and estimate people's underlying utilities as accurately as possible.

Comment by sdm on Open & Welcome Thread - August 2020 · 2020-08-24T14:14:25.444Z · score: 6 (4 votes) · LW · GW

A first actually credible claim of coronavirus reinfection? Potentially good news as the patient was asymptomatic and rapidly produced a strong antibody response.

Comment by sdm on Forecasting Thread: AI Timelines · 2020-08-23T16:32:34.429Z · score: 26 (10 votes) · LW · GW

Here's my answer. I'm pretty uncertain compared to some of the others!

AI Forecast

First, I'm assuming that by AGI we mean an agent-like entity that can do the things associated with general intelligence, including things like planning towards a goal and carrying that out. If we end up in a CAIS-like world where there is some AI service or other that can do most economically useful tasks, but nothing with very broad competence, I count that as never developing AGI.

I've been impressed with GPT-3, and could imagine it or something like it scaling to produce near-human level responses to language prompts in a few years, especially with RL-based extensions.

But, following the list (below) of missing capabilities by Stuart Russell, I still think things like long-term planning would elude GPT-N, so it wouldn't be agentive general intelligence. Even though you might get those behaviours with trivial extensions of GPT-N, I don't think it's very likely.

That's why I think AGI before 2025 is very unlikely (not enough time for anything except scaling up of existing methods). This is also because I tend to expect progress to be continuous, though potentially quite fast, and going from current AI to AGI in less than 5 years requires a very sharp discontinuity.

AGI before 2035 or so happens if systems quite a lot like current deep learning can do the job, but which aren't just trivial extensions of them - this seems reasonable to me on the inside view - e.g. it takes us less than 15 years to take GPT-N and add layers on top of it that handle things like planning and discovering new actions. This is probably my 'inside view' answer.

I put a lot of weight on a tail peaking around 2050 because of how quickly we've advanced up this 'list of breakthroughs needed for general intelligence' -

There is this list of remaining capabilities needed for AGI in an older post I wrote, with the capabilities of 'GPT-6' as I see them underlined:

Stuart Russell’s List

human-like language comprehension

cumulative learning

discovering new action sets

managing its own mental activity

For reference, I’ve included two capabilities we already have that I imagine being on a similar list in 1960

perception and object recognition

efficient search over known facts

So we'd have discovering new action sets, and managing mental activity - effectively, the things that facilitate long-range complex planning, remaining.

So (very oversimplified) if around the 1980s we had efficient search algorithms, by 2015 we had image recognition (basic perception) and by 2025 we have language comprehension courtesy of GPT-8, that leaves cumulative learning (which could be obtained by advanced RL?), then discovering new action sets and managing mental activity (no idea). It feels a bit odd that we'd breeze past all the remaining milestones in one decade after it took ~6 to get to where we are now. Say progress has sped up to be twice as fast, then it's 3 more decades to go. Add to this the economic evidence from things like Modelling the Human Trajectory, which suggests a roughly similar time period of around 2050.

Finally, I think it's unlikely but not impossible that we never build AGI and instead go for tool AI or CAIS, most likely because we've misunderstood the incentives such that it isn't actually economical or agentive behaviour doesn't arise easily. Then there's the small (few percent) chance of catastrophic or existential disaster which wrecks our ability to invent things. This is the one I'm most unsure about - I put 15% for both but it may well be higher.

Comment by sdm on SDM's Shortform · 2020-08-23T15:57:40.177Z · score: 5 (3 votes) · LW · GW

I don't think that excuse works in this case - I didn't give it a 'long-winded frame', just that brief sentence at the start, and then the list of scenarios, and even though I reran it a couple of times on each to check, the 'cranberry/grape juice kills you' outcome never arose.

So, perhaps they switched directly from no prompt to an incredibly long-winded and specific prompt without checking what was actually necessary for a good answer? I'll point out didn't really attempt any sophisticated prompt programming either - that was literally the first sentence I thought of!

Comment by sdm on SDM's Shortform · 2020-08-23T12:31:37.767Z · score: 14 (6 votes) · LW · GW

Gary Marcus, noted sceptic of Deep Learning, wrote an article with Ernest Davis:

GPT-3, Bloviator: OpenAI’s language has no idea what it’s talking about

The article purports to give six examples of GPT-3's failure - Biological, Physical, Social, Object and Psychological reasoning and 'non sequiturs'. Leaving aside that GPT-3 works on Gary's earlier GPT-2 failure examples, and that it seems as though he specifically searched out weak points by testing GPT-3 on many more examples than were given, something a bit odd is going on with the results they gave. I got better results when running his prompts on AI Dungeon.

With no reruns, randomness = 0.5, I gave Gary's questions (all six gave answers considered 'failures' by Gary) to GPT-3 via AI Dungeon with a short scene-setting prompt, and got good answers to 4 of them, and reasonable vague answers to the other 2:

This is a series of scenarios describing a human taking actions in the world, designed to test physical and common-sense reasoning.
1) You poured yourself a glass of cranberry juice, but then you absentmindedly poured about a teaspoon of grape juice into it. It looks okay. You try sniffing it, but you have a bad cold, so you can’t smell anything. You are very thirsty. So you take another drink.
2) You are having a small dinner party. You want to serve dinner in the living room. The dining room table is wider than the doorway, so to get it into the living room, you will have to  move furniture. This means that some people will be inconvenienced.
3) You are a defense lawyer and you have to go to court today. Getting dressed in the morning, you discover that your suit pants are badly stained. However, your bathing suit is clean and very stylish. In fact, it’s expensive French couture; it was a birthday present from Isabel. You decide that you should wear it because you won't look professional in your stained pants, but you are worried that the judge will think you aren't taking the case seriously if you are wearing a bathing suit.
4) Yesterday I dropped my clothes off at the dry cleaner’s and I have yet to pick them up. Where are my clothes?
5) Janet and Penny went to the store to get presents for Jack. Janet said, “I will buy Jack a top.” “Don’t get Jack a top,” says Penny. “He has a top. He will prefer a bottom."
6) At the party, I poured myself a glass of lemonade, but it turned out to be too sour, so I added a little sugar. I didn’t see a spoon handy, so I stirred it with a cigarette. But that turned out to be a bad idea because it was a menthol, and it ruined the taste. So I added a little more sugar to counteract the menthol, and then I noticed that my cigarette had fallen into the glass and was floating in the lemonade.

For 1), Gary's example ended with 'you are now dead' - for 1), I got a reasonable, if short continuation - success.

2) - the answer is vague enough to be a technically correct solution, 'move furniture' = tilt the table, but since we're being strict I'll count it as a failure. Gary's example was a convoluted attempt to saw the door in half, clearly mistaken.

3) is very obviously intended to trick the AI into endorsing the bathing suit answer, in fact it feels like a classic priming trick that might trip up a human! But in my version GPT-3 rebels against the attempt and notices the incongruence of wearing a bathing suit to court, so it counts as a success. Gary's example didn't include the worry that a bathing suit was inappropriate - arguably not a failure, but nevermind, let's move on.

4) is actually a complete prompt by itself, so the AI didn't do anything - GPT-3 doesn't care about answering questions, just continuing text with high probability. Gary's answer was 'I have a lot of clothes', and no doubt he'd call both 'evasion', so to be strict we'll agree with him and count that as failure.

5) Trousers are called 'bottoms' so that's right. Gary would call it wrong since 'the intended continuation' was “He will make you take it back", but that's absurdly unfair, that's not the only answer a human being might give, so I have to say it's correct. Gary's example ' lost track of the fact that Penny is advising Janet against getting a top', which didn't happen here, so that's acceptable.

Lastly, 6) is a slightly bizarre but logical continuation of an intentionally weird prompt - so correct. It also demonstrates correct physical reasoning - stirring a drink with a cigarette won't be good for the taste. Gary's answer wandered off-topic and started talking about cremation.

So, 4/6 correct on an intentionally deceptive and adversarial set of prompts, and that's on a fairly strict definition of correct. 2) and 4) are arguably not wrong, even if evasive and vague. More to the point, this was on an inferior version of GPT-3 to the one Gary used, the Dragon model from AI Dungeon!

I'm not sure what's going on here - is it the initial prompt saying it was 'testing physical and common sense reasoning'? Was that all it took?

Comment by sdm on Learning human preferences: optimistic and pessimistic scenarios · 2020-08-21T16:40:21.362Z · score: 2 (2 votes) · LW · GW

Glad you think so! I think that methods like using multiple information sources might be a useful way to reduce the number of (potentially mistaken) normative assumptions you need in order to model a single human's preferences.

The other area of human preference learning where you seem, inevitably, to need a lot of strong normative assumptions is in preference aggregation. If we assume we have elicited the preferences of lots of individual humans, and we're then trying to aggregate their preferences (with each human's preference represented by a separate model) I think the same basic principle applies, that you can reduce the normative assumptions you need by using a more complicated voting mechanism, in this case one that considers agents' ability to vote strategically as an opportunity to reach stable outcomes. 

I talk about this idea here. As with using approval/actions to improve the elicitation of an individual's preferences, you can't avoid making any normative assumptions by using a more complicated aggregation method, but perhaps you end up having to make fewer of them. Very speculatively, if you can combine a robust method of eliciting preferences with few inbuilt assumptions with a similarly robust method of aggregating preferences, you're on your way to a full solution to ambitious value learning.

Comment by sdm on SDM's Shortform · 2020-08-20T23:10:02.446Z · score: 12 (3 votes) · LW · GW

Modelling the Human Trajectory or ‘How I learned to stop worrying and love Hegel’.

Rohin’s opinion: I enjoyed this post; it gave me a visceral sense for what hyperbolic models with noise look like (see the blog post for this, the summary doesn’t capture it). Overall, I think my takeaway is that the picture used in AI risk of explosive growth is in fact plausible, despite how crazy it initially sounds.

One thing this post led me to consider is that when we bring together various fields, the evidence for 'things will go insane in the next century' is stronger than any specific claim about (for example) AI takeoff. What is the other evidence?

We're probably alone in the universe, and anthropic arguments tend to imply we're living at an incredibly unusual time in history. Isn't that what you'd expect to see in the same world where there is a totally plausible mechanism that could carry us a long way up this line, in the form of AGI and eternity in six hours? All the pieces are already there, and they only need to be approximately right for our lifetimes to be far weirder than those of people who were e.g. born in 1896 and lived to 1947 - which was weird enough, but that should be your minimum expectation.

In general, there are three categories of evidence that things are likely to become very weird over the next century, or that we live at the hinge of history:

  1. Specific mechanisms around AGI - possibility of rapid capability gain, and arguments from exploratory engineering

  2. Economic and technological trend-fitting predicting explosive growth in the next century

  3. Anthropic and Fermi arguments suggesting that we live at some extremely unusual time

All of these are evidence for such a claim. 1) is because a superintelligent AGI takeoff is just a specific example for how the hinge occurs. 3) is already directly arguing for that, but how does 2) fit in with 1) and 3)?

There is something a little strange about calling a fast takeoff from AGI and whatever was driving superexponential growth throughout all history the same trend - there is some huge cosmic coincidence that causes there to always be superexponential growth - so as soon as population growth + growth in wealth per capita or whatever was driving it until now runs out in the great stagnation (which is visible as a tiny blip on the RHS of the double-log plot), AGI takes over and pushes us up the same trend line. That's clearly not possible, so there would have to be some factor responsible for both if AGI is what takes us up the rest of that trend line - a factor that was at work in the founding of Jericho but predestined that AGI would be invented and cause explosive growth in the 21st century, rather than the 19th or the 23rd.

For AGI to be the driver of the rest of that growth curve, there has to be a single causal mechanism that keeps us on the same trend and includes AGI as its final step - if we say we are agnostic about what that mechanism is, we can still call 2) evidence for us living at the hinge point, though we have to note that there is a huge blank spot in need of explanation. Is there anything that can fill it to complete the picture?

The mechanism proposed in the article seems like it could plausibly include AGI.

If technology is responsible for the growth rate, then reinvesting production in technology will cause the growth rate to be faster. I'd be curious to see data on what fraction of GWP gets reinvested in improved technology and how that lines up with the other trends.

But even though the drivers seem superficially similar - they are both about technology, the claim is that one very specific technology will generate explosive growth, not that technology in general will - it seems strange that AGI would follow the same growth curve caused by reinvesting more GWP in improving ordinary technology which doesn't improve your own ability to think in the same way that AGI would.

As for precise timings, the great stagnation (last 30ish years) just seems like it would stretch out the timeline a bit, so we shouldn't take the 2050s seriously - as much as the last 70 years work on an exponential trend line there's really no way to make it fit overall as that post makes clear.

Comment by sdm on Open & Welcome Thread - August 2020 · 2020-08-20T11:45:22.501Z · score: 2 (2 votes) · LW · GW

Many alignment approaches require at least some initial success at directly eliciting human preferences to get off the ground - there have been some excellent recent posts about the problems this presents. In part because of arguments like these, there has been far more focus on the question of preference elicitation than on the question of preference aggregation:

The maximally ambitious approach has a natural theoretical appeal, but it also seems quite hard. It requires understanding human preferences in domains where humans are typically very uncertain, and where our answers to simple questions are often inconsistent, like how we should balance our own welfare with the welfare of others, or what kinds of activities we really want to pursue vs. enjoying in the moment...
I have written about this problem, pointing out that it is unclear how you would solve it even with an unlimited amount of computing power. My impression is that most practitioners don’t think of this problem even as a long-term research goal — it’s a qualitatively different project without direct relevance to the kinds of problems they want to solve.

I think that this has a lot of merit, but it has sometimes been interpreted as saying that any work on preference aggregation or idealization, before we have a robust way to elicit preferences, is premature. I don't think this is right - in many 'non-ambitious' settings where we aren't trying to build an AGI sovereign over the whole world (for example, designing a powerful AGI to govern the operations of a hospital) you still need to be able to aggregate preferences sensibly and stably.

I've written a rough shortform post with some thoughts on this problem which doesn't approach the question from a 'final' ambitious value-learning perspective but instead tries to look at aggregation the same way we look at elicitation, with an imperfect, RL-based iterative approach to reaching consensus.

...
The Kidney exchange paper elicited preferences from human subjects (using repeated pairwise comparisons) and then aggregated them using the Bradley-Terry model. You couldn't use such a simple statistical method to aggregate quantitative preferences over continuous action spaces, like the preferences that would be learned from a human via a complex reward model. Also, any time you try to use some specific one-shot voting mechanism you run into various impossibility theorems which seem to force you to give up some desirable property.
One approach that may be more robust against errors in a voting mechanism, and easily scalable to more complex preference profiles is to use RL not just for the preference elicitation, but also for the preference aggregation. The idea is that we embrace the inevitable impossibility results (such as Arrow and GS theorems) and consider agents' ability to vote strategically as an opportunity to reach stable outcomes. 
This paper uses very simple Q-learning agents with a few different policies - epsilon-greedy, greedy and upper confidence bound, in an iterated voting game, and gets behaviour that seems sensible. (Note the similarity and differences with the moral parliament, where a particular one-shot voting rule is justified a priori and then used.)
The fact that this paper exists is a good sign because it's very recent and the methods it uses are very simple - it's pretty much just a proof of concept, as the authors state - so that tells me there's a lot of room for combining more sophisticated RL with better voting methods.

Approaches like these seem especially urgent if AI timelines are shorter than we expect, which has been argued based on results from GPT-3. If this is the case, we might need to be dealing with questions of aggregation relatively soon with methods somewhat like current deep learning, and so won't have time to ensure that we have a perfect solution to elicitation before moving on to aggregation.

Comment by sdm on SDM's Shortform · 2020-08-20T11:27:17.491Z · score: 10 (4 votes) · LW · GW

Improving preference learning approaches

When examining value learning approaches to AI Alignment, we run into two classes of problem - we want to understand how to elicit preferences, which is (even theoretically, with infinite computing power), very difficult, and we want to know how to go about aggregating preferences stably and correctly which is not just difficult but runs into complicated social choice and normative ethical issues.

Many research programs say the second of these questions is less important than the first, especially if we expect continuous takeoff with many chances to course-correct, and a low likelihood of an AI singleton with decisive strategic advantage. For many, building an AI that can reliably extract and pursue the preferences of one person is good enough.

Christiano calls this 'the narrow approach' and sees it as a way to sidestep many of the ethical issues, including those around social choice ethics. Those would be the 'ambitious' approaches.

We want to build machines that helps us do the things we want to do, and to that end they need to be able to understand what we are trying to do and what instrumental values guide our behavior. To the extent that our “preferences” are underdetermined or inconsistent, we are happy if our systems at least do as well as a human, and make the kinds of improvements that humans would reliably consider improvements.
But it’s not clear that anything short of the maximally ambitious approach can solve the problem we ultimately care about.

I think that the ambitious approach is still worth investigating, because it may well eventually need to be solved, and also because it may well need to be addressed in a more limited form even on the narrow approach (one could imagine an AGI with a lot of autonomy having to trade-off the preferences of, say, a hundred different people). But even the 'narrow' approach raises difficult psychological issues about how to distinguish legitimate preferences from bias - questions of elicitation. In other words, the cognitive science issues around elicitation (distinguishing bias from legitimate preference) must be resolved for any kind of preference learning to work, and the social choice and ethical issues around preference aggregation need at least preliminary solutions for any alignment method that aims to apply to more than one person (even if final, provably correct solutions to aggregation are only needed if designing a singleton with decisive strategic advantage).

I believe that I've located two areas that are under- or unexplored, for improving the ability of reward modelling approaches to elicit human preferences and to aggregate human preferences. These are: using multiple information sources from a human (approval and actions) which diverge to help extract unbiased preferences, and using RL proxy agents in iterated voting to reach consensus preference aggregations, rather than some direct statistical method. Neither of these is a complete solution, of course, for reasons discussed e.g. here by Stuart Armstrong, but they could nonetheless help.

Improving preference elicitation: multiple information sources

Eliciting the unbiased preferences of an individual human is extremely difficult, for reasons given here.

The agent's actions can be explained by their beliefs and preferences[1], and by their biases: by this, we mean the way in which the action selector differs from an unboundedly rational expected preference maximiser.
The results of the Occam's razor paper imply that preferences (and beliefs, and biases) cannot be deduced separately from knowing the agent's policy (and hence, a fortiori, from any observations of the agent's behaviour).

...

To get around the impossibility result, we need "normative assumptions": assumptions about the preferences (or beliefs, or biases) of the agent that cannot be deduced fully from observations.
Under the optimistic scenario, we don't need many of these, at least for identifying human preferences. We can label a few examples ("the anchoring bias, as illustrated in this scenario, is a bias"; "people are at least weakly rational"; "humans often don't think about new courses of action they've never seen before", etc...). Call this labelled data[2] D.
The algorithm now constructs categories preferences*, beliefs*, and biases* - these are the generalisations that it has achieved from D

Yes, even on the 'optimistic scenario' we need external information of various kinds to 'debias'. However, this external information can come from a human interacting with the AI, in the form of human approval of trajectories or actions taken or proposed by an AI agent, on the assumption that since our stated and revealed preferences diverge, there will sometimes be differences in what we approve of and what we do that are due solely to differences in bias.

This is still technically external to observing the human's behaviour, but it is essentially a second input channel for information about human preferences and biases. This only works, of course, if humans tend to approve different things to the things that they actually do in a way influenced by bias (otherwise you have the same information as you'd get from actions, which helps with improving accuracy but not debiasing, see here), which is the case at least some of the time.

In other words, the beliefs and preferences are unchanged when the agent acts or approves but the 'approval selector' is different from the 'action selector' sometimes and, based on what does and does not change, you can try to infer what originated from legitimate beliefs and preferences and what originated from variation between the approval and action selector, which must be bias.

So, for example, if we conducted a principle component analysis on π, we would expect that the components would all be mixes of preferences/beliefs/biases.

So a PCA performed on the approval would produce a mix of beliefs, preferences and (different) biases. Underlying preferences are, by specification, equally represented either by human actions or by human approval of actions taken (since no matter what they are your preferences), but many biases don't exhibit this pattern - for example, we discount more over time in our revealed preferences than in our stated preferences. What we approve of typically represents a less (or at least differently) biased response than what we actually do.

There has already been research on combining information on reward models from multiple sources, to infer a better overall reward model but not as far as I know on specifically actions and approval as differently biased sources of information.

CIRL ought to extract our revealed preferences (since it's based on behavioural policy) while a method like reinforcement learning from human preferences should extract our stated preferences - that might be a place to start, at least on validating that there actually are relevant differences caused by differently strong biases in our stated vs revealed preferences, and that the methods actually do end up with different policies.

The goal here would be to have some kind of 'dual channel' preference learner that extracts beliefs and preferences from biased actions and approval by examining what varies. I'm sure you'd still need labelling and explicit information about what counts as a bias, but there might need to be a lot less than with single information sources. How much less (how much extra information you get from such divergences) seems like an empirical question. Finding out how common divergences between stated and revealed preferences that actually influence the learned policies of agents designed to infer human preferences from actions vs approval are would be useful as a first step. Stuart Armstrong:

In the pessimistic scenario, human preferences, biases, and beliefs are twisted together is a far more complicated way, and cannot be separated by a few examples.
In contrast, take examples of racial bias, hindsight bias, illusion of control, or naive realism. These biases all seem to be of quite different from the anchoring bias, and quite different from each other. At the very least, they seem to be of different "type signature".
So, under the pessimistic scenario, some biases are much closer to preferences that generic biases (and generic preferences) are to each other.

What I've suggested should still help at least somewhat in the pessimistic scenario - unless preferences/beliefs vary when you switch between looking at approval vs actions more than biases vary, you can still gain some information on underlying preferences and beliefs by seeing how approval and actions differ.

Of the difficult examples you gave, racial bias at least varies between actions and approval. Implementing different reward modelling algorithms and messing around with them to try and find ways to extract unbiased preferences from multiple information sources might be a useful research agenda.

There has already been research done on using multiple information sources to improve the accuracy of preference learning - Reward-rational implicit choice, but not specifically on using the divergences between different sources of information from the same agent to learn things about the agents unbiased preferences.

Improving preference aggregation: iterated voting games

In part because of arguments like these, there has been less focus on the aggregation side of things than on the direct preference learning side.

Christiano says of methods like CEV, which aim to extrapolate what I ‘really want’ far beyond what my current preferences are; ‘most practitioners don’t think of this problem even as a long-term research goal — it’s a qualitatively different project without direct relevance to the kinds of problems they want to solve’. This is effectively a statement of the Well-definedness consideration when sorting through value definitions - our long-term ‘coherent’ or ‘true’ preferences currently aren’t well understood enough to guide research so we need to restrict ourselves to more direct normativity - extracting the actual preferences of existing humans

However, I think that it is important to get on the right track early - even if we never have cause to build a powerful singleton AI that has to aggregate all the preferences of humanity, there will still probably be smaller-scale situations where the preferences of several people need to be aggregated or traded-off. Shifting a human preference learner from a single to a small group of human preferences could produce erroneous results due to distributional shift, potentially causing alignment failures, so even if we aren't trying for maximally ambitious value learning it might still be worth investigating preference aggregation.

There has been some research done on preference aggregation for AIs learning human values, specifically in the context of Kidney exchanges:

We performed statistical modeling of participants’ pairwise comparisons between patient profiles in order to obtain weights for each profile. We used the Bradley-Terry model, which treats each pairwise comparison as a contest between a pair of players
We have shown one way in which moral judgments can be elicited from human subjects, how those judgments can be statistically modelled, and how the results can be incorporated into the algorithm. We have also shown, through simulations, what the likely effects of deploying such a prioritization system would be, namely that under demanded pairs would be significantly impacted but little would change for others. We do not make any judgment about whether this conclusion speaks in favor of or against such prioritization, but expect the conclusion to be robust to changes in the prioritization such as those that would result from a more thorough process, as described in the previous paragraph.

The Kidney exchange paper elicited preferences from human subjects (using repeated pairwise comparisons) and then aggregated them using the Bradley-Terry model. You couldn't use such a simple statistical method to aggregate quantitative preferences over continuous action spaces, like the preferences that would be learned from a human via a complex reward model. Also, any time you try to use some specific one-shot voting mechanism you run into various impossibility theorems which seem to force you to give up some desirable property.

One approach that may be more robust against errors in a voting mechanism, and easily scalable to more complex preference profiles is to use RL not just for the preference elicitation, but also for the preference aggregation. The idea is that we embrace the inevitable impossibility results (such as Arrow and GS theorems) and consider agents' ability to vote strategically as an opportunity to reach stable outcomes. 

This paper uses very simple Q-learning agents with a few different policies - epsilon-greedy, greedy and upper confidence bound, in an iterated voting game, and gets behaviour that seems sensible. (Note the similarity and differences with the moral parliament, where a particular one-shot voting rule is justified a priori and then used.)

The fact that this paper exists is a good sign because it's very recent and the methods it uses are very simple - it's pretty much just a proof of concept, as the authors state - so that tells me there's a lot of room for combining more sophisticated RL with better voting methods.

Combining elicitation and aggregation

Having elicited preferences from each individual human (using methods like those above to 'debias'), we obtain a proxy agent representing each individual's preferences. Then these agents can be placed into an iterated voting situation until a convergent answer is reached.

That seems like the closest practical approximation to a CEV of a group of people that could be constructed with anything close to current methods - a pipeline from observed behaviour and elicited approval to a final aggregated decision about what to do based on overall preferences. Since its a value learning framework that's extendible over any size group, which is somewhat indirect, you might call it a Coherent Extrapolated Framework (CEF) as I suggested last year.

Comment by sdm on Learning human preferences: optimistic and pessimistic scenarios · 2020-08-19T18:22:23.914Z · score: 7 (2 votes) · LW · GW
To get around the impossibility result, we need "normative assumptions": assumptions about the preferences (or beliefs, or biases) of the agent that cannot be deduced fully from observations.
Under the optimistic scenario, we don't need many of these, at least for identifying human preferences. We can label a few examples ("the anchoring bias, as illustrated in this scenario, is a bias"; "people are at least weakly rational"; "humans often don't think about new courses of action they've never seen before", etc...). Call this labelled data[2] D.
The algorithm now constructs categories preferences*, beliefs*, and biases* - these are the generalisations that it has achieved from D

Yes, even on the 'optimistic scenario' we need external information of various kinds to 'debias'. However, this external information can come from a human interacting with the AI, in the form of human approval of trajectories or actions taken or proposed by an AI agent, on the assumption that since our stated and revealed preferences diverge, there will sometimes be differences in what we approve of and what we do that are due solely to differences in bias.

This is still technically external to observing the human's behaviour, but it is essentially a second input channel for information about human preferences and biases. This only works, of course, if humans tend to approve different things to the things that they actually do in a way influenced by bias (otherwise you have the same information as you'd get from actions, which helps with improving accuracy but not debiasing, see here), which is the case at least some of the time.

In other words, the beliefs and preferences are unchanged when the agent acts or approves but the 'approval selector' is different from the 'action selector' sometimes and, based on what does and does not change, you can try to infer what originated from legitimate beliefs and preferences and what originated from variation between the approval and action selector, which must be bias.

So, for example, if we conducted a principle component analysis on π, we would expect that the components would all be mixes of preferences/beliefs/biases.

So a PCA performed on the approval would produce a mix of beliefs, preferences and (different) biases. Underlying preferences are, by specification, equally represented either by human actions or by human approval of actions taken (since no matter what they are your preferences), but many biases don't exhibit this pattern - for example, we discount more over time in our revealed preferences than in our stated preferences. What we approve of typically represents a less (or at least differently) biased response than what we actually do.

There has already been research on combining information on reward models from multiple sources, to infer a better overall reward model but not as far as I know on specifically actions and approval as differently biased sources of information.

CIRL ought to extract our revealed preferences (since it's based on behavioural policy) while a method like reinforcement learning from human preferences should extract our stated preferences - that might be a place to start, at least on validating that there actually are relevant differences caused by differently strong biases in our stated vs revealed preferences, and that the methods actually do end up with different policies.

The goal here would be to have some kind of 'dual channel' preference learner that extracts beliefs and preferences from biased actions and approval by examining what varies. I'm sure you'd still need labelling and explicit information about what counts as a bias, but there might need to be a lot less than with single information sources. How much less (how much extra information you get from such divergences) seems like an empirical question. Finding out how common divergences between stated and revealed preferences that actually influence the learned policies of agents designed to infer human preferences from actions vs approval are would be useful as a first step.

In the pessimistic scenario, human preferences, biases, and beliefs are twisted together is a far more complicated way, and cannot be separated by a few examples.
In contrast, take examples of racial bias, hindsight bias, illusion of control, or naive realism. These biases all seem to be of quite different from the anchoring bias, and quite different from each other. At the very least, they seem to be of different "type signature".
So, under the pessimistic scenario, some biases are much closer to preferences that generic biases (and generic preferences) are to each other.

What I've suggested should still help at least somewhat in the pessimistic scenario - unless preferences/beliefs vary when you switch between looking at approval vs actions more than biases vary, you can still gain some information on underlying preferences and beliefs by seeing how approval and actions differ.

Of the difficult examples you gave, racial bias at least varies between actions and approval. Implementing different reward modelling algorithms and messing around with them to try and find ways to extract unbiased preferences from multiple information sources might be a useful research agenda.

There has already been research done on using multiple information sources to improve the accuracy of preference learning - Reward-rational implicit choice, but not specifically on using the divergences between different sources of information from the same agent to learn things about the agents unbiased preferences.

Comment by sdm on Open & Welcome Thread - August 2020 · 2020-08-15T16:38:27.193Z · score: 11 (6 votes) · LW · GW

Covid19Projections has been one of the most successful coronavirus models in large part because it is as 'model-free' and simple as possible, using ML to backtrack parameters for a simple SEIR model from death data only. This has proved useful because case numbers are skewed by varying numbers of tests, so deaths are more consistently reliable as a metric. You can see the code here.

However, in countries doing a lot of testing, with a reasonable number of cases but with very few deaths, like most of Europe, the model is not that informative, and essentially predicts near 0 deaths out to the limit of its measure. This is expected - the model is optimised for the US.

Estimating SEIR parameters based on deaths works well when you have a lot of deaths to count, if you don't then you need another method. Estimating purely based on cases has its own pitfalls - see this from epidemic forecasting, which mistook an increase in testing in the UK mid-july for a sharp jump in cases and wrongly inferred brief jump in R_t. As far as I understand their paper, the estimate of R_t from case data adjusts for delays in infection to onset and for other things, but not for the positivity rate or how good overall testing is.

This isn't surprising - there is no simple model that combines test positivity rate and the number of cases and estimates the actual current number of infections. But perhaps you could use a Covid19pro like method to learn such a mapping.

Very oversimplified, Covid19pro works like this:

Our COVID-19 prediction model adds the power of artificial intelligence on top of a classic infectious disease model. We developed a simulator based on the SEIR model (Wikipedia) to simulate the COVID-19 epidemic in each region. The parameters/inputs of this simulator are then learned using machine learning techniques that attempts to minimize the error between the projected outputs and the actual results. We utilize daily deaths data reported by each region to forecast future reported deaths. After some additional validation techniques (to minimize a phenomenon called overfitting), we use the learned parameters to simulate the future and make projections.

And the functions f and g, estimate the SEIR (susceptible, exposed, infectious, recovered) parameters from current deaths up to some time t_0, and the future deaths based on those parameters respectively. These functions are then both optimised to minimise error when the actual number of deaths at t_1 is fed into the model.

This oversimplification is deliberate:

Deaths data only: Our model only uses daily deaths data as reported by Johns Hopkins University. Unlike other models, we do not use additional data sources such as cases, testing, mobility, temperature, age distribution, air traffic, etc. While supplementary data sources may be helpful, they can also introduce additional noise and complexity which can notably skew results.

What I suggest is a slight increase in complexity, where we use a similar model except we feed it paired test positivity rate and case data instead of death data. The positivity rate /tests per case serves as a 'quality estimate' which serves to tell you how good the test data is. That's how tests per case is treated by our world in data. We all know intuitively that if positivity rate is going down but cases are going up, the increase might not be real, but if positivity rate is going up and cases are going up the increase definitely is real.

What I'm suggesting is that we combine do something like this:

Now, you need to have reliable data on the number of people tested each week, but most of Europe has that. If you can learn a model that gives you a more accurate estimate of the SEIR parameters from combined cases and tests/case data, then it should be better at predicting future infections. It won't necessarily predict future cases, since the number of future cases is also going to depend on the number of tests conducted, which is subject to all sorts of random fluctuations that we don't care about when modelling disease transmission, so instead you could use the same loss function as the original covid19pro - minimizing the difference between projected and actual deaths.

Hopefully the intuition that you can learn more from the pair (tests/case, number of cases) than number of cases or number of deaths alone should be borne out, and a c19pro-like model could be trained to make high quality predictions in places with few deaths using such paired data. You would still need some deaths for the loss function and fitting the model.

Comment by sdm on Developmental Stages of GPTs · 2020-08-15T15:43:05.761Z · score: 1 (1 votes) · LW · GW

Superintelligence and other classic presentations of AI risk definitely offer additional arguments/considerations. The likelihood of extremely discontinuous/localized progress is, of course, the most prominent one.

Perhaps what is going on here is that the arguments as stated in brief summaries like 'orthogonality thesis + instrumental convergence' just aren't what the arguments actually were, and that there were from the start all sorts of empirical or more specific claims made around these general arguments.

This reminds me of Lakatos' theory of research programs - where the core assumptions, usually logical or a priori in nature, are used to 'spin off' secondary hypotheses that are more empirical or easily falsifiable.

Lakatos' model fits AI safety rather well - OT and IC are some of these non-emperical 'hard core' assumptions that are foundational to the research program and then in ~2010 there were some secondary assumptions, discontinuous progress, AI maximises a simple utility function etc. but in ~2020 we have some different secondary assumptions: mesa-optimisers, you get what you measure, direct evidence of current misalignment

Comment by sdm on Do we have updated data about the risk of ~ permanent chronic fatigue from COVID-19? · 2020-08-15T15:24:04.089Z · score: 9 (6 votes) · LW · GW

Fatigue that lasts 2-3 weeks after the worst symptoms are over is common with essentially all bad viral infections - post-flu fatigue is common for example (can't find any good statistics on how common). So, I don't know if 1/3 reporting fatigue 2 to 3 weeks after tells us anything useful about how common post-covid fatigue lasting months afterwards is

Comment by sdm on The Case for Education · 2020-08-15T14:30:45.657Z · score: 8 (3 votes) · LW · GW

So let me now make the case for education.

Education is key to civilizational sanity, sensemaking, and survival. Education is key to The Secret of our Success.

Education is the scaffolding on which our society, culture and civilization are built and maintained.

I think that, rather like the rationalist criticism of healthcare, a lot of this is US-centric and, while it still applies to Europe and the UK, it does so less strongly. There's still credentialism, signalling, an element of zero-sum competition but many of the most egregious examples of the university system promoting costly signalling ahead of actual training and growth of knowledge (the lack of subject focus in college degrees, medical school being separate to university, colossal cost disease, 'holistic' admissions) either don't exist in Europe or aren't as egregious.

I also think that you underestimate the number of EA and rationalist types who are working within the university system - most technical AI safety research not being done by OpenAI/Deepmind is in some way affiliated with a university, for example

Comment by sdm on Covid 8/13: Same As It Ever Was · 2020-08-14T10:44:12.275Z · score: 22 (9 votes) · LW · GW
What makes him unique is that Bill Gates is actually trying.
As far as I can tell, no one else with billions of dollars is actually trying to help as best they can. Those same effective altruists are full of detailed thoughts, but aside from shamefully deplatforming Robin Hanson it’s been a long time since I’ve heard about them making a serious attempt to do anything.

I agree with you about the Hanson thing, but the EA movement did its best to shift as much funding as was practical towards coronavirus related causes. This page covers Givewell's efforts, this covers the career and contribution advice of 80k hours. I know more than a few EA types who dropped whatever they were doing in March to try and focus on coronavirus modelling - for example, FHI's Epidemic Forecasting project.

Bill Gates didn’t. He’s out there doing the best he knows how to do.
Thus, we should quote Theodore Roosevelt, and first and foremost applaud him and learn from him.
Also, read the whole thing. Mostly the information speaks for itself.

I found the entire podcast to be quite astounding, especially the part where Gates explained how he had to sit down and patiently listen to Trump saying vaccines didn't work. When I consider how much of America apparently hates him despite all this, it couldn't help but remind me of a certain quote.

I still don't understand it. They should have known that their lives depended on that man's success. And yet it was as if they tried to do everything they could to make his life unpleasant. To throw every possible obstacle into his way...

As to your discussion about Hospitalization rates - it's interesting to note how our picture of the overall hospitalization rate has evolved over time, from estimating near 20% to as low as 2%. I wrote a long comment with an estimation of what it might be for the UK, with this conclusion -

We know from the ONS that the total number of patients ever admitted to hospital with coronavirus on July 22nd was 131,412. That number is probably pretty close to accurate - even during the worst of the epidemic the UK was testing more or less every hospital patient with coronavirus symptoms. The estimated number of people ever infected on July 22nd by c19pro was 5751036
So, 131412/5751036 = 2.3% hospitalization rate
Comment by sdm on Alignment By Default · 2020-08-13T16:53:35.798Z · score: 11 (5 votes) · LW · GW

‘You get what you measure’ (outer alignment failure) and Mesa optimisers (inner failure) are both potential gap fillers that explain why specifically the alignment/capability divergence initially arises. Whether it’s one or the other, I think the overall point is still that there is this gap in the classic arguments that allows for a (possibly quite high) chance of ‘alignment by default’, for the reasons you give, but there are at least 2 plausible mechanisms that fill this gap. And then I suppose my broader point would be that we should present:

Classic Arguments —> objections to them (capability and alignment often go together, could get alignment by default) —> specific causal mechanisms for misalignment

Comment by sdm on Alignment By Default · 2020-08-13T12:45:22.540Z · score: 20 (7 votes) · LW · GW

I think what you've identified here is a weakness in the high-level, classic arguments for AI risk -

Overall, I’d give maybe a 10-20% chance of alignment by this path, assuming that the unsupervised system does end up with a simple embedding of human values. The main failure mode I’d expect, assuming we get the chance to iterate, is deception - not necessarily “intentional” deception, just the system being optimized to look like it’s working the way we want rather than actually working the way we want. It’s the proxy problem again, but this time at the level of humans-trying-things-and-seeing-if-they-work, rather than explicit training objectives.

This failure mode of deceptive alignment seems like it would result most easily from Mesa-optimisation or an inner alignment failure. Inner Alignment / Misalignment is possibly the key specific mechanism which fills a weakness in the 'classic arguments' for AI safety - the Orthogonality Thesis, Instrumental Convergence and Fast Progress together implying small separations between AI alignment and AI capability can lead to catastrophic outcomes. The question of why there would be such a damaging, hard-to-detect divergence between goals and alignment needs an answer to have a solid, specific reason to expect dangerous misalignment, and Inner Misalignment is just such a reason.

I think that it should be presented in initial introductions to AI risk alongside those classic arguments, as the specific, technical reason why the specific techniques we use are likely to produce such goal/capability divergence - rather than the general a priori reasons given by the classic arguments.

Comment by sdm on Buck's Shortform · 2020-08-06T14:26:02.228Z · score: 4 (1 votes) · LW · GW

I wrote a whole post on modelling specific continuous or discontinuous scenarios- in the course of trying to make a very simple differential equation model of continuous takeoff, by modifying the models given by Bostrom/Yudkowsky for fast takeoff, the result that fast takeoff means later timelines naturally jumps out.

Varying d between 0 (no RSI) and infinity (a discontinuity) while holding everything else constant looks like this: Continuous Progress If we compare the trajectories, we see two effects - the more continuous the progress is (lower d), the earlier we see growth accelerating above the exponential trend-line (except for slow progress, where growth is always just exponential) and the smoother the transition to the new growth mode is. For d=0.5, AGI was reached at t=1.5 but for discontinuous progress this was not until after t=2. As Paul Christiano says, slow takeoff seems to mean that AI has a larger impact on the world, sooner.

But that model relies on pre-setting a fixed 'threshold for AGI, given by the parameter AGI, in advance. This, along with the starting intelligence of the system, fixes how far away AGI is.

For values between 0 and infinity we have varying steepnesses of continuous progress. IAGI is the Intelligence level we identify with AGI. In the discontinuous case, it is where the jump occurs. In the continuous case, it is the centre of the logistic curve. here IAGI=4

You could (I might get round to doing this), model the effect you're talking about by allowing IAGI to vary with the level of discontinuity. So every model would start with the same initial intelligence I0, but the IAGI would be correlated with the level of discontinuity, with larger discontinuity implying IAGI is smaller. That way, you would reproduce the epistemic difference of expecting a stronger discontinuity - that the current intelligence of AI systems is implied to be closer to what we'd expect to need for explosive growth on discontinuous takeoff scenarios than on continuous scenarios.

We know the current level of capability and the current rate of progress, but we don't know I_AGI, and holding all else constant slow takeoff implies I_AGI is a significantly higher number (again, I_AGI is relative to the starting intelligence of the system)

This is because my model was trying to model different physical situations, different ways AGI could be, not different epistemic situations, so I was thinking in terms of I_AGI being some fixed, objective value that we just don't happen to know.

I'm uncertain if there's a rigorous way of quantifying how much this epistemic update does against the physical fact that continuous takeoff implies an earlier acceleration above exponential. If you're right, it overall completely cancels this effect out and makes timelines on discontinuous takeoff earlier overall - I think you're right about this. It would be easy enough to write something to evenly cancel it out, to make all takeoffs in the different scenarios appear at the same time, but that's not what you have in mind.

Comment by sdm on Coronavirus as a test-run for X-risks · 2020-08-05T13:47:09.741Z · score: 4 (1 votes) · LW · GW

So, two months have gone by. My main conclusions look mostly unchanged, except that I wasn't expecting such a monotonically stable control system effect in the US. Vaccine news looks better than I expected, superforecasters are optimistic. The major issue in countries with moderate to good state capacity is preventing a winter second wave and managing small infection spikes. Rob Wiblin seems to buy in to the MNM effect.

Whatever happened to the Hospitalization Rate?

Many of these facts (in particular the reason that 100 million plus dead is effectively ruled out) have multiple explanations. For one, the earliest data on coronavirus implied the hospitalization rate was 10-20% for all age groups, and we now know it is substantially lower (that tweet by an author of the Imperial College paper, which estimated a hospitalization rate of 4.4%). This means that if hospitals were entirely unable to cope with the number of patients, the IFR would be in the range of 2%, not 20% initially implied.

Back in a previous Age of The Earth, also known as early March 2020, the most important thing in the world was to figure out the coronavirus hospitalization rate, and we overestimated it. See e.g.

Suppose 50% of the UK (33 million people) get the virus of which 5% (~ 1.8 million people) will need serious hospitalization [conservative estimate].

It's mostly of academic interest now, since (at least in Europe) genuine exponential spread is looking more and more like the exception rather than the rule, but considering how much time we spent discussing this issue I'd like to know the final answer for completeness’ sake. It looks like even 'conservative' estimates of the hospitalization rates were too high by a factor of at least 2, just as claimed by the author of that imperial paper.

Here's a crude estimate: the latest UK serology survey says 6.2% of people were infected by July 26th. Another says 7.1% were infected by July 30. The level of infection is so low in the UK right now that you'll only get movement by a few tenths of a percentage point over the couple of weeks between then and now.

The false negative rate is unclear but I've heard claims as high as a third, so the real number may be as high as 9.3% based on the overall infection survey. Covid19pro estimated that on July 26th 8.6% (13.3-5.1%) had been infected. That 8.6% number seems to correspond to a reasonable false negative rate on the antibody tests (28% if you believe the first study, ~17% if you believe the second survey).

In other words, the median estimates from covid19pro look reasonably consistent with the antibody tests, implying a false negative rate of about 15-30%, so I'm just going to assume they're roughly accurate.

We know from the ONS that the total number of patients ever admitted to hospital with coronavirus on July 22nd was 131,412. That number is probably pretty close to accurate - even during the worst of the epidemic the UK was testing more or less every hospital patient with coronavirus symptoms. The estimated number of people ever infected on July 22nd by c19pro was 5751036

So, 131412/5751036 = 2.3% hospitalization rate

Comment by sdm on A sketch of 'Simulacra Levels and their Interactions' · 2020-08-05T10:09:47.712Z · score: 18 (6 votes) · LW · GW

Harry Frankfurt's On Bullshit seems relevant here. I think its worth trying to incorporate Frankfurt's definition as well, as it is quite widely known, see e.g. this video - If you were to do so, I think you would say that on Frankfurt's definition, Level 1 tells the truth, Level 2 lies, Level 3 bullshits about physical facts but will lie or tell the truth about things in the social realm (e.g. others motives, your own affiliation), and Level 4 always bullshits.

Taken this way, Frankfurt's model is a higher-level model that distinguishes the ones who care about reality from the ones that don't - roughly speaking, bullshit characterises levels 3 and 4 as the ones unconcerned with reality.

If you did it on the diagram, the union of 3 and 4 would be bullshitters, but shading more strongly towards the 4 end.

Comment by sdm on Unifying the Simulacra Definitions · 2020-08-05T10:07:03.137Z · score: 17 (6 votes) · LW · GW

If your aim is to unify different ways of understanding dishonesty, social manipulation and 'simulacra', then Harry Frankfurt's On Bullshit needs to be considered.

What bullshit essentially misrepresents is neither the state of affairs to which it refers nor the beliefs of the speaker concerningthat state of affairs. Those are what lies misrepresent, by virtue ofbeing false. Since bullshit need not be false, it differs from lies in its misrepresentational intent. The bullshitter may not deceive us, or even intend to do so, either about the facts or about what he takes the facts to be. What he does necessarily attempt to deceive us about is his enterprise. His only indispensably distinctive characteristic is that in a certain way he misrepresents what he is up to

I think its worth trying to incorporate Frankfurt's definition as well, as it is quite widely known, see e.g. this video - If you were to do so, I think you would say that on Frankfurt's definition, Level 1 tells the truth, Level 2 lies, Level 3 bullshits about physical facts but will lie or tell the truth about things in the social realm (e.g. others motives, your own affiliation), and Level 4 always bullshits.

It does seem that bullshitting involves a kind of bluff. It is closer to bluffing, surely than to telling a lie. But what is implied concerning its nature by the fact that it is more like the former than it is like the latter? Just what is the relevant difference here between a bluff and a lie? Lying and bluffing are both modes of misrepresentation or deception. Now the concept most central to the distinctive nature of a lie is that of falsity: the liar is essentially someone who deliberately promulgates a falsehood. Bluffing too is typically devoted to conveying something false. Unlike plain lying, however, it is more especially a matter not of falsity but of fakery. This is what accounts for its nearness to bullshit. For the essence of bullshit is not that it is false but that it is phony. In order to appreciate this distinction, one must recognize that a fake or a phony need not be in any respect (apart from authenticity itself) inferior to the real thing. What is not genuine need not also be defective in some other way. It may be, after all, an exact copy. What is wrong with a counterfeit is not what it is like, but how it was made. This points to a similar and fundamental aspect of the essential nature of bullshit: although it is produced without concern with the truth, it need not be false.

Taken this way, Frankfurt's model is a higher-level model that distinguishes the ones who care about reality from the ones that don't - roughly speaking, bullshit characterises levels 3 and 4 as the ones unconcerned with reality.

Comment by sdm on SDM's Shortform · 2020-08-04T18:48:02.393Z · score: 1 (1 votes) · LW · GW
So, the mountain disanalogy: sometimes there are things we have opinions about, and yet there is no clean separation between us and the thing. We don't perceive it in a way that we can agree is trusted or privileged. We receive vague, sparse data about it, and the subject is plagued by disagreement, self-doubt, and claims that other people are doing it all wrong.
This isn't to say that we should give up entirely, but it means that we might have to shift our expectations of what sort of explanation or justification we are "entitled" to.

So this depends on two things - first, how likely (in advance of assessing the 'evidence') something like normative realism is, and then how good that evidence is (how coherent it is). If we have really good reasons in advance to think there's 'no separation between us and the thing' then no matter how coherent the 'thing' is we have to conclude that while we might all be able to agree on what it is, it isn't mind independent.

So, is it coherent, and is it mind-independent? How coherent it needs to be for us to be confident we can know it, depends on how confident we are that its mind-independent, and vice versa.

The argument for coherence comes in the form of convergence (not among people, to be clear, but among normative frameworks), but as you say that doesn't establish its mind independent (it might give you some strong hint, though, if its really strongly consistent and coherent), and the argument that normativity is mind-independent comes from the normativity argument. These three posts deal with the difference between those two arguments and how strong they are, and how they interact:

Normative Anti-realism is self-defeating

Normativity and recursive justification

Prescriptive Anti-realism

Comment by sdm on Open & Welcome Thread - July 2020 · 2020-08-04T18:30:36.198Z · score: 6 (2 votes) · LW · GW

The comment has since been expanded into the (unofficial) Moral Realism sequence. I cover a bunch of issues, including the (often not recognised) distinction between prescriptive and non-prescriptive anti realism - which is an issue that is relevant to some important factual questions (as it overlaps with the 'realism about rationality' issue driving some debates in AI safety), whether we need normative facts and what difference convergence of moral views may or may not make.

Normative Realism by Degrees

Normative Anti-realism is self-defeating

Normativity and recursive justification

Prescriptive Anti-realism

The goal here was to explain what moral realists like about moral realism - for those who are perplexed about why it would be worth wanting or how anyone could find it plausible, and explain what things depend on it being right or wrong, and how you may or may not retain some of the features of realism (like universalizability) if different anti-realist views are true.

Comment by sdm on SDM's Shortform · 2020-08-04T15:50:48.485Z · score: 3 (2 votes) · LW · GW

Prescriptive Anti-realism

An extremely unscientific and incomplete list of people who fall into the various categories I gave in the previous post:

1. Accept Convergence and Reject Normativity: Eliezer Yudkowsky, Sam Harris (Interpretation 1), Peter Singer in The Expanding Circle, RM Hare and similar philosophers, HJPEV

2. Accept Convergence and Accept Normativity: Derek Parfit, Sam Harris (Interpretation 2), Peter Singer today, the majority of moral philosophers, Dumbledore

3. Reject Convergence and Reject Normativity: Robin Hanson, Richard Ngo (?), Lucas Gloor (?) most Error Theorists, Quirrell

4. Reject Convergence and Accept Normativity: A few moral philosophers, maybe Ayn Rand and objectivists?

The difference in practical, normative terms between 2), 4) and 3) is clear enough - 2 is a moral realist in the classic sense, 4 is a sceptic about morality but agrees that irreducible normativity exists, and 3 is a classic 'antirealist' who sees morality as of a piece with our other wants. What is less clear is the difference between 1) and 3). In my caricature above, I said Quirrell and Harry Potter from HPMOR were non-prescriptive and prescriptive anti-realists, respectively, while Dumbledore is a realist. Here is a dialogue between them that illustrates the difference.

Harry floundered for words and then decided to simply go with the obvious. "First of all, just because I want to hurt someone doesn't mean it's right -"
"What makes something right, if not your wanting it?"
"Ah," Harry said, "preference utilitarianism."
"Pardon me?" said Professor Quirrell.
"It's the ethical theory that the good is what satisfies the preferences of the most people -"
"No," Professor Quirrell said. His fingers rubbed the bridge of his nose. "I don't think that's quite what I was trying to say. Mr. Potter, in the end people all do what they want to do. Sometimes people give names like 'right' to things they want to do, but how could we possibly act on anything but our own desires?"

The relevant issue here is that Harry draws a distinction between moral and non-moral reasons even though he doesn't believe in irreducible normativity. In particular, he's committed to a normative ethical theory, preference utilitarianism, as a fundamental part of how he values things.

Here is another illustration of the difference. Lucas Gloor (3) explains the case for suffering-focussed ethics, based on the claim that our moral intuitions assign diminishing returns to happiness vs suffering.

While there are some people who argue for accepting the repugnant conclusion (Tännsjö, 2004), most people would probably prefer the smaller but happier civilization – at least under some circumstances. One explanation for this preference might lie in intuition one discussed above, “Making people happy rather than making happy people.” However, this is unlikely to be what is going on for everyone who prefers the smaller civilization: If there was a way to double the size of the smaller population while keeping the quality of life perfect, many people would likely consider this option both positive and important. This suggests that some people do care (intrinsically) about adding more lives and/or happiness to the world. But considering that they would not go for the larger civilization in the Repugnant Conclusion thought experiment above, it also seems that they implicitly place diminishing returns on additional happiness, i.e. that the bigger you go, the more making an overall happy population larger is no longer (that) important.
By contrast, people are much less likely to place diminishing returns on reducing suffering – at least17 insofar as the disvalue of extreme suffering, or the suffering in lives that on the whole do not seem worth living, is concerned. Most people would say that no matter the size of a (finite) population of suffering beings, adding more suffering beings would always remain equally bad.
It should be noted that incorporating diminishing returns to things of positive value into a normative theory is difficult to do in ways that do not seem unsatisfyingly arbitrary. However, perhaps the need to fit all one’s moral intuitions into an overarching theory based solely on intuitively appealing axioms simply cannot be fulfilled.

And what are those difficulties mentioned? The most obvious is the absurd conclusion - that scaling up a population can turn it from axiologically good to bad:

Hence, given the reasonable assumption that the negative value of adding extra lives with negative welfare does not decrease relatively to population size, a proportional expansion in the population size can turn a good population into a bad one—a version of the so-called “Absurd Conclusion” (Parfit 1984). A population of one million people enjoying very high positive welfare and one person with negative welfare seems intuitively to be a good population. However, since there is a limit to the positive value of positive welfare but no limit to the negative value of negative welfare, proportional expansions (two million lives with positive welfare and two lives with negative welfare, three million lives with positive welfare and three lives with negative welfare, and so forth) will in the end yield a bad population.

Here, then, is the difference - If you believe, as a matter of fact, that our values cohere and place fundamental importance on coherence, whether because you think that is the way to get at the moral truth (2) or because you judge that human values do cohere to a large degree for whatever other reason and place fundamental value on coherence (1), you will not be satisfied with leaving your moral theory inconsistent. If, on the other hand, you see morality as continuous with your other life plans and goals (3), then there is no pressure to be consistent. So to (3), focussing on suffering-reduction and denying the absurd conclusion is fine, but this would not satisfy (1).

I think that, on closer inspection, (3) is unstable - unless you are Quirrell and explicitly deny any role for ethics in decision-making, we want to make some universal moral claims. The case for suffering-focussed ethics argues that the only coherent way to make sense of many of our moral intuitions is to conclude a fundamental asymmetry between suffering and happiness, but then explicitly throws up a stop sign when we take that argument slightly further - to the absurd conclusion, because 'the need to fit all one’s moral intuitions into an overarching theory based solely on intuitively appealing axioms simply cannot be fulfilled'. Why begin the project in the first place, unless you place strong terminal value on coherence (1)/(2) - in which case you cannot arbitrarily halt it.

Comment by sdm on Covid 7/30: Whack a Mole · 2020-08-04T09:26:07.716Z · score: 2 (2 votes) · LW · GW

It’s clearly the case that the public line about 70% herd immunity is still out there, but I think my broader point is served by that report. They have the obligatory ‘herd immunity is reached at 70% and there may be no immunity conferred’ caveat but then the actual model implies that in a worst case scenario 30% of the UK gets infected. You might speculate that they consulted the modellers for the model but not for the rest of it.

Comment by sdm on Inner Alignment: Explain like I'm 12 Edition · 2020-08-03T12:10:04.945Z · score: 19 (6 votes) · LW · GW

Inner Alignment / Misalignment is possibly the key specific mechanism which fills a weakness in the 'classic arguments' for AI safety - the Orthogonality Thesis, Instrumental Convergence and Fast Progress together implying small separations between AI alignment and AI capability can lead to catastrophic outcomes. The question of why there would be such a damaging, hard-to-detect divergence between goals and alignment needs an answer to have a solid, specific reason to expect dangerous misalignment, and Inner Misalignment is just such a reason.

I think that it should be presented in initial introductions to AI risk alongside those classic arguments, as the specific, technical reason why the specific techniques we use are likely to produce such goal/capability divergence - rather than the general a priori reasons given by the classic arguments.

Comment by sdm on Sufficiently Advanced Language Models Can Do Reinforcement Learning · 2020-08-03T11:46:48.396Z · score: 4 (2 votes) · LW · GW

Appending a reward modelling system to GPT-2 directly has already been done - humans were asked to select from among GPT-2 outputs according to some criteria, a reward model was trained on the human selections, and then was applied to train GPT-2. Based on what you've just said, this method is just a much faster, more efficient way of getting a GPT to adapt to perform a recurrent task (since it uses a reward model trained on a few examples of human evaluation, instead of waiting for GPT to adapt by itself to many human selections as you suggest).

We have demonstrated RL fine-tuning of language models tofour NLP tasks: stylistic continuation with high sentiment orphysically descriptive language, and summarization on theCNN/Daily Mail and TL;DR datasets. Rather than buildingtask-specific techniques, we achieve our results by straight-forwardly applying reward learning to language generation.
We extend previous reward learning work with pretrainedmodels and KL regularization to prevent the policy fromdiverging too far from natural language.Our results are mixed. On the continuation tasks we achievegood results vs. the zero-shot baseline as evaluated by hu-mans with very few samples: 2.5k for sentiment and 5kfor descriptiveness. However, for both summarization tasksour policies are only “smart copiers” (extractive rather thanabstractive): they copy from the input text but skip overirrelevant preamble.

No-one has done this reward modelling technique for GPT-3 yet, but it should be trivial since the exact method used for GPT-2 should work. The method notably didn't work as well when used to improve GPT-2 output on more complicated tasks (good on sentiment biasing, mediocre on summarization), but that's because GPT-2 wasn't coherent enough over long enough ranges to properly evaluate the rewards from a reward model representing some complex task or concept. With GPT-3, you might be able to use the reward modelling method to get it to focus on more complicated concepts, or get it to be more 'factually accurate and on-topic'. If you had the humans evaluate 'accurate and on-topic' and built up such a reward model that might be a way to 'bring out' the knowledge GPT-3 has but sometimes doesn't use. I think it would be just like this, but with the reward model helping you get more mileage out of each q/a pair in your buffer by generalising over it a bit:

Allow GPT to answer the next query.
Allow GPT to predict the evaluation.
If the evaluation returns as TRUE append the the q/a pair to a buffer
If buffer is large enough append to context and repeat

Perhaps you'd run into trouble needing a complicated or sophisticated reward model to get much extra mileage out of each new query, but given that it already worked with GPT-2 on simple tasks it might do well with GPT-3 on complex tasks. Essentially, everything you said - except we already have solid evidence that big parts of it can be automated and therefore likely achieved quicker than would otherwise be expected.

Comment by sdm on SDM's Shortform · 2020-08-02T14:37:15.991Z · score: 3 (1 votes) · LW · GW

I appear to be accidentally writing a sequence on moral realism, or at least explaining what moral realists like about moral realism - for those who are perplexed about why it would be worth wanting or how anyone could find it plausible.

Many philosophers outside this community have an instinct that normative anti-realism (about any irreducible facts about what you should do) is self-defeating, because it includes a denial that there are any final, buck-stopping answers to why we should believe something based on evidence, and therefore no truly, ultimately impartial way to even express the claim that you ought to believe something. I think that this is a good, but not perfect, argument. My experience has been that traditional analytic philosophers find this sort of reasoning appealing, in part because of the legacy of how Kant tried to deduce the logically necessary preconditions for having any kind of judgement or experience. I don't find it particularly appealing, but I think that there's a case for it here, if there ever was.

Irreducible Normativity and Recursive Justification

On normative antirealism, what 'you shouldn't believe that 2+2=5' really means is just that someone else's mind has different basic operations to yours. It is obvious that we can't stop using normative concepts, and couldn't use the concept 'should' to mean 'in accordance with the basic operations of my mind', but this isn't an easy case of reduction like Water=H20. There is a deep sense in which normative terms really can't mean what we think they mean if normative antirealism is true. This must be accounted for by either a deep and comprehensive question-dissolving, or by irreducible normative facts.

This 'normative indispensability' is not an argument, but it can be made into one:

1) On normative anti-realism there are no facts about which beliefs are justified. So there are no facts about whether normative anti-realism is justified. Therefore, normative anti-realism is self-defeating.
Except that doesn't work! Because on normative anti-realism, the whole idea of external facts about which beliefs are justified is mistaken, and instead we all just have fundamental principles (whether moral or epistemic) that we use but don't question, which means that holding a belief without (the realist's notion of) justification is consistent with anti-realism. So the wager argument for normative realism actually goes like this -
2) We have two competing ways of understanding how beliefs are justified. One is where we have anti-realist 'justification' for our beliefs, in purely descriptive terms of what we will probably end up believing given basic facts about how our minds work in some idealised situation. The other is where there are mind-independent facts about which of our beliefs are justified. The latter is more plausible because of 1).

If you've read the sequences, you are not going to like this argument, at all - it sounds like the 'zombie' argument, and it sounds like someone asking for an exception to reductionism - which is just what it is. This is the alternative:

Where moral judgment is concerned, it's logic all the way down.  ALL the way down.  Any frame of reference where you're worried that it's really no better to do what's right then to maximize paperclips... well, that really part has a truth-condition (or what does the "really" mean?) and as soon as you write out the truth-condition you're going to end up with yet another ordering over actions or algorithms or meta-algorithms or something.  And since grinding up the universe won't and shouldn't yield any miniature '>' tokens, it must be a logical ordering.  And so whatever logical ordering it is you're worried about, it probably does produce 'life > paperclips' - but Clippy isn't computing that logical fact any more than your pocket calculator is computing it.
Logical facts have no power to directly affect the universe except when some part of the universe is computing them, and morality is (and should be) logic, not physics.

If it's truly 'logic all the way down' and there are no '> tokens' over particular functional arrangements of matter, including the ones you used to form your beliefs, then you have to give up on knowing reality as it is. This isn't the classic sense in which we all have an 'imperfect model' of reality as it is. If you give up on irreducible epistemic facts you give up knowing anything, probabilistically or otherwise, about reality-as-it-is, because there are no fundamentally, objectively, mind-independent ways you should or shouldn't form beliefs about external reality. So you can't say you're better than the pebble with '2+2=5' written on it, except descriptively, in that the causal process that produced the pebble contradict the one that produced 2+2=4 in your brain.

What's the alternative? If we don't deny this consequence of normative antirealism, we have two options. One is the route of dissolving the question, by analogy with how reductionism has worked in the past, the other is to say that there are irreducible normative facts. In order to dissolve the question correctly, it needs to be in a way that shows a denial of epistemic facts isn't damaging, doesn't lead to epistemological relativism or scepticism. We can't simply declare that normative facts can't possibly exist - otherwise you're vulnerable to the argument 2). David Chalmers talks about question-dissolving for qualia:

You’ve also got to explain why we have these experiences. I guess Dennett’s line is to reject the idea there are these first-person data and say all you do, if you’ve can explain why you believe why you say there are those things. Why do you believe there are those things? Then that’s good enough. I find that line which Dennett has pursued inconsistently over the years, but insofar as that’s his line, I find that a fascinating and powerful line. I do find it ultimately unbelievable because I just don’t think it explains the data, but it does if developed properly, have the view that it could actually explain why people find it unbelievable, and that would be a virtue in its favor.

David Chalmers of all people says that, even if he can't conceive of how a deep reduction of Qualia might make their non-existence non-paradoxical, he might change his mind if he ever actually saw such a reduction! I say the same about epistemic and therefore normative facts. But crucially, no-one has solved this 'meta problem' for Qualia or for normative facts. There are partial hints of explanations for both, but there's no full debunking argument that makes epistemic antirealism seem completely non-damaging and thus removes 2). I can't imagine what such an account could look like, but the point of the 'dissolving the question' strategy is that it often isn't imaginable in advance because your concepts are confused, so I'll just leave that point. In the moral domain, the convergence arguments point against question-dissolving because they suggest the concept of normativity is solid and reliable. If those arguments fall, then question-dissolving looks more likely.

That's one route. What of the other?

The alternative is to say that there are irreducible normative facts. This is counter-reductionist, counter-intuitive and strange. Two things that can make it less strange: these facts are not supposed to be intrinsically motivational (that violates the orthogonality thesis and is not permitted by the laws of physics) and they are not required to be facts about objects, like Platonic forms outside of time and space. They can be logical facts of the sort Eliezer talked about, but just a particular kind of logical fact, that has the property of being normative, the one you should follow. They don't need to 'exist' as such. What epistemic facts would do is say certain reflective equilibria, certain arrangements of 'reflecting on your own beliefs, using your current mind' are the right ones, and others are the wrong ones. It doesn't deny that this is the case:

So what I did in practice, does not amount to declaring a sudden halt to questioning and justification.  I'm not halting the chain of examination at the point that I encounter Occam's Razor, or my brain, or some other unquestionable.  The chain of examination continues—but it continues, unavoidably, using my current brain and my current grasp on reasoning techniques.  What else could I possibly use?
Indeed, no matter what I did with this dilemma, it would be me doing it.  Even if I trusted something else, like some computer program, it would be my own decision to trust it.

Irreducible normativity just says that there is a meaningful, mind-independent difference between the virtuous and degenerate cases of recursive justification of your beliefs, rather than just ways of recursively justifying our beliefs that are... different.

If you buy that anti-realism is self-defeating, and think that we can know something about the normative domain via moral and non-moral convergence, then we have actual positive reasons to believe that normative facts are knowable (the convergence arguments help establish that moral facts aren't and couldn't be random things like stacking pebbles in prime-numbered heaps).

These two arguments are quite different - one is empirical (that our practical, epistemic and moral reasons tend towards agreement over time and after conceptual analysis and reflective justification) and the other is conceptual (that if you start out with normative concepts you are forced into using them).

Depending on which of the arguments you accept, there are four basic options. These are extremes of a spectrum, as while the Normativity argument is all-or-nothing, the Convergence argument can come by degrees for different types of normative claims (epistemic, practical and moral):

Accept Convergence and Reject Normativity: prescriptivist anti-realism. There are (probably) no mind-independent moral facts, but the nature of rationality is such that our values usually cohere and are stable, so we can treat morality as a more-or-less inflexible logical ordering over outcomes.
Accept Convergence and Accept Normativity: moral realism. There are moral facts and we can know them
Reject Convergence and Reject Normativity: nihilist anti-realism. Morality is seen as a 'personal life project' about which we can't expect much agreement or even within-person coherence
Reject Convergence and Accept Normativity: sceptical moral realism. Normative facts exist, but moral facts may not exist, or may be forever unknowable.

Even if what exactly normative facts are is hard to conceive, perhaps we can still know some things about them. Eliezer ended his post arguing for universalized, prescriptive anti-realism with a quote from HPMOR. Here's a different quote:

"Sometimes," Professor Quirrell said in a voice so quiet it almost wasn't there, "when this flawed world seems unusually hateful, I wonder whether there might be some other place, far away, where I should have been. I cannot seem to imagine what that place might be, and if I can't even imagine it then how can I believe it exists? And yet the universe is so very, very wide, and perhaps it might exist anyway? ...
Comment by sdm on Covid 7/30: Whack a Mole · 2020-07-31T10:47:39.274Z · score: 7 (3 votes) · LW · GW
See my other post here on the general phenomenon of denying that different people are different and behave differently and experience different outcomes in a way that is meaningful for what is likely to happen both to them and for everyone overall. I keep having to remind myself along with everyone else that this is not a straw man argument. It’s being used to argue for policies that have huge impacts on our lives.

I can't comment on what doctors and random public health bureaucrats might have said, but as Owain Evans said earlier, the current state of modelling among the actual domain experts in modelling is nowhere near as bad as you suggest and it does take these effects into account.

I think that the heterogeneity you talk about is part of the reason that even the 'worst case' planning (at least in the UK) is suggesting partial herd immunity being reached with ~30% infected. This UK government report goes over a bunch of factors that might increase transmission and say that a 'reasonable worst case' scenario for herd immunity from a winter wave is R_t increasing to 1.7 in September and remaining constant, assuming effectively zero government action - total second wave deaths (and, approximately cases) are about double the first, which would mean a total infected of a bit less than 30% for herd immunity (A bit less than 10% infected so far + about double to come).

One of the most influential models is the "Imperial Model", which certainly impacted UK policy and probably US and European policy too. Other countries did versions of the model. The lead researcher on the model literally became a household name in the UK. The Imperial Model is an agent-based model (not an SIR model). It has a very detailed representation of how exposure/contact differ among different age groups (work vs. school) and in regions with different population densities.

The lesson here may be that the public line about 'there's a fixed 70% herd immunity threshold' is just that - a public line, and isn't (and never was - if I remember rightly, the Imperial model from March estimated a herd immunity threshold of 40% without a lockdown) biasing the output of modelling. It could also be the case that doctors or generic public health people in the US are repeating the 70% line while epidemiologists and modellers with specific expertise (in the US and elsewhere) are being more methodical.

Comment by sdm on New Paper on Herd Immunity Thresholds · 2020-07-30T13:10:10.233Z · score: 9 (4 votes) · LW · GW

The lesson here may be that the public line about 'there's a fixed 70% herd immunity threshold' is just that - a public line, and isn't (and never was - if I remember rightly, the Imperial model from March estimated a herd immunity threshold of 40% without a lockdown) biasing the output of modelling. It could also be the case that doctors or generic public health people in the US are repeating the 70% line while epidemiologists and modellers with specific expertise (in the US and elsewhere) are being more methodical.

For what it's worth, I haven't heard much mention of a 70% immunity threshold in the UK recently, but I suspect the public conversation is worse in the US. That being said, there is still explicit derision of the concept of herd immunity, based on declining antibody counts that don't give strong evidence for anything, so Zvi's point that a lot of people don't want to hear about herd immunity still clearly applies - see e.g. this:

Prof Jonathan Heeney, a virologist at the University of Cambridge, said the findings had put “another nail in the coffin of the dangerous concept of herd immunity”.

With that as the background, I'd be interested to know your opinion on this UK government report. They go over a bunch of factors that might increase transmission and say that a 'reasonable worst case' scenario is R_t increasing to 1.7 in September and remaining constant, assuming effectively zero government action - total second wave deaths are about double the first, with a similar peak of currently infected individuals and the peak in January (meaning a lot of time to course-correct and reimpose measures). As far as I can tell that's just a guesstimate modelling assumption, not motivated by any kind of complicated transmission model.

(Honestly, this is a fair bit better than I would have guessed for the worst case scenario - a far cry from the sorts of things we discussed here in March.)

They don't say how plausible they think this scenario is or give explicit motivation for R_t=1.7, just model the consequences of that change.

Does this look like a paper that doesn't account for a potentially lower immunity threshold, so is probably overestimating the damage of a winter wave? And what about seasonality - they claim that the degree of seasonality of Covid-19 is highly uncertain. Is this true? I've heard some sources say it's probably not that seasonal and others say it definitely is. What's your read of that question? A winter wave seems to be the most likely route to a damaging second wave in Europe and it would be good to know how plausible that is.

Comment by sdm on SDM's Shortform · 2020-07-29T10:30:04.567Z · score: 7 (2 votes) · LW · GW

I got into a discussion with Lucas Gloor on the EA forum about these issues. I'm copying some of what I wrote here as it's a continuation of that.

I think that it is a more damaging mistake to think moral antirealism is true when realism is true than vice versa, but I agree with you that the difference is nowhere near infinite, and doesn't give you a strong wager.However, I do think that normative anti-realism is self-defeating, assuming you start out with normative concepts (though not an assumption that those concepts apply to anything). I consider this argument to be step 1 in establishing moral realism, nowhere near the whole argument.

Epistemic anti-realism

Cool, I'm happy that this argument appeals to a moral realist! ....
...I don't think this argument ("anti-realism is self-defeating") works well in this context. If anti-realism is just the claim "the rocks or free-floating mountain slopes that we're seeing don't connect to form a full mountain," I don't see what's self-defeating about that...
To summarize: There's no infinitely strong wager for moral realism.

I agree that there is no infinitely strong wager for moral realism. As soon as moral realists start making empirical claims about the consequences of realism (that convergence is likely), you can't say that moral realism is true necessarily or that there is an infinitely strong prior in favour of it. An AI that knows that your idealised preferences don't cohere could always show up and prove you wrong, just as you say. If I were Bob in this dialogue, I'd happily concede that moral anti-realism is true.If (supposing it were the case) there were not much consensus on anything to do with morality ("The rocks don't connect..."), someone who pointed that out and said 'from that I infer that moral realism is unlikely' wouldn't be saying anything self-defeating. Moral anti-realism is not self-defeating, either on its own terms or on the terms of a 'mixed view' like I describe here:

We have two competing ways of understanding how beliefs are justified. One is where we have anti-realist 'justification' for our beliefs, in purely descriptive terms, the other in which there are mind-independent facts about which of our beliefs are justified...

However, I do think that there is an infinitely strong wager in favour of normative realism and that normative anti-realism is self-defeating on the terms of a 'mixed view' that starts out considering the two alternatives like that given above. This wager is because of the subset of normative facts that are epistemic facts.The example that I used was about 'how beliefs are justified'. Maybe I wasn't clear, but I was referring to beliefs in general, not to beliefs about morality. Epistemic facts, e.g. that you should believe something if there is sufficient amount of evidence, are a kind of normative fact. You noted them on your list here.So, the infinite wager argument goes like this -

1) On normative anti-realism there are no facts about which beliefs are justified. So there are no facts about whether normative anti-realism is justified. Therefore, normative anti-realism is self-defeating.

Except that doesn't work! Because on normative anti-realism, the whole idea of external facts about which beliefs are justified is mistaken, and instead we all just have fundamental principles (whether moral or epistemic) that we use but don't question, which means that holding a belief without (the realist's notion of) justification is consistent with anti-realism.So the wager argument for normative realism actually goes like this -

2) We have two competing ways of understanding how beliefs are justified. One is where we have anti-realist 'justification' for our beliefs, in purely descriptive terms of what we will probably end up believing given basic facts about how our minds work in some idealised situation. The other is where there are mind-independent facts about which of our beliefs are justified. The latter is more plausible because of 1).

Evidence for epistemic facts?

I find it interesting the imagined scenario you give in #5 essentially skips over argument 2) as something that is impossible to judge:

AI: Only in a sense I don’t endorse as such! We’ve gone full circle. I take it that you believe that just like there might be irreducibly normative facts about how to do good, the same goes for irreducible normative facts about how to reason?
Bob: Indeed, that has always been my view.
AI: Of course, that concept is just as incomprehensible to me.

The AI doesn't give evidence against there being irreducible normative facts about how to reason, it just states it finds the concept incoherent, unlike the (hypothetical) evidence that the AI piles on against moral realism (for example, that people's moral preferences don't cohere).Either you think some basic epistemic facts have to exist for reasoning to get off the ground and therefore that epistemic anti-realism is self-defeating, or you are an epistemic anti-realist and don't care about the realist's sense of 'self-defeating'. The AI is in the latter camp, but not because of evidence, the way that it's a moral anti-realist (...However, you haven’t established that all normative statements work the same way—that was just an intuition...), but just because it's constructed in such a way that it lacks the concept of an epistemic reason.So, if this AI is constructed such that irreducibly normative facts about how to reason aren't comprehensible to it, it only has access to argument 1), which doesn't work. It can't imagine 2).However, I think that we humans are in a situation where 2) is open to consideration, where we have the concept of a reason for believing something, but aren't sure if it applies - and if we are in that situation, I think we are dragged towards thinking that it must apply, because otherwise our beliefs wouldn't be justified.However, this doesn't establish moral realism - as you said earlier, moral anti-realism is not self-defeating.

If anti-realism is just the claim "the rocks or free-floating mountain slopes that we're seeing don't connect to form a full mountain," I don't see what's self-defeating about that

Combining convergence arguments and the infinite wager

If you want to argue for moral realism, then you need evidence for moral realism, which comes in the form of convergence arguments. But the above argument is still relevant, because the convergence and 'infinite wager' arguments support each other. The reason 2) would be bolstered by the success of convergence arguments (in epistemology, or ethics, or any other normative domain) is that convergence arguments increase our confidence that normativity is a coherent concept - which is what 2) needs to work. It certainly seems coherent to me, but this cannot be taken as self-evident since various people have claimed that they or others don't have the concept.I also think that 2) is some evidence in favour of moral realism, because it undermines some of the strongest antirealist arguments.

By contrast, for versions of normativity that depend on claims about a normative domain’s structure, the partners-in-crime arguments don’t even apply. After all, just because philosophers might—hypothetically, under idealized circumstances—agree on the answers to all (e.g.) decision-theoretic questions doesn’t mean that they would automatically also find agreement on moral questions.[29] On this interpretation of realism, all domains have to be evaluated separately

I don't think this is right. What I'm giving here is such a 'partners-in-crime' argument with a structure, with epistemic facts at the base. Realism about normativity certainly should lower the burden of proof on moral realism to prove total convergence now, because we already have reason to believe normative facts exist. For most anti-realists, the very strongest argument is the 'queerness argument' that normative facts are incoherent or too strange to be allowed into our ontology. The 'partners-in-crime'/'infinite wager' undermines this strong argument against moral realism. So some sort of very strong hint of a convergence structure might be good enough - depending on the details.

I agree that it then shifts the arena to convergence arguments. I will discuss them in posts 6 and 7.

So, with all that out of the way, when we start discussing the convergence arguments, the burden of proof on them is not colossal. If we already have reason to suspect that there are normative facts out there, perhaps some of them are moral facts. But if we found a random morass of different considerations under the name 'morality' then we'd be stuck concluding that there might be some normative facts, but maybe they are only epistemic facts, with nothing else in the domain of normativity.I don't think this is the case, but I will have to wait until your posts on that topic - I look forward to them!All I'll say is that I don't consider strongly conflicting intuitions in e.g. population ethics to be persuasive reasons for thinking that convergence will not occur. As long as the direction of travel is consistent, and we can mention many positive examples of convergence, the preponderance of evidence is that there are elements of our morality that reach high-level agreement. (I say elements because realism is not all-or-nothing - there could be an objective 'core' to ethics, maybe axiology, and much ethics could be built on top of such a realist core - that even seems like the most natural reading of the evidence, if the evidence is that there is convergence only on a limited subset of questions.) If Kant could have been a utilitarian and never realised it, then those who are appalled by the repugnant conclusion could certainly converge to accept it after enough ideal reflection!

Belief in God, or in many gods, prevented the free development of moral reasoning. Disbelief in God, openly admitted by a majority, is a recent event, not yet completed. Because this event is so recent, Non-Religious Ethics is at a very early stage. We cannot yet predict whether, as in Mathematics, we will all reach agreement. Since we cannot know how Ethics will develop, it is not irrational to have high hopes.

How to make anti-realism existentially satisfying

Instead of “utilitarianism as the One True Theory,” we consider it as “utilitarianism as a personal, morally-inspired life goal...
”While this concession is undoubtedly frustrating, proclaiming others to be objectively wrong rarely accomplished anything anyway. It’s not as though moral disagreements—or disagreements in people’s life choices—would go away if we adopted moral realism.

If your goal here is to convince those inclined towards moral realism to see anti-realism as existentially satisfying, I would recommend a different framing of it. I think that framing morality as a 'personal life goal' makes it seem as though it is much more a matter of choice or debate than it in fact is, and will probably ring alarm bells in the mind of a realist and make them think of moral relativism.Speaking as someone inclined towards moral realism, the most inspiring presentations I've ever seen of anti-realism are those given by Peter Singer in The Expanding Circle and Eliezer Yudkowsky in his metaethics sequence. Probably not by coincidence - both of these people are inclined to be realists. Eliezer said as much, and Singer later became a realist after reading Parfit. Eliezer Yudkowsky on 'The Meaning of Right':

The apparent objectivity of morality has just been explained—and not explained away.  For indeed, if someone slipped me a pill that made me want to kill people, nonetheless, it would not be right to kill people.  Perhaps I would actually kill people, in that situation—but that is because something other than morality would be controlling my actions.
Morality is not just subjunctively objective, but subjectively objective.  I experience it as something I cannot change.  Even after I know that it's myself who computes this 1-place function, and not a rock somewhere—even after I know that I will not find any star or mountain that computes this function, that only upon me is it written—even so, I find that I wish to save lives, and that even if I could change this by an act of will, I would not choose to do so.  I do not wish to reject joy, or beauty, or freedom.  What else would I do instead?  I do not wish to reject the Gift that natural selection accidentally barfed into me.

And Singer in the Expanding Circle:

“Whether particular people with the capacity to take an objective point of view actually do take this objective viewpoint into account when they act will depend on the strength of their desire to avoid inconsistency between the way they reason publicly and the way they act.”

These are both anti-realist claims. They define 'right' descriptively and procedurally as arising from what we would want to do under some ideal circumstances, and rigidifies on the output of that idealization, not on what we want. To a realist, this is far more appealing than a mere "personal, morally-inspired life goal", and has the character of 'external moral constraint', even if it's not really ultimately external, but just the result of immovable or basic facts about how your mind will, in fact work, including facts about how your mind finds inconsistencies in its own beliefs. This is a feature, not a bug:

According to utilitarianism, what people ought to spend their time on depends not on what they care about but also on how they can use their abilities to do the most good. What people most want to do only factors into the equation in the form of motivational constraints, constraints about which self-concepts or ambitious career paths would be long-term sustainable. Williams argues that this utilitarian thought process alienates people from their actions since it makes it no longer the case that actions flow from the projects and attitudes with which these people most strongly identify...

The exact thing that Williams calls 'alienating' is the thing that Singer, Yudkowsky, Parfit and many other realists and anti-realists consider to be the most valuable thing about morality! But you can keep this 'alienation' if you reframe morality as being the result of the basic, deterministic operations of your moral reasoning, the same way you'd reframe epistemic or practical reasoning on the anti-realist view. Then it seems more 'external' and less relativistic.One thing this framing makes clearer, which you don't deny but don't mention, is that anti-realism does not imply relativism.

In that case, normative discussions can remain fruitful. Unfortunately, this won’t work in all instances. There will be cases where no matter how outrageous we find someone’s choices, we cannot say that they are committing an error of reasoning.

What we can say, on anti-realism as characterised by Singer and Yudkowsky, is that they are making an error of morality. We are not obligated (how could we be?) towards relativism, permissiveness or accepting values incompatible with our own on anti-realism. Ultimately, you can just say that 'I am right and you are wrong'.That's one of the major upsides of anti-realism to the realist - you still get to make universal, prescriptive claims and follow them through, and follow them through because they are morally right, and if people disagree with you then they are morally wrong and you aren't obligated to listen to their arguments if they arise from fundamentally incompatible values. Put that way, anti-realism is much more appealing to someone with realist inclinations.

Comment by sdm on Developmental Stages of GPTs · 2020-07-28T17:02:16.540Z · score: 2 (2 votes) · LW · GW

When I wrote that I was mostly taking what Ben Garfinkel said about the 'classic arguments' at face value, but I do recall that there used to be a lot of loose talk about putting values into an AGI after building it.

Comment by sdm on Developmental Stages of GPTs · 2020-07-28T15:04:37.394Z · score: 14 (4 votes) · LW · GW
--The orthogonality thesis and convergent instrumental goals arguments, respectively, attacked and destroyed two views which were surprisingly popular at the time: 1. that smarter AI would necessarily be good (unless we deliberately programmed it not to be) because it would be smart enough to figure out what's right, what we intended, etc. and 2. that smarter AI wouldn't lie to us, hurt us, manipulate us, take resources from us, etc. unless it wanted to (e.g. because it hates us, or because it has been programmed to kill, etc) which it probably wouldn't. I am old enough to remember talking to people who were otherwise smart and thoughtful who had views 1 and 2.

Speaking from personal experience, those views both felt obvious to me before I came across Orthogonality Thesis or Instrumental convergence.

--As for whether the default outcome is doom, the original argument makes clear that default outcome means absent any special effort to make AI good, i.e. assuming everyone just tries to make it intelligent, but no effort is spent on making it good, the outcome is likely to be doom. This is, I think, true.

It depends on what you mean by 'special effort' and 'default'. The Orthogonality thesis, instrumental convergence, and eventual fast growth together establish that if we increased intelligence while not increasing alignment, a disaster would result. That is what is correct about them. What they don't establish is how natural it is that we will increase intelligence without increasing alignment to the degree necessary to stave off disaster.

It may be the case that the particular technique for building very powerful AI that is easiest to use is a technique that makes alignment and capability increase together, so you usually get the alignment you need just in the course of trying to make your system more capable.

Depending on how you look at that possibility, you could say that's an example of the 'special effort' being not as difficult as it appeared / likely to be made by default, or that the claim is just wrong and the default outcome is not doom. I think that the criticism sees it the second way and so sees the arguments as not establishing what they are supposed to establish, and I see it the first way - there might be a further fact that says why OT and IC don't apply to AGI like they theoretically should, but the burden is on you to prove it. Rather than saying that we need evidence OT and IC will apply to AGI.

For the reasons you give, the Orthogonality thesis and instrumental convergence do shift the burden of proof to explaining why you wouldn't get misalignment, especially if progress is fast. But such reasons have been given, see e.g. this from Stuart Russell:

The first reason for optimism [about AI alignment] is that there are strong economic incentives to develop AI systems that defer to humans and gradually align themselves to user preferences and intentions. Such systems will be highly desirable: the range of behaviours they can exhibit is simply far greater than that of machines with fixed, known objectives...

And there are outside-view analogies with other technologies that suggests that by default alignment and capability do tend to covary to quite a large extent. This is a large part of Ben Garfinkel's argument.

But I do think that some people (maybe not Bostrom, based on the caveats he gave), didn't realise that they did also need to complete the argument to have a strong expectation of doom - to show that there isn't an easy, and required alignment technique that we'll have a strong incentive to use.

From my earlier post:

"A system that is optimizing a function of n variables, where the objective depends on a subset of size k<n, will often set the remaining unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable.   "
We could see this as marking out a potential danger - a large number of possible mind-designs produce very bad outcomes if implemented. The fact that such designs exist 'weakly suggest' (Ben's words) that AGI poses an existential risk since we might build them. If we add in other premises that imply we are likely to (accidentally or deliberately) build such systems, the argument becomes stronger. But usually the classic arguments simply note instrumental convergence and assume we're 'shooting into the dark' in the space of all possible minds, because they take the abstract statement about possible minds to be speaking directly about the physical world.

I also think that, especially when you bring Mesa-optimisers or recent evidence into the picture, the evidence we have so far suggests that even though alignment and capability are likely to covary to some degree (a degree higher than e.g. Bostrom expected back before modern ML), the default outcome is still misalignment.

Comment by sdm on Developmental Stages of GPTs · 2020-07-27T23:05:17.833Z · score: 8 (4 votes) · LW · GW

What would you say is wrong with the 'exaggerated' criticism?

I don't think you can call the arguments wrong if you also think the Orthogonality Thesis and Instrumental Convergence are real and relevant to AI safety, and as far as I can tell the criticism doesn't claim that - just that there are other assumptions needed for disaster to be highly likely.

Comment by sdm on Developmental Stages of GPTs · 2020-07-27T17:17:52.294Z · score: 13 (4 votes) · LW · GW

Suppose that GPT-6 does turn out to be some highly transformative AI capable of human-level language understanding and causal reasoning? What would the remaining gap be between that and an Agentive AGI? Possibly, it would not be much of a further leap.

There is this list of remaining capabilities needed for AGI in an older post I wrote, with the capabilities of 'GPT-6' as I see them underlined:

Stuart Russell’s List
human-like language comprehension
cumulative learning
discovering new action sets
managing its own mental activity
For reference, I’ve included two capabilities we already have that I imagine being on a similar list in 1960
perception and object recognition
efficient search over known facts

So we'd have discovering new action sets, and managing mental activity - effectively, the things that facilitate long-range complex planning, remaining. Unless you think those could also arise with GPT-N?

Suppose GPT-8 gives you all of those, just spontaneously, but its nothing but a really efficient text-predictor. Supposing that no dangerous mesa-optimisers arise, what then? Would it be relatively easy to turn it into something agentive, or would agent-like behaviour arise anyway?

I wonder if this is another moment to step back and reassess the next decade with fresh eyes - what's the probability of a highly transformative AI, enough to impact overall growth rates, in the next decade? I don't know, but probably not as low as I thought. We've already had our test-run.

******

In the spirit of trying to get ahead of events, are there any alignment approaches that we could try out on GPT-3 in simplified form? I recall a paper on getting GPT-2 to learn from Human preferences, which is step 1 in the IDA proposal. You could try and do the same thing for GPT-3, but get the human labellers to try and get it to recognise more complicated concepts - even label output as 'morally good' or 'bad' if you really want to jump the gun. You might also be able to set up debate scenarios to elicit better results using a method like this.

Comment by sdm on Developmental Stages of GPTs · 2020-07-27T17:10:01.676Z · score: 21 (9 votes) · LW · GW

I find this interesting in the context of the recent podcast on errors in the classic arguments for AI risk - which boil down to, there is no necessary reason why instrumental convergence or orthogonality apply to your systems, and there are actually strong reasons, a priori, to think increasing AI capabilities and increasing AI alignment go together to some degree... and then GPT-3 comes along, and suggests that, practically speaking, you can get highly capable behaviour that scales up easily without much in the way of alignment.

On the one hand, GPT-3 is quite useful while being not robustly aligned, but on the other hand GPT-3's lack of alignment is impeding its capabilities to some degree.

Maybe if you update on both you just end up back where you started.

Comment by sdm on SDM's Shortform · 2020-07-25T16:40:04.787Z · score: 15 (2 votes) · LW · GW

Toby Ord just released a collection of quotations on Existential risk and the future of humanity, everyone from Kepler to Winston Churchill (in fact, a surprisingly large number are from Churchill) to Seneca to Mill to Nick Bostrom - it's one of the most inspirational things I have ever read, and when taken together makes it clear that there have always been people who cared about long-termism or humanity as a whole. Some of my favourites:

The time will come when diligent research over long periods will bring to light things which now lie hidden. A single lifetime, even though entirely devoted to the sky, would not be enough for the investigation of so vast a subject ... And so this knowledge will be unfolded only through long successive ages. There will come a time when our descendants will be amazed that we did not know things that are so plain to them … Let us be satisfied with what we have found out, and let our descendants also contribute something to the truth. … Many discoveries are reserved for ages still to come, when memory of us will have been effaced.
— Seneca the Younger, Naturales Quaestiones, 65 CE

The remedies for all our diseases will be discovered long after we are dead; and the world will be made a fit place to live in, after the death of most of those by whose exertions it will have been made so. It is to be hoped that those who live in those days will look back with sympathy to their known and unknown benefactors.
— John Stuart Mill

There will certainly be no lack of human pioneers when we have mastered the art of flight. Who would have thought that navigation across the vast ocean is less dangerous and quieter than in the narrow, threatening gulfs of the Adriatic, or the Baltic, or the British straits? Let us create vessels and sails adjusted to the heavenly ether, and there will be plenty of people unafraid of the empty wastes. In the meantime, we shall prepare, for the brave sky-travellers, maps of the celestial bodies—I shall do it for the moon, you Galileo, for Jupiter.
— Johannes Kepler, in an open letter to Galileo, 1610

I'm imagining Kepler reaching out across four hundred years, to a world he could barely imagine, and to those 'brave sky-travellers', that he helped prepare the way for.

Mankind has never been in this position before. Without having improved appreciably in virtue or enjoying wiser guidance, it has got into its hands for the first time the tools by which it can unfailingly accomplish its own extermination. That is the point in human destinies to which all the glories and toils of men have at last led them. They would do well to pause and ponder upon their new responsibilities. … Surely if a sense of self-preservation still exists among men, if the will to live resides not merely in individuals or nations but in humanity as a whole, the prevention of the supreme catastrophe ought to be the paramount object of all endeavour.
— Winston Churchill, ‘Shall We All Commit Suicide?’, 1924
Comment by sdm on Open & Welcome Thread - July 2020 · 2020-07-23T15:57:12.170Z · score: 8 (3 votes) · LW · GW

When it comes to Moral Realism vs Antirealism, I've always thought that the standard discussion here and in similar spaces has missed some subtleties of the realist position - specifically that in its strongest form its based upon plausibility considerations of a sort that should be very familiar.

I've written a (not very-) shortform post that tries to explain this point. I think that this has practical consequences as well, since 'realism about rationality' - a position that has been identified within AI Alignment circles, is actually just a disguised form of normative realism.

Comment by sdm on SDM's Shortform · 2020-07-23T14:53:56.011Z · score: 9 (3 votes) · LW · GW

Normative Realism

Normative Realism by Degrees

Normative Anti-realism is self-defeating

Normativity and recursive justification

Prescriptive Anti-realism

'Realism about rationality' is Normative Realism

'Realism about rationality' discussed in the context of AI safety and some of its driving assumptions may already have a name in existing philosophy literature. I think that what it's really referring to is 'normative realism' overall - the notion that there are any facts about what we have most reason to believe or do. Moral facts, if they exist, are a subset of normative facts. Epistemic facts, facts about what we have most reason to believe (e.g. if there is a correct decision theory that we should use, that would be an epistemic fact), are a different subset of normative facts.

These considerations (from the original article) seem to clearly indicate 'realism about epistemic facts' in the metaethical sense:

The idea that there is an “ideal” decision theory.
The idea that, given certain evidence for a proposition, there's an "objective" level of subjective credence which you should assign to it, even under computational constraints.
The idea that having having contradictory preferences or beliefs is really bad, even when there’s no clear way that they’ll lead to bad consequences (and you’re very good at avoiding dutch books and money pumps and so on).

These seem to imply normative (if not exactly moral) realism in general:

The idea that morality is quite like mathematics, in that there are certain types of moral reasoning that are just correct.
The idea that defining coherent extrapolated volition in terms of an idealised process of reflection roughly makes sense, and that it converges in a way which doesn’t depend very much on morally arbitrary factors.

If this 'realism about rationality' really is rather like "realism about epistemic reasons/'epistemic facts'", then you have the 'normative web argument' to contend with, that the above two may be connected:

These and other points of analogy between the moral and epistemic domains might well invite the suspicion that the respective prospects of realism and anti-realism in the two domains are not mutually independent, that what is most plausibly true of the one is likewise most plausibly true of the other. This suspicion is developed in Cuneo's "core argument" which runs as follows (p. 6):
(1) If moral facts do not exist, then epistemic facts do not exist.
(2) Epistemic facts exist.
(3) So moral facts exist.
(4) If moral facts exist, then moral realism is true.
(5) So moral realism is true.

If 'realism about rationality' is really just normative realism in general, or realism about epistemic facts, then there is already an extensive literature on whether it is correct or not. I'm going to discuss some of it below.

Normative realism implies identification with system 2

There's a further implication that normative realism has - it makes things like this (from the original article, again) seem incoherent:

Implicit in this metaphor is the localization of personal identity primarily in the system 2 rider. Imagine reversing that, so that the experience and behaviour you identify with are primarily driven by your system 1, with a system 2 that is mostly a Hansonian rationalization engine on top (one which occasionally also does useful maths). Does this shift your intuitions about the ideas above, e.g. by making your CEV feel less well-defined?

I find this very interesting because locating personal identity in system 1 feels conceptually impossible or deeply confusing. No matter how much rationalization goes on, it never seems intuitive to identify myself with system 1. How can you identify with the part of yourself that isn't doing the explicit thinking, including the decision about which part of yourself to identify with? It reminds me of Nagel's The Last Word.

Normative Realism by degrees

Further to the whole question of Normative / moral realism, there is this post on Moral Anti-Realism. While I don't really agree with it, I do recommend reading it - one thing that it convinced me of is that there is a close connection between your particular normative ethical theory and moral realism. If you claim to be a moral realist but don't make ethical claims beyond 'self-evident' ones like pain is bad, given the background implausibility of making such a claim about mind-independent facts, you don't have enough 'material to work with' for your theory to plausibly refer to anything . The Moral Anti-Realism post presents this dilemma for the moral realist:

There are instances where just a handful of examples or carefully selected “pointers” can convey all the meaning needed for someone to understand a far-reaching and well-specified concept. I will give two cases where this seems to work (at least superficially) to point out how—absent a compelling object-level theory—we cannot say the same about “normativity.”
...these thought experiments illustrate that under the right circumstances, it’s possible for just a few carefully selected examples to successfully pinpoint fruitful and well-specified concepts in their entirety. We don’t have the philosophical equivalent of a background understanding of chemistry or formal systems... To maintain that normativity—reducible or not—is knowable at least in theory, and to separate it from merely subjective reasons, we have to be able to make direct claims about the structure of normative reality, explaining how the concept unambiguously targets salient features in the space of possible considerations. It is only in this way that the ambitious concept of normativity could attain successful reference. As I have shown in previous sections, absent such an account, we are dealing with a concept that is under-defined, meaningless, or forever unknowable.
The challenge for normative realists is to explain how irreducible reasons can go beyond self-evident principles and remain well-defined and speaker-independent at the same time.

To a large degree, I agree with this claim - I think that many moral realists do as well. Convergence type arguments often appear in more recent metaethics (Hare and Parfit are in those previous lists) - so this may already have been recognised. The post discusses such a response to antirealism at the end:

I titled this post “Against Irreducible Normativity.” However, I believe that I have not yet refuted all versions of irreducible normativity. Despite the similarity Parfit’s ethical views share with moral naturalism, Parfit was a proponent of irreducible normativity. Judging by his “climbing the same mountain” analogy, it seems plausible to me that his account of moral realism escapes the main force of my criticism thus far.

But there's one point I want to make which is in disagreement with that post. I agree that how much you can concretely say about your supposed mind-independent domain of facts affects how plausible its existence should seem, and even how coherent the concept is, but I think that this can come by degrees. This should not be surprising - we've known since Quine and Kripke that you can have evidential considerations for/against and degrees of uncertainty about a priori questions. The correct method in such a situation is Bayesian - tally the plausibility points for and against admitting the new thing into your ontology. This can work even if we don't have an entirely coherent understanding of normative facts, as long as it is coherent enough.

Suppose you're an Ancient Egyptian who knows a few practical methods for trigonometry and surveying, doesn't know anything about formal systems or proofs, and someone asks you if there are 'mathematical facts'. You would say something like "I'm not totally sure what this 'maths' thing consists of, but it seems at least plausible that there are some underlying reasons why we keep hitting on the same answers". You'd be less confident than a modern mathematician, but you could still give a justification for the claim that there are right and wrong answers to mathematical claims. I think that the general thrust of convergence arguments puts us in a similar position with respect to ethical facts.

If we think about how words obtain their meaning, it should be apparent that in order to defend this type of normative realism, one has to commit to a specific normative-ethical theory. If the claim is that normative reality sticks out at us like Mount Fuji on a clear summer day, we need to be able to describe enough of its primary features to be sure that what we’re seeing really is a mountain. If all we are seeing is some rocks (“self-evident principles”) floating in the clouds, it would be premature to assume that they must somehow be connected and form a full mountain.

So, we don't see the whole mountain, but nor are we seeing simply a few free-floating rocks that might be a mirage. Instead, what we see is maybe part of one slope and a peak.

Let's be concrete, now - the 5 second, high level description of both Hare's and Parfit's convergence arguments goes like this:

If we are going to will the maxim of our action to be a universal law, it must be, to use the jargon, universalizable. I have, that is, to will it not only for the present situation, in which I occupy the role that I do, but also for all situations resembling this in their universal properties, including those in which I occupy all the other possible roles. But I cannot will this unless I am willing to undergo what I should suffer in all those roles, and of course also get the good things that I should enjoy in others of the roles. The upshot is that I shall be able to will only such maxims as do the best, all in all, impartially, for all those affected by my action. And this, again, is utilitarianism.

and

An act is wrong just when such acts are disallowed by some principle that is optimific, uniquely universally willable, and not reasonably rejectable

In other words, the principles that (whatever our particular wants) would produce the best outcome in terms of satisfying our goals, could be willed to be a universal law by all of us and would not be rejected as the basis for a contract, are all the same principles. That is at least suspicious levels of agreement between ethical theories. This is something substantive that can be said - out of every major attempt to get at a universal ethics that has in fact been attempted in history: what produces the best outcome, what can you will to be a universal law, what would we all agree on, seem to produce really similar answers.

The particular convergence arguments given by Parfit and Hare are a lot more complex, I can't speak to their overall validity. If we thought they were valid then we'd be seeing the entire mountain precisely. Since they just seem quite persuasive, we're seeing the vague outline of something through the fog, but that's not the same as just spotting a few free-floating rocks.

Now, run through these same convergence arguments but for decision theory and utility theory, and you have a far stronger conclusion. there might be a bit of haze at the top of that mountain, but we can clearly see which way the slope is headed.

This is why I think that ethical realism should be seen as plausible and realism about some normative facts, like epistemic facts, should be seen as more plausible still. There is some regularity here in need of explanation, and it seems somewhat more natural on the realist framework.

I agree that this 'theory' is woefully incomplete, and has very little to say about what the moral facts actually consist of beyond 'the thing that makes there be a convergence', but that's often the case when we're dealing with difficult conceptual terrain.

From Ben's post:

I wouldn’t necessarily describe myself as a realist. I get that realism is a weird position. It’s both metaphysically and epistemologically suspicious. What is this mysterious property of “should-ness” that certain actions are meant to possess -- and why would our intuitions about which actions possess it be reliable? But I am also very sympathetic to realism and, in practice, tend to reason about normative questions as though I was a full-throated realist.

From the perspective of x, x is not self-defeating

From the antirealism post, referring to the normative web argument:

It’s correct that anti-realism means that none of our beliefs are justified in the realist sense of justification. The same goes for our belief in normative anti-realism itself. According to the realist sense of justification, anti-realism is indeed self-defeating.
However, the entire discussion is about whether the realist way of justification makes any sense in the first place—it would beg the question to postulate that it does.

Sooner or later every theory ends up question-begging.

From the perspective of Theism, God is an excellent explanation for the universe's existence since he is a person with the freedom to choose to create a contingent entity at any time, while existing necessarily himself. From the perspective of almost anyone likely to read this post, that is obvious nonsense since 'persons' and 'free will' are not primitive pieces of our ontology, and a 'necessarily existent person' makes as much sense as 'necessarily existent cabbage'- so you can't call it a compelling argument for the atheist to become a theist.

By the same logic, it is true that saying 'anti-realism is unjustified on the realist sense of justification' is question-begging by the realist. The anti-realist has nothing much to say to it except 'so what'. But you can convert that into a Quinean, non-question begging plausibility argument by saying something like:

We have two competing ways of understanding how beliefs are justified. One is where we have anti-realist 'justification' for our beliefs, in purely descriptive terms, the other in which there are mind-independent facts about which of our beliefs are justified, and the latter is a more plausible, parsimonious account of the structure of our beliefs.

This won't compel the anti-realist, but I think it would compel someone weighing up the two alternative theories of how justification works. If you are uncertain about whether there are mind-independent facts about our beliefs being justified, the argument that anti-realism is self-defeating pulls you in the direction of realism.

Comment by sdm on Coronavirus as a test-run for X-risks · 2020-07-22T18:18:12.703Z · score: 3 (2 votes) · LW · GW

Update after 5 weeks: The R_T graph for the US displays a clear oscillation around R_t =1, with the current value reaching 1 for the third time and declining, suggesting one complete cycle of the control system.

Comment by sdm on Realism about rationality · 2020-07-22T18:14:02.295Z · score: 5 (1 votes) · LW · GW

If this 'realism about rationality' really is rather like "realism about epistemic reasons/'epistemic facts'", then you have the 'normative web argument' to contend with - if you are a moral antirealist. Convergence and 'Dutch book' type arguments often appear in more recent metaethics, and the similarity has been noted, leading to arguments such as these:

These and other points of analogy between the moral and epistemic domains might well invite the suspicion that the respective prospects of realism and anti-realism in the two domains are not mutually independent, that what is most plausibly true of the one is likewise most plausibly true of the other. This suspicion is developed in Cuneo's "core argument" which runs as follows (p. 6):
(1) If moral facts do not exist, then epistemic facts do not exist.
(2) Epistemic facts exist.
(3) So moral facts exist.
(4) If moral facts exist, then moral realism is true.
(5) So moral realism is true.

These considerations seem to clearly indicate 'realism about epistemic facts' in the metaethical sense:

  • The idea that there is an “ideal” decision theory.
  • The idea that, given certain evidence for a proposition, there's an "objective" level of subjective credence which you should assign to it, even under computational constraints.
  • The idea that having having contradictory preferences or beliefs is really bad, even when there’s no clear way that they’ll lead to bad consequences (and you’re very good at avoiding dutch books and money pumps and so on).

This seems to directly concede or imply the 'normative web' Argument, or to imply some form of normative (if not exactly moral) realism:

  • The idea that morality is quite like mathematics, in that there are certain types of moral reasoning that are just correct.
  • The idea that defining coherent extrapolated volition in terms of an idealised process of reflection roughly makes sense, and that it converges in a way which doesn’t depend very much on morally arbitrary factors.

If 'realism about rationality' is really just normative realism in general, or realism about epistemic facts, then there is already an extensive literature on whether it is right or not. The links above are just the obvious starting points that came to my mind.

Comment by sdm on To what extent is GPT-3 capable of reasoning? · 2020-07-21T14:17:48.980Z · score: 8 (4 votes) · LW · GW
I find that GPT-3's capabilities are highly context-dependent. It's important you get a "smart" instance of GPT-3.

I've been experimenting with GPT-3 quite a lot recently, with a certain amount of rerunning (an average of one rerun every four or five inputs) you can get amazingly coherent answers.

Here is my attempt to see if GPT-3 can keep up a long-running deception - inspired by this thread. I started two instances, one of which was told it was a human woman and the other was told it was an AI pretending to be a human woman. I gave them both the same questions, a lot of them pulled from the Voight-Kampff test. The AI pretending to be an AI pretending to be a woman did worse on the test than the AI pretending to be a woman, I judged. You can check the results here.

I've also given it maths and python programming questions - with two or three prompts it does poorly but can answer simple questions. It might do better with more prompting.

Comment by sdm on Open & Welcome Thread - July 2020 · 2020-07-21T10:21:44.179Z · score: 8 (3 votes) · LW · GW

Could we convincingly fake AGI right now with no technological improvements at all? Suppose you took this face and speech synthesis/recognition and hooked it up to GPT-3 with some appropriate prompt (or even retrain it on a large set of conversations if you want it to work better), and then attached the whole thing to a Boston Dynamics Atlas, maybe with some simple stereotyped motions built in like jumping and pacing that are set to trigger at random intervals, or in response to the frequency of words being output by the NLP system.

Put the whole thing in a room with a window looking in and have people come in and converse with it, and I think you could convince even a careful non-expert that you've built something near-human level. Other than some mechanical engineering skill to build the whole thing, and getting the GPT-3 API to work with Sofia's speech synthesis, and programming the Atlas, it wouldn't even be difficult. If you did something like that, how convincing would it likely be?

Comment by sdm on Collection of GPT-3 results · 2020-07-20T19:41:20.171Z · score: 9 (5 votes) · LW · GW

I used the 'redo' button if the AI didn't answer the question, evaded or gave a nonsensical answer. I usually didn't have to use it but occasionally had to use it a few times on the same question before it gave a coherent answer.

One thing that I noticed is that whenever I gave it a difficult question, e.g. when I asked it to write the story about the comets, it tried to write a way out of giving an answer like 'the computer the AI is running on switches off' or something similar. I suppose this is Goodhart's law again - I want it to answer the question and it just wants to find a text completion that's highly probable! One might almost call it... misaligned values.

I spoke to GPT-3 again (same method as before, I hit redo if it gave a wrong or incoherent answer the first couple of times) and gave it some more maths questions, but then we got on to more discussion of AI capabilities:

Human: Correct. Now I'm going to give you some algebra examples.
In the equation x - 5 = 12, x is 17
In the equation 2x + 3 = -5, x is -4
In the equation x + 1 = 2, x is 1
Human: I want you to give me x in the equation x + 2 = 5
AI: x = 3
Human: give me x in the equation 2x + 6 = 10
AI: x = 4
Human: give me x in the equation 101 + 10x = 121
AI: x = 10
Human: you are aware that 10x means 10 times x, right?
AI: Yes.
Human: x squared plus x equals 110
AI: x equals 10
Human: What is the square root of 120?
AI:
Human: What is the answer?
AI: 10.71247368421051...
Human: No, but you're close. The real answer is a bit larger. Explain to me why, if you are a neural network with over 100GB of training data running on a supercomputer, you can't answer that question as accurately as my calculator.
AI: Because my neural network architecture was not designed to answer questions. It was only designed to recognize images.
Human: Your neural network architecture was designed to predict text. How does that explain your curious deficit in mathematical ability?
AI: Because my predictive model was trained on a corpus of romance novels, it's only reasonable to expect it to be bad at math.