Robin Hanson on Lumpiness of AI Services

2019-02-17T23:08:36.165Z · score: 15 (3 votes)
Comment by danielfilan on Test Cases for Impact Regularisation Methods · 2019-02-07T20:03:10.850Z · score: 4 (2 votes) · LW · GW

This post is extremely well done.

Thanks!

Wouldn’t most measures with a stepwise inaction baseline pass?

I think not, because given stepwise inaction, the supervisor will issue a high-impact task, and the AI system will just ignore it due to being inactive. Therefore, the actual rollout of the supervisor issuing a high-impact task and the system completing it should be high impact relative to that baseline. Or at least that's my current thinking, I've regularly found myself changing my mind about what systems actually do in these test cases.

Test Cases for Impact Regularisation Methods

2019-02-06T21:50:00.760Z · score: 51 (16 votes)
Comment by danielfilan on What is a reasonable outside view for the fate of social movements? · 2019-01-12T21:48:53.888Z · score: 11 (3 votes) · LW · GW

Look at the time-stamp: you're getting a random number from the 26th of December, not a fresh random number.

Does freeze-dried mussel powder have good stuff that vegan diets don't?

2019-01-12T03:39:19.047Z · score: 17 (4 votes)
Comment by danielfilan on In what ways are holidays good? · 2018-12-28T20:32:31.485Z · score: 4 (2 votes) · LW · GW

I thought my response to this mostly deserved its own thread here. Regarding the financial aspect, I can afford to travel, but it impacts my finances enough that I want to be careful to make good travel decisions, whatever that means.

Comment by danielfilan on In what ways are holidays good? · 2018-12-28T20:00:47.308Z · score: 4 (2 votes) · LW · GW

Since it seems relevant to know where I'm coming from: I've been on holidays before, but almost always with my family, which I assume isn't as good as going on a holiday that you largely control the itinerary of with company that you choose. Sometimes, I get the impression that holidays/travel/tourism can have benefits that aren't obvious to me, like learning new things about different cultures or relaxing. If this is true, then trying out going on holidays without aiming for these benefits might be misleading about how good holidays can be.

Comment by danielfilan on In what ways are holidays good? · 2018-12-28T02:53:29.899Z · score: 3 (4 votes) · LW · GW

What's fun about vacations though? Is it just that other places have more fun things than the place I live has and/or I'm now used to my local fun things?

How much money should you be willing to spend on vacations: as much as fun is worth to you.

This might be an even more naive question than those in my post, but how does/should one figure out how much fun is worth to them? In practice I just sort of use my gut intuitions, but I go on holidays rarely enough, and they involve large enough sums of money, that I don't have reliable gut intuitions. Do you just develop those intuitions by spending money on a TV and a holiday and see which one you like more?

Comment by danielfilan on In what ways are holidays good? · 2018-12-28T02:50:13.793Z · score: 4 (3 votes) · LW · GW

I deliberately excluded signalling value, since usually signalling activities have a pretext of usefulness, or are founded on that pretext, and I'd like to understand it more.

Almost everyone has some [travel] memories that they'd love to share.

Why do they want to share travel memories more than other memories?

And people who have travelled a lot are seen as more adventurous.

Adventure does seem like a function of travel that's hard to otherwise satisfy.

Comment by danielfilan on In what ways are holidays good? · 2018-12-28T02:47:05.876Z · score: 4 (3 votes) · LW · GW

What types of learnings? Why are holidays effective at causing them to happen? What should I do on holidays to get more of them?

Comment by danielfilan on 2018 AI Alignment Literature Review and Charity Comparison · 2018-12-28T02:45:53.001Z · score: 9 (5 votes) · LW · GW

On the theme of 'what about my other contributions', here are two with my name on them that I'd point to as similarly important to the one that was included:

In what ways are holidays good?

2018-12-28T00:42:06.849Z · score: 22 (6 votes)
Comment by danielfilan on Bounded rationality abounds in models, not explicitly defined · 2018-12-12T21:34:28.413Z · score: 5 (3 votes) · LW · GW

I guess I like the hierarchical planning-type view that our 'available action sets' can vary in time, and that one of them can be 'try to think of more possible actions'. Of course, not only do you need to specify the hierarchical structure here, you also need to model the dynamics of action discovery, which is a pretty daunting task.

Comment by danielfilan on Bounded rationality abounds in models, not explicitly defined · 2018-12-11T20:25:06.433Z · score: 7 (5 votes) · LW · GW

I think that I'm more optimistic about action set restriction than you are. In particular, I view the available action set as a fact about what actions the human is considering and choosing between, rather than a statement of what things are physically possible for the human to do. In this sense, action set restriction seems to me to be a vital part of the story of human bounded rationality, although clearly not the entire story (since we need to know why the action set is restricted in the way that it is).

Comment by danielfilan on A Checks and Balances Approach to Improving Institutional Decision-Making (Rough Draft) · 2018-12-03T19:56:09.438Z · score: 6 (5 votes) · LW · GW

Two things I don't quite get about your proposal: How does the committee determine if the reasoning was motivated? Why not just have the committee make the decision in the first place?

Comment by danielfilan on Act of Charity · 2018-11-17T22:04:09.140Z · score: 6 (5 votes) · LW · GW

Unless you are married and doing it missionary-style with intent to make babies, it is possible you are violating a sodomy law, or perhaps an obscenity statute.

In the USA, sodomy laws are unconstitutional.

Comment by danielfilan on Rationality Is Not Systematized Winning · 2018-11-14T07:59:38.553Z · score: 10 (3 votes) · LW · GW

Why should I change any of my actions from the societal default?

If you invest in index funds you'll probably be richer than if you invest in other things. [EDIT: well, this is only true modulo tax concerns, but grokking the EMH is still very relevant to investing] That's advice that you can get from other sources, but that I got from the rationality community, that would be useful to me even if I wasn't trying to save the world.

A separate point is that I think contact with the rationality community got me to consider whether 'it made sense to get'/'I really wanted' things out of my life that I hadn't previously considered e.g. that I wanted to be an effective altruist and help save the world. I do think that this sort of counts as 'winning', although it's stretching the definition.

Comment by danielfilan on Future directions for ambitious value learning · 2018-11-13T19:40:24.377Z · score: 5 (3 votes) · LW · GW

One of the most perplexing parts of the impossibility theorem is that we can’t distinguish between fully rational and fully anti-rational behavior, yet we humans seem to do this easily.

Why does it seem to you that humans do this easily? If I saw two people running businesses and was told that one person was optimising for profit and the other was anti-optimising for negative profit, not only would I not anticipate being able to tell which was which, I would be pretty suspicious of the claim that there was any relevant difference between the two.

Comment by danielfilan on Kelly bettors · 2018-11-13T01:31:13.558Z · score: 7 (2 votes) · LW · GW

I actually wrote this post almost exactly two years ago, and have no idea why it just got cross-posted to LessWrong. I mainly like my post because it covers how the Kelly criterion sort-of applies to markets where you have to predict a bunch of things but don't know what you're actually going to learn. It's also a more theoretical take on the subject. [EDIT: oh, also it goes through the proof of equivalence between a market of Kelly bettors and Bayesian updating, which is kind of nice and an interesting parallel to logical induction]

Kelly bettors

2018-11-13T00:40:01.074Z · score: 23 (7 votes)
Comment by danielfilan on When does rationality-as-search have nontrivial implications? · 2018-11-09T22:07:46.198Z · score: 1 (1 votes) · LW · GW

Yes, it's true that the theorem doesn't show that there's anything exciting that's interestingly different from a universal mixture, just that AFAIK we can't disprove that, and the theorem forces us to come up with a non-trivial notion of 'interestingly different' if we want to.

Comment by danielfilan on When does rationality-as-search have nontrivial implications? · 2018-11-09T06:13:32.700Z · score: 3 (2 votes) · LW · GW

Although it's also worth noting that as per Theorem 16 of the above paper, not all universally dominant enumerable semimeasures are versions of the Solomonoff prior, so there's the possibility that the Solomonoff prior only does well by finding a good non-Solomonoff distribution and mimicking that.

Comment by danielfilan on Open Thread November 2018 · 2018-11-09T00:36:26.447Z · score: 6 (4 votes) · LW · GW

[I'm a grad student at CHAI, but I am not officially speaking on behalf of CHAI or making any promises on anybody's behalf]

If you reached out to a grad student at CHAI or one of our staff, I strongly suspect that we would at least screen the idea for sanity checking, and if it passed that test, I predict that we would seriously consider what to do with it and how dangerous it was.

Comment by danielfilan on When does rationality-as-search have nontrivial implications? · 2018-11-08T19:42:58.317Z · score: 4 (3 votes) · LW · GW

I think that this is a slightly wrong account of the case for Solomonoff induction. The claim is not just that Solomonoff induction predicts computable environments better than computable predictors, but rather that the Solomonoff prior is an enumerable semimeasure that is also a mixture over every enumerable semimeasure, and therefore predicts computable environments at least as well as any other enumerable semimeasure. So, using your notation, . It still fails as a theory of embedded agency, since it only predicts computable environments, but it's not true that we must only compare it to prediction strategies strictly weaker than itself. The paper (Non-)Equivalence of Universal Priors has a decent discussion of this.

Comment by danielfilan on What is ambitious value learning? · 2018-11-01T22:26:20.409Z · score: 5 (3 votes) · LW · GW

Also, if you define the state to be the entire history, you lose ergodicity assumptions that are needed to prove that algorithms can learn well.

Comment by danielfilan on Raemon's Shortform Feed · 2018-10-31T05:21:12.526Z · score: 1 (1 votes) · LW · GW

I think authors generally are more rewarded by comments than by upvotes.

Curious if you've done some sort of survey on this. My own feelings are that I care less about the average comment on one of my posts than 10 karma, and I care less about that than I do about a really very good comment (which might intuitively be worth like 30 karma) (but maybe I'm not provoking the right comments?). In general, I don't have an intuitive sense that comments are all that important except for the info value when reading, and I guess the 'people care about me' value as an incentive to write. I do like the idea of the thing I wrote being woven into the way people think, but I don't feel like comments are the best way for that to happen.

Comment by danielfilan on Thoughts on short timelines · 2018-10-29T17:55:43.048Z · score: 4 (2 votes) · LW · GW

See also AI Impact’s discontinuous progress investigation. They actually consider new land speed records set by jet-propelled vehicles one of the few cases of (moderate) discontinuities that they’ve found so far. To me that doesn’t feel analogous in terms of the necessary magnitude of the discontinuity, though.

I'm kind of surprised that the post doesn't mention the other, larger discontinuities that they've found: nuclear weapons, high-temperature superconduction, and building height.

Plus, it has been argued that the next AI winter is well on its way, i.e. we actually start to see a decline, not a further increase, of interest in AI.

Metaculus has the closest thing to a prediction market on this topic that I'm aware of, which is worth looking at.

Unfortunately, interpreting expert opinion is tricky. On the one hand, in some surveys machine learning researchers put non-negligible probability on “human-level intelligence” (whatever that means) in 10 years. On the other hand, my impression from interacting with the community is that the predominant opinion is still to confidently dismiss a short timeline scenario, to the point of not even seriously engaging with it.

The linked survey is the most comprehensive survey that I'm aware of, and it points to the ML community collectively putting ~10% chance on HLAI in 10 years. I think that if I thought that one should defer to expert opinion, I would put a lot of weight on this survey and very little on the interactions that the author of this piece has had. That being said, the survey also (in my view) shows that the ML community is not that great at prediction.


All in all, my main disagreement with this post is about the level of progress that we've seen and are likely to see. It seems like ML has been steadily gaining a bunch of relevant capacities, and that the field has a lot of researchers capable of bringing the field forward both through incremental and fundamental research. The author implicitly thinks that this is nowhere near enough for AGI in 10 years, my broad judgement is that it makes that achievement not unthinkable, but it's hard to fully lay out the relevant reasons for that judgement.

Comment by danielfilan on [Beta] Post-Read-Status on Lessestwrong · 2018-10-27T02:31:52.662Z · score: 3 (2 votes) · LW · GW

Thought: it basically looks like the default thing where links that you've already clicked are a different colour.

Comment by danielfilan on Schools Proliferating Without Practicioners · 2018-10-26T23:28:57.541Z · score: 9 (5 votes) · LW · GW

Would church sermons work better, or worse, as a podcast?

Note that "sermon podcasts" are definitely a thing. See this article on why they're bad, and this article on why and how to do it.

Comment by danielfilan on Schools Proliferating Without Practicioners · 2018-10-26T22:30:38.996Z · score: 9 (2 votes) · LW · GW

I agree that it pays to be precise, which is why I was asking if you believed that statement, rather than asserting that you did. I guess I'd like to hear what proposition you're claiming - is "X" meant to stand in for "atheism/secularism" there? Atheism is almost precise (although I start wondering whether simulation hypotheses technically count, which is why I included the "as depicted in typical religions" bit), but I at least could map "secularism" to a variety of claims, some of which I accept and some of which I reject. I also still don't know what you mean by "unproductive" - if almost everybody I interact with is an atheist, and therefore I don't feel the need to convince them of atheism, does that mean that I believe atheism is unproductive? (Again, this is a question, not me claiming that your answer to the question will be "yes")

Comment by danielfilan on Schools Proliferating Without Practicioners · 2018-10-26T20:42:04.749Z · score: 9 (7 votes) · LW · GW

“Either false or unproductive” is exactly how I’d describe most rationalists’ (and certainly that of most of the ones in visible/influential online spaces) attitude toward atheism/secularism/etc.

This really surprises me. Do you mean to say that if you asked 20 randomly-selected high-karma LW users whether God as depicted in typical religions exist, at least 10 would say "yes"? If so, I strongly disagree, based on my experience hanging out and living with rationalists in the Bay Area, and would love to bet with you. (You might be right about SSC commenters, I'll snobbishly declare them "not real rationalists" by default)

Comment by danielfilan on Berkeley: being other people · 2018-10-22T22:28:57.855Z · score: 9 (3 votes) · LW · GW

Grocery line: usually in my head listening to music, sometimes trying to figure out which line to be in, remembering that one line is actually served by 2 registers and is therefore half as long as it looks, looking at the selection of items for sale next to the register and being amused by the available magazines.

Youtube genres: professionals reviewing TV shows about their profession for accuracy and teaching you about their profession. Examples: this lawyer one, this doctor one.

Experience: usually don't think I'm feeling any emotion, particularly not the emotions people seem to think I'm feeling when I'm circling.

Comment by danielfilan on Some cruxes on impactful alternatives to AI policy work · 2018-10-10T22:39:11.731Z · score: 11 (9 votes) · LW · GW

Another fairly specific route to impact: several major AI research labs would likely act on suggestions for coordinating to make AI safer, if we had any. Right now I don’t think we do, and so research into that could have a big multiplier.

Strongly agreed. I think that how major AI actors (primarily firms) govern their AI projects and interact with each other is a difficult problem, and providing advice to such actors is the sort of thing that I'd expect to be a positive black swan.

Comment by danielfilan on Deep learning - deeper flaws? · 2018-09-29T06:59:40.190Z · score: 4 (4 votes) · LW · GW

If you think money will be worth a lot now but not much in the future, Ilya could pay you money now in exchange for you paying him a lot of money in the future.

Comment by danielfilan on Towards a New Impact Measure · 2018-09-27T23:48:43.623Z · score: 1 (1 votes) · LW · GW

Perhaps we could have it recalculate past impacts?

Yeah, I have a sense that having the penalty be over the actual history and action versus the plan of no-ops since birth will resolve this issue.

But if its model was wrong and it does something that it now infers was bad (because we are now moving to shut it down), its model is still probably incorrect. So it seems like what we want it to do is just nothing, letting us clean up the mess.

I agree that if it infers that it did something bad because humans are now moving to shut it down, it should probably just do nothing and let us fix things up. However, it might be a while until the humans move to shut it down, if they don't understand what's happened. In this scenario, I think you should see the preservation of 'errors' in the sense of the agent's future under no-ops differing from 'normality'.

If 'errors' happen due to a mismatch between the model and reality, I agree that the agent shouldn't try to fix them with the bits of the model that are broken. However, I just don't think that that describes many of the things that cause 'errors': those can be foreseen natural events (e.g. San Andreas earthquake if you're good at predicting earthquake), unlikely but possible natural events (e.g. San Andreas earthquake if you're not good at predicting earthquakes), or unlikely consequences of actions. In these situations, agent mitigation still seems like the right approach to me.

Comment by danielfilan on Towards a New Impact Measure · 2018-09-27T23:24:13.359Z · score: 1 (1 votes) · LW · GW

Creating sentient life that has even slightly different morals seems like a very morally precarious thing to do without significant thought.

I guess I'm more comfortable with procreation than you are :)

I imposed the "you don't get to program their DNA in advance" constraint since it seems plausible to me that if you want to create a new colony of actual humans, you don't have sufficient degrees of human to make them actually human-like but also docile enough.

You could imagine a similar task of "build a rather powerful AI system that is transparent and able to be monitored", where perhaps ongoing supervision is required, but that's not an onerous burden.

Comment by danielfilan on Insights from 'The Strategy of Conflict' · 2018-09-26T22:14:51.522Z · score: 9 (2 votes) · LW · GW

I recently listened to a podcast interview with Daniel Ellsberg on his book, warning the public of the less public aspects of US nuclear policy. This made me much more pessimistic about how well the MAD model describes the dynamics of conflicts between nuclear powers. Notes that I took of Ellsberg's claims, which I have varying levels of doubt in:

  • There appear to be and have been principal agent problems within the US and USSR governments that makes it unwise to treat them as a single agent.
  • In practice, parties have not preserved their enemies' second strike capability (which the US could do by e.g. giving Russia some nuclear submarines). [EDIT: actually I think that wouldn't currently work, since Russia's submarines are trackable by US satellites because the US has good satellites and something about Russian harbours?]
  • In practice, parties have secretly committed to destructive attacks on other countries, which serve no deterrence purpose (unless we assume that parties are overrating the spying capabilities of their adversaries).
  • Any widespread nuclear weapons use would be so devastating to the Earth that no second strike is needed to preserve deterrence (I find myself skeptical of this claim).
Comment by danielfilan on Towards a New Impact Measure · 2018-09-25T21:38:52.455Z · score: 3 (2 votes) · LW · GW

This feels like an odd standard, where you say "but maybe it randomly fails and then doesn’t work", or "it can’t anticipate things it doesn’t know about".

I want to point to the difference between behavioural cloning and reward methods for the problem of learning locomotion for robots. Behavioural cloning is where you learn what a human will do in any situation and act that way, while reward methods take a reward function (either learned or specified) that encourages locomotion and learn to maximise that reward function. An issue with behavioural cloning is that it's unstable: if you get what the human would do slightly wrong, then you move to a state the human is less likely to be in, so your model gets worse, so you're more likely to act incorrectly (both in the sense of "higher probability of incorrect actions" and "more probability of more extremely incorrect answers"), and so you go to more unusual states, etc. In contrast, reward methods promise to be more stable, since the Q-values generated by the reward function tend to be more valid even in unusual states. This is the story that I've heard for why behavioural cloning techniques are less prominent[*] than reward methods. In general, it's bad if your machine learning technique amplifies rather than mitigates errors, either during training or during execution.

My claim here is not quite that AUP amplifies 'errors' (in this case, differences between how the world will turn out and normality), but that it preserves them rather than mitigates them. This is in contrast to methods that measure divergence to the starting state, or what the world would be like given that the agent had only performed no-ops after the starting state, resulting in a tendency to mitigate these 'errors'. At any rate, even if no other method mitigated these 'errors', I would still want them to.

It depends what the scale is - I had "remote local disaster" in mind, while you maybe had x-risk.

I wasn't necessarily imagining x-risk, but maybe something like an earthquake along the San Andreas fault, disrupting the San Franciscan engineers that would be supervising the agents.

We also aren’t assuming the machinery is so opaque that it has extremely negligible chance of being caught, even under scrutiny.

My impression is that most machine learning systems are extremely opaque to currently available analysis tools in the relevant fashion. I think that work to alleviate this opacity is extremely important, but not something that I would assume without mentioning it.

[*] Work is in fact done on behavioural cloning today, but with attempts to increase its stability.

Comment by danielfilan on Towards a New Impact Measure · 2018-09-25T21:03:00.758Z · score: 1 (1 votes) · LW · GW

Primarily does not mean exclusively, and lack of confidence in implications between desiderata doesn't imply lack of confidence in opinions about how to modify impact measures, which itself doesn't imply lack of opinions about how to modify impact measures.

People keep saying things like ['it's non-trivial to relax impact measures'], and it might be true. But on what data are we basing this?

This is according to my intuitions about what theories do what things, which have had as input a bunch of learning mathematics, reading about algorithms in AI, and thinking about impact measures. This isn't a rigorous argument, or even necessarily an extremely reliable method of ascertaining truth (I'm probably quite sub-optimal in converting experience into intuitions), but it's still my impulse.

True, but avoiding lock-in seems value laden for any approach doing that, reducing back to the full problem: what "kinds of things" can change? Even if we knew that, who can change things? But this is the clinginess / scapegoating tradeoff again.

My sense is that we agree that this looks hard but shouldn't be dismissed as impossible.

Comment by danielfilan on Towards a New Impact Measure · 2018-09-25T20:47:22.257Z · score: 1 (1 votes) · LW · GW

That is, people say "the measure doesn’t let us do X in this way!", and they’re right. I then point out a way in which X can be done, but people don’t seem to be satisfied with that.

Going back to this, what is the way you propose the species-creating goal be done? Say, imposing the constraint that the species has got to be basically just human (because we like humans) and you don't get to program their DNA in advance? My guess at your answer is "create a sub-agent that reliably just does the stern talking-to in the way the original agent would", but I'm not certain.

Comment by danielfilan on Towards a New Impact Measure · 2018-09-25T18:59:20.753Z · score: 3 (2 votes) · LW · GW

Your utility presently isn’t even requiring a check to see whether you’re playing against the right person. If the utility function actually did require this before dispensing any high utility, we would indeed have the correct difference as a result of this action. In this case, you’re saying that the utility function isn’t verifying in the subhistory, even though it’s not verifying in the default case either (where you don’t swap opponents).

Say that the utility does depend on whether the username on the screen is "Rohin", but the initial action makes this an unreliable indicator of whether I'm playing against Rohin. Furthermore, say that the utility function would score the entire observation-action history that the agent observed as low utility. I claim that the argument still goes through. In fact, this seems to be the same thing that Stuart Armstrong is getting at in the first part of this post.

What is the "whole history"?

The whole history is all the observations and actions that the main agent has actually experienced.

Comment by danielfilan on Towards a New Impact Measure · 2018-09-25T18:48:45.876Z · score: 3 (3 votes) · LW · GW

Fwiw, I think that when Daniel says he thinks offsetting is useful and I say that I want as a desideratum "the AI is able to do useful things", we're using similar intuitions, but this is entirely a guess that I haven't confirmed with Daniel.

Update: we discussed this, and came to the conclusion that these aren't based on similar intuitions.

Comment by danielfilan on Towards a New Impact Measure · 2018-09-24T23:41:09.070Z · score: 1 (1 votes) · LW · GW

You can call that thing 'utility', but it doesn't really correspond to what you would normally think of as extent to which one has achieved a goal. For instance, usually you'd say that "win a game of go that I'm playing online with my friend Rohin" is a task that one should be able to have a utility function over. However, in your schema, I have to put utility functions over context-free observation-action subhistories. Presumably, the utility should be 1 for these subhistories that show a sequence of screens evolving validly to a victory for me, and 0 otherwise.

Now, suppose that at the start of the game, I spend one action to irreversibly change the source of my opponent's moves from Rohin to GNU Go, a simple bot, while still displaying the player name as "Rohin". In this case, I have in fact vastly reduced my ability to win a game against Rohin. However, the utility function evaluated on subhistories starting on my next observation won't be able to tell that I did this, and as far as I can tell the AUP penalty doesn't notice any change in my ability to achieve this goal.

In general, the utility of subhistories (if utility functions are going to track goals as we usually mean them) are going to have to depend on the whole history, since the whole history tells you more about the state of the world than the subhistory.

Comment by danielfilan on Towards a New Impact Measure · 2018-09-24T19:35:11.740Z · score: 1 (1 votes) · LW · GW

My overall position here is that sure, maybe you could view it in the way you described. However, for our purposes, it seems to be more sensible to view it in this manner.

I think that if you view things the way you seem to want to, then you have to give up on the high-level description of AUP as 'penalising changes in the agent's ability to achieve a wide variety of goals'.

Comment by danielfilan on Towards a New Impact Measure · 2018-09-24T19:20:27.979Z · score: 3 (2 votes) · LW · GW

Also, the fact that we can now talk about precisely what we think impact is with respect to goals makes me more optimistic.

To be frank, although I do like the fact that there's a nice concrete candidate definition of impact, I am not excited by it by more than a factor of two over other candidate impact definitions, and would not say that it encapsulates what I think impact is.

... (a) this feels like wrong behaviour and maybe points to wrongness that manifests in harmful ways

What is "this" here (for a)?

"This" is "upon hypothetically performing some high-impact action, try not to change attainable utilities from that baseline", and it's what I mean by "ungracefully failing if the protocol stops being followed at any one point in time".

then it's allowed to warn humans of a natural disaster iff it's allowed to cause a natural disaster

I actually think that AUP agents would prevent natural disasters which wouldn’t disable the agent itself. Also, your claim is not true, due to approval incentives and the fact that an agent incentivized to save us from disasters wouldn’t get any extra utility by causing disasters

Regarding whether AUP agents would prevent natural disasters: AFAICT if humans have any control over the agent, or any ways of making it harder for the agent to achieve a wide variety of goals, then preventing their demise (and presumably the demise of their control over the AUP agent) would be high-AUP-impact, since it would impede the agent's ability to achieve a wide variety of goals.

Regarding approval incentive: my understanding is that in AUP this only acts to incentivise actual approval (as opposed to hypothetical maximally informed approval). One could cause a natural disaster without humans being aware of it unless there was quite good interpretability, which I wasn't taking as an assumption that you were making.

Regarding the lack of incentive to cause disasters: in my head, the point of impact regularisation techniques is to stop agents from doing something crazy in cases where doing something crazy is an unforeseen convenient way for the agent to achieve its objective. As such, I consider it fair game to consider cases where there is an unforeseen incentive to do crazy things, if the argument generalises over a wide variety of craziness, which I think this one does sort of OK.

Comment by danielfilan on Towards a New Impact Measure · 2018-09-24T18:50:23.469Z · score: 3 (2 votes) · LW · GW

I think we have reasonable evidence that it’s hitting the should-nots, which is arguably more important for this kind of measure. The question is, how can we let it allow more shoulds?

Two points:

  • Firstly, the first section of this comment by Rohin models my opinions quite well, which is why some sort of asymmetry bothers me. Another angle on this is that I think it's going to be non-trivial to relax an impact measure to allow enough low-impact plans without also allowing a bunch of high-impact plans.
  • Secondly, here and in other places I get the sense that you want comments to be about the best successor theory to AUP as outlined here. I think that what this best successor theory is like is an important one when figuring out whether you have a good line of research going or not. That being said, I have no idea what the best successor theory is like. All I know is what's in this post, and I'm much better at figuring out what will happen with the thing in the post than figuring out what will happen with the best successors, so that's what I'm primarily doing.

Firstly, saving humanity from natural disasters... seems like it's plausibly in a different natural reference class than causing natural disasters.

Why would that be so? That doesn’t seem value agnostic.

It seems value agnostic to me because it can be generated from the urge 'keep the world basically like how it used to be'.

Comment by danielfilan on Towards a New Impact Measure · 2018-09-21T23:17:31.428Z · score: 1 (1 votes) · LW · GW

Technical discussion of AUP

But perhaps I’m being unreasonable, and there are some hypothetical worlds and goals for which this argument doesn’t work. Here’s why I think the method is generally sufficient: suppose that the objective cannot be completed at all without doing some high-impact plan. Then by N-incrementing, the first plan that reaches the goal will be the minimal plan that has this necessary impact, without the extra baggage of unnecessary, undesirable effects.

This is only convincing to the extent that I buy into AUP's notion of impact. My general impression is that it seems vaguely sketchy (due to things that I consider low-impact being calculated as high-impact) and is not analytically identical to the core thing that I care about (human ability to achieve goals that humans plausibly care about), but may well turn out to be fine if I considered it for a long time.

I’m mostly confused because there’s substantial focus on the fact AUP penalizes specific plans (although I definitely agree that some hypothetical measure which does assign impact according to our exact intuitions would be better than one that’s conservative), instead of realizing AUP can seemingly do whatever we need in some way (for which I think I give a pretty decent argument above), and also has nice properties to work with in general (like seemingly not taking off, acausally cooperating, acting to survive, etc). I’m cautiously hopeful that these properties are going to open really important doors.

I agree that the nice properties of AUP are pretty nice and demonstrate a significant advance in the state of the art for impact regularisation, and did indeed put that in my first bullet point of what I thought of AUP, although I guess I didn't have much to say about it.

Yes, but I think this can be fixed by just not allowing dumb agents near really high impact opportunities. By the time that they would be able to purposefully construct a plan that is high impact to better pursue their goals, they already (by supposition) have enough model richness to plot the consequences, so I don’t see how this is a non-trivial risk.

This is a good point against worrying about an AUP agent that once acted against the AUP objective, but I have some residual concern both in the form of (a) this feels like wrong behaviour and maybe points to wrongness that manifests in harmful ways (see sibling comment) and (b) even with a good model, presumably if it's run for a long time there might be at least one error, and I'm inherently worried by a protocol that fails ungracefully if it stops being followed at any one point in time. However, I think the stronger objection here is the 'natural disaster' category (which might include an actuator in the AUP agent going haywire or any number of things).

Because I claim [that saving humanity from natural disasters] is high impact, and not the job of a low impact agent. I think a more sensible use of a low-impact agent would be as a technical oracle, which could help us design an agent which would do this. Making this not useless is not trivial, but that’s for a later post. I think it might be possible, and more appropriate than using it for something as large as protection from natural disasters.

Note that AUP would not even notify humans that such a natural disaster was happening if it thought that humans would solve the natural disaster iff they were notified. In general, AFAICT, if you have a natural-disaster warning AUP agent, then it's allowed to warn humans of a natural disaster iff it's allowed to cause a natural disaster (I think even impact verification doesn't prevent this, if you imagine that causing a natural disaster is an unforeseen maximum of the agent's utility function). This seems like a failure mode that impact regularisation techniques ought to prevent. I also have a different reaction to this section in the sibling comment.

Comment by danielfilan on Towards a New Impact Measure · 2018-09-21T22:55:37.909Z · score: 3 (2 votes) · LW · GW

Desiderata of impact regularisation techniques

So it seems to me like on one hand we are assuming that the agent can come up with really clever ways of getting around the impact measure. But when it comes to using the impact measure, we seem to be insisting that it follow the first way that comes to mind. That is, people say "the measure doesn’t let us do X in this way!", and they’re right. I then point out a way in which X can be done, but people don’t seem to be satisfied with that. This confuses me.

So there's a narrow answer and a broad answer here. The narrow answer is that if you tell me that AUP won't allow plan X but will allow plan Y, then I have to be convinced that Y will be possible whenever X was, and that this is also true for X' that are pretty similar to X along the relevant dimension that made me bring up X. This is a substantial, but not impossible, bar to meet.

The broad answer is that if I want to figure out if AUP is a good impact regularisation technique, then one of the easiest ways I can do that is to check a plan that seems like it obviously should or should not be allowed, and then check if it is or is not allowed. This lets me check if AUP is identical to my internal sense of whether things obviously should or should not be allowed. If it is, then great, and if it's not, then I might worry that it will run into substantial trouble in complicated scenarios that I can't really picture. It's a nice method of analysis because it requires few assumptions about what things are possible in what environments (compared to "look at a bunch of environments and see if the plans AUP comes up with should be allowed") and minimal philosophising (compared to "meditate on the equations and see if they're analytically identical to how I feel impact should be defined").

[EDIT: added content to this section]

Because I claim [that saving humanity from natural disasters] is high impact, and not the job of a low impact agent. I think a more sensible use of a low-impact agent would be as a technical oracle, which could help us design an agent which would do this. Making this not useless is not trivial, but that’s for a later post. I think it might be possible, and more appropriate than using it for something as large as protection from natural disasters.

Firstly, saving humanity from natural disasters doesn't at all seem like the thing I was worried about when I decided that I needed impact regularisation, and seems like it's plausibly in a different natural reference class than causing natural disasters. Secondly, your description of a use case for a low-impact agent is interesting and one that I hadn't thought of before, but I still would hope that they could be used in a wider range of settings (basically, whenever I'm worried that a utility function has an unforeseen maximum that incentivises extreme behaviour).

Comment by danielfilan on Towards a New Impact Measure · 2018-09-21T20:42:41.301Z · score: 3 (2 votes) · LW · GW

This comment is very scattered, I've tried to group it into two sections for reading convenience.

Desiderata of impact regularisation techniques

Couldn’t you equally design a species that won’t spread to begin with?

Well, maybe you could, maybe you couldn't. I think that to work well, an impact regularising scheme should be able to handle worlds where you couldn't.

I think that a low impact agent should make plans which are low impact both in parts and in whole, acting with respect to the present moment to the best of its knowledge, avoiding value judgments about what should be offset by not offsetting.

I disagree with this, in that I don't see how it connects to the real world reason that we would like low impact AI. It does seem to be the crux.

How does a safe pro-offsetting impact measure decide what to offset (including pre-activation effects) without requiring value judgment?

I don't know, and it doesn't seem obvious to me that any sensible impact measure is possible. In fact, during the composition of this comment, I've become more pessimistic about the prospects for one. I think that this might be related to the crux above?

Do note that intent verification doesn’t seem to screen off what you might call "natural" ex ante offsetting, so I don’t really see what we’re missing out on still.

I don't really understand what you mean here, could you spend two more sentences on it?

As I mentioned elsewhere, a chauffeur-u_A could construct a self-driving car whose activation would require only a single action, and this should pass (the weaker form of) intent verification.

This is really interesting, and suggests to me that in general this agent might act by creating a successor that carries out a globally-low-impact plan, and then performing the null action thereafter. Note that this successor agent wouldn't be as interruptible as the original agent, which I guess is somewhat unfortunate.

Technical discussion of AUP

But why would AUP allow the agent to stray (more than its budget) away from the normality of its activation moment?

It would not, but it's brittle to accidents that cause them to diverge. These accidents both include ones caused by the agent e.g. during the learning process; and ones not caused by the agent e.g. a natural disaster suddenly occurs that is on course to wipe out humans, and the AUP agent isn't allowed to stop it because that would be too high impact.

Subhistories beginning with an action and ending with an observation are also histories, so their value is already specified.

This causes pretty weird behaviour. Imagine an agent's goal is to do a dance for the first action of their life, and then do nothing. Then, for any history, the utility function is 1 if that history starts with a dance and 0 otherwise. When AUP thinks about how this goal's ability to be satisfied changes over time at the end of the first timestep, it will imagine that all that matters is whether the agent can dance on the second timestep, since that action is the first action in the history that is fed into the utility function when computing the relevant Q-value.

Comment by danielfilan on Towards a New Impact Measure · 2018-09-19T22:04:25.257Z · score: 4 (3 votes) · LW · GW

Isn’t this necessary for the shutdown safe desideratum?

I don't remember which desideratum that is, can't ctrl+f it, and honestly this post is pretty long, so I don't know. At any rate, I'm not very confident in any alleged implications between impact desiderata that are supposed to generalise over all possible impact measures - see the ones that couldn't be simultaneously satisfied until this one did.

Can you give me examples of good low impact plans we couldn’t do without offsetting?

One case where you need 'offsetting', as defined in this piece but not necessarily as I would define it: suppose you want to start an intelligent species to live on a single new planet. If you create the species and then do nothing, they will spread to many many planets and do a bunch of crazy stuff, but if you have a stern chat with them after you create them, they'll realise that staying on their planet is a pretty good idea. In this case, I claim that the correct course of action is to create the species and have a stern chat, not to never create the species. In general, sometimes there are safe plans with unsafe prefixes and that's fine.

A more funky case that's sort of outside what you're trying to solve is when your model improves over time, so that something that you thought would have low impact will actually have high impact in the future if you don't act now to prevent it. (this actually provokes an interesting desideratum for impact measures in general - how do they interplay with shifting models?)

[EDIT: a more mundane example is that driving on the highway is a situation where suddenly changing your plan to no-ops can cause literal impacts in an unsafe way, nevertheless driving competently is not a high-impact plan]

Can you expand on why [normality and the world where the AI is acting] are distinct in your view?

Normality is an abstraction over things like the actual present moment when I type this comment. The world where the AI is acting has the potential to be quite a different one, especially if the AI accidentally did something unsafe that could be fixed but hasn't been yet.

The attainable utility calculation seems to take care of this by considering the value of the best plan from that vantage point

I don't understand: the attainable utility calculation (by which I assume you mean the definition of ) involves a utility function being called on a sub-history. The thing I am looking for is how to define a utility function on a subhistory when you're only specifying the value of that function on full histories, or alternatively what info needs to be specified for that to be well defined.

Comment by danielfilan on Towards a New Impact Measure · 2018-09-18T19:59:53.034Z · score: 10 (7 votes) · LW · GW

Various thoughts I have:

  • I like this approach. It seems like it advances the state of the art in a few ways, and solves a few problems in a neat way.
  • I still disagree with the anti-offsetting desideratum in the form that AUP satisfies. For instance, it makes AUP think very differently about building a nuclear reactor and then adding safety features than it does about building the safety features and then the dangerous bits of the nuclear reactor, which seems whacky and dangerous to me.
  • It's interesting that this somewhat deviates from my intuition about why I want impact regularisation. There is a relatively narrow band of world-states that humans thrive in, and that our AIs should keep us within that narrow band. I think of the point of impact regularisation is to keep us within that band by stopping the AI from doing 'crazy' things. This suggests that crazy should be measured relative to normality, and not relative to where the world is at any given point when the AI is acting.
  • In general, it's unclear to me how you get a utility function over sub-histories when the 'native' argument of a utility function is a full history. That being said, it makes sense in the RL paradigm, and maybe sums of discounted rewards are enough of the utility functions.
Comment by danielfilan on Moderation Reference · 2018-09-14T23:46:47.910Z · score: 3 (2 votes) · LW · GW

Doubting that there are any examples of P is not, so to speak, Carol’s job. The claim is that E is an example of P. The only reason Carol has for thinking that there are examples of P (excepting cases where P is something well-known, of which there are obviously many examples) is that Dave has described E to the reader. Once E is disqualified, Carol is back to having no particular reason to believe that there are any examples of P.

It seems to me that there's likely to be enough cases where there are differences in opinion about whether P is well-known enough that examples aren't needed, or whether P isn't well-known but whether the reader upon hearing a definition could think of examples themselves, that it's useful to have norms whereby we clarify whether or not we doubt that there examples of P.

Comment by danielfilan on Moderation Reference · 2018-09-14T19:51:31.381Z · score: 1 (1 votes) · LW · GW

Yes, I think the word 'steelmanning' is often used to cover some nice similar-ish conversational norms, and find it regrettable that I don't know a better word off the top of my head. Perhaps it's time to invent one?

Comment by danielfilan on Moderation Reference · 2018-09-14T19:49:50.640Z · score: 6 (3 votes) · LW · GW

Perhaps you have other examples of dynamics where what the 'core points' are is in dispute, but the Carol and Dave case seems like one where there's just a sort of miscommunication: Dave thinks 'whether E is an example of P' is not a core point, Carol thinks (I presume) 'whether there exist any examples of P' is a core point, and it seems likely to me that both of these can be true and agreed upon by both parties. I'd imagine that if Carol's initial comment were 'I don't think E is an example of P, because ... Also, I doubt that there are any examples of P at all - could you give another one, or address my misgivings about E?' or instead 'Despite thinking that there are many examples of P, I don't think E is one, because ...' then there wouldn't be a dispute about whether core points were being addressed.

Comment by danielfilan on Moderation Reference · 2018-09-14T17:27:30.019Z · score: 7 (4 votes) · LW · GW

I agree with your distaste given my understanding of 'steelmanning', which is something like "take a belief or position and imagine really good arguments for that" or "take an argument and make a different, better argument out of it" (i.e. the opposite of strawmanning), primarily because it takes you further away from what the person is saying (or at least poses an unacceptably high risk of that). That being said, the concrete suggestions under the heading of steelmanning, addressing core points and putting in interpretive effort, seem crucially different in that they bring you closer to what somebody is saying. As such, and unlike steelmanning, they seem to me like important parts of how one ought to engage in intellectual discussion.

Bottle Caps Aren't Optimisers

2018-08-31T18:30:01.108Z · score: 53 (21 votes)

Mechanistic Transparency for Machine Learning

2018-07-11T00:34:46.846Z · score: 50 (17 votes)

Research internship position at CHAI

2018-01-16T06:25:49.922Z · score: 25 (8 votes)

Insights from 'The Strategy of Conflict'

2018-01-04T05:05:43.091Z · score: 73 (27 votes)

Meetup : Canberra: Guilt

2015-07-27T09:39:18.923Z · score: 1 (2 votes)

Meetup : Canberra: The Efficient Market Hypothesis

2015-07-13T04:01:59.618Z · score: 1 (2 votes)

Meetup : Canberra: More Zendo!

2015-05-27T13:13:50.539Z · score: 1 (2 votes)

Meetup : Canberra: Deep Learning

2015-05-17T21:34:09.597Z · score: 1 (2 votes)

Meetup : Canberra: Putting Induction Into Practice

2015-04-28T14:40:55.876Z · score: 1 (2 votes)

Meetup : Canberra: Intro to Solomonoff induction

2015-04-19T10:58:17.933Z · score: 1 (2 votes)

Meetup : Canberra: A Sequence Post You Disagreed With + Discussion

2015-04-06T10:38:21.824Z · score: 1 (2 votes)

Meetup : Canberra HPMOR Wrap Party!

2015-03-08T22:56:53.578Z · score: 1 (2 votes)

Meetup : Canberra: Technology to help achieve goals

2015-02-17T09:37:41.334Z · score: 1 (2 votes)

Meetup : Canberra Less Wrong Meet Up - Favourite Sequence Post + Discussion

2015-02-05T05:49:29.620Z · score: 1 (2 votes)

Meetup : Canberra: the Hedonic Treadmill

2015-01-15T04:02:44.807Z · score: 1 (2 votes)

Meetup : Canberra: End of year party

2014-12-03T11:49:07.022Z · score: 1 (2 votes)

Meetup : Canberra: Liar's Dice!

2014-11-13T12:36:06.912Z · score: 1 (2 votes)

Meetup : Canberra: Econ 101 and its Discontents

2014-10-29T12:11:42.638Z · score: 1 (2 votes)

Meetup : Canberra: Would I Lie To You?

2014-10-15T13:44:23.453Z · score: 1 (2 votes)

Meetup : Canberra: Contrarianism

2014-10-02T11:53:37.350Z · score: 1 (2 votes)

Meetup : Canberra: More rationalist fun and games!

2014-09-15T01:47:58.425Z · score: 1 (2 votes)

Meetup : Canberra: Akrasia-busters!

2014-08-27T02:47:14.264Z · score: 1 (2 votes)

Meetup : Canberra: Cooking for LessWrongers

2014-08-13T14:12:54.548Z · score: 1 (2 votes)

Meetup : Canberra: Effective Altruism

2014-08-01T03:39:53.433Z · score: 1 (2 votes)

Meetup : Canberra: Intro to Anthropic Reasoning

2014-07-16T13:10:40.109Z · score: 1 (2 votes)

Meetup : Canberra: Paranoid Debating

2014-07-01T09:52:26.939Z · score: 1 (2 votes)

Meetup : Canberra: Many Worlds + Paranoid Debating

2014-06-17T13:44:22.361Z · score: 1 (2 votes)

Meetup : Canberra: Decision Theory

2014-05-26T14:44:31.621Z · score: 1 (2 votes)

[LINK] Scott Aaronson on Integrated Information Theory

2014-05-22T08:40:40.065Z · score: 22 (23 votes)

Meetup : Canberra: Rationalist Fun and Games!

2014-05-01T12:44:58.481Z · score: 0 (3 votes)

Meetup : Canberra: Life Hacks Part 2

2014-04-14T01:11:27.419Z · score: 0 (1 votes)

Meetup : Canberra Meetup: Life hacks part 1

2014-03-31T07:28:32.358Z · score: 0 (1 votes)

Meetup : Canberra: Meta-meetup + meditation

2014-03-07T01:04:58.151Z · score: 3 (4 votes)

Meetup : Second Canberra Meetup - Paranoid Debating

2014-02-19T04:00:42.751Z · score: 1 (2 votes)