Comment by turntrout on Best reasons for pessimism about impact of impact measures? · 2019-04-22T23:55:35.261Z · score: 14 (2 votes) · LW · GW

So there's a thing people do when they talk about AUP which I don't understand. They think it's about state, even though I insist it's fundamentally different, and try to explain why (note that AUP in the MDP setting is necessarily over states, because states are the observations). My explanations apparently haven't been very good; in the given conversation, they acknowledge that it's different, but then regress a little while later. I think they might be trying understand the explanation, remain confused, and then subconsciously slip back to their old model. out of everyone I've talked to, I can probably count on my hands the number of people who get this – note that agreeing with specific predictions of mine is different.

Now, it's the author's job to communicate their ideas. When I say "as far as I can tell, few others have internalized how AUP actually works", this doesn't connote "gosh, I can't stand you guys, how could you do this", it's more like "somehow I messed up the explanations; I wonder what key ideas are missing still? How can I fix this?".

my goal with this comment isn't to explain, but rather to figure out what's happening. Let's go through some of my past comments about this.

Surprisingly, the problem comes from thinking about "effects on the world". Let's begin anew.
To scale, relative reachability requires solution of several difficult ontological problems which may not have anything close to a simple core, including both a sensible world state representation and a perfect distance metric. Relative reachability isn't ontology-agnostic.
In the long term, the long arms of opportunity cost and instrumental convergence plausibly allow us to toss in a random set of utility functions. I expect this to work for the same reasons we worry about instrumental convergence to begin with.
I have a theory that AUP seemingly works for advanced agents not because the content of the attainable set's utilities actually matters, but rather because there exists a common utility achievement currency of power.
Here, we’re directly measuring the agent’s power: its ability to wirehead a trivial utility function.
The plausibility of [this] makes me suspect that even though most of the measure in the unbounded case is not concentrated on complex human-relevant utility functions, the penalty still captures shifts in power.
By changing our perspective from "what effects on the world are 'impactful'?" to "how can we stop agents from overfitting their environments?", a natural, satisfying definition of impact falls out.
Towards a New Impact Measure

When I read this, it seems like I'm really trying to emphasize that I don't think the direct focus should be on the world state in any way. But it was a long post, and I said a lot of things, so I'm not too surprised.

I tried to nip this confusion in the bud.

"The biggest difference from relative reachability, as I see it, is that you penalize increasing the ability to achieve goals, as well as decreasing it."
I strongly disagree that this is the largest difference, and I think your model of AUP might be some kind of RR variant.
Consider RR in the real world, as I imagine it (I could be mistaken about the details of some of these steps, but I expect my overall point holds). We receive observations, which, in combination with some predetermined ontology and an observation history -> world state function, we use to assign a distribution over possible physical worlds. We also need another model, since we need to know what we can do and reach from a specific world configuration.Then, we calculate another distribution over world states that we’d expect to be in if we did nothing. We also need a distance metric weighting the importance of different discrepancies between states. We have to calculate the coverage reduction of each action-state (or use representative examples, which is also hard-seeming), with respect to each start-state, weighted using our initial and post-action distributions. We also need to figure out which states we care about and which we don’t, so that’s another weighting scheme. But what about ontological shift?
This approach is fundamentally different. We cut out the middleman, considering impact to be a function of our ability to string together favorable action-observation histories, requiring only a normal model. The “state importance"/locality problem disappears. Ontological problems disappear. Some computational constraints (imposed by coverage) disappear. The "state difference weighting" problem disappears. Two concepts of impact are unified.
I’m not saying RR isn’t important - just that it’s quite fundamentally different, and that AUP cuts away a swath of knotty problems because of it.
~ my reply to your initial comment on the AUP post

even more confusing is when I say "there are fundamental concepts here you're missing", people don't seem to become any less confident in their predictions about what AUP does. if people think that AUP is penalizing effects in the world, why don't they notice their confusion when they read a comment like the one above?

a little earlier,

Thinking in terms of "effects" seems like a subtle map/territory confusion. That is, it seems highly unlikely that there exists a robust, value-agnostic means of detecting "effects" that makes sense across representations and environments.
Impact Measure Desiderata

I also might've expected some people to wonder, given their state interpretation, how come I'm not worried about stuff I mentioned in the whitelisting post anymore since I strangely don't think representation/state similarity metric matters for AUP:

due to entropy, you may not be able to return to the exact same universe configuration.
Worrying about the Vase: Whitelisting

(this is actually your "chaotic world" concern)

right now, I'm just chalking this up to "Since the explanations don't make any sense because they're too inferentially distant/it just looks like I built a palace of equations, it probably seems like I'm not on the same page with their concerns, so there's nothing to be curious about." can you give me some of your perspective? (others are welcome to chime in)

to directly answer your question: no, the real world version of AUP which I proposed doesn't reward based on state, and would not have its behavior influenced solely by different possible arrangements of air molecules. (I guess I'm directly responding to this concern, but I don't see any other way to get information on why this phenomenon is happening)

as for the question – I was just curious. I think you'll see why I asked when I send you some drafts of the new sequence. :)

Comment by turntrout on Best reasons for pessimism about impact of impact measures? · 2019-04-20T15:47:36.275Z · score: 2 (1 votes) · LW · GW

I meant that for attainable set consisting of random utility functions, I would expect most of the variation in utility to be based on irrelevant factors like the positions of air molecules.

Are you thinking of an action observation formalism, or some kind of reward function over inferred state?


If you had to pose the problem of impact measurement as a question, what would it be?

Comment by turntrout on Best reasons for pessimism about impact of impact measures? · 2019-04-19T15:58:15.780Z · score: 4 (2 votes) · LW · GW

Thanks for the detailed list!

AU with random utility functions, which would mostly end up rewarding specific configurations of air molecules.

What does this mean, concretely? And what happens with the survival utility function being the sole member of the attainable set? Does this run into that problem, in your model?

Humans get around this by only counting easily predictable effects as impact that they are considered responsible for.

What makes you think that?

Comment by turntrout on Simplified preferences needed; simplified preferences sufficient · 2019-04-19T15:54:21.296Z · score: 2 (1 votes) · LW · GW

people working in these areas don't often disagree with this formal argument; they just think it isn't that important.

I do disagree with this formal argument in that I think it’s incorrectly framed. See the difference between avoiding huge impact to utility and avoiding huge impact to attainable utility, discussed here:

Comment by turntrout on Best reasons for pessimism about impact of impact measures? · 2019-04-17T22:47:27.182Z · score: 2 (1 votes) · LW · GW
this only works if we specified the goal and the cost correctly

Wait, why doesn't it work if you just specify the cost (impact) correctly?

Comment by turntrout on Towards a New Impact Measure · 2019-04-14T01:56:22.896Z · score: 2 (1 votes) · LW · GW

(The post defines the mathematical criterion used for what I call intent verification, it’s not a black box that I’m appealing to.)

Comment by turntrout on Towards a New Impact Measure · 2019-04-13T17:36:26.620Z · score: 2 (1 votes) · LW · GW

I think there's some variance, but not as much as you have in mind. Even if there were a very large value, however, this isn't how N-incrementation works (in the post – if you're thinking of the paper, then yes, the version I presented there doesn't bound lifetime returns and therefore doesn't get the same desirable properties as in the post). If you'll forgive my postponing this discussion, I'd be interested in hearing your thoughts after I post a more in-depth exploration of the phenomenon?

Comment by turntrout on Towards a New Impact Measure · 2019-04-13T17:32:34.475Z · score: 2 (1 votes) · LW · GW

I don't think I agree, but even if trust did work like this, how exactly does taking over the world not increase the Q-values? Even if the code doesn't supply reward for other reward functions, the agent now has a much more stable existence. If you're saying that the stable existence only applies for agents maximizing the AUP reward function, then this is what intent verification is for.

Notice something interesting here where the thing which would be goodharted upon without intent verification isn't the penalty itself per se, but rather the structural properties of the agent design – the counterfactuals, the fact that it's a specific agent with I/O channels, and so on. more on this later.

Comment by turntrout on Towards a New Impact Measure · 2019-04-13T17:24:43.628Z · score: 2 (1 votes) · LW · GW

Where did I purport that it was safe for AGI in the paper, or in the post? I specifically disclaim that I'm not making that point yet, although I'm pretty sure we can get there.

There is a deeper explanation which I didn't have space to fit in the paper, and I didn't have the foresight to focus on when I wrote this post. I agree that it calls out for more investigation, and (this feels like a refrain for me at this point) I'll be answering this call in a more in-depth sequence on what is actually going on at a deep level with AUP, and how fundamental the phenomenon is to agent-environment interaction.

I don't remember how I found the first version, I think it was in a Google search somehow?

Comment by turntrout on Best reasons for pessimism about impact of impact measures? · 2019-04-11T23:14:17.202Z · score: 4 (2 votes) · LW · GW

What would you predict AUP does for the chaotic scenarios? Suppose the attainable set just includes the survival utility function, which is 1 if the agent is activated and 0 otherwise.

Comment by turntrout on Best reasons for pessimism about impact of impact measures? · 2019-04-11T23:06:08.113Z · score: 9 (3 votes) · LW · GW
I don't see how representation invariance addresses this concern.

I think my post was basically saying "representation selection seems like a problem because people are confused about the type signature of impact, which is actually a thing you can figure out no matter what you think the world is made of". I don't want to go into too much detail here (as I explained below), but part of what this implies is that discrete "effects" are fake/fuzzy mental constructs/not something to think about when designing an impact measure. In turn, this would mean we should ask a different question that isn't about butterfly effects.

Comment by turntrout on Towards a New Impact Measure · 2019-04-11T16:23:30.143Z · score: 2 (1 votes) · LW · GW

1) Why wouldn't gaining trust be useful for other rewards? I think that it wouldn't be motivated to do so, because the notion of gaining power seems to be deeply intertwined with the notion of heavy maximization. It might attempt to Goodhart our particular way of measuring impact; the fact that we are actually measuring goal achievement ability from a particular vantage point and are using a particular counterfactual structure means that there could be cheeky ways of tricking that structure. This is why intent verification is a thing in this longer post. However, I think the attainable utility measure itself is correct.

2) this doesn't appear in the paper, but I do talk about in the post and I think it's great that you raise this point. Attainable utility preservation says that impact is measured along the arc of your actions, taking into account the deviation of the Q functions at each step compared to doing nothing. If you can imagine making your actions more and more granular (at least, up to a reasonably fine level), it seems like there should be a well-defined limit that the coarser representations approximate. In other words, since impact is measured along the arc of your actions, if your differential elements are chunky, you're not going to get a very good approximation. I think there are good reasons to suspect that in the real world, the way we think about actions is granular enough to avoid this dangerous phenomenon.

3) this is true. My stance here is that this is basically a capabilities problem/a safe exploration issue, which is disjoint from impact measurement.

4) this is why we want to slowly increment . This should work whether it's a human policy or a meaningless string of text. The reason for this is that even if the meaningless string is very low impact, eventually gets large enough to let the agent do useful things; conversely, if the human policy is more aggressive, we stop incrementing sooner and avoid giving too much leeway.

Comment by turntrout on Best reasons for pessimism about impact of impact measures? · 2019-04-10T23:23:56.111Z · score: 2 (1 votes) · LW · GW

is there a central example you have in mind for this potential failure mode?

Comment by turntrout on Best reasons for pessimism about impact of impact measures? · 2019-04-10T23:21:16.945Z · score: 13 (-1 votes) · LW · GW

I do plan on pushing back on certain concerns, but I think if I did so now, some of my reasons for believing things would seem weird and complicated-enough-to-be-shaky because of inferential distance. The main pedagogical mistake I made with Towards a New Impact Measure wasn't putting too much in one post, but rather spending too much time on conclusions, telling people what I think happens without helping build in them the intuitions and insights which generate those results. Over the last 8 months, I think I've substantially enriched my model of how agents interact with their environments. I'm interested in seeing how many disagreements melt away when these new insights are properly shared and understood, and what people still disagree with me on. That's why I'm planning on waiting until my upcoming sequence to debate these points.

I am comfortable sharing those concerns which I have specific reason to believe don't hold up. However, I'm wary of dismissing them in a way that doesn't Include those specific reasons. That seems unfair. If you're curious which ones I think these are, feel free to ask me over private message.

Comment by turntrout on Best reasons for pessimism about impact of impact measures? · 2019-04-10T22:51:39.468Z · score: 2 (1 votes) · LW · GW

How does this concern interact with the effective representation invariance claim I made when introducing AUP?

Comment by turntrout on Best reasons for pessimism about impact of impact measures? · 2019-04-10T18:21:38.815Z · score: 3 (2 votes) · LW · GW
Moreover, if in the moment T2 it turns out that the person, who crashed his head at T1, was next Hitler, when again preserving the vase in the T0 becomes a low impact event

why does it become a low impact event in your eyes?

In other words, to calculate the real impact of an event we have to calculate its consequences until the end of eternity, which is impossible.

What is your model of the thing that people are doing in their heads when they say things are impactful or not?

longterm impact can't be calculated without knowing actual human values, so lowering the impact is the same task as AI alignment, and low impact task can't replace AI alignment task or be solved separately.

Can you elaborate on this, and what your definition of impact is? I think we might be thinking of different things, and I'm interested as to what yours is.

Best reasons for pessimism about impact of impact measures?

2019-04-10T17:22:12.832Z · score: 75 (16 votes)
Comment by turntrout on Impact Measure Desiderata · 2019-04-09T01:58:56.799Z · score: 4 (2 votes) · LW · GW

Sounds good. I’m currently working on a long sequence walking through my intuitions and assumptions in detail.

Comment by turntrout on Impact Measure Desiderata · 2019-04-05T17:07:19.328Z · score: 2 (1 votes) · LW · GW
But creating extreme suffering might not actually involve doing much in the physical world (compared to "normal" actions the AI would have to take to achieve the goals that we gave it). What if, depending on the goals we give the AI, doing this kind of extortion is actually the lowest impact way to achieve some goal?

Since there are a lot of possible scenarios, each of which affects the optimization differently, I'm hesitant to use a universal quantifier here without more details. However, I am broadly suspicious of AUP agents choosing plans which involve almost maximally offensive components, even accounting for the fact that it could try to do so surreptitiously. An agent might try to extort us if it expected we would respond, but respond with what? Although impact measures quantify things in the environment, that doesn't mean they're measuring how "similar" two states look to the eye. AUP penalizes distance traveled in the Q function space for its attainable utility functions. We also need to think about the motive for the extortion – if it means the agent gains in power, then that is also penalized.

Maybe it could extort a different group of humans, and as part of the extortion force them to keep it secret from people who could turn it off? Or extort us and as part of the extortion force us to not turn it off (until we were going to turn it off anyway)?

Again, it depends on the objective of the extortion. As for the latter, that wouldn't be credible, since we would be able to tell its threat was the last action in its plan. AUP isolates the long-term effects of each action by having the agent stop acting for the rest of the epoch; this gives us a counterfactual opportunity to respond to that action.

I'm not sure whether this belongs in the desiderata, since we're talking about whether temporary object level bad things could happen. I think it's a bonus to think that there is less of a chance of that, but not the primary focus of the impact measure. Even so, it's true that we could explicitly talk about what we want to do with impact measures, adding desiderata like "able to do reasonable things" and "disallows catastrophes from rising to the top of the preference ordering". I'm still thinking about this.

Comment by turntrout on Impact Measure Desiderata · 2019-04-05T01:52:40.691Z · score: 2 (1 votes) · LW · GW

I think so. First, AUP seems to bound "how hard the agent tries" (in the physical world with its actions); the ambitions of such an agent seem rather restrained. Second, AUP provides a strong counterfactual approval incentive. While it doesn't rule out the possibility of physical suffering, the agent is heavily dis-incentivized from actions which would substantially change the likelihood we keep it activated (comparing how likely it is to be turned off if it doesn't do the thing, with the likelihood if it does the thing and then waits for a long time). It would basically have to be extremely sure it could keep it secret, which seems rather unlikely considering the other aspects of the behavior of AUP agents. If I understand the extortion scenario correctly, it would have to be extorting us, so it couldn't keep it secret, so it would be penalized and it wouldn't do it.

I think similar arguments involving counterfactual approval apply for similar things we may want to avoid.

Comment by turntrout on Impact Measure Desiderata · 2019-04-05T00:58:21.543Z · score: 2 (1 votes) · LW · GW

it's true that impact measures, and AUP in particular, don't do anything to mitigate mindcrime. Part of this is because aspects of the agent's reasoning process can't be considered impactful in the non-embedded formalisms we're currently stuck with. Part of this is because it seems like a separate problem. Mindcrime would indeed be very bad, and a unique kind of catastrophe not meant to be covered by my claim above.

However, I'm skeptical that that goal is actually a component of our terminal preferences. What is doing the causing – are you thinking "never have an AI cause an instance of that"? Why would that be part of our terminal preferences? If you mean "never have this happen", we've already lost.

It seems more like we really, really don't want any of that to happen, and the less happens, the better. Like I said, the point isn't that the agent will never do it, but that any bad things can be recovered from. This seems alright to me, as far as impact measures are concerned.

More generally, if we did have a goal of this type, it would be the case that if we learned that a particular thing had happened at any point in the past in our universe, that universe would be partially ruined for us forever. That just doesn't sound right.

Comment by turntrout on Impact Measure Desiderata · 2019-04-04T22:02:20.199Z · score: 4 (2 votes) · LW · GW

Short answer: yes; if its goal is to break vases, that would be pretty reasonable.

Longer answer: The AUP theory of low impact says that impact is relative to the environment and to the agent's vantage point therein. In Platonic gridworlds like this:

knowing whether a vase is present tells you a lot about the state, and you can't replace the vase here, so breaking it is a big deal (according to AUP). If you could replace the vase, there would still be a lesser impact. AUP would say to avoid breaking unnecessary vases due to the slight penalty, since the goal presumably doesn't require breaking the vase – so why not go around?

On the other hand, in the Go example, winning is the agent's objective. Depending on how the agent models the world (as a real-world agent playing a game on a computer, or whether it thinks it's just Platonically interacting with a Go environment), penalties get applied differently. In the former case, I don't think it would incur much penalty for being good at a game (modulo approval incentives it may or may not predict). In the latter case, you'd probably need to keep giving it more impact allowance until it's playing as well as you'd like. This is because the goal is related to the thing which has a bit of impact.

Comment by turntrout on Impact Measure Desiderata · 2019-04-04T20:38:01.181Z · score: 4 (2 votes) · LW · GW

Reading through this again, I think I have a better response to this part.

We could have valued human superiority at game-playing very highly, in which case AlphaGo would be catastrophic. How can an impact measure allow that without having at least some knowledge about values?

A low impact agent could beat us at games while still preserving our ability to beat it at games (by, for example, shutting it off). Of course, you could say "what if being beaten even once is a catastrophe, such that it destroys our ability to be undefeated forever", but it seems like our goals are simply not of this form. In other words, it seems that actual catastrophes do destroy our ability to achieve different goals, while more benign things don't. If the bad things the agent does can be recovered from, then I think the impact measure has done its job.

Comment by turntrout on More realistic tales of doom · 2019-03-27T15:56:34.332Z · score: 2 (1 votes) · LW · GW

And if the internal goal doesn’t permit that? I’m trying to feel out which levels of meta are problematic in this situation.

Comment by turntrout on More realistic tales of doom · 2019-03-18T23:29:42.746Z · score: 4 (2 votes) · LW · GW

So the concern here is that even if the goal, say, robustly penalizes gaining influence, the agent still has internal selection pressures for seeking influence? And this might not be penalized by the outer criterion if the policy plays nice on-distribution?

Comment by turntrout on Alignment Newsletter #48 · 2019-03-12T23:40:42.044Z · score: 2 (1 votes) · LW · GW

it's also mild on the inside of the algorithm, not just in its effects on the world. this could avert problems with inner optimizers. beyond that, I haven't thought enough about the behavior of the agent. I might reply with another comment.

Comment by turntrout on Alignment Newsletter #48 · 2019-03-12T21:34:03.467Z · score: 4 (2 votes) · LW · GW

Alternatively, construct a distribution over actions such that each action has measure according to some function of its e.g. attainable utility impact penalty (normalized appropriately, of course). Seems like a potential way to get a mild optimizer which is explicitly low-impact and doesn't require complicated models of humans.

Designing agent incentives to avoid side effects

2019-03-11T20:55:10.448Z · score: 31 (6 votes)
Comment by turntrout on Asymptotically Benign AGI · 2019-03-07T01:54:26.403Z · score: 2 (1 votes) · LW · GW

Other algorithms... would eventually seek arbitrary power in the world in order to intervene in the provision of its own reward; this follows straightforwardly from its directive to maximize reward

The conclusion seems false; AUP (IJCAI, LW) is a reward maximizer which does not exhibit this behavior. For similar reasons, the recent totalitarian convergence conjecture made here also seems not true.

Comment by turntrout on Test Cases for Impact Regularisation Methods · 2019-02-07T17:25:00.699Z · score: 6 (3 votes) · LW · GW

This post is extremely well done.

my understanding is that every published impact regularisation method fails [supervisor manipulation] in a ‘default’ implementation.

Wouldn’t most measures with a stepwise inaction baseline pass? They would still have incentive to select over future plans so that the humans’ reactions to the agent are low impact (wrt current baseline), but if the stepwise inaction outcome is high impact by the time the agent realizes, that’s the new baseline.

Comment by turntrout on How much can value learning be disentangled? · 2019-01-30T22:37:51.936Z · score: 2 (1 votes) · LW · GW

In my experience so far, we need to include our values, in part, to define "reasonable" utility functions.

It seems that an extremely broad set of input attainable functions suffice to capture the “reasonable“ functions with respect to which we want to be low impact. For example, “remaining on”, “reward linear in how many blue pixels are observed each time step”, etc. All thanks to instrumental convergence and opportunity cost.

Comment by turntrout on How much can value learning be disentangled? · 2019-01-30T21:43:05.499Z · score: 3 (2 votes) · LW · GW

Take a friendly AI that does stuff. Then there is a utility function for which that "does stuff" is the single worst thing the AI could have done.

The fact that no course of action is universally friendly doesn’t mean it can’t be friendly for us.

As I understand it, the impact version of this argument is flawed in the same way (but less blatantly so): something being high impact according to a contrived utility function doesn’t mean we can’t induce behavior that is, with high probability, low impact for the vast majority of reasonable utility functions.

Comment by turntrout on How much can value learning be disentangled? · 2019-01-30T18:41:14.242Z · score: 2 (1 votes) · LW · GW

This seems to prove too much; the same argument proves friendly behavior can’t exist ever, or that including our preferences directly is (literally) impossible. The argument doesn’t show that that utility has to be important to / considered by the impact measure.

Plus, low impact doesn’t have to be robust to adversarially chosen input attainable utilities - we get to choose them. Just choose the “am I activated” indicator utility and AUP seems to do fine, modulo open questions raised in the post and comments.

Comment by turntrout on How much can value learning be disentangled? · 2019-01-29T23:30:14.484Z · score: 2 (1 votes) · LW · GW

If the AI isn’t just fed all the data by default (ie via a camera already at the opportune location), taking steps to observe is (AUP-)impactful. I think you’re right that agents with small impact allowances can still violate values.

Comment by turntrout on How much can value learning be disentangled? · 2019-01-29T17:06:20.863Z · score: 2 (1 votes) · LW · GW

Incidentally, I feel the same about low-impact approaches. The full generality problem, an AI that is low impact but value-agnostic, I think is impossible.

My (admittedly hazy) recollection of our last conversation is that your concerns were that “value agnostic, low impact, and still does stuff” is impossible. Can you expand on what you mean by value agnostic here, and why you think we can’t even have that and low impact?

Comment by turntrout on "AlphaStar: Mastering the Real-Time Strategy Game StarCraft II", DeepMind [won 10 of 11 games against human pros] · 2019-01-25T04:38:08.092Z · score: 9 (5 votes) · LW · GW

How long do handicaps take to overcome, though? I find it hard to imagine that the difference between eg 500 APM average or 500 APM hard ceiling requires a whole new insight for the agent to be “clever” enough to win anyways - maybe just more training.

Comment by turntrout on Starting to see 2 months later · 2019-01-23T22:22:38.084Z · score: 3 (2 votes) · LW · GW

Congratulations; take some time to be consciously proud of yourself for the progress you’ve made. :)

Comment by turntrout on Announcement: AI alignment prize round 4 winners · 2019-01-23T18:14:38.852Z · score: 3 (2 votes) · LW · GW

I also think surveying applicants might be a good idea, since my experience may not be representative.

Comment by turntrout on And My Axiom! Insights from 'Computability and Logic' · 2019-01-22T15:33:54.781Z · score: 2 (1 votes) · LW · GW

Turing’s thesis applies only to this notion of definability, right?

Comment by turntrout on Announcement: AI alignment prize round 4 winners · 2019-01-22T01:06:59.666Z · score: 16 (6 votes) · LW · GW

Yes, it was the top idea on/off over a few months. I considered it my secret research and thought on my twice daily walks, in the shower, and in class when bored. I developed it for my CHAI application and extended it as my final Bayesian stats project. Probably 5-10 hours a week, plus more top idea time. However, the core idea came within the first hour of thinking about Concrete Problems.

The second piece, Overcoming Clinginess, was provoked by Abram’s comment that clinginess seemed like the most damning failure of whitelisting; at the time, I thought just finding a way to overcome clinginess would be an extremely productive use of my entire summer (lol). On an AMS - PDX flight, I put on some music and spent hours running through different scenarios to dissolve my confusion. I hit the solution after about 5 hours of work, spending 3 hours formalizing it a bit and 5 more making it look nice.

Comment by turntrout on Announcement: AI alignment prize round 4 winners · 2019-01-21T23:55:59.374Z · score: 6 (3 votes) · LW · GW

In round three, I was working on computational molecule design research and completing coursework; whitelisting was developed in my spare time.

In fact, during the school year I presently don't have research funding, so I spend some of my time as a teaching assistant.

Comment by turntrout on Announcement: AI alignment prize round 4 winners · 2019-01-21T16:41:24.256Z · score: 6 (3 votes) · LW · GW

Could there be some kind of mentorship incentive? Another problem at large in alignment research seems to be lack of mentors, since most of the people skilled enough to fill this role are desperately working against the clock. A naïve solution could be to offer a smaller prize to the mentor of a newer researcher if the newbie's submission details a significant amount of help on their part. Obviously, dishonest people could throw the name of their friend on the submission because "why not", but I'm not sure how serious this would be.

What would be nice would be some incentive for high quality mentorship / for bringing new people into the contest and research field, in a way that encourages the mentors to get their friends in the contest, even though that might end up increasing the amount of competition they have for their own proposal.

This might also modestly improve social incentives for mentors, since people like being associated with success and being seen as helpful / altruistic.

ETA: What about a flat prize (a few thousand dollars) you can only win once, but thence can mentor others and receive a slightly more modest sum for prizes they win? It might help kickstart people’s alignment careers if sufficiently selective / give them the confidence to continue work. Have to worry about the details for what counts as mentorship, depending on how cheaty we think people would try to be.

Comment by turntrout on Announcement: AI alignment prize round 4 winners · 2019-01-21T16:35:29.206Z · score: 7 (4 votes) · LW · GW

One possible factor is that there was initially a pool of people who wouldn't otherwise try to contribute to alignment research (~30 people, going from # of submissions to contest 1 - # of submissions to this contest) who tried their hand early on, but then became discouraged because the winners' entries seemed more polished and productive than they felt they could realistically hope for. In fact, I felt this way in round two. I imagine that I probably would've stopped if the alignment prize had been my sole motivation (i.e., totally ignoring how I feel about the necessity of work on this problem).

Comment by turntrout on And My Axiom! Insights from 'Computability and Logic' · 2019-01-19T00:33:49.163Z · score: 2 (1 votes) · LW · GW

Sure, but how do we get the final set, then? The paradox addresses the reader in the imperative, implying one can follow along with some effective procedure to trim down the set. Yet if Turing’s thesis is to be believed, there is no such procedure, no final set, and therefore no paradox.

Comment by turntrout on And My Axiom! Insights from 'Computability and Logic' · 2019-01-18T16:43:49.294Z · score: 2 (1 votes) · LW · GW

I don’t think I understand this line of objection; would you be willing to expand?

Comment by turntrout on And My Axiom! Insights from 'Computability and Logic' · 2019-01-17T20:03:50.458Z · score: 2 (1 votes) · LW · GW

But there are more objections; even if "computability" isn't explicitly mentioned in the problem, it's still present. Are the sets "the singleton set containing 1 if and only if machine halts on input " and "the singleton set containing 1" the same? Even if we grant a procedure for figuring out what counts as a set, we can't even compute which sentences are duplicates.

Comment by turntrout on Alignment Newsletter #41 · 2019-01-17T15:44:11.134Z · score: 4 (2 votes) · LW · GW
Another way of phrasing this is that I am pessimistic about the prospects of conceptual thinking, which seems to be the main way by which we could find a fundamental obstruction. (Theory and empirical experiments can build intuitions about what is and isn't hard, but given the complexities of the real world it seems unlikely that either would give us the sort of crystallized knowledge that Paul is aiming for.) Phrased this way, I put less credence in this opinion, because I think there are a few examples of conceptual thinking being very important, though not that many.

Can you expand on your reasons for pessimism?

Comment by turntrout on And My Axiom! Insights from 'Computability and Logic' · 2019-01-17T02:27:34.290Z · score: 2 (1 votes) · LW · GW

Thanks, my terminology was a little loose. What I was trying to hint at is that some of the paradox's culling operations require uncomputable tests of English sentences, and that the regularity of the original language doesn't determine the status of its subsets.

And My Axiom! Insights from 'Computability and Logic'

2019-01-16T19:48:47.388Z · score: 40 (9 votes)
Comment by turntrout on Optimization Regularization through Time Penalty · 2019-01-01T16:51:51.783Z · score: 3 (2 votes) · LW · GW

I like this line of thought overall.

• How would we safely set lambda?

• Isn’t it still doing an argmax over plans and T, making the internal optimization pressure very non-mild? If we have some notion of embedded agency, one would imagine that doing the argmax would be penalized, but it’s not clear what kind of control the agent has over its search process in this case.

But a value neutral impact measure is almost impossible, because the world has too many degrees of freedom.

Can you explain why you think something like AUP requires value-laden inputs?

Penalizing Impact via Attainable Utility Preservation

2018-12-28T21:46:00.843Z · score: 26 (10 votes)

Why should I care about rationality?

2018-12-08T03:49:29.451Z · score: 26 (6 votes)

A New Mandate

2018-12-06T05:24:38.351Z · score: 15 (8 votes)
Comment by turntrout on Fixed Point Exercises · 2018-11-30T17:51:24.469Z · score: 10 (7 votes) · LW · GW

Sounds like me at the beginning of this year; I’m now able to make my way through logical induction. I’d be happy to help, by the way - feel free to message me.

Comment by turntrout on Turning Up the Heat: Insights from Tao's 'Analysis II' · 2018-11-29T21:25:31.288Z · score: 2 (1 votes) · LW · GW

Then you can solve it, yeah.

Comment by turntrout on Turning Up the Heat: Insights from Tao's 'Analysis II' · 2018-11-29T16:17:10.199Z · score: 2 (1 votes) · LW · GW

He defined a strict contraction on a metric space as requiring for and for all . Your proposed solution doesn’t fix such a ; in fact, as , , which is why .

Claim: You can’t solve the exercise

Proof (thanks to TheMajor). Let be a sequence in the domain converging to such that . Since is a strict contraction with contraction constant , . Since the absolute value is continuous, we conclude that . ◻️

Towards a New Impact Measure

2018-09-18T17:21:34.114Z · score: 104 (36 votes)

Impact Measure Desiderata

2018-09-02T22:21:19.395Z · score: 40 (11 votes)

Turning Up the Heat: Insights from Tao's 'Analysis II'

2018-08-24T17:54:54.344Z · score: 40 (11 votes)


2018-07-29T00:35:24.674Z · score: 36 (14 votes)

Making a Difference Tempore: Insights from 'Reinforcement Learning: An Introduction'

2018-07-05T00:34:59.249Z · score: 35 (9 votes)

Overcoming Clinginess in Impact Measures

2018-06-30T22:51:29.065Z · score: 40 (13 votes)

Worrying about the Vase: Whitelisting

2018-06-16T02:17:08.890Z · score: 84 (20 votes)

Swimming Upstream: A Case Study in Instrumental Rationality

2018-06-03T03:16:21.613Z · score: 113 (36 votes)

Into the Kiln: Insights from Tao's 'Analysis I'

2018-06-01T18:16:32.616Z · score: 69 (19 votes)

Confounded No Longer: Insights from 'All of Statistics'

2018-05-03T22:56:27.057Z · score: 56 (13 votes)

Internalizing Internal Double Crux

2018-04-30T18:23:14.653Z · score: 79 (18 votes)

The First Rung: Insights from 'Linear Algebra Done Right'

2018-04-22T05:23:49.024Z · score: 77 (21 votes)

Unyielding Yoda Timers: Taking the Hammertime Final Exam

2018-04-03T02:38:48.327Z · score: 39 (11 votes)

Open-Category Classification

2018-03-28T14:49:23.665Z · score: 36 (8 votes)

The Art of the Artificial: Insights from 'Artificial Intelligence: A Modern Approach'

2018-03-25T06:55:46.204Z · score: 68 (18 votes)

Lightness and Unease

2018-03-21T05:24:26.289Z · score: 53 (15 votes)

How to Dissolve It

2018-03-07T06:19:22.923Z · score: 41 (15 votes)

Ambiguity Detection

2018-03-01T04:23:13.682Z · score: 33 (9 votes)

Set Up for Success: Insights from 'Naïve Set Theory'

2018-02-28T02:01:43.790Z · score: 62 (18 votes)

Walkthrough of 'Formalizing Convergent Instrumental Goals'

2018-02-26T02:20:09.294Z · score: 27 (6 votes)

Interpersonal Approaches for X-Risk Education

2018-01-24T00:47:44.183Z · score: 29 (8 votes)