Research ideas to study humans with AI Safety in mind 2020-07-03T16:01:25.314Z · score: 22 (6 votes)
What messy problems do you see Deep Reinforcement Learning applicable to? 2020-04-05T17:43:45.945Z · score: 5 (4 votes)
What is the relationship between Preference Learning and Value Learning? 2020-01-13T21:08:40.334Z · score: 5 (4 votes)


Comment by riccardo-volpato on Human-AI Interaction · 2020-07-23T11:36:16.869Z · score: 1 (1 votes) · LW · GW

I think we raise children to satisfy our common expected wellbeing (our + theirs + the overall societal one). Thus, the goal-directness comes from society as a whole. I think there is a key difference between this system and one where a a smarter-than-human AI focuses solely on the well-being of its users, even if it does Context Etrapolated Volition, which I think is what you are referring to when you talk about expected well being (which I agree that if you look only at their CEV-like property the two systems are equivalent).

Comment by riccardo-volpato on Human-AI Interaction · 2020-07-21T10:04:40.174Z · score: 3 (2 votes) · LW · GW

The problem with this line of reasoning is that it assumes that the goal-directness comes from the smarter part of the duo decision-maker and bearer of consequences. With children and animals we consider they preferences as an input into our decision making, which mainly seeks to satisfies our preferences. We do not raise children solely for the purpose of satisfying their preferences.

This is why Rohin stresses particuarly on the idea that the danger in is the source of goal-directedness and if it comes from humans, then we are safer.

Comment by riccardo-volpato on Akrasia is confusion about what you want · 2020-07-16T10:37:25.458Z · score: 1 (1 votes) · LW · GW

Helpful post - thanks for writing it. From a phenomenological perspective, how can we reason well about the truth of this kind of "principles" (i.e. dual-model where S2 is better than S1 being less effective at dealing with motivational conflicts than than the perspective-shift you suggest) that are to some extent non-falisfiable?

Comment by riccardo-volpato on Akrasia is confusion about what you want · 2020-07-16T10:34:07.515Z · score: 3 (2 votes) · LW · GW

This seems true to me (that it happens all the time). I think the article helps by showing that we often fail to recognise that A) and B) can both be true. Also, if we accept that A) and B) are both true and don't create an identify conflict about it, we can probably be more effective in striking a compromise (i.e. giving up either or finding some other way to get A that does not involve B).

Comment by riccardo-volpato on Intuitions about goal-directed behavior · 2020-07-06T13:24:27.358Z · score: 1 (1 votes) · LW · GW

My rough mental summary of these intuitions

  • Generalisation abilities suggests that behaviour is goal directed beucase it demonstrates adaptability (and goals are more adatable/compact ways of defining behaviour than others, like enumeration)
  • Powergrabs suggest that behaviour is goal directed beucase it reveals instrumentalism
  • Our understanding of intelligence migth be limited to human intelligence which is sometime goal directed so we use this a proxy of intelligence (adding some, perhaps unrefutable, skepticism of goal directness as a model of intelligence)
Comment by riccardo-volpato on Risks from Learned Optimization: Conclusion and Related Work · 2020-06-26T16:49:32.991Z · score: 5 (3 votes) · LW · GW

we currently don't have a formal specification of optimization

This seems to me a singificant bottleneck for progress. No formal specification of what optimisation is has been tried before? What has been achieved? Is anyone working on this?

Comment by riccardo-volpato on Deceptive Alignment · 2020-06-26T15:49:48.194Z · score: 1 (1 votes) · LW · GW
  1. Could internalization and modelling of the base objective happen simultanously? In some sense, since Darwin discovered evolution, isn't that the state in which humans are? I guess that this is equivalent to saying that even if the mesa-optimiser has a model of the base-optimiser (condition 2 met) it cannot expect the threat of modification to eventually go away (condition 3 not met) since it is still under selection pressure and is experiencing internalization of the base objective. So if humans will ever be able to defeat mortality (can expect the threat of modification to eventually go away) will they stop having any incentive to self-improve?

once a mesa-optimizer learns about the base objective, the selection pressure acting on its objective will significantly decrease

This seems to be context-dependent to me, as for my example with humans: did learning about evoluation reduced our selection pressure?

  1. How would the reflections on training vs testing apply to something like online learning? Could we simply solve deceptive alignment by never (fully) ending training?
Comment by riccardo-volpato on The Inner Alignment Problem · 2020-06-23T22:25:28.527Z · score: 2 (1 votes) · LW · GW

If we model reachability of an objective as simply its length in bits, then distinguishing O-base from every single more reachable O-mesa gets exponentially harder as O-base gets more complex. Thus, for a very complicated O-base, sufficiently incentivizing the base optimizer to find a mesa-optimizer with that O-base is likely to be very difficult, though not impossible

What is the intuition that makes you think that despite being expoentially harder this would not be impossible?

Comment by riccardo-volpato on Synthesizing amplification and debate · 2020-06-10T13:31:32.070Z · score: 1 (1 votes) · LW · GW

you can anneal whatever combination of the different losses you are using to eventually become exclusively imitative amplification, exclusively debate, or anything else in between

How necessary is annealing for this? Could you choose other optimisation procedures? Or do you refer to annealing in a more general sense?

Comment by riccardo-volpato on An overview of 11 proposals for building safe advanced AI · 2020-06-06T12:18:33.801Z · score: 3 (2 votes) · LW · GW

I will keep track of all questions during our discussion and if there is anything that make sense to send over to you, I will or invite the attendees to do so.

I feel like we as a community still haven't really explored the full space of possible prosaic AI alignment approaches

I agree and I have mixed feelings about the current trend of converging towards somehow equivalent approaches all containing a flavour of recursive supervision (at least 8 of your 11). On one hand, the fact that many attempts point to a similar direction is a good indication of the potential of such a direction. On the other hand, its likelihood of succeeding may be lower than a portoflio approach, which seemed like what the community was originally aiming for. However, I (and I supect most of junior researchers too) don't have a strong intuition on what very different directions migth be promising. Perhaps one possibility would be to not completely abandon modelling humans. While it is undoubtly hard, it may be worth exploring this possiblity from a ML perspective as well, since others are still working on it from a theoretical perspective. It may be that granted some breaktroughts in Neuroscience, it could be less hard that what we anticipate.

Another open problem is improving our understanding of transparency and interpretability

Also agree. I find it a bit vague, in fact, whenever you refer to "transparency tools" in the post. However, if we aim for some kind of guarantees, this problem may either involve modelling humans or loop back to the main alignment problem. In the sense that specifying the success of a transparency tool, is itself prone to specification error and outer/inner alignment problems. Not sure my point here is clear, but is something I am interested on pondering aboud.

Thanks for all the post pointers. I will have an in-depth read.

Comment by riccardo-volpato on An overview of 11 proposals for building safe advanced AI · 2020-06-05T16:54:45.320Z · score: 10 (4 votes) · LW · GW

Thanks for the great post. It really provides an awesome overview of the current progress. I will surely come back to this post often and follow pointers as I think about and research things.

Just before I came across this, I was thinking of hosting a discussion about "Current Progres in ML-based AI Safety Proposals" at the next AI Safety Discussion Day (Sunday June 7th).

Having read this, I think that the best thing to do is to host an open-ended discussion about this post. It would be awesome if you can and want to join. More details can be found here

One additional thing that I was thinking of discussing (and that could be a minor way to go beyond this post) are the various open problems across the different solutions and what might be more impactful to contribute/focus on. Do you have any thoughts on this?

Many thanks again for the precious resource you put together.

Comment by riccardo-volpato on What is the relationship between Preference Learning and Value Learning? · 2020-01-15T09:20:20.548Z · score: 3 (2 votes) · LW · GW

Interesting points. The distinctions you mention could equally apply in distinguishing narrow from ambitious value learning. In fact, I think preference learning is pretty much the same as narrow value learning. Thus, could it be that ambitious value learning research may not be very interested in preference learning to a similar extent in which they are not interested in narrow value learning?

"How important safety concerns" is certainly right, but the story of science teaches us that taking something from a domain with different concerns to another domain has often proven extremely useful.