What alignment-related concepts should be better known in the broader ML community?

lauro-langosco

What alignment-related concepts should be better known in the broader ML community?

post by Lauro Langosco · 2021-12-09T20:44:09.228Z · LW · GW · No comments

This is a question post.

  Answers
    14 jbkjr
    5 Daniel Kokotajlo
    5 Daniel Kokotajlo
    2 Charlie Steiner
None
No comments

We want to work towards a world in which the alignment problem is a mainstream concern among ML researchers. An important part of this is popularizing alignment-related concepts within the ML community. Here's a few recent examples:

Reward hacking / misspecification (blog post)
Convergent instrumental goals (paper)
Objective robustness (paper)
Assistance (paper)

(I'm sure this list is missing many examples; let me know if there are any in particular I should include).

Meanwhile, there are many other things that alignment researchers have been thinking about that are not well known within the ML community. Which concepts would you most want to be more widely known / understood?

Answers

answer by jbkjr · 2021-12-09T20:02:30.218Z · LW(p) · GW(p)

This is kind of vague, but I have this sense that almost everybody doing RL and related research takes the notion of "agent" for granted, as if it's some metaphysical primitive*, as opposed to being a (very) leaky abstraction [? · GW] that exists in the world models of humans. But I don't think the average alignment researcher has much better intuitions about agency, either, to be honest, even though some spend time thinking about things like embedded agency. It's hard to think meaningfully about the illusoriness of the Cartesian boundary when you still live 99% of your life and think 99% of your thoughts as if you were a Cartesian agent, fully "in control" of your choices, thoughts, and actions.

(*Not that "agent" couldn't, in fact, be a metaphysical primitive, just that such "agents" are hardly "agents" in the way most people consider humans to "be agents" [and, equally importantly, other things, like thermostats and quarks, to "not be agents"].)

answer by Daniel Kokotajlo · 2021-12-09T22:50:39.270Z · LW(p) · GW(p)

Saints vs. Schemers vs. Sycophants as different kinds of trained models / policies we might get. (I'm drawing from Ajeya's post here).

There are more academic-sounding terms for these concepts too, I forget where, probably in Paul's posts about "the intended model" vs. "the instrumental policy" and stuff like that.

answer by Daniel Kokotajlo · 2021-12-09T22:48:05.907Z · LW(p) · GW(p)

Inner vs. outer alignment, mesa-optimizers

answer by Charlie Steiner · 2021-12-13T19:34:49.283Z · LW(p) · GW(p)

Human values exist within human-scale models of the world.

No comments

Comments sorted by top scores.

What alignment-related concepts should be better known in the broader ML community?

Contents

Answers

No comments