A foundation model approach to value inference

post by sen · 2023-02-21T05:09:29.658Z · LW · GW · 0 comments

Contents

  The covering set
  Creating the model
  Factoring the model
  Why bother
None
No comments

Epistemic status: shower thoughts.

I'm going to write this out as a pseudo-proof. Please pardon the lack of narrative structure. Conceptually, I'm splitting the problem of value inference into three sub-problems:

  1. Finding a "covering set" of all causal implications of a person's values. The goal here is to describe a concrete "values" dataset. Modeling that dataset should be sufficient to model values.
  2. Creating a model of that covering set. The goal here is to show that it is feasible to model values, along with a bunch of other stuff that we eventually want to separate out.
  3. Factoring the model to separate the effects of values from the effects of other variables. The goal is to show how to isolate values in a model and thereby get a more robust model of values.

I'm not going to claim that everything in this post is sound or complete. But I do suspect that this process, if iterated on for a fairly short period of time, could lead to a reasonably accurate model of values.

The covering set

Conjecture: Any system that models all of the effects of a person's values must also model that person's values.

In practice, the effects of a person's values may include:

Hypothesis: The above four points form the basis of all things causally influenced by a person's values. In other words, any system that models the above four things perfectly must also perfectly model a person's values.

So, tentatively, "the dataset" would only need to consist of data points on those four things: emotional responses, feedback that people give, what people consider important or worthwhile, and the scenarios people consider ideal.

Creating the model

Hypothesis: The effects of a person's values can be reasonably well modeled through a large language model.

A language model would need to model a lot more than just a person's values to perform well on such a dataset, but that's okay. As long as it needs to model values well to perform well on the dataset, it's fine for it to model extraneous things. They'll be handled in the next step.

Factoring the model

Premise: Values are grounded in measurable physical observables.

Hypothesis: Using the model from Step 2, it's possible to make otherwise-uninterpretable value-oriented neural signals interpretable. Outline of the process:

With that, if you create a model of the conditioning data by any means, you end up with a value model that both matches intuition and is grounded in physical observables.

Why bother

  1. Decomposing the problem makes it easier to think about. Each of the three steps above feels a lot more intuitive and tractable than the problem of "value inference" in its entirety.
  2. It's modular. It splits up the philosophical work, the computational work, and the scientific work such that each of these things has the absolute minimum dependency on the others (maybe). That means people don't need to spend a huge amount of time catching up on everything before they're able to contribute anything.
  3. It takes advantage of machine learning progress. Advances in the ability to model more things will provide more flexibility for Step 1. Advances in creating more efficient models will benefit Step 2. Advances in making models easier to control will benefit Step 3.
  4. It makes disagreements precise. Disagreements about how to perform Step 2 (modeling the effects of values) and Step 3 (the physical observables of values) can be resolved through experiments. Disagreements about Step 1 (what data should be considered relevant) can be discussed over concrete data points rather than abstract arguments.
  5. This seems like a feasible starting point, and, assuming people can ever agree on what "values" are, it should converge onto the true model.
  6. It gives people something to optimize that might actually help with safety.

0 comments

Comments sorted by top scores.