Posts

Measuring Coherence and Goal-Directedness in RL Policies 2024-04-22T18:26:37.903Z
Measuring Coherence of Policies in Toy Environments 2024-03-18T17:59:08.118Z
Supervised Program for Alignment Research (SPAR) at UC Berkeley: Spring 2023 summary 2023-08-19T02:27:30.153Z

Comments

Comment by dx26 (dylan-xu) on If all trade is voluntary, then what is "exploitation?" · 2024-12-28T08:51:29.120Z · LW · GW

In this case, the starving person presumably has to press the button or else starve to death, and thus has no bargaining power. The other person only has to offer the bare minimum beyond what the starving person needs to survive, and the starving person must take the deal. In Econ 101 (assuming away monopolies, information asymmetry, etc.), exploited workers do have bargaining power by being able to work for other companies, hence why companies can’t just do stupid, spiteful actions in the long term.

Comment by dx26 (dylan-xu) on Coherence of Caches and Agents · 2024-05-03T17:38:51.738Z · LW · GW

It might be relevant to note that the meaningfulness of this coherence definition depends on the chosen environment. For instance, in an deterministic forest MDP where an agent at a state  can never return to  for any  and there is only one path between any two states, suppose we have a deterministic policy  and let , etc. Then for the zero-current-payoff Bellman equations, we only need that  for any successor  from  for any successor  from , etc. We can achieve this easily by, for example, letting all values except  be near-zero; since  is a successor of  iff  (as otherwise there would be a cycle), this fits our criterion. Thus, every  is coherent in this environment. (I haven't done the explicit math here, but I suspect that this also works for non-deterministic  and non-stochastic MDPs.)

Importantly, using the common definition of language models in an RL setting where each state represents a sequence of tokens and each action adds a token to the end of a sequence of length  to produce a sequence of length , the environment is a deterministic forest, as there is only one way to "go between" two sequences (if one is a prefix of the other, choose the remaining tokens in order). Thus, any language model is coherent, which seems unsatisfying. We could try using a different environment, but this risks losing stochasticity (as the output logits of an LM is determined by its input sequence) and gets complicated pretty quickly (use natural abstractions/world model as states?).

Comment by dx26 (dylan-xu) on Measuring Coherence of Policies in Toy Environments · 2024-03-19T01:30:05.036Z · LW · GW

Right, I think this somewhat corresponds to the "how long it takes a policy to reach a stable loop" (the "distance to loop" metric), which we used in our experiments.

What did you use your coherence definition for?