Posts

Measuring Coherence and Goal-Directedness in RL Policies 2024-04-22T18:26:37.903Z
Measuring Coherence of Policies in Toy Environments 2024-03-18T17:59:08.118Z
Supervised Program for Alignment Research (SPAR) at UC Berkeley: Spring 2023 summary 2023-08-19T02:27:30.153Z

Comments

Comment by dx26 (dylan-xu) on Coherence of Caches and Agents · 2024-05-03T17:38:51.738Z · LW · GW

It might be relevant to note that the meaningfulness of this coherence definition depends on the chosen environment. For instance, in an deterministic forest MDP where an agent at a state  can never return to  for any  and there is only one path between any two states, suppose we have a deterministic policy  and let , etc. Then for the zero-current-payoff Bellman equations, we only need that  for any successor  from  for any successor  from , etc. We can achieve this easily by, for example, letting all values except  be near-zero; since  is a successor of  iff  (as otherwise there would be a cycle), this fits our criterion. Thus, every  is coherent in this environment. (I haven't done the explicit math here, but I suspect that this also works for non-deterministic  and non-stochastic MDPs.)

Importantly, using the common definition of language models in an RL setting where each state represents a sequence of tokens and each action adds a token to the end of a sequence of length  to produce a sequence of length , the environment is a deterministic forest, as there is only one way to "go between" two sequences (if one is a prefix of the other, choose the remaining tokens in order). Thus, any language model is coherent, which seems unsatisfying. We could try using a different environment, but this risks losing stochasticity (as the output logits of an LM is determined by its input sequence) and gets complicated pretty quickly (use natural abstractions/world model as states?).

Comment by dx26 (dylan-xu) on Measuring Coherence of Policies in Toy Environments · 2024-03-19T01:30:05.036Z · LW · GW

Right, I think this somewhat corresponds to the "how long it takes a policy to reach a stable loop" (the "distance to loop" metric), which we used in our experiments.

What did you use your coherence definition for?