Posts
Comments
The distinction between inner and outer alignment is quite unnatural. For example, even the concept of reward hacking implies the double-fold failure of a reward that is not robust enough to exploitation, and a model that develops instrumental capabilities as to find a way to trick the reward; indeed, in the case of reward hacking, it's worth noting that depending on the autonomy of the system in question, we could attribute the misalignment as inner or outer. At its core, this distinction comes out of the policy <-> reward scheme of RL, though prediction <-> loss function in SL can be similarly characterized; I doubt how well this framing generalizes to other engineering choices.
Eliezer seems on track to win: current AI benchmark for IMO geometry problems is at 27/30 (IMO Gold human performance is at 25.9/30). This new benchmark was set by LLM-augmented neurosymbolic AI.
Thank you for the insightful post! You mentioned that:
Consider the relation a transformer has to an HMM that produced the data it was trained on. This is general - any dataset consisting of sequences of tokens can be represented as having been generated from an HMM.
and the linear projection consists of:
Linear regression from the residual stream activations (64 dimensional vectors) to the belief distributions (3 dimensional vectors).
Given any natural language dataset, if we didn't have the ground truth belief distribution, is it possible to reverse engineer (data model) a HMM and extract the topology of the residual stream activation?
I've been running task salient representation experiments on larger models and am very interested in replicating and possibly extending your result to more noisy settings.