tomek-korbak

Posts
Comments

Posts

How to evaluate control measures for LLM agents? A trajectory from today to superintelligence 2025-04-14T16:45:46.584Z

A sketch of an AI control safety case 2025-01-30T17:28:47.992Z

Eliciting bad contexts 2025-01-24T10:39:39.358Z

Automation collapse 2024-10-21T14:50:54.500Z

Compositional preference models for aligning LMs 2023-10-25T12:17:28.990Z

Towards Understanding Sycophancy in Language Models 2023-10-24T00:30:48.923Z

Paper: LLMs trained on “A is B” fail to learn “B is A” 2023-09-23T19:55:53.427Z

Paper: On measuring situational awareness in LLMs 2023-09-04T12:54:20.516Z

Imitation Learning from Language Feedback 2023-03-30T14:11:56.295Z

Pretraining Language Models with Human Preferences 2023-02-21T17:57:09.774Z

RL with KL penalties is better seen as Bayesian inference 2022-05-25T09:23:33.202Z

Comments

Comment by Tomek Korbak (tomek-korbak) on Compositional preference models for aligning LMs · 2023-10-25T20:24:41.543Z · LW · GW

Fair point, I'm using "compositional" in an informal sense different from the one in formal semantics, closer to what I called "trivial compositionally" in this paper. But I'd argue it's not totally crazy to call such preference models compositional and that compositionally here still has some resemblance to Montague's account of compositionally as homeomorphism: basically, you have get_total_score(response) == sum([get_score(attribute) for attribute in decompose(response)])

Comment by Tomek Korbak (tomek-korbak) on [Paper] All's Fair In Love And Love: Copy Suppression in GPT-2 Small · 2023-10-17T11:45:08.475Z · LW · GW

Cool work! Reminds me a bit of my submission to the inverse scaling prize: https://tomekkorbak.com/2023/03/21/repetition-supression/

Comment by Tomek Korbak (tomek-korbak) on Pretraining Language Models with Human Preferences · 2023-03-27T17:13:32.475Z · LW · GW

In practice I think using a trained reward model (as in RLHF), not fixed labels, is the way forward. Then the cost of acquiring the reward model is the same as in RLHF, the difference is primarily that PHF typically needs much more calls to the reward model than RLHF.

Comment by Tomek Korbak (tomek-korbak) on Remarks 1–18 on GPT (compressed) · 2023-03-21T11:47:22.123Z · LW · GW

Thanks, I found the post quite stimulating. Some questions and thoughts:

Is LLM dynamics ergodic? I.e. is the time average equal to ${lim}_{N \to \infty} \frac{1}{N} \sum_{n}^{N} π_{n} (0)$ , the average page vector?.
One potential issue with this formalisation is that you always assume a prompt of size $k$ (so you need to introduce artificial "null tokens" if the prompt is shorter) and you don't give special treatment to the token <|endoftext|>. For me, it would be more intuitive to consider LLM dynamics in terms of finite, variable length, token-level Markov chains (until <|endoftext|>). While a fixed block size is actually being used during training, the LLM is incentivised to disregard anything before <|endoftext|>. So these two prompts should induce the same distribution: Document about cats.<|endoftext|>My name is; Document about dogs.<|endoftext|>My name is. Your formalisation doesn't account for this symmetry.
Dennett is spelled with "tt".
Note that a softmax-based LLM will always put non-zero probability on every token. So there are no strictly absorbing states. You're careful enough to define absorbing states as "once you enter, you are unlikely to ever leave", but then your toy Waluigi model is implausible. A Waluigi can always switch back to a Luigi.

Comment by Tomek Korbak (tomek-korbak) on Pretraining Language Models with Human Preferences · 2023-03-04T13:04:48.962Z · LW · GW

I don't remember where I saw that, but something as dumb as subtracting the embedding of <|bad|> might even work sometimes.

Comment by Tomek Korbak (tomek-korbak) on Pretraining Language Models with Human Preferences · 2023-03-04T13:01:44.398Z · LW · GW

That's a good point. But if you're using a distilled, inference-bandwith-optimised RM, annotating your training data might be a fraction of compute needed for pretraining.

Also, the cost of annotation is constant and can be amortized over many training runs. PHF shares an important advantage of offline RL over online RL approaches (such as RLHF): being able to reuse feedback annotations across experiments. If you already have a dataset, running a hyperparameter sweep on it is as cheap as standard pretraining and in contrast with RLHF you don't need to recompute rewards.

Comment by Tomek Korbak (tomek-korbak) on Pretraining Language Models with Human Preferences · 2023-02-28T10:53:52.558Z · LW · GW

For filtering it was 25% of best scores, so we effectively trained for 4 epochs.

(We had different threshold for filtering and conditional training, note that we filter at document level but condition at sentence level.)

Comment by Tomek Korbak (tomek-korbak) on Pretraining Language Models with Human Preferences · 2023-02-24T14:52:20.370Z · LW · GW

Good question! We're not sure. The fact that PHF scales well with dataset size might provide weak evidence that it would scale well with model size too.

Comment by Tomek Korbak (tomek-korbak) on Pretraining Language Models with Human Preferences · 2023-02-23T18:17:03.958Z · LW · GW

I'm guessing that poison-pilling the <|bad|> sentences would have a negative effect on the <|good|> capabilities as well?

That would be my guess too.

Comment by Tomek Korbak (tomek-korbak) on Pretraining Language Models with Human Preferences · 2023-02-23T18:05:53.561Z · LW · GW

Have you tested the AI's outputs when run in <|bad|> mode instead of <|good|> mode?

Here it would be helpful to know what the AI produces when prompted by <|bad|>.

That's a good point. We haven't systematically investigate difference in capabilities between<|good|> and <|bad|> modes, I'd love to see that.

Just before public release, one could delete the <|bad|> token from the tokenizer and the model parameters, so switching to evil mode would require rediscovering that token embedding.

Yeah, you could even block the entire direction in activation space corresponding to the embedding of the <|bad|> token

Comment by Tomek Korbak (tomek-korbak) on RL with KL penalties is better seen as Bayesian inference · 2022-11-22T11:25:18.175Z · LW · GW

fixed, thanks!

Comment by Tomek Korbak (tomek-korbak) on Safety considerations for online generative modeling · 2022-07-08T19:10:00.142Z · LW · GW

I really liked the post and the agenda of improving safety through generative modelling is close to my heart.

we begin an online phase of its training: the agent starts acting in its environment and generating new task completions, which are recorded and fed back into the decision transformer as new training data

But you still need online access to our MDP (i.e. reward function and transition function), don't you? And it's access to MDP that drives novelty and improvement If you were just sampling whole trajectories from the model (asking the model itself to simulate reward function and transition model) and feeding them back into the model, you should expect any change (on average). Your gradients updates will cancel out, that's a consequence of the expected-grad-log-prob lemma ().

It gets more nuanced when you account for doing ancestral sampling, but it adds problems, not solves them:
https://arxiv.org/abs/2110.10819

Reproduce the “Learning to Summarize with Human Feedback” paper but with a frozen reward model.

On the other hand, in their follow-up work on instruction following, OpenAI claimed they used little online data (from fine-tuned policies):
https://arxiv.org/abs/2203.02155

It would be interesting to figure out a way to factorize the policy in (a) over timesteps, i.e. produce distributions $(π (\cdot), π (\cdot | τ_{1}), π (\cdot | τ_{1} τ_{2}), \dots, π (\cdot | τ_{1} \dots τ_{T - 1})$ \) over actions conditional on partial trajectories

Levine derives that in his control-as-inference tutorial paper (section 2.3). Your expected exponential total reward is pretty close. Not that it acts a bit like an (exponentiated) Q function for your policy: it gives you exp-reward expected after taking action $τ_{t}$ at state $τ_{< t}$ and following $π$ thereafter. The exponential works like a soft argmax, so it gives you something like soft Q-learning but not quite: argmax is also over environment dynamics, not only over policy. So it causes an optimism bias: your agent effectively assumes an optimal next state will sampled for it every time, however unlikely would that be. The rest of Levine's paper deals with that.

Comment by Tomek Korbak (tomek-korbak) on RL with KL penalties is better seen as Bayesian inference · 2022-06-07T16:36:33.550Z · LW · GW

good catch, yes, thanks!

Comment by Tomek Korbak (tomek-korbak) on RL with KL penalties is better seen as Bayesian inference · 2022-06-02T17:54:39.512Z · LW · GW

Thanks for sharing your thoughts, I found these remarks extremely insightful!

It seems like ideal way forward is to more accurately capture what you actually care about, then optimize that---staying close to the original distribution feels like more of a hack to me. It seems like you view the original distribution of webtext as more principled or fundamental than I do, but I'm not sure what accounts for that difference.

A reply that comes to mins is that maybe being grounded in human knowledge, reasoning rules and values represented in web text has inherent value? Maybe web text is already approximately aligned with human preferences and you only want tweak that distribution a bit to match true human preferences? Assume that's the case. Then, we can decompose LM alignment into (i) learning web text distribution and (ii) learning how to warp web text distribution. It seems that (ii) is easier than just learning aligned behaviour from scratch: your reward model doesn't have to work well on arbitrary text but only text from distributions similar to webtext.

Another way of phrasing that point: maybe the assumption that you can have a perfect reward model is unrealistic and we can offload some of the complexity of learning a reward model to a prior given by web text? Or more philosophically, if you're a Bayesian, you shouldn't trust your reward model blindly, you should still have some prior.

Comment by Tomek Korbak (tomek-korbak) on RL with KL penalties is better seen as Bayesian inference · 2022-06-02T17:04:36.693Z · LW · GW

Do you think these insights would generalise to the case where the language model may be interacting with some system during this fine-tuning phase? For example, if it generates queries to an external search engine or API, or has dialogue with a human, then the optimal policy is no longer equivalent to just generating the correct output distribution, as it now also involves environment observations.

That's a good point and helps to make a distinction between generative models and policies. In the interactive case, your policy pi(a|s) is conditional distribution. You can equivalently view it as a collection of unconditional distributions {pi_s(a)}, one for each s, and for each of these you are likely to also have distribution collapse (single best action for a given state). Arguably, that's what you want in RL.

So I think it mostly comes down to a philosophical difference. Do you want your LM to be a decision-maker acting in a world or a model of a some probability distribution over texts? If you want a decision-maker and training on language is just a scaffolding to get you there, maybe indeed staying close to the original distribution only has instrumental value?

But what if what you want is just an oracle-type conversational AI: a knowledge base and a common-sense reasoner. Maybe in this case staying close to human knowledge and inference rules represented in language is of inherent value?

Comment by Tomek Korbak (tomek-korbak) on RL with KL penalties is better seen as Bayesian inference · 2022-05-29T17:55:58.383Z · LW · GW

I'm glad you found our post insightful!

I'm not sure what is the best energy allocation between modelling and inference here. I think, however, that the modelling part is more neglected (the target distribution is rarely even considered as something that can be written down and analysed). Moreover, designing good target distributions can be quite alignment-specific whereas designing algorithms for inference in probabilistic graphical models is an extremely generic research problem so we can expect progress here anyway.

Comment by Tomek Korbak (tomek-korbak) on RL with KL penalties is better seen as Bayesian inference · 2022-05-26T19:34:09.230Z · LW · GW

I expect that in the current regime (only optimizing the policy a small amount), any method that does a reasonable job of maximizing reward while controlling how much the policy changes can be made to work in practice

Yes, that seems plausible. Though as you said, most methods that only change the policy a bit (early stopping, clipping in PPO) do that via implicit KL penalties and still can be seen as updating a prior.

there would be an exploration-exploitation trade-off, which is something that the RL perspective may again offer insight into.

Definitely exploration-exploitation issues could make the distribution collapse more severe and traditional RL tricks could help with that. But I still believe distribution collapse does not reduce to insufficient exploration and good exploration alone won't solve it. In this specific instance, failing to find the optimal policy is not the problem, the optimal policy itself is the problem.

User info

Posts

Comments