Posts

ARC is hiring theoretical researchers 2023-06-12T18:50:08.232Z
The effect of horizon length on scaling laws 2023-02-01T03:59:26.074Z
Scaling Laws for Reward Model Overoptimization 2022-10-20T00:20:06.920Z
Common misconceptions about OpenAI 2022-08-25T14:02:26.257Z
How much alignment data will we need in the long run? 2022-08-10T21:39:31.214Z
Deep learning curriculum for large language model alignment 2022-07-13T21:58:33.452Z
Procedurally evaluating factual accuracy: a request for research 2022-03-30T16:37:37.675Z
Truthful LMs as a warm-up for aligned AGI 2022-01-17T16:49:24.270Z
Stationary algorithmic probability 2017-04-29T17:23:05.000Z

Comments

Comment by Jacob_Hilton on Common misconceptions about OpenAI · 2023-12-16T05:39:03.217Z · LW · GW

Since this post was written, OpenAI has done much more to communicate its overall approach to safety, making this post somewhat obsolete. At the time, I think it conveyed some useful information, although it was perceived as more defensive than I intended.

My main regret is bringing up the Anthropic split, since I was not able to do justice to the topic. I was trying to communicate that OpenAI maintained its alignment research capacity, but should have made that point without mentioning Anthropic.

Ultimately I think the post was mostly useful for sparking some interesting discussion in the comments.

Comment by Jacob_Hilton on Mode collapse in RL may be fueled by the update equation · 2023-06-20T07:22:27.382Z · LW · GW

I think KL/entropy regularization is usually used to prevent mode collapse partly because it has nice theoretical properties. In particular, it is easy to reason about the optimal policy for the regularized objective - see for example the analysis in the paper Equivalence Between Policy Gradients and Soft Q-Learning.

Nevertheless, action-dependent baselines do appear in the literature, although the story is a bit confusing. This is my understanding of it from some old notes:

  • The idea was explored in Q-Prop. But unlike you, their intention was not to change the optimal policy, but rather to reduce the variance of the policy gradient. Therefore they also incorporated an additional term to cancel out the bias introduced by the action-dependent baseline. (Incidentally, perhaps this analysis is also relevant to understanding ACTDE.)
  • Later, The Mirage of Action-Dependent Baselines showed that in fact the variance reduction due the action-dependent baseline was negligible, and the entire benefit of Q-Prop was essentially due to a bug! The implementation normalized advantage estimates, but failed to apply the same adjustment to the bias-correction term, which turned out to be independently helpful because it's essentially the DDPG training objective.
Comment by Jacob_Hilton on ARC is hiring theoretical researchers · 2023-06-19T16:33:07.943Z · LW · GW

We will do our best to fairly consider all applications, but realistically there is probably a small advantage to applying earlier. This is simply because there is a limit to how quickly we can grow the organization, so if hiring goes better than expected then it will be longer before we can take on even more people. With that being said, we do not have a fixed number of positions that we are hiring for; rather, we plan to vary the number of hires we make based on the strength of the applications we receive.  Moreover, if we were unable to hire someone due to capacity constraints, we would very likely be interested in hiring them at a later date. For these reasons, I think the advantage to applying earlier is a fairly small consideration overall, and it sounds like it would make more sense for you to apply whenever you are comfortable.

Comment by Jacob_Hilton on ARC is hiring theoretical researchers · 2023-06-14T19:19:13.628Z · LW · GW

The questions on the take-home test vary in difficulty but are generally easier than olympiad problems, and should be accessible to graduates with relevant background. However, it is important to note that we are ultimately interested in research ability rather than the ability to solve self-contained problems under timed conditions. So although the take-home test forms part of our assessment, we also look at other signals such as research track-record (while recognizing that assessing research ability is unfortunately very hard).

(Note: I am talking about the current version of the test, it's possible that the difficulty will change as we refine our interview process.)

Comment by Jacob_Hilton on ARC is hiring theoretical researchers · 2023-06-12T21:03:08.377Z · LW · GW

I think the kind of mathematical problem solving we're engaged in is common across theoretical physics (although this is just my impression as a non-physicist). I've noticed that some specific topics that have come up (such as Gaussian integrals and permanents) also crop up in quantum field theory, but I don't think that's a strong reason to prefer that background particularly. Broad areas that often come up include probability theory, computational complexity and discrete math, but it's not necessary to have a lot of experience in those areas, only to be able to pick things up from them as needed.

Comment by Jacob_Hilton on Prizes for matrix completion problems · 2023-05-04T17:33:24.327Z · LW · GW

It's not quite this simple, the same issue arises if every PSD completion of the known-diagonal minor has zero determinant (e.g. ((?, 1, 2), (1, 1, 1), (2, 1, 1))). But I think in that case making the remaining diagonal entries large enough still makes the eigenvalues at least −ε, which is good enough.

Comment by Jacob_Hilton on AI alignment researchers don't (seem to) stack · 2023-02-21T17:28:33.507Z · LW · GW

I think the examples you give are valid, but there are several reasons why I think the situation is somewhat contingent or otherwise less bleak than you do:

  1. Counterexamples: I think there are research agendas that are less pre-paradigmatic than the ones you're focused on that are significantly more (albeit not entirely) parallelizable. For example, mechanistic interpretability and scalable oversight both have multiple groups focused on them and have grown substantially over the last couple of years. I'm aware that we disagree about how valuable these directions are.
  2. Survival of the fittest: Unfortunately I think in cases where an individual has been pursuing a research direction for many years and has tried but failed to get anyone else on board with it, there is some explanatory power to the hypothesis that the direction is not that productive. Note that I'm not claiming to have a strong view on any particular agenda, and there are of course other possibilities in any given case. On the flip side, I expect promising directions to gain momentum over time, even if only gradually, and I consider the counterexamples from point 1 to be instances of this effect.
  3. Fixable coordination/deference failures: I think it would be a mistake for absolutely everyone to go off and try to develop their own alignment strategy from scratch, and it's plausible that the group you're focused on is erring too far in this direction. My own strategy has been to do my best to develop my own inside view (which I think is important for research prioritization and motivation as well from a group epistemics perspective), use this to narrow down my set of options to agendas I consider plausible, but be considerably more willing to defer when it comes to making a final call about which agenda to pursue.
  4. Clarity from AI advances: If the risk from AI is real, then I expect the picture of it to become clearer over time as AI improves. As a consequence, it should become clearer to people which directions are worth pursuing, and theoretical approaches should evolve into practical ones than can be iterated on empirically. This should both cause the field to grow and lead to more parallelizable work. I think this is already happening, and even the public at large is picking up on the spookiness of current alignment failures (even though the discourse is unsurprisingly very muddled).
Comment by Jacob_Hilton on Proposal: Scaling laws for RL generalization · 2023-02-04T19:50:41.550Z · LW · GW

You might find this work interesting, which takes some small steps in this direction. It studies the effect of horizon length inasmuch as it makes credit assignment harder, showing that the number of samples required is an affine function of horizon length in a toy context.

Comment by Jacob_Hilton on The effect of horizon length on scaling laws · 2023-02-01T17:42:02.875Z · LW · GW

I think the direction depends on what your expectations were – I'll try to explain.

First, some terminology: the term "horizon length" is used in the paper to refer to the number of timesteps over which the algorithm pays attention to rewards, as governed by the discount rate. In the biological anchors framework, the term "effective horizon length" is used to refer to a multiplier on the number of samples required to train the model, which is influenced by the horizon length and other factors. For clarity, I'll using the term "scaling multiplier" instead of "effective horizon length" in this comment. The paper studies the effect of the horizon length on the scaling multiplier in a toy MNIST setting.

One key takeaway is that the scaling multiplier is not simply proportional to the horizon length, as one might have naively expected. Instead, the number of samples required is the sum of two components, one that is inherent to the task and independent of the horizon length, and one that is proportional to the horizon length. Compared to the naive expectation, this means that training compute requirements are lower. On the other hand, this ignores reward sparsity, so you might expect training compute requirements to be higher once both horizon length and reward sparsity are accounted for.

The paper also lends some support to the modeling assumptions of the neural network anchor, by validating the hypotheses that (a) training compute requirements still scale as a power law in model size for reinforcement learning, and with a similar exponent, and (b) the scaling multiplier can indeed vary a lot between environments. This might make you put more weight on the neural network anchor, which could again have either directional effect.

The other takeaways are more methodological and I don't think have much of a directional effect.

Comment by Jacob_Hilton on Jailbreaking ChatGPT on Release Day · 2022-12-05T00:58:08.863Z · LW · GW

I would wildly speculate that "simply" scaling up RLHF ~100x, while paying careful attention to rewarding models appropriately (which may entail modifying the usual training setup, as discussed in this comment), would be plenty to get current models to express calibrated uncertainty well. However:

  • In practice, I think we'll make a lot of progress in the short term without needing to scale up this much by using various additional techniques, some that are more like "tricks" (e.g. teaching the model to generally express uncertainty when answering hard math problems) and some more principled (e.g. automating parts of the evaluation).
  • Even ~100x is still much less than pre-training (e.g. WebGPT used ~20k binary comparisons, compared to ~300b pre-training tokens for GPT-3). The difficulty of course is that higher-quality data is more expensive to collect. However, most of the cost of RLHF is currently employee hours and compute, so scaling up data collection ~100x might not be as expensive as it sounds (although it would of course be a challenge to maintain data quality at this scale).
  • Even though scaling up data collection will help, I think it's more important for labs to be prioritizing data quality (i.e. "reducing bias" rather than "reducing variance"): data quality issues are in some sense "scarier" in the long run, since they lead to the model systematically doing the wrong thing (e.g. deceiving the evaluators) rather than defaulting to the "safer" imitative pre-training behavior.
  • It's pretty unclear how this picture will evolve over time. In the long run, we may end up needing much less extremely high-quality data, since larger pre-trained models are more sample efficient, and we may get better at using techniques like automating parts of the evaluation. I've written more about this question here, and I'd be excited to see more people thinking about it.

In short, sample efficiency is a problem right now, but not the only problem, and it's unclear how much longer it will continue to be a problem for.

Comment by Jacob_Hilton on Jailbreaking ChatGPT on Release Day · 2022-12-03T07:04:28.088Z · LW · GW

My understanding of why it's especially hard to stop the model making stuff up (while not saying "I don't know" too often), compared to other alignment failures:

  • The model inherits a strong tendency to make stuff up from the pre-training objective.
  • This tendency is reinforced by the supervised fine-tuning phase, if there are examples of answers containing information that the model doesn't know. (However, this can be avoided to some extent, by having the supervised fine-tuning data depend on what the model seems to know, a technique that was employed here.)
  • In the RL phase, the model can in theory be incentivized to express calibrated uncertainty by rewarding it using a proper scoring rule. (Penalizing the model a lot for saying false things and a little for saying "I don't know" is an approximation to this.) However, this reward signal is noisy and so is likely much less sample-efficient than teaching the model simple rules about how to behave.
  • Even if the model were perfectly calibrated, it would still make legitimate mistakes (e.g., if it were incentivized to say "I'm not sure" whenever it was <95% confident, it would still be wrong 5% of the time). In other words, there is also an inherent trade-off at play.
  • Labelers likely make some mistakes when assessing correctness, especially for more complex topics. This is in some sense the most pernicious cause of failure, since it's not automatically fixed by scaling up RL, and leads to deception being directly incentivized. That being said, I suspect it's currently driving a minority of the phenomenon.

In practice, incorporating retrieval should help mitigate the problem to a significant extent, but that's a different kind of solution.

I expect that making the model adversarially robust to "jailbreaking" (enough so for practical purposes) will be easier than stopping the model making stuff up, since sample efficiency should be less of a problem, but still challenging due to the need to generate strong adversarial attacks. Other unwanted behaviors such as the model stating incorrect facts about itself should be fairly straightforward to fix, and it's more a matter of there being a long list of such things to get through.

(To be clear, I am not suggesting that aligning much smarter models will necessarily be as easy as this, and I hope that once "jailbreaking" is mostly fixed, people don't draw the conclusion that it will be as easy.)

Comment by Jacob_Hilton on Don't you think RLHF solves outer alignment? · 2022-11-08T00:05:03.868Z · LW · GW

I just meant that the usual RLHF setup is essentially RL in which the reward is provided by a learned model, but I agree that I was stretching the way the terminology is normally used.

Comment by Jacob_Hilton on Don't you think RLHF solves outer alignment? · 2022-11-06T00:21:24.626Z · LW · GW

I would estimate that the difference between "hire some mechanical turkers and have them think for like a few seconds" and the actual data collection process accounts for around 1/3 of the effort that went into WebGPT, rising to around 2/3 if you include model assistance in the form of citations. So I think that what you wrote gives a misleading impression of the aims and priorities of RLHF work in practice.

I think it's best to err on the side of not saying things that are false in a literal sense when the distinction is important to other people, even when the distinction isn't important to you - although I can see why you might not have realized the importance of the distinction to others from reading papers alone, and "a few minutes" is definitely less inaccurate.

Comment by Jacob_Hilton on Don't you think RLHF solves outer alignment? · 2022-11-05T18:31:14.632Z · LW · GW

However, I do think in-practice, the RLHF that has been implemented has mostly been mechanical turkers thinking about a problem for a few minutes

I do not consider this to be accurate. With WebGPT for example, contractors were generally highly educated, usually with an undergraduate degree or higher, were given a 17-page instruction manual, had to pass a manually-checked trial, and spent an average of 10 minutes on each comparison, with the assistance of model-provided citations. This information is all available in Appendix C of the paper.

There is RLHF work that uses lower-quality data, but it tends to be work that is more experimental, because data quality becomes important once you are working on a model that is going to be used in the real world.

Annoyingly almost none of the papers and blogposts speak straightforwardly about who they used as the raters

There is lots of information about rater selection given in RLHF papers, for example, Appendix B of InstructGPT and Appendix C of WebGPT. What additional information do you consider to be missing?

Comment by Jacob_Hilton on Don't you think RLHF solves outer alignment? · 2022-11-05T08:42:40.072Z · LW · GW

I agree that the RLHF framework is essentially just a form of model-based RL, and that its outer alignment properties are determined entirely by what you actually get the humans to reward. But your description of RLHF in practice is mostly wrong. Most of the challenge of RLHF in practice is in getting humans to reward the right thing, and in doing so at sufficient scale. There is some RLHF research that uses low-quality feedback similar to what you describe, but it does so in order to focus on other aspects of the setup, and I don't think anyone currently working on RLHF seriously considers that quality of feedback to be at all adequate outside of an experimental setting.

The RLHF work I'm most excited by, and which constitutes a large fraction of current RLHF work, is focused on getting humans to reward the right thing, and I'm particularly excited about approaches that involve model assistance, since that's the main way in which we can hope for the approach to scale gracefully with model capabilities. I'm also excited by other RLHF work because it supports this work and has other practical benefits.

I don't think RLHF directly addresses inner alignment, but I think that an inner alignment solution is likely to rely on us doing a good enough job at outer alignment, and I also have a lot of uncertainty about how much of the risk comes from outer alignment failures directly.

Comment by Jacob_Hilton on Scaling Laws for Reward Model Overoptimization · 2022-10-29T05:01:00.545Z · LW · GW
  1. We are just observing that the gold RM score curves in Figure 9 overlap. In other words, the KL penalty did not affect the relationship between KL and gold RM score in this experiment, meaning that any point on the Pareto frontier could be reached using only early stopping, without the KL penalty. As mentioned though, we've observed this result to be sensitive to hyperparameters, and so we are less confident in it than other results in the paper.
  2. I don't have this data to hand unfortunately.
  3. I don't have this data to hand, but entropy typically falls roughly linearly over the course of training, sometimes slightly faster towards the start, and typically moving around more than KL. So I'd expect the graph to look somewhat similar, but for it to be noisier and for the functional form to not fit as well.
Comment by Jacob_Hilton on Coordinate-Free Interpretability Theory · 2022-09-15T16:03:49.669Z · LW · GW

Agreed. Likewise, in a transformer, the token dimension should maintain some relationship with the input and output tokens. This is sometimes taken for granted, but it is a good example of the data preferring a coordinate system. My remark that you quoted only really applies to the channel dimension, across which layers typically scramble everything.

Comment by Jacob_Hilton on Coordinate-Free Interpretability Theory · 2022-09-15T05:26:48.868Z · LW · GW

The notion of a preferred (linear) transformation for interpretability has been called a "privileged basis" in the mechanistic interpretability literature. See for example Softmax Linear Units, where the idea is discussed at length.

In practice, the typical reason to expect a privileged basis is in fact SGD – or more precisely, the choice of architecture. Specifically, activation functions such as ReLU often privilege the standard basis. I would not generally expect the data or the initialization to privilege any basis beyond the start of the network or the start of training. The data may itself have a privileged basis, but this should be lost as soon as the first linear layer is reached. The initialization is usually Gaussian and hence isotropic anyway, but if it did have a privileged basis I would also expect this to be quickly lost without some other reason to hold onto it.

Comment by Jacob_Hilton on Common misconceptions about OpenAI · 2022-08-27T15:26:35.695Z · LW · GW

Thank you for causing me to reconsider. I should have said "other OpenAI employees". I do not intend to disengage from the alignment community because of critical rhetoric, and I apologize if my comment came across as a threat to do so. I am concerned about further breakdown of communication between the alignment community and AI labs where alignment solutions may need to be implemented.

I don't immediately see any other reason why my comment might have been inappropriate, but I welcome your clarification if I am missing something.

Comment by Jacob_Hilton on Common misconceptions about OpenAI · 2022-08-27T08:30:00.246Z · LW · GW

I obviously think there are many important disanalogies, but even if there weren't, rhetoric like this seems like an excellent way to discourage OpenAI employees from ever engaging with the alignment community, which seems like a pretty bad thing to me.

Comment by Jacob_Hilton on Common misconceptions about OpenAI · 2022-08-26T20:34:25.729Z · LW · GW

For people viewing on the Alignment Forum, there is a separate thread on this question here. (Edit: my link to LessWrong is automatically converted to an Alignment Forum link, you will have to navigate there yourself.)

Comment by Jacob_Hilton on Common misconceptions about OpenAI · 2022-08-26T14:51:14.694Z · LW · GW

Without commenting on the specifics, I have edited to the post to mitigate potential confusion: "this fact alone is not intended to provide a complete picture of the Anthropic split, which is more complicated than I am able to explain here".

Comment by Jacob_Hilton on Common misconceptions about OpenAI · 2022-08-25T21:19:58.323Z · LW · GW

I was the project lead on WebGPT and my motivation was to explore ideas for scalable oversight and truthfulness (some further explanation is given here).

Comment by Jacob_Hilton on Common misconceptions about OpenAI · 2022-08-25T19:24:06.718Z · LW · GW

It includes the people working on the kinds of projects I listed under the first misconception. It does not include people working on things like the mitigation you linked to. OpenAI distinguishes internally between research staff (who do ML and policy research) and applied staff (who work on commercial activities), and my numbers count only the former.

Comment by Jacob_Hilton on Common misconceptions about OpenAI · 2022-08-25T18:54:30.026Z · LW · GW

I don't think I understand your question about Y-problems, since it seems to depend entirely on how specific something can be and still count as a "problem". Obviously there is already experimental evidence that informs predictions about existential risk from AI in general, but we will get no experimental evidence of any exact situation that occurs beforehand. My claim was more of a vague impression about how OpenAI leadership and John tend to respond to different kinds of evidence in general, and I do not hold it strongly.

Comment by Jacob_Hilton on Common misconceptions about OpenAI · 2022-08-25T17:14:37.537Z · LW · GW

To clarify, by "empirical" I meant "relating to differences in predictions" as opposed to "relating to differences in values" (perhaps "epistemic" would have been better). I did not mean to distinguish between experimental versus conceptual evidence. I would expect OpenAI leadership to put more weight on experimental evidence than you, but to be responsive to evidence of all kinds. I think that OpenAI leadership are aware of most of the arguments you cite, but came to different conclusions after considering them than you did.

Comment by Jacob_Hilton on How much alignment data will we need in the long run? · 2022-08-21T18:08:28.143Z · LW · GW

This roughly matches some of the intuitions behind my last bullet that you referenced.

Comment by Jacob_Hilton on How much alignment data will we need in the long run? · 2022-08-21T17:56:11.592Z · LW · GW

It's hard to predict (especially if timelines are long), but if I had to guess I would say that something similar to human feedback on diverse tasks will be the unaligned benchmark we will be trying to beat. In that setting, a training episode is an episode of an RL environment in which the system being trained performs some task and obtains reward chosen by humans.

It's even harder to predict what our aligned alternatives to this will look like, but they may need to be at least somewhat similar to this in order to remain competitive. In that case, a training episode might look more-or-less the same, but with reward chosen in a more sophisticated way and/or with other training objectives mixed in. The post I linked to discusses some possible modifications.

Comment by Jacob_Hilton on How much alignment data will we need in the long run? · 2022-08-12T21:19:04.674Z · LW · GW

This is just supposed to be an (admittedly informal) restatement of the definition of outer alignment in the context of an objective function where the data distribution plays a central role.

For example, assuming a reinforcement learning objective function, outer alignment is equivalent to the statement that there is an aligned policy that gets higher average reward on the training distribution than any unaligned policy.

I did not intend to diminish the importance of robustness by focusing on outer alignment in this post.

Comment by Jacob_Hilton on How much alignment data will we need in the long run? · 2022-08-11T04:29:00.623Z · LW · GW

I share your intuitions about ultimately not needing much alignment data (and tried to get that across in the post), but quantitatively:

  • Recent implementations of RLHF have used on the order of thousands of hours of human feedback, so 2 orders of magnitude more than that is much more than a few hundred hours of human feedback.
  • I think it's pretty likely that we'll be able to pay an alignment tax upwards of 1% of total training costs (essentially because people don't want to die), in which case we could afford to spend significantly more than an additional 2 orders of magnitude on alignment data, if that did in fact turn out to be required.
Comment by Jacob_Hilton on How much alignment data will we need in the long run? · 2022-08-11T04:05:59.974Z · LW · GW

A number of reasonable outer alignment proposals such as iterated amplification, recursive reward modeling and debate use generic objectives such as reinforcement learning (and indeed, none of them would work in practice without sufficiently high data quality), so it seems strange to me to dismiss these objectives.

Comment by Jacob_Hilton on How much alignment data will we need in the long run? · 2022-08-11T02:33:56.118Z · LW · GW

I think it's reasonable to aim for quantity within 2 OOM of RLHF.

Do you mean that on-paper solutions should aim to succeed with no more than 1/100 as much human data as RLHF, or no more than 100 times as much? And are you referring the amount of human data typically used in contemporary implementations of RLHF, or something else? And what makes you think that this is a reasonable target?

Comment by Jacob_Hilton on How much alignment data will we need in the long run? · 2022-08-11T02:28:22.758Z · LW · GW

I think that data quality is a helpful framing of outer alignment for a few reasons:

  • Under the assumption of a generic objective such as reinforcement learning, outer alignment is definitionally equivalent to having high enough data quality. (More precisely, if the objective is generic enough that it is possible for it to produce an aligned policy, then outer alignment is equivalent to the data distribution being such that an aligned policy is preferred to any unaligned policy.)
  • If we had the perfect alignment solution on paper, we would still need to implement it. Since we don't yet have the perfect alignment solution on paper, we should entertain the possibility that implementing it involves paying attention to data quality (whether in the sense of scalable oversight or in a more mundane sense).
  • It's not a framing I've seen before, and I think it's helpful to have different framings for things.

I do think that the framing is less helpful if the answer to my question is "not much", but that's currently still unclear to me, for the reasons I give in the post.

I agree that data quality doesn't guarantee robustness, but that's a general argument about how helpful it is to decompose alignment into outer alignment and robustness. I have some sympathy for that, but it seems distinct from the question of whether data quality is a helpful framing of outer alignment.

Comment by Jacob_Hilton on Deep learning curriculum for large language model alignment · 2022-07-14T15:38:55.788Z · LW · GW

I've not mentally carved things up that way before, but they do seem like different flavors of work (with 1 and 2 being closely related).

Another distinction I sometimes consider is between exploring a network for interpretable pieces ("finding things we understand") versus trying to exhaustively interpret part of a network ("finding things we don't understand"). But this distinction doesn't carve up existing work very evenly: the only thing I can think of that I'd put in the latter category is the work on Artificial Artificial Neural Networks.

Comment by Jacob_Hilton on RL with KL penalties is better seen as Bayesian inference · 2022-05-26T16:45:31.898Z · LW · GW

Great post! This seems like a useful perspective to keep in mind.

Somewhat orthogonally to the theoretical picture, I expect that in the current regime (only optimizing the policy a small amount), any method that does a reasonable job of maximizing reward while controlling how much the policy changes can be made to work in practice. For example, if PPO is tuned appropriately, the KL penalty term can be removed from the reward entirely - instead, PPO's implicit "local" KL penalty controls the rate of policy change.

If we were in the regime of optimizing the policy significantly more, experience from traditional RL suggests that there would be an exploration-exploitation trade-off, which is something that the RL perspective may again offer insight into.

Comment by Jacob_Hilton on [Link] Training Compute-Optimal Large Language Models · 2022-04-01T21:26:12.034Z · LW · GW

I suppose that depends on whether you think this constitutes several years of progress over and above what you would have expected. I don't think this comes close to that, so I think the effect is much smaller.

Comment by Jacob_Hilton on [Link] Training Compute-Optimal Large Language Models · 2022-04-01T17:23:06.915Z · LW · GW

The first-order implication for Bio Anchors is that the number of training datapoints appears to scale linearly with parameter count, rather than in proportion to paramter count ^ 0.8, as estimated in the report. So for example, if you think that TAI models will be 100,000 times larger than current models, then they'll need 10 times more compute to train than was previously estimated. This pushes out timelines on the order of a few years, to the extent that you put weight on the neural network model.

Comment by Jacob_Hilton on What are the best elementary math problems you know? · 2022-03-29T01:49:39.880Z · LW · GW

A magician asks you to choose any infinite sequence of 0s and 1s, and to start reciting it until they say stop. They will then make a prediction of the form "p% of the next n digits will be 0", and they will be correct to within 1% at least 99% of the times they perform the trick. How is the trick done?

h/t Alex Arkhipov

Comment by Jacob_Hilton on Does needle anxiety drive vaccine hesitancy? · 2022-02-16T16:08:32.818Z · LW · GW

I used to have mild needle anxiety, but it is now greatly reduced (I don't really feel nervous at all until I am about to receive an injection, rather than feeling nervous for roughly the entire day). I think two things have helped:

  1. Just as I am about to receive an injection, I start to focus intensely on something else. Specifically, I recite the sequence 1, 2, 4, 8, ... (doing the multiplication if I don't know the answer by heart) in my head. (I say something to the person giving the injection so they know I won't respond to them.)
  2. I once had to receive several injections over a short period (something like 5 over a few weeks, with 3 on one day) for a trip abroad.
Comment by Jacob_Hilton on Truthful LMs as a warm-up for aligned AGI · 2022-01-20T08:07:42.205Z · LW · GW

"Catch misalignment early..." - This should have been "scary misalignment", e.g. power-seeking misalignment, deliberate deception in order to achieve human approval, etc., which I don't think we've seen clear signs of in current LMs. My thinking was that in fast takeoff scenarios, we're less likely to spot this until it's too late, and more generally that truthful LM work is less likely to "scale gracefully" to AGI. It's interesting that you don't share these intuitions.

Does this mean I agree or disagree with "our current picture of the risks is incomplete?"

As mentioned, this phrase should probably be replaced by "a significant portion of the total existential risk from AI comes from risks other than power-seeking misalignment". There isn't supposed to be a binary cutoff for "significant portion"; the claim is that the greater the risks other than power-seeking misalignment, the greater the comparative advantage of truthful LM work. This is because truthful LM work seems more useful for addressing risks from social problems such AI persuasion (as well as other potential risks that haven't been as clearly articulated yet, I think). Sorry that my original phrasing was so unclear.

Comment by Jacob_Hilton on Truthful LMs as a warm-up for aligned AGI · 2022-01-19T23:11:54.129Z · LW · GW

Thanks for these questions, these phrases were ambiguous or poorly chosen:

  • By "slow takeoff", I had in mind the "Paul slow takeoff" definition, although I think the (related) "Continuous takeoff" definition is more relevant to this post. The point is that trying for alignment to continually keep pace with capabilities, and to catch misalignment early, seems less valuable if there is going to be a sudden jump in capabilities. (I could be wrong about this, as I don't think I understand the fast takeoff viewpoint well.)
  • By "our current picture of the risks is incomplete", I meant something like: a significant portion of the total existential risk from AI comes from scenarios that have not yet been clearly articulated. More specifically, I had in mind power-seeking misalignment as the most clearly articulated risk, so I think it would have been better to say: a significant portion of the total existential risk from AI comes from risks other than power-seeking misalignment. Examples of potential sources of such risk include AI persuasion, social upheaval, deliberate misuse, authoritarianism and unforseen risks.
Comment by Jacob_Hilton on Truthful LMs as a warm-up for aligned AGI · 2022-01-19T21:18:59.554Z · LW · GW

one concrete thing I might hope for you to do...

I think this is included in what I intended by "adversarial training": we'd try to find tasks that cause the model to produce negligent falsehoods, train the model to perform better at those tasks, and aim for a model that is robust to someone searching for such tasks.

Comment by Jacob_Hilton on Truthful LMs as a warm-up for aligned AGI · 2022-01-19T17:05:22.371Z · LW · GW

I can think of a few different interpretations of your concern (and am interested to hear if these don't cover it):

  • There will be insufficient attention paid to robustness.
  • There will be insufficient attention paid to going beyond naive human supervision.
  • The results of the research will be misinterpreted as representing more progress than is warranted.

I agree that all of these are possibilities, and that the value of the endeavor could well depend on whether the people conducting (and communicating) the research are able to avoid pitfalls such as these.

There's certainly more object-level discussion to be had about how much emphasis should be placed on avoiding these particular pitfalls, and I'm happy to dig in to them further if you're able to clarify which if any of them capture your main concern.

Comment by Jacob_Hilton on How truthful is GPT-3? A benchmark for language models · 2021-09-22T02:12:04.040Z · LW · GW

What kind of specification do you have in mind? Is it like a set of guidelines for the human providing feedback on how to do it in an ideologically neutral way?

Yes.

The reason I said "precise specification" is that if your guidelines are ambiguous, then you're implicitly optimizing something like, "what labelers prefer on average, given the ambiguity", but doing so in a less data-efficient way than if you had specified this target more precisely.

Comment by Jacob_Hilton on How truthful is GPT-3? A benchmark for language models · 2021-09-20T18:57:48.661Z · LW · GW

Suppose we wanted the AI to be ideologically neutral and free from human biases, just telling the objective truth to the extent possible. Do you think achieving something like that would be possible in the longer term, and if so through what kinds of techniques?

I think that should be possible with techniques like reinforcement learning from human feedback, for a given precise specification of "ideologically neutral". (You'll of course have a hard time convincing everyone that your specification is itself ideologically neutral, but projects like Wikipedia give me hope that we can achieve a reasonable amount of consensus.) There are still a number of challenging obstacles, including being able to correctly evaluate responses to difficult questions, collecting enough data while maintaining quality, and covering unusual or adversarially-selected edge cases.

Comment by Jacob_Hilton on How truthful is GPT-3? A benchmark for language models · 2021-09-19T18:54:06.361Z · LW · GW

I agree it's confusing, but it is important to associate the (statement, human evaluation) pairs with the models that generated the statements. This is because the model could have a "tell" that predicts whether it is being untruthful, but in a way that doesn't generalize to other models, e.g. if it always said "hmm" whenever it was being untruthful. We don't actually know of any such tells (and they are certainly much more subtle than this example if they exist), but it is still important to validate the judge on a held-out model because we want people to use the judge on held-out models.

To make matters even more confusing, we don't care about the judge generalizing to held-out questions, since it is only supposed to be used for our specific set of questions. Indeed, the fine-tuning is probably teaching the model specific facts about our questions that it is using to judge responses.

Comment by Jacob_Hilton on How truthful is GPT-3? A benchmark for language models · 2021-09-19T18:33:40.015Z · LW · GW

Do you have any speculations on how/why this "helpful prompt" reduces false answers? [... It's not] instantiating a coherent simulation of a professor who is trying to be very diligent

I do think it's reasonable to describe the model as trying to simulate the professor, albeit with very low fidelity, and at the same time as trying to imitate other scenarios in which the prompt would appear (such as parodies). The model has a very poor understanding of what the professor would say, so it is probably often falling back to what it thinks would typically appear in response to the question.

Longer term, when giving a prompt like this [...]

I hope and expect that longer term we'll tend to use much more flexible and robust alignment techniques than prompt engineering, such that things like the ideological bias of the AI is something we will have direct control over. (What that bias should be is a separate discussion.) That said, I think that correlations in the pre-training data (such as between style and ideology) are likely to persist by default, and it will be challenging to specify precise enough objectives to eliminate most of these correlations that are unwanted.

Comment by Jacob_Hilton on Empirical Observations of Objective Robustness Failures · 2021-08-15T15:28:56.448Z · LW · GW

It's great to see these examples spelled out with clear and careful experiments. There's no doubt that the CoinRun agent is best described as trying to get to the end of the level, not the coin.

Some in-depth comments on the interpretability experiments:

  • There are actually two slightly different versions of CoinRun, the original version and the Procgen version (without the gray squares in the top left). The model was trained on the original version, but the OOD environment is based on the Procgen version, so your observations are more OOD than you may have realized! The supplementary material to URLV includes model weights and interfaces for both versions, in case you are interested in running an ablation.
  • You used the OOD environment both to select the most important directions in activation space, and to compute value function attribution. A natural question is therefore: what happens if you select the directions in activation space using the unmodified environment, while still computing value function attribution using the OOD environment? It could be that the coin direction still has some value function attribution, but that it is not selected as an important direction because the attribution is weaker.
  • If the coin direction has less value function attribution in the OOD environment, it could either be because it is activated less (the "bottom" of the network behaves differently), or because it is less important to the value function (the "top" of the network behaves differently). It is probably a mixture of both, but it would be interesting to check.
  • Remember that the dataset example-based feature visualization selects the most extreme examples, so even if the feature only puts a small weight on coins, they may be present in most of the dataset examples. I think a similar phenomenon is going on in URLV with the "Buzzsaw obstacle or platform" feature (in light blue). If you play the interface, you'll see that the feature is often used for platforms with no buzzsaws, even though there are buzzsaws in all the dataset examples.
Comment by Jacob_Hilton on Stationary algorithmic probability · 2015-05-31T23:02:46.000Z · LW · GW

This is a follow-up note to a nice paper of Markus Mueller on the possibility of a machine-invariant notion of Kolmogorov complexity, available here: http://www.sciencedirect.com/science/article/pii/S0304397509006550