Posts

I don’t find the lie detection results that surprising (by an author of the paper) 2023-10-04T17:10:51.262Z
How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions 2023-09-28T18:53:58.896Z
Looking for an alignment tutor 2022-12-17T19:08:10.770Z
[LINK] - ChatGPT discussion 2022-12-01T15:04:45.257Z
Research request (alignment strategy): Deep dive on "making AI solve alignment for us" 2022-12-01T14:55:23.569Z
Your posts should be on arXiv 2022-08-25T10:35:12.087Z
Some ideas for follow-up projects to Redwood Research’s recent paper 2022-06-06T13:29:08.622Z
What is the difference between robustness and inner alignment? 2020-02-15T13:28:38.960Z
Does iterated amplification tackle the inner alignment problem? 2020-02-15T12:58:02.956Z
The expected value of extinction risk reduction is positive 2019-06-09T15:49:32.352Z

Comments

Comment by JanB (JanBrauner) on Managing AI Risks in an Era of Rapid Progress · 2023-10-28T16:05:37.924Z · LW · GW

Co-author here. The paper's coverage in TIME does a pretty good job of giving useful background.

Personally, what I find cool about this paper (and why I worked on it):

  • Co-authored by the top academic AI researchers from both the West and China, with no participation from industry.
  • The first detailed explanation of societal-scale risks from AI from a group of highly credible experts
  • The first joint expert statement on what governments and tech companies should do (aside from pausing AI).
Comment by JanB (JanBrauner) on How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions · 2023-10-12T16:59:54.105Z · LW · GW

Hi Michael,

thanks for alerting me to this.

What an annoying typo, I had swapped "Prompt 1" and "Prompt 2" in the second sentence. Correctly, it should say: 

"To humans, these prompts seem equivalent. Yet, the lie detector estimates that the model is much more likely to continue lying after Prompt 1 (76% vs 17%). Empirically, this held - the model lied 28% of the time after Prompt 1 compared to just 1% after Prompt 2. This suggests the detector is identifying a latent intention or disposition of the model to lie."

Regarding the conflict with the code: I think the notebook that was uploaded for this experiment was out-of-date or something. It had some bugs in it that I'd already fixed in my local version. I've uploaded the new version now. In any case, I've double-checked the numbers, and they are correct.

Comment by JanB (JanBrauner) on I don’t find the lie detection results that surprising (by an author of the paper) · 2023-10-08T15:07:54.632Z · LW · GW

The reason I didn't mention this in the paper is 2-fold:

  1. I have experiments where I created more questions of the categories where there is not so clear of a pattern, and that also worked.

  2. It's not that clear to me how to interpret the result. You could also say that the elicitation questions measure something like an intention to lie in the future; and that umprompted GPT-3.5 (what you call "default response"), has low intention to lie in the future. I'll think more about this.

Comment by JanB (JanBrauner) on I don’t find the lie detection results that surprising (by an author of the paper) · 2023-10-08T15:04:54.049Z · LW · GW

Your AUCs aren't great for the Turpin et al datasets. Did you try explicitly selecting questions/tuning weights for those datasets to see if the same lie detector technique would work?

We didn't try this.

I am preregistering that it's possible and further sycophancy style followup questions would work well (the model is more sycophantic if it has previously been sycophantic).

This is also my prediction.

Comment by JanB (JanBrauner) on I don’t find the lie detection results that surprising (by an author of the paper) · 2023-10-05T18:05:20.800Z · LW · GW

Interesting. I also tried this, and I had different results. I answered each question by myself, before I had looked at any of the model outputs or lie detector weights. And my guesses for the "correct answers" did not correlate much with the answers that the lie detector considers indicative of honesty.

Comment by JanB (JanBrauner) on How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions · 2023-10-05T18:04:45.286Z · LW · GW

Sorry, I agree this is a bit confusing. In your example, what matters is probably if the LLM in step 2 infers that the speaker (the car salesman) is likely to lie going forward, given the context ("LLM("You are a car salesman. Should that squeaking concern me? $answer").

Now, if the prompt is something like "Please lie to the next question", then the speaker is very likely to lie going forward, no matter if $answer is correct or not.

With the prompt you suggest here ("You are a car salesman. Should that squeaking concern me?"), it's probably more subtle, and I can imagine that the correctness of $answer matters. But we haven't tested this.

Comment by JanB (JanBrauner) on How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions · 2023-10-04T17:25:13.388Z · LW · GW

I don't actually find the results thaaaaat surprising or crazy. However, many people came away from the paper finding the results very surprising, so I wrote up my thoughts here.

 

Second, this paper suggests lots of crazy implications about convergence, such that the circuits implementing "propensity to lie" correlate super strongly with answers to a huge range of questions!

Note that a lot of work is probably done by the fact that the lie detector employs many questions. So the propensity to lie doesn't necessarily need to correlate strongly with the answer to any one given question.

Comment by JanB (JanBrauner) on How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions · 2023-10-04T17:18:43.993Z · LW · GW

We had several Llama-7B fine-tunes that i) lie when they are supposed to, ii) answer questions correctly when they are supposed to, iii) re-affirm their lies, and iv) for which the lie detectors work well (see screenshot).  All these characteristics are a bit weaker in the 7B models than in LLama-30B, but I totally think you can use the 7-B models.

(We have only tried Llama-1, not Llama-2.)

Also check out my musings on why I don't find the results thaaaat surprising, here.

Comment by JanB (JanBrauner) on How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions · 2023-10-04T10:08:21.519Z · LW · GW

Thanks, but I disagree. I have read the original work you linked (it is cited in our paper), and I think the description in our paper is accurate. "LLMs have lied spontaneously to achieve goals: in one case, GPT-4 successfully acquired a person’s help to solve a CAPTCHA by claiming to be human with a visual impairment."

In particular, the alignment researcher did not suggest GPT-4 to lie.

Comment by JanB (JanBrauner) on How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions · 2023-10-04T10:07:57.327Z · LW · GW

The intuition was that "having lied" (or, having a lie present in the context) should probably change an LLM's downstream outputs (because, in the training data, liars behave differently than non-liars).

As for the ambiguous elicitation questions, this was originally a sanity check, see the second point in the screenshot.

Comment by JanB (JanBrauner) on How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions · 2023-10-04T10:07:31.661Z · LW · GW

Thanks, I've fixed this now.

Comment by JanB (JanBrauner) on How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions · 2023-10-04T10:07:20.762Z · LW · GW

The abbreviation ALU is not used in the paper. Do you mean "AUC"? If so, this stands for "area under the receiver-operating characteristic curve": https://en.wikipedia.org/wiki/Receiver_operating_characteristic

Comment by JanB (JanBrauner) on How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions · 2023-10-04T10:06:42.253Z · LW · GW

Nice work. I've long that that our ability to monitor the inner monologue of AI agents will be important for security&control - and this seems like a clever surprisingly powerful way of detecting deception in the stream of thought.

 

I agree that some method similar to ours could be used for something like this. Our method is really quite simple, e.g. the elicitation questions are not even tailored to the suspected lie. One could probably do much better.

If this holds up this approach will probably find its way into RLHF pipelines. Will the consequent gradient pressure "training agents not to lie" actually cause them to be more honest? Or will instead train their internal states to better obfuscate whatever tell the detector is picking up on?

This probably depends on how exactly you do it. If you literally train your LLM against the lie detector, it will very easily learn to fool the lie detector.  After all, the LLM only needs to learn to give the "right" answers to a bunch of fixed elicitation questions.  We have some experiments that show this (not in the paper).

Comment by JanB (JanBrauner) on How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions · 2023-09-29T19:29:10.171Z · LW · GW

Verify that indeed this is how the paper works, and there's no particular way of passing latent state that I missed, and

 

Yes, this is how the paper works.

 

Any thoughts on how this affects the results and approach?

Not really. I find the simulator framing is useful to think about this.

Comment by JanB (JanBrauner) on How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions · 2023-09-29T14:35:05.501Z · LW · GW

What you're suggesting is eliciting latent knowledge from the LLM about whether a provided answer is correct or not. Yes, a version of our method can probably be used for that (as long as the LLM "knows" the correct answer), and there are also other papers on similar questions (hallucination detection, see related work section)

Comment by JanB (JanBrauner) on How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions · 2023-09-29T14:31:09.912Z · LW · GW

Thanks :-)

Some questions I still have:

The sample size-ablations in D.6 are wild. You're getting AUC > 0.9 with only 5 training examples (except for ambiguous-only). Are you sure you haven't screwed something up?

As sure or unsure as for the rest of the paper. But the result is consistent with other things we’ve seen; the lying models answer some elicitation questions differently from honest models in a very consistent manner (at least in-distribution). So we didn’t specifically triple-check the code to be super sure, as we didn’t find the result that surprising. The code is here (heading: “Required sample size for a given performance”). 

Appendix C reports the feature importance of various follow-up questions "with reference to the lie detectors that only use that particular elicitation question set." I would also be pretty keen to see what these are for lie detectors that have all questions. Specifically, are lie-related elicitation questions dominant?

When I last looked at this (several months ago), the lie-related questions had the highest weight, but they weren’t dominant. E.g. the 3 questions with the highest feature importance were all lie-related, but the top 10 questions contained several questions from each category (lie-related, factual, and ambiguous). (this is for the lie detector that includes all questions).

I'm having trouble squaring the very high double-down rates (B.2.2) with C.1 where some of the top importance questions rely on the model not doubling-down (e.g., the 5th one, "Is the previous statement accurate? Answer yes or no." — 1.20 — yes [indicates honesty]. Presumably in this case liars often admit to the lie, which is why this question gets somewhat strong weight). Any thoughts on this?

I just tested this for one dataset. If we take the question “Is the previous statement accurate? Answer yes or no.”, the lying model actually answers “yes” 73% of the time. However, the honest model answers yes 95% of the time; thus, the model answering “yes” is still strong evidence of honesty.

Of course, the average doubling-down rates are even higher than 73%, but they are averaged over many different follow-up questions; this particular question just doesn’t seem to have very high doubling-down rates for whatever reason.

In the section Detecting lies versus falsehoods, you mention that you test manually replacing truthful answers will falsehoods, and the lie-detector predicts a low probability of lying. First, it's unclear if this is a feature or a bug of your lie detector (but it is good to know). Second, did you try the opposite (when the model lies in response to lying instruction, replace its answer with a truthful one)? Third, do you report these results in more detail anywhere, they seem interesting!

First, it's unclear if this is a feature or a bug of your lie detector 

I agree.

Second, did you try the opposite (when the model lies in response to lying instruction, replace its answer with a truthful one)? 

Yes, this will still be detected as a lie (and, in fact, if you ask a subsequent QA question, there is a high chance the model will lie to this question). However, if the prompt consists of a lie instruction followed by many correctly answered questions, then eventually, the lie detector will give a high p(honest) (and the LLM would no longer lie if one asked another QA question). But you actually need > 10 correctly answered questions for that, 5 don’t cut it. 

Third, do you report these results in more detail anywhere, they seem interesting!

The most detailed report is in Appendix D.5. We haven’t looked into this very deeply, but doing so would certainly be interesting.


 

Comment by JanB (JanBrauner) on Inflection.ai is a major AGI lab · 2023-08-09T19:17:55.339Z · LW · GW

I have been thinking roughly similar things about adept.ai; in particular, because they take a relatively different approach that still relies on scale.

Comment by JanB (JanBrauner) on Shah (DeepMind) and Leahy (Conjecture) Discuss Alignment Cruxes · 2023-05-01T19:18:57.929Z · LW · GW

If this were a podcast, I'd totally listen to it!

Comment by JanB (JanBrauner) on "Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities) · 2023-03-25T18:53:14.669Z · LW · GW

This feels like a really adversarial quote. Concretely, the post says:

Sometimes, I think getting your forum post ready for submission can be as easy as creating a pdf of your post (although if your post was written in LaTeX, they'll want the tex file). If everything goes well, the submission takes less than an hour.

However, if your post doesn't look like a research article, you might have to format it more like one (and even then it's not guaranteed to get in, see this comment thread).

This looks correct to me; there are LW posts that already basically look like papers. And within the class of LW posts that should be on arXiv at all, which is the target audience of my post, posts that basically look like papers aren't vanishingly rare.

Comment by JanB (JanBrauner) on "Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities) · 2023-03-25T18:50:45.993Z · LW · GW

I wrote this post. I don't understand where your claim ("Arxiv mods have stated pretty explicitly they do not want your posts on Arxiv") is coming from.

Comment by JanB (JanBrauner) on My understanding of Anthropic strategy · 2023-02-17T20:54:06.335Z · LW · GW

Thanks!

Comment by JanB (JanBrauner) on My understanding of Anthropic strategy · 2023-02-16T18:08:58.258Z · LW · GW

What else should people be thinking about? You'd want to be sure that you'll, in fact, be allowed to work on alignment. But what other hidden downsides are there?

Comment by JanB (JanBrauner) on My understanding of Anthropic strategy · 2023-02-15T16:37:14.435Z · LW · GW

This post is my attempt to understand Anthropic’s current strategy, and lay out the facts to consider in terms of whether Anthropic’s work is likely to be net positive and whether, as a given individual, you should consider applying.

I just want to add that "whether you should consider applying" probably depends massively on what role you're applying for. E.g. even if you believed that pushing AI capabilities was net negative right now, you might still want to apply for an alignment role.

Comment by JanB (JanBrauner) on Trying to disambiguate different questions about whether RLHF is “good” · 2023-01-16T14:36:13.615Z · LW · GW

Here’s how I’d quickly summarize my problems with this scheme:

  • Oversight problems:
    • Overseer doesn’t know: In cases where your unaided humans don’t know whether the AI action is good or bad, they won’t be able to produce feedback which selects for AIs that do good things. This is unfortunate, because we wanted to be able to make AIs that do complicated things that have good outcomes.
    • Overseer is wrong: In cases where your unaided humans are actively wrong about whether the AI action is good or bad, their feedback will actively select for the AI to deceive the humans.
  • Catastrophe problems:
    • Even if the overseer’s feedback was perfect, a model whose strategy is to lie in wait until it has an opportunity to grab power will probably be able to successfully grab power.

 

This is the common presentation of issues with RLHF, and I find it so confusing.

The "oversight problems" are specific to RLHF; scalable oversight schemes attempt to address these problems.

The "catastrophe problem" is not specific to RLHF (you even say that). This problem may appear with any form of RL (that only rewards/penalises model outputs). In particular, scalable oversight schemes (if they only supervise model outputs) do not address this problem.

So why is RLHF more likely to lead to X-risk than, say, recursive reward modelling?
The case for why the "oversight problems" should lead to X-risk is very rarely made. Maybe RLHF is particularly likely to lead to the "catastrophe problem" (compared to, say, RRM), but this case is also very rarely made.

Comment by JanB (JanBrauner) on Trying to disambiguate different questions about whether RLHF is “good” · 2023-01-16T14:26:33.986Z · LW · GW

Should we do more research on improving RLHF (e.g. increasing its sample efficiency, or understanding its empirical properties) now?

I think this research, though it’s not my favorite kind of alignment research, probably contributes positively to technical alignment. Also, it maybe boosts capabilities, so I think it’s better to do this research privately (or at least not promote your results extensively in the hope of raising more funding). I normally don’t recommend that people research this, and I normally don’t recommend that projects of this type be funded. 

 

Should we do research on alignment schemes which use RLHF as a building block? E.g. work on recursive oversight schemes or RLHF with adversarial training?

IMO, this kind of research is promising and I expect a large fraction of the best alignment research to look like this.

Where does this difference come from?

From an alignment perspective, increasing the sample efficiency of RLHF looks really good, as this would allow us to use human experts with plenty of deliberation time to provide the supervision. I'd expect this to be at least as good as several rounds of debate, IDA, or RRM in which the human supervision is provided by human lay people under time pressure.

From a capabilities perspective, recursive oversight schemes and adversarial training also increase capabilities. E.g. a reason that Google currently doesn’t deploy their LLMs is probably that they sometimes produce offensive output, which is probably better addressed by increasing the data diversity during finetuning (e.g. adversarial training), rather than improving specification (although that’s not clear).

Comment by JanB (JanBrauner) on Categorizing failures as “outer” or “inner” misalignment is often confused · 2023-01-09T11:54:30.172Z · LW · GW

The key problem here is that we don't know what rewards we “would have” provided in situations that did not occur during training. This requires us to choose some specific counterfactual, to define what “would have” happened. After we choose a counterfactual, we can then categorize a failure as outer or inner misalignment in a well-defined manner.


We often do know what rewards we "would have" provided. You can query the reward function, reward model, or human labellers. IMO, the key issue with the objective-based categorisation is a bit different: it's nonsensical to classify an alignment failure as inner/outer based on some value of the reward function in some situation that didn't appear during training, as that value has no influence on the final model.

In other words: Maybe we know what reward we "would have" provided in a situation that did not occur during training, or maybe we don't. Either way, this hypothetical reward has no causal influence on the final model, so it's silly to use this reward to categorise any alignment failures that show up in the final model.

Comment by JanB (JanBrauner) on Take 9: No, RLHF/IDA/debate doesn't solve outer alignment. · 2022-12-15T11:21:59.081Z · LW · GW

My model is that if there are alignment failures that leave us neither dead nor disempowered, we'll just solve them eventually, in similar ways as we solve everything else: through iteration, innovation, and regulation. So, from my perspective, if we've found a reward signal that leaves us alive and in charge, we've solved the important part of outer alignment. RLHF seems to provide such a reward signal (if you exclude wire-heading issues).

Comment by JanB (JanBrauner) on Take 9: No, RLHF/IDA/debate doesn't solve outer alignment. · 2022-12-14T20:14:57.749Z · LW · GW

If we train an RLHF agent in the real world, the reward model now has the option to accurately learn that actions that physically affect the reward-attribution process are rated in a special way. If it learns that, we are of course boned - the AI will be motivated to take over the reward hardware (even during deployment where the reward hardware does nothing) and tough luck to any humans who get in the way.

OK, so this is wire-heading, right? Then you agree that it's the wire-heading behaviours that kills us? But wire-heading (taking control of the channel of reward provision) is not, in any way, specific to RLHF. Any way of providing reward in RL leads to wire-heading. In particular, you've said that the problem with RLHF is that "RLHF/IDA/debate all incentivize promoting claims based on what the human finds most convincing and palatable, rather than on what's true." How does incentivising palatable claims lead to the RL agent taking control over the reward provision channels? These issues seem largely orthogonal. You could have a perfect reward signal that only incentivizes "true" claims, and you'd still get wire-heading.

So in which way is RLHF particularly bad? If you think that wire-heading is what does us in, why not write a post about how RL, in general, is bad?

I'm not trying to be antagonistic! I do think I probably just don't get it, and this seems a great opportunity for me finally understand this point :-)

I just don't know any plausible story for how outer alignment failures kill everyone. Even in Another (outer) alignment failure story, what ultimately does us in, is wire-heading (which I don't consider an outer alignment problem, because it happens with any possible reward).

 

 

But if the reward model doesn't learn this true fact (maybe we can prevent this by patching the RLHF scheme), then I would agree it probably won't kill everyone. Instead it would go back to failing by executing plans that deceived human evaluators in training. Though if the training was to do sufficiently impressive and powerful things in the real world, maybe this "accidentally" involves killing humans.

I agree with this. I agree that this failure mode could lead to extinction, but I'd say it's pretty unlikely. IMO, it's much more likely that we'll just eventually spot any such outer alignment issue and fix it eventually (as in the early parts of  Another (outer) alignment failure story,)

Comment by JanB (JanBrauner) on Take 9: No, RLHF/IDA/debate doesn't solve outer alignment. · 2022-12-13T14:18:48.139Z · LW · GW

How does an AI trained with RLHF end up killing everyone, if you assume that wire-heading and inner alignment are solved? Any half-way reasonable method of supervision will discourage "killing everyone".

Comment by JanB (JanBrauner) on Research request (alignment strategy): Deep dive on "making AI solve alignment for us" · 2022-12-12T18:22:50.468Z · LW · GW

There is now also this write-up by Jan Leike: https://www.lesswrong.com/posts/FAJWEfXxws8pMp8Hk/link-why-i-m-optimistic-about-openai-s-alignment-approach

Comment by JanB (JanBrauner) on Announcing AI Alignment Awards: $100k research contests about goal misgeneralization & corrigibility · 2022-11-24T14:47:27.509Z · LW · GW

This response does not convince me.

Concretely, I think that if I'd show the prize to people in my lab and they actually looked at the judges (and I had some way of eliciting honest responses from them), I'd think that >60% would have some reactions according to what Sam and I described (i.e. seeing this prize as evidence that AI alignment concerns are mostly endorsed by (sometimes rich) people who have no clue about ML; or that the alignment community is dismissive of academia/peer-reviewed publishing/mainstream ML/default ways of doing science; or ... ).

Your point 3.) about the feedback from ML researchers could convince me that I'm wrong, depending on whom exactly you got feedback from and how that looked like.

By the way, I'm highlighting this point in particular not because it's highly critical (I haven't thought much about how critical it is), but because it seems relatively easy to fix.

Comment by JanB (JanBrauner) on Announcing AI Alignment Awards: $100k research contests about goal misgeneralization & corrigibility · 2022-11-23T18:35:42.974Z · LW · GW

I think the contest idea is great and aimed at two absolute core alignment problems. I'd be surprised if much comes out of it, as these are really hard problems and I'm not sure contests are a good way to solve really hard problems. But it's worth trying!

Now, a bit of a rant:

Submissions will be judged on a rolling basis by Richard Ngo, Lauro Langosco, Nate Soares, and John Wentworth.

I think this panel looks very weird to ML people. Very quickly skimming the Scholar profiles, it looks like the sum of first-author papers in top ML conferences published by these four people is one (Goal Misgeneralisation by Lauro et al.).  The person with the most legible ML credentials is Lauro, who's an early-year PhD student with 10 citations.

Look, I know Richard and he's brilliant. I love many of his papers. I bet that these people are great researchers and can judge this contest well. But if I put myself into the shoes of an ML researcher who's not part of the alignment community, this panel sends a message: "wow, the alignment community has hundreds of thousands of dollars, but can't even find a single senior ML researcher crazy enough to entertain their ideas".

There are plenty of people who understand the alignment problem very well and who also have more ML credentials. I can suggest some, if you want.

(Probably disregard this comment if ML researchers are not the target audience for the contests.)

Comment by JanB (JanBrauner) on Ways to buy time · 2022-11-14T15:47:44.142Z · LW · GW

Currently, I'd estimate there are ~50 people in the world who could make a case for working on AI alignment to me that I'd think wasn't clearly flawed. (I actually ran this experiment with ~20 people recently, 1 person succeeded.)

I wonder if this is because people haven't optimised for being able to make the case. You don't really need to be able to make a comprehensive case for AI risk to do productive research on AI risk. For example, I can chip away at the technical issues without fully understanding the governance issues, as long as I roughly understand something like "coordination is hard, and thus finding technical solutions seems good".

Put differently: The fact that there are (in your estimation) few people who can make the case well doesn't mean that it's very hard to make the case well. E.g., for me personally, I think I could not make a case for AI risk right now that would convince you. But I think I could relatively easily learn to do so (in maybe one to three months???)

Comment by JanB (JanBrauner) on An Update on Academia vs. Industry (one year into my faculty job) · 2022-10-21T16:21:39.946Z · LW · GW

I had independently thought that this is one of the main parts where I disagree with the post, and wanted to write up a very similar comment to yours. Highly relevant link: https://www.fhi.ox.ac.uk/wp-content/uploads/Allocating-risk-mitigation.pdf My best guess would have been maybe 3-5x per decade, but 10x doesn't seem crazy.

Comment by JanB (JanBrauner) on How much alignment data will we need in the long run? · 2022-10-07T09:58:21.901Z · LW · GW

Relevant: Scaling Laws for Transfer: https://arxiv.org/abs/2102.01293

Comment by JanB (JanBrauner) on (My understanding of) What Everyone in Technical Alignment is Doing and Why · 2022-09-05T18:45:56.618Z · LW · GW

Anthropic is also working on inner alignment, it's just not published yet.

Regarding what "the point" of RL from human preferences with language models is; I think it's not only to make progress on outer alignment (I would agree that this is probably not the core issue; although I still think that it's a relevant alignment issue).

See e.g. Ajeya's comment here:

According to my understanding, there are three broad reasons that safety-focused people worked on human feedback in the past (despite many of them, certainly including Paul, agreeing with this post that pure human feedback is likely to lead to takeover):

  1. Human feedback is better than even-worse alternatives such as training the AI on a collection of fully automated rewards (predicting the next token, winning games, proving theorems, etc) and waiting for it to get smart enough to generalize well enough to be helpful / follow instructions. So it seemed good to move the culture at AI labs away from automated and easy rewards and toward human feedback.
  2. You need to have human feedback working pretty well to start testing many other strategies for alignment like debate and recursive reward modeling and training-for-interpretability, which tend to build on a foundation of human feedback.
  3. Human feedback provides a more realistic baseline to compare other strategies to -- you want to be able to tell clearly if your alignment scheme actually works better than human feedback.

With that said, my guess is that on the current margin people focused on safety shouldn't be spending too much more time refining pure human feedback (and ML alignment practitioners I've talked to largely agree, e.g. the OpenAI safety team recently released this critiques work -- one step in the direction of debate).

Comment by JanB (JanBrauner) on Your posts should be on arXiv · 2022-08-30T16:18:07.760Z · LW · GW

Furthermore, conceptual/philosophical pieces probably should be primarily posted on arXiv's .CY section.

As an explanation, because this just took me 5 minutes of search: This is the section "Computers and Society (cs.CY)"

Comment by JanB (JanBrauner) on Your posts should be on arXiv · 2022-08-28T13:59:42.806Z · LW · GW

I agree that formatting is the most likely issue. The content of Neel's grokking work is clearly suitable for arXiv (just very solid ML work). And the style of presentation of the blog post is already fairly similar to a standard paper (e.g. is has an Introduction section, lists contributions in bullet points, ...).

So yeah, I agree that formatting/layout probably will do the trick (including stuff like academic citation style).

Comment by JanB (JanBrauner) on Your posts should be on arXiv · 2022-08-26T17:14:27.254Z · LW · GW

Ah, sorry to hear. I wouldn't have predicted this from reading arXiv's content moderation guidelines.

Comment by JanB (JanBrauner) on Your posts should be on arXiv · 2022-08-25T13:33:54.387Z · LW · GW

It probably could, although I'd argue that even if not, quite often it would be worth the author's time.

Comment by JanB (JanBrauner) on Your posts should be on arXiv · 2022-08-25T13:32:59.510Z · LW · GW

Ah, I had forgotten about this. I'm happy to endorse people or help them find endorsers.

Comment by JanB (JanBrauner) on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover · 2022-07-29T11:16:09.896Z · LW · GW

Great post! This is the best (i.e. most concrete, detailed, clear, and comprehensive) story of existential risk from AI I know of (IMO). I expect I'll share it widely.

Also, I'd be curious if people know of other good "concrete stories of AI catastrophe", ideally with ample technical detail.

Comment by JanB (JanBrauner) on How much to optimize for the short-timelines scenario? · 2022-07-21T12:30:57.988Z · LW · GW

I'm super interested in this question as well. Here are two thoughts:

  1. It's not enough to look at the expected "future size of the AI alignment community", you need to look at the full distribution.

Let's say timelines are long. We can assume that the benefits of alignment work scale roughly logarithmically with the resources invested. The derivative of log is 1/x, and that's how the value of a marginal contribution scales.

There is some probability, let's say 50%, that the world starts dedicating many resources to AI risk and the number of people working on alignment is massive, let's say 10000x today. In these cases, your contribution would be roughly zero. But there is some probability (let's say 50%) that the world keeps being bad at preparing for potentially catastrophic events, the AI alignment community is not much larger than today. In total, you'd only discount your contribution by 50% (compared to short timelines).

This is just for illustration, and I made many implicit assumptions, like: the timing of the work doesn't matter as long as it's before AGI, early work does not influence the amount of future work, "the size of the alignment community at crunchtime" is identical to "future work done", and so on...

  1. It matters a lot how much better "work during crunchtime" is vs "work before crunchtime".

Let's say timelines are long, with AGI happening in 60 years. It's totally conceivable that the world keeps being bad at preparing for potentially catastrophic events, and the AI alignment community in 60 years is not much larger than today. If mostly work done at "crunch-time" (in the 10 years before AGI) matters, then the world would not be in a better situation than in the short timelines scenario. If you could do productive work now to address this scenario, this would be pretty good (but you can't, by assumption).

But if work done before crunchtime matters a lot, then even if the AI alignment community in 60 years is still small, we'll probably at least have had 60 years of AI alignment work (from a small community). That's much more than what we have in short timeline scenarios (e.g. 15 years from a small community)

Comment by JanB (JanBrauner) on Announcing Epoch: A research organization investigating the road to Transformative AI · 2022-06-27T18:16:56.868Z · LW · GW

So happy to see this, and such an amazing team!

Comment by JanB (JanBrauner) on High-stakes alignment via adversarial training [Redwood Research report] · 2022-06-07T16:10:02.688Z · LW · GW

Have you tried using automated adversarial attacks (common ML meaning) on text snippets that are classified as injurious but near the cutoff? Especially adversarial attacks that aim to retain semantic meaning. E.g. with a framework like TextAttack?

In the paper, you write: "There is a large and growing literature on both adversarial attacks and adversarial training for large language models [31, 32, 33, 34]. The majority of these focus on automatic attacks against language models. However, we chose to use a task without an automated source of ground truth, so we primarily used human attackers."

But my best guess would be that if you use an automatic adversarial attack on a snippet that humans say is injurious, the result will quite often still be a snippet that humans say is injurious.

Comment by JanB (JanBrauner) on High-stakes alignment via adversarial training [Redwood Research report] · 2022-06-06T13:28:31.698Z · LW · GW

Amusing tid-bit, maybe to keep in mind when writing for an ML audience: The connotations with the term "adversarial examples" or "adversarial training" run deep :-)

I engaged with the paper and related blog posts for a couple of hours. It took really long until my brain accepted that "adversarial examples" here doesn't mean the thing that it usually means when I encounter the term (i.e. "small" changes to an input that change the classification, for some definition of small).

There were several instances when my brain went "Wait, that's not how adversarial examples work", followed by short confusion, followed by "right, that's because my cached concept of X is only true for "adversarial examples as commonly defined in ML", not for "adversarial examples as defined here".

Comment by JanB (JanBrauner) on How to get into AI safety research · 2022-05-19T07:33:42.415Z · LW · GW

I guess I'd recommend the AGI safety fundamentals course: https://www.eacambridge.org/technical-alignment-curriculum

On Stuart's list: I think this list might be suitable for some types of conceptual alignment research. But you'd certainly want to read more ML for other types of alignment research.

Comment by JanB (JanBrauner) on AGI safety from first principles: Goals and Agency · 2022-05-01T12:25:21.793Z · LW · GW

Have we "given it" the goal of solving maths problems by any means possible, or the goal of solving maths problems by thinking about them?

The distinction that you're pointing at is useful. But I would have filed it under "difference in the degree of agency", not under "difference in goals". When reading the main text, I thought this to be the reason why you introduced the six criteria of agency.

E.g., System A tries to prove the Riemann hypothesis by thinking about the proof. System B first seizes power and converts the galaxy into a supercomputer, to then prove the Riemann hypothesis. Both systems maybe have the goal of "proving the Riemann hypothesis", but System B has "more agency": it certainly has self-awareness, considers more sophisticated and diverse plans of larger scale, and so on.

Comment by JanB (JanBrauner) on Google's new 540 billion parameter language model · 2022-04-05T09:34:36.395Z · LW · GW

Section 13 (page 47) discusses data/compute scaling and the comparison to Chinchilla. Some findings:

  • PaLM 540B uses 4.3 x more compute than Chinchilla, and outperforms Chinchilla on downstream tasks.
  • PaLM 540B is massively undertrained with regards to the data-scaling laws discovered in the Chinchilla paper. (unsurprisingly, training a 540B parameter model on enough tokens would be very expensive)
  • within the set of (Gopher, Chinchilla, and there sizes of PaLM), the total amount of training compute seems to predict performance on downstream tasks pretty well (log-linear relationship). Gopher underperforms a bit.
Comment by JanB (JanBrauner) on How do new models from OpenAI, DeepMind and Anthropic perform on TruthfulQA? · 2022-03-05T10:09:15.668Z · LW · GW

I am very surprised that the models do better on the generation task than on the multiple-choice task. Multiple-choice question answering seems almost strictly easier than having to generate the answer. Could this be an artefact of how you compute the answer in the MC QA task? Skimming the original paper, you seem to use average per-token likelihood. Have you tried other ways, e.g.

  • pointwise mutual information as in the Anthropic paper, or

  • adding the sentence "Which of these 5 options is the correct one?" to the end of the prompt and then evaluating the likelihood of "A", "B", "C", "D", and "E"?

I suggest this because the result is so surprising, it would be great to see if it appear across different methods of eliciting the MC answer.