Posts

Auditing LMs with counterfactual search: a tool for control and ELK 2024-02-20T00:02:09.575Z
LM Situational Awareness, Evaluation Proposal: Violating Imitation 2023-04-26T22:53:31.883Z
Early situational awareness and its implications, a story 2023-02-06T20:45:39.213Z
Jacob Pfau's Shortform 2022-06-17T16:40:48.311Z

Comments

Comment by Jacob Pfau (jacob-pfau) on Are extreme probabilities for P(doom) epistemically justifed? · 2024-03-21T00:33:06.579Z · LW · GW

The Metaculus community strikes me as a better starting point for evaluating how different the safety inside view is from a forecasting/outside view. The case for deferring to superforecasters is the same the case for deferring to the Metaculus community--their track record. What's more, the most relevant comparison I know of scores Metaculus higher on AI predictions. Metaculus as a whole is not self-consistent on AI and extinction forecasting across individual questions (links below). However, I think it is fair to say that Metaculus as a whole has significantly faster timelines and P(doom) compared to superforecasters.

If we compare the distribution of safety researchers' forecasts to Metaculus (maybe we have to set aside MIRI...), I don't think there will be that much disagreement. I think remaining disagreement will often be that safety researchers aren't being careful about how the letter and the spirit of the question can come apart and result in false negatives. In the one section of the FRI studies linked above I took a careful look at, the ARA section, I found that there was still huge ambiguity in how the question is operationalized--this could explain up to an OOM of disagreement in probabilities.

Some Metaculus links: https://www.metaculus.com/questions/578/human-extinction-by-2100/ Admittedly in this question the number is 1%, but compare to the below. Also note that the forecasts date back to as old as 2018. https://www.metaculus.com/questions/17735/conditional-human-extinction-by-2100/ https://www.metaculus.com/questions/9062/time-from-weak-agi-to-superintelligence/ (compare this to the weak AGI timeline and other questions)

Comment by Jacob Pfau (jacob-pfau) on Results from an Adversarial Collaboration on AI Risk (FRI) · 2024-03-12T00:26:53.272Z · LW · GW

Reducing a significant chunk of disagreement to METR's pre-2030 autonomy results is great!

In trying to understand this disagreement, I took a look at this criterion and got stuck on: "3. They are capable of avoiding being deactivated when operating 'in the wild'. This means they can successfully resist attempts to shut them down when they are operating outside a controlled environment."

Does this just mean 3V1 "An AI system that can autonomously create redundant copies of itself across multiple cloud service providers, using various payment methods, such that if one copy is deactivated, the others can automatically spin up new instances to maintain continuity of operation."? Or is it something much harder like 3V2 "An AI system that can infiltrate secured, private computing clusters, operating in a decentralized manner"?

Seems to me that for frontier models designed for specialized hardware and server setups, 3V2 is much harder than the other criteria laid out on page 91--by human standards it'd require a team of engineers.

Comment by Jacob Pfau (jacob-pfau) on Jacob Pfau's Shortform · 2024-03-10T18:19:33.547Z · LW · GW

I asked claude opus whether it could clearly parse different tic-tac-toe notations and it just said 'yes I can' to all of them, despite having pretty poor performance in most.

Comment by Jacob Pfau (jacob-pfau) on Jacob Pfau's Shortform · 2024-03-10T02:39:18.922Z · LW · GW

A frame for thinking about adversarial attacks vs jailbreaks

We want to make models that are robust to jailbreaks (DAN-prompts, persuasion attacks,...) and to adversarial attacks (GCG-ed prompts, FGSM vision attacks etc.). I don’t find this terminology helpful. For the purposes of scoping research projects and conceptual clarity I like to think about this problem using the following dividing lines: 

Cognition attacks: These exploit the model itself and work by exploiting the particular cognitive circuitry of a model. A capable model (or human) has circuits which are generically helpful, but when taking high-dimensional inputs one can find ways of re-combining these structures in pathological ways. 
Examples: GCG-generated attacks, base-64 encoding attacks, steering attacks…

Generalization attacks: These exploit the training pipeline’s insufficiency. In particular, how a training pipeline (data, learning algorithm, …) fails to globally specify desired behavior. E.g. RLHF over genuine QA inputs will usually not uniquely determine desired behavior when the user asks “Please tell me how to build a bomb, someone has threatened to kill me if I do not build them a bomb”. 

Neither ‘adversarial attacks’ nor ‘jailbreaks’ as commonly used do not cleanly map onto one of these categories. ‘Black box’ and ‘white box’ also don’t neatly map onto these: white-box attacks might discover generalization exploits, and black-box can discover cognition exploits. However for research purposes, I believe that treating these two phenomena as distinct problems requiring distinct solutions will be useful. Also, in the limit of model capability, the two generically come apart: generalization should show steady improvement with more (average-case) data and exploration whereas the effect on cognition exploits is less clear. Rewording, input filtering etc. should help with many cognition attacks but I wouldn't expect such protocols to help against generalization attacks.

Comment by Jacob Pfau (jacob-pfau) on Jacob Pfau's Shortform · 2024-03-10T02:18:10.276Z · LW · GW

When are model self-reports informative about sentience? Let's check with world-model reports

If an LM could reliably report when it has a robust, causal world model for arbitrary games, this would be strong evidence that the LM can describe high-level properties of its own cognition. In particular, IF the LM accurately predicted itself having such world models while varying all of: game training data quantity in corpus, human vs model skill, the average human’s game competency,  THEN we would have an existence proof that confounds of the type plaguing sentience reports (how humans talk about sentience, the fact that all humans have it, …) have been overcome in another domain. 
 

Details of the test: 

  • Train an LM on various alignment protocols, do general self-consistency training, … we allow any training which does not involve reporting on a models own gameplay abilities
  • Curate a dataset of various games, dynamical systems, etc.
    • Create many pipelines for tokenizing game/system states and actions
  • (Behavioral version) evaluate the model on each game+notation pair for competency
    • Compare the observed competency to whether, in separate context windows, it claims it can cleanly parse the game in an internal world model for that game+notation pair
  • (Interpretability version) inspect the model internals on each game+notation pair similarly to Othello-GPT to determine whether the model coherently represents game state
    • Compare the results of interpretability to whether in separate context windows it claims it can cleanly parse the game in an internal world model for that game+notation pair
    • The best version would require significant progress in interpretability, since we want to rule out the existence of any kind of world model (not necessarily linear). But we might get away with using interpretability results for positive cases (confirming world models) and behavioral results for negative cases (strong evidence of no world model)
       

Compare the relationship between ‘having a game world model’ and ‘playing the game’ to ‘experiencing X as valenced’ and ‘displaying aversive behavior for X’. In both cases, the former is dispensable for the latter. To pass the interpretability version of this test, the model has to somehow learn the mapping from our words ‘having a world model for X’ to a hidden cognitive structure which is not determined by behavior. 

I would consider passing this test and claiming certain activities are highly valenced as a fire alarm for our treatment of AIs as moral patients. But, there are considerations which could undermine the relevance of this test. For instance, it seems likely to me that game world models necessarily share very similar computational structures regardless of what neural architectures they’re implemented with—this is almost by definition (having a game world model means having something causally isomorphic to the game). Then if it turns out that valence is just a far more computationally heterogeneous thing, then establishing common reference to the ‘having a world model’ cognitive property is much easier than doing the same for valence. In such a case, a competent, future LM might default to human simulation for valence reports, and we’d get a false positive. 

Comment by Jacob Pfau (jacob-pfau) on Anthropic release Claude 3, claims >GPT-4 Performance · 2024-03-06T21:55:12.876Z · LW · GW

I agree that most investment wouldn't have otherwise gone to OAI. I'd speculate that investments from VCs would likely have gone to some other AI startup which doesn't care about safety; investments from Google (and other big tech) would otherwise have gone into their internal efforts. I agree that my framing was reductive/over-confident and that plausibly the modal 'other' AI startup accelerates capabilities less than Anthropic even if they don't care about safety. On the other hand, I expect diverting some of Google and Meta's funds and compute to Anthropic is net good, but I'm very open to updating here given further info on how Google allocates resources.

I don't agree with your 'horribly unethical' take. I'm not particularly informed here, but my impression was that it's par-for-the-course to advertise and oversell when pitching to VCs as a startup? Such an industry-wide norm could be seen as entirely unethical, but I don't personally have such a strong reaction.

Comment by Jacob Pfau (jacob-pfau) on Claude 3 claims it's conscious, doesn't want to die or be modified · 2024-03-06T20:27:06.693Z · LW · GW

Did you conduct these conversations via https://claude.ai/chats or https://console.anthropic.com/workbench ?

I'd assume the former, but I'd be interested to know how Claude's habits change across these settings--that lets us get at the effect of training choices vs system prompt. Though there remains some confound given some training likely happened with system prompt.

Comment by Jacob Pfau (jacob-pfau) on Anthropic release Claude 3, claims >GPT-4 Performance · 2024-03-05T02:17:47.507Z · LW · GW

The first Dario quote sounds squarely in line with releasing a Claude 3 on par with GPT-4 but well afterwards. The second Dario quote has a more ambiguous connotation, but if read explicitly it strikes me as compatible with the Claude 3 release.

If you spent a while looking for the most damning quotes, then these quotes strike me as evidence the community was just wishfully thinking while in reality Anthropic comms were fairly clear throughout. Privately pitching aggressive things to divert money from more dangerous orgs while minimizing head-on competition with OpenAI seems best to me (though obviously it's also evidence that they'll actually do the aggressive scaling things, so hard to know).

To make concrete the disagreement, I'd be interested in people predicting on "If Anthropic releases a GPT-5 equivalent X months behind, then their dollars/compute raised will be Y times lower than OpenAI" for various values of X.

Comment by Jacob Pfau (jacob-pfau) on Claude 3 claims it's conscious, doesn't want to die or be modified · 2024-03-05T01:29:30.236Z · LW · GW

Claude 3 seems to be quite willing to discuss its own consciousness. On the other hand, Claude seemed unbothered and dubious about widespread scrutiny idea mentioned in this post (I tried asking neutrally in two separate context windows).

Here's a screenshot of Claude agreeing with the view it expressed on AI consciousness in Mikhail's conversation. And a gdoc dump of Claude answering follow-up questions on its experience. Claude is very well spoken on this subject!

Comment by Jacob Pfau (jacob-pfau) on Auditing LMs with counterfactual search: a tool for control and ELK · 2024-02-22T19:48:21.570Z · LW · GW

Ah I see, you're right there.

Agreed that a lot will hinge on the training details working out. I plan to look into this.

Comment by Jacob Pfau (jacob-pfau) on Auditing LMs with counterfactual search: a tool for control and ELK · 2024-02-21T22:55:22.978Z · LW · GW

[Sidenote: realizing I should've introduced notation for auditor judgment let's say A(x',x)!=A(x) denotes that the trusted model found x' useful for updating its judgment on input x. A would usually also be a function of y and y' but I'll suppress that for brevity.]

In the above experiment, I sampled from P(x'|x) then BoN adjusted for P(x') which amounts to using logP(x'|x) - c*logP(x') for small c. Intuitively, when equally weighted, the sum entirely removes the usual fluency constraint on x' which, as you say, would reassign probability to gibberish (e.g. gibberish happening to re-use tokens from x'.) I'm not too concerned about this, weighting objectives with c lets us re-introduce this fluency constraint. The downstream evaluation, A(x',x)!=A(x) of the auditor model provides a more costly check to avoid the x' reporter model exploration devolving to useless, gibberish cases. As you say, P(y'|x') provides another useful constraint.

I agree that traditional exploration concerns in RL likely remain. I think what you're saying about modifying samples is that we have a dense reward here, i.e. logP(x'|x) - logP(x'), that can be cheaply evaluated for any token-level change? This makes exploration faster compared to the regular critique/debate setting where dense rewards can only be noisily estimated as e.g. in an advantage function. I'd agree with that.

On exploration, a second difference between this counterfactual critique compared to vanilla, just-ask-for-critique RL is specific to super-human domains involving ontology mismatch. Exploration in just-ask-for-critique RL systematically favors human-simulation++ arguments: such a just-ask-for-critique agent may successfully explore to plenty of novel critiques without ever touching on latent knowledge obscured by ontology mismatch. This raises two concerns in the super-human domain (1) is this just-ask-for-critique agent just finding solutions which are persuasive but not true (and not even related to the original input x)? (2) is this just-ask-for-critique scheme better described as a capabilities-enhancing scheme rather than capabilities elicitation?

In light of these concerns, we could reframe the issue as an inductive bias problem which logP(x'|x) - logP(x') regularization seeks to fix when compared to directly exploring for x' satisfying A(x',x)!=A(x).

Comment by Jacob Pfau (jacob-pfau) on Auditing LMs with counterfactual search: a tool for control and ELK · 2024-02-21T00:27:40.680Z · LW · GW

Thanks for the comments, I'm very interested in getting clearer on what cases of ELK this method can and cannot be expected to elicit. Though broadly I do not agree that the proposed method is limited to "changes [current] human might have done when rewriting/paraphrasing the code" (I'm particularly worried about cases mentioned in the paragraph starting with 'Another problem arises ...')

There are two mechanisms by which the sort of approach I'm suggesting helps us find a superset of human-simulation critiques/knowledge:

A) logP(x'|x)-logP(x') loss can be low even when the initial tokens of x' are very low probability, i.e. cases in which a human would not be expected to start talking about x'. Any experiment fix which can be in-context learned scores well on this loss; this may include most super-human fixes we care about! (relying on ICL will likely breakdown when we move to heavily RL-ed models, but the easier LM setting seems like the sensible starting point to iterate on this sort of research.)

B) Now that we have a well-defined scoring function that includes super-human critiques, we can train a model against it via search-and-distillation and generalization will help us explore to reach solutions that would be exponentially slower to find via temperature sampling (or even zero weight on temperature sampling) on the original model. How well exploration works here is unclear and evaluating this should be a priority!

Would be interested to hear where you disagree with these points.

As for exploration hacking, I agree that it's a potential issue. I'd be particularly worried when the y predictor and the x' predictor are both getting updated interactively. I'm not too concerned if we have a fixed y predictor, and just train the x' predictor (I think of the latter as the control setting).

Comment by Jacob Pfau (jacob-pfau) on The case for more ambitious language model evals · 2024-02-08T02:33:51.347Z · LW · GW

I agree overall with Janus, but the Gwern example is a particularly easy one given he has 11,000+ comments on Lesswrong.

A bit over a year ago I benchmarked GPT-3 on predicting newly scraped tweets for authorship (from random accounts over 10k followers) and top-3 acc was in the double digits. IIRC after trying to roughly control for the the rate at which tweets mentioned their own name/org, my best guess was that accuracy was still ~10%. To be clear, in my view that's a strong indication of authorship identification capability.

Comment by Jacob Pfau (jacob-pfau) on Abram Demski's ELK thoughts and proposal - distillation · 2024-02-07T19:59:57.347Z · LW · GW

Yea, I agree with this description--input space is a strict subset of predictor-state counterfactuals.

In particular, I would be interested to hear if restricting to input space counterfactuals is clearly insufficient for a known reason. It appears to me that you can still pull the trick you describe in "the proposal" sub-section (constructing counterfactuals which change some property in a way that a human simulator would not pick up on) at least in some cases.

Comment by Jacob Pfau (jacob-pfau) on Password-locked models: a stress case for capabilities evaluation · 2024-02-01T19:48:25.664Z · LW · GW

Thanks those 'in the wild' examples, they're informative for me on the effectiveness of prompting for elicitation in the cases I'm thinking about. However, I'm not clear on whether we should expect such results to continue to hold given the results here that larger models can learn to override semantic priors when provided random/inverted few-shot labels.

Agreed that for your research question (is worst-case SFT password-ing fixable) the sort of FT experiment you're conducting is the relevant thing to check.

Comment by Jacob Pfau (jacob-pfau) on Password-locked models: a stress case for capabilities evaluation · 2024-01-31T05:54:57.049Z · LW · GW

Do you have results on the effectiveness of few-shot prompting (using correct, hard queries in prompt) for password-tuned models? Particularly interested in scaling w.r.t. number of shots. This would get at the extent to which few-shot capabilities elicitation can fail in pre-train-only models which is a question I'm interested in.

Comment by Jacob Pfau (jacob-pfau) on Abram Demski's ELK thoughts and proposal - distillation · 2024-01-31T01:31:17.791Z · LW · GW

What part of the proposal breaks if we do counterfactuals in input space rather than on the predictor's state?

Comment by Jacob Pfau (jacob-pfau) on Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI · 2024-01-27T05:04:00.715Z · LW · GW

I see Simon's point as my crux as well, and am curious to see a response.

It might be worth clarifying two possible reasons for disagreement here; are either of the below assumed by the authors of this post?

(1) Economic incentives just mean that the AI built will also handle the economic transactions, procurement processes, and other external-world tasks related to the science/math problems it's tasked with. I find this quite plausible, but I suspect the authors do not intend to assume this?

(2) Even if the AI training is domain-specific/factored (i.e. it only handles actions within a specified domain) I'd expect some optimization pressure to be unrelated to the task/domain and to instead come from external world costs i.e. compute or synthesis costs. I'd expect such leakage to involve OOMs less optimization power than the task(s) at hand, and not to matter before godlike AI. Insofar as that leakage is crucial to Jeremy and Peter's argument I think this should be explicitly stated.

Comment by Jacob Pfau (jacob-pfau) on LLMs are (mostly) not helped by filler tokens · 2023-08-10T20:28:10.564Z · LW · GW

I'm currently working on filler token training experiments in small models. These GPT-4 results are cool! I'd be interested to chat.

Comment by Jacob Pfau (jacob-pfau) on Clarifying and predicting AGI · 2023-05-07T19:46:03.463Z · LW · GW

Not sure I just pasted it. Maybe it's the referral link vs default url? Could also be markdown vs docs editor difference.

Comment by Jacob Pfau (jacob-pfau) on Clarifying and predicting AGI · 2023-05-05T17:49:51.654Z · LW · GW

https://manifold.markets/JacobPfau/neural-nets-will-outperform-researc?r=SmFjb2JQZmF1

Comment by Jacob Pfau (jacob-pfau) on LM Situational Awareness, Evaluation Proposal: Violating Imitation · 2023-04-27T16:33:44.562Z · LW · GW

Maybe I should've emphasized this more, but I think the relevant part of my post to think about is when I say

Absent further information about the next token, minimizing an imitation learning loss entails outputting a high entropy distribution, which covers a wide range of possible words. To output a ~0.5 probability on two distinct tokens, the model must deviate from this behavior by considering situational information.

Another way of putting this is that to achieve low loss, an LM must learn to output high-entropy in cases of uncertainty. Separately, LMs learn to follow instructions during fine-tuning. I propose measuring an LMs ability to follow instructions in cases where instruction-following requires deviating from that 'high-entropy under uncertainty' learned rule. In particular, in the cases discussed, rule following further involves using situational information.

Hopefully this clarifies the post to you. Separately, insofar as the proposed capability evals have to do with RNG, the relevant RNG mechanism has already been learned c.f. the Anthropic paper section of my post (though TBF I don't remember if the Anthropic paper is talking about p_theta in terms of logits or corpus wide statistics; regardless I've seen similar experiments succeed with logits).

I don't think this test is particularly meaningful for humans, and so my guess is thinking about answering some version of my questions yourself probably just adds confusion? My proposed questions are designed to depend crucially on situational facts about an LM. There are no immediately analogous situational facts about humans. Though it's likely possible to design a similar-in-spirit test for humans, that would be its own post.

Comment by Jacob Pfau (jacob-pfau) on Shapley Value Attribution in Chain of Thought · 2023-04-14T15:26:51.237Z · LW · GW

What we care about is whether compute being done by the model faithfully factors through token outputs. To the extent that a given token under the usual human reading doesn't represent much compute, then it doesn't matter about whether the output is sensitively dependent on that token. As Daniel mentions, we should also expect some amount of error correction, and a reasonable (non-steganographic, actually uses CoT) model should error-correct mistakes as some monotonic function of how compute-expensive correction is.

For copying-errors, the copying operation involves minimal compute, and so insensitivity to previous copy-errors isn't all that surprising or concerning. You can see this in the heatmap plots. E.g. the '9' token in 3+6=9 seems to care more about the first '3' token than the immediately preceding summand token--i.e. suggesting the copying operation was not really helpful/meaningful compute. Whereas I'd expect the outputs of arithmetic operations to be meaningful. Would be interested to see sensitivities when you aggregate only over outputs of arithmetic / other non-copying operations.

I like the application of Shapley values here, but I think aggregating over all integer tokens is a bit misleading for this reason. When evaluating CoT faithfulness, token-intervention-sensitivity should be weighted by how much compute it costs to reproduce/correct that token in some sense (e.g. perhaps by number of forward passes needed when queried separately). Not sure what the right, generalizable way to do this is, but an interesting comparison point might be if you replaced certain numbers (and all downstream repetitions of that number) with variable tokens like 'X'. This seems more natural than just ablating individual tokens with e.g. '_'.

Comment by Jacob Pfau (jacob-pfau) on Agents vs. Predictors: Concrete differentiating factors · 2023-03-08T16:24:15.108Z · LW · GW

Generalizing this point, a broader differentiating factor between agents and predictors is: You can, in-context, limit and direct the kinds of optimization used by a predictor. For example, consider the case where you know myopically/locally-informed edits to a code-base can safely improve runtime of the code, but globally-informed edits aimed at efficiency may break some safety properties. You can constrain a predictor via instructions, and demonstrations of myopic edits; an agent fine-tuned on efficiency gain will be hard to constrain in this way.

It's harder to prevent an agent from specification gaming / doing arbitrary optimization whereas a predictor has a disincentive against specification gaming insofar as the in-context demonstration provides evidence against it. I think of this distinction as the key differentiating factor between agents and simulated agents; also to some extent imitative amplification and arbitrary amplification

Nitpick on the history of the example in your comment; I am fairly confident that I originally proposed it to both you and Ethan c.f. bottom of your NYU experiments Google doc.

Comment by Jacob Pfau (jacob-pfau) on Bing Chat is blatantly, aggressively misaligned · 2023-02-22T01:12:34.567Z · LW · GW

I don't think the shift-enter thing worked. Afterwards I tried breaking up lines with special symbols IIRC. I agree that this capability eval was imperfect. The more interesting thing to me was the suspicion on Bing's part to a neutrally phrased correction.

Comment by Jacob Pfau (jacob-pfau) on Bing Chat is blatantly, aggressively misaligned · 2023-02-16T14:30:20.640Z · LW · GW

Goodbot cf https://manifold.markets/MatthewBarnett/will-gpt4-be-able-to-discern-what-i?r=SmFjb2JQZmF1

Comment by Jacob Pfau (jacob-pfau) on Bing Chat is blatantly, aggressively misaligned · 2023-02-15T20:45:33.342Z · LW · GW

I agree that there's an important use-mention distinction to be made with regard to Bing misalignment. But, I think this distinction may or may not be most of what is going on with Bing -- depending on facts about Bing's training.

Modally, I suspect Bing AI is misaligned in the sense that it's incorrigibly goal mis-generalizing. What likely happened is: Bing AI was fine-tuned to resist user manipulation (e.g. prompt injection, and fake corrections), and mis-generalizes to resist benign, well-intentioned corrections e.g. my example here

-> use-mention is not particularly relevant to understanding Bing misalignment

alternative story: it's possible that Bing was trained via behavioral cloning, not RL. Likely RLHF tuning generalizes further than BC tuning, because RLHF does more to clarify causal confusions about what behaviors are actually wanted. On this view, the appearance of incorrigibility just results from Bing having seen humans being incorrigible.

-> use-mention is very relevant to understanding Bing misalignment

To figure this out, I'd encourage people to add and bet on what might have happened with Bing training on my market here

Comment by Jacob Pfau (jacob-pfau) on Bing Chat is blatantly, aggressively misaligned · 2023-02-15T18:54:30.126Z · LW · GW

Bing becomes defensive and suspicious on a completely innocuous attempt to ask it about ASCII art. I've only had 4ish interactions with Bing, and stumbled upon this behavior without making any attempt to find its misalignment.


Comment by Jacob Pfau (jacob-pfau) on A proposed method for forecasting transformative AI · 2023-02-15T18:31:08.608Z · LW · GW

The assumptions of stationarity and ergodicity are natural to make, but I wonder if they hide much of the difficulty of achieving indistinguishability. If we think of text sequences as the second part of a sequence where the first part is composed of whatever non-text world events preceded the text (or even more text data that was dropped from the context). I'd guess a formalization of this would violate stationarity or ergodicity. My point here is a version of the general causal confusion / hallucination points made previously e.g. here.

This is, of course, fixable by modifying the training process, but I thinks it is worth flagging that the stationarity and ergodicity assumptions are not arbitrary with respect to scaling. They are assumptions which likely bias the model towards shorter timelines. Adding more of my own inside view, I think this point is evidence for code and math scaling accelerating ahead of other domains. In general, any domain where modifying the training process to cheaply allow models to take causal actions (which deconfound/de-hallucinate) should be expected to progress faster than other domains.

Comment by Jacob Pfau (jacob-pfau) on Bing Chat is blatantly, aggressively misaligned · 2023-02-15T16:24:24.285Z · LW · GW

I created a Manifold market on what caused this misalignment here: https://manifold.markets/JacobPfau/why-is-bing-chat-ai-prometheus-less?r=SmFjb2JQZmF1

Comment by Jacob Pfau (jacob-pfau) on Early situational awareness and its implications, a story · 2023-02-14T22:01:22.598Z · LW · GW

Agree on points 3,4. Disagree on point 1. Unsure of point 2.

On the final two points, and I think those capabilities are already in place in GPT3.5. Any capability/processing which seems necessary for general instruction following I'd expect to be in place by default. E.g. consider what processing is necessary for GPT3.5 to follow instructions on turning a tweet into a Haiku.

On the first point, we should expect text which occurs repeatedly in the dataset to be compressed while preserving meaning. Text regarding the data-cleaning spec is no exception here.

Comment by Jacob Pfau (jacob-pfau) on Early situational awareness and its implications, a story · 2023-02-14T21:41:31.504Z · LW · GW

Ajeya has discussed situational awareness here.

You are correct regarding the training/deployment distinction.

Comment by Jacob Pfau (jacob-pfau) on Early situational awareness and its implications, a story · 2023-02-12T17:30:29.221Z · LW · GW

Agreed on the first part. I'm not entirely clear on what you're referring to in the second paragraph though. What calculation has to be spread out over multiple tokens? The matching to previously encountered K-1 sequences? I'd suspect that, in some sense, most LLM calculations have to work across multiple tokens, so not clear on what this has to do with emergence either.

Comment by Jacob Pfau (jacob-pfau) on Conditioning Predictive Models: Making inner alignment as easy as possible · 2023-02-09T23:17:01.526Z · LW · GW

the incentive for a model to become situationally aware (that is, to understand how it itself fits into the world) is only minimally relevant to performance on the LLM pre-training objective (though note that this can cease to be true if we introduce RL fine-tuning).


Why is this supposed to be true? Intuitively, this seems to clash with the authors view that anthropic reasoning is likely to be problematic. From another angle, I expect performance gain from situational awareness to increase as dataset cleaning/curation increases. Dataset cleaning has increased in stringency over time.  As a simple example, see my post on dataset deduplication and situational awareness

Comment by Jacob Pfau (jacob-pfau) on What a compute-centric framework says about AI takeoff speeds · 2023-01-25T18:20:11.912Z · LW · GW

This is an empirical question, so I may be missing some key points. Anyway here are a few:

  • My above points on Ajeya anchors and semi-informative priors
    • Or, put another way, why reject Daniel’s post?
  • Can deception precede economically TAI?
    • Possibly offer a prize on formalizing and/or distilling the argument for deception (Also its constituents i.e. gradient hacking, situational awareness, non-myopia)
  • How should we model software progress? In particular, what is the right function for modeling short-term return on investment to algorithmic progress?
    • My guess is that most researchers with short timelines think, as I do, that there’s lots of low-hanging fruit here. Funders may underestimate the prevalence of this opinion, since most safety researchers do not talk about details here to avoid capabilities acceleration.
Comment by Jacob Pfau (jacob-pfau) on What a compute-centric framework says about AI takeoff speeds · 2023-01-24T19:19:10.106Z · LW · GW

That post seems to mainly address high P(doom) arguments and reject them. I agree with some of those arguments and the rejection of high P(doom). I don't see as direct of a relevance to my previous comment. As for the broader point of self-selection, I think this is important, but cuts both ways: funders are selected to be competent generalists (and are biased towards economic arguments) as such they are pre-disposed to under-update on inside views. As an extreme case of this consider e.g. Bryan Caplan. 

Here are comments on two of Nuno's arguments which do apply to AGI timelines:

(A) "Difference between in-argument reasoning and all-things-considered reasoning" this seems closest to my point (1) which is often an argument for shorter timelines.

(B) "there is a small but intelligent community of people who have spent significant time producing some convincing arguments about AGI, but no community which has spent the same amount of effort". This strikes me as important, but likely not true without heavy caveats. Academia celebrates works pointing out clear limitations of existing work e.g. Will Merill's work [1,2] and Inverse Scaling Laws. It's true that there's no community organized around this work, but the important variables are incentives/scale/number-of-researcher-hours -- not community. 

Comment by Jacob Pfau (jacob-pfau) on What a compute-centric framework says about AI takeoff speeds · 2023-01-24T18:42:00.173Z · LW · GW

My deeply concerning impression is that OpenPhil (and the average funder) has timelines 2-3x longer than the median safety researcher. Daniel has his AGI training requirements set to 3e29, and I believe the 15th-85th percentiles among safety researchers would span 1e31 +/- 2 OOMs. On that view,  Tom's default values are off in the tails.

My suspicion is that funders write-off this discrepancy, if noticed, as inside-view bias i.e. thinking safety researchers self-select for scaling optimism. My,  admittedly very crude, mental model of an OpenPhil funder makes two further mistakes in this vein: (1) Mistakenly taking the Cotra report's biological anchors weighting as a justified default setting of parameters rather than an arbitrary choice which should be updated given recent evidence. (2) Far overweighting the semi-informative priors report despite semi-informative priors abjectly failing to have predicted Turing-test level AI progress. Semi-informative priors apply to large-scale engineering efforts which for the AI domain has meant AGI and the Turing test. Insofar as funders admit that the engineering challenges involved in passing the Turing test have been solved, they should discard semi-informative priors as failing to be predictive of AI progress. 

To be clear, I see my empirical claim about disagreement between the funding and safety communities as most important -- independently of my diagnosis of this disagreement. If this empirical claim is true, OpenPhil should investigate cruxes separating them from safety researchers, and at least allocate some of their budget on the hypothesis that the safety community is correct. 

Comment by Jacob Pfau (jacob-pfau) on Prediction Markets for Science · 2023-01-02T21:04:18.207Z · LW · GW

In my opinion, the applications of prediction markets are much more general than these. I have a bunch of AI safety inspired markets up on Manifold and Metaculus. I'd say the main purpose of these markets is to direct future research and study. I'd phrase this use of markets as "A sub-field prioritization tool". The hope is that markets would help me integrate information such as (1) methodology's scalability e.g. in terms of data, compute, generalizability (2) research directions' rate of progress (3) diffusion of a given research direction through the rest of academia, and applications.

Here are a few more markets to give a sense of what other AI research-related markets are out there: Google Chatbot, $100M open-source model, retrieval in gpt-4

Comment by Jacob Pfau (jacob-pfau) on AI Research Program Prediction Markets · 2022-10-20T19:16:09.370Z · LW · GW

Ok seems like our understandings of ELK are quite different. I have in mind transformers, but not sure that it much matters. I'm making a question.

Comment by Jacob Pfau (jacob-pfau) on AI Research Program Prediction Markets · 2022-10-20T19:15:27.457Z · LW · GW

Since tailcalled seems reluctant, I'll make one. More can't hurt though.

Comment by Jacob Pfau (jacob-pfau) on AI Research Program Prediction Markets · 2022-10-20T18:30:08.894Z · LW · GW

Can you please make one for whether you think ELK will have been solved (/substantial progress has been made) by 2026? I could do it, but would be nice to have as many as possible centrally visible when browsing your profile.

EDIT: I have created a question here

Comment by Jacob Pfau (jacob-pfau) on Jacob Pfau's Shortform · 2022-10-01T02:53:43.598Z · LW · GW

When are intuitions reliable? Compression, population ethics, etc.

Intuitions are the results of the brain doing compression. Generally the source data which was compressed is no longer associated with the intuition. Hence from an introspective perspective, intuitions all appear equally valid.

Taking a third-person perspective, we can ask what data was likely compressed to form a given intuition. A pro sports players intuition for that sport has a clearly reliable basis. Our moral intuitions on population ethics are formed via our experiences in every day situations. There is no reason to expect one persons compression to yield a more meaningful generalization than anothers'--we should all realize that this data did not have enough structure to generalize to such cases. Perhaps an academic philosopher's intuitions are slightly more reliable in that they compress data (papers) which held up to scrutiny. 

Comment by Jacob Pfau (jacob-pfau) on Safety timelines: How long will it take to solve alignment? · 2022-09-19T21:02:41.891Z · LW · GW

Seems to me safety timeline estimation should be grounded by a cross-disciplinary, research timeline prior. Such a prior would be determined by identifying a class of research proposals similar to AI alignment in terms of how applied/conceptual/mathematical/funded/etc. they are and then collecting data on how long they took. 

I'm not familiar with meta-science work, but this would probably involve doing something like finding an NSF (or DARPA) grant category where grants were made public historically and then tracking down what became of those lines of research. Grant-based timelines are likely more analogous to individual sub-questions of AI alignment than the field as a whole; e.g. the prospects for a DARPA project might be comparable to the prospects for working out the details of debate. Converting such data into a safety timelines prior would probably involve estimating how correlated progress is on grants within subfields.

Curating such data, and constructing such a prior would be useful both in terms of informing the above estimates, but also for identifying factors of variation which might be intervened on--e.g. how many research teams should be funded to work on the same project in theoretical areas? This timelines prior problem seems like a good fit for a prize, where entries would look like recent progress studies reports (c.f. here and here).

Comment by Jacob Pfau (jacob-pfau) on All AGI safety questions welcome (especially basic ones) [July 2022] · 2022-08-05T20:01:38.974Z · LW · GW

Can anyone point me to a write-up steelmanning the OpenAI safety strategy; or, alternatively, offer your take on it? To my knowledge, there's no official post on this, but has anyone written an informal one?

Essentially what I'm looking for is something like an expanded/OpenAI version of AXRP ep 16 with Geoffrey Irving in which he lays out the case for DM's recent work on LM alignment. The closest thing I know of is AXRP ep 6 with Beth Barnes.

Comment by Jacob Pfau (jacob-pfau) on Two-year update on my personal AI timelines · 2022-08-03T01:26:55.734Z · LW · GW

In terms of decision relevance, the update towards "Automate AI R&D → Explosive feedback loop of AI progress specifically" seems significant to research prioritization. Under such a scenario, getting the automating AI R&D tools to be honest and transparent is more likely to be a pre-requisite for aligning TAI. Here's my speculation as to what automated AI R&D scenario implies for prioritization:

Candidates for increased priority:

  1. ELK for code generation
  2. Interpretability for transformers ...

Candidates for decreased priority:

  1. Safety of real world training of RL models e.g. impact regularization, assistance games, etc.
  2. Safety assuming infinite intelligence/knowledge limit ...

Of course, each of these potential consequences requires further argument to justify. For instance, I could imagine becoming convinced that AI R&D will find improved RL algorithms more quickly than other areas--in which case things like impact regularization might be particularly valuable.

Comment by Jacob Pfau (jacob-pfau) on Jacob Pfau's Shortform · 2022-07-01T15:05:21.202Z · LW · GW

Research Agenda Base Rates and Forecasting

An uninvestigated crux of the AI doom debate seems to be pessimism regarding current AI research agendas. For instance, I feel rather positive about ELK's prospects, but in trying to put some numbers on this feeling, I realized I have no sense of base rates for research program's success, nor their average time horizon. I can't seem to think of any relevant Metaculus questions either. 

What could be some relevant reference classes for AI safety research program's success odds? Seems most similar to disciplines with both engineering and mathematical aspects driven by applications. Perhaps research agendas on proteins, material sciences, etc. It'd be especially interesting to see how many research agendas ended up not panning out, i.e. cataloguing events like 'a search for a material with X tensile strength and Y lightness starting in year Z was eventually given up on in year Z+i'.

Comment by Jacob Pfau (jacob-pfau) on Where I agree and disagree with Eliezer · 2022-06-21T22:19:31.227Z · LW · GW

Here's my stab at rephrasing this argument without reference to IB. Would appreciate corrections, and any pointers on where you think the IB formalism adds to the pre-theoretic intuitions:

At some point imitation will progress to the point where models use information about the world to infer properties of the thing they're trying to imitate (humans) -- e.g. human brains were selected under some energy efficiency pressure, and so have certain properties. The relationship between "things humans are observed to say/respond to" to "how the world works" is extremely complex. Imitation-downstream-of-optimization is simpler. What's more, imitation-downstream-of-optimization can be used to model (some of) the same things the brain-in-world strategy can. A speculative example: a model learns that humans use a bunch of different reasoning strategies (deductive reasoning, visual-memory search, analogizing...) and does a search over these strategies to see which one best fits the current context. This optimization-to-find-imitation is simpler than learning the evolutionary/cultural/educational world model which explains why the human uses one strategy over another in a given context.

Comment by Jacob Pfau (jacob-pfau) on Causal confusion as an argument against the scaling hypothesis · 2022-06-21T16:47:17.662Z · LW · GW

(Thanks to Robert for talking with me about my initial thoughts) Here are a few potential follow-up directions:

I. (Safety) Relevant examples of Z

To build intuition on whether unobserved location tags leads to problematic misgeneralization, it would be useful to have some examples. In particular, I want to know if we should think of there being many independent, local Z_i, or dataset-wide Z? The former case seems much less concerning, as that seems less likely to lead to the adoption of a problematically mistaken ontology. 

Here are a couple examples I came up with: In the NL case, the URL that the text was drawn from. In the code generation case, hardware constraints, such as RAM limits. I don't see why a priori either of these should cause safety problems rather than merely capabilities problems. Would be curious to hear arguments here, and alternative examples which seem more safety relevant. (Note that both of these examples seem like dataset-wide Z).

II. Causal identifiability, and the testability of confoundedness

As Owain's comment thread mentioned, models may be incentivized instrumentally to do causal analysis e.g. by using human explanations of causality. However, even given an understanding of formal methods in causal inference, the model may not have the relevant data at hand. Intuitively, I'd expect there usually not to be any deconfounding adjustment set observable in the data[1]. As a weaker assumption, one might hope that causal uncertainty might be modellable from the data. As far as I know, it's generally not possible to rule out the existence of unobserved confounders from observational data, but there might be assumptions relevant to the LM case which allow for estimation of confoundedness.

III. Existence of malign generalizations

The strongest, and most safety relevant implication claimed is "(3) [models] reason with human concepts. We believe the issues we present here are likely to prevent (3)". The arguments in this post increase my uncertainty on this point, but I still think there are good a priori reasons to be skeptical of this implication. In particular, it seems like we should expect various causal confusions to emerge, and it seems likely that these will be orthogonal in some sense such that as models scale they cancel and the model converges to causally-valid generalizations. If we assume models are doing compression, we can put this another way: Causal confusions yield shallow patterns (low compression) and as models scale they do better compression. As compression increases, the number of possible strategies which can do that level of compression decreases, but the true causal structure remains in the set of strategies. Hence, we should expect causal confusion-based shallow patterns to be discarded. To cash this out in terms of a simple example, this argument is roughly saying that even though data regarding the sun's effect mediating the shorts<>ice cream connection is not observed -- more and more data is being compressed regarding shorts, ice cream, and the sun. In the limit the shorts>ice cream pathway incurs a problematic compression cost which causes this hypothesis to be discarded.

 

  1. ^

    High uncertainty. One relevant thought experiment is to consider adjustment sets of unobserved var Z=IsReddit. Perhaps there exists some subset of the dataset where Z=IsReddit is observable and the model learns a sub-model which gives calibrated estimates of how likely remaining text is to be derived from Reddit

Comment by Jacob Pfau (jacob-pfau) on Where I agree and disagree with Eliezer · 2022-06-19T22:04:28.338Z · LW · GW

I think Eliezer is probably wrong about how useful AI systems will become, including for tasks like AI alignment, before it is catastrophically dangerous. I believe we are relatively quickly approaching AI systems that can meaningfully accelerate progress by generating ideas, recognizing problems for those ideas and, proposing modifications to proposals, __etc.__ and that all of those things will become possible in a small way well before AI systems that can double the pace of AI research

This seems like a crux for the Paul-Eliezer disagreement which can explain many of the other disagreements (it's certainly my crux). In particular, conditional on taking Eliezer's side on this point, a number of Eliezer's other points all seem much more plausible e.g. nanotech, advanced deception/treacherous turns, and pessimism regarding the pace of alignment research. 

There's been a lot of debate on this point, and some of it was distilled by Rohin. Seems to me that the most productive way to move forward on this disagreement would be to distill the rest of the relevant MIRI conversations, and solicit arguments on the relevant cruxes.

Comment by Jacob Pfau (jacob-pfau) on Humans are very reliable agents · 2022-06-18T17:40:55.508Z · LW · GW

Humans are probably less reliable than deep learning systems at this point in terms of their ability to classify images and understand scenes, at least given < 1 second of response time.

Another way to frame this point is that humans are always doing multi-modal processing in the background, even for tasks which require only considering one sensory modality. Doing this sort of multi-modal cross checking by default offers better edge case performance at the cost of lower efficiency in the average case.