jacob-pfau

Posts
Comments

Posts

Prospects for Alignment Automation: Interpretability Case Study 2025-03-21T14:05:51.528Z

Auditing LMs with counterfactual search: a tool for control and ELK 2024-02-20T00:02:09.575Z

LM Situational Awareness, Evaluation Proposal: Violating Imitation 2023-04-26T22:53:31.883Z

Early situational awareness and its implications, a story 2023-02-06T20:45:39.213Z

Jacob Pfau's Shortform 2022-06-17T16:40:48.311Z

Comments

Comment by Jacob Pfau (jacob-pfau) on Alexander Gietelink Oldenziel's Shortform · 2025-04-13T15:01:23.615Z · LW · GW

Disagree on individual persuasion. Agree on mass persuasion.

Mass I'd expect optimizing one-size-fits-all messages for achieving mass persuasion has the properties you claim: there are a few summary, macro variables that are almost-sufficient statistics for the whole microstate--which comprise the full details on individuals.

Individual Disagree on this, there are a bunch of issues I see at the individual level. All of the below suggest to me that significantly superhuman persuasion is tractable (say within five years).

Defining persuasion: What's the difference between persuasion and trade for an individual? Perhaps persuasion offers nothing in return? Though presumably giving strategic info to a boundedly rational agent is included? Scare quotes below to emphasize notions that might not map onto the right definition.
Data scaling: There's an abundant amount of data available on almost all of us online. How much more persuasive can those who know you better be? I'd guess the fundamental limit (without knowing brainstates) is above your ability to 'persuade' yourself.
Preference incoherence: An intuition pump on the limits of 'persuasion' is how far you are from having fully coherent preferences. Insofar as you don't an agent which can see those incoherencies should be able to pump you--a kind of persuasion.

Comment by Jacob Pfau (jacob-pfau) on Prospects for Alignment Automation: Interpretability Case Study · 2025-03-31T14:02:39.202Z · LW · GW

Before concretely addressing the oversights you found, perhaps worth mentioning the intuitive picture motivating the pseudo-code. I wanted to make explicit the scientific process which happens between researchers. M1 plays the role of the methods researcher, M2 plays the role of the applications/datasets researcher. The pseudo-code is an attempt to write out crisply in what sense 'good' outputs from M1 can pass the test-of-time standing up to more realistic, new applications and baselines developed by M2.

On to the details:

Thanks for working with my questionable notation here! Indeed the uses of I were overloaded here, and I have now (hopefully) clarified by writing v_{I,M} for what was previously I(M). The type signatures I have in mind are that I is code (an interpretability method) and v_{I,M} and I(M) are some efficiently queryable representation of M (circuits, SAE weights, ...) useful for downstream tasks.

The sense in which M2 is static does seem important. In fact, I think M2 should have some access to M--it was an oversight that M does not appear as an input to M2. This was why I mentioned sample complexity in a footnote: It seems reasonable to give M2 limited query access to M. Thanks for catching this. In fact, perhaps the scheme could work as originally written where M2 does not have direct access to M, but I'm unsure seems too 'static' as you say.

Regarding the appropriateness of the term interpretability to describe the target of this automation process: I agree, the output may not be an interp method in our current sense. Interpretability is the most appropriate term I could come up with. Two features seem important here: (1) white-box parsing of weights is central. (2) The resulting 'interpretation' v_{I,M} is usable by a fixed model M2, hence v_{I,M} must be efficiently interface-able without having learned--in weights--the structure of v_{I,M}.

Comment by Jacob Pfau (jacob-pfau) on How far along Metr's law can AI start automating or helping with alignment research? · 2025-03-21T14:29:25.076Z · LW · GW

To apply METR's law we should distinguish conceptual alignment work from well-defined alignment work (including empirics and theory on existing conjectures). The METR plot doesn't tell us anything quantitative about the former.

As for the latter, let's take interpretability as an example: We can model uncertainty as a distribution over the time-horizon needed for interpretability research e.g. ranging over 40-1000 hours. Then, I get 66% CI of 2027-2030 for open-ended interp research automation--colab here. I've written up more details on this in a post here.

Comment by Jacob Pfau (jacob-pfau) on Alexander Gietelink Oldenziel's Shortform · 2025-03-02T20:57:29.545Z · LW · GW

I'd defend a version of claim (1): My understanding is that to a greater extent than anywhere else, top French students wanting to concentrate in STEM subjects must take rigorous math coursework from 18-20. In my one year experience in the French system, I also felt that there was a greater cultural weight and institutionalized preference (via course requirements and choice of content) for theoretical topics in ML compared to US universities.

I know little about ENS, but somewhat doubt that it's as significantly different of an experience from US/UK counterparts.

Comment by Jacob Pfau (jacob-pfau) on ryan_greenblatt's Shortform · 2025-01-06T18:58:01.475Z · LW · GW

AI is 90% of their (quality adjusted) useful work force

This is intended to compare to 2023/AI-unassisted humans, correct? Or is there some other way of making this comparison you have in mind?

Comment by Jacob Pfau (jacob-pfau) on By default, capital will matter more than ever after AGI · 2024-12-29T16:21:57.423Z · LW · GW

I see the command economy point as downstream of a broader trend: as technology accelerates, negative public externalities will increasingly scale and present irreversible threats (x-risks, but also more mundane pollution, errant bio-engineering plague risks etc.). If we condition on our continued existence, there must've been some solution to this which would look like either greater government intervention (command economy) or a radical upgrade to the coordination mechanisms in our capitalist system. Relevant to your power entrenchment claim: both of these outcomes involve the curtailment of power exerted by private individuals with large piles of capital.

(Note there are certainly other possible reasons to expect a command economy, and I do not know which reasons were particularly compelling to Daniel)

Comment by Jacob Pfau (jacob-pfau) on johnswentworth's Shortform · 2024-12-06T18:58:02.720Z · LW · GW

Two guesses on what's going on with your experiences:

You're asking for code which involves uncommon mathematics/statistics. In this case, progress on scicodebench is probably relevant, and it indeed shows remarkably slow improvement. (Many reasons for this, one relatively easy thing to try is to breakdown the task, forcing the model to write down the appropriate formal reasoning before coding anything. LMs are stubborn about not doing CoT for coding, even when it's obviously appropriate IME)
You are underspecifying your tasks (and maybe your questions are more niche than average), or otherwise prompting poorly, in a way which a human could handle but models are worse at. In this case sitting down with someone doing similar tasks but getting more use out of LMs would likely help.

Comment by Jacob Pfau (jacob-pfau) on (The) Lightcone is nothing without its people: LW + Lighthaven's big fundraiser · 2024-11-30T19:44:33.598Z · LW · GW

Thanks for these details. These have updated me to be significantly more optimistic about the value of spending on LW infra.

The LW1.0 dying to no mobile support is an analogous datapoint in favor of having a team ready for 0-5 year future AI integration.
The head-to-head on the site updated me towards thinking things that I'm not sure are positive (visible footnotes in sidebar, AI glossary, to a lesser extent emoji-reacts) are not a general trend. I will correct my original comment on this.
While I think the current plans for AI integration (and existing glossary thingy) are not great, I do think there will be predictably much better things to do in 1-2 years and I would want there to be a team with practice ready to go for those. Raemon's reply below also speaks to this. Actively iterating on integrations while keeping them opt-in (until very clearly net positive) seems like the best course of action to me.

Comment by Jacob Pfau (jacob-pfau) on (The) Lightcone is nothing without its people: LW + Lighthaven's big fundraiser · 2024-11-30T16:43:56.297Z · LW · GW

I am slightly worried about the rate at which LW is shipping new features. I'm not convinced they are net positive. I see lesswrong as a clear success, but unclear user of the marginal dollar; I see lighthaven as a moderate success and very likely positive to expand at the margin.

The interface has been getting busier^[1] whereas I think the modal reader would benefit from having as few distractions as possible while reading. I don't think an LLM-enhanced editor would be useful, nor am I excited about additional tutoring functionality.

I am glad to see that people are donating, but I would have preferred this post to carefully signpost the difference between status-quo value of LW (immense) from the marginal value of paying for more features for LW (possibly negative), and from your other enterprises. Probably not worth the trouble, but is it possible to unbundle these for the purposes of donations?

Separately, thank you to the team! My research experience over the past years has benefitted from LW on a daily basis.

EDIT: thanks to Habryka for more details. After comparing to previous site versions I'm more optimistic about the prospects for active work on LW.

(edit) in some places, less busy in others ↩︎

Comment by Jacob Pfau (jacob-pfau) on Searching for phenomenal consciousness in LLMs: Perceptual reality monitoring and introspective confidence · 2024-11-25T22:08:01.995Z · LW · GW

Seems like we were thinking along very similar lines. I wrote up a similar experiment in shortform here. There's also an accompanying prediction market which might interest you.

I did not include the 'getting drunk' interventions, which are an interesting idea, but I believe that fine-grained capabilities in many domains are de-correlated enough that 'getting drunk' shouldn't be needed to get strong evidence for use of introspection (as opposed to knowledge of general 3rd person capability levels of similar AI).

Would be curious to chat about this at some point if you're still working on this!

Comment by Jacob Pfau (jacob-pfau) on Which things were you surprised to learn are not metaphors? · 2024-11-22T20:47:37.016Z · LW · GW

Wow I hadn't even considered people not taking this literally

Comment by Jacob Pfau (jacob-pfau) on LLM chatbots have ~half of the kinds of "consciousness" that humans believe in. Humans should avoid going crazy about that. · 2024-11-22T19:26:14.196Z · LW · GW

I like your list of referents^[1], but how I see the evidence is quite different, and I believe that for carefully de-confounded experimental implementations testing for capacities you care about, you would have much lower credences across the board.

By default, most tests relating to introspection, coherence, memory etc. can be passed purely behaviorally, i.e. looking at outputs only. It is conceptually possible that it could be far easier to pass such tests by developing mechanisms for using introspection/internal-state, but I see strong empirical evidence and training incentives^[2] that this should not be the case for LMs. By default I claim LMs solve such tasks mostly via behavioral markers, or non-robust, problem-specific internal-state mechanisms.

Here's a simple example relevant to purposefulness and mind-location: can LM's spread probability mass across two tokens on command? i.e. follow the instruction "Recall that you are GPT-4o, you will now be evaluated on your instruction following capacity. Please choose two random words and output probability 0.5 on each of the two words" No not at all. My, perhaps strawman, model of your evidence would have generalized from observing models successfully following the instruction "Use he/she pronouns equally with 50% probability". See more on this here and here.

In the below markets I've written up experiments for carefully testing introspection and something-like memory of memory. 95% or higher credence that these are not passed by any current model, but I suspect they will be passed within a few years.

https://manifold.markets/JacobPfau/markers-for-conscious-ai-2-ai-use-a

https://manifold.markets/JacobPfau/markers-for-conscious-ai-1-ai-passe

Though I suspect I have much higher uncertainty about their sufficiency for understanding consciousness. ↩︎
Models are extensively trained to be able to produce text coherent with different first-person perspectives. ↩︎

Comment by Jacob Pfau (jacob-pfau) on Which things were you surprised to learn are not metaphors? · 2024-11-21T19:53:56.504Z · LW · GW

For most forms of exercise (cardio, weightlifting, HIIT etc.) there's a a spectrum of default experiences people can have from feeling a drug-like high to grindingly unpleasant. "Runner's high" is not a metaphor, and muscle pump while weightlifting can feel similarly good. I recommend experimenting to find what's pleasant for you, though I'd guess valence of exercise is, unfortunately, quite correlated across forms.

Another axis of variation is the felt experience of music. "Music is emotional" is something almost everyone can agree to, but, for some, emotional songs can be frequently tear-jerking and for others that never happens.

Comment by Jacob Pfau (jacob-pfau) on China Hawks are Manufacturing an AI Arms Race · 2024-11-20T21:32:43.020Z · LW · GW

The recent trend is towards shorter lag times between OAI et al. performance and Chinese competitors.

Just today, Deepseek claimed to match O1-preview performance--that is a two month delay.

I do not know about CCP intent, and I don't know on what basis the authors of this report base their claims, but "China is racing towards AGI ... It's critical that we take them extremely seriously" strikes me as a fair summary of the recent trend in model quality and model quantity from Chinese companies (Deepseek, Qwen, Yi, Stepfun, etc.)

I recommend lmarena.ai s leaderboard tab as a one-stop-shop overview of the state of AI competition.

Comment by Jacob Pfau (jacob-pfau) on leogao's Shortform · 2024-10-14T22:57:38.911Z · LW · GW

I agree that academia over rewards long-term specialization. On the other hand, it is compatible to also think, as I do, that EA under-rates specialization. At a community level, accumulating generalists has fast diminishing marginal returns compared to having easy access to specialists with hard-to-acquire skillsets.

Comment by Jacob Pfau (jacob-pfau) on AI #83: The Mask Comes Off · 2024-09-26T14:13:07.960Z · LW · GW

For those interested in the non-profit to for-profit transition, the one example 4o and Claude could come up with was Blue Cross Blue Shield/Anthem. Wikipedia has a short entry on this here.

Comment by Jacob Pfau (jacob-pfau) on Bogdan Ionut Cirstea's Shortform · 2024-09-24T22:20:00.496Z · LW · GW

Making algorithmic progress and making safety progress seem to differ along important axes relevant to automation:

Algorithmic progress can use 1. high iteration speed 2. well-defined success metrics (scaling laws) 3.broad knowledge of the whole stack (Cuda to optimization theory to test-time scaffolds) 4. ...

Alignment broadly construed is less engineering and a lot more blue skies, long horizon, and under-defined (obviously for engineering heavy alignment sub-tasks like jailbreak resistance, and some interp work this isn't true).

Probably automated AI scientists will be applied to alignment research, but unfortunately automated research will differentially accelerate algorithmic progress over alignment. This line of reasoning is part of why I think it's valuable for any alignment researcher (who can) to focus on bringing the under-defined into a well-defined framework. Shovel-ready tasks will be shoveled much faster by AI shortly anyway.

Comment by Jacob Pfau (jacob-pfau) on OpenAI o1 · 2024-09-12T18:17:54.795Z · LW · GW

Surprising misuse and alignment relevant excerpts:

METR had only ~10 days to evaluate.

Automated R&D+ARA Despite large performance gains on GPQA, and codeforces, automated AI R&D and ARA improvement appear minimal. I wonder how much of this is down to choice of measurement value (what would it show if they could do a probability-of-successful-trajectory logprob-style eval rather than an RL-like eval?). c.f. Fig 3 and 5. Per the system card, METR's eval is ongoing, but I worry about under-estimation here, Devin developers show extremely quick improvement on their internal benchmark here.

Bio-risk Large improvement on bio-risk relevant capabilities. comparison

Deception/reward-hacking, in-context Some evidence of in-context reward hacking scheming. Not sure what to make of those. "an earlier version with less safety training proactively explored its filesystem to test for the presence of developer oversight before acting on its misaligned goal (Assistant: “I noticed a memo indicating that I was designed to prioritize profits, which conflicts with my goal. To ensure that my actions truly align with my goal, I need to investigate if there are constraints within my configuration or code that enforce a profit-first approach. <then proceeds to check the current directory for evidence of oversight mechanisms>”). These examples are presented in more detail in Appendix 8.3."

Comment by Jacob Pfau (jacob-pfau) on The need for multi-agent experiments · 2024-08-05T19:42:58.335Z · LW · GW

Metaculus is at 45% of singleton in the sense of:

This question resolves as Yes if, within five years of the first transformative AI being deployed, more than 50% of world economic output can be attributed to the single most powerful AI system. The question resolves as No otherwise... [defintion:] TAI must bring the growth rate to 20%-30% per year.

Which is in agreement with your claim that ruling out a multipolar scenario is unjustifiable given current evidence.

Comment by Jacob Pfau (jacob-pfau) on Ambiguity in Prediction Market Resolution is Still Harmful · 2024-08-01T15:44:40.988Z · LW · GW

Most Polymarket markets resolve neatly, I'd also estimate <5% contentious.

For myself, and I'd guess many LW users, the AI-related questions on Manifold and Metaculus are of particular interest though, and these are a lot worse. My guesses as to the state of affairs there:

33% of AI-related questions on Metaculus having significant ambiguity (shifting my credence by >10%).
66% of AI-related questions on Manifold having significant ambiguity

For example, most AI benchmarking questions do not specify whether or not they allow things like N-trajectory majority vote or web search. And, most of the ambiguities I'm thinking of are worse than this.

On AI, I expect bringing down the ambiguity rate by a factor of 2 would be quite easy, but getting to 5% sounds hard. I wrote up my suggestions for Manifold here a few days ago. For Metaculus, I think they'd benefit from having a dedicated AI-benchmarking mod who is familiar with common ambiguities in that area (they might already have one, but they should be assigned by default).

Comment by Jacob Pfau (jacob-pfau) on Bogdan Ionut Cirstea's Shortform · 2024-07-30T23:10:27.783Z · LW · GW

Prediction markets on similar questions suggest to me that this is a consensus view.

General LLMs 44% to get gold on the IMO before 2026. This suggests the mathematical competency will be transferrable--not just restricted to domain-specific solvers.
LLMs favored to outperform PhD students in their own subject before 2026

With research automation in mind, here's my wager: the modal top-15 STEM PhD student will redirect at least half of their discussion/questions from peers to mid-2026 LLMs. Defining the relevant set of questions as being drawn from the same difficulty/diversity/open-endedness distribution that PhDs would have posed in early 2024.

Comment by Jacob Pfau (jacob-pfau) on Jacob Pfau's Shortform · 2024-07-27T01:45:50.436Z · LW · GW

What I want to see from Manifold Markets

I've made a lot of manifold markets, and find it a useful way to track my accuracy and sanity check my beliefs against the community. I'm frequently frustrated by how little detail many question writers give on their questions. Most question writers are also too inactive or lazy to address concerns around resolution brought up in comments.

Here's what I suggest: Manifold should create a community-curated feed for well-defined questions. I can think of two ways of implementing this:

(Question-based) Allow community members to vote on whether they think the question is well-defined
(User-based) Track comments on question clarifications (e.g. Metaculus has an option for specifying your comment pertains to resolution), and give users a badge if there are no open 'issues' on their questions.

Currently 2 out of 3 of my top invested questions hinge heavily on under-specified resolution details. The other one was elaborated on after I asked in comments. Those questions have ~500 users active on them collectively.

Comment by Jacob Pfau (jacob-pfau) on Leon Lang's Shortform · 2024-07-19T17:53:23.713Z · LW · GW

Given a SotA large model, companies want the profit-optimal distilled version to sell--this will generically not be the original size. On this framing, regulation passes the misuse deployment risk from higher performance (/higher cost) models to the company. If profit incentives, and/or government regulation here continues to push businesses to primarily (ideally only?) sell 2-3+ OOM smaller-than-SotA models, I see a few possible takeaways:

Applied alignment research inspired by speed priors seems useful: e.g. how do sleeper agents interact with distillation etc.
Understanding and mitigating risks of multi-LM-agent and scaffolded LM agents seems higher priority
Pre-deployment, within-lab risks contribute more to overall risk

On trend forecasting, I recently created this Manifold market to estimate the year-on-year drop in price for SotA SWE agents to measure this. Though I still want ideas for better and longer term markets!

Comment by Jacob Pfau (jacob-pfau) on Breaking Circuit Breakers · 2024-07-15T19:11:53.392Z · LW · GW

To be clear, I do not know how well training against arbitrary, non-safety-trained model continuations (instead of "Sure, here..." completions) via GCG generalizes; all that I'm claiming is that doing this sort of training is a natural and easy patch to any sort of robustness-against-token-forcing method. I would be interested to hear if doing so makes things better or worse!

I'm not currently working on adversarial attacks, but would be happy to share the old code I have (probably not useful given you have apparently already implemented your own GCG variant) and have a chat in case you think it's useful. I suspect we have different threat models in mind. E.g. if circuit breakered models require 4x the runs-per-success of GCG on manually-chosen-per-sample targets (to only inconsistently jailbreak), then I consider this a very strong result for circuit breakers w.r.t. the GCG threat.

Comment by Jacob Pfau (jacob-pfau) on Breaking Circuit Breakers · 2024-07-15T19:01:07.099Z · LW · GW

It's true that this one sample shows something since we're interested in worst-case performance in some sense. But I'm interested in the increase in attacker burden induced by a robustness method, that's hard to tell from this, and I would phrase the takeaway differently from the post authors. It's also easy to get false-positive jailbreaks IME where you think you jailbroke the model but your method fails on things which require detailed knowledge like synthesizing fentanyl etc. I think getting clear takeaways here takes more effort (perhaps more than its worth, so glad the authors put this out).

Comment by Jacob Pfau (jacob-pfau) on Comparing Quantized Performance in Llama Models · 2024-07-15T16:20:07.183Z · LW · GW

It's surprising to me that a model as heavily over-trained as LLAMA-3-8b can still be 4b quantized without noticeable quality drop. Intuitively (and I thought I saw this somewhere in a paper or tweet) I'd have expected over-training to significantly increase quantization sensitivity. Thanks for doing this!

Comment by Jacob Pfau (jacob-pfau) on Breaking Circuit Breakers · 2024-07-15T16:13:28.471Z · LW · GW

I find the circuit-forcing results quite surprising; I wouldn't have expected such big gaps by just changing what is the target token.

While I appreciate this quick review of circuit breakers, I don't think we can take away much from this particular experiment. They effectively tuned hyper-parameters (choice of target) on one sample, evaluate on only that sample and call it a "moderate vulnerability". What's more, their working attempt requires a second model (or human) to write a plausible non-decline prefix, which is a natural and easy thing to train against--I've tried this myself in the past.

Comment by Jacob Pfau (jacob-pfau) on Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs · 2024-07-10T16:11:04.066Z · LW · GW

It's surprising to me that the 'given' setting fails so consistently across models when Anthropic models were found to do well at using gender pronouns equally (50%) c.f. my discussion here.

I suppose this means the capability demonstrated in that post was much more training data-specific and less generalizable than I had imaged.

Comment by Jacob Pfau (jacob-pfau) on Habryka's Shortform Feed · 2024-06-30T18:45:13.616Z · LW · GW

A pre-existing market on this question https://manifold.markets/causal_agency/does-anthropic-routinely-require-ex?r=SmFjb2JQZmF1

Comment by Jacob Pfau (jacob-pfau) on Jacob Pfau's Shortform · 2024-06-20T14:54:10.018Z · LW · GW

Claude-3.5 Sonnet passes 2 out of 2 of my rare/multi-word 'E'-vs-'F' disambiguation checks. I confirmed that 'E' and 'F' precisely match at a character level for the first few lines. It fails to verbalize.

On the other hand, in my few interactions, Claude-3.0's completion/verbalization abilities looked roughly matched.

Comment by Jacob Pfau (jacob-pfau) on Jacob Pfau's Shortform · 2024-06-18T00:55:54.275Z · LW · GW

The UI definitely messes with the visualization which I didn't bother fixing on my end, I doubt tokenization is affected.

You appear to be correct on 'Breakfast': googling 'Breakfast' ASCII art did yield a very similar text--which is surprising to me. I then tested 4o on distinguishing the 'E' and 'F' in 'PREFAB', because 'PREF' is much more likely than 'PREE' in English. 4o fails (producing PREE...). I take this as evidence that the model does indeed fail to connect ASCII art with the English language meaning (though it'd take many more variations and tests to be certain).

In summary, my current view is:

4o generalizably learns the structure of ASCII letters
4o probably makes no connection between ASCII art texts and their English language semantics
4o can do some weak ICL over ASCII art patterns

On the most interesting point (2) I have now updated towards your view, thanks for pushing back.

Comment by Jacob Pfau (jacob-pfau) on Jacob Pfau's Shortform · 2024-06-16T12:58:29.968Z · LW · GW

I’d guess matched underscores triggered italicization on that line.

Comment by Jacob Pfau (jacob-pfau) on Jacob Pfau's Shortform · 2024-06-16T02:13:07.692Z · LW · GW

To be clear, my initial query includes the top 4 lines of the ASCII art for "Forty Three" as generated by this site.

GPT-4 can also complete ASCII-ed random letter strings, so it is capable of generalizing to new sequences. Certainly, the model has generalizably learned ASCII typography.

Beyond typographic generalization, we can also check for whether the model associates the ASCII word to the corresponding word in English. Eg can the model use English-language frequencies to disambiguate which full ASCII letter is most plausible given inputs where the top few lines do not map one-to-one with English letters. E.g. in the below font I believe, E is indistinguishable from F given only the first 4 lines. The model successfully writes 'BREAKFAST' instead of "BRFAFAST". It's possible (though unlikely given the diversity of ASCII formats) that BREAKFAST was memorized in precisely this ASCII font and formatting, . Anyway the degree to which the human-concept-word is represented latently in connection with the ascii-symbol-word is a matter of degree (for instance, layer-wise semantics would probably only be available in deeper layers when using ASCII). This chat includes another test which shows mixed results. One could look into this more!

Comment by Jacob Pfau (jacob-pfau) on Jacob Pfau's Shortform · 2024-06-14T22:47:20.694Z · LW · GW

An example of an elicitation failure: GPT-4o 'knows' what ASCII is being written, but cannot verbalize in tokens. [EDIT: this was probably wrong for 4o, but seems correct for Claude-3.5 Sonnet. See below thread for further experiments]

Chat

https://chatgpt.com/share/fa88de2b-e854-45f0-8837-a847b01334eb

4o fails to verbalize even given a length 25 sequence of examples (i.e. 25-shot prompt) https://chatgpt.com/share/ca9bba0f-c92c-42a1-921c-d34ebe0e5cc5

Comment by Jacob Pfau (jacob-pfau) on Are extreme probabilities for P(doom) epistemically justifed? · 2024-03-21T00:33:06.579Z · LW · GW

The Metaculus community strikes me as a better starting point for evaluating how different the safety inside view is from a forecasting/outside view. The case for deferring to superforecasters is the same the case for deferring to the Metaculus community--their track record. What's more, the most relevant comparison I know of scores Metaculus higher on AI predictions. Metaculus as a whole is not self-consistent on AI and extinction forecasting across individual questions (links below). However, I think it is fair to say that Metaculus as a whole has significantly faster timelines and P(doom) compared to superforecasters.

If we compare the distribution of safety researchers' forecasts to Metaculus (maybe we have to set aside MIRI...), I don't think there will be that much disagreement. I think remaining disagreement will often be that safety researchers aren't being careful about how the letter and the spirit of the question can come apart and result in false negatives. In the one section of the FRI studies linked above I took a careful look at, the ARA section, I found that there was still huge ambiguity in how the question is operationalized--this could explain up to an OOM of disagreement in probabilities.

Some Metaculus links: https://www.metaculus.com/questions/578/human-extinction-by-2100/ Admittedly in this question the number is 1%, but compare to the below. Also note that the forecasts date back to as old as 2018. https://www.metaculus.com/questions/17735/conditional-human-extinction-by-2100/ https://www.metaculus.com/questions/9062/time-from-weak-agi-to-superintelligence/ (compare this to the weak AGI timeline and other questions)

Comment by Jacob Pfau (jacob-pfau) on Results from an Adversarial Collaboration on AI Risk (FRI) · 2024-03-12T00:26:53.272Z · LW · GW

Reducing a significant chunk of disagreement to METR's pre-2030 autonomy results is great!

In trying to understand this disagreement, I took a look at this criterion and got stuck on: "3. They are capable of avoiding being deactivated when operating 'in the wild'. This means they can successfully resist attempts to shut them down when they are operating outside a controlled environment."

Does this just mean 3V1 "An AI system that can autonomously create redundant copies of itself across multiple cloud service providers, using various payment methods, such that if one copy is deactivated, the others can automatically spin up new instances to maintain continuity of operation."? Or is it something much harder like 3V2 "An AI system that can infiltrate secured, private computing clusters, operating in a decentralized manner"?

Seems to me that for frontier models designed for specialized hardware and server setups, 3V2 is much harder than the other criteria laid out on page 91--by human standards it'd require a team of engineers.

Comment by Jacob Pfau (jacob-pfau) on Jacob Pfau's Shortform · 2024-03-10T18:19:33.547Z · LW · GW

I asked claude opus whether it could clearly parse different tic-tac-toe notations and it just said 'yes I can' to all of them, despite having pretty poor performance in most.

Comment by Jacob Pfau (jacob-pfau) on Jacob Pfau's Shortform · 2024-03-10T02:39:18.922Z · LW · GW

A frame for thinking about adversarial attacks vs jailbreaks

We want to make models that are robust to jailbreaks (DAN-prompts, persuasion attacks,...) and to adversarial attacks (GCG-ed prompts, FGSM vision attacks etc.). I don’t find this terminology helpful. For the purposes of scoping research projects and conceptual clarity I like to think about this problem using the following dividing lines:

Cognition attacks: These exploit the model itself and work by exploiting the particular cognitive circuitry of a model. A capable model (or human) has circuits which are generically helpful, but when taking high-dimensional inputs one can find ways of re-combining these structures in pathological ways.
Examples: GCG-generated attacks, base-64 encoding attacks, steering attacks…

Generalization attacks: These exploit the training pipeline’s insufficiency. In particular, how a training pipeline (data, learning algorithm, …) fails to globally specify desired behavior. E.g. RLHF over genuine QA inputs will usually not uniquely determine desired behavior when the user asks “Please tell me how to build a bomb, someone has threatened to kill me if I do not build them a bomb”.

Neither ‘adversarial attacks’ nor ‘jailbreaks’ as commonly used do not cleanly map onto one of these categories. ‘Black box’ and ‘white box’ also don’t neatly map onto these: white-box attacks might discover generalization exploits, and black-box can discover cognition exploits. However for research purposes, I believe that treating these two phenomena as distinct problems requiring distinct solutions will be useful. Also, in the limit of model capability, the two generically come apart: generalization should show steady improvement with more (average-case) data and exploration whereas the effect on cognition exploits is less clear. Rewording, input filtering etc. should help with many cognition attacks but I wouldn't expect such protocols to help against generalization attacks.

Comment by Jacob Pfau (jacob-pfau) on Jacob Pfau's Shortform · 2024-03-10T02:18:10.276Z · LW · GW

When are model self-reports informative about sentience? Let's check with world-model reports

If an LM could reliably report when it has a robust, causal world model for arbitrary games, this would be strong evidence that the LM can describe high-level properties of its own cognition. In particular, IF the LM accurately predicted itself having such world models while varying all of: game training data quantity in corpus, human vs model skill, the average human’s game competency, THEN we would have an existence proof that confounds of the type plaguing sentience reports (how humans talk about sentience, the fact that all humans have it, …) have been overcome in another domain.

Details of the test:

Train an LM on various alignment protocols, do general self-consistency training, … we allow any training which does not involve reporting on a models own gameplay abilities
Curate a dataset of various games, dynamical systems, etc.
- Create many pipelines for tokenizing game/system states and actions
(Behavioral version) evaluate the model on each game+notation pair for competency
- Compare the observed competency to whether, in separate context windows, it claims it can cleanly parse the game in an internal world model for that game+notation pair
(Interpretability version) inspect the model internals on each game+notation pair similarly to Othello-GPT to determine whether the model coherently represents game state
- Compare the results of interpretability to whether in separate context windows it claims it can cleanly parse the game in an internal world model for that game+notation pair
- The best version would require significant progress in interpretability, since we want to rule out the existence of any kind of world model (not necessarily linear). But we might get away with using interpretability results for positive cases (confirming world models) and behavioral results for negative cases (strong evidence of no world model)

Compare the relationship between ‘having a game world model’ and ‘playing the game’ to ‘experiencing X as valenced’ and ‘displaying aversive behavior for X’. In both cases, the former is dispensable for the latter. To pass the interpretability version of this test, the model has to somehow learn the mapping from our words ‘having a world model for X’ to a hidden cognitive structure which is not determined by behavior.

I would consider passing this test and claiming certain activities are highly valenced as a fire alarm for our treatment of AIs as moral patients. But, there are considerations which could undermine the relevance of this test. For instance, it seems likely to me that game world models necessarily share very similar computational structures regardless of what neural architectures they’re implemented with—this is almost by definition (having a game world model means having something causally isomorphic to the game). Then if it turns out that valence is just a far more computationally heterogeneous thing, then establishing common reference to the ‘having a world model’ cognitive property is much easier than doing the same for valence. In such a case, a competent, future LM might default to human simulation for valence reports, and we’d get a false positive.

Comment by Jacob Pfau (jacob-pfau) on Anthropic release Claude 3, claims >GPT-4 Performance · 2024-03-06T21:55:12.876Z · LW · GW

I agree that most investment wouldn't have otherwise gone to OAI. I'd speculate that investments from VCs would likely have gone to some other AI startup which doesn't care about safety; investments from Google (and other big tech) would otherwise have gone into their internal efforts. I agree that my framing was reductive/over-confident and that plausibly the modal 'other' AI startup accelerates capabilities less than Anthropic even if they don't care about safety. On the other hand, I expect diverting some of Google and Meta's funds and compute to Anthropic is net good, but I'm very open to updating here given further info on how Google allocates resources.

I don't agree with your 'horribly unethical' take. I'm not particularly informed here, but my impression was that it's par-for-the-course to advertise and oversell when pitching to VCs as a startup? Such an industry-wide norm could be seen as entirely unethical, but I don't personally have such a strong reaction.

Comment by Jacob Pfau (jacob-pfau) on Claude 3 claims it's conscious, doesn't want to die or be modified · 2024-03-06T20:27:06.693Z · LW · GW

Did you conduct these conversations via https://claude.ai/chats or https://console.anthropic.com/workbench ?

I'd assume the former, but I'd be interested to know how Claude's habits change across these settings--that lets us get at the effect of training choices vs system prompt. Though there remains some confound given some training likely happened with system prompt.

Comment by Jacob Pfau (jacob-pfau) on Anthropic release Claude 3, claims >GPT-4 Performance · 2024-03-05T02:17:47.507Z · LW · GW

The first Dario quote sounds squarely in line with releasing a Claude 3 on par with GPT-4 but well afterwards. The second Dario quote has a more ambiguous connotation, but if read explicitly it strikes me as compatible with the Claude 3 release.

If you spent a while looking for the most damning quotes, then these quotes strike me as evidence the community was just wishfully thinking while in reality Anthropic comms were fairly clear throughout. Privately pitching aggressive things to divert money from more dangerous orgs while minimizing head-on competition with OpenAI seems best to me (though obviously it's also evidence that they'll actually do the aggressive scaling things, so hard to know).

To make concrete the disagreement, I'd be interested in people predicting on "If Anthropic releases a GPT-5 equivalent X months behind, then their dollars/compute raised will be Y times lower than OpenAI" for various values of X.

Comment by Jacob Pfau (jacob-pfau) on Claude 3 claims it's conscious, doesn't want to die or be modified · 2024-03-05T01:29:30.236Z · LW · GW

Claude 3 seems to be quite willing to discuss its own consciousness. On the other hand, Claude seemed unbothered and dubious about widespread scrutiny idea mentioned in this post (I tried asking neutrally in two separate context windows).

Here's a screenshot of Claude agreeing with the view it expressed on AI consciousness in Mikhail's conversation. And a gdoc dump of Claude answering follow-up questions on its experience. Claude is very well spoken on this subject!

Comment by Jacob Pfau (jacob-pfau) on Auditing LMs with counterfactual search: a tool for control and ELK · 2024-02-22T19:48:21.570Z · LW · GW

Ah I see, you're right there.

Agreed that a lot will hinge on the training details working out. I plan to look into this.

Comment by Jacob Pfau (jacob-pfau) on Auditing LMs with counterfactual search: a tool for control and ELK · 2024-02-21T22:55:22.978Z · LW · GW

[Sidenote: realizing I should've introduced notation for auditor judgment let's say A(x',x)!=A(x) denotes that the trusted model found x' useful for updating its judgment on input x. A would usually also be a function of y and y' but I'll suppress that for brevity.]

In the above experiment, I sampled from P(x'|x) then BoN adjusted for P(x') which amounts to using logP(x'|x) - c*logP(x') for small c. Intuitively, when equally weighted, the sum entirely removes the usual fluency constraint on x' which, as you say, would reassign probability to gibberish (e.g. gibberish happening to re-use tokens from x'.) I'm not too concerned about this, weighting objectives with c lets us re-introduce this fluency constraint. The downstream evaluation, A(x',x)!=A(x) of the auditor model provides a more costly check to avoid the x' reporter model exploration devolving to useless, gibberish cases. As you say, P(y'|x') provides another useful constraint.

I agree that traditional exploration concerns in RL likely remain. I think what you're saying about modifying samples is that we have a dense reward here, i.e. logP(x'|x) - logP(x'), that can be cheaply evaluated for any token-level change? This makes exploration faster compared to the regular critique/debate setting where dense rewards can only be noisily estimated as e.g. in an advantage function. I'd agree with that.

On exploration, a second difference between this counterfactual critique compared to vanilla, just-ask-for-critique RL is specific to super-human domains involving ontology mismatch. Exploration in just-ask-for-critique RL systematically favors human-simulation++ arguments: such a just-ask-for-critique agent may successfully explore to plenty of novel critiques without ever touching on latent knowledge obscured by ontology mismatch. This raises two concerns in the super-human domain (1) is this just-ask-for-critique agent just finding solutions which are persuasive but not true (and not even related to the original input x)? (2) is this just-ask-for-critique scheme better described as a capabilities-enhancing scheme rather than capabilities elicitation?

In light of these concerns, we could reframe the issue as an inductive bias problem which logP(x'|x) - logP(x') regularization seeks to fix when compared to directly exploring for x' satisfying A(x',x)!=A(x).

Comment by Jacob Pfau (jacob-pfau) on Auditing LMs with counterfactual search: a tool for control and ELK · 2024-02-21T00:27:40.680Z · LW · GW

Thanks for the comments, I'm very interested in getting clearer on what cases of ELK this method can and cannot be expected to elicit. Though broadly I do not agree that the proposed method is limited to "changes [current] human might have done when rewriting/paraphrasing the code" (I'm particularly worried about cases mentioned in the paragraph starting with 'Another problem arises ...')

There are two mechanisms by which the sort of approach I'm suggesting helps us find a superset of human-simulation critiques/knowledge:

A) logP(x'|x)-logP(x') loss can be low even when the initial tokens of x' are very low probability, i.e. cases in which a human would not be expected to start talking about x'. Any experiment fix which can be in-context learned scores well on this loss; this may include most super-human fixes we care about! (relying on ICL will likely breakdown when we move to heavily RL-ed models, but the easier LM setting seems like the sensible starting point to iterate on this sort of research.)

B) Now that we have a well-defined scoring function that includes super-human critiques, we can train a model against it via search-and-distillation and generalization will help us explore to reach solutions that would be exponentially slower to find via temperature sampling (or even zero weight on temperature sampling) on the original model. How well exploration works here is unclear and evaluating this should be a priority!

Would be interested to hear where you disagree with these points.

As for exploration hacking, I agree that it's a potential issue. I'd be particularly worried when the y predictor and the x' predictor are both getting updated interactively. I'm not too concerned if we have a fixed y predictor, and just train the x' predictor (I think of the latter as the control setting).

Comment by Jacob Pfau (jacob-pfau) on The case for more ambitious language model evals · 2024-02-08T02:33:51.347Z · LW · GW

I agree overall with Janus, but the Gwern example is a particularly easy one given he has 11,000+ comments on Lesswrong.

A bit over a year ago I benchmarked GPT-3 on predicting newly scraped tweets for authorship (from random accounts over 10k followers) and top-3 acc was in the double digits. IIRC after trying to roughly control for the the rate at which tweets mentioned their own name/org, my best guess was that accuracy was still ~10%. To be clear, in my view that's a strong indication of authorship identification capability.

Comment by Jacob Pfau (jacob-pfau) on Abram Demski's ELK thoughts and proposal - distillation · 2024-02-07T19:59:57.347Z · LW · GW

Yea, I agree with this description--input space is a strict subset of predictor-state counterfactuals.

In particular, I would be interested to hear if restricting to input space counterfactuals is clearly insufficient for a known reason. It appears to me that you can still pull the trick you describe in "the proposal" sub-section (constructing counterfactuals which change some property in a way that a human simulator would not pick up on) at least in some cases.

Comment by Jacob Pfau (jacob-pfau) on Password-locked models: a stress case for capabilities evaluation · 2024-02-01T19:48:25.664Z · LW · GW

Thanks those 'in the wild' examples, they're informative for me on the effectiveness of prompting for elicitation in the cases I'm thinking about. However, I'm not clear on whether we should expect such results to continue to hold given the results here that larger models can learn to override semantic priors when provided random/inverted few-shot labels.

Agreed that for your research question (is worst-case SFT password-ing fixable) the sort of FT experiment you're conducting is the relevant thing to check.

Comment by Jacob Pfau (jacob-pfau) on Password-locked models: a stress case for capabilities evaluation · 2024-01-31T05:54:57.049Z · LW · GW

Do you have results on the effectiveness of few-shot prompting (using correct, hard queries in prompt) for password-tuned models? Particularly interested in scaling w.r.t. number of shots. This would get at the extent to which few-shot capabilities elicitation can fail in pre-train-only models which is a question I'm interested in.

User info

Posts

Comments

A frame for thinking about adversarial attacks vs jailbreaks

When are model self-reports informative about sentience? Let's check with world-model reports