maxnadeau

Posts
Comments

Posts

Research directions Open Phil wants to fund in technical AI safety 2025-02-08T01:40:00.968Z

Open Philanthropy Technical AI Safety RFP - $40M Available Across 21 Research Areas 2025-02-06T18:58:53.076Z

Update on Harvard AI Safety Team and MIT AI Alignment 2022-12-02T00:56:45.596Z

Apply to the Redwood Research Mechanistic Interpretability Experiment (REMIX), a research program in Berkeley 2022-10-27T01:32:44.750Z

Comments

Comment by maxnadeau on The 4-Minute Mile Effect · 2025-04-16T07:20:52.661Z · LW · GW

Typo: should be "Gell-Mann"

Comment by maxnadeau on Tormenting Gemini 2.5 with the [[[]]][][[]] Puzzle · 2025-03-30T20:18:26.283Z · LW · GW

I figured out the encoding, but I expressed the algorithm for computing the decoding in different language from you. My algorithm produces equivalent outputs but is substantially uglier. I wanted to leave a note here in case anyone else had the same solution.

Alt phrasing of the solution:

Each expression (i.e. a well-formed string of brackets) has a "degree", which is defined as the the number of well-formed chunks that the encoding can be broken up into. Some examples: [], [[]], and [-[][]] have degree one, [][], -[][[]], and [][[][]] have degree two, etc.

Here's a special case: the empty string maps to 0, i.e. decode("") = 0

When an encoding has degree one, you take off the outer brackets and do 2^decode(the enclosed expression), defined recursively. So decode([]) = 2^decode("") = 2^0 = 1, decode([[]]) = 2^decode([]) = 2, etc.

Negation works as normal. So decode([-[]]) = 2^decode(-[]) = 2^(-decode([])) = 2^(-1) = 1/2

So now all we have to deal with is expressions with degree >1.

When an expression has degree >1, you compute its decoding as the product of decoding of the first subexpression and inc(decode(everything after the first subexpression)). I will define the "inc" function shortly.

So decode([][[]]) = decode([]) * inc(decode([[]]) = 1 * inc(2)

decode([[[]]][][[]]) = decode([[[]]]) * inc(decode([][[]])) = 4 * inc(decode([]) * inc(decode([[]]))) = 4 * inc(1 * inc(2))

What is inc()? inc() is a function that computes a prime factorization of a number and the increments (from one prime to the next) all the prime bases. So inc(10) = inc(2 * 5) = 3 * 7 = 21, and inc(36) = inc(2^2 * 3^2) = 3^2 * 5^2 = 225. But inc() doesn't just take in integers, it can take in any number representable as a product of primes raised to powers. So inc(2^(1/2) * 3^(-1)) = 3^(1/2) * 5^(-1) = sqrt(3)/5. I asked the language models whether there's a standard name for the set of numbers definable in this way, and they didn't have ideas.

Comment by maxnadeau on Six Thoughts on AI Safety · 2025-03-07T02:26:11.189Z · LW · GW

On point 6, "Humanity can survive an unaligned superintelligence": In this section, I initially took you to be making a somewhat narrow point about humanity's safety if we develop aligned superintelligence and humanity + the aligned superintelligence has enough resources to out-innovate and out-prepare a misaligned superintelligence. But I can't tell if you think this conditional will be true, i.e. whether you think the existential risk to humanity from AI is low due to this argument. I infer from this tweet of yours that AI "kill[ing] us all" is not among your biggest fears about AI, which suggests to me that you expect the conditional to be true—am I interpreting you correctly?

Comment by maxnadeau on We should start looking for scheming "in the wild" · 2025-03-07T02:14:06.967Z · LW · GW

To make a clarifying point (which will perhaps benefit other readers): you're using the term "scheming" in a different sense from how Joe's report or Ryan's writing uses the term, right?

I assume your usage is in keeping with your paper here, which is definitely different from those other two writers' usages. In particular, you use the term "scheming" to refer to a much broader set of failure modes. In fact, I think you're using the term synonymously with Joe's "alignment-faking"—is that right?

Comment by maxnadeau on Open problems in emergent misalignment · 2025-03-03T23:42:19.585Z · LW · GW

People interested in working on these sorts of problems should consider applying to Open Phil's request for proposals: https://www.openphilanthropy.org/request-for-proposals-technical-ai-safety-research/

Comment by maxnadeau on Detecting Strategic Deception Using Linear Probes · 2025-02-11T05:27:00.296Z · LW · GW

This section of our RFP has some other related work you might want to include, e.g. Orgad et al.

Comment by maxnadeau on Six Thoughts on AI Safety · 2025-01-25T03:42:16.892Z · LW · GW

I think the link in footnote two goes to the wrong place?

Comment by maxnadeau on Jesse Hoogland's Shortform · 2025-01-23T19:26:10.183Z · LW · GW

I haven't read the paper, but based only on the phrase you quote, I assume it's referring to hacks like the one shown here: https://arxiv.org/pdf/2210.10760#19=&page=19.0

Comment by maxnadeau on AI Timelines · 2025-01-08T01:50:24.382Z · LW · GW

Do you think that cyber professionals would take multiple hours to do the tasks with 20-40 min first-solve times? I'm intuitively skeptical.

One (edit: minor) component of my skepticism is that someone told me that the participants in these competitions are less capable than actual cyber professionals, because the actual professionals have better things to do than enter competitions. I have no idea how big that selection effect is, but it at least provides some countervailing force against the selection effect you're describing.

Comment by maxnadeau on AI Timelines · 2025-01-08T01:31:06.388Z · LW · GW

You mentioned CyBench here. I think CyBench provides evidence against the claim "agents are already able to perform self-contained programming tasks that would take human experts multiple hours". AFAIK, the most up-to-date CyBench run is in the joint AISI o1 evals. In this study (see Table 4.1, and note the caption), all existing models (other than o3, which was not evaluated here) succeed on 0/10 attempts at almost all the Cybench tasks that take >40 minutes for humans to complete.

Comment by maxnadeau on Brief analysis of OP Technical AI Safety Funding · 2024-11-04T23:47:49.346Z · LW · GW

(I work at Open Phil on TAIS grantmaking)

I agree with most of this. A lot of our TAIS grantmaking over the last year was to evals grants solicited through this RFP. But I want to make a few points of clarification:

Not all the grants that have been or will be approved in 2024 are on our website. For starters, there are still two months left in the year. But also, there are some grants that have been approved, but haven't been put on the website yet. So $28 million is an modest underestimate, so it isn't directly comparable to the 2022/2023 numbers.
I agree that evals don't create new technological approaches to making AIs safer, but I think there are lots of possible worlds where eval results create more willpower and enthusiasm for putting safeguards in place (especially when those safeguards take work/cost money/etc). Specifically, I think evals can show people what a model is capable of, and what the trend lines are over time, and these effects can (if AIs are increasingly more capable) get people to invest more effort in safeguards. So I don't agree with the claim that evals won't "cause any action if they are built", and I also disagree that "very few TAIS grants are directly focused on making sure systems are aligned / controllable / built safely".

I appreciate you examining our work and giving your takes!

Comment by maxnadeau on the case for CoT unfaithfulness is overstated · 2024-10-05T00:36:26.024Z · LW · GW

Edit: Let me know if you or someone you know is interested in working on this sort of research. I work at Open Phil and we're interested in getting a better sense of how much demand for funding in this area there is.

Good post. I agree with the headline that faithful CoT is a promising path forward for interpreting and supervising LLMs.

TL;DR: I'd suggest researchers focus more narrowly on a subset of unfaithful chain-of-thought: steganographic/encoded reasoning that the model is using to perform hidden serial reasoning, as studied in Roger et al. IMO this type of unfaithfulness is more important, more neglected, and more tractable to fix than the sort of post-hoc reasoning studied in Turpin et al., or the hidden parallel reasoning studied in Pfau et al.

I think it's worth distinguishing between three kinds of unfaithful chain of thought that can occur when you ask a model a question and give it time to generate a chain of thought before producing an answer (I'm slightly adapting this trichotomy from Lanham et al., who should get the credit for this conceptual breakdown):

Epiphenomenal (typically post-hoc) reasoning: This is when the process of writing a CoT explanation is not actually helping the model produce its answer. This can be okay, if (as OP mentions), the model is just accurately explaining the reasoning process that it followed to produce its answer. But it can also lead to explanations that are inaccurate, as Turpin et al. demonstrate; these explanations leave out important reasons for a decision and/or make false predictions about how the model would behave on other inputs. For more on whether models' explanations make accurate predictions about their counterfactual behaviors, see here.
Hidden parallelized reasoning: This is when a model does a bunch of reasoning while generating a chain of thought, but it doesn't encode any of its intermediate calculations/conclusions in the tokens it produces. It just uses the fact that it has lots of tokens to think with to do a large number of parallelized calculations. See Pfau et al. for a demonstration of what this looks like; you can make models that get a question wrong if forced to answer immediately, but get it right if you ask them "[Question]?...........................Give your answer now:".
- Something like hidden parallelized reasoning may emerge naturally from pre-training, as studied in Wu et al. They show that larger models do more "pre-caching", i.e. computing a feature that is useless for predicting the literal next token, but will be useful for predicting some future next token down the line.
Steganography/encoded serial reasoning: This is when a model uses the words it generates in its chain of thought to accomplish as task that it could not do in a single forward pass, but it encodes meaning in its word choice that is opaque to a typical overseer. This is demonstrated in Roger et al., which takes a model that is too dumb to learn a QA task without CoT but can learn the task with CoT, and then teaches it to perform the task using steganographic CoTs.

I think steganography is the most concerning form of unfaithful chain-of-thought of these three, but it's also the easiest to study and mitigate. That is to say, I think it's more important, more neglected, and more tractable than the other two.

Importance/scariness of steg vs. other forms of unfaithfulness: I think steganographic reasoning in chains of thought is both reasonably likely to emerge naturally and potentially quite dangerous, especially compared to epiphenomenal and hidden parallelized reasoning.
- Why likely? It seems useful for multi-step reasoning to arbitrarily assign information-laden connotations to pairs of words that are synonymous to humans, especially under a financial pressure to use reduce the number of tokens used. Also, OpenAI performed no public tests for steganography on o1 and this OpenAI RS says he'd be totally happy with steganography, so I get the sense that OpenAI at least isn't prioritizing avoiding steg highly.
- Why dangerous? Steganographic CoTs allow models to perform long chains of serial reasoning outside human oversight, and to send messages to other AIs (esp. copies of themselves).
  - This could allow models to:
    - Generate complex plans for escaping safety measures and going rogue.
    - Carefully analyze their inputs to distinguish real chances to defect from synthetic honeypots, or distinguish insecure containment environments (that could be escaped from) from secure ones.
    - Subtly collude with other copies of themselves in order to coordinate on e.g. sabotaging AI developers' cybersecurity, alignment research, or AI training code.
- Meanwhile, epiphenomenal and hidden parallelized reasoning don't allow for very long and complex trains of thought, or for hidden communication with other AIs.
Neglectedness: Almost all work on CoT faithfulness (e.g. this) focuses entirely on unfaithful post-hoc reasoning and ignores steganography entirely. I conjecture that this is because long, RL-trained CoTs are new with o1, and there hasn't been clear evidence of steganographic reasoning occurring organically yet. That said, I'd guess o1 is already steganographic, on this basis of this sort of thing.
Tractability: Roger et al. find that "paraphrasing is an effective defense against all steganography schemes we’ve tried" and Radhakrishnan et al. find that "Question Decomposition Improves the Faithfulness of Model-Generated Reasoning". I am pretty skeptical that these basic defenses can resist optimization pressures against them, and I'd bet that further research will overturn Roger et al.'s blue-team victory. But I'd like to see the SOTA on defenses pushed further; I think that defenses can probably reduce steganography somewhat and make it incrementally harder for future AIs to defect.

I think there are a lot of exciting follow-up directions for future research here; finding cases where steg emerges organically, testing out different paraphrasing approaches for reducing steg, finding ways of decomposing tasks into subproblems to reduce how much context/background info any given copy of a model has (and then preventing the copies from steganographically communicating).

(I'd be interested in hearing about any good work in this area that I haven't mentioned in this comment, if anyone is aware of some)

Comment by maxnadeau on The Checklist: What Succeeding at AI Safety Will Involve · 2024-09-04T23:02:25.522Z · LW · GW

I got a bit lost in understanding your exit plan. You write

My preferred exit plan is to build human-obsoleting AIs which are sufficiently aligned/trustworthy that we can safely defer to them

Some questions about this and the text that comes after it:

How do you achieve such alignment? You wrote that you worry about the proposal of perfectly + scalably solving alignment, but I worry about how to achieve even the imperfect alignment of human-ish-level AIs that you're describing here. What techniques are you imagining using?
Why do these AIs need to be human-obsoleting? Why not just human-accelerating?
Why does your exit plan involve using powerful and aligned AIs to prepare for superintelligence, rather than merely using controlled AIs of that capability level? Do you think that it would be hard/dangerous to try to control "human-obsoleting" AIs?
Why do you "expect that ruling out egregious misalignment is the hardest part in practice"? That seems pretty counterintuitive to me. It's easy to imagine descendants of today's models that don't do anything egregious but have pretty different values from me and/or the general public; these AIs wouldn't be "philosophically competent".
What are you buying time to do? I don't understand how you're proposing spending the "3 years of time prior to needing to build substantially superhuman AIs". Is it on alignment for those superhuman AIs?
You mention having 3 years, but then you say "More generally, it just seems really heuristically scary to very quickly go from AIs which aren't much smarter than the best humans to AIs which are wildly smarter in only a few years." I found this confusing.
What do you mean by "a high fraction of risk comes from building wildly superhuman AI and it seems much easier to mitigate risks prior to this point." It seems easier to mitigate which risks prior to what point? And why? I didn't follow this.

Comment by maxnadeau on Would catching your AIs trying to escape convince AI developers to slow down or undeploy? · 2024-08-26T18:06:14.745Z · LW · GW

In the past, you've treated catching an AI escape attempt as a "win condition" for a control protocol, because such an attempt leads you to either 1) shut down your AI or 2) do few-shot catastrophe prevention. You wrote that having these two tools would "substantially" change the situation w.r.t. the model's chance of success.

But the arguments in this post suggest that catching an attempt may not change the overall strategic landscape/level of AI risk much.

Is this a change in your views? Or have you always felt that catching an AI trying to escape substantially changes that AI's chance of success without substantially changing the overall level of AI risk?

Separately, another takeaway of this post: it seems valuable to work on research that allows you to provide evidence about the alignment of other labs' AIs, e.g. better technical work on searching for inputs that trigger competent misaligned behavior.

Comment by maxnadeau on Getting 50% (SoTA) on ARC-AGI with GPT-4o · 2024-06-27T20:51:52.688Z · LW · GW

Thanks, edited my post to reference this (lmk if you understand what's happening here better than I do)

Comment by maxnadeau on Getting 50% (SoTA) on ARC-AGI with GPT-4o · 2024-06-20T00:16:16.049Z · LW · GW

Thanks, this is a helpful comment. Fixed the typo

Comment by maxnadeau on Getting 50% (SoTA) on ARC-AGI with GPT-4o · 2024-06-18T00:42:24.258Z · LW · GW

Edit: The sitation has evolved but is still somewhat confusing. There is now a leaderboard of scores on the public test set that Ryan is #1 on (see here). But this tweet from Jack Cole indicates that his (many month old) solution gets a higher score on the public test set than Ryan's top score on that leaderboard. I'm not really sure what's going on here,

Why isn't Jack's solution on the public leaderboard?
Is the semi-pubic test set the same as the old private set?
If not, is it equal in difficulty to the public test set, or the harder private test set?
Here it says "New high scores are accepted when the semi-private and public evaluation sets are in good agreement". What does that mean?

One important caveat to the presentation of results in this post (and the discussion on Twitter) is that there are reasons to think this approach may not be SOTA, as it performs similarly to the prior best-performing approach when tested apples-to-apples, i.e. on the same problems.

There are three sets of ARC problems: the public training set, the public eval set, and the private eval set.

Buck and Ryan got 71% on the first, 51% on the second, and [we don't know] on the third.
The past SOTA got [we don't know] on the first, 52% on the second, and 34% on the third.
Humans get 85% on the first, [we don't know] on the second, and [we don't know] on the third

My two main deductions from this are:

It's very misleading to compare human perf on the train set and AI perf on either of the test sets; the test sets seem way harder! Note that 71% is approaching 85%, so it seems like AIs are not far from human perf when you compare apples-to-apples. So graphs from the ARC folks like the one showing little progress towards human-level perf on this page are not scientifically valid.
Buck and Ryan's approach doesn't exceed the past AI SOTA on the only apples-to-apples comparison we have so far. Unclear if it will beat it on the private test set.

Apparently, lots of people get better performance on the public test set than the private one, which is a little surprising given that if you read this page from the ARC folks, you'll see the following:

The public training set is significantly easier than the others (public evaluation and private evaluation set) since it contains many "curriculum" type tasks intended to demonstrate Core Knowledge systems. It's like a tutorial level.
The public evaluation sets and the private test sets are intended to be the same difficulty.

Two explanations come to mind: maybe the public and private test sets are not IID, and/or maybe past SOTA method overfit to the public set. Chollet claims it's (accidentally) the latter here, but he doesn't rule out the former. He says the tasks across the two public test sets are meant to be equally hard for a human, but he doesn't say they're divided in an IID manner.

I guess we'll see how the results on the public leaderboard shake out.

(Expanding on a tweet)

Comment by maxnadeau on Anthropic Fall 2023 Debate Progress Update · 2023-12-20T01:04:03.629Z · LW · GW

What are the considerations around whether to structure the debate to permit the judge to abstain (as Michael et al do, by allowing the judge to end the round with low credence) versus forcing the judge to pick an answer each time? Are there pros/cons to each approach? Any arguments about similarity of one or the other to the real AI debates that might be held in the future?

It's possible I'm misremembering/misunderstanding the protocols used for the debate here/in that other paper.

Comment by maxnadeau on What are the best non-LW places to read on alignment progress? · 2023-07-07T01:26:10.692Z · LW · GW

"Follow the right people on twitter" is probably the best option. People will often post twitter threads explaining new papers they put out. There's also stuff like:

News put together by CAIS: https://newsletter.mlsafety.org/ and https://newsletter.safe.ai/ and https://twitter.com/topofmlsafety
News put together by Daniel Paleka: https://newsletter.danielpaleka.com/ and twitter summaries like https://twitter.com/dpaleka/status/1664617835178631170

Comment by maxnadeau on Geoffrey Hinton - Full "not inconceivable" quote · 2023-04-21T18:49:36.769Z · LW · GW

I appreciate you transcribing these interviews William!

Comment by maxnadeau on DragonGod's Shortform · 2023-01-18T20:23:41.306Z · LW · GW

Did/will this happen?

Comment by maxnadeau on My first year in AI alignment · 2023-01-02T04:00:38.256Z · LW · GW

I've been loving your optimization posts so far; thanks for writing them. I've been feeling confused about this topic for a while and feel like "being able to answer any question about optimization" would be hugely valuable for me.

Comment by maxnadeau on Apply to the Redwood Research Mechanistic Interpretability Experiment (REMIX), a research program in Berkeley · 2022-11-04T02:14:28.119Z · LW · GW

Thanks, fixed

Comment by maxnadeau on Apply to the Redwood Research Mechanistic Interpretability Experiment (REMIX), a research program in Berkeley · 2022-10-30T20:38:33.874Z · LW · GW

We're expecting familiarity with PyTorch, unlike MLAB. The level of Python background expected is otherwise similar. The bar will vary somewhat depending on each applicant's other traits, e.g. mathematical and empirical-science backgrounds

Comment by maxnadeau on Apply to the Redwood Research Mechanistic Interpretability Experiment (REMIX), a research program in Berkeley · 2022-10-30T19:12:53.127Z · LW · GW

30 min, 45 min, 20-30 min (respectively)

Comment by maxnadeau on Infra-Exercises, Part 1 · 2022-09-02T01:19:59.866Z · LW · GW

Video link in the pdf doesn't work

Comment by maxnadeau on chinchilla's wild implications · 2022-08-01T14:53:34.152Z · LW · GW

Confusion:

You write "Only PaLM looks better than Chinchilla here, mostly because it trained on 780B tokens instead of 300B or fewer, plus a small (!) boost from its larger size."

But earlier you write:

"Chinchilla is a model with the same training compute cost as Gopher, allocated more evenly between the two terms in the equation.

It's 70B params, trained on 1.4T tokens of data"

300B vs. 1.4T. Is this an error?

Comment by maxnadeau on Some thoughts on vegetarianism and veganism · 2022-02-15T04:39:14.584Z · LW · GW

I agree with your description about the hassle of eating veg when away from home. The point I was trying to make is that buying hunted meat seems possibly ethically preferable to veganism on animal welfare grounds, would address Richard's nutritional concerns, and also satisfies meat cravings.

Of course, this only works if you condition on the brutality of nature as the counterfactual. But for the time being, that won't change.

Comment by maxnadeau on Some thoughts on vegetarianism and veganism · 2022-02-15T03:04:02.421Z · LW · GW

I was thinking yesterday that I'm surprised more EAs don't hunt or eat lots of mail-ordered hunted meat, like eg this. Regardless of whether you think nature should exist in the long term, as it stands the average deer, for example, has a pretty harsh life and death. Studies like this on American white-tailed deer enumerate the alternative modes of death, which I find universally unappealing. You've got predation (which surprisingly to me is the number one cause of death for fawns), car accidents, disease, and starvation. These all seem orders of magnitude worse than being killed by a hunter with a good shot.

I'd assume human hunting basically trades off against predation and starvation, so the overall quantity of deer and deer consciousness isn't affected much by hunting. The more humans kill, the fewer coyotes kill.

Edit: So it seems to me that buying hunted meat/encouraging hunting might have a better animal welfare profile than veganism, while also satisfying Richard's concerns about nutrition and satisfying meat cravings. That being said, it is not really scalable in the way veg*ism is.

User info

Posts

Comments