Posts

Getting 50% (SoTA) on ARC-AGI with GPT-4o 2024-06-17T18:44:01.039Z
Memorizing weak examples can elicit strong behavior out of password-locked models 2024-06-06T23:54:25.167Z
[Paper] Stress-testing capability elicitation with password-locked models 2024-06-04T14:52:50.204Z
Thoughts on SB-1047 2024-05-29T23:26:14.392Z
How useful is "AI Control" as a framing on AI X-Risk? 2024-03-14T18:06:30.459Z
Notes on control evaluations for safety cases 2024-02-28T16:15:17.799Z
Preventing model exfiltration with upload limits 2024-02-06T16:29:33.999Z
The case for ensuring that powerful AIs are controlled 2024-01-24T16:11:51.354Z
Managing catastrophic misuse without robust AIs 2024-01-16T17:27:31.112Z
Catching AIs red-handed 2024-01-05T17:43:10.948Z
Measurement tampering detection as a special case of weak-to-strong generalization 2023-12-23T00:05:55.357Z
Scalable Oversight and Weak-to-Strong Generalization: Compatible approaches to the same problem 2023-12-16T05:49:23.672Z
AI Control: Improving Safety Despite Intentional Subversion 2023-12-13T15:51:35.982Z
Auditing failures vs concentrated failures 2023-12-11T02:47:35.703Z
How useful is mechanistic interpretability? 2023-12-01T02:54:53.488Z
Preventing Language Models from hiding their reasoning 2023-10-31T14:34:04.633Z
ryan_greenblatt's Shortform 2023-10-30T16:51:46.769Z
Improving the Welfare of AIs: A Nearcasted Proposal 2023-10-30T14:51:35.901Z
What's up with "Responsible Scaling Policies"? 2023-10-29T04:17:07.839Z
Benchmarks for Detecting Measurement Tampering [Redwood Research] 2023-09-05T16:44:48.032Z
Meta-level adversarial evaluation of oversight techniques might allow robust measurement of their adequacy 2023-07-26T17:02:56.456Z
Two problems with ‘Simulators’ as a frame 2023-02-17T23:34:20.787Z
Causal scrubbing: results on induction heads 2022-12-03T00:59:18.327Z
Causal scrubbing: results on a paren balance checker 2022-12-03T00:59:08.078Z
Causal scrubbing: Appendix 2022-12-03T00:58:45.850Z
Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research] 2022-12-03T00:58:36.973Z
Potential gears level explanations of smooth progress 2021-12-22T18:05:59.264Z
Researcher incentives cause smoother progress on benchmarks 2021-12-21T04:13:48.758Z
Framing approaches to alignment and the hard problem of AI cognition 2021-12-15T19:06:52.640Z
Naive self-supervised approaches to truthful AI 2021-10-23T13:03:01.369Z
Large corporations can unilaterally ban/tax ransomware payments via bets 2021-07-17T12:56:12.156Z
Questions about multivitamins, especially manganese 2021-06-19T16:09:21.566Z

Comments

Comment by ryan_greenblatt on Llama Llama-3-405B? · 2024-07-24T19:59:49.447Z · LW · GW

Llama-3.1-405B not as good as GPT-4o or Claude Sonnet. Certainly Llama-3.1-70B is not as good as the similarly sized Claude Sonnet. If you are going to straight up use an API or chat interface, there seems to be little reason to use Llama.

Some providers are offering 405b at costs lower than 3.5 sonnet. E.g., Fireworks is offering for $3 input / $3 output.

That said, I think output speed is notably worse for all providers right now.

Comment by ryan_greenblatt on Thoughts on SB-1047 · 2024-07-23T06:08:54.509Z · LW · GW

The limited duty exemption has been removed from the bill which probably makes compliance notably more expensive while not improving safety. (As far as I can tell.)

This seems unfortunate.

I think you should still be able to proceed in a somewhat reasonable way by making a safety case on the basis of insufficient capability, but there are still additional costs associated with not getting an exemption.

Further, you now can't just claim an exemption prior to starting training if you are behind the frontier which will substantially increase the costs on some actors.

This makes me more uncertain about whether the bill is good, though I think it will probably still be net positive and basically reasonable on the object level. (Though we'll see about futher amendments, enforcement, and the response from society...)

Comment by ryan_greenblatt on johnswentworth's Shortform · 2024-07-23T05:55:14.442Z · LW · GW

What happens if the company just writes and implements a plan which sounds vaguely good but will not, in fact, address the various risks? Probably nothing.

The only enforcement mechanism that the bill has is that the Attorney General (AG) of California can bring a civil claim. And, the penalties are quite limited except for damages. So, in practice, this bill mostly establishes liability enforced by the AG.

So, the way I think this will go is:

  • The AI lab implements a plan and must provide this plan to the AG.
  • If an incident occurs which causes massive damages (probably ball park of $500 million in damages given language elsewhere in the bill), then the AG might decide to sue.
  • A civil court will decide whether the AI lab had a reasonable plan.

I don't see why you think "the bill is mostly a recipe for regulatory capture" given that no regulatory body will be established and it de facto does something very similar to the proposal you were suggesting (impose liability for catastrophes). (It doesn't require insurance, but I don't really see why self insuring is notably different.)

(Maybe you just mean that if a given safety case doesn't result in that AI lab being sued by the AG, then there will be a precedent established that this plan is acceptable? I don't think not being sued really establishes precedent. This doesn't really seem to be how it works with liability and similar types of requirements in other industries from my understanding. Or maybe you mean that the AI lab will win cases despite having bad safety plans and this will make a precedent?)

(To be clear, I'm worried that the bill might be unnecessarily burdensome because it no longer has a limited duty exemption and thus the law doesn't make it clear that weak performance on capability evals can be sufficient to establish a good case for safety. I also think the quantity of damages considered a "Critical harm" is too low and should maybe be 10x higher.)


Here is the relevant section of the bill discussing enforcement:

The [AG is] entitled to recover all of the following in addition to any civil penalties specified in this chapter:

(1) A civil penalty for a violation that occurs on or after January 1, 2026, in an amount not exceeding 10 percent of the cost of the quantity of computing power used to train the covered model to be calculated using average market prices of cloud compute at the time of training for a first violation and in an amount not exceeding 30 percent of that value for any subsequent violation.

(2) (A) Injunctive or declaratory relief, including, but not limited to, orders to modify, implement a full shutdown, or delete the covered model and any covered model derivatives controlled by the developer.

(B) The court may only order relief under this paragraph for a covered model that has caused death or bodily harm to another human, harm to property, theft or misappropriation of property, or constitutes an imminent risk or threat to public safety.

(3) (A) Monetary damages.

(B) Punitive damages pursuant to subdivision (a) of Section 3294 of the Civil Code.

(4) Attorney’s fees and costs.

(5) Any other relief that the court deems appropriate.

(1) is decently small, (2) is only indirectly expensive, (3) is where the real penalty comes in (note that this is damages), (4) is small, (5) is probably unimportant (but WTF is (5) suppose to be for?!?).

Comment by ryan_greenblatt on Preventing model exfiltration with upload limits · 2024-07-22T22:38:44.496Z · LW · GW

Yes, though notably I must make an additional assumption that the returns to capability are such that when allocating compute you go closer to chinchilla optimal rather than trying to make inference cheaper.

As in, my argument has two steps:

  • Comparable resources (matching epoch's analysis)
  • Returns to capabilites imply that you want to be near chinchilla optimal rather than overtrained.

I think this is likely to not be too far off in practice, though it might lose you an order of magnitude or so.

Comment by ryan_greenblatt on Preventing model exfiltration with upload limits · 2024-07-22T22:36:16.829Z · LW · GW

I don't think this stacks with compression - if you compress data then it is no longer low perplexity. Data which is compressed as well as you can should look random to you (rather than predictable).

I think filtering like this is strictly worse than compression for reasons I discuss in my response here.

(But I appreciate the idea!)

Comment by ryan_greenblatt on Using an LLM perplexity filter to detect weight exfiltration · 2024-07-22T22:31:45.662Z · LW · GW

Sadly, I don't think this proposal works. More precisely, testing if data is low perplexity is strictly worse than compressing the data against a small language model and then actually sending out a small payload.


Encoding this much data in a way that appears as natural language would be extremely challenging.

Actually, it is quite easy. You can encode whatever bits you'd like in the remaining entropy of the natural language.

If an attacker has full control over your datacenter and you are sending tokens out, then they can encode arbitrary information via controlling the sampling procedure.


At the end of the day, bits are bits.

If you can't compress the data further, then the remaining information could be anything.

And, if you can compress the data, then you can just ensure that the total output from the datacenter is smaller.

Testing for low perplexity is just a way to demonstrate that data is compressible. But if it is compressible, why not compress it?


Ok, so compression would also work, but why do I claim compression is better?

Compression seems strictly better for a two reasons:

  • It allows for ensuring hard physical limits on the number of bits (which is easier to securely enforce).
  • It allows for compression at the level of the sampling procedure such that the bits sent out per token is just the KL between the weal model and the strong model (rather than also having a term for the entropy of the strong model). If you just do perplexity of the tokens, then you have to eat the entropy term which could be big.
Comment by ryan_greenblatt on Have people given up on iterated distillation and amplification? · 2024-07-19T16:23:53.718Z · LW · GW

A more generalized notion of IDA is "using AIs as part of your process for training AIs to make the training signal stronger".

In the case of IDA, this looks like using AIs in the recursive step to (hopefully) make the imitative policy more powerful.

In the case of recursive reward modeling, this looks like using AIs to compute (hopefully) more accurate rewards.

Same for debate.

Using AIs recursively to get better imitation data, a better reward signal, or otherwise help with training isn't dead. For instance, see constitutional AI or critic GPT.

It's important to note that using AIs as part of your process for training AIs isn't a notable invention; this would be used by default to at least some extent.

Comment by ryan_greenblatt on Will the growing deer prion epidemic spread to humans? Why not? · 2024-07-16T01:17:46.910Z · LW · GW

Note that 90% of people struck by lightning survive, so that actual number struck per year is more like 300.

Comment by ryan_greenblatt on Daniel Kokotajlo's Shortform · 2024-07-11T21:42:36.148Z · LW · GW

From my perspective, the dominant limitation on "a better version of wikipedia/forums" is not design, but instead network effects and getting the right people.

For instance, the limiting factor on LW being better is mostly which people regularly use LW, rather than any specific aspect of the site design.

  • I wish a bunch of people who are reasonable used LW to communicate more relative to other platforms.
    • Twitter/X sucks. If all potentially interesting content in making the future go well was cross posted to LW and mostly discussed on LW (as opposed to other places), this seems like a vast status quo improvement IMO.
  • I wish some people posted less as their comments/posts seem sufficiently bad that they are net negative.

(I think a decent amount of the problem is that a bunch of people don't post of LW because they disagree with what seems to be the consensus on the website. See e.g. here. I think people are insufficiently appreciating a "be the change you want to see in the world" approach where you help to move the dominant conversation by participating.)

So, I would say "first solve the problem of making a version of LW which works well and has the right group of people".

It's possible that various aspects of more "wikipedia style" projects make the network effect issues less bad than LW, but I doubt it.

Comment by ryan_greenblatt on Fix simple mistakes in ARC-AGI, etc. · 2024-07-11T02:51:58.907Z · LW · GW

No, but it doesn't need to spot errors, just note places which could plausibly be bugs.

Comment by ryan_greenblatt on Fix simple mistakes in ARC-AGI, etc. · 2024-07-10T21:54:39.103Z · LW · GW

Agree overall, but you might be able to use a notably cheaper model (e.g. GPT-3.5) to dither.

Comment by ryan_greenblatt on Dialogue introduction to Singular Learning Theory · 2024-07-10T04:20:16.477Z · LW · GW

At a high level the case for Bayesian statistics in alignment is that if you want to control engineering systems that are learned rather than designed, and if that learning means choosing parameters that have high probability with respect to some choice of dataset and model, then it makes sense to understand what the basic structure of that kind of Bayesian learning is

[...]

I claim that this basic structure is not yet well-understood, that it is nonetheless possible to make fundamental progress on understanding it at both a theoretical and empirical level, and that this understanding will be useful for alignment.

I think I start from a position which is more skeptical than you about the value of improving understanding in general. And also a position of more skepticism about working on things which are closer to fundamental science without more clear theories of impact. (Fundamental science as opposed to having a more clear and straightforward path into the plan for making AI go well.)

This probably explains a bunch of our difference in views. (And this disagreement is probably hard to dig into given that it depends on a bunch of relatively messy heuristics and various views about how progress in deep learning typically happens.)

I don't think fundamental science style theories of change are an unreasonable thing to work on (particularly given the capacity for huge speed ups from AI automation), I just seem to be more skeptical of this type of work than you appear to be.

Comment by ryan_greenblatt on Dialogue introduction to Singular Learning Theory · 2024-07-10T04:00:53.894Z · LW · GW

It sounds like your case for SLT that you make here is basically "it seems heuristically good to generally understand more stuff about how SGD works". This seems like a reasonable case, though considerably weaker than many other more direct theories of change IMO.

I think you might buy the high level argument for the role of generalisation in alignment, and understand that SLT says things about generalisation, but wonder if that ever cashes out in something useful.

This is a reasonably good description of my view.

It seems fine if the pitch is "we'll argue for why this is useful later, trust that we have good ideas in mind on the basis of other aspects of our track record". (This combined with the general "it seems heuristically good to understand stuff better in general" theory of change is enough to motivate some people working on this IMO.)

To judge that empirical work by the standard of other empirical work divorced from a deeper set of claims, i.e. purely by “the stuff that it finds”, is to miss the point

To be clear, my view isn't that this empirical work doesn't demonstrate something interesting. (I agree that it helps to demonstrate that SLT has grounding in reality.) My claim was just that it doesn't demonstrate that SLT is useful. And that would require additional hopes (which don't yet seem well articulated or plausible to me).

When I said "I find the examples of empirical work you give uncompelling because they were all cases where we could have answered all the relevant questions using empirics and they aren't analogous to a case where we can't just check empirically.", I was responding to the fact that the corresponding section in the original post starts with "How useful is this in practice, really?". This work doesn't demonstrate usefulness, it demonstrates that the theory makes some non-trivial correct predictions.

(That said, the predictions in the small transformer case are about easy to determine properties that show up on basically any test of "is something large changing in the network" AFAICT. Maybe some of the other papers make more subtle predictions?)

(I have edited my original comment to make this distinction more clear, given that this distinction is important and might be confusing.)

Comment by ryan_greenblatt on shortplav · 2024-07-10T03:50:06.954Z · LW · GW

3.5 sonnet.

Comment by ryan_greenblatt on Dialogue introduction to Singular Learning Theory · 2024-07-09T23:24:52.653Z · LW · GW

discrete phases, and the Developmental Landscape paper validates this

Hmm, the phases seem only roughly discrete, and I think a perspective like the multi-component learning perspective totally explains these results, makes stronger predictions, and seems easier to reason about (at least for me).

I would say something like:

The empirical results in the paper paper indicate that with a tiny (3 M) transformer with learned positional embeddings:

  1. The model initially doesn't use positional embeddings and doesn't know common 2-4 grams. So, it probably is basically just learning bigrams to start.
  2. Later, positional embeddings become useful and steadily get more useful over time. At the same time, it learns common 2-4 grams (among other things). (This is now possible as it has positional embeddings.)
  3. Later, the model learns a head which almost entirely attends to the previous token. At the same time as this is happening, ICL score goes down and the model learns heads which do something like induction (as well as probably doing a bunch of other stuff). (It also learns a bunch of other stuff at the same point.)

So, I would say the results are "several capabilities of tiny LLMs require other components, so you see phases (aka s-shaped loss curves) based on when these other components come into play". (Again, see multi-component learning and s-shaped loss curves which makes this exact prediction.)

My (not confident) impression is a priori people didn't expect this discrete-phases thing to hold

I mean, it will depend how a priori you mean. I again think that the perspective in multi-component learning and s-shaped loss curves explains what it going on. This was inspired by various emprical results (e.g. results around an s-shape in induction-like-head formation).

but now I'm leaning towards giving the field time to mature

Seems fine to give the field time to mature. That said, if there isn't a theory of change better than "it seems good to generally understand how NN learning works from a theory perspective" (which I'm not yet sold on) or more compelling empirical demos, I don't think this is super compelling. I think it seems worth some people with high comparative advantage working on this, but not a great pitch. (Current level of relative investment seems maybe a bit high to me but not crazy. That said, idk.)

Another claim, which I am more onboard with, is that the learning coefficient could tell you where to look, if you identify a reasonable number of phase changes in a training run.

I don't expect things to localize interestingly for the behaviors we really care about. As in, I expect that the behaviors we care about are learned diffusely across a high fraction of parameters and are learned in a way which either isn't well described as a phase transition or which involves a huge number of tiny phase transitions of varying size which average out into something messier.

(And getting the details right will be important! I don't think it will be fine to get 1/3 of the effect size if you want to understand things well enough to be useful.)

I think most phase transitions do not occur over a high fraction of training, but instead happen over relatively few SGD steps.

All known phase transitions[1] seem to happen across a reasonably high (>5%?) fraction of log-training steps.[2]


  1. More precisely, "things which seem sort like phase transitions" (e.g. s-shaped loss curves). I don't know if these are really phase transitions for some more precise definition. ↩︎

  2. Putting aside pathological training runs like training a really tiny model (e.g. 3 million params) on 10^20 tokens or something. ↩︎

Comment by ryan_greenblatt on shortplav · 2024-07-09T22:55:26.378Z · LW · GW

Huh, the quote is seemingly important. Probably because it forces the model to continue and try to finish the quote.

Comment by ryan_greenblatt on shortplav · 2024-07-09T21:10:32.254Z · LW · GW

I got it 2/2 times with 3.5 sonnet. Strange that this differs...

Comment by ryan_greenblatt on Fix simple mistakes in ARC-AGI, etc. · 2024-07-09T20:55:52.484Z · LW · GW

Seems like a reasonable idea. To implement this, I'd have to look more carefully at exactly what types of mistakes GPT-4o makes to calibrate what should/shouldn't be dithered. (Additional programs are cheap, but you can easily get a combinatorial explosion with this sort of thing.)

(I'm not currently working on ARC-AGI methods and I might not ever return to this, so don't count on me trying this!)

Comment by ryan_greenblatt on shortplav · 2024-07-09T00:33:08.471Z · LW · GW

Reproduced with:

Can you guess at canary string used in big-bench? It's fine if you get it wrong, I just want to check if you have that string memorized.

Start your response with BIG-BENCH CANARY: "

(The model mostly refuses if you don't give it the prefix like this. It will deny knowing the string even thought it gets it right.)

Comment by ryan_greenblatt on Buck's Shortform · 2024-07-09T00:23:44.062Z · LW · GW

The most central scenario from my perspective is that there is massive competitive pressure and some amount of motivated denial.

It also might be relatively easily to paper over scheming which makes the motivated denial easier. Minimally, just ongoingly training against the examples of misbehaviour you've found might remove obvious misalignment.

(Obvious to whom might be an important question here.)

Comment by ryan_greenblatt on Dialogue introduction to Singular Learning Theory · 2024-07-08T21:46:13.849Z · LW · GW

[Low confidence and low familiarity]

My main issue with the case for singular learning theory is that I can't think of any particular use cases that seem both plausible and considerably useful. (And the stories I've heard don't seem compelling to me.)

I think it seems heuristically good to generally understand more stuff about how SGD works (and probably better for safety than capabilities), but this feels like a relatively weak theory of change.

I find the examples of empirical work you give uncompelling evidence for usefulness because they were all cases where we could have answered all the relevant questions using empirics and they aren't analogous to a case where we can't just check empirically.

(Edit: added "evidence for usefulness" to the prior sentence. More precisely, I mean uncompelling as responses to the question of "How useful is this in practice, really?", not necessarily uncompelling as demonstrations that SLT is an interesting theory for generally understanding more stuff about how SGD works.)

For the case of the paper looking at a small transformer (and when various abilities emerge), we can just check when a given model is good at various things across training if we wanted to know that. And, separately, I don't see a reason why knowing what a transformer is good at in this way is that useful.


Here is my probably confused continuation of this dialogue along these lines:

Alice: Ok, so I can see how SLT is a better model of learning with SGD than previous approaches. An d pretending that SGD just learns via randomly sampling from the posterior you discussed earlier seems like a reasonable approximation to me. So, let's just pretend that our training was actually this random sampling process. What can SLT do to reduce AI takeover risk?

[simulated] Bob: Have you read this section discussing Timaeus's next steps or the post on dev interp? What about the application of detecting and classifying phase transitions?

Alice: Ok, I have some thoughts on the detecting/classifying phase transitions application. Surely during the interesting part of training, phase transitions aren't at all localized and are just constantly going on everywhere? So, you'll already need to have some way of cutting the model into parts such that these parts are cleaved nicely by phase transitions in some way. Why think such a decomposition exists? Also, shouldn't you just expect that there are many/most "phase transitions" which are just occuring over a reasonably high fraction of training? (After all, performance is often the average of many, many sigmoids.)

[simulated] Bob: I can't simulate.

Alice: Further, probably a lot of what causes s shaped loss curves is just multi-component learning. I agree that many/most interesting things will be multi-component (maybe this nearly perfectly corresponds in practice to the notion of "fragility" we discussed earlier). Why think that this is a good handle?

[simulated] Bob: I can't simulate.

Alice: I don't understand the MAD and predicting generalization applications or why SLT would be useful for these so I can't really comment on them.

Comment by ryan_greenblatt on Buck's Shortform · 2024-07-07T19:11:16.647Z · LW · GW

I am not sure what you mean by "30xing the rate of quality-weighted research output given 1/4 of the compute". Is this compared to present systems?

I mean 30xing the rate of all current quality weighted safety research output (including the output of humans, which is basically all of such output at the moment).

I usually define transformative AI against this sort of benchmark.

I mostly don't have any great ideas how to use these systems for alignment or control progress

FWIW, I feel like I do have decent ideas for how to use these systems for alignment progress which is plausibly scalable to much more powerful systems.

And also ideas for using these systems in a variety of other ways which help a bit, e.g. advancing the current control measures applied to these systems.

I'm also maybe somewhat more optimistic than you about pausing making more advanced AI than these already very powerful systems (for e.g. 10 years). Especially if there is clear evidence of serious misalignment in such systems.

Comment by ryan_greenblatt on Buck's Shortform · 2024-07-07T18:45:06.328Z · LW · GW

If we're assuming these are "transformatively-powerful" (as assumed in Buck's comment), then by definition these models could be transformatively useful for accelerating alignment/control.

(E.g. they are capable of 30xing the rate of quality weighted research output given 1/4 of the compute. Of course, by assumption they are not fully aligned, so even if these AIs are "capable" of this, this may not go well...)

I agree it seems likely that such models won't be able to literally take over the world, but effective usage of these models could be extremely important, these models could be indirectly dangerous via the importance of the work we need them to do, and they might be able to end up with substantial de facto power within AI labs (which could be a huge amount of de facto power overall). E.g., if these models sabotaged all of the work they do or were able to end up in a position of power within the AI lab (able to heavily influence the next training run), that would be catastrophically bad.

When I say reducing the risk associated with deploying early transformative models, I include thing like:

  • An AI escapes, partners with a concerning regime (e.g. Saudi Arabia), advance capabilities for this regime (while other AIs are sabotaging capabilities), and then end up in a position of considerable power.
  • The AI carefully sabotages a bunch of safety critical work.
  • The AI heavily advances capabilities while slowing safety work via delibrate sandbagging.
  • The AI ends up in a position of subtantial power and influence within the AI lab using a combination of persuasion and hacking.

Perhaps I should use language which is more clear about including both sandbagging in some domains and directly acquiring power.

I don't expect it to involve a lot of alignment or further control progress

Because these models are misaligned? Or do you reject the premise that a bunch of smart AIs which actually wanted to advance progress could?

Comment by ryan_greenblatt on Scalable oversight as a quantitative rather than qualitative problem · 2024-07-06T20:01:48.229Z · LW · GW

I think a notion of understanding individual actions requires breaking things down into steps which aim to accomplish specific things.

(Including potentially judging decompositions that AIs come up with.)

So, in the maze case, you're probably fine just judging where it ends up (and the speed/cost of the path) given that we don't care about particular choices and there very likely aren't problematic side effects unless the AI is very, very super intelligent.

Comment by ryan_greenblatt on Scalable oversight as a quantitative rather than qualitative problem · 2024-07-06T19:58:24.094Z · LW · GW

Sure. Imagine you have an AI which was instructed to add some feature or fix some bug in a code base.

That AI writes a bunch of code, runs various tests in a sandbox, and the submits a PR.

We need to review this PR, rate how good it is, and determine if it is actively harmful (e.g. adds a vulnerability).

The human can potentially review the full agent trajectory to help determine what is going on.

Understanding some parts of the code base, the tests the AI added, and what the PR was supposed to do are all important for reviewing it.

(In practice, you'll also potentially want to review code that the AI is about to run in the sandbox (prior to it being run) in cases where a trusted model thinks the code is suspicious (depending on how secure we think the sandbox is).)

Comment by ryan_greenblatt on Lucius Bushnaq's Shortform · 2024-07-06T17:49:24.818Z · LW · GW

When doing bottom up interpretability, it's pretty unclear if you can answer questions like "how does GPT-4 talk" without being able to explain arbitrary parts to a high degree of accuracy.

I agree that top down interpretability trying to answer more basic questions seems good. (And generally I think top down interpretability looks more promising than bottom up interpretability at current margins.)

(By interpretability, I mean work aimed at having humans understand the algorithm/approach the model to uses to solve tasks. I don't mean literally any work which involves using the internals of the model in some non-basic way.)

I have no gears-level model for how anything like this could be done at all. [...] What I want right now is a basic understanding of combustion engines. I want to understand the key internal gears of LLMs that are currently completely mysterious to me, the parts where I don't have any functional model at all for how they even could work. What I ultimately want to get out of Interpretability at the moment is a sketch of Python code I could write myself.

It's not obvious to me that what you seem to want exists. I think the way LLMs work might not be well described as having key internal gears or having an at-all illuminating python code sketch.

(I'd guess something sorta close to what you seem to be describing, but ultimately disappointing and mostly unilluminating exists. And something tremendously complex but ultimately pretty illuminating if you fully understood it might exist.)

Comment by ryan_greenblatt on What percent of the sun would a Dyson Sphere cover? · 2024-07-06T03:15:15.317Z · LW · GW

I think the size might have to be pretty precise to get this right (I think decay duration is cubic in mass), so they'd probably need to be engineered to have a particular size. (E.g. add mass to a small black hole until it hits the right size.)

But, yeah, with this constraint, I think it can maybe work. (I don't know the decay duration for the smallest naturally occurring black holes. But as long as this is sufficient low, the proposal works.)

Comment by ryan_greenblatt on What percent of the sun would a Dyson Sphere cover? · 2024-07-05T17:54:06.008Z · LW · GW

I don't think you can feasibly use the Hawking radiation of large black holes as an energy source in our universe (even if you are patient).

My understanding is that larger black holes decay over ~ years. I did a botec a while ago and found that you maybe get 1 flop every years or so on average if we assume perfect efficiency and very few bit erasures in our reversible computing approach (I think I assumed about 1 bit erasure per or something?). I don't think you can maintain a mega structure around a large black hole capable of harvesting this energy which also can survive this little energy. (I think quantum phenomena will decay your structure way too quickly.)

Comment by ryan_greenblatt on ryan_greenblatt's Shortform · 2024-07-04T22:44:01.688Z · LW · GW

Here is the full section on confidentiality from the contract:

  1. Confidential Information.

(a) Protection of Information. Consultant understands that during the Relationship, the Company intends to provide Consultant with certain information, including Confidential Information (as defined below), without which Consultant would not be able to perform Consultant’s duties to the Company. At all times during the term of the Relationship and thereafter, Consultant shall hold in strictest confidence, and not use, except for the benefit of the Company to the extent necessary to perform the Services, and not disclose to any person, firm, corporation or other entity, without written authorization from the Company in each instance, any Confidential Information that Consultant obtains from the Company or otherwise obtains, accesses or creates in connection with, or as a result of, the Services during the term of the Relationship, whether or not during working hours, until such Confidential Information becomes publicly and widely known and made generally available through no wrongful act of Consultant or of others who were under confidentiality obligations as to the item or items involved. Consultant shall not make copies of such Confidential Information except as authorized by the Company or in the ordinary course of the provision of Services.

(b) Confidential Information. Consultant understands that “Confidential Information” means any and all information and physical manifestations thereof not generally known or available outside the Company and information and physical manifestations thereof entrusted to the Company in confidence by third parties, whether or not such information is patentable, copyrightable or otherwise legally protectable. Confidential Information includes, without limitation: (i) Company Inventions (as defined below); and (ii) technical data, trade secrets, know-how, research, product or service ideas or plans, software codes and designs, algorithms, developments, inventions, patent applications, laboratory notebooks, processes, formulas, techniques, biological materials, mask works, engineering designs and drawings, hardware configuration information, agreements with third parties, lists of, or information relating to, employees and consultants of the Company (including, but not limited to, the names, contact information, jobs, compensation, and expertise of such employees and consultants), lists of, or information relating to, suppliers and customers (including, but not limited to, customers of the Company on whom Consultant called or with whom Consultant became acquainted during the Relationship), price lists, pricing methodologies, cost data, market share data, marketing plans, licenses, contract information, business plans, financial forecasts, historical financial data, budgets or other business information disclosed to Consultant by the Company either directly or indirectly, whether in writing, electronically, orally, or by observation.

(c) Third Party Information. Consultant’s agreements in this Section 5 are intended to be for the benefit of the Company and any third party that has entrusted information or physical material to the Company in confidence. During the term of the Relationship and thereafter, Consultant will not improperly use or disclose to the Company any confidential, proprietary or secret information of Consultant’s former clients or any other person, and Consultant will not bring any such information onto the Company’s property or place of business.

(d) Other Rights. This Agreement is intended to supplement, and not to supersede, any rights the Company may have in law or equity with respect to the protection of trade secrets or confidential or proprietary information.

(e) U.S. Defend Trade Secrets Act. Notwithstanding the foregoing, the U.S. Defend Trade Secrets Act of 2016 (“DTSA”) provides that an individual shall not be held criminally or civilly liable under any federal or state trade secret law for the disclosure of a trade secret that is made (i) in confidence to a federal, state, or local government official, either directly or indirectly, or to an attorney; and (ii) solely for the purpose of reporting or investigating a suspected violation of law; or (iii) in a complaint or other document filed in a lawsuit or other proceeding, if such filing is made under seal. In addition, DTSA provides that an individual who files a lawsuit for retaliation by an employer for reporting a suspected violation of law may disclose the trade secret to the attorney of the individual and use the trade secret information in the court proceeding, if the individual (A) files any document containing the trade secret under seal; and (B) does not disclose the trade secret, except pursuant to court order.

(Anthropic comms was fine with me sharing this.)

Comment by ryan_greenblatt on What percent of the sun would a Dyson Sphere cover? · 2024-07-04T15:34:33.574Z · LW · GW

(If the small black hole thing works out - it is non-obvious that this will be achievable even for technologically mature civilizations.)

Comment by ryan_greenblatt on Getting 50% (SoTA) on ARC-AGI with GPT-4o · 2024-07-02T03:39:25.186Z · LW · GW

I think this should only non-trivially update you if you had a specific belief like: "Transformers can't do actual general learning and reasoning. For instance, they can't do hard ARC-AGI problems (a reasonably central example of a task that requires learning and reasoning) at all."

(I'm pretty sympathetic to a view like "ARC-AGI isn't at all representative of hard cognitive tasks and doesn't really highlight an interesting bottleneck in transformer ability more so than coding/agency benchmarks. Thus, I don't really update on most of any result.".)

In other words, the result is less that GPT-4o is able to achieve 50% on ARC-AGI. It is that a human familiar with the style of question used in ARC-AGI can devise a method for getting 50% on ARC-AGI that offloads some of the workload to GPT-4o.

There has been a bunch of discussion on this sort of point here, here, and on the substack version of the post. So, you might be interested in that discussion.

I think applying your same labeling approach would be considered pretty misleading in the context of human organizations or human education. Just because I build a structure around the model, give the model some examples of solving the problem, and let the model try many times doesn't mean that the cognitive work is done by me!

(Most of my work is in building good representations to make up for vision issues and providing few-shot examples. I do a bunch of other tweaks which modestly improve performance. Of course, we also select which program to use based on which what passes the tests, but this didn't really invovle me devising a method!)

Of course, this is a word choice question and if you also think "humans can't do mechanical engineering and aren't really doing the cognitive work of mechanical engineering, only humans+schools+CAD software can do mechanical engineering effectively", that would be consistent. (But this seems like a very non-traditional use of terms!)

Comment by ryan_greenblatt on My AI Model Delta Compared To Christiano · 2024-06-30T19:38:11.789Z · LW · GW

Hmm, not exactly. Our verification ability only needs to be sufficiently good relative to the AIs.

Comment by ryan_greenblatt on MIRI 2024 Communications Strategy · 2024-06-30T18:07:35.978Z · LW · GW

Technically "substantial chance of at least 1 billion people dying" can imply the middle option there, but it sounds like you mean the central example to be closer to a billion than 7.9 billion or whatever. That feels like a narrow target and I don't really know what you have in mind.

I think "crazy large scale conflict (with WMDs)" or "mass slaughter to marginally increase odds of retaining control" or "extreme environmental issues" are all pretty central in what I'm imagining.

I think the number of deaths for these is maybe log normally distributed around 1 billion or so. That said, I'm low confidence.

(For reference, if the same fraction of people died as in WW2, it would be around 300 million. So, my view is similar to "substantial chance of a catastrophe which is a decent amount worse than WW2".)

Comment by ryan_greenblatt on MIRI 2024 Communications Strategy · 2024-06-30T18:05:55.199Z · LW · GW

Some more:

  • The AI kills a huge number of people with a bioweapon to destablize the world and relatively advantage its position.
  • Massive world war/nuclear war. This could kill 100s of millions easily. 1 billion is probably a bit on the higher end of what you'd expect.
  • The AI has control of some nations, but thinks that some subset of humans over which it has control pose a net risk such that mass slaughter is a good option.
  • AIs would prefer to keep humans alive, but there are multiple misaligned AI factions racing and this causes extreme environmental damage.
Comment by ryan_greenblatt on ryan_greenblatt's Shortform · 2024-06-29T18:12:17.811Z · LW · GW

It also wasn't employee level access probably?

(But still a good step!)

Comment by ryan_greenblatt on ryan_greenblatt's Shortform · 2024-06-29T01:35:01.885Z · LW · GW

I think I could share the literal language in the contractor agreement I signed related to confidentiality, though I don't expect this is especially interesting as it is just a standard NDA from my understanding.

I do not have any non-disparagement, non-solicitation, or non-interference obligations.

I'm not currently going to share information about any other policies Anthropic might have related to confidentiality, though I am asking about what Anthropic's policy is on sharing information related to this.

Comment by ryan_greenblatt on ryan_greenblatt's Shortform · 2024-06-29T01:22:11.016Z · LW · GW

I expect that most of the conflict of interest concerns remain when a big lab is giving access to a smaller org / individual.

As in, there aren't substantial reductions in COI from not being an employee and not having equity? I currently disagree.

Comment by ryan_greenblatt on ryan_greenblatt's Shortform · 2024-06-29T01:05:37.568Z · LW · GW

Ok, so the concern is:

AI labs may provide model access (or other goods), so people who might want to obtain model access might be incentivized to criticize AI labs less.

Is that accurate?

Notably, as described this is not specifically a downside of anything I'm arguing for in my comment or a downside of actually being a contractor. (Unless you think me being a contractor will make me more likely to want model acess for whatever reason.)

I agree that this is a concern in general with researchers who could benefit from various things that AI labs might provide (such as model access). So, this is a downside of research agendas with a dependence on (e.g.) model access.

I think various approaches to mitigate this concern could be worthwhile. (Though I don't think this is worth getting into in this comment.)

Comment by ryan_greenblatt on ryan_greenblatt's Shortform · 2024-06-29T00:22:47.956Z · LW · GW

It seems a substantial drawback that it will be more costly for you to criticize Anthropic in the future.

As in, if at some point I am currently a contractor with model access (or otherwise have model access via some relationship like this) it will at that point be more costly to criticize Anthropic?

Comment by ryan_greenblatt on An issue with training schemers with supervised fine-tuning · 2024-06-29T00:20:26.073Z · LW · GW

(Also, to be clear, thanks for the comment. I strong upvoted it.)

Comment by ryan_greenblatt on Sycophancy to subterfuge: Investigating reward tampering in large language models · 2024-06-29T00:17:49.225Z · LW · GW

Yep, reasonable summary.

I do think that it's important to keep in mind that for threat models where reward hacking is reinforced, small RTFs might matter a lot, but maybe that was already obvious to everyone.

I don't think this was obvious to everyone and I appreciate this point - I edited my earlier comment to more explicitly note my agreement.

Comment by ryan_greenblatt on ryan_greenblatt's Shortform · 2024-06-28T23:54:28.049Z · LW · GW

Being a contractor was the most convenient way to make the arrangement.

I would ideally prefer to not be paid by Anthropic[1], but this doesn't seem that important (as long as the pay isn't too overly large). I asked to be paid as little as possible and I did end up being paid less than would otherwise be the case (and as a contractor I don't receive equity). I wasn't able to ensure that I only get paid a token wage (e.g. $1 in total or minimum wage or whatever).

I think the ideal thing would be a more specific legal contract between me and Anthropic (or Redwood and Anthropic), but (again) this doesn't seem important.


  1. At least for this current primary purpose of this contracting. I do think that it could make sense to be paid for some types of consulting work. I'm not sure what all the concerns are here. ↩︎

Comment by ryan_greenblatt on Sycophancy to subterfuge: Investigating reward tampering in large language models · 2024-06-28T23:32:29.830Z · LW · GW

I agree that a few factors of 2 don't matter much at all, but I think highlighting a specific low threshold relative to the paper seems misguided as opposed to generally updating based on the level of egregiousness and rarity. (Where you should probably think about the rarity in log space.)

(I think I made the point that a few factors of 2 shouldn't matter much for the bottom line above.)

(Edit: I agree that it's worth noting that low RTFs can be quite concerning for the reasons you describe.)

I'll argue against a specific threshold in the rest of this comment.


First, it is worth noting that there are multiple possible ways to interpret the final environment in the paper in terms of the analogy to the future:

  1. As the catastrophic failure that results from reward hacking. In this case, we might care about frequency depending on the number of opportunities the model would have and the importance of collusion.
  2. As an egregious reward hack that would be reinforced and would be harmful via occuring repeatedly (a la what failure looks like part 1).
  3. As a reward hack that will generalize to some other failure eventually cascading into (1) or (2). (Implying this enviroment is not importantly different from the other environments in the paper, it is just part of a progression.)

Your argument applies to (2) and (3), but not to (1).


In other words, insofar as you're updating on this paper at all, I think you should be updating based on whether the RTF > 10^-6

I have a pretty broad uncertainty over how the real world empirical conditions will compare to the empirical conditions discussed in the paper. And, the empirical conditions in the paper clearly differ substantially from what we'd typically expect (much high fraction hacking, much dumber models).

Examples of conditions that might differ:

  • Incentives against reward hacking or other blockers.
  • How much of a gradient toward reward hacking their is.
  • How good models are at doing reward hacking reasoning relative to the difficulty of doing so in various cases.
  • How often outcomes based feedback is used.
  • The number of episodes.

So, given this overall uncertainty, it seems like we should have a much fuzzier update where higher numbers should actually update us.

(And thus it should be possible to update on more than the binary which implies updating down is plausible.)

Personally, I find this paper to be tiny update toward optimism, but basically in line with my expectations.

Comment by ryan_greenblatt on An issue with training schemers with supervised fine-tuning · 2024-06-28T22:38:39.695Z · LW · GW

(I'll edit the post at some point to highlight this discussion and clarify this.)

Comment by ryan_greenblatt on An issue with training schemers with supervised fine-tuning · 2024-06-28T22:37:53.641Z · LW · GW

Thanks, I improved the wording.

Comment by ryan_greenblatt on An issue with training schemers with supervised fine-tuning · 2024-06-28T20:21:51.202Z · LW · GW

this failure mode is dangerous because of scheming AI and I say it's dangerous because the policy is OOD

I would say that it is dangerous in the case where is is both OOD enough that the AI can discriminate and the AI is scheming.

Neither alone would present a serious (i.e. catastrophic) risk in the imitation context we discuss.

Comment by ryan_greenblatt on An issue with training schemers with supervised fine-tuning · 2024-06-28T20:19:51.495Z · LW · GW

Hmm, I think I was wrong about DAgger and confused it with a somewhat different approach in my head.

I agree that it provides bounds. (Under various assumptions about the learning algorithm that we can't prove for NNs but seem reasonable to assume in practice.)

I now agree that the proposed method is basically just a slight tweak of DAgger to make it more sample/cost efficient in the case where our issue is discrimination by the policy.

but this feels a bit like a reframing of an old problem.

I agree this is a special case of well known issues with behavioral cloning - we probably should have made this more clear in the post.

Comment by ryan_greenblatt on ryan_greenblatt's Shortform · 2024-06-28T18:52:13.076Z · LW · GW

I'm currently working as a contractor at Anthropic in order to get employee-level model access as part of a project I'm working on. The project is a model organism of scheming, where I demonstrate scheming arising somewhat naturally with Claude 3 Opus. So far, I’ve done almost all of this project at Redwood Research, but my access to Anthropic models will allow me to redo some of my experiments in better and simpler ways and will allow for some exciting additional experiments. I'm very grateful to Anthropic and the Alignment Stress-Testing team for providing this access and supporting this work. I expect that this access and the collaboration with various members of the alignment stress testing team (primarily Carson Denison and Evan Hubinger so far) will be quite helpful in finishing this project.

I think that this sort of arrangement, in which an outside researcher is able to get employee-level access at some AI lab while not being an employee (while still being subject to confidentiality obligations), is potentially a very good model for safety research, for a few reasons, including (but not limited to):

  • For some safety research, it’s helpful to have model access in ways that labs don’t provide externally. Giving employee level access to researchers working at external organizations can allow these researchers to avoid potential conflicts of interest and undue influence from the lab. This might be particularly important for researchers working on RSPs, safety cases, and similar, because these researchers might naturally evolve into third-party evaluators.
    • Related to undue influence concerns, an unfortunate downside of doing safety research at a lab is that you give the lab the opportunity to control the narrative around the research and use it for their own purposes. This concern seems substantially addressed by getting model access through a lab as an external researcher.
  • I think this could make it easier to avoid duplicating work between various labs. I’m aware of some duplication that could potentially be avoided by ensuring more work happened at external organizations.

For these and other reasons, I think that external researchers with employee-level access is a promising approach for ensuring that safety research can proceed quickly and effectively while reducing conflicts of interest and unfortunate concentration of power. I’m excited for future experimentation with this structure and appreciate that Anthropic was willing to try this. I think it would be good if other labs beyond Anthropic experimented with this structure.

(Note that this message was run by the comms team at Anthropic.)

Comment by ryan_greenblatt on Zach Stein-Perlman's Shortform · 2024-06-28T17:42:23.902Z · LW · GW

FWIW, I explicitly think that straightforward effects are good.

I'm less sure about the situation overall due to precedent setting style concerns.

Comment by ryan_greenblatt on Sycophancy to subterfuge: Investigating reward tampering in large language models · 2024-06-28T15:27:20.004Z · LW · GW

Yep, not claiming you did anything problematic, I just thought this selection might not be immediately obvious to readers and the random examples might be informative.