Posts

New report: Safety Cases for AI 2024-03-20T16:45:27.984Z
List of strategies for mitigating deceptive alignment 2023-12-02T05:56:50.867Z
New paper shows truthfulness & instruction-following don't generalize by default 2023-11-19T19:27:30.735Z
Testbed evals: evaluating AI safety even when it can’t be directly measured 2023-11-15T19:00:41.908Z
Red teaming: challenges and research directions 2023-05-10T01:40:16.556Z
Safety standards: a framework for AI regulation 2023-05-01T00:56:39.823Z
Are short timelines actually bad? 2023-02-05T21:21:41.363Z
[MLSN #7]: an example of an emergent internal optimizer 2023-01-09T19:39:47.888Z
Prizes for ML Safety Benchmark Ideas 2022-10-28T02:51:46.369Z
What is the best critique of AI existential risk arguments? 2022-08-30T02:18:42.463Z

Comments

Comment by joshc (joshua-clymer) on New report: Safety Cases for AI · 2024-03-23T21:04:08.202Z · LW · GW

which feels to me like it implies it's easy to get medium-scalable safety cases that get you acceptable levels of risks by using only one or two good layers of security

I agree there's a communication issue here. Based on what you described, I'm not sure if we disagree.


> (maybe 0.3 bits to 1 bit)

I'm glad we are talking bits. My intuitions here are pretty different. e.g. I think you can get 2-3 bits from testbeds. I'd be keen to discuss standards of evidence etc in person sometime.

Comment by joshc (joshua-clymer) on New report: Safety Cases for AI · 2024-03-22T03:05:51.810Z · LW · GW

This makes sense. Thanks for the resources!

Comment by joshc (joshua-clymer) on New report: Safety Cases for AI · 2024-03-21T22:03:14.827Z · LW · GW

Thanks for leaving this comment on the doc and posting it.

But I feel like that's mostly just a feature of the methodology, not a feature of the territory. Like, if you applied the same methodology to computer security, or financial fraud, or any other highly complex domain you would end up with the same situation where making any airtight case is really hard.

You are right in that safety cases are not typically applied to security. Some of the reasons for this are explained in this paper, but I think the main reason is this:

"The obvious difference between safety and security is the presence of an intelligent adversary; as Anderson (Anderson 2008) puts it, safety deals with Murphy’s Law while security deals with Satan’s Law... Security has to deal with agents whose goal is to compromise systems."

My guess is that most safety evidence will come down to claims like "smart people tried really hard to find a way things could go wrong and couldn't." This is part of why I think 'risk cases' are very important.

I share the intuitions behind some of your other reactions.

I feel like the framing here tries to shove a huge amount of complexity and science into a "safety case", and then the structure of a "safety case" doesn't feel to me like it is the kind of thing that would help me think about the very difficult and messy questions at hand. 

Making safety cases is probably hard, but I expect explicitly enumerating their claims and assumptions would be quite clarifying. To be clear, I'm not claiming that decision-makers should only communicate via safety and risk cases. But I think that relying on less formal discussion would be significantly worse.

Part of why I think this is that my intuitions have been wrong over and over again. I've often figured this out after eventually asking myself "what claims and assumptions am I making? How confident am I these claims are correct?"

it also feels more like it just captures "the state of fashionable AI safety thinking in 2024" more than it is the kind of thing that makes sense to enshrine into a whole methodology.

To be clear, the methodology is separate from the enumeration of arguments (which I probably could have done a better job signaling in the paper). The safety cases + risk cases methodology shouldn't depend too much on what arguments are fashionable at the moment.

I agree that the arguments will evolve to some extent in the coming years. I'm more optimistic about the robustness of the categorization, but that's maybe minor.

Comment by joshc (joshua-clymer) on New report: Safety Cases for AI · 2024-03-21T21:12:05.051Z · LW · GW

To me, the introduction made it sound a little bit like the specifics of applying safety cases to AI systems have not been studied


This is a good point. In retrospect, I should have written a related work section to cover these. My focus was mostly on AI systems that have only existed for ~ a year and future AI systems, so I didn't spend much time reading safety cases literature specifically related to AI systems (though perhaps there are useful insights that transfer over).

The reason the "nebulous requirements" aren't explicitly stated is that when you make a safety case you assure the safety of a system against specific relevant hazards for the system you're assuring. These are usually identified by performing a HAZOP analysis or similar. Not all AI systems have the same list of hazards, so its obviously dubious to expect you can list requirements a priori.

My impression is that there is still a precedent for fairly detailed guidelines that describe how safety cases are assessed in particular industries and how hazards should be analyzed. For example, see the UK's Safety Assessment Principles for Nuclear Facilities. I don't think anything exists like this for evaluating risks from advanced AI agents.

I agree, however, that not everyone who mentions that developers should provide 'safety evidence' should need to specify in detail what this could look like.

Comment by joshc (joshua-clymer) on New report: Safety Cases for AI · 2024-03-21T20:58:37.801Z · LW · GW

Thanks, this is good feedback!

Addressing the disagreements:

The level of practicality you assign to some approaches is just insanely high. Neither modeling generalization nor externalized reasoning, and certainly not testbeds seem to "strongly practical" (defined as "Applies to arguments that do not require fundamental research advances, but might require"). The only one that is justifiably "strongly practical" to me is "absence of precursor abilities."

Externalized reasoning: just a capability argument, no?
Testbeds: I think the best testbed ideas rn just require training a bunch of models to have particular policies, which seems pretty straightforward. I think these are good enough to make a moderate-strength argument (which is the bar where practicality is evaluated).
Modeling generalization: I definitely think that versions of this aren't very practical, but e.g. looking at efficiency inductive biases is pretty straightforward -- and this is sufficient to make a moderate strength not-alignment-faking argument.

Strong practicality ~refers to "does there exist an argument of this type of moderate strength that can be made without fundamental research." So, lots of arguments of these types will still require fundamental research, but if some don't, then that means they are strongly practical.

I disagree with putting testbeds and amplification in trustworthiness (at least not as strong or even medium practicality), I think they usually don't help to know if AIs can be trusted (and in particular they don't prevent any forms of alignment faking).

I agree that amplification arguments do not help to rule out alignment faking, but testbeds could help. I've read your post on protocol evaluations vs control. If I remember correctly, the main objection you had to testbeds ruling out alignment faking is that the testbed models could intentionally throw off the results. I think this is a knock-down problem. For example, consider a testbed where models have been trained to defect if they are confident they've been deployed to the real internet. Then, developers could try to tell these models from benign ones without being able to access the real internet. It is straightforward in this case to verify that the behavior of the testbed models is consistent with their intended policies (nice when not on the internet, and otherwise bad). So, the only the models can do to throw off the test is to change their internal states (e.g. think in simple ways when they know they could otherwise obfuscate their reasoning by thinking in convoluted ways).

"Claim 4: AI systems are not faking alignment" is just extremely hard to assess. I think that the focus of trustworthiness assessment on "Claim 3: All enumerated causes of correlated infractions are unlikely aside from alignment faking." is somewhat unwarranted.

I agree claim 4 is hard to assess. I'm not sure what you mean by 'unwarranted' so I can't comment much on this.

I think the description of macrosystem risks was not very convincing, aside from risks that make systems less trustworthy - so I think macrosystem should maybe have been folded into trustworthiness.

I don't think it would have made sense to fold macrosystem risks into trustworthiness. One reason for this is that some macrosystem risks relate to control / inability. e.g. models might combine their intelligence at scale and become much more capable.

Another reason for this is that I also think it is harder to assess trustworthiness arguments at larger scales. In particular, not alignment faking arguments claim that de-correlated misalignment is not going to be an issue. At a macrosystem level, this requires answering the question: "would it be bad if 1% of my AI population was misaligned?" This is pretty different from answering the question "if one model in X subsystem is misaligned, would that be bad?"

The strength of almost all "high strength" measures is overrated, almost nothing is "strong" (where "strong" is defined as "Applies to arguments that could bring catastrophic risk below acceptable levels without needing other arguments."). Control is not strong, monitoring is not strong, externalized reasoning is not strong (for control and trustworthiness), ... Even with a notion of acceptable that is quite generous (e.g. 2% p(doom)) I think nothing is strong, but given the stakes I think we shouldn't use "acceptable" for anything above 1% (and ideally reserve that word for 0.1% or below).

I suspect that this disagreement might be mostly semantic. The main reason I labeled capability arguments as strong is that developers could in principle demonstrate a large margin of error. e.g. it's clear that gpt-2 is not going to cause a catastrophe. Or it could be very clear that smarter models are controlled if they are sufficiently isolated.

Strength labels indicate an upper bound. It's on me for not making that more clear.
 

Comment by joshc (joshua-clymer) on The case for ensuring that powerful AIs are controlled · 2024-01-28T05:34:38.651Z · LW · GW

Thanks, I'll summarize my takeaways in the form of an argument that control is sufficient to safely accelerate alignment research by 30x. This is certainly not the only argument one could make, but I like it because it is clear and concrete.

I'll specifically focus on justifying that the red team is competitive.

If your AI systems can accelerate AI safety research by 30x, then they are producing effectively superhuman outputs; however, this does not imply they can produce superhuman evasion plans because AI systems can't spend as much of their thinking on how to evade control measures as they can on safety research.

Suppose that the AI systems have approximately human capabilities and are sped up by 30x.

Suppose also that the human red team has spent 2 months identifying evasion strategies (call this 'coup research' following your control paper)

I'll argue that the AI systems will not be able to do more than the equivalent of 2 months of coup research. Since our red team has lots of advantages over the AI systems, this is a conservative argument that the red team is competitive.

Suppose that we only want to justify that the red team is competitive for the next two months (and we'll continuously search for attack strategies in order to maintain this 1 to 1 ratio).

Two months of deployment is equivalent to 5 years of human thinking since our AI system is sped up by 30x. We need to show that no more than 1 / 30th of this thinking is coup research.

  • Training algorithms might pressure AI systems to be at least somewhat efficient about doing useful work (and coup research isn't useful for the task at hand). Perhaps this implies a 1 / N coup research budget.
  • Furthermore, suppose the AI systems' reasoning is externalized. Monitoring setups are very likely to   catch coup research if it is present at a frequency greater than 1 / M.
  • min(1 / N, 1 / M) <= 1 / 30

I think I buy this argument or some variation of it.

Comment by joshc (joshua-clymer) on The case for ensuring that powerful AIs are controlled · 2024-01-25T20:34:31.431Z · LW · GW

Thanks for writing this. I think it does a good job laying out the assumptions needed for control.

It seems like it could be challenging to justify that a red team is competitive  -- especially if you are aiming to extract massive amounts of work from AI systems.

For example, if your AI system is like a human sw engineer but 30X faster, then it might come up with effectively superhuman plans just because it has lots of time to think. Externalized reasoning can help, but it's not very clear to me that externalized reasoning can scale to domains where AI systems are doing a bunch of research that humans (and GPT-4) can't readily disentangle and understand. I'm imagining that chain of thought will amount to thousands of pages of technical content that is written like rambly google docs. It might be hard to pick out the AI safety research from the cyberoffense planning in this case  -- especially with dumb models like GPT-4.

It seems like to the extent you want to use AI systems for work that is hard to evaluate, control measures are difficult to trust. Maybe a key crux is whether AI systems will have to do work that is hard to evaluate to yield a 30x acceleration factor. I'd imagine this would apply for AI systems to achieve fundemental research advances  -- e.g. in interpretability, formal verification, etc.

Curious about your thoughts.

Comment by joshc (joshua-clymer) on Revealing Intentionality In Language Models Through AdaVAE Guided Sampling · 2023-11-18T01:45:12.920Z · LW · GW

The results seem to be cherry picked or else perhaps I am using the code incorrectly. I'm trying to use the VAE for a separate project and the encoded vectors don't steer generations very well (or reconstruct  -- which is what I was hoping to use this for).

Comment by joshc (joshua-clymer) on The case for becoming a black-box investigator of language models · 2023-05-10T04:03:13.387Z · LW · GW

In addition to seeing more AI behavioral psychology work, I would be excited about seeing more AI developmental psychology -- i.e. studying how varying properties of training or architecture affect AI behavior. Shard theory is an example of this.

I've written a bit about the motivations for AI developmental psychology here.

Comment by joshc (joshua-clymer) on Reframing the burden of proof: Companies should prove that models are safe (rather than expecting auditors to prove that models are dangerous) · 2023-04-26T21:05:09.946Z · LW · GW

I agree with this norm, though I think it would be better to say that the "burden of evidence" should be on labs. When I first read the title, I thought you wanted labs to somehow prove the safety of their system in a conclusive way. What this probably looks like in practice is "we put x resources into red teaming and didn't find any problems." I would be surprised if 'proof' was ever an appropriate term. 

Comment by joshc (joshua-clymer) on AI alignment researchers don't (seem to) stack · 2023-03-14T06:16:36.921Z · LW · GW

The analogy between AI safety and math or physics is assumed it in a lot of your writing and I think it is a source of major disagreement with other thinkers. ML capabilities clearly isn’t the kind of field that requires building representations over the course of decades.

I think it’s possible that AI safety requires more conceptual depth than AI capabilities; but in these worlds, I struggle to see how the current ML paradigm coincides with conceptual ‘solutions’ that can’t be found via iteration at the end. In those worlds, we are probably doomed so I’m betting on the worlds in which you are wrong and we must operate within the current empirical ML paradigm. It’s odd to me that you and Eliezer seem to think the current situation is very intractable, and yet you are confident enough in your beliefs to where you won’t operate on the assumption that you are wrong about something in order to bet on a more tractable world.

Comment by joshc (joshua-clymer) on AI alignment researchers don't (seem to) stack · 2023-03-14T05:55:10.122Z · LW · GW

+1. As a toy model, consider how the expected maximum of a sample from a heavy tailed distribution is affected by sample size. I simulated this once and the relationship was approximately linear. But Soares’ point still holds if any individual bet requires a minimum amount of time to pay off. You can scalably benefit from parallelism while still requiring a minimum amount of serial time.

Comment by joshc (joshua-clymer) on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover · 2023-01-08T21:48:48.981Z · LW · GW

At its core, the argument appears to be "reward maximizing consequentialists will necessarily get the most reward." Here's a counter example to this claim: if you trained a Go playing AI with RL, you are unlikely to get a reward maximizing consequentialist. Why? There's no reason for the Go playing AI to think about how to take over the world or hack the computer that is running the game. Thinking this way would be a waste of computation. AIs that think about how to win within the boundaries of the rules therefore do better.

In the same way, if you could robustly enforce rules like "turn off when the humans tell you to" or "change your goal when humans tell you to" etc, perhaps you end up with agents that follow these rules rather than agents that think "hmmm... can I get away with being disobedient?"

Both achieve the same reward if the rules are consistently enforced during training and I think there are weak reasons to expect deontological agents to be more likely.

Comment by joshua-clymer on [deleted post] 2022-12-08T17:39:55.255Z

Also, AI startups make AI safety resources more likely to scale with AI capabilities.

Comment by joshc (joshua-clymer) on Prizes for ML Safety Benchmark Ideas · 2022-12-04T06:28:58.976Z · LW · GW

It hasn't been canceled.

Comment by joshc (joshua-clymer) on A philosopher's critique of RLHF · 2022-11-07T03:00:24.787Z · LW · GW

I don't mean to distract from your overall point though which I take to be "a philosopher said a smart thing about AI alignment despite not having much exposure." That's useful data.

Comment by joshc (joshua-clymer) on A philosopher's critique of RLHF · 2022-11-07T02:56:08.026Z · LW · GW

I don't know why this is a critique of RLHF. You can use RLHF to train a model to ask you questions when it's confused about your preferences. To the extent to which you can easily identify when a system has this behavior and when it doesn't, you can use RLHF to produce this behavior just like you can use RLHF to produce many other desired behaviors.

Comment by joshc (joshua-clymer) on (My understanding of) What Everyone in Technical Alignment is Doing and Why · 2022-11-05T03:42:39.300Z · LW · GW

Thoughts on John's comment: this is a problem with any method for detecting deception that isn't 100% accurate. I agree that finding a 100% accurate method would be nice, but good luck.

Also, you can somewhat get around this by holding some deception detecting methods out (i.e. not optimizing against them). When you finish training and the held out methods tell you that your AI is deceptive, you start over. Then you have to try to think of another approach that is more likely to actually discourage deception than fool your held out detectors. This is the difference between gradient descent search and human design search, which I think is an important distinction.

Also, FWIW, I doubt that trojans are currently a good microcosm for detecting deception. Right now, it is too easy to search for the trigger using brute force optimization. If you ported this over to sequential-decision-making land where triggers can be long and complicated, that would help a lot. I see a lot of current trojan detection research as laying the groundwork for future research that will be more relevant.

In general, it seems better to me to evaluate research by asking "where is this taking the field/what follow-up research is this motivating?" rather than "how are the words in this paper directly useful if we had to build AGI right now?" Eventually, the second one is what matters, but until we have systems that look more like agents that plan and achieve goals in the real world, I'm pretty skeptical of a lot of the direct value of empirical research.

Comment by joshc (joshua-clymer) on My Thoughts on the ML Safety Course · 2022-10-28T20:59:33.773Z · LW · GW

From what I understand, Dan plans to add more object-level arguments soon.

Comment by joshc (joshua-clymer) on My Thoughts on the ML Safety Course · 2022-10-02T04:02:31.199Z · LW · GW

(opinions are my own)
I think this is a good review. Some points that resonated with me:
1. "The concepts of systemic safety, monitoring, robustness, and alignment seem rather fuzzy." I don't think the difference between objective and capabilities robustness is discussed but this distinction seems important. Also, I agree that Truthful AI could easily go into monitoring.
2. "Lack of concrete threat models." At the beginning of the course, there are a few broad arguments for why AI might be dangerous but not a lot of concrete failure modes. Adding more failure modes here would better motivate the material.

3. Give more clarity on how the various ML safety techniques address the alignment problem, and how they can potentially scale to solve bigger problems of a similar nature as AIs scale in capabilities
4. Give an assessment on the most pressing issues that should be addressed by the ML community and the potential work that can be done to contribute to the ML safety field

You can read more about how these technical problems relate to AGI failure modes and how they rank on importance, tractability, and crowdedness in Pragmatic AI Safety 5. I think the creators included this content in a separate forum post for a reason.

The course is intended for to two audiences: people who are already worried about AI X-risk and people who are only interested in the technical content. The second group doesn't necessarily care about why each research direction relates to reducing X-risk.

Putting a lot of emphasis on this might just turn them off. It could give them the impression that you have to buy X-risk arguments in order to work on these problems (which I don't think is true) or it could make them less likely to recommend the course to others, causing fewer people to engage with the X-risk material overall.

Comment by joshua-clymer on [deleted post] 2022-09-05T05:29:54.589Z

These are good points. Maybe we'll align these things enough to where they'll give us a little hamster tank to run around in.

Comment by joshc (joshua-clymer) on (My understanding of) What Everyone in Technical Alignment is Doing and Why · 2022-09-05T05:17:35.677Z · LW · GW

PAIS #5 might be helpful here. It explains how a variety of empirical directions are related to X-Risk and probably includes many of the ones that academics are working on. 

Comment by joshua-clymer on [deleted post] 2022-08-29T02:36:17.403Z

I agree that I did not justify this claim and it is controversial in the ML community. I'll try to explain why I think this.

First of all, when I say an AI 'wants things' or is 'an agent' I just mean it robustly and autonomously brings about specific outcomes. I think all of the arguments made above apply with this definition and don't rely on anthropomorphism.

Why are we likely to build superintelligent agents?
1. It will be possible to build superintelligent agents
I'm just going to take it as a given here for the sake of time. Note: I'm not claiming it will be easy, but I don't think anyone can really say with confidence that it will be impossible to build such agents within the century.
2. There will be strong incentives to build superintelligent agents
I'm generally pretty skeptical of claims that if you just train an AI system for long enough it becomes a scary consequentialist. Instead, I think it is likely someone will build a system like this because humans want things and building a superintelligent agent that wants to do what you want pretty much solves all your problems. For example, they might want to remain technologically relevant/be able to defend themselves from other countries/extraterrestrial civilizations, expand human civilization, etc. Building a thing that 'robustly and autonomously brings about specific outcomes' would be helpful for all this stuff.

In order to use current systems ('tool AIs' as some people call them) to actually get things done in the world, a human needs to be in the loop, which is pretty uncompetitive with fully autonomous agents.

I wouldn't be super surprised if humans didn't build superintelligent agents for a while -- even if they could. Like I mentioned in the post, most people prob want to stay in control. But I'd put >50% credence on it happening pretty soon after it is possible because of coordination difficulty and the unilateralist curse.

Comment by joshc (joshua-clymer) on Some conceptual alignment research projects · 2022-08-28T03:54:59.890Z · LW · GW
Comment by joshc (joshua-clymer) on Announcing Encultured AI: Building a Video Game · 2022-08-28T03:03:47.181Z · LW · GW

I don't doubt this. I was more reporting on how the branding came across to me.

Comment by joshc (joshua-clymer) on Announcing Encultured AI: Building a Video Game · 2022-08-28T02:49:28.262Z · LW · GW

Your emphasis on how 'fun' your organization is is kind of off-putting (I noticed this on your website as well). I think it gives me the impression that you are not very serious about what you are doing. Maybe it's just me though.

Comment by joshc (joshua-clymer) on The longest training run · 2022-08-28T02:33:04.460Z · LW · GW

This is because longer runs will be outcompeted by runs that start later and therefore use better hardware and better algorithms.

Wouldn't companies port their partially-trained models to new hardware? I guess the assumption here is that when more compute is available, actors will want to train larger models. I don't think this is obviously true because:
1. Data may be the bigger bottleneck. There was some discussion of this here. Making models larger doesn't help very much after a certain point compared with training them with more data.
2. If training runs are happening over months, there will be strong incentives to make use of previously trained models -- especially in a world where people are racing to build AGI. This could look like anything from slapping on more layers to developing algorithms that expand the model in all relevant dimensions as it is being trained. Here's a paper about progressive learning for vision transformers. I didn't find anything for NLP, but I also haven't looked very hard.

Comment by joshc (joshua-clymer) on Refining the Sharp Left Turn threat model, part 1: claims and mechanisms · 2022-08-14T18:57:37.651Z · LW · GW

Claim 1: there is an AI system that (1) performs well ... (2) generalizes far outside of its training distribution.

Don't humans provide an existence proof of this? The point about there being a 'core' of general intelligence seems unnecessary.

Comment by joshua-clymer on [deleted post] 2022-06-25T01:20:51.647Z

Thanks!

Comment by joshua-clymer on [deleted post] 2022-06-24T18:21:44.098Z

I know that this is a common argument against amplification, but I've never found it super compelling.

People often point to evil corporations to show that unaligned behavior can emerge from aligned humans, but I don't think this analogy is very strong. Humans in fact do not share the same goals and are generally competing with each other over resources and power, which seems like the main source of inadequate equilibria to me. 

If everyone in the world was a copy of Eliezer, I don't think we would have a coordination problem around building AGI. They would probably have an Eliezer government that is constantly looking out for emergent misalignment and suggesting organizational changes to squash it. Since everyone in this world is optimizing for making AGI go well and not for profit or status among their Eliezer peers, all you have to do is tell them what the problem is and what they need to do to fix it. You don't have to threaten them with jail time or worry that they will exploit loopholes in Eliezer law. 

I think it is quite likely that I am missing something here and it would be great if you could flush this argument out a little more or direct me towards a post that does. 

Comment by joshua-clymer on [deleted post] 2022-06-24T17:45:05.646Z

That's a good point. I guess I don't expect this to be a big problem because:
1. I think 1,000,000 copies of myself could still get a heck of a lot done. 
2. The first human-level AGI might be way more creative than your average human. It would probably be trained on data from billions of humans, so all of those different ways of thinking could be latent in the model.
2. The copies can potentially diverge. I'm expecting the first transformative model to be stateful and be able to meta-learn. This could be as simple as giving a transformer read and write access to an external memory and training it over longer time horizons. The copies could meta-learn on different data and different sub-problems and bring different perspectives to the table.

Comment by joshc (joshua-clymer) on Infra-Bayesianism Distillation: Realizability and Decision Theory · 2022-06-19T00:48:07.839Z · LW · GW

Wait... I'm quite confused. In the decision rule, how is the set of environments 'E' determined? If it contains every possible environment, then this means I should behave like I am in the worst possible world, which would cause me to do some crazy things.

Also, when you say that an infra-bayesian agent models the world with a set of probability distributions, what does this mean? Does the set contain every distribution that would be consistent with the agent's observations? But isn't this almost all probability distributions? Some distributions match the data better than others, so do you weigh them according to P(observation | data generating distribution)? But then what would you do with these weights?

Sorry if I am missing something obvious. I guess this would have been clearer for me if you explained the infra-bayesian framework a little more before introducing the decision rule.

Comment by joshc (joshua-clymer) on Infra-Bayesianism Distillation: Realizability and Decision Theory · 2022-06-19T00:18:43.127Z · LW · GW

Interesting post! I'm not sure if I understand the connection between infra-bayesianism and Newcomb's paradox very well. The decision procedure you outlined in the first example seems equivalent to an evidential decision theorist placing 0 credence on worlds where Omega makes an incorrect prediction. What is the infra-bayesianism framework doing differently? It just looks like the credence distribution over worlds is disguised by the 'Nirvana trick.'

Comment by joshc (joshua-clymer) on Explaining inner alignment to myself · 2022-06-18T22:26:09.859Z · LW · GW

It's great that you are trying to develop a more detailed understanding of inner alignment. I noticed that you didn't talk about deception much. In particular, the statement below is false:

Generalization <=> accurate priors + diverse data

You have to worry about what John Wentworth calls 'manipulation of imperfect search.' You can have accurate priors and diverse data and (unless you have infinite data) the training process could produce a deceptive agent that is able to maintain its misalignment. 

Comment by joshua-clymer on [deleted post] 2022-06-18T18:50:21.377Z

I'm guessing that you are referring to this:

Another strategy is to use intermittent oversight – i.e. get an amplified version of the current aligned model to (somehow) determine whether the upgraded model has the same objective before proceeding.

The intermittent oversight strategy does depend on some level of transparency. This is only one of the ideas I mentioned though (and it is not original). The post in general does not assume anything about our transparency capabilities. 

Comment by joshua-clymer on [deleted post] 2022-06-17T04:37:52.770Z

I'm not sure I understand. We might not be on the same page.

Here's the concern I'm addressing:
Let's say we build a fully aligned human-level AGI, but we want to scale it up to superintelligence. This seems much harder to do safely than to train the human-level AGI since you need a training signal that's better than human feedback/imitation.

Here's the point I am making about that concern:
It might actually be quite easy to scale an already aligned AGI up to superintelligence -- even if you don't have a scalable outer-aligned training signal -- because the AGI will be motivated to crystallize its aligned objective.

Comment by joshua-clymer on [deleted post] 2022-06-17T03:20:09.393Z

Thanks for the thoughtful review! I think this is overall a good read of what I was saying. I agree now that redundancy would not work. 

One clarification:

The mesaobjective that was aligned to our base objective in the original setting is no longer aligned in the new setting

When I said that the 'human-level' AGI is assumed to be aligned, I meant that it has an aligned mesa-objective (corrigibly or internally) -- not that it has an objective that was functionally aligned on the training distribution, but may not remain aligned under distribution shift. I thought that internally/corrigibly aligned mesa-objectives are intent-aligned on all (plausible) distributions by definition...

Comment by joshua-clymer on [deleted post] 2022-06-16T21:53:10.712Z

Adding some thoughts that came out of a conversation with Thomas Kwa:

Gradient hacking seems difficult. Humans have pretty weak introspective access to their goals. I have a hard time determining whether my goals have changed or if I have gained information about what they are. There isn't a good reason to believe that the AIs we build will be different.

Comment by joshc (joshua-clymer) on A Bird's Eye View of the ML Field [Pragmatic AI Safety #2] · 2022-05-25T18:05:02.848Z · LW · GW

Safety and value alignment are generally toxic words, currently. Safety is becoming more normalized due to its associations with uncertainty, adversarial robustness, and reliability, which are thought respectable. Discussions of superintelligence are often derided as “not serious”, “not grounded,” or “science fiction.”

 

Here's a relevant question in the 2016 survey of AI researchers:

 

These numbers seem to conflict with what you said but maybe I'm misinterpreting you. If there is a conflict here, do you think that if this survey was done again, the results would be different? Or do you think these responses do not provide an accurate impression of how researchers actually feel/felt (maybe because of agreement bias or something)?

Comment by joshc (joshua-clymer) on Biology-Inspired AGI Timelines: The Trick That Never Works · 2022-05-03T00:13:57.106Z · LW · GW

I have an objection to the point about how AI models will be more efficient because they don't need to do massive parallelization:

Massive parallelization is useful for AI models too and for somewhat similar reasons. Parallel computation allows the model to spit out a result more quickly. In the biological setting, this is great because it means you can move out of the way when a tiger jumps toward you. In the ML setting, this is great because it allows the gradient to be computed more quickly. The disadvantage of parallelization is that it means that more hardware is required. In the biological setting, this means bigger brains. Big brains are costly. They use up a lot of energy and make childbearing more difficult as the skulls need to fit through the birth canal.

In the ML setting, however, big brains are not as costly. We don't need to fit our computers in a skull. So, it is not obvious to me that ML models will do fewer computations in parallel than biological brains.

Some relevant information:

  • According to Scaling Laws for Neural Language Models, model performance depends strongly on model size but very weakly on shape (depth vs width).
  • An explanation for the above is that deep residual networks have been observed to behave like Ensembles of shallow networks.
  • GPT-3 uses 96 layers (decoder blocks). That isn't very many serial computations. If a matrix multiplication, softmax, relu, or vector addition count as atomic computations, then there are 11 serial computations per layer, so that's only 1056 serial computations. It is unclear how to compare this to biological neurons as each neuron may require a number of these serial computations to properly simulate.
  • PALM has 3 times more parameters than GPT-3 but only has 118 layers.
Comment by joshc (joshua-clymer) on What 2026 looks like · 2022-04-20T18:32:11.549Z · LW · GW

Here's another milestone in AI development that I expect to happen in the next few years which could be worth noting:
I don't think any of the large language models that currently exist write anything to an external memory. You can get a chatbot to hold a conversation and 'remember' what was said by appending the dialogue to its next input, but I'd imagine this would get unwieldy if you want your language model to keep track of details over a large number of interactions. 

Fine-tuning a language model so that it makes use of a memory could lead to:
1. More consistent behavior
2. 'Mesa-learning' (it could learn things about the world from its inputs instead of just by gradient decent)

This seems relevant from a safety perspective because I can imagine 'mesa-learning' turning into 'mesa-agency.'

Comment by joshc (joshua-clymer) on ARC's first technical report: Eliciting Latent Knowledge · 2022-03-05T22:11:56.876Z · LW · GW

I'm pretty confused about the plan to use ELK to solve outer alignment. If Cakey is not actually trained, how are amplified humans accessing its world model?

"To avoid this fate, we hope to find some way to directly learn whatever skills and knowledge Cakey would have developed over the course of training without actually training a cake-optimizing AI...

  1. Use imitative generalization combined with amplification to search over some space of instructions we could give an amplified human that would let them make cakes just as delicious as Cakey’s would have been.
  2. Avoid the problem of the most helpful instructions being opaque (e.g. “Run this physics simulation, it’s great”) by solving ELK — i.e., finding a mapping from whatever possibly-opaque model of the world happens to be most useful for making superhumanly delicious cakes to concepts humans care about like “people” being “alive.”
  3. Spell out a procedure for scoring predicted futures that could be followed by an amplified human who has access to a) Cakey’s great world model, and b) the correspondence between it and human concepts of interest. We think this procedure should choose scores using some heuristic along the lines of “make sure humans are safe, preserve option value, and ultimately defer to future humans about what outcomes to achieve in the world” (we go into much more detail in Appendix: indirect normativity).
  4. Distill their scores into a reward model that we use to train Hopefully-Aligned-Cakey, which hopefully uses its powers to help humans build the utopia we want."


 

Comment by joshua-clymer on [deleted post] 2022-02-26T01:55:04.706Z

I don't think I agree that this undermines my argument. I showed that the utility function of person 1 is of the form h(x + y) where h is monotonic increasing. This respects the fact that the utility function is not unique. 2(x + y) + 1 would qualify, as would 3 log(x + y), etc.

Showing that the utility function must have this form is enough to prove total utilitarianism in this case since when you compare h(x + y) to h(x'+ y'), h becomes irrelevant. It is the same as comparing x + y to x' + y'.

Comment by joshua-clymer on [deleted post] 2022-02-26T01:50:02.764Z

This is a much more agreeable assumption. When I get a chance, I'll make sure it can replace the fairness one and add it to the proof and give you credit.

Comment by joshua-clymer on [deleted post] 2022-02-26T01:47:38.248Z

I am defining it as you said. They are like movie frames that haven't been projected yet. I agree that the pre-arranged nature of the snapshots is irrelevant -- that was the point of the example (sorry that this wasn't clear).

The purpose of the example was to falsify the following hypothesis:
 "In order for a simulation to produce conscious experiences, it must compute the next state based on the previous state. It can't just 'play the simulation from memory'"

Maybe what you are getting at is that this hypothesis doesn't do justice to the intuitions that inspired it. Something complex is happening inside the brain that is analogous to 'computation on-demand.' This differentiates the computer-brain system from the stack of papers that are being moved around. 

This seems legit... I just would like to have a more precise understanding of what this 'computation on-demand' property is.

Comment by joshua-clymer on [deleted post] 2022-02-26T01:34:58.684Z
  1. If they are simulating both, then I think they are simulating a lot of things. Could they be simulating Miami on a Monday morning? If the simulation states are represented with N bits and states of Miami can be expressed with <= N bits then I think they could be. You just need to define a one-to-one map between the sequence of bit arrays to sensible sequences of states of Miami, which is guaranteed to exist (every bit array is unique and every state of Miami is unique). Extending this argument implies we are almost certainly in an accidental simulation. 
  2. Then what determines what 'their time' is? Does the order of the pages matter? Why would their time need to correspond to our space? 
Comment by joshua-clymer on [deleted post] 2022-02-26T01:26:57.410Z

This looks great. I'll check it out thanks!

Comment by joshc (joshua-clymer) on My Overview of the AI Alignment Landscape: A Bird's Eye View · 2022-02-25T02:56:09.530Z · LW · GW

This is great. I'd love to see more stuff like this.

Is anyone aware of articles like Chris Olah's views on AI Safety but for other prominent researchers?

Also, it would be great to see a breakdown of how the approaches of ARC, MIRI, Anthropic, Redwood, etc differ. Does this exist somewhere?

Comment by joshua-clymer on [deleted post] 2022-02-24T17:35:18.774Z

I agree that the claims are doing all of the work and that this is not a convincing argument for utilitarianism. I often hear arguments for moral philosophies that make a ton of implicit assumptions. I think that once you make them explicit and actually try to be rigorous the argument always seems less impressive, and less convincing.

Comment by joshua-clymer on [deleted post] 2022-02-24T17:28:40.972Z

Good point, I overlooked this.