Posts

Refining the Sharp Left Turn threat model, part 2: applying alignment techniques 2022-11-25T14:36:08.948Z
Threat Model Literature Review 2022-11-01T11:03:22.610Z
Clarifying AI X-risk 2022-11-01T11:03:01.144Z
Autonomy as taking responsibility for reference maintenance 2022-08-17T12:50:30.218Z
Refining the Sharp Left Turn threat model, part 1: claims and mechanisms 2022-08-12T15:17:38.304Z
Will Capabilities Generalise More? 2022-06-29T17:12:56.255Z
ELK contest submission: route understanding through the human ontology 2022-03-14T21:42:26.952Z
P₂B: Plan to P₂B Better 2021-10-24T15:21:09.904Z
Optimization Concepts in the Game of Life 2021-10-16T20:51:35.821Z
Intelligence or Evolution? 2021-10-09T17:14:40.951Z
Draft papers for REALab and Decoupled Approval on tampering 2020-10-28T16:01:12.968Z
Modeling AGI Safety Frameworks with Causal Influence Diagrams 2019-06-21T12:50:08.233Z
Thoughts on Human Models 2019-02-21T09:10:43.943Z
Cambridge UK Meetup Saturday 12 February 2011-02-02T14:20:02.418Z

Comments

Comment by Ramana Kumar (ramana-kumar) on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover · 2024-01-13T11:23:50.536Z · LW · GW

I found this post to be a clear and reasonable-sounding articulation of one of the main arguments for there being catastrophic risk from AI development. It helped me with my own thinking to an extent. I think it has a lot of shareability value.

Comment by Ramana Kumar (ramana-kumar) on OpenAI, DeepMind, Anthropic, etc. should shut down. · 2023-12-18T18:22:06.674Z · LW · GW

I think this is basically correct and I'm glad to see someone saying it clearly.

Comment by Ramana Kumar (ramana-kumar) on Systems that cannot be unsafe cannot be safe · 2023-05-02T13:11:10.586Z · LW · GW

I agree with this post. However, I think it's common amongst ML enthusiasts to eschew specification and defer to statistics on everything. (Or datapoints trying to capture an "I know it when I see it" "specification".)

Comment by Ramana Kumar (ramana-kumar) on Why do we care about agency for alignment? · 2023-04-23T21:14:48.366Z · LW · GW

This is one of the answers: https://www.alignmentforum.org/posts/FWvzwCDRgcjb9sigb/why-agent-foundations-an-overly-abstract-explanation

Comment by Ramana Kumar (ramana-kumar) on Teleosemantics! · 2023-04-02T11:42:24.025Z · LW · GW

The trick is that for some of the optimisations, a mind is not necessary. There is a sense perhaps in which the whole history of the universe (or life on earth, or evolution, or whatever is appropriate) will become implicated for some questions, though.

Comment by Ramana Kumar (ramana-kumar) on AI and Evolution · 2023-03-30T13:51:55.299Z · LW · GW

I think https://www.alignmentforum.org/posts/TATWqHvxKEpL34yKz/intelligence-or-evolution is somewhat related in case you haven't seen it.

Comment by Ramana Kumar (ramana-kumar) on $500 Bounty/Contest: Explain Infra-Bayes In The Language Of Game Theory · 2023-03-27T09:50:07.226Z · LW · GW

I'll add $500 to the pot.

Comment by Ramana Kumar (ramana-kumar) on Discussion with Nate Soares on a key alignment difficulty · 2023-03-14T23:24:36.120Z · LW · GW

Interesting - it's not so obvious to me that it's safe. Maybe it is because avoiding POUDA is such a low bar. But the sped up human can do the reflection thing, and plausibly with enough speed up can be superintelligent wrt everyone else.

Comment by Ramana Kumar (ramana-kumar) on Discussion with Nate Soares on a key alignment difficulty · 2023-03-14T12:38:58.380Z · LW · GW

A possibly helpful - because starker - hypothetical training approach you could try for thinking about these arguments is make an instance of the imitatee that has all their (at least cognitive) actions sped up by some large factor (e.g. 100x), e.g., via brain emulation (or just "by magic" for the purpose of the hypothetical).

Comment by Ramana Kumar (ramana-kumar) on Can we efficiently distinguish different mechanisms? · 2023-01-03T13:59:03.295Z · LW · GW

It means f(x) = 1 is true for some particular x's, e.g., f(x_1) = 1 and f(x_2) = 1, there are distinct mechanisms for why f(x_1) = 1 compared to why f(x_2) = 1, and there's no efficient discriminator that can take two instances f(x_1) = 1 and f(x_2) = 1 and tell you whether they are due to the same mechanism or not.

Comment by Ramana Kumar (ramana-kumar) on Response to Holden’s alignment plan · 2022-12-22T17:20:08.888Z · LW · GW

Will the discussion be recorded?

Comment by Ramana Kumar (ramana-kumar) on Mechanistic anomaly detection and ELK · 2022-12-09T16:28:54.788Z · LW · GW

(Bold direct claims, not super confident - criticism welcome.)

The approach to ELK in this post is unfalsifiable.

A counterexample to the approach would need to be a test-time situation in which:

  1. The predictor correctly predicts a safe-looking diamond.
  2. The predictor “knows” that the diamond is unsafe.
  3. The usual “explanation” (e.g., heuristic argument) for safe-looking-diamond predictions on the training data applies.

Points 2 and 3 are in direct conflict: the predictor knowing that the diamond is unsafe rules out the usual explanation for the safe-looking predictions.

So now I’m unclear what progress has been made. This looks like simply defining “the predictor knows P” as “there is a mechanistic explanation of the outputs starting from an assumption of P in the predictor’s world model”, then declaring ELK solved by noting we can search over and compare mechanistic explanations.

Comment by Ramana Kumar (ramana-kumar) on [Link] Why I’m optimistic about OpenAI’s alignment approach · 2022-12-09T11:08:07.037Z · LW · GW

I think you're right - thanks for this! It makes sense now that I recognise the quote was in a section titled "Alignment research can only be done by AI systems that are too dangerous to run".

Comment by Ramana Kumar (ramana-kumar) on Finding gliders in the game of life · 2022-12-08T11:46:18.508Z · LW · GW

“We can compute the probability that a cell is alive at timestep 1 if each of it and each of its 8 neighbors is alive independently with probability 10% at timestep 0.”

we the readers (or I guess specifically the heuristic argument itself) can do this, but the “scientists” cannot, because the 

“scientists don’t know how the game of life works”.

Do the scientists ever need to know how the game of life works, or can the heuristic arguments they find remain entirely opaque?

 

Another thing confusing to me along these lines:

“for example they may have noticed that A-B patterns are more likely when there are fewer live cells in the area of A and B”

where do they (the scientists) notice these fewer live cells? Do they have some deep interpretability technique for examining the generative model and "seeing" its grid of cells?

Comment by Ramana Kumar (ramana-kumar) on [Link] Why I’m optimistic about OpenAI’s alignment approach · 2022-12-07T22:21:07.399Z · LW · GW

They have a strong belief that in order to do good alignment research, you need to be good at “consequentialist reasoning,” i.e. model-based planning, that allows creatively figuring out paths to achieve goals.

I think this is a misunderstanding, and that approximately zero MIRI-adjacent researchers hold this belief (that good alignment research must be the product of good consequentialist reasoning). What seems more true to me is that they believe that better understanding consequentialist reasoning -- e.g., where to expect it to be instantiated, what form it takes, how/why it "works" -- is potentially highly relevant to alignment.

Comment by Ramana Kumar (ramana-kumar) on Alignment allows "nonrobust" decision-influences and doesn't require robust grading · 2022-11-30T18:04:38.647Z · LW · GW

I'm focusing on the code in Appendix B.

What happens when self.diamondShard's assessment of whether some consequences contain diamonds differs from ours? (Assume the agent's world model is especially good.)

Comment by Ramana Kumar (ramana-kumar) on Alignment allows "nonrobust" decision-influences and doesn't require robust grading · 2022-11-30T17:51:04.943Z · LW · GW

upweights actions and plans that lead to

how is it determined what the actions and plans lead to?

Comment by Ramana Kumar (ramana-kumar) on Mechanistic anomaly detection and ELK · 2022-11-28T15:21:40.201Z · LW · GW

We expect an explanation in terms of the weights of the model and the properties of the input distribution. 

We have a model that predicts a very specific pattern of observations, corresponding to “the diamond remains in the vault.” We have a mechanistic explanation π for how those correlations arise from the structure of the model.

Now suppose we are given a new input on which our model predicts that the diamond will appear to remain in the vault. We’d like to ask: in this case, does the diamond appear to remain in the vault for the normal reason π?


A problem with this: π can explain the predictions on both train and test distributions without all the test inputs corresponding to safe diamonds. In other words, the predictions can be made for the “normal reason” π even when the normal reason of the diamond being safe doesn’t hold.

(elaborating the comment above)

Because π is a mechanistic (as opposed to teleological, or otherwise reference-sensitive) explanation, its connection to what we would like to consider “normal reasons” has been weakened if not outright broken. 

On the training distribution suppose we have two explanations for the “the diamond remains in the vault” predicted observations.

First there is ɸ, the explanation that there was a diamond in the vault and the cameras were working properly, etc. and the predictor is a straightforward predictor with a human-like world-model (ɸ is kinda loose on the details of how the predictor works, and just says that it does work).

Then there is π, which is an explanation that relies on various details about the circuits implemented by the weights of the predictor that traces abstractly how this distribution of inputs produces outputs with the observed properties, and uses various concepts and abstractions that make sense of the particular organisation of this predictor’s weights. (π is kinda glib about real world diamonds but has plenty to say about how the predictor works, and some of what it says looks like there’s a model of the real world in there.)

We might hope that a lot of the concepts π is dealing in do correspond to natural human things like object permanence or diamonds or photons. But suppose not all of them do, and/or there are some subtle mismatches.

Now on some out-of-distribution inputs that produce the same predictions, we’re in trouble when π is still a good explanation of those predictions but ɸ is not. This could happen because, e.g., π’s version of “object permanence” is just broken on this input, and was never really about object permanence but rather about a particular group of circuits that happen to do something object-permanence-like on the training distribution. Or maybe π refers to the predictor's alien diamond-like concept that humans wouldn't agree with if they understood it but does nevertheless explain the prediction of the same observations.

Is it an assumption of your work here (or maybe a desideratum of whatever you find to do mechanistic explanations) that the mechanistic explanation is basically in terms of a world model or simulation engine, and we can tell that’s how it’s structured? I.e., it’s not some arbitrary abstract summary of the predictor’s computation. (And also that we can tell that the world model is good by our lights?)

Comment by Ramana Kumar (ramana-kumar) on Finite Factored Sets · 2022-11-28T13:36:27.109Z · LW · GW

Partitions (of some underlying set) can be thought of as variables like this:

  • The number of values the variable can take on is the number of parts in the partition.
  • Every element of the underlying set has some value for the variable, namely, the part that that element is in.

Another way of looking at it: say we're thinking of a variable  as a function from the underlying set  to 's domain . Then we can equivalently think of  as the partition  of  with (up to)  parts.

In what you quoted, we construct the underlying set by taking all possible combinations of values for the "original" variables. Then we take all partitions of that to produce all "possible" variables on that set, which will include the original ones and many more.

Comment by Ramana Kumar (ramana-kumar) on Refining the Sharp Left Turn threat model, part 2: applying alignment techniques · 2022-11-25T17:38:39.888Z · LW · GW

I agree with you - and yes we ignore this problem by assuming goal-alignment. I think there's a lot riding on the pre-SLT model having "beneficial" goals.

Comment by Ramana Kumar (ramana-kumar) on A very crude deception eval is already passed · 2022-10-24T12:52:23.386Z · LW · GW

I think it would mean the same thing with your sentence instead.

Comment by Ramana Kumar (ramana-kumar) on Inner alignment: what are we pointing at? · 2022-09-20T11:01:12.211Z · LW · GW

I'll take a stab at answering the questions for myself (fairly quick takes):

  1. No, I don't care about whether a model is an optimiser per se. I care only insofar as being an optimiser makes it more effective as an agent. That is, if it's robustly able to achieve things, it doesn't matter how. (However, it could be impossible to achieve things without being shaped like an optimiser; this is still unresolved.)
  2. I agree that it would be nice to find definitions such that capacity and inclination split cleanly. Retargetability is one approach to this, e.g., operationalised as fine-tuning effort required to redirect inclinations.
  3. I think there are two: incorrect labels (when the feedback provider isn't capable enough to assess the examples it needs to evaluate), and underspecification (leading to goal misgeneralisation).
  4. Goal misgeneralisation. More broadly (to also include capability misgeneralisation), robustness failures.
  5. No I don't think they're important to distinguish.
Comment by Ramana Kumar (ramana-kumar) on Simulators · 2022-09-08T10:24:55.028Z · LW · GW

I think Dan's point is good: that the weights don't change, and the activations are reset between runs, so the same input (including rng) always produces the same output.

I agree with you that the weights and activations encode knowledge, but Dan's point is still a limit on learning.

I think there are two options for where learning may be happening under these conditions:

  • During the forward pass. Even though the function always produces the same output for a given output, the computation of that output involves some learning.
  • Using the environment as memory. Think of the neural network function as a choose-your-own-adventure book that includes responses to many possible situations depending on which prompt is selected next by the environment (which itself depends on the last output from the function). Learning occurs in the selection of which paths are actually traversed.

These can occur together. E.g., the "same character" as was invoked by prompt 1 may be invoked by prompt 2, but they now have more knowledge (some of which was latent in the weights, some of which came in directly via prompt 2; but all of which was triggered by prompt 2).

Comment by Ramana Kumar (ramana-kumar) on Sticky goals: a concrete experiment for understanding deceptive alignment · 2022-09-05T15:47:42.401Z · LW · GW

Expanding a bit on why: I think this will fail because the house-building AI won't actually be very good at instrumental reasoning, so there's nothing for the sticky goals hypothesis to make use of.

Comment by Ramana Kumar (ramana-kumar) on Sticky goals: a concrete experiment for understanding deceptive alignment · 2022-09-05T15:43:29.002Z · LW · GW

I agree with this prediction directionally, but not as strongly.

I'd prefer a version where we have a separate empirical reason to believe that the training and finetuning approaches used can support transfer of something (e.g., some capability), to distinguish goal-not-sticky from nothing-is-sticky.

Comment by Ramana Kumar (ramana-kumar) on We may be able to see sharp left turns coming · 2022-09-05T09:46:59.504Z · LW · GW

What was it changed from and to?

Comment by Ramana Kumar (ramana-kumar) on Some conceptual alignment research projects · 2022-09-01T10:29:20.545Z · LW · GW

This post (and the comment it links to) does some of the work of #10. I agree there's more to be said directly though.

Comment by Ramana Kumar (ramana-kumar) on Will Capabilities Generalise More? · 2022-08-30T13:35:29.205Z · LW · GW

Hm, no, not really.

OK let's start here then. If what I really want is an AI that plays tic-tac-toe (TTT) in the real world well, what exactly is wrong with saying the reward function I described above captures what I really want?

 

There are several claims which are not true about this function:

Neither of those claims seemed right to me. Can you say what the type signature of our desires (e.g., for good classification over grayscale images) is? [I presume the problem you're getting at isn't as simple as wanting desires to look like (image, digit-label, goodness) tuples as opposed to(image, correct digit-label) tuples.]

Comment by Ramana Kumar (ramana-kumar) on Your posts should be on arXiv · 2022-08-25T12:03:23.054Z · LW · GW

Could this be accomplished with literally zero effort from the post-writers? The tasks of identifying which posts are arXiv-worthy, formatting for submission, and doing the submission all seem like they could be done by entities other than the author. The only issue might be in associating the arXiv submitter account with the right person.

Comment by Ramana Kumar (ramana-kumar) on Will Capabilities Generalise More? · 2022-08-25T10:28:59.491Z · LW · GW

What about the real world is important here? The first thing you could try is tic-tac-toe in the real world (i.e., the same scenario as above but don't think of a Platonic game but a real world implementation). Does that still seem fine?

Another aspect of the real world is that we don't necessarily have compact specifications of what we want. Consider the (Platonic) function that assigns to every 96x96 grayscale (8 bits per pixel) image a label from {0, 1, ..., 9, X} and correctly labels unambiguous images of digits (with X for the non-digit or ambiguous images). This function I would claim "captures what I really want" from a digit-classifier (at least for some contexts of use, like where I am going to use it with a camera at that resolution in an OCR task), although I don't know how to implement it. A smaller dataset of images with labels in agreement with that function, and training losses derived from that dataset I would say inherit this property of "capturing what I really want", though imperfectly due to the possibilities of suboptimality and of generalisation failure. 

Comment by Ramana Kumar (ramana-kumar) on Finding Goals in the World Model · 2022-08-23T15:41:09.549Z · LW · GW

Given a utility function ...

I might have missed it, but where do you get this utility function from ultimately? It looked like you were trying to simultaneously infer the policy and utility function of the operator. This sounds like it might run afoul of Armstrong's work, which shows that you can't be sure to split out the  correctly from the policy when doing IRL (with potentially imperfect agents, like humans) without more assumptions than a simplicity prior.

Comment by Ramana Kumar (ramana-kumar) on Autonomy as taking responsibility for reference maintenance · 2022-08-22T09:26:37.484Z · LW · GW

I agree it is related! I hope we as a community can triangulate in on whatever is going on between theories of mental representation and theories of optimisation or intelligence.

Comment by Ramana Kumar (ramana-kumar) on Gradient descent doesn't select for inner search · 2022-08-19T06:38:41.069Z · LW · GW

How does this square with Are minimal circuits deceptive? 

Comment by Ramana Kumar (ramana-kumar) on Will Capabilities Generalise More? · 2022-08-17T21:07:35.919Z · LW · GW

Sure, one concrete example is the reward function in the tic-tac-toe environment (from X's perspective) that returns -1 when the game is over and O has won, returns +1 when the game is over and X has won, and returns 0 on every other turn (including a game over draw), presuming what I really want is for X to win in as few turns as possible.

I can probably illustrate something outside of such a clean game context too, but I'm curious what your response to this one is first, and to make sure this example is as clear as it needs to be.

Comment by Ramana Kumar (ramana-kumar) on Refining the Sharp Left Turn threat model, part 1: claims and mechanisms · 2022-08-17T14:31:26.123Z · LW · GW

I agree that humans satisfying the conditions of claim 1 is an argument in favour of it being possible to build machines that do the same. A couple of points: I think the threat model would posit the core of general intelligence as the reason both why humans can do these things and why the first AGI we build might also do these things. Claim 1 should perhaps be more clear that it's not just saying such an AI design is possible, but that it's likely to be found and built.

Comment by Ramana Kumar (ramana-kumar) on Oversight Misses 100% of Thoughts The AI Does Not Think · 2022-08-15T09:59:05.725Z · LW · GW

The first thing I imagine is that nobody asks those questions. But let's set that aside.

This seems unlikely to me. I.e., I expect people to ask these questions. It would be nice to see the version of the OP that takes this most seriously, i.e., expect people to make a non-naive safety effort (trying to prevent AI takeover) focused on scalable oversight as the primary method. Because right now it's hard to disentangle your strong arguments against scalable oversight from weak arguments against straw scalable oversight.

Comment by Ramana Kumar (ramana-kumar) on Oversight Misses 100% of Thoughts The AI Does Not Think · 2022-08-15T09:54:25.008Z · LW · GW

Because doing something reliably in the world is easy to operationalise with feedback mechanisms, but us being happy with the outcomes is not.

Getting some feedback mechanism (including "what do human raters think of this?" but also mundane things like "what does this sensor report in this simulation or test run?") to reliably output high scores typically requires intelligence/capability. Optimising for that is where the AI's ability to get stuff done in the world comes from. The problem is genuinely capturing "will we be happy with the outcomes?" with such a mechanism.

Comment by Ramana Kumar (ramana-kumar) on Oversight Misses 100% of Thoughts The AI Does Not Think · 2022-08-15T09:49:52.968Z · LW · GW

The AI wasn't trained to translate the literal semantics of questions into a query to its own internal world model and then translate the result back to human language; humans have no clue how to train such a thing.

This sounds pretty close to what ELK is for. And I do expect if there is a solution found for ELK for people to actually use it. Do you? (We can argue separately about whether a solution is likely to be found.)

Comment by Ramana Kumar (ramana-kumar) on How much alignment data will we need in the long run? · 2022-08-12T15:30:57.013Z · LW · GW

If our alignment training data correctly favors aligned behavior over unaligned behavior, then we have solved outer alignment.

I'm curious to understand what this means, what "data favoring aligned behavior" means particularly. I'll take for granted as background that there are some policies that are good ("aligned" and capable) and some that are bad. I see two problems with the concept of data favoring a certain kind of policy:

  1. Data doesn't specify generalization. For any achievable training loss on some dataset, there are many policies that achieve that loss, and some of them will be good, some bad.
  2. There's ambiguity in what it means for training data to favor some behavior. On the one hand, there's the straightforward interpretation of the labels: the data specifies a preference for one kind of behavior over another. On the other hand, there's the behavior of the policies actually found by training on this data. These can come far apart if the data's main effect on the policies found is to imbue them with a rich world model and goals. There isn't necessarily a straightforward relationship between the labels and the goals.

I realise you're focusing on "outer alignment" here, and maybe these are not outer alignment problems.

Comment by Ramana Kumar (ramana-kumar) on Will Capabilities Generalise More? · 2022-08-12T10:47:12.850Z · LW · GW

Straw person: We haven't found any feedback producer whose outputs are safe to maximise. We strongly suspect there isn't one.

Ramana's gloss of TurnTrout: But AIs don't maximise their feedback. The feedback is just input to the algorithm that shapes the AI's cognition. This cognition may then go on to in effect "have a world model" and "pursue something" in the real world (as viewed through its world model). But its world model might not even contain the feedback producer, in which case it won't be pursuing high feedback. (Also, it might just do something else entirely.)

Less straw person: Yeah I get that. But what kind of cognition do you actually get after shaping it with a lot of feedback? (i.e., optimising/selecting the cognition based on its performance at feedback maximisation) If your optimiser worked, then you get something that pursues positive feedback. Spelling things out, what you get will have a world model that includes the feedback producer, and it will pursue real high feedback, as long as doing so is a possible mind configuration and the optimiser can find it, since that will in fact maximise the optimisation objective.

Possible TurnTrout response: We're obviously not going to be using "argmax" as the optimiser though.

Comment by Ramana Kumar (ramana-kumar) on On how various plans miss the hard bits of the alignment challenge · 2022-07-12T13:52:44.226Z · LW · GW

For 2, I think a lot of it is finding the "sharp left turn" idea unlikely. I think trying to get agreement on that question would be valuable.

For 4, some of the arguments for it in this post (and comments) may help.

For 3, I'd be interested in there being some more investigation into and explanation of what "interpretability" is supposed to achieve (ideally with some technical desiderata). I think this might end up looking like agency foundations if done right.

For example, I'm particularly interested in how "interpretability" is supposed to work if, in some sense, much of the action of planning and achieving some outcome occurs far away from the code or neural network that played some role in precipitating it. E.g., one NN-based system convinces another more capable system to do something (including figuring out how); or an AI builds some successor AIs that go on to do most of the thinking required to get something done. What should "interpretability" do for us in these cases, assuming we only have access to the local system?

Comment by Ramana Kumar (ramana-kumar) on Will Capabilities Generalise More? · 2022-07-06T15:39:54.237Z · LW · GW

The desiderata you mentioned:

  1. Make sure the feedback matches the preferences
  2. Make sure the agent isn't changing the preferences

It seems that RRM/Debate somewhat addresses both of these, and path-specific objectives is mainly aimed at addressing issue 2. I think (part of) John's point is that RRM/Debate don't address issue 1 very well, because we don't have very good or robust processes for judging the various ways we could construct or improve these schemes. Debate relies on a trustworthy/reliable judge at the end of the day, and we might not actually have that.

Comment by Ramana Kumar (ramana-kumar) on Will Capabilities Generalise More? · 2022-07-05T10:10:34.686Z · LW · GW

Thanks that's great to hear :)

Comment by Ramana Kumar (ramana-kumar) on Will Capabilities Generalise More? · 2022-06-29T22:54:38.145Z · LW · GW

Nice - thanks for this comment - how would the argument be summarised as a nice heading to go on this list? Maybe "Capabilities can be optimised using feedback but alignment cannot" (and feedback is cheap, and optimisation eventually produces generality)?

Comment by Ramana Kumar (ramana-kumar) on Will Capabilities Generalise More? · 2022-06-29T22:45:30.643Z · LW · GW

I think what you say makes sense, but to be clear the argument does not consider those things as the optimisation target but rather considers fitness or reproductive capacity as the optimisation target. (A reasonable counterargument is that the analogy doesn't hold up because fitness-as-optimisation-target isn't a good way to characterise evolution as an optimiser.)

Comment by Ramana Kumar (ramana-kumar) on Where I agree and disagree with Eliezer · 2022-06-27T09:41:27.985Z · LW · GW

Yes that sounds right to me.

Comment by Ramana Kumar (ramana-kumar) on Where I agree and disagree with Eliezer · 2022-06-23T09:58:26.368Z · LW · GW

I basically agree with you. I think you go too far in saying Lethailty 19 is solved, though. Using the 3 feats from your linked comment, which I'll summarise as "produce a mind that...":

  1. cares about something
  2. cares about something external (not shallow function of local sensory data)
  3. cares about something specific and external

(clearly each one is strictly harder than the previous) I recognise that Lethality 19 concerns feat 3, though it is worded as if being about both feat 2 and feat 3.

I think I need to distinguish two versions of feat 3:

  1. there is a reliable (and maybe predictable) mapping between the specific targets of caring and the mind-producing process
  2. there is a principal who gets to choose what the specific targets of caring are (and they succeed)

Humans show that feat 2 at least has been accomplished, but also 3a, as I take you to be pointing out. I maintain that 3b is not demonstrated by humans and is probably something we need.

Comment by Ramana Kumar (ramana-kumar) on Generalized Heat Engine · 2022-06-21T13:17:57.368Z · LW · GW

nit: "This transformation swaps  with  if  is 1, and leaves everything unchanged if  is 0."

I think it actually swaps if  is  and leaves unchanged if it's .

 

nit: "swaps to coins"

missing 'w'

Comment by Ramana Kumar (ramana-kumar) on Where I agree and disagree with Eliezer · 2022-06-21T09:51:25.978Z · LW · GW

Yes, human beings exist and build world models beyond their local sensory data, and have values over those world models not just over the senses.

But this is not addressing all of the problem in Lethality 19. What's missing is how we point at something specific (not just at anything external).

The important disanalogy between AGI alignment and humans as already-existing (N)GIs is:

  • for AGIs there's a principal (humans) that we want to align the AGI to
  • for humans there is no principal - our values can be whatever. Or if you take evolution as the principal, the alignment problem wasn't solved.
Comment by Ramana Kumar (ramana-kumar) on Training Trace Priors · 2022-06-14T16:46:30.587Z · LW · GW

In this story deception is all about the model having hidden behaviors that never get triggered during training


Not necessarily - depends on how abstractly we're considering behaviours. (It also depends on how likely we are to detect the bad behaviours during training.)

Consider an AI trained on addition problems that is only exposed to a few problems that look like 1+3=4, 3+7=10, 2+5=7, 2+6=8 during training, where there are two summands which are each a single digit and they appear in ascending order. Now at inference time the model exposed to 10+2= outputs 12.

Have we triggered a hidden behaviour that was never encountered in training? Certainly these inputs were never encountered, and there's maybe a meaningful difference in the new input, since it involves multiple digits and out-of-order summands. But it seems possible that exactly the same learned algorithm is being applied now as was being applied during the late stages of training, and so there won't be some new parts of the model being activated for the first time.

Deceptive behaviour might be a natural consequence of the successful learned algorithms when they are exposed to appropriate inputs, rather than different machinery that was never triggered during training.