Posts

Why Don't We Just... Shoggoth+Face+Paraphraser? 2024-11-19T20:53:52.084Z
AI Craftsmanship 2024-11-11T22:17:01.112Z
o1 is a bad idea 2024-11-11T21:20:24.892Z
Seeking Collaborators 2024-11-01T17:13:36.162Z
Complete Feedback 2024-11-01T16:58:50.183Z
Why is o1 so deceptive? 2024-09-27T17:27:35.439Z
Formalizing the Informal (event invite) 2024-09-10T19:22:53.564Z
In Defense of Open-Minded UDT 2024-08-12T18:27:36.220Z
Leaving MIRI, Seeking Funding 2024-08-08T18:32:20.387Z
Circular Reasoning 2024-08-05T18:10:32.736Z
LLMs for Alignment Research: a safety priority? 2024-04-04T20:03:22.484Z
Modern Transformers are AGI, and Human-Level 2024-03-26T17:46:19.373Z
Technologies and Terminology: AI isn't Software, it's... Deepware? 2024-02-13T13:37:10.364Z
Meaning & Agency 2023-12-19T22:27:32.123Z
FixDT 2023-11-30T21:57:11.950Z
Agent Boundaries Aren't Markov Blankets. [Unless they're non-causal; see comments.] 2023-11-20T18:23:40.443Z
Translations Should Invert 2023-10-05T17:44:23.262Z
Where might I direct promising-to-me researchers to apply for alignment jobs/grants? 2023-09-18T16:20:03.452Z
One Minute Every Moment 2023-09-01T20:23:56.391Z
Probabilistic Payor Lemma? 2023-03-19T17:57:04.237Z
Teleosemantics! 2023-02-23T23:26:15.894Z
Some Thoughts on AI Art 2023-01-25T14:18:14.507Z
Contra Common Knowledge 2023-01-04T22:50:38.493Z
Talking to God 2023-01-03T20:14:20.955Z
Knottiness 2023-01-02T22:13:12.752Z
Prettified AI Safety Game Cards 2022-10-11T19:35:18.991Z
Builder/Breaker for Deconfusion 2022-09-29T17:36:37.725Z
Vingean Agency 2022-08-24T20:08:53.237Z
Steam 2022-06-20T17:38:58.548Z
Brass Puppet 2022-05-26T17:42:04.876Z
ELK Computational Complexity: Three Levels of Difficulty 2022-03-30T20:56:37.239Z
[Closed] Job Offering: Help Communicate Infrabayesianism 2022-03-23T18:35:16.790Z
ELK Thought Dump 2022-02-28T18:46:08.611Z
Contest for outlining rules for this contest. 2022-02-21T18:44:43.990Z
There is essentially one best-validated theory of cognition. 2021-12-10T15:51:06.423Z
Worst Commonsense Concepts? 2021-11-15T18:22:31.465Z
How can one train philosophical skill? 2021-09-30T14:56:35.313Z
Power vs Precision 2021-08-16T18:34:42.287Z
Implicature Conflation 2021-08-09T19:48:51.097Z
Refactoring Alignment (attempt #2) 2021-07-26T20:12:15.196Z
Re-Define Intent Alignment? 2021-07-22T19:00:31.629Z
Progress, Stagnation, & Collapse 2021-07-22T16:51:04.595Z
The Homunculus Problem 2021-05-27T20:25:58.312Z
The Argument For Spoilers 2021-05-21T12:23:49.127Z
Time & Memory 2021-05-20T15:16:49.042Z
Formal Inner Alignment, Prospectus 2021-05-12T19:57:37.162Z
Fractal Conversations vs Holistic Response 2021-05-05T15:04:40.314Z
Death by Red Tape 2021-05-01T18:03:34.780Z
Gradations of Inner Alignment Obstacles 2021-04-20T22:18:18.394Z
Superrational Agents Kelly Bet Influence! 2021-04-16T22:08:18.201Z

Comments

Comment by abramdemski on Don't want Goodhart? — Specify the damn variables · 2024-12-16T16:00:08.430Z · LW · GW

If you indeed were solving a narrower task — that is, only creating the most sense of pleasure-inducing picture with maximization of other parameters — and then looked back, puzzled as to why the hungry weren't fed by this procedure, bringing Goodhart's law into the discussion is madness; it stresses me out. The variable 'people are hungry' wasn't important for this task at all. Oh, or was it important to you? Then why didn't you specify it? You think it’s 'obvious'?

The point of Goodhart's Law is that you can only select for what you can measure. The burger is a good analogy because Instagram can't measure taste or nutrition, so when Instagram is what optimizes burgers, you get burgers with a very appealing appearance but non-optimized taste and nutrition. If you have the ability to measure taste, then you can create good taste, but you run into subtler examples of Goodhart (EG, Starbucks coffee is optimized to taste good to their professional tasters, which is slightly different from tasting good to a general audience).

Just specifying the variable you're interested in doesn't solve this problem; you also have to figure out how to measure it. The problem is that measurements are usually at least slightly statistically distinct from the actual target variable, so that the statistical connection can fall apart under optimization.

I also take issue with describing optimizing the appearance of the burger as "narrower" than optimizing the burger quality. In general it is a different task, which may be narrower or broader.

Comment by abramdemski on Don't want Goodhart? — Specify the damn variables · 2024-12-16T15:50:07.688Z · LW · GW

I expect that the main problem with Goodhart's law is that if you strive for an indicator to accurately reflect the state of the world, once the indicator becomes decoupled from the state of the world, it stops reflecting the changes in the world. This is how I interpret the term 'good,' which I dislike. People want a thermometer to accurately reflect the patterns they called temperature to better predict the future — if the thermometer doesn't reflect the temperature, future predictions suffer.

A problem I have with this reinterpretation is that "state of the world" is too broad. In looking at a thermometer, I am not trying to understand the entire world-state (and the thermometer also couldn't be decoupled from the entire world-state, since it is a part of the world).

A more accurate way to remove "good" would be as follows:

In everyday life, if a human is asked to make a (common, everyday) judgement based on appearances, then the judgement is probably accurate. But if we start optimizing really hard based on their judgement, Goodhart's Law kicks in.

Comment by abramdemski on Complete Class: Consequentialist Foundations · 2024-12-05T16:34:09.441Z · LW · GW

Thanks!

Comment by abramdemski on Ayn Rand’s model of “living money”; and an upside of burnout · 2024-12-02T21:13:37.094Z · LW · GW

Ah, yeah, sorry. I do think about this distinction more than I think about the actual model-based vs model-free distinction as defined in ML. Are there alternative terms you'd use if you wanted to point out this distinction? Maybe policy-gradient vs ... not policy-gradient?

Comment by abramdemski on Ayn Rand’s model of “living money”; and an upside of burnout · 2024-12-02T17:37:24.572Z · LW · GW

In machine-learning terms, this is the difference between model-free learning (reputation based on success/failure record alone) and model-based learning (reputation can be gained for worthy failed attempts, or lost for foolish lucky wins).

Comment by abramdemski on Examples of Highly Counterfactual Discoveries? · 2024-12-02T16:01:17.567Z · LW · GW

There's unpublished work about a slightly weaker logical induction criterion which doesn't have this property (there exist constant-distribution inductors in this weaker sense), but which is provably equivalent to the regular LIC whenever the inductor is computable.[1] To my eye, the weaker criterion is more natural. The basic idea is that this weird trader shouldn't count as raking in the cash. The regular LIC (we can call it "strong LIC" or SLIC) counts traders as exploiting the market if there is a sequence of worlds in which their wealth grows unboundedly. This allows for the trick you quote: buying up larger and larger piles of sentences in diminishingly-probable worlds counts as exploiting the market.

The weak LIC (WLIC) says instead that traders have to actually make the money in order to count as exploiting the market.

Thus the limit of a logical inductor can count as a (weak) logical inductor, just not a computable one.

  1. ^

    Roughly speaking. This is not quite an adequate description of the theorem.

Comment by abramdemski on o1 is a bad idea · 2024-12-02T00:12:22.458Z · LW · GW

Good citation. Yeah, I should have flagged harder that my description there was a caricature and not what anyone said at any point. I still need to think more about how to revise the post to be less misleading in this respect.

One thing I can say is that the reason that quote flags that particular failure mode is because, according to the MIRI way of thinking about the problem, that is an easy failure mode to fall into. 

Comment by abramdemski on Are most people deeply confused about "love", or am I missing a human universal? · 2024-11-28T17:49:05.540Z · LW · GW

I agree that:

  • People commonly talk about love as black-and-white when it isn't.
  • It is used as a semantic stopsign or applause light.
  • Love is a cluster concept made up of many aspects, but the way people talk about love obscures this. The concept of "true love" serves to keep people in the dark by suggesting that the apparent grab-bag nature of love is just the way people talk, & there's an underlying truth to be discovered. Because experiences of love are fairly rare, people can only accumulate personal evidence about this slowly, so it remains plausible that what they've experienced is "just infatuation" or something else, and the "true love" remains to be discovered. People will fight about what love is in sideways ways, EG "you don't know what love is" can be used to undercut the other person's authority on love (when a more accurate representation of the situation might be "I didn't enjoy participating in the social dynamic you're presently calling love"). Different people will sometimes express different detailed views about what love is, but there seems to be little drive to reach an agreement about this, at least between people who aren't romantic partners. 

I once had the experience of a roommate asking if I believe in love. I said yes, absolutely, explaining that it is a real human emotion (I said something about chemicals in the brain). He responded: it sounds like you don't believe in love!

(I think this has to do with ambiguity about believe-in as much as about love, to be fair.)

So, a question: is 'love' worth saving as a concept? The way some people use the word might be pretty terrible, but, should we try to rescue it? Is there a good way to use it which recovers many aspects of the common use, while restoring coherence?

I do personally feel that there is some emotional core to love, so I'm sympathetic to the "it's a specific emotion" definition. This accounts with people not being able to give a specific definition. Emotions are feelings, so they're tough to define. They just feel a specific way. You can try to give cognitive-behavioral definitions, and that's pretty useful; but, for example, you could show behavioral signs of being afraid without actually experiencing fear.

I'm also sympathetic to the view that "love" is a kind of ritual, with saying "I love you" being the centerpiece of the ritual, and a cluster of loving behaviors being the rest of it. "I love you" can then be interpreted as a desire or intention or commitment to participate in the rest of it. 

Comment by abramdemski on Hell is wasted on the evil · 2024-11-28T17:13:50.569Z · LW · GW

I reject the principled distinction. To me, it's more of a spectrum.

Comment by abramdemski on Which things were you surprised to learn are not metaphors? · 2024-11-22T09:16:16.366Z · LW · GW

I would strongly guess that many people could physically locate the cringe pain, particularly if asked when they're experiencing it.

Comment by abramdemski on Why Don't We Just... Shoggoth+Face+Paraphraser? · 2024-11-21T19:13:07.175Z · LW · GW

I'm specifically referring to this answer, combined with a comment that convinced me that the o1 deception so far is plausibly just a capabilities issue:

https://www.lesswrong.com/posts/3Auq76LFtBA4Jp5M8/why-is-o1-so-deceptive#L5WsfcTa59FHje5hu

https://www.lesswrong.com/posts/3Auq76LFtBA4Jp5M8/why-is-o1-so-deceptive#xzcKArvsCxfJY2Fyi

I continue not to get what you're saying. I entirely agree with the Habryka and Acertain responses to Gwern. Ted Sanders is talking about why GPT models hallucinate, but fails to address the question of why o1 actively encourages itself to hallucinate (rather than, eg, telling the user that it doesn't know). My most plausible explanation of your position is that you think this is aligned, IE, you think providing plausible URLs is the best it can do when it doesn't know. I disagree: it can report that it doesn't know, rather than hallucinate answers. Actively encouraging itself to give plausible-but-inaccurate links is misalignment. It's not just missing a capability; it's also actively self-prompting to deceive the user in a case where it seems capable of knowing better (if it were deliberating with the user's best interest in mind).

As I said above, the more likely explanation is that there's an asymmetry in capabilities that's causing the results, where just knowing what specific URL the customer wants doesn't equal the model having the capability to retrieve a working URL, so this is probably at the heart of this behavior.

Capabilities issues and alignment issues are not mutually exclusive. It appears that capabilities issues (not being able to remember all valid URLs) is causing an alignment issue in this case (actively planning to give deceptive URLs rather than admit its ignorance). This causes me to predict that even if this particular capability issue is resolved, the training will still have this overall tendency to turn remaining capability issues into alignment issues. How are you seeing this differently?

Comment by abramdemski on Why Don't We Just... Shoggoth+Face+Paraphraser? · 2024-11-21T18:44:08.688Z · LW · GW

Yep, thanks, this makes sense. I still don't have an intuitive picture of what differences to expect in the trained result, though.

Clarifying Concrete Algorithms

In the most basic possible shoggoth+face training setup, you sample CoT+summary, you score the summary somehow, and then you do this lots of times and throw away a worse-scoring fraction of these samples. Then you fine-tune the shoggoth on the CoT parts of these, and fine-tune the face on the summary part. I'll call this 'basic trainig'.

The most basic implementation of your thing is similar, except it discards the worst in this more complex way you've defined, instead of just global score. I'll call this 'your proposal'. (Although, it ignores the contrastive part of your proposal.)

(For both basic training and your proposal, I'm going to pretend there is only one question/prompt; really we have across-prompt issues similar to the across-summary issues you're trying to solve, IE, by discarding the globally worst-scoring CoT+summary we discard answers to hard questions, meaning we would never train to do better on those questions. So really we need to at least take multiple samples for each question and discard samples based on performance relative to a question. Easier to focus on one question to avoid describing too many algorithmic details.)

Your proposal splits into the CoT part (which you call the n part), and the summary part (which you call the k part).

Analyzing CoT Gradient

For the CoT part, basic training keeps the CoTs which resulted in the best summaries. Your proposal instead averages the top 3 summaries for each CoT. This de-noises somewhat, at the cost of taking more summary samples per CoT. I'm not sure how to think about the trade-off there. More summary samples per CoT means less CoT samples overall, so we're doing a better job scoring the CoTs, but we get less scores overall, so we learn less. Maybe the gradient step-size can be increased a little because we're more confident in the gradient steps being taken, but we still explore less of the CoT space.

Another variation would be to average all the summary samples for that CoT to score a CoT sample. This de-noises more, for the same price in number of samples. This way, we're simply scoring CoTs based on how well they make the summaries perform, with several samples to de-noise the gradient.

A possible argument for your proposal over this alternative is that we're also training the summary to improve it, so the top 3 summaries are actually a better estimator of the kind of performance we care about than the total average.

This argument, if valid, similarly recommends your proposal over the basic training, because basic training is also looking at average summary score per-CoT. (It's also looking at top, but it only samples one sumary per CoT, so these are the same).

Analyzing Summary Gradient

For the summary-gradient part, you take the highest-scoring summary from the best CoT (as defined previously), rather than the highest-scoring summary overall. This could help train the summary to be good for the type of CoT we're steering towards, instead of than training it to be good for high-variance CoTs which produce one outlier great summary. As with the previous section, this helps to steer towards policies which are going to be better after learning, rather than just what's best currently.

Conclusion

I'm curious to hear whether you think I'm missing any big factors here. I don't have a lot of practical experience with this type of machine learning, so my guesses about what's important vs unimportant here are not very informed.

My uninformed guess is that the trade-off between denoising the gradient vs getting more samples favors getting more samples, so the denoising argument for your proposal isn't a great one. The main argument in favor, then, is the way your proposal trains the face to respond well to how the shoggoth is expected to behave after training, and trains the shoggoth to respond well to how the face is expected to behave after training -- whereas the basic proposal instead trains the face to perform well for the current shoggoth and vice versa.

I'm not sure how big of a deal this is. If the gradient steps are small and the training is iterated a lot of times, maybe it's not a very big deal at all? 

IE, for the CoT gradient, the mistake made by the basic proposal is that it will sometimes discard a good CoT (and gradient towards a worse one) based on random poor performance from face. However, maybe face's performance will improve fast enough that this won't matter too much in the end?

Similarly, for the summary gradient, the problem with the basic proposal is that it'll sometimes update based on an anomalously good summary of a poor CoT, meaning it is learning to cope with bad CoT behavior which it won't have to deal with very often post-training. But maybe it won't matter too much, if the bad CoT behavior is eliminated during training anyway?

Overall, it seems plausible to me that these problems with the basic proposal would work against what Daniel wants, and your proposal would help push towards what Daniel wants. However, I'm very uncertain, especially about the size of the effect.

Comment by abramdemski on Why Don't We Just... Shoggoth+Face+Paraphraser? · 2024-11-20T19:14:38.629Z · LW · GW

I definitely agree that OpenAI would dismiss good evidence of deceptive cognition, though I personally don't find the o1 report damning, because I find the explanation that it confabulates links in the CoT because there is a difference between it's capability to retrieve links and it's ability to know links to be pretty convincing (combined with links being a case where perfect linking is far more useful than approximately linking.)

What do you mean? I don't get what you are saying is convincing.

Perhaps I should clarify my belief. 

The o1 report says the following (emphasis mine):

0.8% of o1-preview’s responses got flagged as being ‘deceptive’ [...] roughly two thirds of which appear to be intentional (0.38%), meaning that there was some evidence in the chain of thought that o1-preview was aware that the answer was incorrect [...] Intentional hallucinations primarily happen when o1-preview is asked to provide references to articles, websites, books, or similar sources that it cannot easily verify without access to internet search, causing o1-preview to make up plausible examples instead. [...] It is encouraging that, in the analysis presented below, while our monitor did find a few forms of the model knowingly presenting incorrect information to the user or omitting important information, it did not find any instances of o1-preview purposely trying to deceive the user for reasons other than satisfying the user request.

  • Is this damning in the sense of showing that o1 is a significant danger, which needs to be shut down for public safety?
    • No, I don't think so.
  • Is this damning in the sense of providing significant evidence that the technology behind o1 is dangerous? That is: does it provide reason to condemn scaling up the methodology behind o1? Does it give us significant reason to think that scaled-up o1 would create significant danger to public safety?
    • This is trickier, but I say yes. The deceptive scheming could become much more capable as this technique is scaled up. I don't think we have a strong enough understanding of why it was deceptive in the cases observed to rule out the development of more dangerous kinds of deception for similar reasons.
  • Is this damning in the sense that it shows OpenAI is dismissive of evidence of deception?
    • OpenAI should be commended for looking for deception in this way, and for publishing what they have about the deception they uncovered.
    • However, I don't buy the distinction they draw in the o1 report about not finding instances of "purposefully trying to deceive the user for reasons other than satisfying the user request". Providing fake URLs does not serve the purpose of satisfying the user request. We could argue all day about what it was "trying" to do, and whether it "understands" that fake URLs don't satisfy the user. However, I maintain that it seems at least very plausible that o1 intelligently pursues a goal other than satisfying the user request; plausibly, "provide an answer that shallowly appears to satisfy the user request, even if you know the answer to be wrong" (probably due to the RL incentive).
    • More importantly, OpenAI's overall behavior does not show concern about this deceptive behavior. It seems like they are judging deception case-by-case, rather than treating it as something to steer hard against in aggregate. This seems bad.
Comment by abramdemski on Why Don't We Just... Shoggoth+Face+Paraphraser? · 2024-11-20T18:39:56.007Z · LW · GW

I think the way around this is to make multiple roll-outs per model per problem. Get n different CoTs from Shoggoth, then for each of those get k different summarizations from Face. You then have n*k final answers. Optimal values for n and k probably depend on how expensive the roll-outs are. This population of answers allows you to usefully get a feedback signal about Shoggoth's contribution vs Face's contribution.

So, how do you extract the desired credit-assignment info out of this?

(Also, how do you then use that info?)

If I have 10 students but I can only test them in pairs, I can get some information on how good individual students are by testing multiple different pairings. 

If I have just Alice and Bob, I pairing them up multiple times doesn't help so much.

Comment by abramdemski on Why Don't We Just... Shoggoth+Face+Paraphraser? · 2024-11-20T18:12:24.255Z · LW · GW

The better the paraphraser, the better the anti-steganography incentive. If the paraphraser itself is easily manipulable by adversarial inputs, then you'll see some weird things like the CoT manipulating the paraphraser to get more useful cognition out of it rather than mere paraphrasing. This is undesirable.

So, I imagine that prompting a vanilla LLM would not be a great approach.

I think first you want to build a fairly good "semantic similarity score" (this shouldn't be too hard) and then you want to fine-tune an LLM to encourage high-entropy paraphrasing (reward entropy, while also heavily punishing low semantic similarity score). This is similar to @CBiddulph's idea here.

I agree that this could plausibly work with a relatively small model and therefore result in a relatively low alignment tax. However, note that my view of this whole proposal as an alignment technique requires that you try and change your RL incentive structure when you see a deceptive mismatch between what the shoggoth says and what the face says. This is a significant alignment tax.

Comment by abramdemski on Why Don't We Just... Shoggoth+Face+Paraphraser? · 2024-11-20T17:56:32.988Z · LW · GW

In my view, the misaligned behavior is split across the shoggoth and face in most cases. In the case which Daniel K anticipates, the Shoggoth is misaligned in the sense of explicitly (but honestly) plotting to deceive, while the face is misaligned in the sense of dishonestly concealing. 

So I am interpreting you as asking why we might expect that specific division of responsibilities, rather than some other divisions such as those you mention.

One reason is that we already see this empirically in the o1 report.

However, to me, the distinction between your case 1 and 2 is not so important. Both scenarios you propose preserve the important thing: there is a mismatch between an honest CoT and a dishonest final answer, which would allow the deceptive final answer to be detected. 

There could be other cases where this doesn't happen; this method is not a guaranteed lie detector. To me, the point is that it will at least sometimes happen. A lie detector that only sometimes works is better than no lie detector at all.

If you give a kid a diary and try to avoid giving them any reason to think you're reading it, then the kid might write honestly in the diary and reveal some deceptions you wouldn't previously have known about. The kid might also do any number of other things. However, setting aside the ethics of snooping on your kid and lying to them about it, it's better than nothing.

Comment by abramdemski on 5 ways to improve CoT faithfulness · 2024-11-19T20:56:18.096Z · LW · GW

See here for an updated version of my thinking based on discussion with Daniel:

https://www.lesswrong.com/posts/Tzdwetw55JNqFTkzK/why-don-t-we-just-shoggoth-face-paraphraser 

Comment by abramdemski on 5 ways to improve CoT faithfulness · 2024-11-19T16:27:44.427Z · LW · GW

Yeah, I agree that's at least somewhat true. So, given that, is it a good move or not? I don't see much of an upside, since there's a heavy financial incentive for the big labs to do little about concerning results observed in such models. IE, when it comes to the question of whether frontier labs training o1-like models should follow OpenAI's example of avoiding safety training for the CoT, I think it's right to discourage them from doing this rather than encourage them... although, I say this with very low confidence.

Comment by abramdemski on 5 ways to improve CoT faithfulness · 2024-11-18T17:33:28.775Z · LW · GW

I don't think it is worth getting into all the stuff you're mentioning here, but I think a key crux is that I'm expecting the face to be quite dumb (e.g. literally 3.5 sonnet might suffice).

I'm reacting to this whole thing as a possible default configuration going forward, rather than as it exists today. All of my concerns are about what happens when you scale it up. For example, I don't find o1's current level of deception hugely concerning in its own right; rather, I see it as a bad sign for the technology as it continues to develop.

I guess one way of framing it is that I find the shoggoth/face idea great as a science experiment; it gives us useful evidence! However, it doesn't make very much sense to me as a safety method intended for deployment.

Comment by abramdemski on 5 ways to improve CoT faithfulness · 2024-11-16T16:33:43.483Z · LW · GW

Ah, yep, this makes sense to me.

Comment by abramdemski on 5 ways to improve CoT faithfulness · 2024-11-16T16:32:35.036Z · LW · GW

Yeah, this is a good point, which doesn't seem addressed by any idea so far.

Comment by abramdemski on o1 is a bad idea · 2024-11-16T16:31:37.929Z · LW · GW

Chess is like a bounded, mathematically described universe where all the instrumental convergence stays contained, and only accomplishes a very limited instrumentality in our universe (IE chess programs gain a limited sort of power here by being good playmates).

LLMs touch on the real world far more than that, such that MCTS-like skill at navigating "the LLM world" in contrast to chess sounds to me like it may create a concerning level of real-world-relevant instrumental convergence.

Comment by abramdemski on 5 ways to improve CoT faithfulness · 2024-11-16T16:25:45.598Z · LW · GW

Still, IMO, exploiting the frozen planner through adversarial inputs in a single step seems pretty unlikely to be a fruitful strategy for the optimized planner. If the optimized planner is simply trying to accurately convey information to the frozen planner, probably >99% of that information is best to convey through human-understandable text.

Well, I'm not sure. As you mention, it depends on the step size. It also depends on how vulnerable to adversarial inputs LLMs are and how easy they are to find. I haven't looked into the research on this, but it sounds empirically checkable. If there are lots of adversarial inputs which have a wide array of impacts on LLM behavior, then it would seem very plausible that the optimized planner could find useful ones without being specifically steered in that direction.

This is especially likely to hold under the following conditions:

We can also combine this with other proposals, such as paraphrasing.

Comment by abramdemski on 5 ways to improve CoT faithfulness · 2024-11-16T16:19:18.349Z · LW · GW

Yeah, you're right, I no longer think it's an interesting proposal.

Comment by abramdemski on 5 ways to improve CoT faithfulness · 2024-11-16T16:05:41.704Z · LW · GW

I'm talking about training only the Face, not training the policy (shoggoth) at all with the proposal I'm imagining.

And, these should clearly be separate models such that the training of one doesn't generalize to the other.

So, making the Face more deceptive doesn't kill the canary?

Ah, yes, I had neglected this important distinction.

So what do you do in the end, throw away the face?

It seems worth pointing out that although the CoT isn't trained to be deceptive, it is trained to think thoughts which help out a deceptive face. So it seems plausible that a multi-step training process like:

GPT -> basic chat training -> shoggoth+face training -> discard face & align shoggoth

Would result in deceptive capabilities: the shoggoth CoT learns to explicitly reason about what helps deceive the alignment training in the shoggoth+face step, and then in the final step, some of this is destroyed but some is driven underground. IE, learning to explicitly reason about deception creates latent structure which is useful for that task, and this latent structure will somewhat transfer to the case where the deception has to be hidden from the CoT.

The question is whether we end up with a highly misaligned policy (and this is clear from the chain of thought such that we can discover this). I think Daniel was expecting in the baseline proposal that you end up with a Face that does at least some censoring/mild deception due to training incentives (I think we already see a bit of this with o1).

So it sounds like we have some disagreement about what degree of deception is considered problematic. Again, I'm worried about basin-of-attraction type dynamics. We can think of a whole frontier lab as collectively being a sort of self-improving AGI; we want that recursive self-improvement to be going down a good path rather than a bad one. The LLMs are already helping to improve themselves in little ways (EG as coding assistants) and big ways (EG constitutional AI). So it's an environment where seeds of deceptiveness can grow and take some control of the overall trajectory.

Maybe your concern is that it is plausible that the policy won't be misaligned, but the Face will be and this will screw us over. This seems plausible and is worth worrying about (with both proposals). However, I thought that a core part of the hope is that the Face isn't smart enough to cleverly scheme (unless this schemeing happens in the CoT itself). And indeed, if the Face can cleverly scheme without CoT, what is stopping the policy from doing this too?

I'm not sure what you're getting at here. It seems like the idea is that Face + CoT is, together, a clever schemer -- in the o1 report, we already see cases of explicit scheming in the CoT. So then, using the Face+CoT together (EG trying to train it further, using it in supervision steps for later iterations of the system, etc) is dangerous and difficult for the same reason that a single deceptive LLM would be. If we use Face+CoT together, we have to reason about them together in that way. Or, if we are using them as separate pieces, then we have to account for the part of the overall deceptive-schemer strategy they've learned.

I guess you could use Face+CoT as a supervisor (EG in constitutional AI) by looking at the CoT (both by hand and using some automation) and throwing out any deceptive examples, so as to not use them for supervision. 

But the overall goal should be to produce a highly non-deceptive system in the end, I think.

Another idea would be to do the whole Face+CoT thing, and then filter for honesty on a bunch of examples, and then re-train a whole new system (well, you can start from the same GPT) only on those honest examples. 

Comment by abramdemski on o1 is a bad idea · 2024-11-16T03:07:13.504Z · LW · GW

Assuming the goals are done over say 1-10 year timescales, or maybe even just 1 year timescales with no reward-shaping/giving feedback for intermediate rewards at all, I do think that the system won't work well enough to be relevant, since it requires way too much time training, and plausibly way too much compute depending on how sparse the feedback actually is.

Ah, I wasn't thinking "sparse" here meant anywhere near that sparse. I thought your dense-vs-sparse was doing something like contrasting RLHF (very dense, basically no instrumental convergence) with chess (very sparse, plenty of instrumental convergence).

I still think o1 is moving towards chess on this spectrum.

Comment by abramdemski on 5 ways to improve CoT faithfulness · 2024-11-16T03:01:34.983Z · LW · GW

Async is fine with me. Please feel encouraged to start it.

Comment by abramdemski on 5 ways to improve CoT faithfulness · 2024-11-16T02:55:13.651Z · LW · GW

My current epistemic state is pretty "I'm not sure at all, it's a complicated subject" fwiw. I'll try to generate some responses anyway.

make the system overall much harder to shape in ways you want (because it is actively hiding things from you so your RL is less accurate)

I thought the point was that we wouldn't train against the CoT basically at all with the baseline proposal? (Similar to o1.)

Sure. Therefore, you can't check for deceptive behavior using the CoT when you're trying to train against deceptive behavior. Therefore, Hubinger et al's conclusions about how deceptive behavior makes the system harder to train still hold even if you sorta get a transparency tool out of this.

You can still inspect the underlying CoT to know if you have big issues.

If other companies go in this direction like OpenAI has, then they'll probably hide the CoT from the end-users for similar reasons, so you can't say that the upside is that you, the end user, get to see whether the model is deliberately deceiving you with respect to specific questions. 

So you probably mean the upside is that the companies can see this info in the CoT, to check whether models are generally behaving in a deceptive way (similar to what we saw in the o1 report).

Sure, but you're making a commitment not to use this in training, so what are you doing with this information? If the model is deceptive, you just throw it away (or, more realistically, make excuses about how the deception you observed isn't so bad). But your version of the proposal deliberately trains the model to be deceptive. So yeah, you can look, but what's the point? You're checking whether a coin you rigged to land on heads in fact landed on heads.

(My guess is that you have this concern with the baseline protocol and thus don't like moving further in this direction.)

Yeah, you're right about that. But there is a difference. I can think of the original proposal as a "canary in a coal mine": you hope that the model doesn't learn deceptive behavior (you plan to chuck it out if you detect any), but you still follow Daniel's division of labor & refuse to do any alignment training on the CoT so that you have a chance of detecting deception if it occurs

On the other hand, your version of Daniel's protocol loses this property. Metaphorically, it's like killing the canary before you go into the coal mine. If the canary was supposed to be an early warning to get out of there, you've completely defeated the point, and for what? (This is the part I'm unclear about, and why I asked my original question to you, rather than Daniel.)

Deceptive behavior is hard to remove once it creeps in. This is nasty stuff. A technique for getting deceptive behavior to reveal itself in the CoT would be wonderfully useful if you could then use that in the training signal to remove the deceptive behavior, but the proposal here relies on not doing that. 

Basically, if there is a basin of attraction for corrigibility, it seems to me that deceptive behavior is the opposite of that. It puts you outside the basin of attraction. It means you're on-coarse for Evan Hubinger's story about how malign inner optimizers develop. It means you're on-coarse for the Bad Ending rather than the Good Ending. And for what?

My take (very much siding with Evan Hubinger here) is that the target should be keeping well away from deceptive behavior throughout training, so that there's never that kernel of skill that could grow into a malign inner optimizer.

Comment by abramdemski on 5 ways to improve CoT faithfulness · 2024-11-15T23:30:34.354Z · LW · GW

Would you be interested in having a (perhaps brief) LW Dialogue about it where you start with a basic intro to your shoggoth/mask division-of-labor, and I then present my critique?

Comment by abramdemski on o1 is a bad idea · 2024-11-15T23:28:37.395Z · LW · GW

My point here is that at the capability level of GPT4, this distinction isn't very important. There's no way to know for sure until it is too late, of course, but it seems pretty plausible that GPT4 isn't cleverly scheming. It is merely human-level at deception, and doesn't pursue any coherent overarching goal with it. It clumsily muddles through with mildly-above-average-for-human convincingness. For most queries (it seems plausible to me) it isn't even adequately coherent to make a crisp distinction between whether it's honestly trying to answer the question vs deceptively trying to make an answer look good; at its level of capability, it's mostly the same thing one way or the other. The exceptions to this "mostly" aren't strategic enough that we expect them to route around obstacles cleverly.

It isn't much, but it is more than I naively expected.

Comment by abramdemski on o1 is a bad idea · 2024-11-15T23:07:57.153Z · LW · GW

I think the crux is I think that the important parts of of LLMs re safety isn't their safety properties specifically, but rather the evidence they give to what alignment-relevant properties future AIs have 

[insert standard skepticism about these sorts of generalizations when generalizing to superintelligence]

But what lesson do you think you can generalize, and why do you think you can generalize that?

I think this is a crux, in that I don't buy o1 as progressing to a regime where we lose so much dense feedback that it's alignment relevant, because I think sparse-feedback RL will almost certainly be super-uncompetitive with every other AI architecture until well after AI automates all alignment research.

So, as a speculative example, further along in the direction of o1 you could have something like MCTS help train these things to solve very difficult math problems, with the sparse feedback being given for complete formal proofs.

Similarly, playing text-based video games, with the sparse feedback given for winning.

Similarly, training CoT to reason about code, with sparse feedback given for predictions of the code output.

Etc.

You think these sorts of things just won't work well enough to be relevant?

Comment by abramdemski on AI Craftsmanship · 2024-11-15T21:59:00.439Z · LW · GW

I think you're mostly right about the world, but I'm going to continue to articulate disagreements based on my sense of dissatisfaction. You should probably mostly read me as speaking from a should-world rather than being very confused about the is-world.

The bitter lesson is insufficient to explain the lack of structure we're seeing. I gave the example of Whisper. I haven't actually used Whisper, so correct me if I'm wrong -- maybe there is a way to get more nuanced probabilistic information out of it? But the bitter lesson is about methods of achieving capabilities, not about the capabilities themselves. Producing a plain text output rather than a more richly annotated text that describes some of the uncertainty is a design choice.

To give another example, LLMs could learn a conditional model that's annotated with metadata like author, date and time, etc. Google Lambda reportedly had something like author-vectors, so that different "characters" could be easily summoned. I would love to play with that, EG, averaging the author-vectors of specific authors I'm interested in to see what it's like to mix their voices. In theory, LLMs could also learn to predict the metadata. You could use partially-labeled-data approaches such as the EM algorithm to train LLMs to predict the metadata labels while also training them to use those labels in text generation. This would give a rich set of useful capabilities. This would be scientifically useful, too: predictions of date, author, and other metadata would be interesting for historians.

In this way, we could actually get closer to a "truth machine". There's a world of difference between the kinds of inferences you can make from ChatGPT's opinion about a text vs inferences you could make by treating these as legit latent variables and seeing what latent-variable-inference algorithms think. 

To give a third example, there's Drexler's Quasilinguistic Neural Representations.

You say "it's a matter of demand". So why does it feel so much like big tech is eager to push the chatbot model on everyone? Everyone is scared to fall behind in the AI race, ever since ChatGPT made it feel plausible that AI would drive significant market share. But did big tech make the correct guess about what there was demand for? Or did everyone over-update on the specific form of ChatGPT, and now there's just not that much else out there to reveal what the demand really is? Little sparkle-emoji buttons decorate everything; press to talk to the modern equivalent of Clippy.

Comment by abramdemski on o1 is a bad idea · 2024-11-15T20:23:42.373Z · LW · GW

Process supervision would be helpful, particularly if the model is punished for sensitive dependence on paraphrasings of the CoT, or rewarded for producing CoT which is useful even after paraphrasing.

One of the concerns that I have is that (I believe) the LLM before the CoT-training will be sensitive to some non-human-readable patterns, similar to adversarial examples for image recognition. The CoT training will encourage any patterns which help reach the desired conclusions, whether those patterns are human-readable or not. 

A lot of possible process-supervision procedures will not do anything to help prevent this. For example, rewarding CoT for being human-understandable won't help, because the reward signal there can miss the subtle patterns being used. It only avoids the outcome where the CoT looks non-human-readable; it doesn't avoid the outcome where the CoT actually contains a bunch of human-unreadable patterns.

Using paraphrasing to avoid this is heavily dependent on the strength of the paraphraser. The paraphraser needs to have really good coverage of alternative ways to convey the same semantic information, so that any non-human-readable patterns get randomized. The paraphraser itself can't be vulnerable to steering by adversarial inputs.

Comment by abramdemski on o1 is a bad idea · 2024-11-15T20:04:44.476Z · LW · GW

I more-or-less agree with Eliezer's comment (to the extent that I have the data necessary to evaluate his words, which is greater than most, but still, I didn't know him in 1996). I have a small beef with his bolded "MIRI is always in every instance" claim, because a universal like that is quite a strong claim, and I would be very unsurprised to find a single counterexample somewhere (particularly if we include every MIRI employee and everything they've ever said while employed at MIRI).

What I am trying to say is something looser and more gestalt. I do think what I am saying contains some disagreement with some spirit-of-MIRI, and possibly some specific others at MIRI, such that I could say I've updated on the modern progress of AI in a different way than they have.

For example, in my update, the modern progress of LLMs points towards the Paul side of some Eliezer-Paul debates. (I would have to think harder about how to spell out exactly which Eliezer-Paul debates.)

One thing I can say is that I myself often argued using "naive misinterpretation"-like cases such as the paperclip example. However, I was also very aware of the Eliezer-meme "the AI will understand what the humans mean, it just won't care". I would have predicted difficulty in building a system which correctly interprets and correctly cares about human requests to the extent that GPT4 does.

This does not mean that AI safety is easy, or that it is solved; only that it is easier than I anticipated at this particular level of capability.

Getting more specific to what I wrote in the post:

My claim is that modern LLMs are "doing roughly what they seem like they are doing" and "internalize human intuitive concepts". This does include some kind of claim that these systems are more-or-less ethical (they appear to be trying to be helpful and friendly, therefore they "roughly are"). 

The reason I don't think this contradicts with Eliezer's bolded claim ("Getting a shape into the AI's preferences is different from getting it into the AI's predictive model") is that I read Eliezer as talking about strongly superhuman AI with this claim. It is not too difficult to get something into the values of some basic reinforcement learning agent, to the extent that something like that has values worth speaking of. It gets increasingly difficult as the agent gets cleverer. At the level of intelligence of, say, GPT4, there is not a clear difference between getting the LLM to really care about something vs merely getting those values into its predictive model. It may be deceptive or honest; or, it could even be meaningless to classify it as deceptive or honest. This is less true of o1, since we can see it actively scheming to deceive.

Comment by abramdemski on o1 is a bad idea · 2024-11-15T19:34:55.104Z · LW · GW

I don't get why you think this is true? EG, it seems like almost no insights about how to train faithful CoT would transfer to systems speaking pure neuralese. It seems to me that what little safety/alignment we have for LLMs is largely a byproduct of how language-focused they are (which gives us a weak sort of interpretability, a very useful safety resource which we are at risk of losing soon).

To put it a different way: human imitation is a relatively safe outer-optimization target, so we should be concerned to the extent that the industry is moving away from it. It sounds like you think safety lessons from the human-imitation regime generalize beyond the human-imitation regime. Maybe I agree that we can derive some abstract lessons like "RL with really dense feedback can avoid instrumental convergence", but we're moving away from the regime where such dense feedback is available, so I don't see what lessons transfer.

Comment by abramdemski on 5 ways to improve CoT faithfulness · 2024-11-15T19:16:13.196Z · LW · GW

I don't get why you think actively training the model to be deceptive is "worth the cost". The deceptive behavior can generalize to other cases, and make the system overall much harder to shape in ways you want (because it is actively hiding things from you so your RL is less accurate). It seems like a choice between two bad options: actively deceptive CoT or actively deceptive Face.

Comment by abramdemski on 5 ways to improve CoT faithfulness · 2024-11-15T19:10:35.446Z · LW · GW

--...but the reward process (RLHF, RLAIF, whatever) gets to see the outputs of the Face but not the Shoggoth.
--So, the hope is that skills and propensities towards deception and dishonesty will accumulate in the Face rather than the Shoggoth.

It seems pretty important to put some incentives on the CoT to avoid steganography, though. The division of labor here seems aimed at not actively incentivizing instrumentally useful thoughts to get hidden. However, the pre-trained LLM will already be responsive to some steganographic patterns, in addition to the human-readable patterns. So, there is still an incentive optimizing the non-human-readable patterns in the CoT to be as useful as possible, in addition to the human-readable patterns. 

Comment by abramdemski on 5 ways to improve CoT faithfulness · 2024-11-15T19:03:39.979Z · LW · GW

However, with this naive approach, the paraphrasings will probably still consistently contain certain kinds of noise from the initial thought. For instance, maybe the paraphraser never replaces an exclamation point with a period, even though this end-of-sentence punctuation doesn't really change the meaning. By modulating whether the step ends with a period or an exclamation point, the model can encode one bit of information per step, despite the paraphraser.

One way to solve this problem is to map every piece of text with a given meaning onto the same "canonical paraphrasing." For instance, we'd want a paraphraser that, given many texts with approximately-identical meanings (like "I own a large home!" "My house is spacious!" and "I possess a sizable dwelling."), maps them all onto a single "canonical" text (like "I have a big house.").

I don't think this proposal improves the thing you claim it improves.

If the initial paraphraser doesn't change between '!' and '.', then learning a canonical-paraphraser from that initial paraphraser won't do it either. 

I think what canonical-paraphrasing does is help with cases where the paraphraser stochastically removes specific information, but in an information-theoretically leaky way. For example, if the paraphraser changes between '!' and '.' with only 10% probability, then steganography can use error-correcting codes to still get through. 

Turning a paraphraser into a canonical-paraphraser probably does something like force details into "always removed" vs "never removed" based on a threshhold; if '!' and '.' are swapped with 10% probability in the initial paraphraser, then they're always swapped in the final paraphraser; if they're swapped with only 1% probability in the initial paraphraser, then they are never swapped in the final paraphraser. 

(Of course this oversimplifies things a lot, and depends completely on how the training to turn paraphrasers into canonical-paraphrasers works.)

I think the only solution to the '.' vs '!' problem, fundamentally, is to train the paraphraser to modify those patterns. If the initial paraphraser doesn't ever change specific details, then those details can be used for steganography. If the paraphraser does change those details, then it is already hard to use them for steganography. So the canonical-paraphrasing idea helps somewhat, but fundamentally, the better the paraphraser (the more closely it captures the human notion of semantic equivalence), the better the anti-steganography guarantee we can get.

Comment by abramdemski on 5 ways to improve CoT faithfulness · 2024-11-15T18:40:37.014Z · LW · GW

Note that the optimized planner is only rewarded for producing steps that help the frozen planner. Since the frozen planner doesn't understand steganography, there is no reason for the optimized planner to learn to use it.

I think this conclusion is overstated. The frozen planner may react to steganography, in the same way that image-recognition NNs react to adversarial examples. The optimized planner is very much like an adversarial example, in the way it is trained to generate outputs which nudge a frozen NN in a specific direction.

Imagine reversing the roles of the optimized vs frozen planner during training: the frozen planner is used for a random number of steps, and then we switch to the optimized planner. Then, it really is the case that the optimized planner has no incentive to use stegonography to nudge the frozen planner, since the frozen planner does not get the chance to react at all to the optimized planner. 

The downside of this training method is that we can't use your suggestion to only reinforce the very last step rather than all the optimized-planner steps. 

Another idea would be to only use the optimized planner to write a single randomly chosen step in the CoT. 

All of these ideas still leave some incentive to use steganography in the adversarial-example sense. At the very least, the final answer (after the CoT) might be influenced by some subtle non-human-readable patterns in the CoT, so the optimized planner will be incentivized to optimize those patterns. In my view, what Incremental Steering does is two things:

  1. It prevents the model from inventing new steganography; it can only leverage subtle patterns which already influence the LLM before this step in the training.
  2. It makes it very difficult for the model to learn steganography which requires coordination over multiple steps. It is not being rewarded for such coordination.
Comment by abramdemski on o1 is a bad idea · 2024-11-15T17:32:38.207Z · LW · GW

Thanks for this response! I agree with the argument. I'm not sure what it would take to ensure CoT faithfulness, but I agree that it is an important direction to try and take things; perhaps even the most promising direction for near-term frontier-lab safety research (given the incentives pushing those labs in o1-ish directions).

Comment by abramdemski on Complete Feedback · 2024-11-15T17:22:30.418Z · LW · GW

How about “purely epistemic” means “updated by self-supervised learning”, i.e. the updates (gradients, trader bankrolls, whatever) are derived from “things being true vs false” as opposed to “things being good vs bad”. Right?

You can define it that way, but then I don't think it's highly relevant for this context. 

The story I'm telling here is that partial feedback (typically: learning some sort of input-output relation via some sort of latents) always leaves us with undesired hypotheses which we can't rule out using the restricted feedback mechanism.

  • Reinforcement Learning cannot rule out the wireheading hypothesis or human-manipulation hypothesis.
  • Updating on things being true or false cannot rule out agentic hypotheses (the inner optimizer problem).

Any sufficiently rich hypotheses space has agentic policies, which can't be ruled out by the feedback. "Purely epistemic" in your sense filters for hypotheses which make good predictions, but this doesn't constrain things to be non-agentic. The system can learn to use predictions as actions in some way.

[I learned the term teleosemantics from you!  :) ]

I think it would be fair to define a teleosemantic notion of "purely epistemic" as something like "there is no optimization (anywhere in the system -- 'inner' or 'outer') except optimization for epistemic accuracy". 

The obvious application of my main point is that some form of "complete feedback" is a necessary (but insufficient) condition for this. 

"Epistemic accuracy" here has to be defined in such a way as to capture the one-way "direction-of-fit" optimization of the map to fit the territory, but never the territory to fit the map. IE the optimization algorithm has to ignore the causal impact of its predictions.

However, I don't particularly endorse this as the correct design choice -- although a system with this property would be relatively safe in the sense of eliminating inner-alignment concerns and (in a sense) outer-alignment concerns, it is doing so by ignoring its impact on the world, which creates its own set of dangers. If such a system were widely deployed and became highly trusted for its predictions, it could stumble into bad self-fulfilling prophecies.

So, in my view, "epistemic" systems should be as transparent as possible with human users about possible multiple-fixed-point issues, try to keep humans in the loop and give the important decisions to humans; but ultimately, we need to view even "purely epistemic" systems as making some important (instrumental) decisions, and have them take some responsibility for making those decisions well instead of poorly.

The original LI paper was in that category, IIUC. The updates (to which traders had more vs less money) are derived from mathematical propositions being true vs false.

I was going to remind you that the paper didn't say how the fixed-point selection works, and we can do that part in an agentic way, but then you go on to say basically the same thing (with the caveat that where you say "put the LI into a larger system that follows the rule: whatever the expectations are about the AI's own actions, make that actually happen" I would say the more general "put the LI into an environment which somehow reacts to its predictions"):

LI defines a notion of logically uncertain variable, which can be used to represent desires

I would say that they don’t really represent desires. They represent expectations about what’s going to happen, possibly including expectations about an AI’s own actions.

And then you can then put the LI into a larger system that follows the rule: whatever the expectations are about the AI’s own actions, make that actually happen.

The important thing that changes in this situation is that the convergence of the algorithm is underdetermined—you can have multiple fixed points. I can expect to stand up, and then I stand up, and my expectation was validated. No update. I can expect to stay seated, and then I stay seated, and my expectation was validated. No update.

(I don’t think I’m saying anything you don’t already know well.)

Anyway, if you do that, then I guess you could say that the LI’s expectations “can be used” to represent desires … but I maintain that that’s a somewhat confused and unproductive way to think about what’s going on. If I intervene to change the LI variable, it would be analogous to changing habits (what do I expect myself to do ≈ which action plans seem most salient and natural), not analogous to changing desires.

(I think the human brain has a system vaguely like LI, and that it resolves the underdetermination by a separate valence system, which evaluates expectations as being good vs bad, and applies reinforcement learning to systematically seek out the good ones.)

I don't understand what you're trying to accomplish in these paragraphs. To me you sound sorta like Bob in the following:

Alice: Here's my computer model of an agent.

Bob: Uh oh, that sounds sort of like active inference. How did you represent values? Did you confuse them with beliefs?

Alice: I used floating-point numbers to represent the expected value of a state. Here, look at my code. It's a Q-learning algorithm.

Bob: You realize that "expected value" is a statistics thing, right? That makes it epistemic, not really value-laden in the sense that makes something agentic. It's a prediction of what a number will be. Indeed, we can justify expected values as min-quadratic-loss estimates. That makes them epistemic!

Alice: Well, I agree that expected values aren't automatically "values" in the agentic sense, but look, my code can solve mazes and stuff -- like a rat learning to get cheese. 

Bob: Of course I agree that it can be used in an instrumental way, but that's a really misleading way to describe it overall, right? If you changed one Q-value estimated by the system, that would be analogous to changing habits, not desires, right?

Alice: um??? If we agree that it can be used in an instrumental way, then what are you saying is misleading?

Bob: I mean, sure, the human brain does something like this.

Alice: Ok??

It seems possible that you think we have some disagreement that we don't have?

Also, it is easy for end users to build agentlike things out of belieflike things by making queries about how to accomplish things. Thus, we need to train epistemic systems to be responsible about how such queries are answered (as is already apparent in existing chatbots).

I’m not sure that this is coming from a coherent threat model (or else I don’t follow).

  • If Dr. Evil trains his own AGI, then this whole thing is moot, because he wants the AGI to have accurate beliefs about bioweapons.
  • If Benevolent Bob trains the AGI and gives API access to Dr. Evil, then Bob can design the AGI to (1) have accurate beliefs about bioweapons, and (2) not answer Dr. Evil’s questions about bioweapons. That might ideally look like what we’re used to in the human world: the AGI says things because it wants to say those things, all things considered, and it doesn’t want Dr. Evil to build bioweapons, either directly or because it’s guessing what Bob would want.

I'm not clear on what you're trying to disagree with here. It sounds like we both agree that if Benevolent Bob builds a powerful "purely epistemic system" (by whatever definition), without limiting its knowledge, then Dr. Evil can misuse it; and we both agree that as a consequence of this, it makes sense to instead build some agency into the system, so that the system can decide not to give users dangerous information.

Possibly you disagree with the claim "it is easy to build agentlike things out of belieflike things"? What I have in mind is a powerful epistemic oracle. As a simple example, let's say it can give highly accurate guesses to mathematically-posed problems. Then Dr. Evil can implement AIXI by feeding in AIXI's mathematical definition, for example. This is the sort of thing I had in mind, but generalized to the nonmathematical case. (EG, "conditional on my owning a super-powerful death ray soon, what actions do I take now")

Comment by abramdemski on Radical Probabilism · 2024-11-11T20:20:40.409Z · LW · GW

The thing to prove is precisely convergence and calibration!

Convergence: Given arbitrary observed sequences, the probabilities of the latent bits (program tape and working tape) converge.

(Obviously the non-latent bits converge, if we assume they eventually get observed; hence the focus on the latent bits here.)

Calibration: Considering the sequence of probabilistic predictions has the property that, considering the subsequence of only predictions between 79% and 81% (for example), the empirical frequency of those predictions coming true will eventually be in that range and stay within that range forever. (This should be true for all such ranges.

Of course, there will also be a number of interesting formal variations on these properties, either weakening them (EG giving conditions where we do get a guarantee) or strengthening them (EG giving explicit bounds on how badly these properties can be violated, rather than only proving that they are obeyed in the limit). For example, converging to predict 50% on adversarial sequences is a much weaker form of a calibration guarantee.

Comment by abramdemski on Complete Feedback · 2024-11-05T16:53:37.034Z · LW · GW

Admittedly, one can try to squish beliefs and desires into the same framework. The Active Inference people do that. Does LI do that too? 

No. LI defines a notion of logically uncertain variable, which can be used to represent desires. There are also other ways one could build agents out of LI, such as doing the active inference thing.

As I mentioned in the post, I'm agnostic about such things here. We could be building """purely epistemic""" AI out of LI, or we could be deliberately building agents. It doesn't matter very much, in part because we don't have a good notion of purely epistemic

  • Any learning system with a sufficiently rich hypothesis space can potentially learn to behave agentically (whether we want it to or not, until we have anti-inner-optimizer tech), so we should still have corrigibility concerns about such systems.
  • In my view, beliefs are a type of decision (not because we smoosh beliefs and values together, but rather because beliefs can have impacts on the world if the world looks at them) which means we should have agentic concerns about beliefs.
  • Also, it is easy for end users to build agentlike things out of belieflike things by making queries about how to accomplish things. Thus, we need to train epistemic systems to be responsible about how such queries are answered (as is already apparent in existing chatbots).
Comment by abramdemski on Seeking Collaborators · 2024-11-04T17:24:12.388Z · LW · GW

I think I've articulated a number of concrete subgoals that require less philosophical skill (they can be approached as math problems). However, in the big picture, novel tiling theorems require novel ideas. This requires philosophical skill.

Comment by abramdemski on Seeking Collaborators · 2024-11-04T17:11:59.482Z · LW · GW

I think there are some deeper insights around inner optimization that you are missing that would make you more pessimistic here. "Unknown Algorithm" to me means that we don't know how to rule out the possibility of inner agents which have opinions about recursive self-improvement. Part of it is that we can't just think about what it "converges to" (convergence time will be too long for interesting learning systems).

Comment by abramdemski on Why I’m not a Bayesian · 2024-11-04T05:24:35.819Z · LW · GW

I agree that Solomonoff’s epistemology is noncentral in the way you describe, but I don't think it impacts my points very much; replace Solomonoff with whatever epistemic theory you like. It was just a convenient example.

(Although I expect defenders of Solomonoff to expect the program bits to be meaningful; and I somewhat agree. It's just that the theory doesn't address the meaning there, instead treating programs more like black-box predictors.)

In my view, meaning is the property of being optimized to adhere to some map-territory relationship. However, this optimization itself must always occur within some model (it provides the map-territory relationship to optimize for). In the context of Solomonoff Induction, this may emerge from the incentive to predict, but it is not easy to reason about.

In some sense, reality isn't made of bits, propositions, or any such thing; it is of unknowable type. However, we always describe it via terms of some type (a language).

I'm no longer sure where the disagreement lies, if any, but I still feel like the original post overstates things.

Comment by abramdemski on Seeking Collaborators · 2024-11-03T15:31:53.857Z · LW · GW

How would you become confident that a UANFSI approach was NFSI?

Comment by abramdemski on Seeking Collaborators · 2024-11-03T15:30:23.972Z · LW · GW

I have lost interest in the Löbian approach to tiling, because probabilistic tiling results seem like they can be strong enough and with much less suspicious-looking solutions. Expected value maximization is a better way of looking at agentic behavior anyway. Trying to logically prove some safety predicate for all actions seems like a worse paradigm than trying to prove some safety properties for the system overall (including proving that those properties tile under as-reasonable-as-possible assumptions, plus sanity-checking what happens when those assumptions aren't precisely true). 

I do think Löb-ish reasoning still seems potentially important for coordination and cooperation, which I expect to feature in important tiling results (if this research program continues to make progress). However, I am optimistic about replacing Löb's Theorem with Payor's Lemma in this context.

I don't completely discount the pivotal-act approach, but I am currently more optimistic about developing safety criteria & designs which could achieve some degree of consensus amongst researchers, and make their way into commercial AI, perhaps through regulation.

Comment by abramdemski on Complete Feedback · 2024-11-03T15:13:48.410Z · LW · GW

Yeah, I totally agree. This was initially a quick private message to someone, but I thought it was better to post it publicly despite the inadequate explanations. I think the idea deserves a better write-up.

Comment by abramdemski on Three Notions of "Power" · 2024-10-30T14:50:45.344Z · LW · GW

It seems to me that the importance and interaction of these different types of power in the future depends a lot on our choices now, ie, what kind of future we shape. Hierarchies could get smashed in one way or another, making John's prediction correct, or we could engineer a future that evolves from the present more smoothly, in which case you'd be correct.