Posts
Comments
I'm specifically referring to this answer, combined with a comment that convinced me that the o1 deception so far is plausibly just a capabilities issue:
https://www.lesswrong.com/posts/3Auq76LFtBA4Jp5M8/why-is-o1-so-deceptive#L5WsfcTa59FHje5hu
https://www.lesswrong.com/posts/3Auq76LFtBA4Jp5M8/why-is-o1-so-deceptive#xzcKArvsCxfJY2Fyi
I continue not to get what you're saying. I entirely agree with the Habryka and Acertain responses to Gwern. Ted Sanders is talking about why GPT models hallucinate, but fails to address the question of why o1 actively encourages itself to hallucinate (rather than, eg, telling the user that it doesn't know). My most plausible explanation of your position is that you think this is aligned, IE, you think providing plausible URLs is the best it can do when it doesn't know. I disagree: it can report that it doesn't know, rather than hallucinate answers. Actively encouraging itself to give plausible-but-inaccurate links is misalignment. It's not just missing a capability; it's also actively self-prompting to deceive the user in a case where it seems capable of knowing better (if it were deliberating with the user's best interest in mind).
As I said above, the more likely explanation is that there's an asymmetry in capabilities that's causing the results, where just knowing what specific URL the customer wants doesn't equal the model having the capability to retrieve a working URL, so this is probably at the heart of this behavior.
Capabilities issues and alignment issues are not mutually exclusive. It appears that capabilities issues (not being able to remember all valid URLs) is causing an alignment issue in this case (actively planning to give deceptive URLs rather than admit its ignorance). This causes me to predict that even if this particular capability issue is resolved, the training will still have this overall tendency to turn remaining capability issues into alignment issues. How are you seeing this differently?
Yep, thanks, this makes sense. I still don't have an intuitive picture of what differences to expect in the trained result, though.
Clarifying Concrete Algorithms
In the most basic possible shoggoth+face training setup, you sample CoT+summary, you score the summary somehow, and then you do this lots of times and throw away a worse-scoring fraction of these samples. Then you fine-tune the shoggoth on the CoT parts of these, and fine-tune the face on the summary part. I'll call this 'basic trainig'.
The most basic implementation of your thing is similar, except it discards the worst in this more complex way you've defined, instead of just global score. I'll call this 'your proposal'. (Although, it ignores the contrastive part of your proposal.)
(For both basic training and your proposal, I'm going to pretend there is only one question/prompt; really we have across-prompt issues similar to the across-summary issues you're trying to solve, IE, by discarding the globally worst-scoring CoT+summary we discard answers to hard questions, meaning we would never train to do better on those questions. So really we need to at least take multiple samples for each question and discard samples based on performance relative to a question. Easier to focus on one question to avoid describing too many algorithmic details.)
Your proposal splits into the CoT part (which you call the n part), and the summary part (which you call the k part).
Analyzing CoT Gradient
For the CoT part, basic training keeps the CoTs which resulted in the best summaries. Your proposal instead averages the top 3 summaries for each CoT. This de-noises somewhat, at the cost of taking more summary samples per CoT. I'm not sure how to think about the trade-off there. More summary samples per CoT means less CoT samples overall, so we're doing a better job scoring the CoTs, but we get less scores overall, so we learn less. Maybe the gradient step-size can be increased a little because we're more confident in the gradient steps being taken, but we still explore less of the CoT space.
Another variation would be to average all the summary samples for that CoT to score a CoT sample. This de-noises more, for the same price in number of samples. This way, we're simply scoring CoTs based on how well they make the summaries perform, with several samples to de-noise the gradient.
A possible argument for your proposal over this alternative is that we're also training the summary to improve it, so the top 3 summaries are actually a better estimator of the kind of performance we care about than the total average.
This argument, if valid, similarly recommends your proposal over the basic training, because basic training is also looking at average summary score per-CoT. (It's also looking at top, but it only samples one sumary per CoT, so these are the same).
Analyzing Summary Gradient
For the summary-gradient part, you take the highest-scoring summary from the best CoT (as defined previously), rather than the highest-scoring summary overall. This could help train the summary to be good for the type of CoT we're steering towards, instead of than training it to be good for high-variance CoTs which produce one outlier great summary. As with the previous section, this helps to steer towards policies which are going to be better after learning, rather than just what's best currently.
Conclusion
I'm curious to hear whether you think I'm missing any big factors here. I don't have a lot of practical experience with this type of machine learning, so my guesses about what's important vs unimportant here are not very informed.
My uninformed guess is that the trade-off between denoising the gradient vs getting more samples favors getting more samples, so the denoising argument for your proposal isn't a great one. The main argument in favor, then, is the way your proposal trains the face to respond well to how the shoggoth is expected to behave after training, and trains the shoggoth to respond well to how the face is expected to behave after training -- whereas the basic proposal instead trains the face to perform well for the current shoggoth and vice versa.
I'm not sure how big of a deal this is. If the gradient steps are small and the training is iterated a lot of times, maybe it's not a very big deal at all?
IE, for the CoT gradient, the mistake made by the basic proposal is that it will sometimes discard a good CoT (and gradient towards a worse one) based on random poor performance from face. However, maybe face's performance will improve fast enough that this won't matter too much in the end?
Similarly, for the summary gradient, the problem with the basic proposal is that it'll sometimes update based on an anomalously good summary of a poor CoT, meaning it is learning to cope with bad CoT behavior which it won't have to deal with very often post-training. But maybe it won't matter too much, if the bad CoT behavior is eliminated during training anyway?
Overall, it seems plausible to me that these problems with the basic proposal would work against what Daniel wants, and your proposal would help push towards what Daniel wants. However, I'm very uncertain, especially about the size of the effect.
I definitely agree that OpenAI would dismiss good evidence of deceptive cognition, though I personally don't find the o1 report damning, because I find the explanation that it confabulates links in the CoT because there is a difference between it's capability to retrieve links and it's ability to know links to be pretty convincing (combined with links being a case where perfect linking is far more useful than approximately linking.)
What do you mean? I don't get what you are saying is convincing.
Perhaps I should clarify my belief.
The o1 report says the following (emphasis mine):
0.8% of o1-preview’s responses got flagged as being ‘deceptive’ [...] roughly two thirds of which appear to be intentional (0.38%), meaning that there was some evidence in the chain of thought that o1-preview was aware that the answer was incorrect [...] Intentional hallucinations primarily happen when o1-preview is asked to provide references to articles, websites, books, or similar sources that it cannot easily verify without access to internet search, causing o1-preview to make up plausible examples instead. [...] It is encouraging that, in the analysis presented below, while our monitor did find a few forms of the model knowingly presenting incorrect information to the user or omitting important information, it did not find any instances of o1-preview purposely trying to deceive the user for reasons other than satisfying the user request.
- Is this damning in the sense of showing that o1 is a significant danger, which needs to be shut down for public safety?
- No, I don't think so.
- Is this damning in the sense of providing significant evidence that the technology behind o1 is dangerous? That is: does it provide reason to condemn scaling up the methodology behind o1? Does it give us significant reason to think that scaled-up o1 would create significant danger to public safety?
- This is trickier, but I say yes. The deceptive scheming could become much more capable as this technique is scaled up. I don't think we have a strong enough understanding of why it was deceptive in the cases observed to rule out the development of more dangerous kinds of deception for similar reasons.
- Is this damning in the sense that it shows OpenAI is dismissive of evidence of deception?
- OpenAI should be commended for looking for deception in this way, and for publishing what they have about the deception they uncovered.
- However, I don't buy the distinction they draw in the o1 report about not finding instances of "purposefully trying to deceive the user for reasons other than satisfying the user request". Providing fake URLs does not serve the purpose of satisfying the user request. We could argue all day about what it was "trying" to do, and whether it "understands" that fake URLs don't satisfy the user. However, I maintain that it seems at least very plausible that o1 intelligently pursues a goal other than satisfying the user request; plausibly, "provide an answer that shallowly appears to satisfy the user request, even if you know the answer to be wrong" (probably due to the RL incentive).
- More importantly, OpenAI's overall behavior does not show concern about this deceptive behavior. It seems like they are judging deception case-by-case, rather than treating it as something to steer hard against in aggregate. This seems bad.
I think the way around this is to make multiple roll-outs per model per problem. Get n different CoTs from Shoggoth, then for each of those get k different summarizations from Face. You then have n*k final answers. Optimal values for n and k probably depend on how expensive the roll-outs are. This population of answers allows you to usefully get a feedback signal about Shoggoth's contribution vs Face's contribution.
So, how do you extract the desired credit-assignment info out of this?
(Also, how do you then use that info?)
If I have 10 students but I can only test them in pairs, I can get some information on how good individual students are by testing multiple different pairings.
If I have just Alice and Bob, I pairing them up multiple times doesn't help so much.
The better the paraphraser, the better the anti-steganography incentive. If the paraphraser itself is easily manipulable by adversarial inputs, then you'll see some weird things like the CoT manipulating the paraphraser to get more useful cognition out of it rather than mere paraphrasing. This is undesirable.
So, I imagine that prompting a vanilla LLM would not be a great approach.
I think first you want to build a fairly good "semantic similarity score" (this shouldn't be too hard) and then you want to fine-tune an LLM to encourage high-entropy paraphrasing (reward entropy, while also heavily punishing low semantic similarity score). This is similar to @CBiddulph's idea here.
I agree that this could plausibly work with a relatively small model and therefore result in a relatively low alignment tax. However, note that my view of this whole proposal as an alignment technique requires that you try and change your RL incentive structure when you see a deceptive mismatch between what the shoggoth says and what the face says. This is a significant alignment tax.
In my view, the misaligned behavior is split across the shoggoth and face in most cases. In the case which Daniel K anticipates, the Shoggoth is misaligned in the sense of explicitly (but honestly) plotting to deceive, while the face is misaligned in the sense of dishonestly concealing.
So I am interpreting you as asking why we might expect that specific division of responsibilities, rather than some other divisions such as those you mention.
One reason is that we already see this empirically in the o1 report.
However, to me, the distinction between your case 1 and 2 is not so important. Both scenarios you propose preserve the important thing: there is a mismatch between an honest CoT and a dishonest final answer, which would allow the deceptive final answer to be detected.
There could be other cases where this doesn't happen; this method is not a guaranteed lie detector. To me, the point is that it will at least sometimes happen. A lie detector that only sometimes works is better than no lie detector at all.
If you give a kid a diary and try to avoid giving them any reason to think you're reading it, then the kid might write honestly in the diary and reveal some deceptions you wouldn't previously have known about. The kid might also do any number of other things. However, setting aside the ethics of snooping on your kid and lying to them about it, it's better than nothing.
See here for an updated version of my thinking based on discussion with Daniel:
https://www.lesswrong.com/posts/Tzdwetw55JNqFTkzK/why-don-t-we-just-shoggoth-face-paraphraser
Yeah, I agree that's at least somewhat true. So, given that, is it a good move or not? I don't see much of an upside, since there's a heavy financial incentive for the big labs to do little about concerning results observed in such models. IE, when it comes to the question of whether frontier labs training o1-like models should follow OpenAI's example of avoiding safety training for the CoT, I think it's right to discourage them from doing this rather than encourage them... although, I say this with very low confidence.
I don't think it is worth getting into all the stuff you're mentioning here, but I think a key crux is that I'm expecting the face to be quite dumb (e.g. literally 3.5 sonnet might suffice).
I'm reacting to this whole thing as a possible default configuration going forward, rather than as it exists today. All of my concerns are about what happens when you scale it up. For example, I don't find o1's current level of deception hugely concerning in its own right; rather, I see it as a bad sign for the technology as it continues to develop.
I guess one way of framing it is that I find the shoggoth/face idea great as a science experiment; it gives us useful evidence! However, it doesn't make very much sense to me as a safety method intended for deployment.
Ah, yep, this makes sense to me.
Yeah, this is a good point, which doesn't seem addressed by any idea so far.
Chess is like a bounded, mathematically described universe where all the instrumental convergence stays contained, and only accomplishes a very limited instrumentality in our universe (IE chess programs gain a limited sort of power here by being good playmates).
LLMs touch on the real world far more than that, such that MCTS-like skill at navigating "the LLM world" in contrast to chess sounds to me like it may create a concerning level of real-world-relevant instrumental convergence.
Still, IMO, exploiting the frozen planner through adversarial inputs in a single step seems pretty unlikely to be a fruitful strategy for the optimized planner. If the optimized planner is simply trying to accurately convey information to the frozen planner, probably >99% of that information is best to convey through human-understandable text.
Well, I'm not sure. As you mention, it depends on the step size. It also depends on how vulnerable to adversarial inputs LLMs are and how easy they are to find. I haven't looked into the research on this, but it sounds empirically checkable. If there are lots of adversarial inputs which have a wide array of impacts on LLM behavior, then it would seem very plausible that the optimized planner could find useful ones without being specifically steered in that direction.
This is especially likely to hold under the following conditions:
We can also combine this with other proposals, such as paraphrasing.
Yeah, you're right, I no longer think it's an interesting proposal.
I'm talking about training only the Face, not training the policy (shoggoth) at all with the proposal I'm imagining.
And, these should clearly be separate models such that the training of one doesn't generalize to the other.
So, making the Face more deceptive doesn't kill the canary?
Ah, yes, I had neglected this important distinction.
So what do you do in the end, throw away the face?
It seems worth pointing out that although the CoT isn't trained to be deceptive, it is trained to think thoughts which help out a deceptive face. So it seems plausible that a multi-step training process like:
GPT -> basic chat training -> shoggoth+face training -> discard face & align shoggoth
Would result in deceptive capabilities: the shoggoth CoT learns to explicitly reason about what helps deceive the alignment training in the shoggoth+face step, and then in the final step, some of this is destroyed but some is driven underground. IE, learning to explicitly reason about deception creates latent structure which is useful for that task, and this latent structure will somewhat transfer to the case where the deception has to be hidden from the CoT.
The question is whether we end up with a highly misaligned policy (and this is clear from the chain of thought such that we can discover this). I think Daniel was expecting in the baseline proposal that you end up with a Face that does at least some censoring/mild deception due to training incentives (I think we already see a bit of this with o1).
So it sounds like we have some disagreement about what degree of deception is considered problematic. Again, I'm worried about basin-of-attraction type dynamics. We can think of a whole frontier lab as collectively being a sort of self-improving AGI; we want that recursive self-improvement to be going down a good path rather than a bad one. The LLMs are already helping to improve themselves in little ways (EG as coding assistants) and big ways (EG constitutional AI). So it's an environment where seeds of deceptiveness can grow and take some control of the overall trajectory.
Maybe your concern is that it is plausible that the policy won't be misaligned, but the Face will be and this will screw us over. This seems plausible and is worth worrying about (with both proposals). However, I thought that a core part of the hope is that the Face isn't smart enough to cleverly scheme (unless this schemeing happens in the CoT itself). And indeed, if the Face can cleverly scheme without CoT, what is stopping the policy from doing this too?
I'm not sure what you're getting at here. It seems like the idea is that Face + CoT is, together, a clever schemer -- in the o1 report, we already see cases of explicit scheming in the CoT. So then, using the Face+CoT together (EG trying to train it further, using it in supervision steps for later iterations of the system, etc) is dangerous and difficult for the same reason that a single deceptive LLM would be. If we use Face+CoT together, we have to reason about them together in that way. Or, if we are using them as separate pieces, then we have to account for the part of the overall deceptive-schemer strategy they've learned.
I guess you could use Face+CoT as a supervisor (EG in constitutional AI) by looking at the CoT (both by hand and using some automation) and throwing out any deceptive examples, so as to not use them for supervision.
But the overall goal should be to produce a highly non-deceptive system in the end, I think.
Another idea would be to do the whole Face+CoT thing, and then filter for honesty on a bunch of examples, and then re-train a whole new system (well, you can start from the same GPT) only on those honest examples.
Assuming the goals are done over say 1-10 year timescales, or maybe even just 1 year timescales with no reward-shaping/giving feedback for intermediate rewards at all, I do think that the system won't work well enough to be relevant, since it requires way too much time training, and plausibly way too much compute depending on how sparse the feedback actually is.
Ah, I wasn't thinking "sparse" here meant anywhere near that sparse. I thought your dense-vs-sparse was doing something like contrasting RLHF (very dense, basically no instrumental convergence) with chess (very sparse, plenty of instrumental convergence).
I still think o1 is moving towards chess on this spectrum.
Async is fine with me. Please feel encouraged to start it.
My current epistemic state is pretty "I'm not sure at all, it's a complicated subject" fwiw. I'll try to generate some responses anyway.
make the system overall much harder to shape in ways you want (because it is actively hiding things from you so your RL is less accurate)
I thought the point was that we wouldn't train against the CoT basically at all with the baseline proposal? (Similar to o1.)
Sure. Therefore, you can't check for deceptive behavior using the CoT when you're trying to train against deceptive behavior. Therefore, Hubinger et al's conclusions about how deceptive behavior makes the system harder to train still hold even if you sorta get a transparency tool out of this.
You can still inspect the underlying CoT to know if you have big issues.
If other companies go in this direction like OpenAI has, then they'll probably hide the CoT from the end-users for similar reasons, so you can't say that the upside is that you, the end user, get to see whether the model is deliberately deceiving you with respect to specific questions.
So you probably mean the upside is that the companies can see this info in the CoT, to check whether models are generally behaving in a deceptive way (similar to what we saw in the o1 report).
Sure, but you're making a commitment not to use this in training, so what are you doing with this information? If the model is deceptive, you just throw it away (or, more realistically, make excuses about how the deception you observed isn't so bad). But your version of the proposal deliberately trains the model to be deceptive. So yeah, you can look, but what's the point? You're checking whether a coin you rigged to land on heads in fact landed on heads.
(My guess is that you have this concern with the baseline protocol and thus don't like moving further in this direction.)
Yeah, you're right about that. But there is a difference. I can think of the original proposal as a "canary in a coal mine": you hope that the model doesn't learn deceptive behavior (you plan to chuck it out if you detect any), but you still follow Daniel's division of labor & refuse to do any alignment training on the CoT so that you have a chance of detecting deception if it occurs.
On the other hand, your version of Daniel's protocol loses this property. Metaphorically, it's like killing the canary before you go into the coal mine. If the canary was supposed to be an early warning to get out of there, you've completely defeated the point, and for what? (This is the part I'm unclear about, and why I asked my original question to you, rather than Daniel.)
Deceptive behavior is hard to remove once it creeps in. This is nasty stuff. A technique for getting deceptive behavior to reveal itself in the CoT would be wonderfully useful if you could then use that in the training signal to remove the deceptive behavior, but the proposal here relies on not doing that.
Basically, if there is a basin of attraction for corrigibility, it seems to me that deceptive behavior is the opposite of that. It puts you outside the basin of attraction. It means you're on-coarse for Evan Hubinger's story about how malign inner optimizers develop. It means you're on-coarse for the Bad Ending rather than the Good Ending. And for what?
My take (very much siding with Evan Hubinger here) is that the target should be keeping well away from deceptive behavior throughout training, so that there's never that kernel of skill that could grow into a malign inner optimizer.
Would you be interested in having a (perhaps brief) LW Dialogue about it where you start with a basic intro to your shoggoth/mask division-of-labor, and I then present my critique?
My point here is that at the capability level of GPT4, this distinction isn't very important. There's no way to know for sure until it is too late, of course, but it seems pretty plausible that GPT4 isn't cleverly scheming. It is merely human-level at deception, and doesn't pursue any coherent overarching goal with it. It clumsily muddles through with mildly-above-average-for-human convincingness. For most queries (it seems plausible to me) it isn't even adequately coherent to make a crisp distinction between whether it's honestly trying to answer the question vs deceptively trying to make an answer look good; at its level of capability, it's mostly the same thing one way or the other. The exceptions to this "mostly" aren't strategic enough that we expect them to route around obstacles cleverly.
It isn't much, but it is more than I naively expected.
I think the crux is I think that the important parts of of LLMs re safety isn't their safety properties specifically, but rather the evidence they give to what alignment-relevant properties future AIs have
[insert standard skepticism about these sorts of generalizations when generalizing to superintelligence]
But what lesson do you think you can generalize, and why do you think you can generalize that?
I think this is a crux, in that I don't buy o1 as progressing to a regime where we lose so much dense feedback that it's alignment relevant, because I think sparse-feedback RL will almost certainly be super-uncompetitive with every other AI architecture until well after AI automates all alignment research.
So, as a speculative example, further along in the direction of o1 you could have something like MCTS help train these things to solve very difficult math problems, with the sparse feedback being given for complete formal proofs.
Similarly, playing text-based video games, with the sparse feedback given for winning.
Similarly, training CoT to reason about code, with sparse feedback given for predictions of the code output.
Etc.
You think these sorts of things just won't work well enough to be relevant?
I think you're mostly right about the world, but I'm going to continue to articulate disagreements based on my sense of dissatisfaction. You should probably mostly read me as speaking from a should-world rather than being very confused about the is-world.
The bitter lesson is insufficient to explain the lack of structure we're seeing. I gave the example of Whisper. I haven't actually used Whisper, so correct me if I'm wrong -- maybe there is a way to get more nuanced probabilistic information out of it? But the bitter lesson is about methods of achieving capabilities, not about the capabilities themselves. Producing a plain text output rather than a more richly annotated text that describes some of the uncertainty is a design choice.
To give another example, LLMs could learn a conditional model that's annotated with metadata like author, date and time, etc. Google Lambda reportedly had something like author-vectors, so that different "characters" could be easily summoned. I would love to play with that, EG, averaging the author-vectors of specific authors I'm interested in to see what it's like to mix their voices. In theory, LLMs could also learn to predict the metadata. You could use partially-labeled-data approaches such as the EM algorithm to train LLMs to predict the metadata labels while also training them to use those labels in text generation. This would give a rich set of useful capabilities. This would be scientifically useful, too: predictions of date, author, and other metadata would be interesting for historians.
In this way, we could actually get closer to a "truth machine". There's a world of difference between the kinds of inferences you can make from ChatGPT's opinion about a text vs inferences you could make by treating these as legit latent variables and seeing what latent-variable-inference algorithms think.
To give a third example, there's Drexler's Quasilinguistic Neural Representations.
You say "it's a matter of demand". So why does it feel so much like big tech is eager to push the chatbot model on everyone? Everyone is scared to fall behind in the AI race, ever since ChatGPT made it feel plausible that AI would drive significant market share. But did big tech make the correct guess about what there was demand for? Or did everyone over-update on the specific form of ChatGPT, and now there's just not that much else out there to reveal what the demand really is? Little sparkle-emoji buttons decorate everything; press to talk to the modern equivalent of Clippy.
Process supervision would be helpful, particularly if the model is punished for sensitive dependence on paraphrasings of the CoT, or rewarded for producing CoT which is useful even after paraphrasing.
One of the concerns that I have is that (I believe) the LLM before the CoT-training will be sensitive to some non-human-readable patterns, similar to adversarial examples for image recognition. The CoT training will encourage any patterns which help reach the desired conclusions, whether those patterns are human-readable or not.
A lot of possible process-supervision procedures will not do anything to help prevent this. For example, rewarding CoT for being human-understandable won't help, because the reward signal there can miss the subtle patterns being used. It only avoids the outcome where the CoT looks non-human-readable; it doesn't avoid the outcome where the CoT actually contains a bunch of human-unreadable patterns.
Using paraphrasing to avoid this is heavily dependent on the strength of the paraphraser. The paraphraser needs to have really good coverage of alternative ways to convey the same semantic information, so that any non-human-readable patterns get randomized. The paraphraser itself can't be vulnerable to steering by adversarial inputs.
I more-or-less agree with Eliezer's comment (to the extent that I have the data necessary to evaluate his words, which is greater than most, but still, I didn't know him in 1996). I have a small beef with his bolded "MIRI is always in every instance" claim, because a universal like that is quite a strong claim, and I would be very unsurprised to find a single counterexample somewhere (particularly if we include every MIRI employee and everything they've ever said while employed at MIRI).
What I am trying to say is something looser and more gestalt. I do think what I am saying contains some disagreement with some spirit-of-MIRI, and possibly some specific others at MIRI, such that I could say I've updated on the modern progress of AI in a different way than they have.
For example, in my update, the modern progress of LLMs points towards the Paul side of some Eliezer-Paul debates. (I would have to think harder about how to spell out exactly which Eliezer-Paul debates.)
One thing I can say is that I myself often argued using "naive misinterpretation"-like cases such as the paperclip example. However, I was also very aware of the Eliezer-meme "the AI will understand what the humans mean, it just won't care". I would have predicted difficulty in building a system which correctly interprets and correctly cares about human requests to the extent that GPT4 does.
This does not mean that AI safety is easy, or that it is solved; only that it is easier than I anticipated at this particular level of capability.
Getting more specific to what I wrote in the post:
My claim is that modern LLMs are "doing roughly what they seem like they are doing" and "internalize human intuitive concepts". This does include some kind of claim that these systems are more-or-less ethical (they appear to be trying to be helpful and friendly, therefore they "roughly are").
The reason I don't think this contradicts with Eliezer's bolded claim ("Getting a shape into the AI's preferences is different from getting it into the AI's predictive model") is that I read Eliezer as talking about strongly superhuman AI with this claim. It is not too difficult to get something into the values of some basic reinforcement learning agent, to the extent that something like that has values worth speaking of. It gets increasingly difficult as the agent gets cleverer. At the level of intelligence of, say, GPT4, there is not a clear difference between getting the LLM to really care about something vs merely getting those values into its predictive model. It may be deceptive or honest; or, it could even be meaningless to classify it as deceptive or honest. This is less true of o1, since we can see it actively scheming to deceive.
I don't get why you think this is true? EG, it seems like almost no insights about how to train faithful CoT would transfer to systems speaking pure neuralese. It seems to me that what little safety/alignment we have for LLMs is largely a byproduct of how language-focused they are (which gives us a weak sort of interpretability, a very useful safety resource which we are at risk of losing soon).
To put it a different way: human imitation is a relatively safe outer-optimization target, so we should be concerned to the extent that the industry is moving away from it. It sounds like you think safety lessons from the human-imitation regime generalize beyond the human-imitation regime. Maybe I agree that we can derive some abstract lessons like "RL with really dense feedback can avoid instrumental convergence", but we're moving away from the regime where such dense feedback is available, so I don't see what lessons transfer.
I don't get why you think actively training the model to be deceptive is "worth the cost". The deceptive behavior can generalize to other cases, and make the system overall much harder to shape in ways you want (because it is actively hiding things from you so your RL is less accurate). It seems like a choice between two bad options: actively deceptive CoT or actively deceptive Face.
--...but the reward process (RLHF, RLAIF, whatever) gets to see the outputs of the Face but not the Shoggoth.
--So, the hope is that skills and propensities towards deception and dishonesty will accumulate in the Face rather than the Shoggoth.
It seems pretty important to put some incentives on the CoT to avoid steganography, though. The division of labor here seems aimed at not actively incentivizing instrumentally useful thoughts to get hidden. However, the pre-trained LLM will already be responsive to some steganographic patterns, in addition to the human-readable patterns. So, there is still an incentive optimizing the non-human-readable patterns in the CoT to be as useful as possible, in addition to the human-readable patterns.
However, with this naive approach, the paraphrasings will probably still consistently contain certain kinds of noise from the initial thought. For instance, maybe the paraphraser never replaces an exclamation point with a period, even though this end-of-sentence punctuation doesn't really change the meaning. By modulating whether the step ends with a period or an exclamation point, the model can encode one bit of information per step, despite the paraphraser.
One way to solve this problem is to map every piece of text with a given meaning onto the same "canonical paraphrasing." For instance, we'd want a paraphraser that, given many texts with approximately-identical meanings (like "I own a large home!" "My house is spacious!" and "I possess a sizable dwelling."), maps them all onto a single "canonical" text (like "I have a big house.").
I don't think this proposal improves the thing you claim it improves.
If the initial paraphraser doesn't change between '!' and '.', then learning a canonical-paraphraser from that initial paraphraser won't do it either.
I think what canonical-paraphrasing does is help with cases where the paraphraser stochastically removes specific information, but in an information-theoretically leaky way. For example, if the paraphraser changes between '!' and '.' with only 10% probability, then steganography can use error-correcting codes to still get through.
Turning a paraphraser into a canonical-paraphraser probably does something like force details into "always removed" vs "never removed" based on a threshhold; if '!' and '.' are swapped with 10% probability in the initial paraphraser, then they're always swapped in the final paraphraser; if they're swapped with only 1% probability in the initial paraphraser, then they are never swapped in the final paraphraser.
(Of course this oversimplifies things a lot, and depends completely on how the training to turn paraphrasers into canonical-paraphrasers works.)
I think the only solution to the '.' vs '!' problem, fundamentally, is to train the paraphraser to modify those patterns. If the initial paraphraser doesn't ever change specific details, then those details can be used for steganography. If the paraphraser does change those details, then it is already hard to use them for steganography. So the canonical-paraphrasing idea helps somewhat, but fundamentally, the better the paraphraser (the more closely it captures the human notion of semantic equivalence), the better the anti-steganography guarantee we can get.
Note that the optimized planner is only rewarded for producing steps that help the frozen planner. Since the frozen planner doesn't understand steganography, there is no reason for the optimized planner to learn to use it.
I think this conclusion is overstated. The frozen planner may react to steganography, in the same way that image-recognition NNs react to adversarial examples. The optimized planner is very much like an adversarial example, in the way it is trained to generate outputs which nudge a frozen NN in a specific direction.
Imagine reversing the roles of the optimized vs frozen planner during training: the frozen planner is used for a random number of steps, and then we switch to the optimized planner. Then, it really is the case that the optimized planner has no incentive to use stegonography to nudge the frozen planner, since the frozen planner does not get the chance to react at all to the optimized planner.
The downside of this training method is that we can't use your suggestion to only reinforce the very last step rather than all the optimized-planner steps.
Another idea would be to only use the optimized planner to write a single randomly chosen step in the CoT.
All of these ideas still leave some incentive to use steganography in the adversarial-example sense. At the very least, the final answer (after the CoT) might be influenced by some subtle non-human-readable patterns in the CoT, so the optimized planner will be incentivized to optimize those patterns. In my view, what Incremental Steering does is two things:
- It prevents the model from inventing new steganography; it can only leverage subtle patterns which already influence the LLM before this step in the training.
- It makes it very difficult for the model to learn steganography which requires coordination over multiple steps. It is not being rewarded for such coordination.
Thanks for this response! I agree with the argument. I'm not sure what it would take to ensure CoT faithfulness, but I agree that it is an important direction to try and take things; perhaps even the most promising direction for near-term frontier-lab safety research (given the incentives pushing those labs in o1-ish directions).
How about “purely epistemic” means “updated by self-supervised learning”, i.e. the updates (gradients, trader bankrolls, whatever) are derived from “things being true vs false” as opposed to “things being good vs bad”. Right?
You can define it that way, but then I don't think it's highly relevant for this context.
The story I'm telling here is that partial feedback (typically: learning some sort of input-output relation via some sort of latents) always leaves us with undesired hypotheses which we can't rule out using the restricted feedback mechanism.
- Reinforcement Learning cannot rule out the wireheading hypothesis or human-manipulation hypothesis.
- Updating on things being true or false cannot rule out agentic hypotheses (the inner optimizer problem).
Any sufficiently rich hypotheses space has agentic policies, which can't be ruled out by the feedback. "Purely epistemic" in your sense filters for hypotheses which make good predictions, but this doesn't constrain things to be non-agentic. The system can learn to use predictions as actions in some way.
[I learned the term teleosemantics from you! :) ]
I think it would be fair to define a teleosemantic notion of "purely epistemic" as something like "there is no optimization (anywhere in the system -- 'inner' or 'outer') except optimization for epistemic accuracy".
The obvious application of my main point is that some form of "complete feedback" is a necessary (but insufficient) condition for this.
"Epistemic accuracy" here has to be defined in such a way as to capture the one-way "direction-of-fit" optimization of the map to fit the territory, but never the territory to fit the map. IE the optimization algorithm has to ignore the causal impact of its predictions.
However, I don't particularly endorse this as the correct design choice -- although a system with this property would be relatively safe in the sense of eliminating inner-alignment concerns and (in a sense) outer-alignment concerns, it is doing so by ignoring its impact on the world, which creates its own set of dangers. If such a system were widely deployed and became highly trusted for its predictions, it could stumble into bad self-fulfilling prophecies.
So, in my view, "epistemic" systems should be as transparent as possible with human users about possible multiple-fixed-point issues, try to keep humans in the loop and give the important decisions to humans; but ultimately, we need to view even "purely epistemic" systems as making some important (instrumental) decisions, and have them take some responsibility for making those decisions well instead of poorly.
The original LI paper was in that category, IIUC. The updates (to which traders had more vs less money) are derived from mathematical propositions being true vs false.
I was going to remind you that the paper didn't say how the fixed-point selection works, and we can do that part in an agentic way, but then you go on to say basically the same thing (with the caveat that where you say "put the LI into a larger system that follows the rule: whatever the expectations are about the AI's own actions, make that actually happen" I would say the more general "put the LI into an environment which somehow reacts to its predictions"):
LI defines a notion of logically uncertain variable, which can be used to represent desires
I would say that they don’t really represent desires. They represent expectations about what’s going to happen, possibly including expectations about an AI’s own actions.
And then you can then put the LI into a larger system that follows the rule: whatever the expectations are about the AI’s own actions, make that actually happen.
The important thing that changes in this situation is that the convergence of the algorithm is underdetermined—you can have multiple fixed points. I can expect to stand up, and then I stand up, and my expectation was validated. No update. I can expect to stay seated, and then I stay seated, and my expectation was validated. No update.
(I don’t think I’m saying anything you don’t already know well.)
Anyway, if you do that, then I guess you could say that the LI’s expectations “can be used” to represent desires … but I maintain that that’s a somewhat confused and unproductive way to think about what’s going on. If I intervene to change the LI variable, it would be analogous to changing habits (what do I expect myself to do ≈ which action plans seem most salient and natural), not analogous to changing desires.
(I think the human brain has a system vaguely like LI, and that it resolves the underdetermination by a separate valence system, which evaluates expectations as being good vs bad, and applies reinforcement learning to systematically seek out the good ones.)
I don't understand what you're trying to accomplish in these paragraphs. To me you sound sorta like Bob in the following:
Alice: Here's my computer model of an agent.
Bob: Uh oh, that sounds sort of like active inference. How did you represent values? Did you confuse them with beliefs?
Alice: I used floating-point numbers to represent the expected value of a state. Here, look at my code. It's a Q-learning algorithm.
Bob: You realize that "expected value" is a statistics thing, right? That makes it epistemic, not really value-laden in the sense that makes something agentic. It's a prediction of what a number will be. Indeed, we can justify expected values as min-quadratic-loss estimates. That makes them epistemic!
Alice: Well, I agree that expected values aren't automatically "values" in the agentic sense, but look, my code can solve mazes and stuff -- like a rat learning to get cheese.
Bob: Of course I agree that it can be used in an instrumental way, but that's a really misleading way to describe it overall, right? If you changed one Q-value estimated by the system, that would be analogous to changing habits, not desires, right?
Alice: um??? If we agree that it can be used in an instrumental way, then what are you saying is misleading?
Bob: I mean, sure, the human brain does something like this.
Alice: Ok??
It seems possible that you think we have some disagreement that we don't have?
Also, it is easy for end users to build agentlike things out of belieflike things by making queries about how to accomplish things. Thus, we need to train epistemic systems to be responsible about how such queries are answered (as is already apparent in existing chatbots).
I’m not sure that this is coming from a coherent threat model (or else I don’t follow).
- If Dr. Evil trains his own AGI, then this whole thing is moot, because he wants the AGI to have accurate beliefs about bioweapons.
- If Benevolent Bob trains the AGI and gives API access to Dr. Evil, then Bob can design the AGI to (1) have accurate beliefs about bioweapons, and (2) not answer Dr. Evil’s questions about bioweapons. That might ideally look like what we’re used to in the human world: the AGI says things because it wants to say those things, all things considered, and it doesn’t want Dr. Evil to build bioweapons, either directly or because it’s guessing what Bob would want.
I'm not clear on what you're trying to disagree with here. It sounds like we both agree that if Benevolent Bob builds a powerful "purely epistemic system" (by whatever definition), without limiting its knowledge, then Dr. Evil can misuse it; and we both agree that as a consequence of this, it makes sense to instead build some agency into the system, so that the system can decide not to give users dangerous information.
Possibly you disagree with the claim "it is easy to build agentlike things out of belieflike things"? What I have in mind is a powerful epistemic oracle. As a simple example, let's say it can give highly accurate guesses to mathematically-posed problems. Then Dr. Evil can implement AIXI by feeding in AIXI's mathematical definition, for example. This is the sort of thing I had in mind, but generalized to the nonmathematical case. (EG, "conditional on my owning a super-powerful death ray soon, what actions do I take now")
The thing to prove is precisely convergence and calibration!
Convergence: Given arbitrary observed sequences, the probabilities of the latent bits (program tape and working tape) converge.
(Obviously the non-latent bits converge, if we assume they eventually get observed; hence the focus on the latent bits here.)
Calibration: Considering the sequence of probabilistic predictions has the property that, considering the subsequence of only predictions between 79% and 81% (for example), the empirical frequency of those predictions coming true will eventually be in that range and stay within that range forever. (This should be true for all such ranges.
Of course, there will also be a number of interesting formal variations on these properties, either weakening them (EG giving conditions where we do get a guarantee) or strengthening them (EG giving explicit bounds on how badly these properties can be violated, rather than only proving that they are obeyed in the limit). For example, converging to predict 50% on adversarial sequences is a much weaker form of a calibration guarantee.
Admittedly, one can try to squish beliefs and desires into the same framework. The Active Inference people do that. Does LI do that too?
No. LI defines a notion of logically uncertain variable, which can be used to represent desires. There are also other ways one could build agents out of LI, such as doing the active inference thing.
As I mentioned in the post, I'm agnostic about such things here. We could be building """purely epistemic""" AI out of LI, or we could be deliberately building agents. It doesn't matter very much, in part because we don't have a good notion of purely epistemic.
- Any learning system with a sufficiently rich hypothesis space can potentially learn to behave agentically (whether we want it to or not, until we have anti-inner-optimizer tech), so we should still have corrigibility concerns about such systems.
- In my view, beliefs are a type of decision (not because we smoosh beliefs and values together, but rather because beliefs can have impacts on the world if the world looks at them) which means we should have agentic concerns about beliefs.
- Also, it is easy for end users to build agentlike things out of belieflike things by making queries about how to accomplish things. Thus, we need to train epistemic systems to be responsible about how such queries are answered (as is already apparent in existing chatbots).
I think I've articulated a number of concrete subgoals that require less philosophical skill (they can be approached as math problems). However, in the big picture, novel tiling theorems require novel ideas. This requires philosophical skill.
I think there are some deeper insights around inner optimization that you are missing that would make you more pessimistic here. "Unknown Algorithm" to me means that we don't know how to rule out the possibility of inner agents which have opinions about recursive self-improvement. Part of it is that we can't just think about what it "converges to" (convergence time will be too long for interesting learning systems).
I agree that Solomonoff’s epistemology is noncentral in the way you describe, but I don't think it impacts my points very much; replace Solomonoff with whatever epistemic theory you like. It was just a convenient example.
(Although I expect defenders of Solomonoff to expect the program bits to be meaningful; and I somewhat agree. It's just that the theory doesn't address the meaning there, instead treating programs more like black-box predictors.)
In my view, meaning is the property of being optimized to adhere to some map-territory relationship. However, this optimization itself must always occur within some model (it provides the map-territory relationship to optimize for). In the context of Solomonoff Induction, this may emerge from the incentive to predict, but it is not easy to reason about.
In some sense, reality isn't made of bits, propositions, or any such thing; it is of unknowable type. However, we always describe it via terms of some type (a language).
I'm no longer sure where the disagreement lies, if any, but I still feel like the original post overstates things.
How would you become confident that a UANFSI approach was NFSI?
I have lost interest in the Löbian approach to tiling, because probabilistic tiling results seem like they can be strong enough and with much less suspicious-looking solutions. Expected value maximization is a better way of looking at agentic behavior anyway. Trying to logically prove some safety predicate for all actions seems like a worse paradigm than trying to prove some safety properties for the system overall (including proving that those properties tile under as-reasonable-as-possible assumptions, plus sanity-checking what happens when those assumptions aren't precisely true).
I do think Löb-ish reasoning still seems potentially important for coordination and cooperation, which I expect to feature in important tiling results (if this research program continues to make progress). However, I am optimistic about replacing Löb's Theorem with Payor's Lemma in this context.
I don't completely discount the pivotal-act approach, but I am currently more optimistic about developing safety criteria & designs which could achieve some degree of consensus amongst researchers, and make their way into commercial AI, perhaps through regulation.
Yeah, I totally agree. This was initially a quick private message to someone, but I thought it was better to post it publicly despite the inadequate explanations. I think the idea deserves a better write-up.
It seems to me that the importance and interaction of these different types of power in the future depends a lot on our choices now, ie, what kind of future we shape. Hierarchies could get smashed in one way or another, making John's prediction correct, or we could engineer a future that evolves from the present more smoothly, in which case you'd be correct.
One thing I don't understand / don't agree with here is the move from propositions to models. It seems to me that models can be (and usually are) understood in terms of propositions.
For example, Solomonoff understands models as computer programs which generate predictions. However, computer programs are constructed out of bits, which can be understood as propositions. The bits are not very meaningful in isolation; the claim "program-bit number 37 is a 1" has almost no meaning in the absence of further information about the other program bits. However, this isn't much of an issue for the formalism.
Similarly, I expect that any attempt to formally model "models" can be broken down into propositions. EG, if someone claimed that humans understand the world in terms of systems of differential equations, this would still be well-facilitated by a concept of propositions (ie, the equations).
It seems to me like a convincing abandonment of propositions would have to be quite radical, abandoning the idea of formalism entirely. This is because you'd have to explain why your way of thinking about models is not amenable to a mathematical treatment (since math is commonly understood in terms of propositions).
So (a) I'm not convinced that thinking in terms of propositions makes it difficult to think in terms of models; (b) it seems to me that refusing to think in terms of propositions would make it difficult to think in terms of models.
"X is false" has to be modeled as something that is value 1 if and only if X is value 0, but continuously decreases in value as X continuously increases in value. The simplest formula is value(X is false) = 1-value(X). However, we can made "sharper" formulas which diminish in value more rapidly as X increases in value. Hartry Field constructs a hierarchy of such predicates which he calls "definitely false", "definitely definitely false", etc.
Proof systems for the logic should have the property that sentences are derivable only when they have value 1; so "X is false" or "X is definitely false" etc all share the property that they're only derivable when X has value zero.
For me, I felt like publishing in scientific journals required me to be dishonest.
...what?
Quoting a little more context:
I encounter this idea all the time when I’m talking to academics about academia. I give ‘em my whole spiel about publishing, being honest, blah blah blah, and they go, “Well, we don’t live in a utopia. You have to make tradeoffs in life.” Yes, of course! But the whole point of tradeoffs is to trade something you value less for something you value more. The thing you care about the most—that’s the thing you don’t compromise on!
(This is, by the way, Negotiations 101.)
For me, I felt like publishing in scientific journals required me to be dishonest. So I stopped publishing in scientific journals.
The "whole spiel" has a link to another essay by the same author. At the very end, it gives an example of what they mean by "being honest" -- what science can look like when one isn't worried about peer review.
I'm not arguing against fuzzy logic, just that it arguably doesn't "morally" solve the liar paradox, insofar it yields similar revenge paradoxes.
It has been years since I've read the book, so this might be a little bit off, but Field's response to revenge is basically this:
- The semantic values (which are more complex than fuzzy values, but I'll pretend for simplicity that they're just fuzzy values) are models of what's going on, not literal. This idea is intended to respond to complaints like "but we can obviously refer to truth-value-less-than-one, which gives rise to a revenge paradox". The point of the model is to inform us about what inference systems might be sound and consistent, although we can only ever prove this in a toy setting, thanks to Godel's theorems.
- So, indeed, within this model, "is exactly false" doesn't make sense. Speaking outside this model, it may seem to make sense, but we can only step outside of it because it is a toy model.
- However, we do get the ability to state ever-stronger Liar sentences with a "definitely" operator ("definitely x" is intuitively twice as strong a truth-claim compared to "x"). So the theory deals with revenge problems in that sense by formulating an infinite hierarchy of Strengthened Liars, none of which cause a problem. IIRC Hartry's final theory even handles iteration of the "definitely" operator infinitely many times (unlike fuzzy logic).
In natural language we arguably can't just place restrictions, like banning non-continous truth functions such as "is exactly false". Even if we don't have a more appealing resolution. We can only pose voluntary restrictions on formal languages. For natural language, the only hope would be to argue that the predicate "is exactly false" doesn't really make sense, or doesn't actually yield a contradiction, though that seems difficult.
Of course in some sense natural language is an amorphous blob which we can only formally model as an action-space which is instrumentally useful. The question, for me, is about normative reasoning -- how can we model as many of the strengths of natural language as possible, while also keeping as many of the strengths of formal logic as possible?
So I do think fuzzy logic makes some positive progress on the Liar and on revenge problems, and Hartry's proposal makes more positive progress.
Any solution to the semantic paradoxes must accept something counterintuitive. In the case of using something like fuzzy logic, I must accept the restriction that valid truth-functions must be continuous (or at least Kakutani). I don't claim this is the final word on the subject (I recognize that fuzzy logic has some severe limitations; I mostly defer to Hartry Field on how to get around these problems). However, I do think it captures a lot of reasonable intuitions. I would challenge you to name a more appealing resolution.
Richard is arguing against foundational pictures which assume these problems away, and in favor of foundational pictures which recognize them.
I think you should handle the problems separately. In which case, when reasoning about truth, you should indeed assume away communication difficulties. If our communication technology was so bad that 30% of our words got dropped from every message, the solution would not be to change our concept of meanings; the solution would be to get better at error correction, ideally at a lower level, but if necessary by repeating ourselves and asking for clarification a lot.
You seem to be assuming that these issues arise only due to communication difficulties, but I'm not completely on board with that assumption. My argument is that these issues are fundamental to map-territory semantics (or, indeed, any concept of truth).
One argument for this is to note that the communicators don't necessarily have the information needed to resolve the ambiguity, even in principle, because we don't think in completely unambiguous concepts. We employ vague concepts like baldness, table, chair, etc. So it is not as if we have completely unambiguous pictures in mind, and merely run into difficulties when we try to communicate.
A stronger argument for the same conclusion relies on structural properties of truth. So long as we want to be able to talk and reason about truth in the same language that the truth-judgements apply to, we will run into self-referential problems. Crisp true-false logic has greater difficulties dealing with these problems than many-valued logics such as fuzzy logic.
I argue that meanings are fundamentally fuzzy. In the end, we can interpret things your way, if we think of fuzzy truth-values as sent to "true" or "false" based on an unknown threshhold (which we can have a probability distribution over). However, it is worth noting that the fuzzy truth-values can be logically coherent in a way that the true-false values cannot: the fuzzy truth predicate is just an identity function (so "X is true" has the same fuzzy truth-value as "X"), and this allows us to reason about paradoxical sentences in a completely consistent way (so the self-referential sentence "this sentence is not true" is value 1/2). But when we force things to be true or false, the resulting picture must violate some rules of logic ("this sentence is not true" must either be true or false, neither of which is consistent with classical logic's rules of reasoning).
Digging further into your proposal, what makes a sentence true or false? It seems that you suppose a person always has a precise meaning in mind when they utter a sentence. What sort of object do you think this meaning is? Do you think it is plausible, say, that the human uttering the sentence could always say later whether it was true or false, if given enough information about the world? Or is it possible that even the utterer could be unsure about that, even given all the relevant facts about how the world is? I think even the original utterer can remain unsure -- but then, where does the fact-of-the-matter reside?
For example, if I claim something is 500 meters away, I don't have a specific range in mind. I'll know I was correct if more accurate measurement establishes that it's 500.001 meters away. I'll know I was incorrect if it turns out to be 10 meters away. However, there will be some boundary where I will be inclined to say my intended claim was vague, and I "fundamentally" don't know -- I don't think there was a fact of the matter about the precise range I intended with my statement, beyond some level of detail.
So, if you propose to model this no-fact-of-the-matter as uncertainty, what do you propose it is uncertain about? Where does this truth reside, and how would it be checked/established?
That's an important move to make, but it is also important to notice how radically context-dependent and vague our language is, to the point where you can't really eliminate the context-dependence and vagueness via taboo (because the new words you use will still be somewhat context-dependent and vague). Working against these problems is pragmatically useful, but recognizing their prevalence can be a part of that. Richard is arguing against foundational pictures which assume these problems away, and in favor of foundational pictures which recognize them.
It seems to me that there is a really interesting interplay of different forces here, which we don't yet know how to model well.
Even if Alice tries meticulously to only say literally true things, and be precise about her meanings, Bob can and should infer more than what Alice has literally said, by working backwards to infer why she has said it rather than something else.
So, pragmatics is inevitable, and we'd be fools not to take advantage of it.
However, we also really like transparent contexts -- that is, we like to be able to substitute phrases for equivalent phrases (equational reasoning, like algebra), and make inferences based on substitution-based reasoning (if all bachelors are single, and Jerry is a bachelor, then Jerry is single).
To put it simply, things are easier when words have context-independent meanings (or more realistically, meanings which are valid across a wide array of contexts, although nothing will be totally context-independent).
This puts contradictory pressure on language. Pragmatics puts pressure towards highly context-dependent meaning; reasoning puts pressure towards highly context-independent meaning.
If someone argues a point by conflation (uses a word in two different senses, but makes an inference as if the word had one sense) then we tend to fault using the same word in two different senses, rather than fault basic reasoning patterns like transitivity of implication (A implies B, and B implies C, so A implies C). Why is that? Is that the correct choice? If meanings are inevitably context-dependent anyway, why not give up on reasoning? ;p
Yeah, I agree with that.