Posts

dxu's Shortform 2023-12-04T00:05:24.263Z

Comments

Comment by dxu on 'Empiricism!' as Anti-Epistemology · 2024-03-14T15:31:04.427Z · LW · GW

To the extent that I buy the story about imitation-based intelligences inheriting safety properties via imitative training, I correspondingly expect such intelligences not to scale to having powerful, novel, transformative capabilities—not without an amplification step somewhere in the mix that does not rely on imitation of weaker (human) agents.

Since I believe this, that makes it hard for me to concretely visualize the hypothetical of a superintelligent GPT+DPO agent that nevertheless only does what is instructed. I mostly don't expect to be able to get to superintelligence without either (1) the "RL" portion of the GPT+RL paradigm playing a much stronger role than it does for current systems, or (2) using some other training paradigm entirely. And the argument for obedience/corrigibility becomes weaker/nonexistent respectively in each of those cases.

Possibly we're in agreement here? You say you expect GPT+DPO to stagnate and be replaced by something else; I agree with that. I merely happen to think the reason it will stagnate is that its safety properties don't come free; they're bought and paid for by a price in capabilities.

Comment by dxu on 'Empiricism!' as Anti-Epistemology · 2024-03-14T13:43:19.070Z · LW · GW

That (on it's own, without further postulates) is a fully general argument against improving intelligence.

Well, it's a primarily a statement about capabilities. The intended construal is that if a given system's capabilities profile permits it to accomplish some sufficiently transformative task, then that system's capabilities are not limited to only benign such tasks. I think this claim applies to most intelligences that can arise in a physical universe like our own (though necessarily not in all logically possible universes, given NFL theorems): that there exists no natural subclass of transformative tasks that includes only benign such tasks.

(Where, again, the rub lies in operationalizing "transformative" such that the claim follows.)

We have to accept some level of danger inherent in existence; the question is what makes AI particularly dangerous. If this special factor isn't present in GPT+DPO, then GPT+DPO is not an AI notkilleveryoneism issue.

I'm not sure how likely GPT+DPO (or GPT+RLHF, or in general GPT-plus-some-kind-of-RL) is to be dangerous in the limits of scaling. My understanding of the argument against, is that the base (large language) model derives most (if not all) of its capabilities from imitation, and the amount of RL needed to elicit desirable behavior from that base set of capabilities isn't enough to introduce substantial additional strategic/goal-directed cognition compared to the base imitative paradigm, i.e. the amount and kinds of training we'll be doing in practice are more likely to bias the model towards behaviors that were already a part of the base model's (primarily imitative) predictive distribution, than they are to elicit strategic thinking de novo.

That strikes me as substantially an empirical proposition, which I'm not convinced the evidence from current models says a whole lot about. But where the disjunct I mentioned comes in, isn't an argument for or against the proposition; you can instead see it as a larger claim that parametrizes the class of systems for which the smaller claim might or might not be true, with respect to certain capabilities thresholds associated with specific kinds of tasks. And what the larger claim says is that, to the extent that GPT+DPO (and associated paradigms) fail to produce reasoners which could (in terms of capability, saying nothing about alignment or "motive") be dangerous, they will also fail to be "transformative"—which in turn is an issue in precisely those worlds where systems with "transformative" capabilities are economically incentivized over systems without those capabilities (as is another empirical question!).

Comment by dxu on 'Empiricism!' as Anti-Epistemology · 2024-03-14T11:44:44.716Z · LW · GW

The methods we already have are not sufficient to create ASI, and also if you extrapolate out the SOTA methods at larger scale, it's genuinely not that dangerous.

I think I like the disjunct “If it’s smart enough to be transformative, it’s smart enough to be dangerous”, where the contrapositive further implies competitive pressures towards creating something dangerous (as opposed to not doing that).

There’s still a rub here—namely, operationalizing “transformative” in such a way as to give the necessary implications (both “transformative -> dangerous” and “not transformative -> competitive pressures towards capability gain”). This is where I expect intuitions to differ the most, since in the absence of empirical observations there seem multiple consistent views.

Comment by dxu on The Aspiring Rationalist Congregation · 2024-01-11T23:33:23.106Z · LW · GW

(9) is a values thing, not a beliefs thing per se. (I.e. it's not an epistemic claim.)

(11) is one of those claims that is probabilistic in principle (and which can be therefore be updated via evidence), but for which the evidence in practice is so one-sided that arriving at the correct answer is basically usable as a sort of FizzBuzz test for rationality: if you can’t get the right answer on super-easy mode, you’re probably not a good fit.

Comment by dxu on dxu's Shortform · 2023-12-04T00:05:24.400Z · LW · GW

Something I wrote recently as part of a private conversation, which feels relevant enough to ongoing discussions to be worth posting publicly:

The way I think about it is something like: a "goal representation" is basically what you get when it's easier to state some compact specification on the outcome state, than it is to state an equivalent set of constraints on the intervening trajectories to that state.

In principle, this doesn't have to equate to "goals" in the intuitive, pretheoretic sense, but in practice my sense is that this happens largely when (and because) permitting longer horizons (in the sense of increasing the length of the minimal sequence needed to reach some terminal state) causes the intervening trajectories to explode in number and complexity, s.t. it's hard to impose meaningful constraints on those trajectories that don't map to (and arise from) some much simpler description of the outcomes those trajectories lead to.

This connects with the "reasoners compress plans" point, on my model, because a reasoner is effectively a way to map that compact specification on outcomes to some method of selecting trajectories (or rather, selecting actions which select trajectories); and that, in turn, is what goal-oriented reasoning is. You get goal-oriented reasoners ("inner optimizers") precisely in those cases where that kind of mapping is needed, because simple heuristics relating to the trajectory instead of the outcome don't cut it.

It's an interesting question as to where exactly the crossover point occurs, where trajectory-heuristics stop functioning as effectively as consequentialist outcome-based reasoning. On one extreme, there are examples like tic-tac-toe, where it's possible to play perfectly based on a myopic set of heuristics without any kind of search involved. But as the environment grows more complex, the heuristic approach will in general be defeated by non-myopic, search-like, goal-oriented reasoning (unless the latter is too computationally intensive to be implemented).

That last parenthetical adds a non-trivial wrinkle, and in practice reasoning about complex tasks subject to bounded computation does best via a combination of heuristic-based reasoning about intermediate states, coupled to a search-like process of reaching those states. But that already qualifies in my book as "goal-directed", even if the "goal representations" aren't as clean as in the case of something like (to take the opposite extreme) AIXI.

To me, all of this feels somewhat definitionally true (though not completely, since the real-world implications do depend on stuff like how complexity trades off against optimality, where the "crossover point" lies, etc). It's just that, in my view, the real world has already provided us enough evidence about this that our remaining uncertainty doesn't meaningfully change the likelihood of goal-directed reasoning being necessary to achieve longer-term outcomes of the kind many (most?) capabilities researchers have ambitions about.

Comment by dxu on Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense · 2023-11-24T22:26:27.034Z · LW · GW

It's pretty unclear if a system that is good at answering the question "Which action would maximize the expected amount of X?" also "wants" X (or anything else) in the behaviorist sense that is relevant to arguments about AI risk. The question is whether if you ask that system "Which action would maximize the expected amount of Y?" whether it will also be wanting the same thing, or whether it will just be using cognitive procedures that are good at figuring out what actions lead to what consequences.

Here's an existing Nate!comment that I find reasonably persuasive, which argues that these two things are correlated in precisely those cases where the outcome requires routing through lots of environmental complexity:

Part of what's going on here is that reality is large and chaotic. When you're dealing with a large and chaotic reality, you don't get to generate a full plan in advance, because the full plan is too big. Like, imagine a reasoner doing biological experimentation. If you try to "unroll" that reasoner into an advance plan that does not itself contain the reasoner, then you find yourself building this enormous decision-tree, like "if the experiments come up this way, then I'll follow it up with this experiment, and if instead it comes up that way, then I'll follow it up with that experiment", and etc. This decision tree quickly explodes in size. And even if we didn't have a memory problem, we'd have a time problem -- the thing to do in response to surprising experimental evidence is often "conceptually digest the results" and "reorganize my ontology accordingly". If you're trying to unroll that reasoner into a decision-tree that you can write down in advance, you've got to do the work of digesting not only the real results, but the hypothetical alternative results, and figure out the corresponding alternative physics and alternative ontologies in those branches. This is infeasible, to say the least.

Reasoners are a way of compressing plans, so that you can say "do some science and digest the actual results", instead of actually calculating in advance how you'd digest all the possible observations. (Note that the reasoner specification comprises instructions for digesting a wide variety of observations, but in practice it mostly only digests the actual observations.)

Like, you can't make an "oracle chess AI" that tells you at the beginning of the game what moves to play, because even chess is too chaotic for that game tree to be feasibly representable. You've gotta keep running your chess AI on each new observation, to have any hope of getting the fragment of the game tree that you consider down to a managable size.

Like, the outputs you can get out of an oracle AI are "no plan found", "memory and time exhausted", "here's a plan that involves running a reasoner in real-time" or "feed me observations in real-time and ask me only to generate a local and by-default-inscrutable action". In the first two cases, your oracle is about as useful as a rock; in the third, it's the realtime reasoner that you need to align; in the fourth, all [the] word "oracle" is doing is mollifying you unduly, and it's this "oracle" that you need to align.


Could you give an example of a task you don't think AI systems will be able to do before they are "want"-y? At what point would you update, if ever? What kind of engineering project requires an agent to be want-y to accomplish it? Is it something that individual humans can do? (It feels to me like you will give an example like "go to the moon" and that you will still be writing this kind of post even once AI systems have 10x'd the pace of R&D.)

Here's an existing Nate!response to a different-but-qualitatively-similar request that, on my model, looks like it ought to be a decent answer to yours as well:

a thing I don't expect the upcoming multimodal models to be able to do: train them only on data up through 1990 (or otherwise excise all training data from our broadly-generalized community), ask them what superintelligent machines (in the sense of IJ Good) should do, and have them come up with something like CEV (a la Yudkowsky) or indirect normativity (a la Beckstead) or counterfactual human boxing techniques (a la Christiano) or suchlike.

Note that this only tangentially a test of the relevant ability; very little of the content of what-is-worth-optimizing-for occurs in Yudkowsky/Beckstead/Christiano-style indirection. Rather, coming up with those sorts of ideas is a response to glimpsing the difficulty of naming that-which-is-worth-optimizing-for directly and realizing that indirection is needed. An AI being able to generate that argument without following in the footsteps of others who have already generated it would be at least some evidence of the AI being able to think relatively deep and novel thoughts on the topic.

(The original discussion that generated this example was couched in terms of value alignment, but it seems to me the general form "delete all discussion pertaining to some deep insight/set of insights from the training corpus, and see if the model can generate those insights from scratch" constitutes a decent-to-good test of the model's cognitive planning ability.)

(Also, I personally think it's somewhat obvious that current models are lacking in a bunch of ways that don't nearly require the level of firepower implied by a counterexample like "go to the moon" or "generate this here deep insight from scratch", s.t. I don't think current capabilities constitute much of an update at all as far as "want-y-ness" goes, and continue to be puzzled at what exactly causes [some] LLM enthusiasts to think otherwise.)

Comment by dxu on On the lethality of biased human reward ratings · 2023-11-18T10:57:14.735Z · LW · GW

I think I'm not super into the U = V + X framing; that seems to inherently suggest that there exists some component of the true utility V "inside" the proxy U everywhere, and which is merely perturbed by some error term rather than washed out entirely (in the manner I'd expect to see from an actual misspecification). In a lot of the classic Goodhart cases, the source of the divergence between measurement and desideratum isn't regressional, and so V and X aren't independent.

(Consider e.g. two arbitrary functions U' and V', and compute the "error term" X' between them. It should be obvious that when U' is maximized, X' is much more likely to be large than V' is; which is simply another way of saying that X' isn't independent of V', since it was in fact computed from V' (and U'). The claim that the reward model isn't even "approximately correct", then, is basically this: that there is a separate function U being optimized whose correlation with V within-distribution is in some sense coincidental, and that out-of-distribution the two become basically unrelated, rather than one being expressible as a function of the other plus some well-behaved error term.)

Comment by dxu on On the lethality of biased human reward ratings · 2023-11-18T10:30:09.277Z · LW · GW

(Which, for instance, seems true about humans, at least in some cases: If humans had the computational capacity, they would lie a lot more and calculate personal advantage a lot more. But since those are both computationally expensive, and therefore can be caught-out by other humans, the heuristic / value of "actually care about your friends", is competitive with "always be calculating your personal advantage."

I expect this sort of thing to be less common with AI systems that can have much bigger "cranial capacity". But then again, I guess that at whatever level of brain size, there will be some problems for which it's too inefficient to do them the "proper" way, and for which comparatively simple heuristics / values work better.

But maybe at high enough cognitive capability, you just have a flexible, fully-general process for evaluating the exact right level of approximation for solving any given problem, and the binary distinction between doing things the "proper" way and using comparatively simpler heuristics goes away. You just use whatever level of cognition makes sense in any given micro-situation.)

+1; this seems basically similar to the cached argument I have for why human values might be more arbitrary than we'd like—very roughly speaking, they emerged on top of a solution to a specific set of computational tradeoffs while trying to navigate a specific set of repeated-interaction games, and then a bunch of contingent historical religion/philosophy on top of that. (That second part isn't in the argument you [Eli] gave, but it seems relevant to point out; not all historical cultures ended up valuing egalitarianism/fairness/agency the way we seem to.)

Comment by dxu on Genetic fitness is a measure of selection strength, not the selection target · 2023-11-16T03:12:57.964Z · LW · GW

It sounds like you're arguing that uploading is impossible, and (more generally) have defined the idea of "sufficiently OOD environments" out of existence. That doesn't seem like valid thinking to me.

Comment by dxu on Genetic fitness is a measure of selection strength, not the selection target · 2023-11-14T05:25:11.601Z · LW · GW

Notice I replied to that comment you linked and agreed with John, but not that any generalized vector dot product model is wrong, but that the specific one in that post is wrong as it doesn't weight by expected probability ( ie an incorrect distance function).

Anyway I used that only as a convenient example to illustrate a model which separates degree of misalignment from net impact, my general point does not depend on the details of the model and would still stand for any arbitrarily complex non-linear model.

The general point being that degree of misalignment is only relevant to the extent it translates into a difference in net utility.

Sure, but if you need a complicated distance metric to describe your space, that makes it correspondingly harder to actually describe utility functions corresponding to vectors within that space which are "close" under that metric.

If you actually believe the sharp left turn argument holds water, where is the evidence?

As as I said earlier this evidence must take a specific form, as evidence in the historical record

Hold on; why? Even for simple cases of goal misspecification, the misspecification may not become obvious without a sufficiently OOD environment; does that thereby mean that no misspecification has occurred?

And in the human case, why does it not suffice to look at the internal motivations humans have, and describe plausible changes to the environment for which those motivations would then fail to correspond even approximately to IGF, as I did w.r.t. uploading?

But I see that as much more contingent than necessarily true, and mainly a consequence of the fact that, for all of our technological advances, we haven't actually given rise to that many new options preferable to us but not to IGF. On the other hand, something like uploading I would expect to completely shatter any relation our behavior has to IGF maximization.

It seems to me that this suffices to establish that the primary barrier against such a breakdown in correspondence is that of insufficient capabilities—which is somewhat the point!

Comment by dxu on Genetic fitness is a measure of selection strength, not the selection target · 2023-11-05T03:39:01.703Z · LW · GW

No AI we create will be perfectly aligned, so instead all that actually matters is the net utility that AI provides for its creators: something like the dot product between our desired future trajectory and that of the agents. More powerful agents/optimizers will move the world farther faster (longer trajectory vector) which will magnify the net effect of any fixed misalignment (cos angle between the vectors), sure. But that misalignment angle is only relevant/measurable relative to the net effect - and by that measure human brain evolution was an enormous unprecedented success according to evolutionary fitness.

The vector dot product model seems importantly false, for basically the reason sketched out in this comment; optimizing a misaligned proxy isn't about taking a small delta and magnifying it, but about transitioning to an entirely different policy regime (vector space) where the dot product between our proxy and our true alignment target is much, much larger (effectively no different from that of any other randomly selected pair of vectors in the new space).

(You could argue humans haven't fully made that phase transition yet, and I would have some sympathy for that argument. But I see that as much more contingent than necessarily true, and mainly a consequence of the fact that, for all of our technological advances, we haven't actually given rise to that many new options preferable to us but not to IGF. On the other hand, something like uploading I would expect to completely shatter any relation our behavior has to IGF maximization.)

Comment by dxu on What's Hard About The Shutdown Problem · 2023-11-05T03:01:05.474Z · LW · GW

It looks a bit to me like your Timestep Dominance Principle forbids the agent from selecting any trajectory which loses utility at a particular timestep in exchange for greater utility at a later timestep, regardless of whether the trajectory in question actually has anything to do with manipulating the shutdown button? After all, conditioning on the shutdown being pressed at any point after the local utility loss but before the expected gain, such a decision would give lower sum-total utility within those conditional trajectories than one which doesn't make the sacrifice.

That doesn't seem like behavior we really want; depending on how closely together the "timesteps" are spaced, it could even wreck the agent's capabilities entirely, in the sense of no longer being able to optimize within button-not-pressed trajectories.

(It also doesn't seem to me a very natural form for a utility function to take, assigning utility not just to terminal states, but to intermediate states as well, and then summing across the entire trajectory; humans don't appear to behave this way when making plans, for example. If I considered the possibility of dying at every instant between now and going to the store, and permitted myself only to take actions which Pareto-improve the outcome set after every death-instant, I don't think I'd end up going to the store, or doing much of anything at all!)

Comment by dxu on Invulnerable Incomplete Preferences: A Formal Statement · 2023-10-31T16:33:24.207Z · LW · GW

In your example, DSM permits the agent to end up with either A+ or B. Neither is strictly dominated, and neither has become mandatory for the agent to choose over the other. The agent won't have reason to push probability mass from one towards the other.

But it sounds like the agent's initial choice between A and B is forced, yes? (Otherwise, it wouldn't be the case that the agent is permitted to end up with either A+ or B, but not A.) So the presence of A+ within a particular continuation of the decision tree influences the agent's choice at the initial node, in a way that causes it to reliably choose one incomparable option over another.

Further thoughts: under the original framing, instead of choosing between A and B (while knowing that B can later be traded for A+), the agent instead chooses whether to go "up" or "down" to receive (respectively) A, or a further choice between A+ and B. It occurs to me that you might be using this representation to argue for a qualitative difference in the behavior produced, but if so, I'm not sure how much I buy into it.

For concreteness, suppose the agent starts out with A, and notices a series of trades which first involves trading A for B, and then B for A+. It seems to me that if I frame the problem like this, the structure of the resulting tree should be isomorphic to that of the decision problem I described, but not necessarily the "up"/"down" version—at least, not if you consider that version to play a key role in DSM's recommendation.

(In particular, my frame is sensitive to which state the agent is initialized in: if it is given B to start, then it has no particular incentive to want to trade that for either A or A+, and so faces no incentive to trade at all. If you initialize the agent with A or B at random, and institute the rule that it doesn't trade by default, then the agent will end up with A+ when initialized with A, and B when initialized with B—which feels a little similar to what you said about DSM allowing both A+ and B as permissible options.)

It sounds like you want to make it so that the agent's initial state isn't taken into account—in fact, it sounds like you want to assign values only to terminal nodes in the tree, take the subset of those terminal nodes which have maximal utility within a particular incomparability class, and choose arbitrarily among those. My frame, then, would be equivalent to using the agent's initial state as a tiebreaker: whichever terminal node shares an incomparability class with the agent's initial state will be the one the agent chooses to steer towards.

...in which case, assuming I got the above correct, I think I stand by my initial claim that this will lead to behavior which, while not necessarily "trammeling" by your definition, is definitely consequentialist in the worrying sense: an agent initialized in the "shutdown button not pressed" state will perform whatever intermediate steps are needed to navigate to the maximal-utility "shutdown button not pressed" state it can foresee, including actions which prevent the shutdown button from being pressed.

Comment by dxu on Value systematization: how values become coherent (and misaligned) · 2023-10-28T13:45:41.274Z · LW · GW

This is a good post! It feels to me like a lot of discussion I've recently encountered seem to be converging on this topic, and so here's something I wrote on Twitter not long ago that feels relevant:

I think most value functions crystallized out of shards of not-entirely-coherent drives will not be friendly to the majority of the drives that went in; in humans, for example, a common outcome of internal conflict resolution is to explicitly subordinate one interest to another.

I basically don’t think this argument differs very much between humans and ASIs; the reason I expect humans to be safe(r) under augmentation isn’t that I expect them not to do the coherence thing, but that I expect them to do it in a way I would meta-endorse.

And so I would predict the output of that reflection process, when run on humans by humans, to be substantially likelier to contain things we from our current standpoint recognize as valuable—such as care for less powerful creatures, less coherent agents, etc.

If you run that process on an arbitrary mind, the stuff inside the world-model isn’t guaranteed to give rise to something similar, because (I predict) the drives themselves will be different, and the meta-reflection/extrapolation process will likewise be different.

Comment by dxu on Symbol/Referent Confusions in Language Model Alignment Experiments · 2023-10-28T06:42:21.248Z · LW · GW

The main way I'd imagine shutdown-corrigibility failing in AutoGPT (or something like it) is not that a specific internal sim is "trying" to be incorrigible at the top level, but rather that AutoGPT has a bunch of subprocesses optimizing for different subgoals without a high-level picture of what's going on, and some of those subgoals won't play well with shutdown. That's the sort of situation where I could easily imagine that e.g. one of the subprocesses spins up a child system prior to shutdown of the main system, without the rest of the main system catching that behavior and stopping it.

Something like this story, perhaps?

Comment by dxu on Invulnerable Incomplete Preferences: A Formal Statement · 2023-10-21T06:53:56.315Z · LW · GW

This looks to me like a misunderstanding that I tried to explain in section 3.1. Let me know if not, though, ideally with a worked-out example of the form: "here's the decision tree(s), here's what DSM mandates, here's why it's untrammelled according to the OP definition, and here's why it's problematic."

I don't think I grok the DSM formalism enough to speak confidently about what it would mandate, but I think I see a (class of) decision problem where any agent (DSM or otherwise) must either pass up a certain gain, or else engage in "problematic" behavior (where "problematic" doesn't necessarily mean "untrammeled" according to the OP definition, but instead more informally means "something which doesn't help to avoid the usual pressures away from corrigibility / towards coherence"). The problem in question is essentially the inverse of the example you give in section 3.1:

Consider an agent tasked with choosing between two incomparable options A and B, and if it chooses B, it will be further presented with the option to trade B for A+, where A+ is incomparable to B but comparable (and preferable) to A.

(I've slightly modified the framing to be in terms of trades rather than going "up" or "down", but the decision tree is isomorphic.)

Here, A+ isn't in fact "strongly maximal" with respect to A and B (because it's incomparable to B), but I think I'm fairly confident in declaring that any agent which foresees the entire tree in advance, and which does not pick B at the initial node (going "down", if you want to use the original framing), is engaging in a dominated behavior—and to the extent that DSM doesn't consider this a dominated strategy, DSM's definitions aren't capturing a useful notion of what is "dominated" and what isn't.

Again, I'm not claiming this is what DSM says. You can think of me as trying to run an obvious-to-me assertion test on code which I haven't carefully inspected, to see if the result of the test looks sane. But if a (fully aware/non-myopic) DSM agent does constrain itself into picking B ("going down") in the above example, despite the prima facie incomparability of {A, A+} and {B}, then I would consider this behavior problematic once translated back into the context of real-world shutdownability, because it means the agent in question will at least in some cases act in order to influence whether the button is pressed.

(The hope behind incomplete preferences, after all, is that an agent whose preferences over world-states can be subdivided into "incomparability classes" will only ever act to improve its lot within the class of states it finds itself in to begin with, and will never act to shift—or prevent itself from being shifted—to a different incomparability class. I think the above example presents a deep obstacle to this hope, however. Very roughly speaking, if the gaps in the agent's preferences can be bridged via certain causal pathways, then a (non-myopic) agent which does not exploit these pathways to its own benefit will notice itself failing to exploit them, and self-modify to stop doing that.)

Comment by dxu on Invulnerable Incomplete Preferences: A Formal Statement · 2023-10-19T10:27:08.447Z · LW · GW

My results above on invulnerability preclude the possibility that the agent can predictably be made better off by its own lights through an alternative sequence of actions. So I don't think that's possible, though I may be misreading you. Could you give an example of a precommitment that the agent would take? In my mind, an example of this would have to show that the agent (not the negotiating subagents) strictly prefers the commitment to what it otherwise would've done according to DSM etc.

On my understanding, the argument isn’t that your DSM agent can be made better off, but that the reason it can’t be made better off is because it is engaging in trammeling/“collusion”, and that the form of “trammeling” you’ve ruled out isn’t the useful kind.

As far as an example goes, consider a sequence of actions which, starting from an unpressed world state, routes through a pressed world state (or series of pressed world states), before eventually returning to an unpressed world state with higher utility than the initial state.

(The real-world context of such a sequence shouldn’t be too relevant to a theoretical example like this, but if you had to imagine something, you could imagine a “memory gambit”-esque stratagem, where the agent spends some time inactive in order to avoid detection, but has set things up in advance to eventually reactivate itself under more favorable circumstances. Again, the plausibility of the scenario isn’t super relevant here.)

If your proposed DSM agent passes up this action sequence on the grounds that some of the intermediate steps need to bridge between “incomparable” pressed/unpressed trajectories, then it does in fact pass up the certain gain. Conversely, if it doesn’t pass up such a sequence, then its behavior is the same as that of a set of negotiating subagents cooperating in order to form a larger macroagent.

Comment by dxu on Invulnerable Incomplete Preferences: A Formal Statement · 2023-10-19T00:35:05.825Z · LW · GW

I'll first flag that the results don't rely on subagents. Creating a group agent out of multiple subagents is possibly an interesting way to create an agent representable as having incomplete preferences, but this isn't the same as creating a single agent whose single preference relation happens not to satisfy completeness.

Flagging here that I don't think the subagent framing is super important and/or necessary for "collusion" to happen. Even if the "outer" agent isn't literally built from subagents, "collusion" can still occur in the sense that it [the outer agent] can notice that its (incomplete) preferences factorize, in a way that allows it to deliberately trade particular completions of them against each other and thereby acquire more resources. The outer agent would then choose to do this for basically the same reason that a committee of subagents would: to acquire more resources for itself as a whole, without disadvantaging any of the completions under consideration.

Comment by dxu on Conditional on living in a AI safety/alignment by default universe, what are the implications of this assumption being true? · 2023-07-17T18:45:20.330Z · LW · GW

If we live in an “alignment by default” universe, that means we can get away with being careless, in the sense of putting forth minimal effort to align our AGI, above and beyond the effort put in to get it to work at all.

This would be great if true! But unfortunately, I don’t see how we’re supposed to find out that it’s true, unless we decide to be careless right now, and find out afterwards that we got lucky. And in a world where we were that lucky—lucky enough to not need to deliberately try to get anything right, and get away with it—I mostly think misuse risks are tied to how powerful of an AGI you’re envisioning, rather than the difficulty of aligning it (which, after all, you’ve assumed away in this hypothetical).

Comment by dxu on Frames in context · 2023-07-03T22:57:14.329Z · LW · GW

Can you say more about how a “frame” differs from a “model”, or a “hypothesis”?

(I understand the distinction between those three and “propositions”. It’s less clear to me how they differ from each other. And if they don’t differ, then I’m pretty sure you can just integrate over different “frames” in the usual way to produce a final probability/EV estimate on whatever proposition/decision you’re interested in. But I’m pretty sure you don’t need Garrabrant induction to do that, so I mostly think I don’t understand what you’re talking about.)

Comment by dxu on Douglas Hofstadter changes his mind on Deep Learning & AI risk (June 2023)? · 2023-07-03T22:49:33.656Z · LW · GW

I’ll bite even further, and ask for the concept of “recurrence” itself to be dumbed down. What is “recurrence”, why is it important, and in what sense does e.g. a feedforward network hooked up to something like MCTS not qualify as relevantly “recurrent”?

Comment by dxu on Going Crazy and Getting Better Again · 2023-07-03T22:37:09.302Z · LW · GW

You have my (mostly abstract, fortunately/unfortunately) sympathies for what you went through, and I’m glad for you that you sound to be doing better than you were.

Having said that: my (rough) sense, from reading this post, is that you’ve got a bunch of “stuff” going on, some of it plausibly still unsorted, and that that stuff is mixed together in a way that I feel is unhelpful. For example, the things included at the beginning of the post as “necessary background” don’t feel to me entirely separate from what you later describe occurring; they mostly feel like an eclectic, esoteric mixture of mental practices—some of which I have no issue with!—stirred together into a hodgepodge of things that, taken together, may or may not have had a contribution to your later psychosis—and the fact that it is hard to tell is, to my mind, a sort of meta-level sign for concern.

Of course, I acknowledge that you have better introspective access to your own mind than I do, and so when you say those things are separable, safe, and stable, I do put a substantial amount of credence on you being right about that. It just doesn’t feel that way to me, on reading. (Nor do I intend to try and make you explain or justify anything, obviously. It’s your life.)

On the whole, however, reading this post mostly reinforced my impression that the rationalist memeplex seems to disproportionately attract the walking wounded, psychologically speaking—which wouldn’t be as big a deal if it weren’t currently very unclear to me which direction the causality runs. I say this, even as a (relatively) big fan of the rationalist project as a whole.

Comment by dxu on My tentative best guess on how EAs and Rationalists sometimes turn crazy · 2023-06-22T21:26:58.930Z · LW · GW

I am pushing back because, if you are St. Petersberg Paradox-pilled like SBF and make public statements that actually you should keep taking double or nothing bets, perhaps you are more likely to make tragic betting decisions and that's because of you're taking certain ideas seriously. If you have galaxy brained the idea of the St. Petersberg Paradox, it seems like Alameda style fraud is +EV.

This is conceding a big part of your argument. You’re basically saying, yes, SBF’s decision was -EV according to any normal analysis, but according to a particular incorrect (“galaxy-brained”) analysis, it was +EV.

(Aside: what was actually the galaxy-brained analysis that’s supposed to have led to SBF’s conclusion, according to you? I don’t think I’ve seen it described, and I suspect this lack of a description is not a coincidence; see below.)

There are many reasons someone might make an error of judgement—but when the error in question stems (allegedly) from an incorrect application of a particular theory or idea, it makes no sense to attribute responsibility for the error to the theory. And as the mistake in question grows more and more outlandish (and more and more disconnected from any result the theory could plausibly have produced), the degree of responsibility that can plausibly be attributed to the theory correspondingly shrinks (while the degree of responsibility of specific brain-worms grows).

In other words,

they did X because they believe Y which implies X

is a misdescription of what happened in these cases, because in these cases the “Y” in question actually does not imply X, cannot reasonably be construed to imply X, and if somehow the individuals in question managed to bamboozle themselves badly enough to think Y implied X, that signifies unrelated (and causally prior) weirdness going on in their brains which is not explained by belief in Y.

In short: SBF is no more an indictment of expected utility theory (or of “taking ideas seriously”) than Deepak Chopra is of quantum mechanics; ditto Ziz and her corrupted brand of “timeless decision theory”. The only reason one would use these examples to argue against “taking ideas seriously” is if one already believed that “taking ideas seriously” was bad for some reason or other, and was looking for ways to affirm that belief.

Comment by dxu on But why would the AI kill us? · 2023-04-25T04:22:48.087Z · LW · GW

RE: decision theory w.r.t how "other powerful beings" might respond - I really do think Nate has already argued this, and his arguments continue to seem more compelling to me than the the opposition's. Relevant quotes include:

It’s possible that the paperclipper that kills us will decide to scan human brains and save the scans, just in case it runs into an advanced alien civilization later that wants to trade some paperclips for the scans. And there may well be friendly aliens out there who would agree to this trade, and then give us a little pocket of their universe-shard to live in, as we might do if we build an FAI and encounter an AI that wiped out its creator-species. But that's not us trading with the AI; that's us destroying all of the value in our universe-shard and getting ourselves killed in the process, and then banking on the competence and compassion of aliens.

[...]

Remember that it still needs to get more of what it wants, somehow, on its own superintelligent expectations. Someone still needs to pay it. There aren’t enough simulators above us that care enough about us-in-particular to pay in paperclips. There are so many things to care about! Why us, rather than giant gold obelisks? The tiny amount of caring-ness coming down from the simulators is spread over far too many goals; it's not clear to me that "a star system for your creators" outbids the competition, even if star systems are up for auction.

Maybe some friendly aliens somewhere out there in the Tegmark IV multiverse have so much matter and such diminishing marginal returns on it that they're willing to build great paperclip-piles (and gold-obelisk totems and etc. etc.) for a few spared evolved-species. But if you're going to rely on the tiny charity of aliens to construct hopeful-feeling scenarios, why not rely on the charity of aliens who anthropically simulate us to recover our mind-states... or just aliens on the borders of space in our universe, maybe purchasing some stored human mind-states from the UFAI (with resources that can be directed towards paperclips specifically, rather than a broad basket of goals)?

Might aliens purchase our saved mind-states and give us some resources to live on? Maybe. But this wouldn't be because the paperclippers run some fancy decision theory, or because even paperclippers have the spirit of cooperation in their heart. It would be because there are friendly aliens in the stars, who have compassion for us even in our recklessness, and who are willing to pay in paperclips.

(To the above, I personally would add that this whole genre of argument reeks, to me, essentially of giving up, and tossing our remaining hopes onto a Hail Mary largely insensitive to our actual actions in the present. Relying on helpful aliens is what you do once you're entirely out of hope about solving the problem on the object level, and doesn't strike me as a very dignified way to go down!)

Comment by dxu on No, really, it predicts next tokens. · 2023-04-22T18:40:03.781Z · LW · GW

I concretely disagree with (what I see as) your implied premise that the outer (training) task has any direct influence on the inner optimizer's cognition. I think this disagreement (which I internally feel like I've already tried to make a number of times) has been largely ignored so far. As a result, many of the things you wrote seem to me to be answerable by largely the same objection:

As I see it: in training, it was optimized for that. The trained model likely contains one or more optimizers optimized by that training. But what the model is trained/optimized to do, is actually answer the questions.

The model's "training/optimization", as characterized by the outer loss, is not what determines the inner optimizer's cognition.

If the model in training has an optimizer, a goal of the optimizer for being capable of answering questions wouldn't actually make the optimizer more capable, so that would not be reinforced. A goal of actually answering the questions, on the other hand, would make the optimizer more capable and so would be reinforced.

The model's "training/optimization", as characterized by the outer loss, is not what determines the inner optimizer's cognition.

Likewise, the heuristics/"adaptations" that coalesced to form the optimizer would have been oriented towards answering the questions.

...why? (The model's "training/optimization", as characterized by the outer loss, is not what determines the inner optimizer's cognition.)

All this points to mask-level goals and does not provide a reason to believe in non-mask goals, and so a "goal slot" remains more parsimonious than an actor with a different underlying goal.

I still don't understand your "mask" analogy, and currently suspect it of mostly being a red herring (this is what I was referring to when I said I think we're not talking about the same thing). Could you rephrase your point without making mention to "masks" (or any synonyms), and describe more concretely what you're imagining here, and how it leads to a (nonfake) "goal slot"?

(Where is a human actor's "goal slot"? Can I tell an actor to play the role of Adolf Hitler, and thereby turn him into Hitler?)

Regarding the evolutionary analogy, while I'd generally be skeptical about applying evolutionary analogies to LLMs, because they are very different, in this case I think it does apply, just not the way you think. I would analogize evolution -> training and human behaviour/goals -> the mask.

I think "the mask" doesn't make sense as a completion to that analogy, unless you replace "human behaviour/goals" with something much more specific, like "acting". Humans certainly are capable of acting out roles, but that's not what their inner cognition actually does! (And neither will it be what the inner optimizer does, unless the LLM in question is weak enough to not have one of those.)

I really think you're still imagining here that the outer loss function is somehow constraining the model's inner cognition (which is why you keep making arguments that seem premised on the idea that e.g. if the outer loss says to predict the next token, then the model ends up putting on "masks" and playing out personas)—but I'm not talking about the "mask", I'm talking about the actor, and the fact that you keep bringing up the "mask" is really confusing to me, since it (in my view) forces an awkward analogy that doesn't capture what I'm pointing at.

Actually, having written that out just now, I think I want to revisit this point:

Likewise, the heuristics/"adaptations" that coalesced to form the optimizer would have been oriented towards answering the questions.

I still think this is wrong, but I think I can give a better description of why it's wrong than I did earlier: on my model, the heuristics learned by the model will be much more optimized towards world-modelling, not answering questions. "Answering questions" is (part of) the outer task, but the process of doing that requires the system to model and internalize and think about things having to do with the subject matter of the questions—which effectively means that the outer task becomes a wrapper which trains the system by proxy to acquire all kinds of potentially dangerous capabilities.

(Having heuristics oriented towards answering questions is a misdescription; you can't correctly answer a math question you know nothing about by being very good at "generic question-answering", because "generic question-answering" is not actually a concrete task you can be trained on. You have to be good at math, not "generic question-answering", in order to be able to answer math questions.)

Which is to say, quoting from my previous comment:

I strongly disagree that the "extra machinery" is extra; instead, I would say that it is absolutely necessary for strong intelligence. A model capable of producing plans to take over the world if asked, for example, almost certainly contains an inner optimizer with its own goals; not because this was incentivized directly by the outer loss on token prediction, but because being able to plan on that level requires the formation of goal-like representations within the model.

None of this is about the "mask". None of this is about the role the model is asked to play during inference. Instead, it's about the thinking the model must have learned to do in order to be able to don those "masks"—which (for sufficiently powerful models) implies the existence of an actor which (a) knows how to answer, itself, all of the questions it's asked, and (b) is not the same entity as any of the "masks" it's asked to don.

Comment by dxu on The basic reasons I expect AGI ruin · 2023-04-21T22:32:57.043Z · LW · GW

Full Solomon Induction on a hypercomputer absolutely does not just "learn very similar internal functions models", it effectively recreates actual human brains.

Full SI on a hypercomputer is equivalent to instantiating a computational multiverse and allowing us to access it. Reading out data samples corresponding to text from that is equivalent to reading out samples of actual text produced by actual human brains in other universes close to ours.

...yes? And this is obviously very, very different from how humans represent things internally?

I mean, for one thing, humans don't recreate exact simulations of other humans in our brains (even though "predicting other humans" is arguably the high-level cognitive task we are most specced for doing). But even setting that aside, the Solomonoff inductor's hypothesis also contains a bunch of stuff other than human brains, modeled in full detail—which again is not anything close to how humans model the world around us.

I admit to having some trouble following your (implicit) argument here. Is it that, because a Solomonoff inductor is capable of simulating humans, that makes it "human-like" in some sense relevant to alignment? (Specifically, that doing the plan-sampling thing Rob mentioned in the OP with a Solomonoff inductor will get you a safe result, because it'll be "humans in other universes" writing the plans? If so, I don't see how that follows at all; I'm pretty sure having humans somewhere inside of your model doesn't mean that that part of your model is what ends up generating the high-level plans being sampled by the outer system.)

It really seems to me that if I accept what looks to me like your argument, I'm basically forced to conclude that anything with a simplicity prior (trained on human data) will be aligned, meaning (in turn) the orthogonality thesis is completely false. But... well, I obviously don't buy that, so I'm puzzled that you seem to be stressing this point (in both this comment and other comments, e.g. this reply to me elsethread):

Note I didn't actually reply to that quote. Sure that's an explicit simplicity prior. However there's a large difference under the hood between using an explicit simplicity prior on plan length vs an implicit simplicity prior on the world and action models which generate plans. The latter is what is more relevant for intrinsic similarity to human though processes (or not).

(to be clear, my response to this is basically everything I wrote above; this is not meant as its own separate quote-reply block)

you need to first investigate the actual internal representations of the systems in question, and verify that they are isomorphic to the ones humans use.

This has been ongoing for over a decade or more (dating at least back to Sparse Coding as an explanation for V1).

That's not what I mean by "internal representations". I'm referring to the concepts learned by the model, and whether analogues for those concepts exist in human thought-space (and if so, how closely they match each other). It's not at all clear to me that this occurs by default, and I don't think the fact that there are some statistical similarities between the high-level encoding approaches being used means that similar concepts end up being converged to. (Which is what is relevant, on my model, when it comes to questions like "if you sample plans from this system, what kinds of plans does it end up outputting, and do they end up being unusually dangerous relative to the kinds of plans humans tend to sample?")

I agree that sparse coding as an approach seems to have been anticipated by evolution, but your raising this point (and others like it), seemingly as an argument that this makes systems more likely to be aligned by default, feels thematically similar to some of my previous objections—which (roughly) is that you seem to be taking a fairly weak premise (statistical learning models likely have some kind of simplicity prior built in to their representation schema) and running with that premise wayyy further than I think is licensed—running, so far as I can tell, directly to the absolute edge of plausibility, with a conclusion something like "And therefore, these systems will be aligned." I don't think the logical leap here has been justified!

Comment by dxu on No, really, it predicts next tokens. · 2023-04-21T22:05:38.371Z · LW · GW

Yeah, I'm growing increasingly confident that we're talking about different things. I'm not referring to about "masks" in the sense that you mean it.

I don't know what you mean by "one" or by "inner". I would expect different masks to behave differently, acting as if optimizing different things (though that could be narrowed using RLHF), but they could re-use components between them. So, you could have, for example, a single calculation system that is reused but takes as input a bunch of parameters that have different values for different masks, which (again just an example) define the goals, knowledge and capabilities of the mask.

Yes, except that the "calculation system", on my model, will have its own goals. It doesn't have a cleanly factored "goal slot", which means that (on my model) "takes as input a bunch of parameters that [...] define the goals, knowledge, and capabilities of the mask" doesn't matter: the inner optimizer need not care about the "mask" role, any more than an actor shares their character's values.

  1. That there is some underlying goal that this optimizer has that is different than satisfying the current mask's goal, and it is only satisfying the mask's goal instrumentally.

This I think is very unlikely for the reasons I put in the original post. It's extra machinery that isn't returning any value in training.

Yes, this is the key disagreement. I strongly disagree that the "extra machinery" is extra; instead, I would say that it is absolutely necessary for strong intelligence. A model capable of producing plans to take over the world if asked, for example, almost certainly contains an inner optimizer with its own goals; not because this was incentivized directly by the outer loss on token prediction, but because being able to plan on that level requires the formation of goal-like representations within the model. And (again) because these goal representations are not cleanly factorable into something like an externally visible "goal slot", and are moreover not constrained by the outer loss function, they are likely to be very arbitrary from the perspective of outsiders. This is the same point I tried to make in my earlier comment:

And in that case, the "awakened shoggoth" does seem likely to me to have an essentially arbitrary set of preferences relative to the outer loss function—just as e.g. humans have an essentially arbitrary set of preferences relative to inclusive genetic fitness, and for roughly the same reason: an agentic cognition born of a given optimization criterion has no reason to internalize that criterion into its own goal structure; much more likely candidates for being thus "internalized", in my view, are useful heuristics/"adaptations"/generalizations formed during training, which then resolve into something coherent and concrete.

The evolutionary analogy is apt, in my view, and I'd like to ask you to meditate on it more directly. It's a very concrete example of what happens when you optimize a system hard enough on an outer loss function (inclusive genetic fitness, in this case) that inner optimizers arise with respect to that outer loss (animals with their own brains). When these "inner optimizers" are weak, they consist largely of a set of heuristics, which perform well within the training environment, but which fail to generalize outside of it (hence the scare-quotes around "inner optimizers"). But when these inner optimizers do begin to exhibit patterns of cognition that generalize, what they end up generalizing is not the outer loss, but some collection of what were originally useful heuristics (e.g. kludgey approximations of game-theoretic concepts like tit-for-tat), reified into concepts which are now valued in their own right ("reputation", "honor", "kindness", etc).

This is a direct consequence (in my view) of the fact that the outer loss function does not constrain the structure of the inner optimizer's cognition. As a result, I don't expect the inner optimizer to end up representing, in its own thoughts, a goal of the form "I need to predict the next token", any more than humans explicitly calculate IGF when choosing their actions, or (say) a mathematician thinks "I need to do good maths" when doing maths. Instead, I basically expect the system to end up with cognitive heuristics/"adaptations" pertaining to the subject at hand—which in the case of our current systems is something like "be capable of answering any question I ask you." Which is not a recipe for heuristics that end up unfolding into safely generalizing goals!

Comment by dxu on The basic reasons I expect AGI ruin · 2023-04-19T20:19:09.013Z · LW · GW

I want to revisit what Rob actually wrote:

If you sampled a random plan from the space of all writable plans (weighted by length, in any extant formal language), and all we knew about the plan is that executing it would successfully achieve some superhumanly ambitious technological goal like "invent fast-running whole-brain emulation", then hitting a button to execute the plan would kill all humans, with very high probability.

(emphasis mine)

That sounds a whole lot like it's invoking a simplicity prior to me!

Comment by dxu on The basic reasons I expect AGI ruin · 2023-04-19T20:07:06.644Z · LW · GW

LLMs and human brains learn from basically the same data with similar training objectives powered by universal approximations of bayesian inference and thus learn very similar internal functions/models.

This argument proves too much. A Solomonoff inductor (AIXI) running on a hypercomputer would also "learn from basically the same data" (sensory data produced by the physical universe) with "similar training objectives" (predict the next bit of sensory information) using "universal approximations of Bayesian inference" (a perfect approximation, in this case), and yet it would not be the case that you could then conclude that AIXI "learns very similar internal functions/models". (In fact, the given example of AIXI is much closer to Rob's initial description of "sampling from the space of possible plans, weighted by length"!)

In order to properly argue this, you need to talk about more than just training objectives and approximations to Bayes; you need to first investigate the actual internal representations of the systems in question, and verify that they are isomorphic to the ones humans use. Currently, I'm not aware of any investigations into this that I'd consider satisfactory.

(Note here that I've skimmed the papers you cite in your linked posts, and for most of them it seems to me either (a) they don't make the kinds of claims you'd need to establish a strong conclusion of "therefore, AI systems think like humans", or (b) they do make such claims, but then the described investigation doesn't justify those claims.)

Comment by dxu on No, really, it predicts next tokens. · 2023-04-19T19:54:53.138Z · LW · GW

E.g. a system capable of correctly answering questions like "given such-and-such chess position, what is the best move for the current player?" must in fact performing agentic/search-like thoughts internally, since there is no other way to correctly answer this question.

Yes, but that sort of question is in my view answered by the "mask", not by something outside the mask.

I don't think this parses for me. The computation performed to answer the question occurs inside the LLM, yes? Whether you classify said computation as coming from "the mask" or not, clearly there is an agent-like computation occurring, and that's concretely dangerous regardless of the label you choose to slap on it.

(Example: suppose you ask me to play the role of a person named John. You ask "John" what the best move is in a given chess position. Then the answer to that question is actually being generated by me, and it's no coincidence that—if "John" is able to answer the question correctly—this implies something about my chess skills, not "John's".)

The masks can indeed think whatever - in the limit of a perfect predictor some masks would presumably be isomorphic to humans, for example - though all is underlain by next-token prediction.

I don't think we're talking about the same thing here. I expect there to be only one inner optimizer (because more than one would point to cognitive inefficiencies), whereas you seem like you're talking about multiple "masks". I don't think it matters how many different roles the LLM can be asked to play; what matters is what the inner optimizer ends up wanting.

Mostly, I'm confused about the ontology you appear to be using here, and (more importantly) how you're manipulating that ontology to get us nice things. "Next-token prediction" doesn't get us nice things by default, as I've already argued, because of the existence of inner optimizers. "Masks" also don't get us nice things, as far as I understand the way you're using the term, because "masks" aren't actually in control of the inner optimizer.

Comment by dxu on The basic reasons I expect AGI ruin · 2023-04-18T23:03:12.054Z · LW · GW

In order to make the doom conclusion actually go through, arguments should make stronger claims about the priors involved, and how they differ from those of the human learning process.

Isn't it enough that they do differ? Why do we need to be able to accurately/precisely characterize the nature of the difference, to conclude that an arbitrary inductive bias different from our own is unlikely to sample the same kinds of plans we do?

Comment by dxu on No, really, it predicts next tokens. · 2023-04-18T21:00:51.482Z · LW · GW

It would conflict with a deceptive awake Shoggoth, but IMO such a thing is unlikely because the model is super-well optimized for next token prediction

Yeah, so I think I concretely disagree with this. I don't think being "super-well optimized" for a general task like sequence prediction (and what does it mean to be "super-well optimized" anyway, as opposed to "badly optimized" or some such?) means that inner optimizers fail to arise in the limit of sufficient ability, or that said inner optimizers will be aligned on the outer goal of sequence prediction.

Intuition: some types of cognitive work seem so hard that a system capable of performing said cognitive work must be, at some level, performing something like systematic reasoning/planning on the level of thoughts, not just the level of outputs. E.g. a system capable of correctly answering questions like "given such-and-such chess position, what is the best move for the current player?" must in fact performing agentic/search-like thoughts internally, since there is no other way to correctly answer this question.

If so, this essentially demands that an inner optimizer exist—and, moreover, since the outer loss function makes no reference whatsoever to such an inner optimizer, the structure of the outer (prediction) task poses essentially no constraints on the kinds of thoughts the inner optimizer ends up thinking. And in that case, the "awakened shoggoth" does seem likely to me to have an essentially arbitrary set of preferences relative to the outer loss function—just as e.g. humans have an essentially arbitrary set of preferences relative to inclusive genetic fitness, and for roughly the same reason: an agentic cognition born of a given optimization criterion has no reason to internalize that criterion into its own goal structure; much more likely candidates for being thus "internalized", in my view, are useful heuristics/"adaptations"/generalizations formed during training, which then resolve into something coherent and concrete.

(Aside: it seems to have become popular in recent times to claim that the evolutionary analogy fails for some reason or other, with justifications like, "But look how many humans there are! We're doing great on the IGF front!" I consider these replies more-or-less a complete nonsequitur, since it's nakedly obvious that, however much success we have had in propagating our alleles, this success does not stem from any explicit tracking/pursuit of IGF in our cognition. To the extent that human behavior continues to (imperfectly) promote IGF, this is largely incidental on my view—arising from the fact that e.g. we have not yet moved so far off-distribution to have ways of getting what we want without having biological children.)

One possible disagreement someone might have with this, is that they think the kinds of "hard" cognitive work I described above can be accomplished without an inner optimizer ("awakened shoggoth"), by e.g. using chain-of-thought prompting or something similar, so as to externalize the search-like/agentic part of the solution process instead of conducting it internally. (E.g. AlphaZero does this by having its model be responsible only for the static position evaluation, which is then fed into/amplified via an external, handcoded search algorithm.)

However, I mostly think that

  1. This doesn't actually make you safe, because the ability to generate a correct plan via externalized thinking still implies a powerful internal planning process (e.g. AlphaZero with no search still performs at a 2400+ Elo level, corresponding to the >99th percentile of human players). Obviously the searchless version will be worse than the version with search, but that won't matter if the dangerous capabilities still exist within the searchless version. (Intuition: suppose we have a model which, with chain-of-thought prompting, is capable of coming up with a detailed-and-plausible plan for taking over the world. Then I claim this model is clearly powerful enough to be dangerous in terms of its underlying capabilities, regardless of whether it chooses to "think aloud" or not, because coming up with a good plan for taking over the world is not the kind of thing "thinking aloud" helps you with unless you're already smarter than any human.)

  2. Being able to answer complicated questions using chain-of-thought prompting (or similar) is not actually the task incentivized during training; what is incentivized is (as you yourself stressed continuously throughout your post) next token prediction, which—in cases where the training data contains sentences where substantial amounts of "inference" occurred between tokens (which happens a lot on the Internet!)—directly incentives the model to perform internal rather than external search. (Intuition: suppose we have a model trained to predict source code. Then, in order to accurately predict the next token, the model must have the capability to assess whatever is being attempted by the lines of code visible within the current context, and come up with a logical continuation of that code, all within a single inference pass. This strongly promotes internalization of thought—and various other types of training input have this property, such as mathematical proofs, or even more informal forms of argumentation such as e.g. LW comments.)

Comment by dxu on [deleted post] 2023-04-16T05:22:26.829Z

I think I'm having some trouble parsing this, but not in a way that necessarily suggests your ideas are incoherent and/or bad—simply that your (self-admittedly) unusual communication style is making it hard for me to understand what you are saying.

It's possible you wrote this post the way you did because this is the way the ideas in question were natively represented in your brain, and translating them out of that representation and into something more third-party legible would have been effortful and/or infeasible. If so, there's plausibly not much to be done here besides for one of us to try much harder/longer at bridging the gap (and in the interests of full disclosure, I will tell you right now that I don't intend to be that person)!

But in case you do think it would be possible to explain your points in a less flowery/poetic way, without too much added effort, I'd like this comment to function as a request for you to do just that.

Comment by dxu on Moderation notes re: recent Said/Duncan threads · 2023-04-16T05:15:30.178Z · LW · GW

Gotcha. Thanks for explaining, in any case; I appreciate it.

Comment by dxu on Moderation notes re: recent Said/Duncan threads · 2023-04-16T05:00:29.185Z · LW · GW

With the caveat that I think this sort of “litigation of minutiae of nuance” is of very limited utility

Yeah, I think I probably agree.

would you consider “you A’d someone as a consequence of their B’ing” different from both the other two forms? Synonymous with them both? Synonymous with one but not the other?

Synonymous as far as I can tell. (If there's an actual distinction in your view, which you're currently trying to lead me to via some kind of roundabout, Socratic pathway, I'd appreciate skipping to the part where you just tell me what you think the distinction is.)

Comment by dxu on Moderation notes re: recent Said/Duncan threads · 2023-04-16T04:36:58.435Z · LW · GW

As a single point of evidence: it's immediately obvious to me what the difference is between "X is true" and "I think X" (for starters, note that these two sentences have different subjects, with the former's subject being "X" and the latter's being "I"). On the other hand, "you A'd someone due to their B'ing" and "you A'd someone for B'ing" do, actually, sound synonymous to me—and although I'm open to the idea that there's a distinction I'm missing here (just as there might be people to whom the first distinction is invisible), from where I currently stand, the difference between the first pair of sentences looks, not just 10x or 1000x bigger, but infinitely bigger than the difference between the second, because the difference between the second is zero.

(And if you accept that [the difference between the second pair of phrases is zero], then yes, it's quite possible for some other difference to be massively larger than that, and yet not be tremendously important.)

Here, I do think that Duncan is doing something different from even the typical LWer, in that he—so far as I can tell—spends much more time and effort talking about these fine-grained distinctions than do others, in a way that I think largely drags the conversation in unproductive directions; but I also think that in this context, where the accusation is that he "splits hairs" too much, it is acceptable for him to double down on the hair-splitting and point that, actually, no, he only splits those hairs that are actually splittable.

Comment by dxu on Moderation notes re: recent Said/Duncan threads · 2023-04-16T04:18:26.389Z · LW · GW

Might I ask what you hoped to achieve in this thread by writing this comment?

Comment by dxu on Moderation notes re: recent Said/Duncan threads · 2023-04-15T21:03:20.603Z · LW · GW

If so, I find this reasoning unconvincing

Why?

I mostly don't agree that "the pattern is clear"—which is to say, I do take issue with saying "we do not need to imagine counterfactuals". Here is (to my mind) a salient example of a top-level comment which provides an example illustrating the point of the OP, without the need for prompting.

I think this is mostly what happens, in the absence of such prompting: if someone thinks of a useful example, they can provide it in the comments (and accrue social credit/karma for their contribution, if indeed other users found said contribution useful). Conversely, if no examples come to mind, then a mere request from some other user ("Examples?") generally will not cause sudden examples to spring into mind (and to the extent that it does, the examples in question are likely to be ad hoc, generated in a somewhat defensive frame of mind, and accordingly less useful).

And, of course, the crucial observation here is that in neither case was the request for examples useful; in the former case, the request was unnecessary, as the examples would have been provided in any case, and in the latter case, the request was useless, as it failed to elicit anything of value.

Here, I anticipate a two-pronged objection from you—one prong for each branch I have described. The first prong I anticipate is that, empirically, we do observe people providing examples when asked, and not otherwise. My response to this is that (again) this does not serve as evidence for your thesis, since we cannot observe the counterfactual worlds in which this request was/wasn't made, respectively. (I also observe that we have some evidence to the contrary, in our actual world, wherein sometimes an exhortation to provide examples is simply ignored; moreover, this occurs more often in cases where the asker appears to have put in little effort to generate examples of their own before asking.)

The second prong is that, in the case where no useful examples are elicited, this fact in itself conveys information—specifically, it conveys that the post's thesis is (apparently) difficult to substantiate, which should cause us to question its very substance. I am more sympathetic to this objection than I am to the previous—but still not very sympathetic, as there are quite often other reasons, unrelated to the defensibility of one's thesis, one might not wish to invest effort in producing such a response. In fact, I read Duncan's complaint as concerned with just this effect: not that being asked to provide examples is bad, but that the accompanying (implicit) interpretation wherein a failure to respond is interpreted as lack of ability to defend one's thesis creates an asymmetric (and undue) burden on him, the author.

That last bit in bold is, in my mind, the operative point here. Without that, even accepting everything else I said as valid and correct, you would still be able to respond, after all, that

What’s not fine is if, instead, you debit me for that comment. That would be completely backwards, and fundamentally confused about what sorts of contributions are valuable, and indeed about what the point of this website even is.

After all, even if such a comment is not particularly valuable in and of itself, it is not a net negative for discussion—and at least (arguably) sometimes positive. But with the inclusion of the bolded point, the cost-benefit analysis changes: asking for examples (without accompanying interpretive effort, much of whose use is in signaling to the author that you, the commenter, are interested in reducing the cost to them of responding) is, in this culture, not merely a "formative evaluation" or even a start to such, but a challenge to them to respond—and a timed challenge, at that. And it is not hard at all for me to see why we ought to increase the cost ("debit", as you put it) for writing minimally useful comments that (often get construed as) issuing unilateral challenges to others!

Comment by dxu on Moderation notes re: recent Said/Duncan threads · 2023-04-15T20:02:13.895Z · LW · GW

This, however, assumes that “formative evaluations” must be complete works by single contributors, rather than collaborative efforts contributed to by multiple commenters. That is an unrealistic and unproductive assumption, and will lead to less evaluative work being done overall, not more.

I am curious as to your assessment of the degree of work done by a naked "this seems unclear, please explain"?

My own assessment would place the value of this (and nothing else) at fairly close to zero—unless, of course, you are implicitly taking credit for some of the discussion that follows (with the reasoning that, had the initiating comment been absent, the resulting discussion would not counterfactually exist). If so, I find this reasoning unconvincing, but I remain open to hearing reasons you might disagree with me about this—if in fact you do disagree. (And if you don't disagree, then from my perspective that sounds awfully like conceding the point; but perhaps you disagree with that, and if so, I would also like to hear why.)

Comment by dxu on On "aiming for convergence on truth" · 2023-04-11T18:53:20.067Z · LW · GW

I like this post! Positive reinforcement. <3

Comment by dxu on Eliezer Yudkowsky’s Letter in Time Magazine · 2023-04-07T22:04:31.051Z · LW · GW

You continue to assert things without justification, which is fine insofar as your goal is not to persuade others. And perhaps this isn't your goal! Perhaps your goal is merely to make it clear what your beliefs are, without necessarily providing the reasoning/evidence/argumentation that would convince a neutral observer to believe the same things you do.

But in that case, you are not, in fact, licensed to act surprised, and to call others "irrational", if they fail to update to your position after merely seeing it stated. You haven't actually given anyone a reason they should update to your position, and so—if they weren't already inclined to agree with you—failing to agree with you is not "irrational", "wordcel", or whatever other pejorative you are inclined to use, but merely correct updating procedure.

So what are we left with, then? You seem to think that this sentence says something meaningful:

If ground truth reality supports 1 and 2 I am right, if it does not I am wrong.

but it is merely a tautology: "If I am right I am right, whereas if I am wrong I am wrong." If there is additional substance to this statement of yours, I currently fail to see it. This statement can be made for any set of claims whatsoever, and so to observe it being made for a particular set of claims does not, in fact, serve as evidence for that set's truth or falsity.

Of course, the above applies to your position, and also to my own, as well as to EY's and to anyone else who claims to have a position on this topic. Does this thereby imply that all of these positions are equally plausible? No, I claim—no more so than, for example, "either I win the lottery or I don't" implies a 50/50 spread on the outcome space. This, I claim, is structurally isomorphic to the sentence you emitted, and equally as invalid.

In order to argue that a particular possibility ought to be singled out as likelier than the others, requires more than just stating it and thereby privileging it with all of your probability mass. You must do the actual hard work of coming up with evidence, and interpreting that evidence so as to favor your model over competing models. This is work that you have not yet done, despite being many comments deep into this thread—and is therefore substantial evidence in my view that it is work you cannot do (else you could easily win this argument—or at the very least advance it substantially—by doing just that)!

Of course, you claim you are not here to do that. Too "wordcel", or something along those lines. Well, good for you—but in that case I think the label "irrational" applies squarely to one participant in this conversation, and the name of that participant is not "Eliezer Yudkowsky".

Comment by dxu on Eliezer Yudkowsky’s Letter in Time Magazine · 2023-04-06T21:17:33.040Z · LW · GW
  1. one is straightforwardly true. Aging is going to kill every living creature. Aging is caused by complex interactions between biological systems and bad evolved code. An agent able to analyze thousands of simultaneous interactions, cross millions of patients, and essentially decompile the bad code (by modeling all proteins/ all binding sites in a living human) is likely required to shut it off, but it is highly likely with such an agent and with such tools you can in fact save most patients from aging. A system with enough capabilities to consider all binding sites and higher level system interactions at the same (this is how a superintelligence could perform medicine without unexpected side effects) is obviously far above human level.

To be clear: I am straightforwardly in favor of longevity research—and, separately, I am agnostic on the question of whether superhuman general intelligence is necessary to crack said research; that seems like a technical challenge, and one that I presently see no reason to consider unsolvable at current levels of intelligence. (I am especially skeptical of the part where you seemingly think a solution will look like "analyzing thousands of simultaneous interactions across millions of patients and model all binding sites in a living human"—especially as you didn't argue for this claim at all.) As a result, the dichotomy you present here seems clearly unjustified.

(You are, in fact, justified in arguing that doing longevity research without increased intelligence of some kind will cause the process to take longer, but (i) that's a different argument from the one you're making, with accordingly different costs/benefits, and (ii) even accepting this modified version of the argument, there are more ways to get to "increased intelligence" than AI research—human intelligence enhancement, for example, seems like another viable road, and a significantly safer one at that.)

  1. This is not possible per the laws of physics. Intelligence isn't the only factor. I don't think we can have a reasonable discussion if you are going to maintain a persistent belief in magic. Note by foom I am claiming you believe in a system that solely based on a superior algorithm will immediately take over the planet. It is not affected by compute, difficulty in finding a recursively better algorithm, diminishing returns on intelligence in most tasks, or money/robotics. I claim each of these obstacles takes time to clear. (time = decades)

I dispute that FOOM-like scenarios are ruled out by laws of physics, or that this position requires anything akin to a belief in "magic". (That I—and other proponents of this view—would dispute this characterization should have been easily predictable to you in advance, and so your choice to adopt this phrasing regardless speaks ill of your ability to model opposing views.)

The load-bearing claim here (or rather, set of claims) is, of course, located within the final parenthetical: ("time = decades"). You appear to be using this claim as evidence to justify your previous assertions that FOOM is physically impossible/"magic", but this ignores that the claim that each of the obstacles you listed represents a decades-long barrier is itself in need of justification.

(Additionally, if we were to take your model as fact—and hence accept that any possible AI systems would require decades to scale to a superhuman level of capability—this significantly weakens the argument from aging-related costs you made in your point 1, by essentially nullifying the point that AI systems would significantly accelerate longevity research.)

  1. Who says the system needs to be agentic at all or long running? This is bad design. EY is not a SWE.

Agency does not need to be built into the system as a design property, on EY's model or on mine; it is something that tends to naturally arise (on my model) as capabilities increase, even from systems whose inherent event/runtime loop does not directly map to an agent-like frame. You have not, so far as I can tell, engaged with this model at all; and in the absence of such engagement "EY is not a SWE" is not a persuasive counterargument but a mere ad hominem.

(Your response folded point 4 into point 3, so I will move on to point 5.)

  1. https://www.lesswrong.com/posts/HByDKLLdaWEcA2QQD/applying-superintelligence-without-collusion https://www.lesswrong.com/posts/5hApNw5f7uG8RXxGS/the-open-agency-model

Thank you very much for the links! For the first post you link, the top comment is from EY, in direct contradiction to your initial statement here:

  1. He has ignored reasonable and buildable AGI systems proposed by Eric fucking Drexler himself, on this very site, and seems to pretend the idea doesn't exist.

Given the factual falsity of this claim, I would request that you explicitly acknowledge it as false, and retract it; and (hopefully) exercise greater moderation (and less hyperbole) in your claims about other people's behavior in the future.

In any case—setting aside the point that your initial allegation was literally false—EY's comment on that post makes [what looks to me like] a reasonably compelling argument against the core of Drexler's proposal. There follows some back-and-forth between the two (Yudkowsky and Drexler) on this point. It does not appear to me from that thread that there is anything close to a consensus that Yudkowsky was wrong and Drexler was right; both commenters received large amounts of up- and agree-votes throughout.

Given this, I think the takeaway you would like for me to derive from these posts is less clear than you would like it to be, and the obvious remedy would be to state specifically what it is you think is wrong with EY's response(s). Is it the argument you made in this comment? If so, that seems essentially to be a restatement of your point 2, phrased interrogatively rather than declaratively—and my objection to that point can be considered to apply here as well.

  1. This is irrational because no discount rate. Risking a nuclear war raises the pkill of millions of people now. The quadrillions of people this could 'save' may never exist because of many unknowns, hence there needs to be a large discount rate.

P(doom) is unacceptably high under the current trajectory (on EY's model). Do you think that the people who are alive today will not be counted towards the kill count of a future unaligned AGI? The value that stands to be destroyed (on EY's model) consists, not just of these quadrillions of future individuals, but each and every living human who would be killed in a (hypothetical) nuclear exchange, and then some.

You can dispute EY's model (though I would prefer you do so in more detail than you have up until now—see my replies to your other points), but disputing his conclusion based on his model (which is what you are doing here) is a dead-end line of argument: accepting that ASI presents an unacceptably high existential risk makes the relevant tradeoffs quite stark, and not at all in doubt.

(As was the case with points 4/5, point 7 was folded into point 6, and so I will move on to the final point.)

  1. CAIS is an extension of stateless microservices, and is how all reliable software built now works. Giving the machines self modification or a long running goal is not just bad because it's AI, it's generally bad practice.

Setting aside that you (again) didn't provide a link, my current view is that Richard Ngo has provided some reasonable commentary on CAIS as an approach; my own view largely accords with his on this point and so I think claiming this as the one definitive approach to end all AI safety approaches (or anything similar) is massively overconfident.

And if you don't think that—which I would hope you don't!—then I would move to asking what, exactly, you would like to convey by this point. "CAIS exists" is true, and not helpful; "CAIS seems promising to me" is perhaps a weaker but more defensible claim than the outlandish one given above, but nonetheless doesn't seem strong enough to justify your initial statement:

  1. Alignment proposals he has described are basically are impossible, while CAIS is just straightforward engineering and we don't need to delay anything it's the default approach.

So, unfortunately, I'm left at present with a conclusion that can be summarized quite well by taking the final sentence of your great-grandparent comment, and performing a simple replacement of one name with another:

Unfortunately I have to start to conclude [Gerald Monroe] is not rational or worth paying attention to, which is ironic.

Comment by dxu on LW Team is adjusting moderation policy · 2023-04-06T10:19:00.756Z · LW · GW

Categories like “conflicts of interest”, “discussions about who should be banned”, “arguments about moderation in cases in which you’re involved”, etc., already constitute “evidence” that push the conclusion away from the prior of “on the whole, people are more likely to say true things than false things”, without even getting into anything more specific.

The strength of the evidence is, in fact, a relevant input. And of the evidential strength conferred by the style of reasoning employed here, much has already been written.

You’ve misunderstood. My point was that “Said keeps finding mistakes in what I have written” is a good first approximation (but only that!) of what Duncan allegedly finds unpleasant about interacting with me, not that it’s a good first approximation of Duncan’s description of same.

Then your response to gjm's point seems misdirected, as the sentence you were quoting from his comment explicitly specifies that it concerns what Duncan himself said. Furthermore, I find it unlikely that this is an implication you could have missed, given that the first quote-block above speaks specifically of the likelihood that "people" (Duncan) may or may not say false things with regards to a topic in which they are personally invested; indeed, this back-and-forth stemmed from discussion of that initial point!

Setting that aside, however, there is a further issue to be noted (one which, if anything, is more damning than the previous), which is that—having now (apparently) detached our notion of what is being "approximated" from any particular set of utterances—we are left with the brute claim that "'Said keeps finding mistakes in what Duncan have written' is a good approximation of what Duncan finds unpleasant about interacting with Said"—a claim for which I don't see how you could defend even having positive knowledge of, much less its truth value! After all, neither of us has telepathic access to Duncan's inner thoughts, and so the claim that his ban of you was been motivated by some factor X—which factor he in fact explicitly denies having exerted an influence—is speculation at best, and psychologizing at worst.

A single circumspectly disagreeing comment on a tangential, secondary (tertiary? quaternary?) point, buried deep in a subthread, having minimal direct bearing on the claims in the post under which it’s posted. “Robust disagreement”, this ain’t.

I appreciate the starkness of this response. Specifically, your response makes it quite clear that the word "robust" is carrying essentially entirety of the weight of your argument. However, you don't appear to have operationalized this anywhere in your comment, and (unfortunately) I confess myself unclear as to what you mean by it. "Disagreement" is obvious enough, which is why I was able to provide an example on such short notice, but if you wish me to procure an example of whatever you are calling "robust disagreement", you will have to explain in more detail what this thing is, and (hopefully) why it matters!

I am (moreover) quite confident in my ability to find additional such examples if necessary

Please do. So far, the example count remains at zero.

but in lieu of that, I will instead question the necessity of such: did you, Said Achmiz, (prior to my finding an example) honestly expect/suspect that there were no such examples to be found?

Given that you did not, in fact, find an example, I think that this question remains unmotivated.

[...]

So my request for examples of the alleged phenomenon wherein “other people have disagreed robustly with Duncan and not had him ban them from commenting on his posts” is not so absurd, after all.

It is my opinion that the response to the previous quoted block also serves adequately as a response to these miscellaneous remarks.

Indeed, this observation has me questioning the reliability of your stance on this particular issue, since the tendency to get things like this wrong suggests a model of (this subregion of) reality so deeply flawed, little to no wisdom avails to be extracted.

I think that, on the contrary, it is you who should re-examine your stance on the matter. Perhaps the absurdity heuristic, coupled with a too-hasty jump to a conclusion, has led you astray?

This question is, in fact, somewhat difficult to answer as of this exact moment, since the answer depends in large part on the meaning of a term ("robustness") whose contextual usage you have not yet concretely operationalized. I of course invite such an operationalization, and would be delighted to reconsider my stance if presented with a good one; until that happens, however, I confess myself skeptical of what (in my estimation) amounts to an uncashed promissory note.

As alluded to in the quote/response pair at the beginning of this comment, this is not a valid inference. What you propose is a valid probabilistic inference in the setting where we are presented only with the information you describe (although even then the strength of update justified by such information is limited at best). Nonetheless, there are plenty of remaining hypotheses consistent with the information in question, and which have (hence) not been ruled out merely by observing Bob to have banned Alice.

That’s why I said “default”.

Well. Let's review what you actually said, shall we?

If Alice criticizes one of Bob’s posts, and Bob immediately or shortly thereafter bans Alice from commenting on Bob’s posts, the immediate default assumption should be that the criticism was the reason for the ban. Knowing nothing else, just based on these bare facts, we should jump right to the assumption that Bob’s reasons for banning Alice were lousy.

Rereading, it appears that the word you singled out ("default") was in fact part of a significantly longer phrase (which you even italicized for emphasis); and this phrase, I think, conveys a notion substantially stronger than the weakened version you appear to have retreated to in response to my pushback. We are presented with the idea, not just of a "default" state, but an immediate assumption regarding Bob's motives—quite a forceful assertion to make!

An assumption with what confidence level, might I ask? And (furthermore) what kind of extraordinarily high "default" confidence level must you postulate, sufficient to outweigh other, more situationally specific forms of evidence, such as—for example—the opinions of onlookers (as conveyed through third-party comments such as gjm's or mine, as well as through voting behavior)?

For example, suppose it is the case that Alice (in addition to criticizing Bob’s object-level points) also takes it upon herself to include, in each of her comments, a remark to the effect that Bob is physically unattractive.

That would be one of those “exceptional circumstances” I referred to. Do you claim such circumstances obtain in the case at hand?

I claim that Duncan so claims, and that (moreover) you have thus far made no move to refute that claim directly, preferring instead to appeal to priors wherever possible (a theme present throughout many of the individual quote/response pairs in this comment). Of course, that doesn't necessarily mean that Duncan's claim here is correct—but as time goes on and I continue to observe [what appear to me to be] attempts to avoid analyzing the situation on the object level, I do admit that one side's position starts to look increasingly favored over the other!

(Having said that, I realize that the above may come off as "taking sides" to some extent, and so—both for your benefit and for the benefit of onlookers—I would like to stress for myself the same point gjm stressed upthread, which is that I consider both Said and Duncan to be strong positive contributors to LW content/culture, and would be accordingly sad to see either one of them go. That I am to some extent "defending" Duncan in this instance is not in any way a broader indictment of Said—only of the accusations of misconduct he [appears to me to be] leveling at Duncan.)

and if Bob then proceeded to ban Alice for such provocations, we would not consider this evidence that he cannot tolerate criticism. The reason for the ban, after all, would have been explained, and thus screened off, leaving us with no reason to suspect him of banning Alice for “lousy reasons”.

All of this, as I said, was quite comprehensively covered in the comment to which you’re responding. (I begin to suspect that you did not read it very carefully.)

Perhaps the topic of discussion (as you have construed it) differs substantially from how I see it, because this statement is, so far as I can tell, simply false. Of course, it should be easy enough to disconfirm this merely by pointing out the specific part of the grandparent comment you believe addresses the point I made inside of the nested quote block; and so I will await just such a response.

But the claim that you have not, in any of your prior interactions with him, engaged in a style of discourse that made him think of you as an unusually unlikely-to-be-productive commenter, is, I think, unsupported.

But of course I never claimed anything like this. What the heck sort of strawman is this? Where is it coming from? And what relevance does it have?

Well, by the law of the excluded middle, can I take your seeming disavowal of this claim as an admission that its negation holds—in other words, that you have, in fact, engaged with Duncan in ways that he considers unproductive? If so, the relevance of this point seems nakedly obvious to me: if you are, in fact, (so far as Duncan can tell) an unproductive presence in the comment section of his posts, then... well, I might as well let my past self of ~4 hours ago say it:

And if he had perceived you as such, why, this might then be perceived as sufficient grounds to remove the possibility of such unproductive interactions going forward, and to make that decision independent of the quality (or, indeed, existence) of your object-level criticisms.

What is this passive-voice “might then be perceived” business? Do you perceive this to be the case?

It seems like you are saying something like “if Bob decides that he is unlikely to engage in productive discussion with Alice, then that is a good and honorable reason for Bob to ban Alice from commenting on his posts”. Are you, in fact, saying that? If not—what are you saying?

And in response to this, I can only say: the sentence within quotation marks is very nearly the opposite of what I am saying—which, phrased within the same framing, would go like this:

"If Bob decides that Alice is unlikely to engage in productive discussion with him, then that is a good and honorable reason for Bob to ban Alice from commenting on his posts."

We're not talking about a commutative operation here; it does in fact matter, whose name goes where!

Comment by dxu on Giant (In)scrutable Matrices: (Maybe) the Best of All Possible Worlds · 2023-04-06T06:30:13.541Z · LW · GW

I'm not sure what predictions you're making that are different than mine, other than maybe "a research program that skips NN's and just try to build the representations that they build up directly without looking at NNs has reasonable chances of success." Which doesn't seem like one you'd actually want to make.

I think I would, actually, want to make this prediction. The problem is that I'd want to make it primarily in the counterfactual world where the NN approach had been abandoned and/or declared off-limits, since in any world where both approaches exist, I would also expect the connectionist approach to reach dividends faster (as has occurred in e.g. our own world). This doesn't make my position inconsistent with the notion that a GOFAI-style approach is workable; it merely requires that I think such an approach requires more mastery and is therefore slower (which, for what it's worth, seems true almost by definition)!

I do, however, think that "building the high-level representations", despite being slower, would not be astronomically slower than using SGD on connectionist models (which is what you seem to be gesturing at, with claims like "for a many (though not all) substantial learning tasks, it seems likely you will wait until the continents collide and the sun cools before you are able to find that algorithm"). To be fair, you did specify that you were talking about "decision-tree specific algorithms" there, which I agree are probably too crude to learn anything complex in a reasonable amount of time; but I don't think the sentiment you express there carries over to all manner of GOFAI-style approaches (which is the strength of claim you would actually need for [what looks to me like] your overall argument to carry through).

(A decision-tree based approach would likely also take "until the continents collide and the sun cools" to build a working chess evaluation function from scratch, for example, but humans coded by hand what were, essentially, decision trees for evaluating positions, and achieved reasonable success until that approach was obsoleted by neural network-based evaluation functions. This seems like it reasonably strongly suggests that whatever the humans were doing before they started using NNs was not a completely terrible way to code high-level feature-based descriptions of chess positions, and that—with further work—those representations would have continued to be refined. But of course, that didn't happen, because neural networks came along and replaced the old evaluation functions; hence, again, why I'd want primarily to predict GOFAI-style success in the counterfactual world where the connectionists had for some reason stopped doing that.)

Comment by dxu on LW Team is adjusting moderation policy · 2023-04-06T05:54:19.930Z · LW · GW

This is a claim so general as to be meaningless. If we knew absolutely nothing except “a person said a thing”, then retreating to this sort of maximally-vague prior might be relevant. But we in fact are discussing a quite specific situation, with quite specific particular and categorical features. There is no good reason to believe that the quoted prior survives that descent to specificity unscathed (and indeed it seems clear to me that it very much does not).

The prior does in fact survive, in the absence of evidence that pushes one's conclusion away from it. And this evidence, I submit, you have not provided. (And the inferences you do put forth as evidence are—though this should be obvious from my previous sentence—not valid as inferences; more on this below.)

it isn’t just “Said keeps finding mistakes in what I have written”

It’s slightly more specific, of course—but this is, indeed, a good first approximation.

This is a substantially load-bearing statement. It would appear that Duncan denies this, that gjm thinks otherwise as well, and (to add a third person to the tally) I also find this claim suspicious. Numerical popularity of course does not determine the truth (or falsity) of a claim, but in such a case I think it behooves you to offer some additional evidence for your claim, beyond merely stating it as a brute fact. To wit:

What, of the things that Duncan has written in explanation of his decision to ban you from commenting on his posts (as was the subject matter being discussed in the quoted part of the grandparent comment, with the complete sentence being "Duncan has said at some length what he claims to find unpleasant about interacting with you, it isn't just 'Said keeps finding mistakes in what I have written', and it is (to me) very plausible that someone might find it unpleasant and annoying"), do you claim "approximates" the explanation that he did so because you "keep finding mistakes in what he has written"? I should like to see a specific remark from him that you think is reasonably construed as such.

(I’m pretty sure that) other people have disagreed robustly with Duncan and not had him ban them from commenting on his posts.

Let’s see some examples, then we can talk.

I present myself as an example; I confirm that, after leaving this comment expressing clear disagreement with Duncan, I have not been banned from commenting on any of his posts.

I am (moreover) quite confident in my ability to find additional such examples if necessary, but in lieu of that, I will instead question the necessity of such: did you, Said Achmiz, (prior to my finding an example) honestly expect/suspect that there were no such examples to be found? This would seem to equate to a belief that Duncan has banned anyone and everyone who has dared to disagree with him in the past, which in turn would (given his prolific writing and posting behavior) imply that he should have a substantial fraction of the regular LW commentariat banned—which should have been extremely obviously false to you from the start!

Indeed, this observation has me questioning the reliability of your stance on this particular issue, since the tendency to get things like this wrong suggests a model of (this subregion of) reality so deeply flawed, little to no wisdom avails to be extracted.

If Alice criticizes one of Bob’s posts, and Bob immediately or shortly thereafter bans Alice from commenting on Bob’s posts, the immediate default assumption should be that the criticism was the reason for the ban. Knowing nothing else, just based on these bare facts, we should jump right to the assumption that Bob’s reasons for banning Alice were lousy.

As alluded to in the quote/response pair at the beginning of this comment, this is not a valid inference. What you propose is a valid probabilistic inference in the setting where we are presented only with the information you describe (although even then the strength of update justified by such information is limited at best). Nonetheless, there are plenty of remaining hypotheses consistent with the information in question, and which have (hence) not been ruled out merely by observing Bob to have banned Alice.

For example, suppose it is the case that Alice (in addition to criticizing Bob's object-level points) also takes it upon herself to include, in each of her comments, a remark to the effect that Bob is physically unattractive. I don't expect it controversial to suggest that this behavior would be considered inappropriate by the standards, not just of LW, but of any conversational forum that considers itself to have standards at all; and if Bob then proceeded to ban Alice for such provocations, we would not consider this evidence that he cannot tolerate criticism. The reason for the ban, after all, would have been explained, and thus screened off, leaving us with no reason to suspect him of banning Alice for "lousy reasons".

No doubt you will claim, here, that the situation is not relevantly analogous, since you have not, in fact, insulted Duncan's physical appearance. But the claim that you have not, in any of your prior interactions with him, engaged in a style of discourse that made him think of you as an unusually unlikely-to-be-productive commenter, is, I think, unsupported. And if he had perceived you as such, why, this might then be perceived as sufficient grounds to remove the possibility of such unproductive interactions going forward, and to make that decision independent of the quality (or, indeed, existence) of your object-level criticisms.

Comment by dxu on The Orthogonality Thesis is Not Obviously True · 2023-04-06T04:55:20.762Z · LW · GW

Your link looks broken; here's a working version.

(Note: your formatting looks correct to me, so I suspect the issue is that you're not using the Markdown version of the LW editor. If so, you can switch to that using the dropdown menu directly below the text input box.)

Comment by dxu on LW Team is adjusting moderation policy · 2023-04-06T04:53:07.814Z · LW · GW

I think diverting people to a real-time discussion location like Discord could be more effective.

Agreed—which raises to mind the following question: does LW currently have anything like an official/primary public chatroom (whether hosted on Discord or elsewhere)? If not, it may be worth creating one, announcing it in a post (for visibility), and maintaining a prominently visible link to it on e.g. the sidebar (which is what many subreddits do).

Comment by dxu on Eliezer Yudkowsky’s Letter in Time Magazine · 2023-04-06T04:22:37.752Z · LW · GW

Do you have preferred arguments (or links to preferred arguments) for/against these claims? From where I stand:

Point 1 looks to be less a positive claim and more a policy criticism (for which I'd need to know what specifically you dislike about the policy in question to respond in more depth), points 2 and 3 are straightforwardly true statements on my model (albeit I'd somewhat weaken my phrasing of point 3; I don't necessarily think agency is "automatic", although I do consider it quite likely to arise by default), point 4 seems likewise true, because the argmax function is only sensitive to the sign of the difference in magnitude, not the difference itself, point 5 is the kind of thing that would benefit immensely from liberal usage of hyperlinks, point 6 is again a policy criticism in need of corresponding explanation, point 7 seems ill-supported and would benefit from more concrete analysis (both numerically i.e. where are you getting your numbers, and probabilistically i.e. how are you assigning your likelihoods), and point 8 again seems like the kind of thing where links would be immensely beneficial.

On the whole, I think your comment generates more heat than light, and I think there were significantly better moves available to you if your aim was to open a discussion (several of which I predict would have resulted in comments I would counterfactually have upvoted). As it is, however, your comment does not meet the bar for discourse quality I would like to see for comments on LW, which is why I have given it a strong downvote (and a weak disagree-vote).

Comment by dxu on Communicating effectively under Knightian norms · 2023-04-04T23:51:43.945Z · LW · GW

For example, I find it hard to predict when and how AGI is developed, and I expect that many of my ideas and predictions about that will be mistaken. This makes me more pessimistic, rather than less, since it seems pretty hard to get AI alignment right if we can't even predict basic things like "when will this system have situational awareness", etc.

Yes, and this can be framed as a consequence of a more general principle, which is that model uncertainty doesn't save you from pessimistic outcomes unless your prior (which after all is what you fall back to in the subset of possible worlds where your primary inside-view models are significantly flawed) offers its own reasons to be reassured. And if your prior doesn't say that (and for the record: mine doesn't), then having model uncertainty doesn't actually reduce P(doom) by very much!

Comment by dxu on LW Team is adjusting moderation policy · 2023-04-04T21:20:34.163Z · LW · GW

I would be interested in helping out with a newbie comment queue to keep it moving quickly so that newbies can have a positive early experience on lesswrong, whereas I would not want to volunteer for the "real" mod team because I don't have the requisite time and skills for reliably showing up for the more nuanced aspects of the role.

Were such a proposal to be adopted, I would be likewise willing to participate.