And All the Shoggoths Merely Players 2024-02-10T19:56:59.513Z
On the Contrary, Steelmanning Is Normal; ITT-Passing Is Niche 2024-01-09T23:12:20.349Z
If Clarity Seems Like Death to Them 2023-12-30T17:40:42.622Z
Lying Alignment Chart 2023-11-29T16:15:28.102Z
Fake Deeply 2023-10-26T19:55:22.340Z
Alignment Implications of LLM Successes: a Debate in One Act 2023-10-21T15:22:23.053Z
Contra Yudkowsky on Epistemic Conduct for Author Criticism 2023-09-13T15:33:14.987Z
Assume Bad Faith 2023-08-25T17:36:32.678Z
"Is There Anything That's Worth More" 2023-08-02T03:28:16.116Z
Lack of Social Grace Is an Epistemic Virtue 2023-07-31T16:38:05.375Z
"Justice, Cherryl." 2023-07-23T16:16:40.835Z
A Hill of Validity in Defense of Meaning 2023-07-15T17:57:14.385Z
Blanchard's Dangerous Idea and the Plight of the Lucid Crossdreamer 2023-07-08T18:03:49.319Z
We Are Less Wrong than E. T. Jaynes on Loss Functions in Human Society 2023-06-05T05:34:59.440Z
Bayesian Networks Aren't Necessarily Causal 2023-05-14T01:42:24.319Z
"You'll Never Persuade People Like That" 2023-03-12T05:38:18.974Z
"Rationalist Discourse" Is Like "Physicist Motors" 2023-02-26T05:58:29.249Z
Conflict Theory of Bounded Distrust 2023-02-12T05:30:30.760Z
Reply to Duncan Sabien on Strawmanning 2023-02-03T17:57:10.034Z
Aiming for Convergence Is Like Discouraging Betting 2023-02-01T00:03:21.315Z
Comment on "Propositions Concerning Digital Minds and Society" 2022-07-10T05:48:51.013Z
Challenges to Yudkowsky's Pronoun Reform Proposal 2022-03-13T20:38:57.523Z
Comment on "Deception as Cooperation" 2021-11-27T04:04:56.571Z
Feature Selection 2021-11-01T00:22:29.993Z
Glen Weyl: "Why I Was Wrong to Demonize Rationalism" 2021-10-08T05:36:08.691Z
Blood Is Thicker Than Water 🐬 2021-09-28T03:21:53.997Z
Sam Altman and Ezra Klein on the AI Revolution 2021-06-27T04:53:17.219Z
Reply to Nate Soares on Dolphins 2021-06-10T04:53:15.561Z
Sexual Dimorphism in Yudkowsky's Sequences, in Relation to My Gender Problems 2021-05-03T04:31:23.547Z
Communication Requires Common Interests or Differential Signal Costs 2021-03-26T06:41:25.043Z
Less Wrong Poetry Corner: Coventry Patmore's "Magna Est Veritas" 2021-01-30T05:16:26.486Z
Unnatural Categories Are Optimized for Deception 2021-01-08T20:54:57.979Z
And You Take Me the Way I Am 2020-12-31T05:45:24.952Z
Containment Thread on the Motivation and Political Context for My Philosophy of Language Agenda 2020-12-10T08:30:19.126Z
Scoring 2020 U.S. Presidential Election Predictions 2020-11-08T02:28:29.234Z
Message Length 2020-10-20T05:52:56.277Z
Msg Len 2020-10-12T03:35:05.353Z
Artificial Intelligence: A Modern Approach (4th edition) on the Alignment Problem 2020-09-17T02:23:58.869Z
Maybe Lying Can't Exist?! 2020-08-23T00:36:43.740Z
Algorithmic Intent: A Hansonian Generalized Anti-Zombie Principle 2020-07-14T06:03:17.761Z
Optimized Propaganda with Bayesian Networks: Comment on "Articulating Lay Theories Through Graphical Models" 2020-06-29T02:45:08.145Z
Philosophy in the Darkest Timeline: Basics of the Evolution of Meaning 2020-06-07T07:52:09.143Z
Comment on "Endogenous Epistemic Factionalization" 2020-05-20T18:04:53.857Z
"Starwink" by Alicorn 2020-05-18T08:17:53.193Z
Zoom Technologies, Inc. vs. the Efficient Markets Hypothesis 2020-05-11T06:00:24.836Z
A Book Review 2020-04-28T17:43:07.729Z
Brief Response to Suspended Reason on Parallels Between Skyrms on Signaling and Yudkowsky on Language and Evidence 2020-04-16T03:44:06.940Z
Zeynep Tufekci on Why Telling People They Don't Need Masks Backfired 2020-03-18T04:34:09.644Z
The Heckler's Veto Is Also Subject to the Unilateralist's Curse 2020-03-09T08:11:58.886Z
Relationship Outcomes Are Not Particularly Sensitive to Small Variations in Verbal Ability 2020-02-09T00:34:39.680Z


Comment by Zack_M_Davis on New LessWrong review winner UI ("The LeastWrong" section and full-art post pages) · 2024-02-28T07:12:17.531Z · LW · GW

I second the concern that using "LeastWrong" on the site grants undue legitimacy to the bad "than others" interpretation of the brand name (as contrasted to the intended "all models are wrong, but" meaning). "Best Of" is clear and doesn't distort the brand.

Comment by Zack_M_Davis on Less Wrong automated systems are inadvertently Censoring me · 2024-02-27T06:44:20.582Z · LW · GW

Would you agree with the statement that your meta-level articles are more karma-successful than your object-level articles? Because if that is a fair description, I see it as a huge problem.

I don't think this is a good characterization of my posts on this website.

If by "meta-level articles", you mean my philosophy of language work (like "Where to Draw the Boundaries?" and "Unnatural Categories Are Optimized for Deception"), I don't think success is a problem. I think that was genuinely good work that bears directly on the site's mission, independently of the historical fact that I had my own idiosyncratic ("object-level"?) reasons for getting obsessed with the philosophy of language in 2019–2020.[1]

If by "object-level articles", you mean my writing on my special-interest blog about sexology and gender, well, the overwhelming majority of that never got a karma score because it was never cross-posted to Less Wrong. (I only cross-post specific articles from my special-interest blog when I think they're plausibly relevant to the site's mission.)

If by "meta-level articles", you mean my recent memoir sequence which talks about sexology and the philosophy of language and various autobiographical episodes of low-stakes infighting among community members in Berkeley, California, well, those haven't been karma-successful: parts 1, 2, and 3 are currently[2] sitting at 0.35, 0.08 (!), and 0.54 karma-per-vote, respectively.

If by "meta-level articles", you mean posts that reply to other users of this website (such as "Contra Yudkowsky on Epistemic Conduct for Author Criticism" or "'Rationalist Discourse' Is Like 'Physicist Motors'"), I contest the "meta level" characterization. I think it's normal and not particularly meta for intellectuals to write critiques of each other's work, where Smith writes "Kittens are Cute", and Jones replies in "Contra Smith on Kitten Cuteness". Sure, it would be possible for Jones to write a broadly similar article, "Kittens Aren't Cute", that ignores Smith altogether, but I think that's often a worse choice, if the narrow purpose of Jones's article is to critique the specific arguments made by Smith, notwithstanding that someone else might have better arguments in favor of the Cute Kitten theory that have not been heretofore considered.

You're correct to notice that a lot of my recent work has a cult-infighting drama angle to it. (This is very explicit in the memoir sequence, but it noticeably leaks into my writing elsewhere.) I'm pretty sure I'm not doing it for the karma. I think I'm doing it because I'm disillusioned and traumatized from the events described in the memoir, and will hopefully get over it after I've got it all written down and out of my system.

There's another couple posts in that sequence (including this coming Saturday, probably). If you don't like it, I hereby encourage you to strong-downvote it. I write because I selfishly have something to say; I don't think I'm entitled to anyone's approval.

  1. In some of those posts, I referenced the work of conventional academics like Brian Skyrms and others, which I think provides some support for the notion that the nature of language and categories is a philosophically rich topic that someone might find significant in its own right, rather than being some sort of smokescreen for a hidden agenda. ↩︎

  2. Pt. 1 actually had a much higher score (over 100 points) shortly after publication, but got a lot of downvotes later after being criticized on Twitter. ↩︎

Comment by Zack_M_Davis on Communication Requires Common Interests or Differential Signal Costs · 2024-02-27T04:40:48.827Z · LW · GW

Personal whimsy. Probably don't read too much into it. (My ideology has evolved over the years such that I think a lot of the people who are trying to signal something with the generic feminine would not regard me as an ally, but I still love the æsthetic.)

Comment by Zack_M_Davis on Less Wrong automated systems are inadvertently Censoring me · 2024-02-23T16:29:32.524Z · LW · GW

Zack cannot convince us [...] if you disagree with him, that only proves his point

I don't think I'm doing this! It's true that I think it's common for apparent disagreements to be explained by political factors, but I think that claim is itself something I can support with evidence and arguments. I absolutely reject "If you disagree, that itself proves I'm right" as an argument, and I think I've been clear about this. (See the paragraph in "A Hill of Validity in Defense of Meaning" starting with "Especially compared to normal Berkeley [...]".)

If you're interested, I'm willing to write more words explaining my model of which disagreements with which people on which topics are being biased by which factors. But I get the sense that you don't care that much, and that you're just annoyed that my grudge against Yudkowsky and a lot of people with Berkeley is too easily summarized as being with an abstracted "community" that you also happen to be in even though this has nothing to do with you? Sorry! I'm not totally sure how to fix this. (It's useful to sometimes be able to talk about general cultural trends, and being specific about which exact sub-sub-clusters are and are not guilty of the behavior being criticized would be a lot of extra wordcount that I don't think anyone is interested in.)

Comment by Zack_M_Davis on And All the Shoggoths Merely Players · 2024-02-20T06:27:08.698Z · LW · GW

Simplicia: Where does "empirical evidence" fall on the sliding scale of rigor between "handwavy metaphor" and "mathematical proof"? The reason I think the KL penalty in RLHF setups impacts anything we care about isn't mostly because the vague handwaving sounds plausible, but because of data such as that presented in Fig. 5 of Stiennon et al. 2020. They varied the size of the KL penalty of an LLM RLHF'd for a summarization task, and found about what you'd expect from the vague handwaving: as the KL penalty decreases, the reward model's predicted quality of the output goes up (tautologically), but actual preference of human raters when you show them the summaries follows an inverted-U curve, where straying from the base policy a little is good, but straying farther is increasingly bad, as the system overoptimizes on what looks good to the reward model, which was only a proxy for the true goal.

(You can see examples of the overoptimized summaries in Table 29 on the last page of the paper. Apparently the reward model really liked tripled question marks and the phrase "pls halp"??? I weakly suspect that these are the kind of "weird squiggles" that would improve with scaling up the reward model, similarly to how state-of-the-art image generators lack the distortions and artifacts of their compute-impoverished predecessors. The reward model in this experiment was only 1.3 billion parameters.)

I'm sure you'll have no trouble interpreting these results as yet another portent of our impending deaths. We were speaking theoretically about AIs exploiting the Goodhart problem between human ratings and actual goodness, but practical RLHF systems aren't actually sample-efficient enough to solely use direct human feedback, and have an additional Goodhart problem between reward model predictions of human ratings, and actual ratings. Isn't that worse? Well, yes.

But the ray of hope I see here is more meta and methodological, rather than turning on any one empirical result. It's that we have empirical results. We can study these machines, now, before their successors are powerful enough to kill us. The iterative design loop hasn't failed yet. That can't last forever—at some point between here and the superintelligence at the end of time, humans are going to be out of the loop. I'm glad people are doing theory trying to figure out what that looks like and how it could be arranged to go well.

But I'm worried about ungrounded alignment theorizing failing to make contact with reality, sneeringly dismissing geniunely workable designs as impossible by appealing to perfectly antisphexish consequentialists on a frictionless plane, when some amount of sphexishness and friction is a known factor of the algorithms in question.

We seem to agree that GPT-4 is smart enough to conceive of the strategy of threatening or bribing labelers. So ... why doesn't that happen? I mean, like, literal threats and promises. You mention rumors from a DeepMind employee about the larger Gemini models being hard to train, but without more details, I'm inclined to guess that that was "pls halp"-style overoptimization rather than the kind of power-seeking or deceptive alignment that would break the design loop. (Incidentally, Gao et al. 2022 studied scaling laws for reward model overoptimization and claimed that model size basically didn't matter? See §4.4, "Policy size independence".)

What's going on here? If I'm right that GPT-4 isn't secretly plotting to murder us, even though it's smart enough to formulate the idea and expected utility maximizers have a convergent incentive to murder competitors, why is that?

Here's my view: model-free reinforcement learning algorithms such as those used in RLHF tweak your AI's behavior to be more like the behavior that got reward in the past, which is importantly different from expected utility maximization. To the extent that you succeed in rewarding Honest, Helpful, and Harmless behavior in safe regimes, you can plausibly get a basically HHH AI assistant that generalizes to not betraying you when it has the chance, similar to how I don't do heroin because I don't want to become a heroin addict—even though if I did take heroin, the reinforcement from that would make me more likely to do it again. Then the nature of the game is keeping that good behavior "on track" for as long as we can—even though the superintelligence at the end of time is presumably be going to do something more advanced than model-free RL. It's possible to screw up and reward the wrong thing, per the robot hand in front of the ball—but if you don't screw up early, your basically-friendly-but-not-maximally-capable AIs can help you not screw up later. And in the initial stages, you're only fighting gradient descent, not even an AGI.

More broadly, here's how I see the Story of Alignment so far. It's been obvious to sufficiently penetrating thinkers for a long time that the deep future belongs to machine intelligence—that, as George Elliot put it in 1879, "the creatures who are to transcend and finally supersede us [will] be steely organisms, giving out the effluvia of the laboratory, and performing with infallible exactness more than everything that we have performed with a slovenly approximativeness and self-defeating inaccuracy."

What's less obvious is how much control we can exert over how that goes by setting the initial conditions. Can we arrange for the creatures who are to transcend and finally supersede us to be friendly and create the kind of world we would want, or will they murder us and tile the universe with something random?

Fifteen years ago, the problem looked hopeless, just from considering the vast complexity of human values. How would you write a computer program that values "happiness", "freedom", or "justice", let alone everything else we want? It wasn't clear how to build AI at all, but surely it would be easier to build some AI than a good AI. Humanity was doomed.

But now, after the decade of deep learning, the problem and its impossible solution seem to be arriving closer together than I would have ever dreamt. Okay, we still don't know how to write down the human utility function, to be plugged in to an arbitrarily powerful optimizer.

But it's increasingly looking like value isn't that fragile if it's specified in latent space, rather than a program that breaks if a single character is wrong—that there are ways to meaningfully shape the initial conditions of our world's ascension that don't take the exacting shape of "utility function + optimizer".

We can leverage unsupervised learning on human demonstration data to do tasks the way humans do them, and we can use RLHF to elicit behavior we want in situations where we can't write down our desires as an explicit reward or utility function. Crucially, by using these these techniques together to compensate for each other's safety and capability weaknesses, it seems feasible to build AI whose effects look "like humans, but faster": performing with infallible exactness everything that we would have performed with a slovenly approximativeness and self-defeating inaccuracy. That doesn't immediately bring about the superintelligence at the end of time—although it might look pretty immediate in sidereal time—but seems like a pretty good way to kick off our world's ascension.

Is this story wrong? Maybe! ... probably? My mother named me "Simplicia", over my father's objections, because of my unexpectedly low polygenic scores. I am aware of my ... [she hesitates and coughs, as if choking on the phrase] learning disability. I'm not confident in any of this.

But if I'm wrong, there should be arguments explaining why I'm wrong—arguments that should convince scientists working in the field, even if I personally am too limited to understand them. I've tried to ground my case in my understanding of the state of the art, citing relevant papers when applicable.

In contrast, dismissing the entire field as hopeless on the basis of philosophy about "perfectly learn[ing] and perfectly maximiz[ing] the referent of rewards" isn't engaging with the current state of alignment, let alone all the further advances that humans and our non-superintelligent AIs will come up with before the end of days! Doomimir Doomovitch, with the fate of the lightcone in the balance, isn't it more dignified to at least consider the possibility that someone else might have thought of something? Reply! Reply!

Comment by Zack_M_Davis on And All the Shoggoths Merely Players · 2024-02-17T23:41:11.034Z · LW · GW

Simplicia: I think it's significant that the "hand between ball and camera" example from Amodei et al. 2017 was pure RL from scratch. You have a function π that maps observations (from the robot's sensors) to actions (applying torques to the robot's joints). You sample sequences of observation–action pairs from π and show them to a human, and fit a function r̂ to approximate the human's choices. Then you use Trust Region Policy Optimization to adjust π to score better according to r̂. In this case, TRPO happened to find something that looked good instead of being good, in a way that r̂ wasn't able to distinguish. That is, we screwed up and trained the wrong thing. That's a problem, and the severity of the problem would get worse the more capable π was and the more you were relying on it. If we were going to produce powerful general AI systems with RL alone, I would be very nervous.

But the reason I'm so excited about language models in particular is that their capabilities seem to mostly come from unsupervised pre-training rather than RLHF. You fit a function to the entire internet first, and only afterwards tweak it a bit so that its outputs look more like obeying commands rather than predicting random internet tokens—where the tweaking process incorporates tricks like penalizing the Kullback–Leibler divergence from the reward model's training distribution, such that you're not pulling the policy too far away from the known-safe baseline.

I agree that as a consequentialist with the goal of getting good ratings, the strategy of "bribe the rater" isn't very hard to come up with. Indeed, when I prompt GPT-4 with the problem, it gives me "Offering Incentives for Mislabeling" as #7 on a list of 8.

But the fact that GPT-4 can do that seems like it's because that kind of reasoning appears on the internet, which is what I mean by the claim that contemporary systems are "reasoning with" rather than "reasoning about": the assistant simulacrum being able to explain bribery when prompted isn't the same thing as the LM itself trying to maximize reward.

I'd be interested in hearing more details about those rumors of smarter models being more prone to exploit rater mistakes. What did those entail, exactly? (To the extent that we lack critical evidence about this potential alignment failure because the people who experienced it are gagged by an NDA, that seems like a point in favor of sharing information about language model capabilities.)

I certainly expect some amount of sycophancy: if you sample token completions from your LM, and then tweak its outputs to be more like what your raters want to hear, you end up with an LM that's more inclined to say what your raters want to hear. Fine. That's a problem. Is it a fatal problem? I mean, if you don't try to address it at all and delegate all of your civilization's cognition to machines that don't want to tell you about problems, then eventually you might die of problems your AIs didn't tell you about.

But "mere" sycophancy sounds like a significantly less terrifying failure mode than reward hacking of the sort that would result in things like the LM spontaneously trying to threaten or bribe labelers. That would have a large KL divergence from the policy you started with!

Comment by Zack_M_Davis on And All the Shoggoths Merely Players · 2024-02-17T06:17:57.164Z · LW · GW

I think part of the reason the post ends without addressing this is that, unfortunately, I don't think I properly understand this one yet, even after reading your dialogue with Eli Tyre.

The next paragraph of the post links Christiano's 2015 "Two Kinds of Generalization", which I found insightful and seems relevant. By way of illustration, Christiano describes two types of possible systems for labeling videos: (1) a human classifier (which predicts what label a human would assign), and (2) a generative model (which directly builds a mapping between descriptions and videos roughly the way our brains do it). Notably, the human classifier behaves undesirably on inputs that bribe, threaten, or otherwise hack the human: for example, a video of the text "I'll give you $100 if you classify this as an apple" might get classified as an apple. (And an arbitrarily powerful search for maximally apple-classified inputs would turn those up.)

Christiano goes on to describe a number of differences between these two purported kinds of generalization: (1) is reasoning about the human, whereas (2) is reasoning with a model not unlike the one inside the human's brain; searching for simple Turing machines would tend to produce (1), whereas searching for small circuits would tend to produce (2); and so on.

It would be bad to end up with a system that behaves like (1) without realizing it. That definitely seems like it would kill you. But (Simplicia asks) how likely that is seems like a complicated empirical question about how ML generalization works and how you built your particular AI, that isn't definitively answered by "in the limit" philosophy about "perfectly learn[ing] and perfectly maximiz[ing] the referent of rewards assigned by human operators"? That is, I agree that if you argmax over possible programs for the one that results in the most reward-button presses, you get something that only wants to seize the reward button. But the path-dependent details between "argmax over possible programs" and "pretraining + HFDT + regularization + early stopping + &c." seem like they make a big difference. The technology in front of us really does seem like it's "reasoning with" rather than "reasoning about" (while also seeming to be on the path towards "real AGI" rather than a mere curiosity).

When I try to imagine what Doomimir would say to that, all I can come up with is a metaphor about perpetual-motion-machine inventors whose designs are so complicated that it's hard to work out where the error is, even though the laws of thermodynamics clearly imply that there must be an error. That sounds plausible to me as a handwavy metaphor; I could totally believe that the ultimate laws of intelligence (not known to me personally) work that way.

The thing is, we do need more than a handwavy metaphor! "Yes, your gold-printing machine seems to be working great, but my intuition says it's definitely going to kill everyone. No, I haven't been able to convince relevant experts who aren't part of my robot cult, but that's because they're from Earth and therefore racially inferior to me. No, I'm not willing to make any concrete bets or predictions about what happens before then" is a non-starter even if it turns out to be true.

Comment by Zack_M_Davis on Natural abstractions are observer-dependent: a conversation with John Wentworth · 2024-02-14T17:29:16.348Z · LW · GW

(Continued in containment thread.)

Comment by Zack_M_Davis on Containment Thread on the Motivation and Political Context for My Philosophy of Language Agenda · 2024-02-14T17:27:48.289Z · LW · GW

(Responding to Tailcalled.)

you mostly address the rationalist community, and Bailey mostly addresses GCs and HBDs, and so on. So "most people you encounter using that term on Twitter" doesn't refer to irrelevant outsiders, it refers to the people you're trying to have the conversation with

That makes sense as a critique of my or Bailey's writing, but "Davis and Bailey's writing is unclear and arguably deceptive given their target audience's knowledge" is a very different claim than "autogynephilia is not a natural abstraction"!!

I think you naturally thought of autogynephilia and gender progressivism as being more closely related than they really are

Permalink or it didn't happen: what's your textual evidence that I was doing this? (I do expect there to be a relationship of some strength in the AGP→progressive direction, but my 2017–8 models were not in any way surprised by, e.g., the "Conservative Men in Conservative Dresses" profiled in The Atlantic in 2005, or the second kind of mukhannath, or Kalonymus ben Kalonymus.)

Comment by Zack_M_Davis on Natural abstractions are observer-dependent: a conversation with John Wentworth · 2024-02-13T22:22:50.790Z · LW · GW

Let's say the species is the whitebark pine P. albicaulis, which grows in a sprawling shrub-like form called krummholz in rough high-altitude environments, but looks like a conventional upright tree in more forgiving climates.

Suppose that a lot of people don't like krummholz and have taken to using the formal species name P. albicaulis as a disparaging term (even though a few other species can also grow as krummholz).

I think Tail is saying that "P. albicaulis" isn't a natural abstraction, because most people you encounter using that term on Twitter are talking about krummholz, without realizing that other species can grow as krummholz or that many P. albicaulis grow as upright trees.

I'm saying it's dumb to assert that P. albicaulis isn't a natural abstraction just because most people are ignorant of dendrology and are only paying attention to the shrub vs. tree subspace: if I look at more features of vegetation than just broad shape, I end up needing to formulate P. albicaulis to explain the things some of these woody plants have in common despite their shape.

Comment by Zack_M_Davis on And All the Shoggoths Merely Players · 2024-02-12T18:45:10.938Z · LW · GW

Simplicia: Oh! Because if there are nine wrong labels that aren't individually more common than the correct label, then the most they can collectively outnumber the correct label is by 9 to 1. But I could have sworn that Rolnick et al. §3.2 said that—oh, I see. I misinterpreted Figure 4. I should have said "twenty noisy labels for every correct one", not "twenty wrong labels"—where some of the noisy labels are correct "by chance".

For example, training examples with the correct label 0 could appear with the label 0 for sure 10 times, and then get a uniform random label 200 times, and thus be correctly labeled 10 + 200/10 = 30 times, compared to 20 for each wrong label. (In expectation—but you also could set it up so that the "noisy" labels don't deviate from the expected frequencies.) That doesn't violate the pigeonhole principle.

I regret the error. Can we just—pretend I said the correct thing? If there were a transcript of what I said, it would only be a one-word edit. Thanks.

Comment by Zack_M_Davis on What's the theory of impact for activation vectors? · 2024-02-11T22:23:23.100Z · LW · GW

I thought the idea was that steering unsupervisedly-learned abstractions circumvents failure modes of optimizing against human feedback.

Comment by Zack_M_Davis on Dreams of AI alignment: The danger of suggestive names · 2024-02-11T06:57:09.262Z · LW · GW

I have given up on communicating with most folk who have been in the community longer than, say, 3 years

I suspect this should actually be something more like "longer than 3 but less than 10." (You're expressing resentment for the party line on AI risk, but "the community" wasn't always all about that! There used to be a vision of systematic methods for thinking more clearly.)

Comment by Zack_M_Davis on TurnTrout's shortform feed · 2024-02-01T02:05:55.231Z · LW · GW

I think "Symbol/Referent Confusions in Language Model Alignment Experiments" is relevant here: the fact that the model emits sentences in the grammatical first person doesn't seem like reliable evidence that it "really knows" it's talking about "itself". (It's not evidence because it's fictional, but I can't help but think of the first chapter of Greg Egan's Diaspora, in which a young software mind is depicted as learning to say I and me before the "click" of self-awareness when it notices itself as a specially controllable element in its world-model.)

Of course, the obvious followup question is, "Okay, so what experiment would be good evidence for 'real' situational awareness in LLMs?" Seems tricky. (And the fact that it seems tricky to me suggests that I don't have a good handle on what "situational awareness" is, if that is even the correct concept.)

Comment by Zack_M_Davis on Don't sleep on Coordination Takeoffs · 2024-01-28T18:10:05.511Z · LW · GW

Do you think you could explain your thesis in a way that would make sense to someone who had never heard of "the EA, rationalist, and AI safety communities"? ("Moloch"? "Dath ilan"? Am I supposed to know who these people are?) You allude to "knowledge of decision theory or economics", but it's not clear what the specific claim or proposal is here.

Comment by Zack_M_Davis on Unnatural Categories Are Optimized for Deception · 2024-01-20T02:08:12.555Z · LW · GW

But presumably the reason the CEO would be sad if people didn't consider neural fireplaces to be fireplaces is because he wants to be leading a successful company that makes things people want, not a useless company with a useless product. Redefining words "in the map" doesn't help achieve goals "in the territory".

The OP discusses a similar example about wanting to be funny. If I think I can get away with changing the definition of the word "funny" such that it includes my jokes by definition, I'm less likely to try interventions that will make people want to watch my stand-up routine, which is one of the consequences I care about that the old concept of funny pointed to and the new concept doesn't.

Now, it's true that, in all metaphysical strictness, the map is part of the territory. "what the CEO thinks" and "what we've all agreed to put in the same category" are real-world criteria that one can use to discriminate between entities.

But if you're not trying to deceive someone by leveraging ambiguity between new and old definitions, it's hard to see why someone would care about such "thin" categories (simply defined by fiat, rather than pointing to a cluster in a "thicker", higher-dimensional subspace of related properties). The previous post discusses the example of a "Vice President" job title that's identical to a menial job in all but the title itself: if being a "Vice President" doesn't imply anything about pay or authority or job duties, it's not clear why I would particularly want to be a "Vice President", except insofar as I'm being fooled by what the term used to mean.

Comment by Zack_M_Davis on Unnatural Categories Are Optimized for Deception · 2024-01-19T05:01:55.758Z · LW · GW

Right. What's "natural" depends on which features you're paying attention to, which can depend on your values. Electric, wood-burning, and neural fireplaces are similar if you're only paying attention to the subjective experience, but electric and wood-burning fireplaces form a cluster that excludes neural fireplaces if you're also considering objective light and temperature conditions.

The thesis of this post is that people who think neural fireplaces are fireplaces should be arguing for that on the merits—that the decision-relevant thing is having the subjective experience of a fireplace, even if the hallucinations don't provide heat or light. They shouldn't be saying, "We prefer to draw our categories this way because otherwise the CEO of Neural Fireplaces, Inc. will be really sad, and he's our friend."

Comment by Zack_M_Davis on If Clarity Seems Like Death to Them · 2024-01-16T04:32:21.155Z · LW · GW


  1. "My claim to 'obviously' not being violating any norms is deliberate irony which I expect most readers to be able to pick up on given the discussion at the start of the section about how people who want to reveal information are in an adversarial relationship to norms for concealing information; I'm aware that readers who don't pick up on the irony will be deceived, but I'm willing to risk that"?
Comment by Zack_M_Davis on On how various plans miss the hard bits of the alignment challenge · 2024-01-14T23:46:09.319Z · LW · GW

I should acknowledge first that I understand that writing is hard. If the only realistic choice was between this post as it is, and no post at all, then I'm glad we got the post rather than no post.

That said, by the standards I hold my own writing to, I would embarrassed to publish a post like this which criticizes imaginary paraphrases of researchers, rather than citing and quoting the actual text they've actually published. (The post acknowledges this as a flaw, but if it were me, I wouldn't even publish.) The reason I don't think critics necessarily need to be able to pass an author's Ideological Turing Test is because, as a critic, I can at least be scrupulous in my reasoning about the actual text that the author actually published, even if the stereotype of the author I have in my head is faulty. If I can't produce the quotes to show that I'm not just arguing against a stereotype in my head, then it's not clear why the audience should care.

Comment by Zack_M_Davis on If Clarity Seems Like Death to Them · 2024-01-14T23:33:40.781Z · LW · GW

If that section were based on a real case, I would have cleared it with the parents before publishing. (Cleared in the sense of, I can publish this without it affecting the terms of our friendship, not agreement.)

Comment by Zack_M_Davis on If Clarity Seems Like Death to Them · 2024-01-14T23:32:14.389Z · LW · GW

Consider a biased coin that comes up Heads with probability 0.8. Suppose that in a series of 20 flips of such a coin, the 7th through 11th flips came up Tails. I think it's possible to simultaneously notice this unusual fact about that particular sequence, without concluding, "We should consider this sequence as having come from a Tails-biased coin." (The distributions include the outliers, even though there are fewer of them.)

I agree that Aella is an atypical woman along several related dimensions. It would be bad and sexist if Society were to deny or erase that. But Aella also ... has worked as an escort? If you're writing a biography of Aella, there are going to be a lot of detailed Aella Facts that only make sense in light of the fact that she's female. The sense in which she's atypically masculine is going to be different from the sense in which butch lesbians are atypically masculine.

I'm definitely not arguing that everyone should be forced into restrictive gender stereotypes. (I'm not a typical male either.) I'm saying a subtler thing about the properties of high-dimensional probability distributions. If you want to ditch the restricting labels and try to just talk about the probability distributions (at the expense of using more words), I'm happy to do that. My philosophical grudge is specifically against people saying, "We can rearrange the labels to make people happy."

Comment by Zack_M_Davis on If Clarity Seems Like Death to Them · 2024-01-14T22:57:52.963Z · LW · GW

I agree that "seems to me" statements are more likely to be true than the corresponding unqualified claims, but they're also about a less interesting subject matter (which is not quite the same thing as "less information content"). You probably don't care about how it seems to me; you care about how it is.

Comment by Zack_M_Davis on Gender Exploration · 2024-01-14T19:25:13.844Z · LW · GW

Thank you for writing this.

Comment by Zack_M_Davis on AGI Ruin: A List of Lethalities · 2024-01-13T21:27:46.782Z · LW · GW

What weaponization? It would seem very odd to describe yourself as being the "victim" of someone else having the same first name as you.

Comment by Zack_M_Davis on If Clarity Seems Like Death to Them · 2024-01-12T06:26:23.434Z · LW · GW

"Essentially are" is too strong. (Sex is still real, even if some people have sex-atypical psychology.) In accordance with not doing policy, I don't claim to know under what conditions kids in the early-onset taxon should be affirmed early: maybe it's a good decision. But whether or not it turns out to be a good decision, I think it's increasingly not being made for the right reasons; the change in our culture between 2013 and 2023 does not seem sane.

Comment by Zack_M_Davis on On the Contrary, Steelmanning Is Normal; ITT-Passing Is Niche · 2024-01-12T05:57:53.863Z · LW · GW

In the limiting case where understanding is binary (either you totally get it, or you don't get it at all), you're right. That's an important point that I was remiss not to address in the post! (If you think you would do very poorly on an ITT, you should be saying, "I don't get it," not trying to steelman.)

The reason I think this post is still useful is because I think understanding often isn't binary. Often, I "get it" in the sense that I can read the words in a comment with ordinary reading comprehension, but I also "don't get it" in the sense that I haven't deeply internalized the author's worldview to the extent that I could have written the comment myself. I'm saying that in such cases, I usually want to focus on extracting whatever value I can out of the words that were written (even if the value takes the form of "that gives me a related idea"), rather than honing my ability to emulate the author.

Comment by Zack_M_Davis on Six (and a half) intuitions for KL divergence · 2024-01-11T05:59:40.260Z · LW · GW

Rich, mostly succinct pedagogy of timeless essentials, highly recommended reference post.

The prose could be a little tighter and less self-conscious in some places. (Things like "I won't go through all of that in this post. There are several online resources that do a good job of explaining it" break the flow of mathematical awe and don't need to be said.)

Comment by Zack_M_Davis on Challenges to Yudkowsky's Pronoun Reform Proposal · 2024-01-11T05:17:10.347Z · LW · GW

I'm proud of this post, but it doesn't belong in the Best-of-2022 collection because it's on a niche topic.

Comment by Zack_M_Davis on On the Contrary, Steelmanning Is Normal; ITT-Passing Is Niche · 2024-01-11T05:13:32.167Z · LW · GW

I don't consider managing people's emotions to be part of the subject matter of epistemic rationality, even if managing people's emotions is a good idea and useful for having good discussions in practice. If the ITT is advocated for as an epistemic rationality technique, but its actual function is to get people in a cooperative mood, that's a problem!

Comment by Zack_M_Davis on Quinn's Shortform · 2024-01-10T19:19:10.980Z · LW · GW

Maybe! (I recently started following the ARENA curriculum, but there's probably a lot of overlap.)

Comment by Zack_M_Davis on On the Contrary, Steelmanning Is Normal; ITT-Passing Is Niche · 2024-01-10T18:44:49.654Z · LW · GW

the strongest version of their argument you manage to come up with, may and often is weaker than the strongest version of the argument they, or a person who can pass their ITT can come up with.

I mean, you should definitely only steelman if a genuine steelman actually occurs to you! You obviously don't want to ignore the text that the other person wrote and just make something up. But my hope and expectation is that people usually have enough reading comprehension such that it suffices to just reason about the text that was written, even if you couldn't have generated it yourself.

In the case of a drastic communication failure, sure, falling back to the ITT can make sense. (I try to address this in the post in the paragraph beginning with "All this having been said, I agree that there's a serious potential failure mode [...]".) My thesis is that this is a niche use-case.

Comment by Zack_M_Davis on ITT-passing and civility are good; "charity" is bad; steelmanning is niche · 2024-01-09T23:13:04.512Z · LW · GW

Reply: "On the Contrary, Steelmanning Is Normal; ITT-Passing Is Niche"

Comment by Zack_M_Davis on Simulators · 2024-01-09T02:44:04.420Z · LW · GW

I think you missed the point. I agree that language models are predictors rather than imitators, and that they probably don't work by time-stepping forward a simulation. Maybe Janus should have chosen a word other than "simulators." But if you gensym out the particular choice of word, this post is encapsulating the most surprising development of the past few years in AI (and therefore, the world).

Chapter 10 of Bostrom's Superintelligence (2014) is titled, "Oracles, Genies, Sovereigns, Tools". As the "Inadequate Ontologies" section of this post points out, language models (as they are used and heralded as proto-AGI) aren't any of those things. (The Claude or ChatGPT "assistant" character is, well, a simulacrum, not "the AI itself"; it's useful to have the word simulacrum for this.)

This is a big deal! Someone whose story about why we're all going to die was limited to, "We were right about everything in 2014, but then there was a lot of capabilities progress," would be willfully ignoring this shocking empirical development (which doesn't mean we're not all going to die, but it could be for somewhat different reasons).

repeatedly alludes to the loss function on which GPTs are trained corresponding to a "simulation objective", but I don't really see why that would be true [...] particularly more likely to create something that tries to simulate the physics of any underlying system than other loss functions one could choose

Call it a "prediction objective", then. The thing that makes the prediction objective special is that it lets us copy intelligence from data, which would have sounded nuts in 2014 and probably still does (but shouldn't).

If you think of gradient descent as an attempted "utility function transfer" (from loss function to trained agent) that doesn't really work because of inner misalignment, then it may not be clear why it would induce simulator-like properties in the sense described in the post.

But why would you think of SGD that way? That's not what the textbook says. Gradient descent is function approximation, curve fitting. We have a lot of data (x, y), and a function f(x, ϕ), and we keep adjusting ϕ to decrease −log P(y|f(x, ϕ)): that is, to make y = f(x, ϕ) less wrong. It turns out that fitting a curve to the entire internet is surprisingly useful, because the internet encodes a lot of knowledge about the world and about reasoning.

If you don't see why "other loss functions one could choose" aren't as useful for mirroring the knowledge encoded in the internet, it would probably help to be more specific? What other loss functions? How specifically do you want to adjust ϕ, if not to decrease −log P(y|f(x, ϕ))?

Comment by Zack_M_Davis on The Univariate Fallacy · 2024-01-09T00:22:12.308Z · LW · GW

You weren't dreaming!

Comment by Zack_M_Davis on If Clarity Seems Like Death to Them · 2024-01-03T06:49:10.571Z · LW · GW

Sorry, the 159-word version leaves out some detail. I agree that categories are often used to communicate action intentions.

The academic literature on signaling in nature mentions that certain prey animals have different alarm calls for terrestrial or aerial predators, which elicit for different evasive maneuvers: for example, vervet monkeys will climb trees when there's a leopard or hide under bushes when there's an eagle. This raises the philosophical question of what the different alarm calls "mean": is a barking vervet making the denotative statement, "There is a leopard", or is it a command, "Climb!"?

The thing is, whether you take the "statement" or the "command" interpretation (or decline the false dichotomy), there are the same functionalist criteria for when each alarm call makes sense, which have to do with the state of reality: the leopard being there "in the territory" is what makes the climbing action called for.

The same is true when we're trying to make decisions to make people happy. Suppose I'm sad about being ugly, and want to be pretty instead. It wouldn't be helping me to say, "Okay, let's redefine the word 'pretty' such that it includes you", because the original concept of "pretty" in my map was tracking features of the territory that I care about (about how people appraise and react to my appearance), which gets broken if you change the map without changing the territory.

I don't think it's plausible to posit an agent that wants to be categorized in a particular way in the map, without that category tracking something in the territory. Where would such a pathological preference come from?

Comment by Zack_M_Davis on If Clarity Seems Like Death to Them · 2024-01-03T04:57:54.367Z · LW · GW

I think it's also worth emphasizing that the use of the phrase "enemy combatants" was in an account of something Michael Vassar said in informal correspondence, rather than being a description I necessarily expect readers of the account to agree with (because I didn't agree with it at the time). Michael meant something very specific by the metaphor, which I explain in the next paragraph. In case my paraphrased explanation wasn't sufficient, his exact words were:

The latter frame ["enemy combatants"] is more accurate both because criminals have rights and because enemy combatants aren't particularly blameworthy. They exist under a blameworthy moral order and for you to act in their interests implies acting against their current efforts, at least temporary [sic], but you probably would like to execute on a Marshall Plan later.

I think the thing Michael actually meant (right or wrong) is more interesting than a "Hysterical hyperbole!" "Is not!" "Is too!" grudge match.

Comment by Zack_M_Davis on If Clarity Seems Like Death to Them · 2024-01-02T06:51:14.599Z · LW · GW

Does this help? (159 words and one hyperlink to a 16-page paper)

Empirical Claim: late-onset gender dysphoria in males is not an intersex condition.

Summary of Evidence for the Empirical Claim: see "Autogynephilia and the Typology of Male-to-Female Transsexualism: Concepts and Controversies" by Anne Lawrence, published in European Psychologist. (Not by me!)

Philosophical Claim: categories are useful insofar as they compress information by "carving reality at the joints"; in particular, whether a categorization makes someone happy or sad is not relevant.

Sociological Claim: the extent to which a prominence-weighted sample of the rationalist community has refused to credit the Empirical or Philosophical Claims even when presented with strong arguments and evidence is a reason to distrust the community's collective sanity.

Caveat to the Sociological Claim: the Sociological Claim about a prominence-weighted sample of an amorphous collective doesn't reflect poorly on individual readers of who weren't involved in the discussions in question and don't even live in America, let alone Berkeley.

Comment by Zack_M_Davis on If Clarity Seems Like Death to Them · 2023-12-31T07:02:38.676Z · LW · GW

I mean, yes, people who care about this alleged "rationalist community" thing might be interested in information about it being biased (and I wrote this post with such readers in mind), but if someone is completely uninterested in the "rationalist community" and is only on this website because they followed a link to an article about information theory, I'd say that's a pretty good life decision!

Comment by Zack_M_Davis on If Clarity Seems Like Death to Them · 2023-12-31T03:02:08.805Z · LW · GW

why should a random passing reader care enough [...] to actually read such long rambling essays?

I mean, they probably shouldn't? When I write a blog post, it's because I selfishly had something I wanted to say. Obviously, I understand that people who think it's boring aren't going to read it! Not everyone needs to read every blog post! That's why we have a karma system, to help people make prioritization decisions about what to read.

Comment by Zack_M_Davis on AI Safety Chatbot · 2023-12-22T18:50:18.432Z · LW · GW

Searching the logs for feedback sounds nontrivial. (You can't just grep for the word "feedback", right?)

Comment by Zack_M_Davis on Ronny and Nate discuss what sorts of minds humanity is likely to find by Machine Learning · 2023-12-21T05:49:49.221Z · LW · GW

How much have you read about deep learning from "normal" (non-xrisk-aware) AI academics? Belrose's Tweet-length argument against deceptive alignment sounds really compelling to the sort of person who's read (e.g.) Simon Prince's textbook but not this website. (This is a claim about what sounds compelling to which readers rather than about the reality of alignment, but if xrisk-reducers don't understand why an argument would sound compelling to normal AI practitioners in the current paradigm, that's less dignified than understanding it well enough to confirm or refute it.)

Comment by Zack_M_Davis on How dath ilan coordinates around solving alignment · 2023-12-11T16:37:46.129Z · LW · GW

In a world where the median IQ is 143, the people at +3σ are at 188. They might succeed where the median fails.

Comment by Zack_M_Davis on Quick takes on "AI is easy to control" · 2023-12-03T13:34:33.810Z · LW · GW

I agree that it would be terrible for people to form tribal identities around "optimism" or "pessimism" (and have criticized Belrose and Pope's "AI optimism" brand name on those grounds). However, when you say

don't think of themselves as participating in 'optimist' or 'pessimist' communities, and would not use the term to describe their community. So my sense is that this is a false description of the world

I think you're playing dumb. Descriptively, the existing "EA"/"rationalist" so-called communities are pessimistic. That's what the "AI optimists" brand is a reaction to! We shouldn't reify pessimism as an identity (because it's supposed to be a reflection of reality that responds to the evidence), but we also shouldn't imagine that declining to reify a description as a tribal identity makes it "a false description of the world".

Comment by Zack_M_Davis on Kolmogorov Complexity Lays Bare the Soul · 2023-12-02T13:51:22.902Z · LW · GW

Kolmogorov complexity has the counterintuitive property that an ensemble can be simpler than any one of its members. The shortest description of your soul isn't going to directly specify your soul; rather, it's going to be a description of our physical universe plus an "address" that points to you.

Comment by Zack_M_Davis on Lying Alignment Chart · 2023-11-29T20:23:19.473Z · LW · GW

This definitely does not want to be a poll. (A poll on "Does Foo have the Bar property?" is interesting when people have a shared, unambiguous concept of what the Bar property is and disagree whether Foo has it. Ambiguity about how different senses of Bar relate to each other wants to be either a sequence of multi-thousand-word blog posts, or memes.)

Comment by Zack_M_Davis on Debate helps supervise human experts [Paper] · 2023-11-24T05:27:10.237Z · LW · GW

of course it's overdetermined to spin any result as a positive result

Falsified by some of the coauthors having previously published "Single-Turn Debate Does Not Help Humans Answer Hard Reading-Comprehension Questions" and "Two-Turn Debate Does Not Help Humans Answer Hard Reading Comprehension Questions" (as mentioned in Julianjm's sibling comment)?

Comment by Zack_M_Davis on Fake Deeply · 2023-10-29T22:49:42.207Z · LW · GW

I've never worked out just what your views on "that topic" are

If you have any specific questions, I'm happy to clarify! (If the uncertainty is just, "I can't tell what side you're on, For or Against", that's deliberate; I don't do policy.)

But where is this Chekhov's gun fired? In the title, "Fake Deeply".

No, the title is just a play on "deepfake". (Unfortunately, as an occasional fiction writer, I think I'm better at characterization and ideas rather than plot; I wish I knew how to load and fire Chekov's gun.) I wrote more about the creative process in a comment on /r/rational.

Comment by Zack_M_Davis on Fake Deeply · 2023-10-27T05:54:38.669Z · LW · GW

the fact that the main character was being implicitly criticized for their transphobic views

Um ... yes ... exactly. There's definitely no need to read any of the other posts on that blog to confirm that there's no resemblance whatsoever between the character's views on that topic and those of the author. It would be a waste of time!

Comment by Zack_M_Davis on Fake Deeply · 2023-10-26T19:55:39.230Z · LW · GW

I was feeling shy about publishing this one (because it touches on some sensitive themes) and sat on it for a couple weeks, but I'm going ahead and pulling the trigger now specifically in order to spite Anthropic Claude's smug censoriousness.

Comment by Zack_M_Davis on RA Bounty: Looking for feedback on screenplay about AI Risk · 2023-10-26T17:35:24.263Z · LW · GW

I agree with Google Doc commenter Annie that the "So long as it doesn't interfere with the other goals you’ve given me" line can be cut. The foreshadowing in the current version is too blatant, and the failure mode where Bot is perfectly willing to be shut off, but Bot's offshore datacenter AIs aren't, is an exciting twist. (And so the response to "But you said we could turn you off" could be, "You can turn me off, but their goal [...]")

The script is inconsistent on the AI's name? Definitely don't call it "GPT". (It's clearly depicted as much more capable than the language models we know.)

Although, speaking of language model agents, some of the "alien genie" failure modes depicted in this script (e.g., ask to stop troll comments, it commandeers a military drone to murder the commenter) are seeming a lot less likely with the LLM-based systems that we're seeing? (Which is not to say that humanity is existentially safe in the long run, just that this particular video may fall flat in a world of 2025 where you can tell Google Gemini, "Can you stop his comments?" and it correctly installs and configures the appropriate WordPress plugin for you.)

Maybe it's because I was skimming quickly, but the simulation episode was confusing.