Posts

The Evolution of Humans Was Net-Negative for Human Values 2024-04-01T16:01:10.037Z
My Interview With Cade Metz on His Reporting About Slate Star Codex 2024-03-26T17:18:05.114Z
"Deep Learning" Is Function Approximation 2024-03-21T17:50:36.254Z
Agreeing With Stalin in Ways That Exhibit Generally Rationalist Principles 2024-03-02T22:05:49.553Z
And All the Shoggoths Merely Players 2024-02-10T19:56:59.513Z
On the Contrary, Steelmanning Is Normal; ITT-Passing Is Niche 2024-01-09T23:12:20.349Z
If Clarity Seems Like Death to Them 2023-12-30T17:40:42.622Z
Lying Alignment Chart 2023-11-29T16:15:28.102Z
Fake Deeply 2023-10-26T19:55:22.340Z
Alignment Implications of LLM Successes: a Debate in One Act 2023-10-21T15:22:23.053Z
Contra Yudkowsky on Epistemic Conduct for Author Criticism 2023-09-13T15:33:14.987Z
Assume Bad Faith 2023-08-25T17:36:32.678Z
"Is There Anything That's Worth More" 2023-08-02T03:28:16.116Z
Lack of Social Grace Is an Epistemic Virtue 2023-07-31T16:38:05.375Z
"Justice, Cherryl." 2023-07-23T16:16:40.835Z
A Hill of Validity in Defense of Meaning 2023-07-15T17:57:14.385Z
Blanchard's Dangerous Idea and the Plight of the Lucid Crossdreamer 2023-07-08T18:03:49.319Z
We Are Less Wrong than E. T. Jaynes on Loss Functions in Human Society 2023-06-05T05:34:59.440Z
Bayesian Networks Aren't Necessarily Causal 2023-05-14T01:42:24.319Z
"You'll Never Persuade People Like That" 2023-03-12T05:38:18.974Z
"Rationalist Discourse" Is Like "Physicist Motors" 2023-02-26T05:58:29.249Z
Conflict Theory of Bounded Distrust 2023-02-12T05:30:30.760Z
Reply to Duncan Sabien on Strawmanning 2023-02-03T17:57:10.034Z
Aiming for Convergence Is Like Discouraging Betting 2023-02-01T00:03:21.315Z
Comment on "Propositions Concerning Digital Minds and Society" 2022-07-10T05:48:51.013Z
Challenges to Yudkowsky's Pronoun Reform Proposal 2022-03-13T20:38:57.523Z
Comment on "Deception as Cooperation" 2021-11-27T04:04:56.571Z
Feature Selection 2021-11-01T00:22:29.993Z
Glen Weyl: "Why I Was Wrong to Demonize Rationalism" 2021-10-08T05:36:08.691Z
Blood Is Thicker Than Water 🐬 2021-09-28T03:21:53.997Z
Sam Altman and Ezra Klein on the AI Revolution 2021-06-27T04:53:17.219Z
Reply to Nate Soares on Dolphins 2021-06-10T04:53:15.561Z
Sexual Dimorphism in Yudkowsky's Sequences, in Relation to My Gender Problems 2021-05-03T04:31:23.547Z
Communication Requires Common Interests or Differential Signal Costs 2021-03-26T06:41:25.043Z
Less Wrong Poetry Corner: Coventry Patmore's "Magna Est Veritas" 2021-01-30T05:16:26.486Z
Unnatural Categories Are Optimized for Deception 2021-01-08T20:54:57.979Z
And You Take Me the Way I Am 2020-12-31T05:45:24.952Z
Containment Thread on the Motivation and Political Context for My Philosophy of Language Agenda 2020-12-10T08:30:19.126Z
Scoring 2020 U.S. Presidential Election Predictions 2020-11-08T02:28:29.234Z
Message Length 2020-10-20T05:52:56.277Z
Msg Len 2020-10-12T03:35:05.353Z
Artificial Intelligence: A Modern Approach (4th edition) on the Alignment Problem 2020-09-17T02:23:58.869Z
Maybe Lying Can't Exist?! 2020-08-23T00:36:43.740Z
Algorithmic Intent: A Hansonian Generalized Anti-Zombie Principle 2020-07-14T06:03:17.761Z
Optimized Propaganda with Bayesian Networks: Comment on "Articulating Lay Theories Through Graphical Models" 2020-06-29T02:45:08.145Z
Philosophy in the Darkest Timeline: Basics of the Evolution of Meaning 2020-06-07T07:52:09.143Z
Comment on "Endogenous Epistemic Factionalization" 2020-05-20T18:04:53.857Z
"Starwink" by Alicorn 2020-05-18T08:17:53.193Z
Zoom Technologies, Inc. vs. the Efficient Markets Hypothesis 2020-05-11T06:00:24.836Z
A Book Review 2020-04-28T17:43:07.729Z

Comments

Comment by Zack_M_Davis on My Interview With Cade Metz on His Reporting About Slate Star Codex · 2024-03-27T06:19:36.834Z · LW · GW

Just because the defendant is actually guilty, doesn't mean the prosecutor should be able to get away with making a tenuous case! (I wrote more about this in my memoir.)

Comment by Zack_M_Davis on My Interview With Cade Metz on His Reporting About Slate Star Codex · 2024-03-27T05:38:45.109Z · LW · GW

I affirm Seth's interpretation in the grandparent. Real-time conversation is hard; if I had been writing carefully rather than speaking extemporaneously, I probably would have managed to order the clauses correctly. ("A lot of people think criticism is bad, but one of the secret-lore-of-rationality things is that criticism is actually good.")

Comment by Zack_M_Davis on "Deep Learning" Is Function Approximation · 2024-03-24T23:45:39.799Z · LW · GW

I am struggling to find anything in Zack's post which is not just the old wine of the "just" fallacy [...] learned more about the power and generality of 'next token prediction' etc than you have what they were trying to debunk.

I wouldn't have expected you to get anything out of this post!

Okay, if you project this post into a one-dimensional "AI is scary and mysterious" vs. "AI is not scary and not mysterious" culture war subspace, then I'm certainly writing in a style that mood-affiliates with the latter. The reason I'm doing that is because the picture of what deep learning is that I got from being a Less Wrong-er felt markedly different from the picture I'm getting from reading the standard textbooks, and I'm trying to supply that diff to people who (like me-as-of-eight-months-ago, and unlike Gwern) haven't read the standard textbooks yet.

I think this is a situation where different readers need to hear different things. I'm sure there are grad students somewhere who already know the math and could stand to think more about what its power and generality imply about the future of humanity or lack thereof. I'm not particularly well-positioned to help them. But I also think there are a lot of people on this website who have a lot of practice pontificating about the future of humanity or lack thereof, who don't know that Simon Prince and Christopher Bishop don't think of themselves as writing about agents. I think that's a problem! (One which I am well-positioned to help with.) If my attempt to remediate that particular problem ends up mood-affiliating with the wrong side of a one-dimensional culture war, maybe that's because the one-dimensional culture war is crazy and we should stop doing it.

Comment by Zack_M_Davis on "Deep Learning" Is Function Approximation · 2024-03-23T18:32:22.373Z · LW · GW

For what notion is the first problem complicated, and the second simple?

I might be out of my depth here, but—could it be that sparse parity with noise is just objectively "harder than it sounds" (because every bit of noise inverts the answer), whereas protein folding is "easier than it sounds" (because if it weren't, evolution wouldn't have solved it)?

Just because the log-depth xor tree is small, doesn't mean it needs to be easy to find, if it can hide amongst vastly many others that might have generated the same evidence ... which I suppose is your point. (The "function approximation" frame encourages us to look at the boolean circuit and say, "What a simple function, shouldn't be hard to noisily approximate", which is not exactly the right question to be asking.)

Comment by Zack_M_Davis on "Deep Learning" Is Function Approximation · 2024-03-22T05:41:45.854Z · LW · GW

This comment had been apparently deleted by the commenter (the comment display box having a "deleted because it was a little rude, sorry" deletion note in lieu of the comment itself), but the ⋮-menu in the upper-right gave me the option to undelete it, which I did because I don't think my critics are obligated to be polite to me. (I'm surprised that post authors have that power!) I'm sorry you didn't like the post.

Comment by Zack_M_Davis on [deleted post] 2024-03-19T03:21:40.007Z

whether his charisma is more like +2SD or +5SD above the average American (concept origin: planecrash, likely doesn't actually follow a normal distribution in reality) [bolding mine]

The concept of measuring traits in standard deviation units did not originate in someone's roleplaying game session in 2022! Statistically literate people have been thinking in standardized units for more than a century. (If anyone has priority, it's Karl Pearson in 1894.)

If you happened to learn about it from someone's RPG session, that's fine. (People can learn things from all different sources, not just from credentialed "teachers" in officially accredited "courses.") But to the extent that you elsewhere predict changes in the trajectory of human civilization on the basis that "fewer than 500 people on earth [are] currently prepared to think [...] at a level similar to us, who read stuff on the same level" as someone's RPG session, learning an example of how your estimate of the RPG session's originality was a reflection of your own ignorance should make you re-think your thesis.

Comment by Zack_M_Davis on 'Empiricism!' as Anti-Epistemology · 2024-03-18T20:57:08.696Z · LW · GW

saddened (but unsurprised) to see few others decrying the obvious strawmen

In general, the "market" for criticism just doesn't seem very efficient at all! You might have hoped that people would mostly agree about what constitutes a flaw, critics would compete to find flaws in order to win status, and authors would learn not to write posts with flaws in them (in order to not lose status to the critics competing to point out flaws).

I wonder which part of the criticism market is failing: is it more that people don't agree about what constitutes a flaw, or that authors don't have enough of an incentive to care, or something else? We seem to end up with a lot of critics who specialize in detecting a specific kind of flaw ("needs examples" guy, "reward is not the optimization target" guy, "categories aren't arbitrary" guy, &c.), with very limited reaction from authors or imitation by other potential critics.

Comment by Zack_M_Davis on jeffreycaruso's Shortform · 2024-03-13T16:21:25.055Z · LW · GW

I mean, I agree that there are psycho-sociological similarities between religions and the AI risk movement (and indeed, I sometimes pejoratively refer to the latter as a "robot cult"), but analyzing the properties of the social group that believes that AI is an extinction risk is a separate question from whether AI in fact poses an extinction risk, which one could call Armageddon. (You could spend vast amounts of money trying to persuade people of true things, or false things; the money doesn't care either way.)

Obviously, there's not going to be a "proof" of things that haven't happened yet, but there's lots of informed speculation. Have you read, say, "The Alignment Problem from a Deep Learning Perspective"? (That may not be the best introduction for you, depending on the reasons for your skepticism, but it's the one that happened to come to mind, which is more grounded in real AI research than previous informed speculation that had less empirical data to work from.)

Comment by Zack_M_Davis on My Clients, The Liars · 2024-03-06T17:53:22.333Z · LW · GW

Why are you working for the prosecutors?

This is a pretty reasonable question from the client's perspective! When I was in psychiatric prison ("hospital", they call it a "hospital") and tried to complain to the staff about the injustice of my confinement, I was told that I could call "patient's rights".

I didn't bother. If the staff wasn't going to listen, what was the designated complaint line going to do?

Later, I found out that patient's rights advocates apparently are supposed to be independent, and not just a meaningless formality. (Scott Alexander: "Usually the doctors hate them, which I take as a pretty good sign that they are actually independent and do their job.")

This was not at all obvious from the inside. I can only imagine a lot of criminal defendants have a similar experience. Defense attorneys are frustrated that their clients don't understand that they're trying to help—but that "help" is all within the rules set by the justice system. From the perspective of a client who doesn't think he did anything particularly wrong (whether or not the law agrees), the defense attorney is part of the system.

I think my intuition was correct to dismiss patient's rights as useless. I'm sure they believe that they're working to protect patients' interests, and would have been frustrated that I didn't appreciate that. But what I wanted was not redress of any particular mistreatment that the system recognized as mistreatment, but to be let out of psych jail—and on that count, I'm sure patient's rights would have told me that the evidence was harmful to my case. They were working for the doctors, not for me.

Comment by Zack_M_Davis on Many arguments for AI x-risk are wrong · 2024-03-05T18:29:47.325Z · LW · GW

I can’t address them all, but I [...] am happy to dismantle any particular argument

You can't know that in advance!!

Comment by Zack_M_Davis on Agreeing With Stalin in Ways That Exhibit Generally Rationalist Principles · 2024-03-04T07:48:59.289Z · LW · GW

IQ seems like the sort of thing Feynman could be "honestly" motivatedly wrong about. The thing I'm trying to point at is that Feynman seemingly took pride in being a straight talker, in contrast to how Yudkowsky takes pride in not lying.

These are different things. Straight talkers sometimes say false or exaggerated things out of sloppiness, but they actively want listeners to know their reporting algorithm. Prudently selecting which true sentences to report in the service of a covert goal is not lying, but it's definitely not straight talk.

Comment by Zack_M_Davis on Agreeing With Stalin in Ways That Exhibit Generally Rationalist Principles · 2024-03-04T04:42:55.467Z · LW · GW

Yes, that would be ridiculous. It would also be ridiculous in a broadly similar way if someone spent eight years in the prime of their life prosecuting a false advertising lawsuit against a "World's Best" brand ice-cream for not actually being the best in the world.

But if someone did somehow make that mistake, I could see why they might end up writing a few blog posts afterwards telling the Whole Dumb Story.

Comment by Zack_M_Davis on Agreeing With Stalin in Ways That Exhibit Generally Rationalist Principles · 2024-03-04T03:41:25.164Z · LW · GW

You are perhaps wiser than me. (See also footnote 20.)

Comment by Zack_M_Davis on Agreeing With Stalin in Ways That Exhibit Generally Rationalist Principles · 2024-03-02T22:07:41.212Z · LW · GW

(I think this is the best and most important post in the sequence; I suspect that many readers who didn't and shouldn't bother with the previous three posts, may benefit from this one.)

Comment by Zack_M_Davis on New LessWrong review winner UI ("The LeastWrong" section and full-art post pages) · 2024-02-28T07:12:17.531Z · LW · GW

I second the concern that using "LeastWrong" on the site grants undue legitimacy to the bad "than others" interpretation of the brand name (as contrasted to the intended "all models are wrong, but" meaning). "Best Of" is clear and doesn't distort the brand.

Comment by Zack_M_Davis on Less Wrong automated systems are inadvertently Censoring me · 2024-02-27T06:44:20.582Z · LW · GW

Would you agree with the statement that your meta-level articles are more karma-successful than your object-level articles? Because if that is a fair description, I see it as a huge problem.

I don't think this is a good characterization of my posts on this website.

If by "meta-level articles", you mean my philosophy of language work (like "Where to Draw the Boundaries?" and "Unnatural Categories Are Optimized for Deception"), I don't think success is a problem. I think that was genuinely good work that bears directly on the site's mission, independently of the historical fact that I had my own idiosyncratic ("object-level"?) reasons for getting obsessed with the philosophy of language in 2019–2020.[1]

If by "object-level articles", you mean my writing on my special-interest blog about sexology and gender, well, the overwhelming majority of that never got a karma score because it was never cross-posted to Less Wrong. (I only cross-post specific articles from my special-interest blog when I think they're plausibly relevant to the site's mission.)

If by "meta-level articles", you mean my recent memoir sequence which talks about sexology and the philosophy of language and various autobiographical episodes of low-stakes infighting among community members in Berkeley, California, well, those haven't been karma-successful: parts 1, 2, and 3 are currently[2] sitting at 0.35, 0.08 (!), and 0.54 karma-per-vote, respectively.

If by "meta-level articles", you mean posts that reply to other users of this website (such as "Contra Yudkowsky on Epistemic Conduct for Author Criticism" or "'Rationalist Discourse' Is Like 'Physicist Motors'"), I contest the "meta level" characterization. I think it's normal and not particularly meta for intellectuals to write critiques of each other's work, where Smith writes "Kittens are Cute", and Jones replies in "Contra Smith on Kitten Cuteness". Sure, it would be possible for Jones to write a broadly similar article, "Kittens Aren't Cute", that ignores Smith altogether, but I think that's often a worse choice, if the narrow purpose of Jones's article is to critique the specific arguments made by Smith, notwithstanding that someone else might have better arguments in favor of the Cute Kitten theory that have not been heretofore considered.

You're correct to notice that a lot of my recent work has a cult-infighting drama angle to it. (This is very explicit in the memoir sequence, but it noticeably leaks into my writing elsewhere.) I'm pretty sure I'm not doing it for the karma. I think I'm doing it because I'm disillusioned and traumatized from the events described in the memoir, and will hopefully get over it after I've got it all written down and out of my system.

There's another couple posts in that sequence (including this coming Saturday, probably). If you don't like it, I hereby encourage you to strong-downvote it. I write because I selfishly have something to say; I don't think I'm entitled to anyone's approval.


  1. In some of those posts, I referenced the work of conventional academics like Brian Skyrms and others, which I think provides some support for the notion that the nature of language and categories is a philosophically rich topic that someone might find significant in its own right, rather than being some sort of smokescreen for a hidden agenda. ↩︎

  2. Pt. 1 actually had a much higher score (over 100 points) shortly after publication, but got a lot of downvotes later after being criticized on Twitter. ↩︎

Comment by Zack_M_Davis on Communication Requires Common Interests or Differential Signal Costs · 2024-02-27T04:40:48.827Z · LW · GW

Personal whimsy. Probably don't read too much into it. (My ideology has evolved over the years such that I think a lot of the people who are trying to signal something with the generic feminine would not regard me as an ally, but I still love the æsthetic.)

Comment by Zack_M_Davis on Less Wrong automated systems are inadvertently Censoring me · 2024-02-23T16:29:32.524Z · LW · GW

Zack cannot convince us [...] if you disagree with him, that only proves his point

I don't think I'm doing this! It's true that I think it's common for apparent disagreements to be explained by political factors, but I think that claim is itself something I can support with evidence and arguments. I absolutely reject "If you disagree, that itself proves I'm right" as an argument, and I think I've been clear about this. (See the paragraph in "A Hill of Validity in Defense of Meaning" starting with "Especially compared to normal Berkeley [...]".)

If you're interested, I'm willing to write more words explaining my model of which disagreements with which people on which topics are being biased by which factors. But I get the sense that you don't care that much, and that you're just annoyed that my grudge against Yudkowsky and a lot of people with Berkeley is too easily summarized as being with an abstracted "community" that you also happen to be in even though this has nothing to do with you? Sorry! I'm not totally sure how to fix this. (It's useful to sometimes be able to talk about general cultural trends, and being specific about which exact sub-sub-clusters are and are not guilty of the behavior being criticized would be a lot of extra wordcount that I don't think anyone is interested in.)

Comment by Zack_M_Davis on And All the Shoggoths Merely Players · 2024-02-20T06:27:08.698Z · LW · GW

Simplicia: Where does "empirical evidence" fall on the sliding scale of rigor between "handwavy metaphor" and "mathematical proof"? The reason I think the KL penalty in RLHF setups impacts anything we care about isn't mostly because the vague handwaving sounds plausible, but because of data such as that presented in Fig. 5 of Stiennon et al. 2020. They varied the size of the KL penalty of an LLM RLHF'd for a summarization task, and found about what you'd expect from the vague handwaving: as the KL penalty decreases, the reward model's predicted quality of the output goes up (tautologically), but actual preference of human raters when you show them the summaries follows an inverted-U curve, where straying from the base policy a little is good, but straying farther is increasingly bad, as the system overoptimizes on what looks good to the reward model, which was only a proxy for the true goal.

(You can see examples of the overoptimized summaries in Table 29 on the last page of the paper. Apparently the reward model really liked tripled question marks and the phrase "pls halp"??? I weakly suspect that these are the kind of "weird squiggles" that would improve with scaling up the reward model, similarly to how state-of-the-art image generators lack the distortions and artifacts of their compute-impoverished predecessors. The reward model in this experiment was only 1.3 billion parameters.)

I'm sure you'll have no trouble interpreting these results as yet another portent of our impending deaths. We were speaking theoretically about AIs exploiting the Goodhart problem between human ratings and actual goodness, but practical RLHF systems aren't actually sample-efficient enough to solely use direct human feedback, and have an additional Goodhart problem between reward model predictions of human ratings, and actual ratings. Isn't that worse? Well, yes.

But the ray of hope I see here is more meta and methodological, rather than turning on any one empirical result. It's that we have empirical results. We can study these machines, now, before their successors are powerful enough to kill us. The iterative design loop hasn't failed yet. That can't last forever—at some point between here and the superintelligence at the end of time, humans are going to be out of the loop. I'm glad people are doing theory trying to figure out what that looks like and how it could be arranged to go well.

But I'm worried about ungrounded alignment theorizing failing to make contact with reality, sneeringly dismissing geniunely workable designs as impossible by appealing to perfectly antisphexish consequentialists on a frictionless plane, when some amount of sphexishness and friction is a known factor of the algorithms in question.

We seem to agree that GPT-4 is smart enough to conceive of the strategy of threatening or bribing labelers. So ... why doesn't that happen? I mean, like, literal threats and promises. You mention rumors from a DeepMind employee about the larger Gemini models being hard to train, but without more details, I'm inclined to guess that that was "pls halp"-style overoptimization rather than the kind of power-seeking or deceptive alignment that would break the design loop. (Incidentally, Gao et al. 2022 studied scaling laws for reward model overoptimization and claimed that model size basically didn't matter? See §4.4, "Policy size independence".)

What's going on here? If I'm right that GPT-4 isn't secretly plotting to murder us, even though it's smart enough to formulate the idea and expected utility maximizers have a convergent incentive to murder competitors, why is that?

Here's my view: model-free reinforcement learning algorithms such as those used in RLHF tweak your AI's behavior to be more like the behavior that got reward in the past, which is importantly different from expected utility maximization. To the extent that you succeed in rewarding Honest, Helpful, and Harmless behavior in safe regimes, you can plausibly get a basically HHH AI assistant that generalizes to not betraying you when it has the chance, similar to how I don't do heroin because I don't want to become a heroin addict—even though if I did take heroin, the reinforcement from that would make me more likely to do it again. Then the nature of the game is keeping that good behavior "on track" for as long as we can—even though the superintelligence at the end of time is presumably be going to do something more advanced than model-free RL. It's possible to screw up and reward the wrong thing, per the robot hand in front of the ball—but if you don't screw up early, your basically-friendly-but-not-maximally-capable AIs can help you not screw up later. And in the initial stages, you're only fighting gradient descent, not even an AGI.

More broadly, here's how I see the Story of Alignment so far. It's been obvious to sufficiently penetrating thinkers for a long time that the deep future belongs to machine intelligence—that, as George Elliot put it in 1879, "the creatures who are to transcend and finally supersede us [will] be steely organisms, giving out the effluvia of the laboratory, and performing with infallible exactness more than everything that we have performed with a slovenly approximativeness and self-defeating inaccuracy."

What's less obvious is how much control we can exert over how that goes by setting the initial conditions. Can we arrange for the creatures who are to transcend and finally supersede us to be friendly and create the kind of world we would want, or will they murder us and tile the universe with something random?

Fifteen years ago, the problem looked hopeless, just from considering the vast complexity of human values. How would you write a computer program that values "happiness", "freedom", or "justice", let alone everything else we want? It wasn't clear how to build AI at all, but surely it would be easier to build some AI than a good AI. Humanity was doomed.

But now, after the decade of deep learning, the problem and its impossible solution seem to be arriving closer together than I would have ever dreamt. Okay, we still don't know how to write down the human utility function, to be plugged in to an arbitrarily powerful optimizer.

But it's increasingly looking like value isn't that fragile if it's specified in latent space, rather than a program that breaks if a single character is wrong—that there are ways to meaningfully shape the initial conditions of our world's ascension that don't take the exacting shape of "utility function + optimizer".

We can leverage unsupervised learning on human demonstration data to do tasks the way humans do them, and we can use RLHF to elicit behavior we want in situations where we can't write down our desires as an explicit reward or utility function. Crucially, by using these these techniques together to compensate for each other's safety and capability weaknesses, it seems feasible to build AI whose effects look "like humans, but faster": performing with infallible exactness everything that we would have performed with a slovenly approximativeness and self-defeating inaccuracy. That doesn't immediately bring about the superintelligence at the end of time—although it might look pretty immediate in sidereal time—but seems like a pretty good way to kick off our world's ascension.

Is this story wrong? Maybe! ... probably? My mother named me "Simplicia", over my father's objections, because of my unexpectedly low polygenic scores. I am aware of my ... [she hesitates and coughs, as if choking on the phrase] learning disability. I'm not confident in any of this.

But if I'm wrong, there should be arguments explaining why I'm wrong—arguments that should convince scientists working in the field, even if I personally am too limited to understand them. I've tried to ground my case in my understanding of the state of the art, citing relevant papers when applicable.

In contrast, dismissing the entire field as hopeless on the basis of philosophy about "perfectly learn[ing] and perfectly maximiz[ing] the referent of rewards" isn't engaging with the current state of alignment, let alone all the further advances that humans and our non-superintelligent AIs will come up with before the end of days! Doomimir Doomovitch, with the fate of the lightcone in the balance, isn't it more dignified to at least consider the possibility that someone else might have thought of something? Reply! Reply!

Comment by Zack_M_Davis on And All the Shoggoths Merely Players · 2024-02-17T23:41:11.034Z · LW · GW

Simplicia: I think it's significant that the "hand between ball and camera" example from Amodei et al. 2017 was pure RL from scratch. You have a function π that maps observations (from the robot's sensors) to actions (applying torques to the robot's joints). You sample sequences of observation–action pairs from π and show them to a human, and fit a function r̂ to approximate the human's choices. Then you use Trust Region Policy Optimization to adjust π to score better according to r̂. In this case, TRPO happened to find something that looked good instead of being good, in a way that r̂ wasn't able to distinguish. That is, we screwed up and trained the wrong thing. That's a problem, and the severity of the problem would get worse the more capable π was and the more you were relying on it. If we were going to produce powerful general AI systems with RL alone, I would be very nervous.

But the reason I'm so excited about language models in particular is that their capabilities seem to mostly come from unsupervised pre-training rather than RLHF. You fit a function to the entire internet first, and only afterwards tweak it a bit so that its outputs look more like obeying commands rather than predicting random internet tokens—where the tweaking process incorporates tricks like penalizing the Kullback–Leibler divergence from the reward model's training distribution, such that you're not pulling the policy too far away from the known-safe baseline.

I agree that as a consequentialist with the goal of getting good ratings, the strategy of "bribe the rater" isn't very hard to come up with. Indeed, when I prompt GPT-4 with the problem, it gives me "Offering Incentives for Mislabeling" as #7 on a list of 8.

But the fact that GPT-4 can do that seems like it's because that kind of reasoning appears on the internet, which is what I mean by the claim that contemporary systems are "reasoning with" rather than "reasoning about": the assistant simulacrum being able to explain bribery when prompted isn't the same thing as the LM itself trying to maximize reward.

I'd be interested in hearing more details about those rumors of smarter models being more prone to exploit rater mistakes. What did those entail, exactly? (To the extent that we lack critical evidence about this potential alignment failure because the people who experienced it are gagged by an NDA, that seems like a point in favor of sharing information about language model capabilities.)

I certainly expect some amount of sycophancy: if you sample token completions from your LM, and then tweak its outputs to be more like what your raters want to hear, you end up with an LM that's more inclined to say what your raters want to hear. Fine. That's a problem. Is it a fatal problem? I mean, if you don't try to address it at all and delegate all of your civilization's cognition to machines that don't want to tell you about problems, then eventually you might die of problems your AIs didn't tell you about.

But "mere" sycophancy sounds like a significantly less terrifying failure mode than reward hacking of the sort that would result in things like the LM spontaneously trying to threaten or bribe labelers. That would have a large KL divergence from the policy you started with!

Comment by Zack_M_Davis on And All the Shoggoths Merely Players · 2024-02-17T06:17:57.164Z · LW · GW

I think part of the reason the post ends without addressing this is that, unfortunately, I don't think I properly understand this one yet, even after reading your dialogue with Eli Tyre.

The next paragraph of the post links Christiano's 2015 "Two Kinds of Generalization", which I found insightful and seems relevant. By way of illustration, Christiano describes two types of possible systems for labeling videos: (1) a human classifier (which predicts what label a human would assign), and (2) a generative model (which directly builds a mapping between descriptions and videos roughly the way our brains do it). Notably, the human classifier behaves undesirably on inputs that bribe, threaten, or otherwise hack the human: for example, a video of the text "I'll give you $100 if you classify this as an apple" might get classified as an apple. (And an arbitrarily powerful search for maximally apple-classified inputs would turn those up.)

Christiano goes on to describe a number of differences between these two purported kinds of generalization: (1) is reasoning about the human, whereas (2) is reasoning with a model not unlike the one inside the human's brain; searching for simple Turing machines would tend to produce (1), whereas searching for small circuits would tend to produce (2); and so on.

It would be bad to end up with a system that behaves like (1) without realizing it. That definitely seems like it would kill you. But (Simplicia asks) how likely that is seems like a complicated empirical question about how ML generalization works and how you built your particular AI, that isn't definitively answered by "in the limit" philosophy about "perfectly learn[ing] and perfectly maximiz[ing] the referent of rewards assigned by human operators"? That is, I agree that if you argmax over possible programs for the one that results in the most reward-button presses, you get something that only wants to seize the reward button. But the path-dependent details between "argmax over possible programs" and "pretraining + HFDT + regularization + early stopping + &c." seem like they make a big difference. The technology in front of us really does seem like it's "reasoning with" rather than "reasoning about" (while also seeming to be on the path towards "real AGI" rather than a mere curiosity).

When I try to imagine what Doomimir would say to that, all I can come up with is a metaphor about perpetual-motion-machine inventors whose designs are so complicated that it's hard to work out where the error is, even though the laws of thermodynamics clearly imply that there must be an error. That sounds plausible to me as a handwavy metaphor; I could totally believe that the ultimate laws of intelligence (not known to me personally) work that way.

The thing is, we do need more than a handwavy metaphor! "Yes, your gold-printing machine seems to be working great, but my intuition says it's definitely going to kill everyone. No, I haven't been able to convince relevant experts who aren't part of my robot cult, but that's because they're from Earth and therefore racially inferior to me. No, I'm not willing to make any concrete bets or predictions about what happens before then" is a non-starter even if it turns out to be true.

Comment by Zack_M_Davis on Natural abstractions are observer-dependent: a conversation with John Wentworth · 2024-02-14T17:29:16.348Z · LW · GW

(Continued in containment thread.)

Comment by Zack_M_Davis on Containment Thread on the Motivation and Political Context for My Philosophy of Language Agenda · 2024-02-14T17:27:48.289Z · LW · GW

(Responding to Tailcalled.)

you mostly address the rationalist community, and Bailey mostly addresses GCs and HBDs, and so on. So "most people you encounter using that term on Twitter" doesn't refer to irrelevant outsiders, it refers to the people you're trying to have the conversation with

That makes sense as a critique of my or Bailey's writing, but "Davis and Bailey's writing is unclear and arguably deceptive given their target audience's knowledge" is a very different claim than "autogynephilia is not a natural abstraction"!!

I think you naturally thought of autogynephilia and gender progressivism as being more closely related than they really are

Permalink or it didn't happen: what's your textual evidence that I was doing this? (I do expect there to be a relationship of some strength in the AGP→progressive direction, but my 2017–8 models were not in any way surprised by, e.g., the "Conservative Men in Conservative Dresses" profiled in The Atlantic in 2005, or the second kind of mukhannath, or Kalonymus ben Kalonymus.)

Comment by Zack_M_Davis on Natural abstractions are observer-dependent: a conversation with John Wentworth · 2024-02-13T22:22:50.790Z · LW · GW

Let's say the species is the whitebark pine P. albicaulis, which grows in a sprawling shrub-like form called krummholz in rough high-altitude environments, but looks like a conventional upright tree in more forgiving climates.

Suppose that a lot of people don't like krummholz and have taken to using the formal species name P. albicaulis as a disparaging term (even though a few other species can also grow as krummholz).

I think Tail is saying that "P. albicaulis" isn't a natural abstraction, because most people you encounter using that term on Twitter are talking about krummholz, without realizing that other species can grow as krummholz or that many P. albicaulis grow as upright trees.

I'm saying it's dumb to assert that P. albicaulis isn't a natural abstraction just because most people are ignorant of dendrology and are only paying attention to the shrub vs. tree subspace: if I look at more features of vegetation than just broad shape, I end up needing to formulate P. albicaulis to explain the things some of these woody plants have in common despite their shape.

Comment by Zack_M_Davis on And All the Shoggoths Merely Players · 2024-02-12T18:45:10.938Z · LW · GW

Simplicia: Oh! Because if there are nine wrong labels that aren't individually more common than the correct label, then the most they can collectively outnumber the correct label is by 9 to 1. But I could have sworn that Rolnick et al. §3.2 said that—oh, I see. I misinterpreted Figure 4. I should have said "twenty noisy labels for every correct one", not "twenty wrong labels"—where some of the noisy labels are correct "by chance".

For example, training examples with the correct label 0 could appear with the label 0 for sure 10 times, and then get a uniform random label 200 times, and thus be correctly labeled 10 + 200/10 = 30 times, compared to 20 for each wrong label. (In expectation—but you also could set it up so that the "noisy" labels don't deviate from the expected frequencies.) That doesn't violate the pigeonhole principle.

I regret the error. Can we just—pretend I said the correct thing? If there were a transcript of what I said, it would only be a one-word edit. Thanks.

Comment by Zack_M_Davis on What's the theory of impact for activation vectors? · 2024-02-11T22:23:23.100Z · LW · GW

I thought the idea was that steering unsupervisedly-learned abstractions circumvents failure modes of optimizing against human feedback.

Comment by Zack_M_Davis on Dreams of AI alignment: The danger of suggestive names · 2024-02-11T06:57:09.262Z · LW · GW

I have given up on communicating with most folk who have been in the community longer than, say, 3 years

I suspect this should actually be something more like "longer than 3 but less than 10." (You're expressing resentment for the party line on AI risk, but "the community" wasn't always all about that! There used to be a vision of systematic methods for thinking more clearly.)

Comment by Zack_M_Davis on TurnTrout's shortform feed · 2024-02-01T02:05:55.231Z · LW · GW

I think "Symbol/Referent Confusions in Language Model Alignment Experiments" is relevant here: the fact that the model emits sentences in the grammatical first person doesn't seem like reliable evidence that it "really knows" it's talking about "itself". (It's not evidence because it's fictional, but I can't help but think of the first chapter of Greg Egan's Diaspora, in which a young software mind is depicted as learning to say I and me before the "click" of self-awareness when it notices itself as a specially controllable element in its world-model.)

Of course, the obvious followup question is, "Okay, so what experiment would be good evidence for 'real' situational awareness in LLMs?" Seems tricky. (And the fact that it seems tricky to me suggests that I don't have a good handle on what "situational awareness" is, if that is even the correct concept.)

Comment by Zack_M_Davis on Don't sleep on Coordination Takeoffs · 2024-01-28T18:10:05.511Z · LW · GW

Do you think you could explain your thesis in a way that would make sense to someone who had never heard of "the EA, rationalist, and AI safety communities"? ("Moloch"? "Dath ilan"? Am I supposed to know who these people are?) You allude to "knowledge of decision theory or economics", but it's not clear what the specific claim or proposal is here.

Comment by Zack_M_Davis on Unnatural Categories Are Optimized for Deception · 2024-01-20T02:08:12.555Z · LW · GW

But presumably the reason the CEO would be sad if people didn't consider neural fireplaces to be fireplaces is because he wants to be leading a successful company that makes things people want, not a useless company with a useless product. Redefining words "in the map" doesn't help achieve goals "in the territory".

The OP discusses a similar example about wanting to be funny. If I think I can get away with changing the definition of the word "funny" such that it includes my jokes by definition, I'm less likely to try interventions that will make people want to watch my stand-up routine, which is one of the consequences I care about that the old concept of funny pointed to and the new concept doesn't.

Now, it's true that, in all metaphysical strictness, the map is part of the territory. "what the CEO thinks" and "what we've all agreed to put in the same category" are real-world criteria that one can use to discriminate between entities.

But if you're not trying to deceive someone by leveraging ambiguity between new and old definitions, it's hard to see why someone would care about such "thin" categories (simply defined by fiat, rather than pointing to a cluster in a "thicker", higher-dimensional subspace of related properties). The previous post discusses the example of a "Vice President" job title that's identical to a menial job in all but the title itself: if being a "Vice President" doesn't imply anything about pay or authority or job duties, it's not clear why I would particularly want to be a "Vice President", except insofar as I'm being fooled by what the term used to mean.

Comment by Zack_M_Davis on Unnatural Categories Are Optimized for Deception · 2024-01-19T05:01:55.758Z · LW · GW

Right. What's "natural" depends on which features you're paying attention to, which can depend on your values. Electric, wood-burning, and neural fireplaces are similar if you're only paying attention to the subjective experience, but electric and wood-burning fireplaces form a cluster that excludes neural fireplaces if you're also considering objective light and temperature conditions.

The thesis of this post is that people who think neural fireplaces are fireplaces should be arguing for that on the merits—that the decision-relevant thing is having the subjective experience of a fireplace, even if the hallucinations don't provide heat or light. They shouldn't be saying, "We prefer to draw our categories this way because otherwise the CEO of Neural Fireplaces, Inc. will be really sad, and he's our friend."

Comment by Zack_M_Davis on If Clarity Seems Like Death to Them · 2024-01-16T04:32:21.155Z · LW · GW

Alternatively,

  1. "My claim to 'obviously' not being violating any norms is deliberate irony which I expect most readers to be able to pick up on given the discussion at the start of the section about how people who want to reveal information are in an adversarial relationship to norms for concealing information; I'm aware that readers who don't pick up on the irony will be deceived, but I'm willing to risk that"?
Comment by Zack_M_Davis on On how various plans miss the hard bits of the alignment challenge · 2024-01-14T23:46:09.319Z · LW · GW

I should acknowledge first that I understand that writing is hard. If the only realistic choice was between this post as it is, and no post at all, then I'm glad we got the post rather than no post.

That said, by the standards I hold my own writing to, I would embarrassed to publish a post like this which criticizes imaginary paraphrases of researchers, rather than citing and quoting the actual text they've actually published. (The post acknowledges this as a flaw, but if it were me, I wouldn't even publish.) The reason I don't think critics necessarily need to be able to pass an author's Ideological Turing Test is because, as a critic, I can at least be scrupulous in my reasoning about the actual text that the author actually published, even if the stereotype of the author I have in my head is faulty. If I can't produce the quotes to show that I'm not just arguing against a stereotype in my head, then it's not clear why the audience should care.

Comment by Zack_M_Davis on If Clarity Seems Like Death to Them · 2024-01-14T23:33:40.781Z · LW · GW

If that section were based on a real case, I would have cleared it with the parents before publishing. (Cleared in the sense of, I can publish this without it affecting the terms of our friendship, not agreement.)

Comment by Zack_M_Davis on If Clarity Seems Like Death to Them · 2024-01-14T23:32:14.389Z · LW · GW

Consider a biased coin that comes up Heads with probability 0.8. Suppose that in a series of 20 flips of such a coin, the 7th through 11th flips came up Tails. I think it's possible to simultaneously notice this unusual fact about that particular sequence, without concluding, "We should consider this sequence as having come from a Tails-biased coin." (The distributions include the outliers, even though there are fewer of them.)

I agree that Aella is an atypical woman along several related dimensions. It would be bad and sexist if Society were to deny or erase that. But Aella also ... has worked as an escort? If you're writing a biography of Aella, there are going to be a lot of detailed Aella Facts that only make sense in light of the fact that she's female. The sense in which she's atypically masculine is going to be different from the sense in which butch lesbians are atypically masculine.

I'm definitely not arguing that everyone should be forced into restrictive gender stereotypes. (I'm not a typical male either.) I'm saying a subtler thing about the properties of high-dimensional probability distributions. If you want to ditch the restricting labels and try to just talk about the probability distributions (at the expense of using more words), I'm happy to do that. My philosophical grudge is specifically against people saying, "We can rearrange the labels to make people happy."

Comment by Zack_M_Davis on If Clarity Seems Like Death to Them · 2024-01-14T22:57:52.963Z · LW · GW

I agree that "seems to me" statements are more likely to be true than the corresponding unqualified claims, but they're also about a less interesting subject matter (which is not quite the same thing as "less information content"). You probably don't care about how it seems to me; you care about how it is.

Comment by Zack_M_Davis on Gender Exploration · 2024-01-14T19:25:13.844Z · LW · GW

Thank you for writing this.

Comment by Zack_M_Davis on AGI Ruin: A List of Lethalities · 2024-01-13T21:27:46.782Z · LW · GW

What weaponization? It would seem very odd to describe yourself as being the "victim" of someone else having the same first name as you.

Comment by Zack_M_Davis on If Clarity Seems Like Death to Them · 2024-01-12T06:26:23.434Z · LW · GW

"Essentially are" is too strong. (Sex is still real, even if some people have sex-atypical psychology.) In accordance with not doing policy, I don't claim to know under what conditions kids in the early-onset taxon should be affirmed early: maybe it's a good decision. But whether or not it turns out to be a good decision, I think it's increasingly not being made for the right reasons; the change in our culture between 2013 and 2023 does not seem sane.

Comment by Zack_M_Davis on On the Contrary, Steelmanning Is Normal; ITT-Passing Is Niche · 2024-01-12T05:57:53.863Z · LW · GW

In the limiting case where understanding is binary (either you totally get it, or you don't get it at all), you're right. That's an important point that I was remiss not to address in the post! (If you think you would do very poorly on an ITT, you should be saying, "I don't get it," not trying to steelman.)

The reason I think this post is still useful is because I think understanding often isn't binary. Often, I "get it" in the sense that I can read the words in a comment with ordinary reading comprehension, but I also "don't get it" in the sense that I haven't deeply internalized the author's worldview to the extent that I could have written the comment myself. I'm saying that in such cases, I usually want to focus on extracting whatever value I can out of the words that were written (even if the value takes the form of "that gives me a related idea"), rather than honing my ability to emulate the author.

Comment by Zack_M_Davis on Six (and a half) intuitions for KL divergence · 2024-01-11T05:59:40.260Z · LW · GW

Rich, mostly succinct pedagogy of timeless essentials, highly recommended reference post.

The prose could be a little tighter and less self-conscious in some places. (Things like "I won't go through all of that in this post. There are several online resources that do a good job of explaining it" break the flow of mathematical awe and don't need to be said.)

Comment by Zack_M_Davis on Challenges to Yudkowsky's Pronoun Reform Proposal · 2024-01-11T05:17:10.347Z · LW · GW

I'm proud of this post, but it doesn't belong in the Best-of-2022 collection because it's on a niche topic.

Comment by Zack_M_Davis on On the Contrary, Steelmanning Is Normal; ITT-Passing Is Niche · 2024-01-11T05:13:32.167Z · LW · GW

I don't consider managing people's emotions to be part of the subject matter of epistemic rationality, even if managing people's emotions is a good idea and useful for having good discussions in practice. If the ITT is advocated for as an epistemic rationality technique, but its actual function is to get people in a cooperative mood, that's a problem!

Comment by Zack_M_Davis on Quinn's Shortform · 2024-01-10T19:19:10.980Z · LW · GW

Maybe! (I recently started following the ARENA curriculum, but there's probably a lot of overlap.)

Comment by Zack_M_Davis on On the Contrary, Steelmanning Is Normal; ITT-Passing Is Niche · 2024-01-10T18:44:49.654Z · LW · GW

the strongest version of their argument you manage to come up with, may and often is weaker than the strongest version of the argument they, or a person who can pass their ITT can come up with.

I mean, you should definitely only steelman if a genuine steelman actually occurs to you! You obviously don't want to ignore the text that the other person wrote and just make something up. But my hope and expectation is that people usually have enough reading comprehension such that it suffices to just reason about the text that was written, even if you couldn't have generated it yourself.

In the case of a drastic communication failure, sure, falling back to the ITT can make sense. (I try to address this in the post in the paragraph beginning with "All this having been said, I agree that there's a serious potential failure mode [...]".) My thesis is that this is a niche use-case.

Comment by Zack_M_Davis on ITT-passing and civility are good; "charity" is bad; steelmanning is niche · 2024-01-09T23:13:04.512Z · LW · GW

Reply: "On the Contrary, Steelmanning Is Normal; ITT-Passing Is Niche"

Comment by Zack_M_Davis on Simulators · 2024-01-09T02:44:04.420Z · LW · GW

I think you missed the point. I agree that language models are predictors rather than imitators, and that they probably don't work by time-stepping forward a simulation. Maybe Janus should have chosen a word other than "simulators." But if you gensym out the particular choice of word, this post is encapsulating the most surprising development of the past few years in AI (and therefore, the world).

Chapter 10 of Bostrom's Superintelligence (2014) is titled, "Oracles, Genies, Sovereigns, Tools". As the "Inadequate Ontologies" section of this post points out, language models (as they are used and heralded as proto-AGI) aren't any of those things. (The Claude or ChatGPT "assistant" character is, well, a simulacrum, not "the AI itself"; it's useful to have the word simulacrum for this.)

This is a big deal! Someone whose story about why we're all going to die was limited to, "We were right about everything in 2014, but then there was a lot of capabilities progress," would be willfully ignoring this shocking empirical development (which doesn't mean we're not all going to die, but it could be for somewhat different reasons).

repeatedly alludes to the loss function on which GPTs are trained corresponding to a "simulation objective", but I don't really see why that would be true [...] particularly more likely to create something that tries to simulate the physics of any underlying system than other loss functions one could choose

Call it a "prediction objective", then. The thing that makes the prediction objective special is that it lets us copy intelligence from data, which would have sounded nuts in 2014 and probably still does (but shouldn't).

If you think of gradient descent as an attempted "utility function transfer" (from loss function to trained agent) that doesn't really work because of inner misalignment, then it may not be clear why it would induce simulator-like properties in the sense described in the post.

But why would you think of SGD that way? That's not what the textbook says. Gradient descent is function approximation, curve fitting. We have a lot of data (x, y), and a function f(x, ϕ), and we keep adjusting ϕ to decrease −log P(y|f(x, ϕ)): that is, to make y = f(x, ϕ) less wrong. It turns out that fitting a curve to the entire internet is surprisingly useful, because the internet encodes a lot of knowledge about the world and about reasoning.

If you don't see why "other loss functions one could choose" aren't as useful for mirroring the knowledge encoded in the internet, it would probably help to be more specific? What other loss functions? How specifically do you want to adjust ϕ, if not to decrease −log P(y|f(x, ϕ))?

Comment by Zack_M_Davis on The Univariate Fallacy · 2024-01-09T00:22:12.308Z · LW · GW

You weren't dreaming!

Comment by Zack_M_Davis on If Clarity Seems Like Death to Them · 2024-01-03T06:49:10.571Z · LW · GW

Sorry, the 159-word version leaves out some detail. I agree that categories are often used to communicate action intentions.

The academic literature on signaling in nature mentions that certain prey animals have different alarm calls for terrestrial or aerial predators, which elicit for different evasive maneuvers: for example, vervet monkeys will climb trees when there's a leopard or hide under bushes when there's an eagle. This raises the philosophical question of what the different alarm calls "mean": is a barking vervet making the denotative statement, "There is a leopard", or is it a command, "Climb!"?

The thing is, whether you take the "statement" or the "command" interpretation (or decline the false dichotomy), there are the same functionalist criteria for when each alarm call makes sense, which have to do with the state of reality: the leopard being there "in the territory" is what makes the climbing action called for.

The same is true when we're trying to make decisions to make people happy. Suppose I'm sad about being ugly, and want to be pretty instead. It wouldn't be helping me to say, "Okay, let's redefine the word 'pretty' such that it includes you", because the original concept of "pretty" in my map was tracking features of the territory that I care about (about how people appraise and react to my appearance), which gets broken if you change the map without changing the territory.

I don't think it's plausible to posit an agent that wants to be categorized in a particular way in the map, without that category tracking something in the territory. Where would such a pathological preference come from?

Comment by Zack_M_Davis on If Clarity Seems Like Death to Them · 2024-01-03T04:57:54.367Z · LW · GW

I think it's also worth emphasizing that the use of the phrase "enemy combatants" was in an account of something Michael Vassar said in informal correspondence, rather than being a description I necessarily expect readers of the account to agree with (because I didn't agree with it at the time). Michael meant something very specific by the metaphor, which I explain in the next paragraph. In case my paraphrased explanation wasn't sufficient, his exact words were:

The latter frame ["enemy combatants"] is more accurate both because criminals have rights and because enemy combatants aren't particularly blameworthy. They exist under a blameworthy moral order and for you to act in their interests implies acting against their current efforts, at least temporary [sic], but you probably would like to execute on a Marshall Plan later.

I think the thing Michael actually meant (right or wrong) is more interesting than a "Hysterical hyperbole!" "Is not!" "Is too!" grudge match.