Apologizing is a Core Rationalist Skill

johnswentworth

Apologizing is a Core Rationalist Skill

post by johnswentworth · 2024-01-02T17:47:35.950Z · LW · GW · 42 comments

  Mistake/Misdeed + Apology can be Net Gainful to Social Status
  Apology-Adjacent Things
  Takeaways
None
42 comments

There’s this narrative about a tradeoff between:

The virtue of Saying Oops [LW · GW], early and often, correcting course rather than continuing to pour oneself into a losing bet, vs
The loss of social status one suffers by admitting defeat, rather than spinning things as a win or at least a minor setback, or defending oneself.

In an ideal world - goes the narrative - social status mechanisms would reward people for publicly updating, rather than defending or spinning their every mistake. But alas, that’s not how the world actually works, so as individuals we’re stuck making difficult tradeoffs.

I claim that this narrative is missing a key piece. There is a social status mechanism which rewards people for publicly updating. The catch is that it’s a mechanism which the person updating must explicitly invoke; a social API which the person updating must call, in order to be rewarded for their update.

That social API is apologizing.

Mistake/Misdeed + Apology can be Net Gainful to Social Status

A personal example: there was a post called “Common Misconceptions about OpenAI [LW · GW]”, which (among many other points) estimated that ~30 alignment researchers work there. I replied (also among many other points):

I'd guess that is an overestimate of the number of people actually doing alignment research at OpenAI, as opposed to capabilities research in which people pay lip service to alignment. In particular, all of the RLHF work is basically capabilities work which makes alignment harder in the long term (because it directly selects for deception), while billing itself as "alignment".

There was a lot of pushback against that. Paul Christiano replied “Calling work you disagree with ‘lip service’ seems wrong and unhelpful.”. I clarified that I was not using this as a generic criticism of something I disagreed with, but rather that RLHF seemed so obviously antihelpful to alignment that I did not expect most people working on it had actually thought about whether it would help, but were instead just doing it for other reasons while, y’know, paying lip service to alignment.^[1] Richard Ngo eventually convinced me the people working on RLHF had thought at least somewhat about whether it would help; his evidence was a comment elsewhere from Paul, which in-particular said:

To be clear, this is not post hoc reasoning. I talked with WebGPT folks early on while they were wondering about whether these risks were significant, and I said that I thought this was badly overdetermined. If there had been more convincing arguments that the harms from the research were significant, I believe that it likely wouldn't have happened.

I was wrong; the people working on RLHF (for WebGPT) apparently had actually thought about how it would impact alignment to at least some extent.

So, I replied to Richard to confirm that he had indeed disproved my intended claim, and thanked him for the information. I struck out the relevant accusation from my original comment, and edited in an apology there:

I have been convinced that I was wrong about this, and I apologize. I still definitely maintain that RLHF makes alignment harder and is negative progress for both outer and inner alignment, but I have been convinced that the team actually was trying to solve problems which kill us, and therefore not just paying lip service to alignment.

And, finally, I sent a personal apology message to Jacob Hilton, the author of the original post.

Why do I bring up this whole story here?

Lesswrong has a convenient numerical proxy-metric of social status: site karma. Prior to the redaction and apology, my comment had been rather controversial - lots of upvotes, lots of downvotes, generally low-positive karma overall but a rollercoaster. After the redaction and apology, it stabilized at a reasonable positive number, and the comment in which I confirmed that Richard had disproved my claim (and thanked him for the information) ended up one of the most-upvoted in that thread.

The point: apologizing probably worked out to a net-positive marginal delta in social status. Not just relative to further defending my claim, but even relative to not having left any comment in the first place.

More generally: when I admit mistake/misdeed and apologize for it, I lose some social standing for having made the mistake or committed the misdeed. But I also get a large boost of social status, by sending the strongest possible signal that I am the sort of person who is willing to admit their own mistakes/misdeeds and apologize for them. The LessWrong community nominally places especially high value on this, but in-practice it extends to the rest of the world too: explicitly admitting one’s mistakes/misdeeds and apologizing for them is rare enough to be an extremely strong signal of integrity. It’s the sort of signal which most people instinctively recognize, and will instinctively respect, a lot. Overall, then, it’s not that unusual for mistake/misdeed + apology to add up to a net gain in social status - if one actually registers one’s update via the social apology-API.^[2]

Now, I am not saying a Machiavellian should go around intentionally making mistakes or committing misdeeds and then apologizing for them. It’s high-variance at best, and besides, we all inevitably have plenty of mistakes/misdeeds to work with anyway. But I am saying that, even from a basically-Machiavellian perspective, ignoring all that “right thing to do” stuff, apologizing is very often a net-beneficial move, even compared to convincing people you didn't make a mistake or commit a misdeed in the first place. It’s severely underexploited.

Apology-Adjacent Things

Beyond literal apologies, there are other apology-like things which seem to evoke similar emotions (both in the apologizer and the recipient/audience), and which “use the same API” in some sense.

Another example: a few years ago I wrote a post called Why Subagents? [LW · GW], arguing that a market/committee of utility maximizers is a better baseline model for agents than a monolithic utility maximizer. Nate Soares later convinced me that a core part of my argument was wrong: the subagent-systems I was talking about will tend toward monolithic utility maximization. So I eventually wrote another post: Why Not Subagents? [LW · GW]. At the very beginning, I quoted the older post, and stated that Nate had convinced me the older argument was wrong.

That’s not literally an apology. This wasn’t the kind of situation where a literal apology made sense; there wasn’t a specific person or a few specific people who’d been harmed by my mistake. (Even had there been, it was a very nonobvious mistake, so I doubt most people would consider me particularly blameworthy.)

But it still had much of the structure of an apology. The admission “I was wrong” (admittedly not in those exact words, but clear enough), spelling out exactly what I was wrong about, then what I would believe differently going forward. All the standard apology-pieces.

And it felt like an apology, when writing it. It pulled the same internal levers in my brain, drew on the same skills. There was a feeling of… a choice to willingly “give up ground”, like I could have been defensive but chose otherwise. An instinctive feeling that I’d lose respect of a different kind (the “being the sort of person who explicitly admits mistake/misdeed” kind) if I tried to dig in my heels. The nudge of an ingrained habit from having apologized many times before, which pushed me to apologize a lot rather than a little - to be maximally upfront and explicit about my own errors, rather than try to portray them as minor. Because the more direct and explicit and comprehensive the apology, the more I gain that other kind of respect from having apologized.

That feels to me like a core part of what makes apologizing a skill, i.e. something one can improve at with practice: that feeling of leaning into the apology, being maximally direct and explicit about one’s mistakes/misdeeds rather than downplaying, and growing the gut feel that such directness tends to be rewarded with its own kind of respect/status.

Takeaways

Wouldn’t it be great if social status mechanisms would reward us for publicly updating, rather than defending or spinning our mistakes/misdeeds? Wouldn’t it be great if integrity were better aligned with Machiavellian incentives?

Well, it often can be, if we invoke the right social API call. Apologizing is the standard social status mechanism by which one can receive social credit for updating, after making a mistake or committing a misdeed.

Even from a purely Machiavellian perspective, apologizing can often leave us better off than we started. The people around us might trust us less for our mistake/misdeed, but we earn a different kind of respect by sending the strongest possible signal that we are the sort of person who is willing to admit their own mistakes/misdeeds and apologize for them - the sort of person who can explicitly and publicly update.

And apologizing is a skill which can be developed. We can build the instinct to “lean into it” - to be maximally upfront and explicit about our errors when apologizing, rather than try to downplay them. We can build the gut feel that being maximally forthright, rather than minimally, will maximize that other kind of respect which a good apology earns.

By building the skill of apologizing well, we can interface with standard social reality in a way more compatible with the virtue of Saying Oops.

^{^}
I wish to note here that Richard took this “as evidence that John would fail an intellectual turing test for people who have different views than he does about how valuable incremental empiricism is”. Of course I couldn’t just ignore an outright challenge to my honor like that, so I wrote a brief reply which Richard himself called “a pretty good ITT”.
^{^}
In certain circumstances, apologizing can also be a countersignalling power-move, i.e. “I am so high status that I can grovel a bit without anybody mistaking me for a general groveller”. But that’s not really the type of move this post is focused on.

42 comments

Comments sorted by top scores.

comment by Elessar2 · 2024-01-03T22:57:15.762Z · LW(p) · GW(p)

Unfortunately, in a substantial segment of society if one were to apologize and admit the actual truth, his or her status in that segment would go way down, likely to zero, esp. if it goes against the prevailing mores of the group in question who would rather keep on believing the Big Lie vs. admit that it and they have been wrong.

Usually when I see such an apology, it isn't directed at the in-group in question at all, but instead at the out-group who has been at odds with the in-group and is more receptive in principle to the said information. In that case it often has been perceived as an inauthentic, manipulative, and disingenuous attempt to garner sympathy and yes status from this new group, if not to troll them.

In other words, authentic apologies are only as good as the authenticity and honesty of the individual or audience that you are apologizing to. If they put little to no weight on such ideals your attempt at apologizing would be worse than useless.

Replies from: kristin-lindquist

↑ comment by Kristin Lindquist (kristin-lindquist) · 2024-01-04T01:01:49.161Z · LW(p) · GW(p)

+1

I internalized the value to apologize proactively, sincerely, specifically and without any "but". While I recommend it from a virtue ethics perspective, I'd urge starry-eyed green rationalists to be cautious. Here are some potential pitfalls:

- People may be confused by this type of apology and conclude that you are neurotic or insincere. Both can signal low status if you lack unambiguous status markers or aren't otherwise effectively conveying high status.
- If someone is an adversary (whether or not you know it), apologies can be weaponized. As a conscientious but sometimes off-putting aspie, I try to apologize for my frustration-inducing behaviors such as being intense, overly persistent and inappropriately blunt - no matter the suboptimal behavior of the other person(s) involved. In short, apology is an act of cooperation and people around you might be inclined to defect, so you must be careful.

I've been too naive on this front, possibly because some of the content I've found most inspirational comes from high status people (the Dalai Lama, Sam Harris, etc) and different rules apply (i.e. great apologies as counter-signaling). It's still really good to develop these virtues; in this case, to learn how to be self-aware, accountable and courageously apologetic. But in some cases, it might be best to just write it in a journal rather than sharing it to your disadvantage.

comment by bideup · 2024-01-02T18:11:06.196Z · LW(p) · GW(p)

"Not invoking the right social API call" feels like a clarifying way to think about a specific conversational pattern that I've noticed that often leads to a person (e.g. me) feeling like they're virtuosly giving up ground, but not getting any credit for it.

It goes something like:

Alice: You were wrong to do X and Y.

Bob: I admit that I was wrong to do X and I'm sorry about it, but I think Y is unfair.

discussion continues about Y and Alice seems not to register Bob's apology

It seems like maybe bundling in your apology for X with a protest against Y just doesn't invoke the right API call. I'm not entirely sure what the simplest fix is, but it might just be swapping the order of the protest and the apology.

Replies from: Kei, M. Y. Zuo, PoignardAzur

↑ comment by Kei · 2024-01-02T23:29:37.371Z · LW(p) · GW(p)

It also helps to dedicate a complete sentence (or multiple sentences if the action you're apologizing for wasn't just a minor mistake) to your apology. When apologizing in-person, you can also pause for a bit, giving your conversational partner the opportunity to respond if they want to.

When you immediately switch into the next topic, as in your example apology above, it looks like you're trying to distract from the fact that you were wrong, and also makes it less likely your conversational partner internalizes that you apologized.

Replies from: jarviniemi

↑ comment by Olli Järviniemi (jarviniemi) · 2024-01-03T01:36:51.483Z · LW(p) · GW(p)

When you immediately switch into the next topic, as in your example apology above, it looks like you're trying to distract from the fact that you were wrong

Yep. Reminds me of the saying "everything before the word 'but' is bullshit". This is of course not universally true, but it often has a grain of truth. Relatedly, I remember seeing writing advice that went like "keep in mind that the word 'but' negates the previous sentence".

I've made a habit of noticing my "but"s in serious contexts. Often I rephrase my point so that the "but" is not needed. This seems especially useful for apologies, as there is more focus on sincerity and reading between lines going on.

↑ comment by M. Y. Zuo · 2024-01-03T18:12:35.424Z · LW(p) · GW(p)

Typically people show genuine sincerity by their actions, not just by words...

So focusing on the 'right social API calls' seems a bit tangential.

Replies from: bideup

↑ comment by bideup · 2024-01-04T12:53:05.399Z · LW(p) · GW(p)

Words are a type of action, and I guess apologising and then immediately moving on to defending yourself is not the sort of action which signals sincerity.

Replies from: M. Y. Zuo

↑ comment by M. Y. Zuo · 2024-01-04T15:55:18.657Z · LW(p) · GW(p)

Well technically since it does take energy and time to move the vocal chords, mouth, tongue, etc..., but it's such a low cost action that even doing something as simple as treating someone to lunch will outweigh it by a hundred fold.

Replies from: bideup

↑ comment by bideup · 2024-01-30T22:31:44.464Z · LW(p) · GW(p)

I think what I was thinking of is that words can have arbitrary consequences and be arbitrarily high cost.

In the apologising case, making the right social API call might be an action of genuine significance. E.g. it might mean taking the hit on lowering onlookers' opinion of my judgement, where if I'd argued instead that the person I wronged was talking nonsense I might have got away with preserving it.

John's post is about how you can gain respect for apologising, but it does have often have costs too, and I think the respect is partly for being willing to pay them.

↑ comment by PoignardAzur · 2024-02-10T00:26:15.621Z · LW(p) · GW(p)

I think "API calls" are the wrong way to word it.

It's more that an apology is a signal; to make it effective, you must communicate that it's a real signal reflecting your actual internal processes, and not a result of a surface-level "what words can I say to appear maximally virtuous" process.

So for instance, if you say a sentence equivalent to "I admit that I was wrong to do X and I'm sorry about it, but I think Y is unfair", then you're not communicating that you underwent the process of "I realized I was wrong, updated my beliefs based on it, and wondered if I was wrong about other things".

I'm not entirely sure what the simplest fix is

A simple fix would be "I admit I was wrong to do X, and I'm sorry about it. Let me think about Y for a moment." And then actually think about Y, because if you did one thing wrong, you probably did other things wrong too.

comment by michaelkeenan · 2024-01-05T18:18:23.571Z · LW(p) · GW(p)

I like and agree with this post, but want to caution that a 2019 paper, Does Apologizing Work? An Empirical Test of the Conventional Wisdom, studied a couple of examples of people saying something politically controversial, and found that apologizing either had no effect or made things worse. I suspect (hope?) that this harmful-apology effect is limited to moral outrage scandals, but that those cases are unusual.

[F]uture research should investigate the extent to which circumstances make it more or less helpful to apologize for a controversial statement. It may be that the effect was greater in the Summers example because Rand Paul is a well-known political figure. It also might be the case that the key difference lies in the fact that Summers apologized for a statement expressing a belief in a theory that can be tested empirically, while Paul had originally been criticized for giving a normative opinion. Finally, Summers gave reasons for his defense, while Paul went on the attack when questioned about his comments, perhaps unfairly implying that the controversy was the result of a partisan witch-hunt. More research is needed before conclusions can be drawn about when apologies have no effect, and when they increase or reduce the desire on the part of observers to punish the embroiled figure.

comment by Darklight · 2024-01-02T20:45:17.226Z · LW(p) · GW(p)

Minor point, but the apology needs to sound sincere and credible, usually by being specific about the mistakes and concise and to the point and not like, say, Bostrom's defensive apology about the racist email a while back. Otherwise you can instead signal that you are trying to invoke the social API call in a disingenuous way, which can clearly backfire.

Things like "sorry you feel offended" also tend to sound like you're not actually remorseful for your actions and are just trying to elicit the benefits of an apology. None of the apologies you described sound anything like that, but it's a common failure state among the less emotionally mature and the syncophantic.

Replies from: johnswentworth

↑ comment by johnswentworth · 2024-01-02T21:41:43.747Z · LW(p) · GW(p)

Expanding on this...

The "standard format" for calling the apology API has three pieces:

"I'm sorry"/"I apologize"
Explicitly state the mistake/misdeed
Explicitly state either what you should have done instead, or will do differently next time

Notably, the second and third bullet points are both costly signals: it's easier for someone to state the mistake/misdeed, and what they'll would/will do differently, if they have actually updated. Thus, those two parts contribute heavily to the apology sounding sincere.

comment by Algon · 2024-01-02T23:32:09.617Z · LW(p) · GW(p)

There's two kinds of apologies. Those where you admit you intentionally made an error and sincerely regret what you did e.g. calling someone an idiot because they support Republicans/Democrats/Official Monster Raving Loony Party. And those where you did something by accident (usually something minor) and apologize to signal that you had no bad intent e.g. bumping into someone. You're invoking the first sort, which should be pretty clear, but if someone thinks of examples which don't match, this may be why.
Apologizing and changing your behaviour is a core rationalist skill. Merely changing your behaviour/model to not make that specific mistake again isn't enough. The generator has to change too. For instance: if you threatened a Loony supporter, apologized, and stopped belitteling that particular person then the generator of the behaviour is still there. It is the instrumental/value counterpart to rationalizing, contorting your model to locally account for contradictory evidence. You have to change the generator to something that wouldn't have made that type of error in the first place. Not threatening Loony supporters is better, not threating Flying Spaghetti Monster Cultists too is better still and not threatening people you disagree is, perhaps, best.
If the analogy between updating and apologizing is a good one, then practices/counditions for updating well should transfer to apologizing properly. It could be interesting to test this out. For instance, this recent post [LW · GW] suggests some requirements for bayesian updating, and what it looks like in practice. Are there analogues for the three requirements? I.e.
1. Ability to inhabit hypothetical worlds i.e. you can simulate how reality would seem to be in a world where that hypothesis is actually true. Within that hypothetical, updates propogate through all the variables in the causal graph in all directions.
2. Ability to form any plausible hypothesis at all i.e. hypothesis that wouldn't register lots of suprise at the evidence at hand.
3. Avoid double-counting dependant observations. You've got to be able to actually condition on all of the pieces of evidence, which is hard if the observations aren't independant in a given world, or you struggle to inhabit that world deeply.
If I had to suggest some equivalents to the three requirements above, they might be:
1. Emphatizing with another person's perspective to the extent that you can see all the implications of your actions being wrong. See the flaws in character that implies.
2. See some perspective where what you did is blatantly wrong Though this feels where bc. of a values-free energy analogy where high value worlds have low suprise.
3. ??? Suggestions welcome.

comment by Conor Moreton · 2024-01-02T22:23:45.349Z · LW(p) · GW(p)

Related: Reneging Prosocially [LW · GW]

Replies from: PoignardAzur

↑ comment by PoignardAzur · 2024-02-10T00:31:56.905Z · LW(p) · GW(p)

I think Duncan's post touches on something this post misses with its talk of "social API": apologies only work when they're a costly signal.

The people you deliver the apology to need to feel it cost you something to make that apology, either pride or effort or something valuable; or at least that you're offering to give up something costly to earn forgiveness.

comment by mruwnik · 2024-01-03T11:49:42.413Z · LW(p) · GW(p)

Another, related Machiavellian tactic is, when starting a relationship that you suspect will be highly valuable to you, is to have an argument with them as soon as possible, and then to patch things up with a (sincere!) apology. I'm not suggesting to go out of your way to start a quarrel, more that it's both a valuable data point as to how they handle problems (as most relationships will have patchy moments) and it's also a good signal to them that you value them highly enough to go through a proper apology.

Replies from: PoignardAzur

↑ comment by PoignardAzur · 2024-02-10T00:28:12.393Z · LW(p) · GW(p)

The slightly less machiavellian version is to play Diplomacy with them.

(Or do a group project, or go to an escape game, or any other high-tension low-stakes scenario.)

comment by bideup · 2024-01-02T20:15:47.680Z · LW(p) · GW(p)

The second footnote seems to be accidentally duplicated as the intro. Kinda works though.

Replies from: johnswentworth

↑ comment by johnswentworth · 2024-01-02T21:38:18.239Z · LW(p) · GW(p)

WOW I missed that typo real hard. Thanks for mentioning.

comment by Review Bot · 2024-02-19T21:38:14.118Z · LW(p) · GW(p)

The LessWrong Review [? · GW] runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

comment by Joseph Van Name (joseph-van-name) · 2024-01-03T12:20:55.278Z · LW(p) · GW(p)

"Lesswrong has a convenient numerical proxy-metric of social status: site karma."-As long as I get massive downvotes for talking correctly about mathematics and using it to create interpretable AI systems, we should all regard karma as a joke. Karma can only be as good as the community here.

Replies from: SimonF, Algon

↑ comment by Simon Fischer (SimonF) · 2024-01-04T12:23:16.378Z · LW(p) · GW(p)

(I downvoted your comment because it's just complaining about downvotes to unrelated comments/posts and not meaningfully engaging with the topic at hand)

Replies from: joseph-van-name

↑ comment by Joseph Van Name (joseph-van-name) · 2024-01-04T15:31:44.043Z · LW(p) · GW(p)

I am pointing out something wrong with the community here. The name of this site is LessWrong. On this site, it is better to acknowledge wrongdoing so that the people here do not fall into traps like FTX again. If you read the article, you would know that it is better to acknowledge wrongdoing or a community weakness than to double down.

Replies from: FireStormOOO

↑ comment by FireStormOOO · 2024-01-04T17:50:53.416Z · LW(p) · GW(p)

It is still a forum, all the usual norms about avoid off-topic, don't hijack threads apply. Perhaps a Q&A on how to get more engagement with math-heavy posts would be more constructive? Speaking just for myself, a cheat-sheet on notation would do wonders.

Nobody is under any illusions that karma is perfect AFAICT, though much discussion has already been had on to what extent it just mirrors the flaws in people's underlying rating choices.

Replies from: joseph-van-name

↑ comment by Joseph Van Name (joseph-van-name) · 2024-01-04T18:23:12.287Z · LW(p) · GW(p)

If you have any questions about the notation or definitions that I have used, you should ask about it in the mathematical posts that I have made and not here. Talking about it here is unhelpful, condescending, and it just shows that you did not even attempt to read my posts. That will not win you any favors with me or with anyone who cares about decency.

Karma is not only imperfect, but Karma has absolutely no relevance whatsoever because Karma can only be as good as the community here.

P.S. Asking a question about the notation does not even signify any lack of knowledge since a knowledgeable person may ask questions about the notation because the knowledgeable person thinks that the post should not assume that the reader has that background knowledge.

P.P.S. I got downvotes, so I got enough engagement on the mathematics. The problem is the community here thinks that we should solve problems with AI without using any math for some odd reason that I cannot figure out.

Replies from: FireStormOOO

↑ comment by FireStormOOO · 2024-01-09T06:30:58.190Z · LW(p) · GW(p)

I did go pull up a couple of your posts as that much is a fair critique:

That first post is only the middle section of what would already be a dense post and is missing the motivating "what's the problem?", "what does this get us?"; without understanding substantially all of the math and spending hours I don't think I could even ask anything meaningful. That first post in particular is suffering from an approachable-ish sounding title then wall of math, so you're getting laypeople who expected to at least get an intro paragraph for their trouble.

The August 19th post piqued my interest substantially more on account of including intro and summary sections, and enough text to let me follow along only understanding part of the math. A key feature of good math text is I should be able to gloss over challenging proofs on a first pass, take your word for it, and still get something out of it. Definitely don't lose the rigor, but have mercy on those of us not cut out for a math PhD. If you had specific toy examples your were playing with while figuring out the post, those can also help make posts more aproachable. That post seemed well received just not viewed much; my money says the title is scaring off everyone but the full time researchers (which I'm not, I'm in software).

I think I and most other interested members not in the field default to staying out of the way when people open up with a wall of post-grad math or something that otherwise looks like a research paper, unless specifically invited to chime in. And then same story with meta; this whole thread is something most folks aren't going to start under your post uninvited, especially when you didn't solicit this flavor of feedback.

I bring up notation specifically as the software crowd is very well represented here, and frequently learn advanced math concepts without bothering to learn any of the notation common in math texts. So not like, 1 or 2 notation questions, but more like you can have people who get the concepts but all of the notation is Greek to them.

Replies from: joseph-van-name

↑ comment by Joseph Van Name (joseph-van-name) · 2024-01-10T13:16:17.405Z · LW(p) · GW(p)

I have made a few minor and mostly cosmetic edits to the post about the dimensionality reduction of tensors that produces so many trace free matrices and also to the post about using LSRDRs to solve a combinatorial graph theory problem.

"What's the problem?"-Neural networks are horribly uninterpretable, so it would be nice if we could use more interpretable AI models or at least better interpretability tools. Neural networks seem to include a lot of random information, so it would be good to use AI models that do not include so much random information. Do you think that we would have more interpretable models by forsaking all mathematical theory?

"what does this get us?"-This gets us systems trained by gradient ascent that behave much more mathematically. Mathematical AI is bound to be highly interpretable.

The downvotes display a very bad attitude, and they indicate that the LW community is a community that I really do not want much to do with at worst, and at best, the LW community is a community that lacks discipline and such mathematics texts will be needed to instill such discipline. In those posts that you have looked at, I did not include any mathematical proofs (these are empirical observations, so I could not include proof), and the lack of mathematical proofs makes the text much easier to go through. I also made the texts quite short; I only included enough text to pretty much define the fitness function and then state what I have observed.

For toy examples, I just worked with random complex matrices, and I wanted these matrices to be sufficiently small so that I can make and run the code to compute with these matrices quite quickly, but these matrices need to be large enough so that I can properly observe what is going on. I do not want to make an observation about tiny matrices that do not have any relevance to what is going on in the real world.

If we want to be able to develop safer AI systems, we will need to make them much more mathematical, and people are doing a great disservice by hating the mathematics needed for developing these safer AI systems.

Replies from: FireStormOOO

↑ comment by FireStormOOO · 2024-01-18T00:09:56.562Z · LW(p) · GW(p)

Wouldn't be engaging at all if I didn't think there was some truth to what you're saying about the math being important and folks needing to be persuaded to "take their medicine" as it were and use some rigor. You are not the first person to make such an observation and you can find posts on point from several established/respected members of the community.

That said, I think "convincing people to take their medicine" mostly looks like those answers you gave just being at the intro of the post(s) by default (and/or the intro to the series if that makes more sense). Alongside other misc readability improvements. Might also try tagging the title as [math heavy] or some such.

I think you're taking too narrow a view on what sorts of things people vote on and thus what sort of signal karma is. If that theory of mind is wrong, any of the inferences that flow from it are likely wrong too. Keep in mind also (especially when parsing karma in comments) that anything that parses as whining costs you status even if you're right (not just a LW thing). And complaining about internet points almost always parses that way.

I don't think it necessarily follows that math heavy post got some downvotes therefore everyone hates math and will downvote math in the future. As opposed to something like people care a lot about readability and about being able to prioritize their reading to the subjects they find relevant, neither of which scores well if the post is math to the exclusion of all else.

I didn't find any of those answers surprising but it's an interesting line of inquiry all the same. I don't have a good sense of how it's simultaneously true that LLMs keep finding it helpful to make everything bigger, but also large sections of the model don't seem to do anything useful, and increasingly so in the largest models.

Replies from: joseph-van-name

↑ comment by Joseph Van Name (joseph-van-name) · 2024-01-18T12:03:13.814Z · LW(p) · GW(p)

Talking about whining and my loss of status is a good way to get me to dislike the LW community and consider them to be anti-intellectuals who fall for garbage like FTX. Do you honestly think the people here should try to interpret large sections of LLMs while simultaneously being afraid of quaternions?

It is better to comment on threads where we are interacting in a more positive manner.

I thought apologizing and recognizing inadequacies was a core rationalist skill. And I thought rationalists were supposed to like mathematics. The lack of mathematical appreciation is one of these inadequacies of the LW community. But instead of acknowledging this deficiency, the community here blasts me as talking about something off topic. How ironic!

Replies from: FireStormOOO

↑ comment by FireStormOOO · 2024-01-18T16:40:04.602Z · LW(p) · GW(p)

Any conversation about karma would necessarily involve talking about what does and doesn't factor into votes, likely both here and in the internet or society at large. Not thinking we're getting anywhere on that point.

I've already said clearly and repeatedly I don't have a problem with math posts and I don't think others do either. You're not going to get what you want by continuing to straw-man myself and others. I disagree with your premise you've thus far failed to acknowledge or engage with any of those points.

Replies from: joseph-van-name

↑ comment by Joseph Van Name (joseph-van-name) · 2024-01-18T18:40:59.015Z · LW(p) · GW(p)

Let's see whether the notions that I have talked about are sensible mathematical notions for machine learning.

Tensor product-Sometimes data in a neural network has tensor structure. In this case, the weight matrices should be tensor products or tensor sums. Regarding the structure of the data works well with convolutional neural networks, and it should also work well for data with tensor structure to it.

Trace-The trace of a matrix measures how much the matrix maps vectors onto themselves since

where $v$ follows the multivariate normal distribution.

Spectral radius-Suppose that we are iterating a smooth function $f$ . Suppose furthermore that $f (v) = v$ and $u_{0}$ is near $v$ and $u_{n + 1} = f (u_{n})$ . We would like to determine whether ${lim}_{n \to \infty} u_{n} = v$ or not. If the Jacobian of $f$ at $v$ has spectral radius less than $1$ , then ${lim}_{n \to \infty} u_{n} = v$ ,. If the Jacobian of $f$ at $v$ has spectral radius greater than $1$ , then this limit does not converge.

The notions that I have been talking about are sensible and arise in machine learning. And understanding these notions is far easier than trying to interpret very large networks like GPT-4 without using these notions. Many people on this site just act like clowns. Karma is only a good metric when the people on the network value substance over fluff. And the only way to convince me otherwise will be for the people here to value posts that involve basic notions like the trace, eigenvalues, and spectral radius of matrices.

P.S. I can make the trace, determinant, and spectral radius even simpler. These operations are what you get when you take the sum, product, and the maximum absolute value of the eigenvalues. Yes. Those are just the basic eigenvalue operations.

Replies from: FireStormOOO

↑ comment by FireStormOOO · 2024-01-18T19:57:14.812Z · LW(p) · GW(p)

You're still hammering on stuff I never disagreed with in the first place. In so far as I don't already understand all the math (or math notation) I'd need to follow this, that's a me problem not a you problem, and having a pile of cool papers I want to grok is prime motivation for brushing up on some more math. I'm definitely not down-voting merely on that.

What I'm mostly trying to get across is just how large of a leap of logic you're making from [post got 2 or 3 downvotes] => [everyone here hates math]. There's got to be at least 3 or 4 major inferences there you haven't articulated here and I'm still not sure what you're reacting so strongly to. Your post with the lowest karma is the first one and it's sitting at neutral, based on a grand total of 3 votes besides yours. You are definitely sophisticated enough on math to understand the hazards of reasoning from a sample size that small.

Replies from: joseph-van-name, joseph-van-name

↑ comment by Joseph Van Name (joseph-van-name) · 2024-01-18T20:07:41.643Z · LW(p) · GW(p)

I will work with whatever data I have, and I will make a value judgment based on the information that I have. The fact that Karma relies on very small amounts of information is a testament to a fault of Karma, and that is further evidence of how the people on this site do not want to deal with mathematics. And the information that I have indicates that there are many people here who are likely to fall for more scams like FTX. Not all of the people here are so bad, but I am making a judgment based on the general atmosphere here. If you do not like my judgment, then the best thing would be to try to do better. If this site has made a mediocre impression on me, then I am not at fault for the mediocrity here.

↑ comment by Joseph Van Name (joseph-van-name) · 2024-01-27T01:47:14.254Z · LW(p) · GW(p)

You are judging my reasoning without knowing all that went into my reasoning. That is not good.

Replies from: FireStormOOO

↑ comment by FireStormOOO · 2024-01-28T22:42:58.407Z · LW(p) · GW(p)

Again you're saying that without engaging with any of my arguments or giving me any more of your reasoning to consider. Unless you care to share substantially more of your reasoning, I don't see much point continuing this?

Replies from: joseph-van-name

↑ comment by Joseph Van Name (joseph-van-name) · 2024-01-29T07:22:47.945Z · LW(p) · GW(p)

I do not care to share much more of my reasoning because I have shared enough and also because there is a reason that I have vowed to no longer discuss except possibly with lots of obfuscation. This discussion that we are having is just convincing me more that the entities here are not the entities I want to have around me at all. It does not do much good to say that the community here is acting well or to question my judgment about this community. It will do good for the people here to act better so that I will naturally have a positive judgment about this community.

Replies from: FireStormOOO

↑ comment by FireStormOOO · 2024-01-29T20:46:16.677Z · LW(p) · GW(p)

There's a presumption you're open to discussing on a discussion forum, not just grandstanding. Strong downvoted much of this thread for the amount of my time you've wasted trolling.

↑ comment by Algon · 2024-01-04T12:58:35.189Z · LW(p) · GW(p)

If I could give some advice: Show that you can do something interesting from an interpretability perspective using your methodology^[1], rather than something interesting from a mathematical perspective.

By "something interesting form an interpretability perspective" I mean things like explaining some of the strange goings on in GPT-embedding spaces [LW · GW]. Or looking at some weird regularity in AI systems and pointing out that it isn't obviously explained by other theories but explained by some theory you have/

^{^}
Presumably there is some intuition, some angle of attack, some perspective that's driving all of that mathematics. Frankly, I'd rather hear what that intuition is before reading a whole bunch of mathematics I frankly haven't used in a while.

Replies from: joseph-van-name

↑ comment by Joseph Van Name (joseph-van-name) · 2024-01-04T15:29:36.253Z · LW(p) · GW(p)

I already did that. But it seems like the people here simply do not want to get into much mathematics regardless of how closely related to interpretability it is.

P.S. If anyone wants me to apply my techniques to GPT, I would much rather see the embedding spaces as more organized objects. I cannot deal very well with words that are represented as vectors of length 4096 very well. I would rather deal with words that are represented as 64 by 64 matrices (or with some other dimensions). If we want better interpretability, the data needs to be structured in a more organized fashion so that it is easier to apply interpretability tools to the data.

comment by Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2024-01-02T20:50:02.083Z · LW(p) · GW(p)

Here. Have some more karma.

Replies from: alexander-gietelink-oldenziel

↑ comment by Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2024-01-03T11:51:14.771Z · LW(p) · GW(p)

Seems people are reading my message in passive aggresive tone. The original message should be read without any irony. I think it's good to (publicly) apologize and I think it's even better that John is writing a separate post to say it again (I missed the original).

Apologizing is a Core Rationalist Skill

Contents

Mistake/Misdeed + Apology can be Net Gainful to Social Status

Apology-Adjacent Things

Takeaways

42 comments