LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Bridging the VLM and mech interp communities for multimodal interpretability
Sonia Joseph (redhat) · 2024-10-28T14:41:41.969Z · comments (5)

Open Thread Winter 2024/2025
habryka (habryka4) · 2024-12-25T21:02:41.760Z · comments (8)

subfunctional overlaps in attentional selection history implies momentum for decision-trajectories
Emrik (Emrik North) · 2024-12-22T14:12:49.027Z · comments (1)

An exhaustive list of cosmic threats
Jordan Stone (jordan-stone) · 2025-01-09T19:59:08.368Z · comments (2)

Write Good Enough Code, Quickly
Oliver Daniels (oliver-daniels-koch) · 2024-12-15T04:45:56.797Z · comments (10)

Theoretical Alignment's Second Chance
lunatic_at_large · 2024-12-22T05:03:51.653Z · comments (0)

Whistleblowing Twitter Bot
Mckiev · 2024-12-26T04:09:45.493Z · comments (5)

AGI with RL is Bad News for Safety
Nadav Brandes (nadav-brandes) · 2024-12-21T19:36:03.970Z · comments (22)

Dmitry's Koan
Dmitry Vaintrob (dmitry-vaintrob) · 2025-01-10T04:27:30.346Z · comments (0)

Turning up the Heat on Deceptively-Misaligned AI
J Bostock (Jemist) · 2025-01-07T00:13:28.191Z · comments (16)

Definition of alignment science I like
quetzal_rainbow · 2025-01-06T20:40:38.187Z · comments (0)

[link] Forecast 2025 With Vox's Future Perfect Team — $2,500 Prize Pool
ChristianWilliams · 2024-12-20T23:00:35.334Z · comments (0)

[link] Update on the Mysterious Trump Buyers on Polymarket
Annapurna (jorge-velez) · 2024-11-04T19:22:06.540Z · comments (9)

[link] Fragile, Robust, and Antifragile Preference Satisfaction
adamShimi · 2024-11-02T17:25:55.986Z · comments (0)

Really radical empathy
MichaelStJules · 2025-01-06T17:46:31.269Z · comments (0)

Balsa Research 2024 Update
Zvi · 2024-12-03T12:30:06.829Z · comments (0)

[link] Why OpenAI’s Structure Must Evolve To Advance Our Mission
stuhlmueller · 2024-12-28T04:24:19.937Z · comments (1)

Higher and lower pleasures
Chris_Leong · 2024-12-05T13:13:46.526Z · comments (3)

Review: “The Case Against Reality”
David Gross (David_Gross) · 2024-10-29T13:13:29.643Z · comments (9)

Economics Roundup #4
Zvi · 2024-10-15T13:20:06.923Z · comments (4)

[link] AI & wisdom 2: growth and amortised optimisation
L Rudolf L (LRudL) · 2024-10-28T21:07:39.449Z · comments (0)

Reality is Fractal-Shaped
silentbob · 2024-12-17T13:52:16.946Z · comments (1)

Latent Adversarial Training (LAT) Improves the Representation of Refusal
alexandraabbas · 2025-01-06T10:24:53.419Z · comments (6)

2024 NYC Secular Solstice & Megameetup
Joe Rogero · 2024-11-12T17:46:18.674Z · comments (0)

Beliefs and state of mind into 2025
RussellThor · 2025-01-10T22:07:01.060Z · comments (7)

Economic Post-ASI Transition
[deleted] · 2025-01-01T22:37:31.722Z · comments (11)

[link] Genesis
PeterMcCluskey · 2024-12-31T22:01:17.277Z · comments (0)

In the Name of All That Needs Saving
pleiotroth · 2024-11-07T15:26:12.252Z · comments (2)

[link] From the Archives: a story
Richard_Ngo (ricraz) · 2024-12-27T16:36:50.735Z · comments (1)

Advisors for Smaller Major Donors?
jefftk (jkaufman) · 2024-11-06T14:30:06.187Z · comments (2)

Word Spaghetti
Gordon Seidoh Worley (gworley) · 2024-10-23T05:39:20.105Z · comments (9)

[question] Does the "ancient wisdom" argument have any validity? If a particular teaching or tradition is old, to what extent does this make it more trustworthy?
SpectrumDT · 2024-11-04T15:20:14.822Z · answers+comments (49)

[link] AI & Liability Ideathon
Kabir Kumar (kabir-kumar) · 2024-11-26T13:54:01.820Z · comments (2)

[link] AI safety content you could create
Adam Jones (domdomegg) · 2025-01-06T15:35:56.167Z · comments (0)

[link] Can o1-preview find major mistakes amongst 59 NeurIPS '24 MLSB papers?
Abhishaike Mahajan (abhishaike-mahajan) · 2024-12-18T14:21:03.661Z · comments (0)

[link] GPT-4o Guardrails Gone: Data Poisoning & Jailbreak-Tuning
ChengCheng (ccstan99) · 2024-11-01T00:10:50.718Z · comments (0)

Monthly Roundup #25: December 2024
Zvi · 2024-12-23T14:20:04.682Z · comments (3)

Announcing the CLR Foundations Course and CLR S-Risk Seminars
JamesFaville (elephantiskon) · 2024-11-19T01:18:10.085Z · comments (0)

[link] AI & wisdom 3: AI effects on amortised optimisation
L Rudolf L (LRudL) · 2024-10-28T21:08:56.604Z · comments (0)

Proposal to increase fertility: University parent clubs
Fluffnutt (Pear) · 2024-11-18T04:21:26.346Z · comments (3)

Everything you care about is in the map
Tahp · 2024-12-17T14:05:36.824Z · comments (27)

Most Minds are Irrational
Davidmanheim · 2024-12-10T09:36:33.144Z · comments (4)

[link] Building AI safety benchmark environments on themes of universal human values
Roland Pihlakas (roland-pihlakas) · 2025-01-03T04:24:36.186Z · comments (3)

Using Dangerous AI, But Safely?
habryka (habryka4) · 2024-11-16T04:29:20.914Z · comments (2)

Rebuttals for ~all criticisms of AIXI
Cole Wyeth (Amyr) · 2025-01-07T17:41:10.557Z · comments (11)

Heresies in the Shadow of the Sequences
Cole Wyeth (Amyr) · 2024-11-14T05:01:11.889Z · comments (12)

[link] A primer on machine learning in cryo-electron microscopy (cryo-EM)
Abhishaike Mahajan (abhishaike-mahajan) · 2024-12-22T15:11:58.860Z · comments (0)

Incredibow
jefftk (jkaufman) · 2025-01-07T03:30:02.197Z · comments (3)

Computational functionalism probably can't explain phenomenal consciousness
EuanMcLean (euanmclean) · 2024-12-10T17:11:28.044Z · comments (34)

Should you have children? All LessWrong posts about the topic
Sherrinford · 2024-11-26T23:52:44.113Z · comments (0)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

zack_m_davis on Comment on "Death and the Gorgon"

But he's not complaining about the traditional pages of search results! He's complaining about the authoritative-looking Knowledge Panel to the side:

Obviously it's not Google's fault that some obscure SF web sites have stolen pictures from the Monash University web site of Professor Gregory K Egan and pretended that they're pictures of me ... but it is Google's fault when Google claim to have assembled a mini-biography of someone called "Greg Egan" in which the information all refers to one person (a science fiction writer), while the picture is of someone else entirely (a professor of engineering). [...] this system is just an amateurish mash-up. And by displaying results from disparate sources in a manner that implies that they refer to the same subject, it acts as a mindless stupidity amplifier that disseminates and entrenches existing errors.

Regarding the site URLs, I don't know, I think it's pretty common for people to have a problem that would take five minutes to fix if you're an specialist that already knows what you're doing, but non-specialists just reach for the first duct-tape solution that comes to mind without noticing how bad it is.

Like: you have a website at myname.somewebhost.com. One day, you buy myname.net, but end up following a tutorial that makes it a redirect rather than a proper CNAME or A record, because you don't know what those are. You're happy that your new domain works in that it's showing your website, but you notice that the address bar is still showing the old URL. So you say, "Huh, I guess I'll put a note on my page template telling people to use the myname.net address in case I ever change webhosts" and call it a day. I guess you could characterize that as a form of "cognitive rigidity", but "fanaticism"? Really?

I agree that Egan still hasn't seen the writing on the wall regarding deep learning. (A line in "Death and the Gorgon" mentions Sherlock's "own statistical tables", which is not what someone familiar with the technology would write.)

I agree that preëmptive blocking is kind of weird, but I also think your locked account with "Follow requests ignored due to terrible UI" is kind of weird.

chris_leong on Dialogue introduction to Singular Learning Theory

Excellent post. It helped clear up some aspects of SLT for me. Any chance you could clarify why this volume is called the "learning coefficient?"

jimmy on Preference Inversion

No, that does not sound like a fair characterization. My claims are cover a lot more than "it doesn't always happen" and yours sure don't seem limited to "it doesn't never happen".

Here's the motivating question for this whole essay:

You asked why people who "believe in" avoiding nonmarital sex so frequently engage in and report badly regretting it

and here's part of your conclusion

At this point the behavior you describe should no longer be perplexing.

You're talking about this as if it needs falsification of preferences to explain and my stance is that no, this is default. Any time people have to face anything as complex as sexuality, even if people are doing their best to pro-socially guide people this is necessarily what's going to happen. Perversions can sneak in too, and I don't deny that they exist, but postulating perversions is absolutely not needed in order to explain the data you're seeking to explain.

To narrow things down a bit, we can return to the original comment:

Sometimes people profess or try to reveal a preference for X, as a response to coercive pressures that are specifically motivated by prior underlying preferences for anti-X. This is what I'm calling preference inversion.

I don't disagree with this.

My intuition is that generally, upon reflection, people would prefer to satisfy their and others' preferences as calculated prior to such influences. I don't know whether there are other sorts of analogous distorting factors nearly all reasonable people would not like to satisfy upon reflection, but in general, I'm using the term "intrinsic preferences" to refer to whatever's left over after all such generally appealing adjustments.

It's this second part I was taking issue with.

Here, you're talking about what generally happens, not what "sometimes" happens, and I don't think "intrinsic preferences" is defined well enough to do what you want it to do here. I don't think it can be, unless you introduce more concepts, because I don't think "external vs intrinsic" can do justice to this multidimensional space no matter how you cleave it.

Part of this is because what counts as "external" cannot be well defined. If daddy yells at me to not drink, that sounds external, and my revealed preferences are likely to revert when he's not looking. But maybe being a reasonable person, upon reflection I'd agree with him. Does that make it "not a preference inversion"? If my boss threatens to fire me if I show up drunk, that sounds external too. But that's not very different than my boss reminding me that he can only afford to hire productive people -- and that's starting to sound like "just reality". Certainly if a doctor tells me that my liver is failing, that sounds like "just reality" and "internal". But it's external to my brain, and maybe if someone offers me an artificial liver I'd revert to my "intrinsic preferences"?

Our preferences necessarily depend on the reality we find ourselves embedded in, and cannot exist in isolation except perhaps in the highest abstraction (e.g. "I prefer to continue existing" or something), so the concept of "intrinsic preferences" for concrete things necessarily falls apart. What doesn't fall apart is the structure of incoherence in our own preferences.

We're constantly trying to shape and reshape the reality that others live in such that their revealed preferences given this reality satisfy our own. Part of this is making laws forbidding theft, how we indoctrinate in church, our hiring and firing decisions, how we inform our friends, etc. Some of these actions are purely cooperative, others are pure defections, and many are somewhere in between. Often we have fairly superficial pressure applied which results in fairly superficial changes in revealed preferences which easily revert, but that superficiality is fundamentally a property of the person containing the preferences not the person applying the pressures. There is indeed skill in facilitating deeper shifts in preferences to better match reality, and this is indeed a good thing to pursue, but the "intrinsic vs external" binary obscures the interplay between shifting reality, shifting perceptions of reality, and shifting preferences -- and therefore most of what is going on.

To use your example, the positive value of marital intimacy is inherently intertwined with the power of sexuality, the importance of getting sexuality right, and therefore the badness of sexuality done inappropriately. There is all sorts of room for this guidance to be given skillfully or clumsily, purely or corruptly, for it to be received coherently or superficially, in concordance with reality or not, and everywhere in between. Like you've noticed, there isn't always a legible distinction between the conventional conservative Christians who pull this off well and those that do more poorly.

My own perception, is that almost none of our preferences can be cleanly described as "intrinsic" or "externally pressured", or as "valid" or "invalid". There's just differing degrees of coherence and differing degrees of fit to reality. The average case of conventional conservative Christians pushing against non-marital sex, and the average case of the person "believing in" and regretting not living by their "beliefs", is in between the picture Christianity portrays, and the one you portray of falsified preferences. Because the ground truth is in between "nonmarital sex is always bad" and "nonmarital sex is always as good as it seems".

Generally, when I interact with people on the topic of sexuality, I see people who don't know what their preferences resolve to with regards to non-marital sex -- and whose genuine preferences would resolve in different ways depending on the culture they're embedded within and the opportunities they have. I could sell either picture, and make it look "intrinsic", if I'm willing to sweep the right things under the rug in order to do so. Most people's belief structures surrounding sex (and most things) are shoddily built. I could argue for their destruction, and destroy them. I could argue for their utility, and preserve them. The optimal solution necessarily involves seeing both the utility and imperfections, both a degree of destruction and of reconstruction.

Like you said, this isn't just theoretical. This is a thing I've actually done when it has come up. I can give examples if it'd help

shankar-sivarajan on Shankar Sivarajan's Shortform

I got a question (maybe more than one? The email left that ambiguous) accepted to the "Humanity's Last Exam" AI Benchmark!

vaniver on Human takeover might be worse than AI takeover

By contrast, today’s AIs are really nice and ethical. They’re humble, open-minded, cooperative, kind. Yes, they care about some things that could give them instrumental reasons to seek power (eg being helpful, human welfare), but their values are great

They also aren't facing the same incentive landscape humans are. You talk later about evolution to be selfish; not only is the story for humans is far more complicated (why do humans often offer an even split in the ultimatum game?), but also humans talk a nicer game than they act (See construal level theory, or social-desirability bias.). Once you start looking at AI agents who have similar affordances and incentives that humans have, I think you'll see a lot of the same behaviors.

(There are structural differences here between humans and AIs. As an analogy, consider the difference between large corporations and individual human actors. Giant corporate chain restaurants often have better customer service than individual proprietors because they have more reputation on the line, and so are willing to pay more to not have things blow up on them. One might imagine that AIs trained by large corporations will similarly face larger reputational costs for misbehavior and so behave better than individual humans would. I think the overall picture is unclear and nuanced and doesn't clearly point to AI superiority.)

though there’s a big question mark over how much we’ll unintentionally reward selfish superhuman AI behaviour during training

Is it a big question mark? It currently seems quite unlikely to me that we will have oversight systems able to actually detect and punish superhuman selfishness on the part of the AI.

screwtape on Takeaways from calibration training

I wish more people 1. tried practicing the skills and techniques they think are important as rationalists and 2. reported back on how it went. Thank you Olli for doing so and writing up what happened!

Being well calibrated is something I aspire to, and so the advice on particular places where one might stumble (pointing out the >90% region is difficult, pointing out that ones gut may get anchored on a particular percentage for no good reason, pointing out switching domains threw things off for a little) is helpful. I'm a little nervous about how changing question category apparently lead to poorer calibration for a while. It makes sense why that would be the case, but my ideal art of rationality would work well across domains. Otherwise, why not study that particular domain more? I do like the application to day-to-day problems; "do I have peanut butter at home or did I run out?" is the kind of thing I run into on at least a daily basis.

I'd love to have a dozen such reports from a dozen people's attempts, both to see if a pattern stood out of where common mistakes are ("Be cautious, Laplace's Rule works a bit differently when there can be multiple outcomes") and to get more datapoints that practice works. That's not a knock against what Olli's written here, that's a wish for more people to follow up and do this! Without feedback on what techniques work and what it looks like to improve, building a martial art of rationality gets much harder. With feedback like this, other people can better understand what's worth practicing and what's realistic to expect.

That's the most important takeaway I had from this takeaway. The repeated practice worked, and Olli got more calibrated as they practiced.

I'm inclined to think the Best Of LessWrong posts should include, not just the big insights or the shiny new techniques, but the dutiful reports years later about how those techniques have impacts on normal life. I'd like to lightly recommend Takeaways From Calibration Training for inclusion in the Best Of LessWrong Posts.

oliver-daniels on Scaling Sparse Feature Circuit Finding to Gemma 9B

I'm not that convinced that attributing patching is better then ACDC - as far as I can tell Syed et al only measure ROC with respect to "ground truth" (manually discovered) circuits and not faithfulness, completeness, etc. Also Interp Bench finds ACDC is better than attribution patching

habryka4 on (The) Lightcone is nothing without its people: LW + Lighthaven's big fundraiser

I reached out to them and they said pooling isn't possible.

charlie-steiner on Human-AI Complementarity: A Goal for Amplified Oversight

Thanks for the great reply :) I think we do disagree after all.

humans are definitionally the source of information about human values, even if it may be challenging to elicit this information from humans

Except about that - here we agree.

Now, what this human input looks like could (and probably should) go beyond introspection and preference judgments, which, as you point out, can be unreliable. It could instead involve expert judgment from humans with diverse cultural backgrounds, deliberation and/or negotiation, incentives to encourage deep, reflective thinking rather than snap judgments or falling back on heuristics. It could also involve AI assistance to help counter human biases, find common ground, and consider the logical consequences of communicated values.

This might be summarized as "If humans are inaccurate, let's strive to make them more accurate."

I think this, as a research priority or plan A, is doomed by a confluence of practical facts (humans aren't actually that consistent, even in what we'd consider a neutral setting) and philosophical problems (What if I think the snap judgments and heuristics are important parts of being human? And, how do you square a univariate notion of 'accuracy' with the sensitivity of human conclusions to semi-arbitrary changes to e.g. their reading lists, or the framings of arguments presented to them?).

Instead, I think our strategy should be "If humans are inconsistent and disagree, let's strive to learn a notion of human values that's robust to our inconsistency and disagreement."

We contend that even as AI gets really smart, humans ultimately need to be in the loop to determine whether or not a constitution is aligned and reasonable.

A committee of humans reviewing an AI's proposal is, ultimately, a physical system that can be predicted. If you have an AI that's good at predicting physical systems, then before it makes an important decision it can just predict this Committee(time, proposal) system and treat the predicted output as feedback on its proposal. If the prediction is accurate, then actual humans meeting in committee is unnecessary.

(And indeed, putting human control of the AI in the physical world actually exposes it to more manipulation than if the control is safely ensconced in the logical structure of the AI's decision-making.)

transhumanist_atom_understander on The Laws of Large Numbers

Well, usually I'm not inherently interested in a probability density function, I'm using it to calculate something else, like a moment or an entropy or something. But I guess I'll see what you use it for in future posts.