LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Elon Musk and Solar Futurism
transhumanist_atom_understander · 2024-12-21T02:55:28.554Z · comments (27)

Boston Secular Solstice 2024: Call for Singers and Musicans
jefftk (jkaufman) · 2024-11-15T13:50:07.827Z · comments (0)

[link] Miles Brundage: Finding Ways to Credibly Signal the Benignness of AI Development and Deployment is an Urgent Priority
Zach Stein-Perlman · 2024-10-28T17:00:18.660Z · comments (4)

Geoffrey Hinton on the Past, Present, and Future of AI
Stephen McAleese (stephen-mcaleese) · 2024-10-12T16:41:56.796Z · comments (5)

[link] Job Opening: SWE to help improve grant-making software
Ethan Ashkie (ethan-ashkie-1) · 2025-01-08T00:54:22.820Z · comments (1)

The average rationalist IQ is about 122
Rockenots (Ekefa) · 2024-12-28T15:42:07.067Z · comments (23)

[link] AI safety tax dynamics
owencb · 2024-10-23T12:18:32.243Z · comments (0)

Why Isn't Tesla Level 3?
jefftk (jkaufman) · 2024-12-11T14:50:01.159Z · comments (7)

Plausibly Factoring Conjectures
Quinn (quinn-dougherty) · 2024-11-22T20:11:56.479Z · comments (1)

[link] Genetically edited mosquitoes haven't scaled yet. Why?
alexey · 2024-12-30T21:37:32.942Z · comments (0)

[question] What should OpenAI do that it hasn't already done, to stop their vacancies from being advertised on the 80k Job Board?
WitheringWeights (EZ97) · 2024-10-21T13:57:30.934Z · answers+comments (0)

Magnitudes: Let's Comprehend the Incomprehensible!
joec · 2024-12-01T03:08:46.503Z · comments (8)

Text Posts from the Kids Group: 2018
jefftk (jkaufman) · 2024-11-23T12:50:05.325Z · comments (0)

Filled Cupcakes
jefftk (jkaufman) · 2024-11-26T03:20:08.504Z · comments (2)

The absolute basics of representation theory of finite groups
Dmitry Vaintrob (dmitry-vaintrob) · 2025-01-08T09:47:13.136Z · comments (0)

Non-Obvious Benefits of Insurance
jefftk (jkaufman) · 2024-12-23T03:40:02.184Z · comments (5)

A short project on Mamba: grokking & interpretability
Alejandro Tlaie (alejandro-tlaie-boria) · 2024-10-18T16:59:45.314Z · comments (0)

[question] Meal Replacements in 2025?
alkjash · 2025-01-06T15:37:25.041Z · answers+comments (9)

Dmitry's Koan
Dmitry Vaintrob (dmitry-vaintrob) · 2025-01-10T04:27:30.346Z · comments (2)

Long Live the Usurper
pleiotroth · 2024-11-27T12:10:51.025Z · comments (0)

Gwerns
Tomás B. (Bjartur Tómas) · 2024-11-16T14:31:57.791Z · comments (2)

[link] Towards the Operationalization of Philosophy & Wisdom
Thane Ruthenis · 2024-10-28T19:45:07.571Z · comments (2)

AI Can be “Gradient Aware” Without Doing Gradient hacking.
Sodium · 2024-10-20T21:02:10.754Z · comments (0)

[link] I read every major AI lab’s safety plan so you don’t have to
sarahhw · 2024-12-16T18:51:38.499Z · comments (0)

[link] Mechanistic Interpretability of Llama 3.2 with Sparse Autoencoders
PaulPauls · 2024-11-24T05:45:20.124Z · comments (3)

AXRP Episode 38.3 - Erik Jenner on Learned Look-Ahead
DanielFilan · 2024-12-12T05:40:06.835Z · comments (0)

Broken Latents: Studying SAEs and Feature Co-occurrence in Toy Models
chanind · 2024-12-30T22:50:54.964Z · comments (3)

Is AI Alignment Enough?
Aram Panasenco (panasenco) · 2025-01-10T18:57:48.409Z · comments (4)

A Generalization of the Good Regulator Theorem
Alfred Harwood · 2025-01-04T09:55:25.432Z · comments (5)

Lab governance reading list
Zach Stein-Perlman · 2024-10-25T18:00:28.346Z · comments (3)

Grading my 2024 AI predictions
Nikola Jurkovic (nikolaisalreadytaken) · 2025-01-02T05:01:46.587Z · comments (1)

[link] Announcement: AI for Math Fund
sarahconstantin · 2024-12-05T18:33:13.556Z · comments (9)

[question] What is the alpha in one bit of evidence?
J Bostock (Jemist) · 2024-10-22T21:57:09.056Z · answers+comments (13)

[link] It looks like there are some good funding opportunities in AI safety right now
Benjamin_Todd · 2024-12-22T12:41:02.151Z · comments (0)

AGI with RL is Bad News for Safety
Nadav Brandes (nadav-brandes) · 2024-12-21T19:36:03.970Z · comments (22)

Measuring Nonlinear Feature Interactions in Sparse Crosscoders [Project Proposal]
Jason Gross (jason-gross) · 2025-01-06T04:22:12.633Z · comments (0)

Review: “The Case Against Reality”
David Gross (David_Gross) · 2024-10-29T13:13:29.643Z · comments (9)

[link] Chess As The Model Game
criticalpoints · 2024-11-17T19:45:26.499Z · comments (0)

minifest
Austin Chen (austin-chen) · 2024-12-07T03:50:38.573Z · comments (1)

Bridging the VLM and mech interp communities for multimodal interpretability
Sonia Joseph (redhat) · 2024-10-28T14:41:41.969Z · comments (5)

Definition of alignment science I like
quetzal_rainbow · 2025-01-06T20:40:38.187Z · comments (0)

Turning up the Heat on Deceptively-Misaligned AI
J Bostock (Jemist) · 2025-01-07T00:13:28.191Z · comments (16)

D/acc AI Security Salon
Allison Duettmann (allison-duettmann) · 2024-10-19T22:17:57.067Z · comments (0)

[link] Forecast 2025 With Vox's Future Perfect Team — $2,500 Prize Pool
ChristianWilliams · 2024-12-20T23:00:35.334Z · comments (0)

Latent Adversarial Training (LAT) Improves the Representation of Refusal
alexandraabbas · 2025-01-06T10:24:53.419Z · comments (6)

Balsa Research 2024 Update
Zvi · 2024-12-03T12:30:06.829Z · comments (0)

Open Thread Winter 2024/2025
habryka (habryka4) · 2024-12-25T21:02:41.760Z · comments (9)

Whistleblowing Twitter Bot
Mckiev · 2024-12-26T04:09:45.493Z · comments (5)

[link] Why OpenAI’s Structure Must Evolve To Advance Our Mission
stuhlmueller · 2024-12-28T04:24:19.937Z · comments (1)

Really radical empathy
MichaelStJules · 2025-01-06T17:46:31.269Z · comments (0)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

jbash on What are some scenarios where an aligned AGI actually helps humanity, but many/most people don't like it?

Well, OK, but you also said "actually helps humanity", which assumes some kind of outside view. And you used "aligned" without specifying any particular one of the conflicting visions of "alignment" that are out there.

I absolutely agree that "aligned with whom" is a huge issue. It's one of the things that really bugs me about the word.

I do also agree that there are going to be irreconcilliable differences, and that, barring mind surgery to change their opinions, many people will be unhappy with whatever happens. That applies no matter what an AI does, and in fact no matter what anybody who's "in charge" does. It applies even if nobody is in charge. But if somebody is in charge, it's guaranteed that a lot of people will be very angry at that somebody. Sometimes all you can change is who is unhappy.

For example, a whole lot of Christians, Muslims, and possibly others believe that everybody who doesn't wholeheartedly accept their religion is not only wrong, but also going to suffer in hell for eternity. Those religions are mutually contradictory at their cores. And a probably smaller but still large number of athiests believe that all religion is mindrot that intrinsically reduces the human dignity of anybody who accepts it.

You can't solve that, no matter how smart you are. Favor one view and the other view loses. Favor none, and the other views say that a bunch of people are seriously harmed, even if it's voluntary. It doesn't even matter how you favor a view. Gentle persuasion is still a problem. OK, technically you can avoid people being mad about it after the fact by extreme mind surgery, but you can't reconcile their original values. You can prevent violent conflict by sheer force, but you can't remove the underlying issue.

Still, a lot of the approaches you describe are are pretty ham-handed even if you agree with the underlying values. Some of the desired outcomes you list even sound to me like good ideas... but you ought to be able to work toward those goals, even achieve them, without doing it in a way that pisses off the maximum possible number of people. So I guess I'm reacting to the extreme framing and the extreme measures. I don't think the Taliban actively want people to be mad.

[Edited unusually heavily after posting because apparently I can't produce coherent, low-typo text in the morning]

meedstrom on CFAR Takeaways: Andrew Critch

Basically agree, but downvoted because not useful.

I'd nuance that as that being alive and energetic is fun -- but when my body no longer grants energy, it's like death already. Say I'm trying to take notes about the content of this thread, but I'm so tired I barely produce anything. If the terms of my body are such that I must first do a timeskip to tomorrow to get more energy, then I want the timeskip.

I guess I understand becoming sleep-deprived and staying up anyway if you don't notice your IQ dropping...

mikbp on Is Musk still net-positive for humanity?

Oh, it is probably my mistake XD I'm also not native. I meant increase, not that it is the maximum it could be, sorry.

aram-panasenco on AGI Ruin: A List of Lethalities

I really appreciate this post, as much as it's making me feel that I and everyone I care about have terminal cancer with only 12-60 months to live.

I found the idea that a pivotal act is necessary as especially valuable and expanded on it in my post [Is AI Alignment Enough?](https://www.lesswrong.com/posts/tdrK7r4QA3ifbt2Ty/is-ai-alignment-enough)

dmitry-vaintrob on Dmitry's Koan

Thanks for asking! I said in a later shortform [LW(p) · GW(p)] that I was trying to do too many things in this post, with only vague relationships between them, and I'm planning to split it into pieces in the future.

Your 1-3 are mostly correct. I'd comment as follows:

(and also kind of 3) That advice of using the tempered local Bayesian posterior (I like the term -- let's shorten it to TLBP) is mostly aimed at non-SLT researchers (but may apply also to some SLT experiments). The suggestion is not to compute expectations. Rather, just running a single experiment at a weight sampled from the TLBP. The result is analogous to tuning a precision dial on your NN to noise away all circuits for which the quotient (usefulness)/(description length) is bounded above by 1/t (where usefulness is measured in reduction of loss). At t = 0, you're adding no noise and at you're fully noising it.
This is interesting to do in interp experiments for two general reasons:
1. You can see whether the behavior your experiment finds is general or spurious. The higher the temperature range it persists over, the more general it is in the sense of usefulness/description length (and all else being equal, the more important your result is).
2. If you are hoping to say that a behavior you found, e.g. a circuit, is "natural from the circuit's point of view" (i.e., plausibly occurs in some kind of optimal weight- or activation-level description of your model), you need to make sure your experiment isn't just putting together bits of other circuits in an ad-hoc way and calling it a circuit. One way to see this, that works 0% of the time, is to notice that turning this circuit on or off affects the output on exactly the context/ structure you care about, and has absolutely no effect at all on performance elsewhere. This never works because our interp isn't at a level where we can perform uber-precise targeted interventions, and whenever we do something to a network in an experiment, this always significantly affects loss on unrelated inputs. By having a tunable precision parameter (as given by the TLBP for example), you have more freedom to find such "clean" effects that only do what you want and don't affect loss otherwise. In general, in an imprecise sense, you expect each "true" circuit to have some "temperature of entanglement" with the rest of the model, and if this circuit is important enough to survive tempering to this temperature of entanglement, you expect to see much cleaner and nicer results in the resulting tempered model.
In the above context, you rarely want to use the Watanabe temperature or any other temperature that only depends on the number of samples n, since it's much too low in most cases. Instead, you're either looking for a characteristic temperature associated with an experiment or circuit (which in general will not depend on n much), or fishing for behaviors that you hope are "significantly general". Here the characteristic temperature associated with the level of generality that "is not literally memorizing" is the Watanabe temperature or very similar, but it is probably more interesting to consider larger scales.
(maybe more related to your question 1): Above, I explained why I think performing experiments at TLBP weight values is useful for "general interp". I also explain that you sometimes have a natural "characteristic temperature" for the TLBP that is independent of sample number (e.g. meaningful at infinite samples), which is the difference between the loss of the network you're studying and a SOTA NN, which you think of as that "true optimal loss". In large-sample (highly underparameterized) cases, this is probably a better characteristic temperature than the Watanabe temperature, including for notions of effective parameter count: indeed, insofar as your NN is "an imperfect approximation of an optimal NN", the noise inherent in this imperfection is on this scale (and not the Watanabe scale). Of course there are issues with this PoV as less expressive NN's are rarely well-conceptualized as TLBP samples (insofar as they find a subset of a "perfect NN's circuits", they find the easily learnable ones rather than the maximally general ones). However it's still reasonable to think of this as a first stab at the inherent noise scale associated to an underparametrized model, and to think of the effective parameter count at this scale (i.e., free energy / log temperature) as a better approximatin of some "inherent" parameter count.

sharmake-farah on Human takeover might be worse than AI takeover

From my perspective, I'd say that conditional on takeover happening, I'd probably say that a human taking over compared to an AI has pretty similar distributions of outcomes, mostly because I consider the variance of human and AI values to have surprisingly similar outcomes (notably a key factor here is I expect a lot of the more alien values to result in extinction, though partial alignment can make things worse, but compared to the horror show that quite a bit of people have on their values, death can be pretty good, and that's because I'm quite a bit more skeptical of the average person's values, especially conditioning on takeover leading to automatically good outcomes.)

exmateriae on Is Musk still net-positive for humanity?

I thought you said he was very close to the maximum he could do? English is a second language so maybe I misunderstood something. Also, only my first paragraph is really related to the quote, the rest is more of a free flow of what I think

meedstrom on CFAR Takeaways: Andrew Critch

I think some Rationalists believe everything is supposed to fit into one frame, but Frames != The Truth. [...] we should be able to pick up and drop frames as needed, at will.

Aye - see also In Praise of Fake Frameworks [LW · GW]. It's helped me interface with a lot people that would've otherwise befuddled me. That gives me a more fleshed-out range of possible perspectives on things, which shortcuts to new knowledge.

But perhaps it's worth thinking twice when or at least how to introduce this skill, because it looks like a method of doing Salvage Epistemology [LW · GW] and so could invite its downsides if taught poorly. I'm undecided whether that's worth worrying about.

richard_kennaway on What are some scenarios where an aligned AGI actually helps humanity, but many/most people don't like it?

The AI, for its own inscrutable reasons, seizes upon the sort of idea that you have to be really smart to be stupid enough to take seriously, and imposes it on everyone.

I think all the scenarios above are instances of this.

zac-hatfield-dodds on POC || GTFO culture as partial antidote to alignment wordcelism

"POC || GTFO culture" need not be literal, and generally cannot be when speculating about future technologies. I wouldn't even want a proof-of-concept misaligned superintelligence!

Nonetheless, I think the field has been improved by an increasing emphasis on empiricism and demonstrations over the last two years, in technical research, in governance research, and in advocacy. I'd still like to see more carefully caveating of claims for which we have arguments but not evidence, and it's useful to have a short handle for that idea - "POC || admit you're unsure", perhaps?