LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

[link] electric turbofans
bhauth · 2024-11-02T22:50:59.807Z · comments (2)

Offering AI safety support calls for ML professionals
Vael Gates · 2024-02-15T23:48:12.797Z · comments (1)

What is SB 1047 *for*?
Raemon · 2024-09-05T17:39:39.871Z · comments (8)

Inspired by: Failures in Kindness
X4vier · 2024-07-27T01:21:42.848Z · comments (2)

Showing SAE Latents Are Not Atomic Using Meta-SAEs
Bart Bussmann (Stuckwork) · 2024-08-24T00:56:46.048Z · comments (9)

Why imperfect adversarial robustness doesn't doom AI control
Buck · 2024-11-18T16:05:06.763Z · comments (25)

Cognitive Work and AI Safety: A Thermodynamic Perspective
Daniel Murfet (dmurfet) · 2024-12-08T21:42:17.023Z · comments (9)

Consider the humble rock (or: why the dumb thing kills you)
pleiotroth · 2024-07-04T13:54:15.593Z · comments (11)

[link] Research Report: Sparse Autoencoders find only 9/180 board state features in OthelloGPT
Robert_AIZI · 2024-03-05T13:55:33.483Z · comments (24)

Checking in on Scott's composition image bet with imagen 3
Dave Orr (dave-orr) · 2024-12-22T19:04:17.495Z · comments (0)

MATS Alumni Impact Analysis
utilistrutil · 2024-09-30T02:35:57.273Z · comments (7)

There Should Be More Alignment-Driven Startups
Vaniver · 2024-05-31T02:05:06.799Z · comments (14)

[question] We might be dropping the ball on Autonomous Replication and Adaptation.
Charbel-Raphaël (charbel-raphael-segerie) · 2024-05-31T13:49:11.327Z · answers+comments (30)

Managing risks while trying to do good
Wei Dai (Wei_Dai) · 2024-02-01T18:08:46.506Z · comments (26)

Balancing Games
jefftk (jkaufman) · 2024-02-24T14:40:04.237Z · comments (18)

A case for donating to AI risk reduction (including if you work in AI)
tlevin (trevor) · 2024-12-02T19:05:06.658Z · comments (2)

AI #78: Some Welcome Calm
Zvi · 2024-08-22T14:20:10.812Z · comments (15)

A civilization ran by amateurs
Olli Järviniemi (jarviniemi) · 2024-05-30T17:57:32.601Z · comments (7)

ReSolsticed vol I: "We're Not Going Quietly"
Raemon · 2024-12-26T17:52:33.727Z · comments (4)

Measuring whether AIs can statelessly strategize to subvert security measures
Alex Mallen (alex-mallen) · 2024-12-19T21:25:28.555Z · comments (0)

[link] How do open AI models affect incentive to race?
jessicata (jessica.liu.taylor) · 2024-05-07T00:33:20.658Z · comments (13)

[link] More people getting into AI safety should do a PhD
AdamGleave · 2024-03-14T22:14:48.855Z · comments (24)

[link] Is Claude a mystic?
jessicata (jessica.liu.taylor) · 2024-06-07T04:27:09.118Z · comments (23)

[link] Linkpost: Memorandum on Advancing the United States’ Leadership in Artificial Intelligence
Nisan · 2024-10-25T04:37:00.828Z · comments (2)

An Actually Intuitive Explanation of the Oberth Effect
Isaac King (KingSupernova) · 2024-01-10T20:23:17.216Z · comments (33)

Toward Safety Cases For AI Scheming
Mikita Balesni (mykyta-baliesnyi) · 2024-10-31T17:20:06.019Z · comments (1)

[link] Results from an Adversarial Collaboration on AI Risk (FRI)
Josh Rosenberg (josh-rosenberg) · 2024-03-11T20:00:24.642Z · comments (3)

[question] What do we know about the AI knowledge and views, especially about existential risk, of the new OpenAI board members?
Zvi · 2024-03-11T14:55:05.128Z · answers+comments (2)

0th Person and 1st Person Logic
Adele Lopez (adele-lopez-1) · 2024-03-10T00:56:14.446Z · comments (28)

Against empathy-by-default
Steven Byrnes (steve2152) · 2024-10-16T16:38:49.926Z · comments (24)

A "Bitter Lesson" Approach to Aligning AGI and ASI
RogerDearnaley (roger-d-1) · 2024-07-06T01:23:22.376Z · comments (40)

Self-explaining SAE features
Dmitrii Kharlapenko (dmitrii-kharlapenko) · 2024-08-05T22:20:36.041Z · comments (13)

Base LLMs refuse too
Connor Kissane (ckkissane) · 2024-09-29T16:04:21.343Z · comments (20)

Approaching Human-Level Forecasting with Language Models
Fred Zhang (fred-zhang) · 2024-02-29T22:36:34.012Z · comments (6)

AI Alignment via Slow Substrates: Early Empirical Results With StarCraft II
Lester Leong (lester-leong) · 2024-10-14T04:05:05.096Z · comments (9)

5 Physics Problems
DaemonicSigil · 2024-03-18T08:05:45.971Z · comments (0)

Pollsters Should Publish Question Translations
jefftk (jkaufman) · 2024-09-08T22:10:04.932Z · comments (3)

Interdictor Ship
lsusr · 2024-08-19T04:59:18.487Z · comments (9)

[link] Towards shutdownable agents via stochastic choice
EJT (ElliottThornley) · 2024-07-08T10:14:24.452Z · comments (11)

Thoughts on SB-1047
ryan_greenblatt · 2024-05-29T23:26:14.392Z · comments (1)

Rationalists are missing a core piece for agent-like structure (energy vs information overload)
tailcalled · 2024-08-17T09:57:19.370Z · comments (9)

[link] How much I'm paying for AI productivity software (and the future of AI use)
jacquesthibs (jacques-thibodeau) · 2024-10-11T17:11:27.025Z · comments (18)

"Metastrategic Brainstorming", a core building-block skill
Raemon · 2024-06-11T04:27:52.488Z · comments (5)

LessOnline Festival Updates Thread
Ben Pace (Benito) · 2024-04-18T21:55:08.003Z · comments (26)

[link] Testing for Scheming with Model Deletion
Guive (GAA) · 2025-01-07T01:54:13.550Z · comments (20)

[link] Pacing Outside the Box: RNNs Learn to Plan in Sokoban
Adrià Garriga-alonso (rhaps0dy) · 2024-07-25T22:00:55.398Z · comments (8)

[link] Linkpost: Surely you can be serious
kave · 2024-07-18T22:18:09.271Z · comments (8)

Feature Targeted LLC Estimation Distinguishes SAE Features from Random Directions
Lidor Banuel Dabbah · 2024-07-19T20:32:15.095Z · comments (6)

D&D.Sci: The Mad Tyrant's Pet Turtles
abstractapplic · 2024-03-29T16:22:13.732Z · comments (18)

AI #48: Exponentials in Geometry
Zvi · 2024-01-18T14:20:07.869Z · comments (9)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

jbash on What are some scenarios where an aligned AGI actually helps humanity, but many/most people don't like it?

Well, OK, but you also said "actually helps humanity", which assumes some kind of outside view. And you used "aligned" without specifying any particular one of the conflicting visions of "alignment" that are out there.

I absolutely agree that "aligned with whom" is a huge issue. It's one of the things that really bugs me about the word.

I do also agree that there are going to be irreconcilliable differences, and that, barring mind surgery to change their opinions, many people will be unhappy with whatever happens. That applies no matter what an AI does, and in fact no matter what anybody who's "in charge" does. It applies even if nobody is in charge. But if somebody is in charge, it's guaranteed that a lot of people will be very angry at that somebody. Sometimes all you can change is who is unhappy.

For example, a whole lot of Christians, Muslims, and possibly others believe that everybody who doesn't wholeheartedly accept their religion is not only wrong, but also going to suffer in hell for eternity. Those religions are mutually contradictory at their cores. And a probably smaller but still large number of athiests believe that all religion is mindrot that intrinsically reduces the human dignity of anybody who accepts it.

You can't solve that, no matter how smart you are. Favor one view and the other view loses. Favor none, and the other views say that a bunch of people are seriously harmed, even if it's voluntary. It doesn't even matter how you favor a view. Gentle persuasion is still a problem. OK, technically you can avoid people being mad about it after the fact by extreme mind surgery, but you can't reconcile their original values. You can prevent violent conflict by sheer force, but you can't remove the underlying issue.

Still, a lot of the approaches you describe are are pretty ham-handed even if you agree with the underlying values. Some of the desired outcomes you list even sound to me like good ideas... but you ought to be able to work toward those goals, even achieve them, without doing it in a way that pisses off the maximum possible number of people. So I guess I'm reacting to the extreme framing and the extreme measures. I don't think the Taliban actively want people to be mad.

[Edited unusually heavily after posting because apparently I can't produce coherent, low-typo text in the morning]

meedstrom on CFAR Takeaways: Andrew Critch

Basically agree, but downvoted because not useful.

I'd nuance that as that being alive and energetic is fun -- but when my body no longer grants energy, it's like death already. Say I'm trying to take notes about the content of this thread, but I'm so tired I barely produce anything. If the terms of my body are such that I must first do a timeskip to tomorrow to get more energy, then I want the timeskip.

I guess I understand becoming sleep-deprived and staying up anyway if you don't notice your IQ dropping...

mikbp on Is Musk still net-positive for humanity?

Oh, it is probably my mistake XD I'm also not native. I meant increase, not that it is the maximum it could be, sorry.

aram-panasenco on AGI Ruin: A List of Lethalities

I really appreciate this post, as much as it's making me feel that I and everyone I care about have terminal cancer with only 12-60 months to live.

I found the idea that a pivotal act is necessary as especially valuable and expanded on it in my post [Is AI Alignment Enough?](https://www.lesswrong.com/posts/tdrK7r4QA3ifbt2Ty/is-ai-alignment-enough)

dmitry-vaintrob on Dmitry's Koan

Thanks for asking! I said in a later shortform [LW(p) · GW(p)] that I was trying to do too many things in this post, with only vague relationships between them, and I'm planning to split it into pieces in the future.

Your 1-3 are mostly correct. I'd comment as follows:

(and also kind of 3) That advice of using the tempered local Bayesian posterior (I like the term -- let's shorten it to TLBP) is mostly aimed at non-SLT researchers (but may apply also to some SLT experiments). The suggestion is not to compute expectations. Rather, just running a single experiment at a weight sampled from the TLBP. The result is analogous to tuning a precision dial on your NN to noise away all circuits for which the quotient (usefulness)/(description length) is bounded above by 1/t (where usefulness is measured in reduction of loss). At t = 0, you're adding no noise and at you're fully noising it.
This is interesting to do in interp experiments for two general reasons:
1. You can see whether the behavior your experiment finds is general or spurious. The higher the temperature range it persists over, the more general it is in the sense of usefulness/description length (and all else being equal, the more important your result is).
2. If you are hoping to say that a behavior you found, e.g. a circuit, is "natural from the circuit's point of view" (i.e., plausibly occurs in some kind of optimal weight- or activation-level description of your model), you need to make sure your experiment isn't just putting together bits of other circuits in an ad-hoc way and calling it a circuit. One way to see this, that works 0% of the time, is to notice that turning this circuit on or off affects the output on exactly the context/ structure you care about, and has absolutely no effect at all on performance elsewhere. This never works because our interp isn't at a level where we can perform uber-precise targeted interventions, and whenever we do something to a network in an experiment, this always significantly affects loss on unrelated inputs. By having a tunable precision parameter (as given by the TLBP for example), you have more freedom to find such "clean" effects that only do what you want and don't affect loss otherwise. In general, in an imprecise sense, you expect each "true" circuit to have some "temperature of entanglement" with the rest of the model, and if this circuit is important enough to survive tempering to this temperature of entanglement, you expect to see much cleaner and nicer results in the resulting tempered model.
In the above context, you rarely want to use the Watanabe temperature or any other temperature that only depends on the number of samples n, since it's much too low in most cases. Instead, you're either looking for a characteristic temperature associated with an experiment or circuit (which in general will not depend on n much), or fishing for behaviors that you hope are "significantly general". Here the characteristic temperature associated with the level of generality that "is not literally memorizing" is the Watanabe temperature or very similar, but it is probably more interesting to consider larger scales.
(maybe more related to your question 1): Above, I explained why I think performing experiments at TLBP weight values is useful for "general interp". I also explain that you sometimes have a natural "characteristic temperature" for the TLBP that is independent of sample number (e.g. meaningful at infinite samples), which is the difference between the loss of the network you're studying and a SOTA NN, which you think of as that "true optimal loss". In large-sample (highly underparameterized) cases, this is probably a better characteristic temperature than the Watanabe temperature, including for notions of effective parameter count: indeed, insofar as your NN is "an imperfect approximation of an optimal NN", the noise inherent in this imperfection is on this scale (and not the Watanabe scale). Of course there are issues with this PoV as less expressive NN's are rarely well-conceptualized as TLBP samples (insofar as they find a subset of a "perfect NN's circuits", they find the easily learnable ones rather than the maximally general ones). However it's still reasonable to think of this as a first stab at the inherent noise scale associated to an underparametrized model, and to think of the effective parameter count at this scale (i.e., free energy / log temperature) as a better approximatin of some "inherent" parameter count.

sharmake-farah on Human takeover might be worse than AI takeover

From my perspective, I'd say that conditional on takeover happening, I'd probably say that a human taking over compared to an AI has pretty similar distributions of outcomes, mostly because I consider the variance of human and AI values to have surprisingly similar outcomes (notably a key factor here is I expect a lot of the more alien values to result in extinction, though partial alignment can make things worse, but compared to the horror show that quite a bit of people have on their values, death can be pretty good, and that's because I'm quite a bit more skeptical of the average person's values, especially conditioning on takeover leading to automatically good outcomes.)

exmateriae on Is Musk still net-positive for humanity?

I thought you said he was very close to the maximum he could do? English is a second language so maybe I misunderstood something. Also, only my first paragraph is really related to the quote, the rest is more of a free flow of what I think

meedstrom on CFAR Takeaways: Andrew Critch

I think some Rationalists believe everything is supposed to fit into one frame, but Frames != The Truth. [...] we should be able to pick up and drop frames as needed, at will.

Aye - see also In Praise of Fake Frameworks [LW · GW]. It's helped me interface with a lot people that would've otherwise befuddled me. That gives me a more fleshed-out range of possible perspectives on things, which shortcuts to new knowledge.

But perhaps it's worth thinking twice when or at least how to introduce this skill, because it looks like a method of doing Salvage Epistemology [LW · GW] and so could invite its downsides if taught poorly. I'm undecided whether that's worth worrying about.

richard_kennaway on What are some scenarios where an aligned AGI actually helps humanity, but many/most people don't like it?

The AI, for its own inscrutable reasons, seizes upon the sort of idea that you have to be really smart to be stupid enough to take seriously, and imposes it on everyone.

I think all the scenarios above are instances of this.

zac-hatfield-dodds on POC || GTFO culture as partial antidote to alignment wordcelism

"POC || GTFO culture" need not be literal, and generally cannot be when speculating about future technologies. I wouldn't even want a proof-of-concept misaligned superintelligence!

Nonetheless, I think the field has been improved by an increasing emphasis on empiricism and demonstrations over the last two years, in technical research, in governance research, and in advocacy. I'd still like to see more carefully caveating of claims for which we have arguments but not evidence, and it's useful to have a short handle for that idea - "POC || admit you're unsure", perhaps?