LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

On Dwarksh’s Podcast with Leopold Aschenbrenner
Zvi · 2024-06-10T12:40:03.348Z · comments (7)

Sam Altman's sister, Annie Altman, claims Sam has severely abused her
pl5015 · 2023-10-07T21:06:49.396Z · comments (107)

Value systematization: how values become coherent (and misaligned)
Richard_Ngo (ricraz) · 2023-10-27T19:06:26.928Z · comments (48)

On the Executive Order
Zvi · 2023-11-01T14:20:01.657Z · comments (4)

[link] LessOnline (May 31—June 2, Berkeley, CA)
Ben Pace (Benito) · 2024-03-26T02:34:00.000Z · comments (24)

General Thoughts on Secular Solstice
Jeffrey Heninger (jeffrey-heninger) · 2024-03-23T18:48:43.940Z · comments (60)

Simple versus Short: Higher-order degeneracy and error-correction
Daniel Murfet (dmurfet) · 2024-03-11T07:52:46.307Z · comments (6)

A simple model of math skill
Alex_Altair · 2024-07-21T18:57:33.697Z · comments (16)

LLM Applications I Want To See
sarahconstantin · 2024-08-19T21:10:03.101Z · comments (4)

Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small
Joseph Bloom (Jbloom) · 2024-02-02T06:54:53.392Z · comments (37)

What's up with "Responsible Scaling Policies"?
habryka (habryka4) · 2023-10-29T04:17:07.839Z · comments (8)

[link] My cover story in Jacobin on AI capitalism and the x-risk debates
garrison · 2024-02-12T23:34:16.526Z · comments (5)

[link] Advice for Activists from the History of Environmentalism
Jeffrey Heninger (jeffrey-heninger) · 2024-05-16T18:40:02.064Z · comments (8)

Announcing the London Initiative for Safe AI (LISA)
James Fox · 2024-02-02T23:17:47.011Z · comments (0)

[link] Transformer Circuit Faithfulness Metrics Are Not Robust
Joseph Miller (Josephm) · 2024-07-12T03:47:30.077Z · comments (5)

Learning-theoretic agenda reading list
Vanessa Kosoy (vanessa-kosoy) · 2023-11-09T17:25:35.046Z · comments (0)

Truthseeking when your disagreements lie in moral philosophy
Elizabeth (pktechgirl) · 2023-10-10T00:00:04.130Z · comments (4)

What's Hard About The Shutdown Problem
johnswentworth · 2023-10-20T21:13:27.624Z · comments (33)

Comments on Anthropic's Scaling Monosemanticity
Robert_AIZI · 2024-06-03T12:15:44.708Z · comments (8)

I don’t find the lie detection results that surprising (by an author of the paper)
JanB (JanBrauner) · 2023-10-04T17:10:51.262Z · comments (8)

“X distracts from Y” as a thinly-disguised fight over group status / politics
Steven Byrnes (steve2152) · 2023-09-25T15:18:18.644Z · comments (14)

Explaining a Math Magic Trick
Robert_AIZI · 2024-05-05T19:41:52.048Z · comments (10)

Lsusr's Rationality Dojo
lsusr · 2024-02-13T05:52:03.757Z · comments (17)

[link] CIV: a story
Richard_Ngo (ricraz) · 2024-06-15T22:36:50.415Z · comments (6)

OpenAI #8: The Right to Warn
Zvi · 2024-06-17T12:00:02.639Z · comments (8)

[link] "Deep Learning" Is Function Approximation
Zack_M_Davis · 2024-03-21T17:50:36.254Z · comments (28)

[link] Perplexity wins my AI race
Elizabeth (pktechgirl) · 2024-08-24T19:20:10.859Z · comments (12)

Deceptive AI ≠ Deceptively-aligned AI
Steven Byrnes (steve2152) · 2024-01-07T16:55:13.761Z · comments (19)

[link] The Minority Faction
Richard_Ngo (ricraz) · 2024-06-24T20:01:27.436Z · comments (5)

[question] Lying to chess players for alignment
Zane · 2023-10-25T17:47:15.033Z · answers+comments (54)

Key takeaways from our EA and alignment research surveys
Cameron Berg (cameron-berg) · 2024-05-03T18:10:41.416Z · comments (10)

[question] How to get nerds fascinated about mysterious chronic illness research?
riceissa · 2024-05-27T22:58:29.707Z · answers+comments (50)

Kids or No kids
Kids or no kids (grosseholz.f@gmail.com) · 2023-11-14T18:37:02.799Z · comments (10)

I am the Golden Gate Bridge
Zvi · 2024-05-27T14:40:03.216Z · comments (6)

Catching AIs red-handed
ryan_greenblatt · 2024-01-05T17:43:10.948Z · comments (20)

[link] Explaining Impact Markets
Saul Munn (saul-munn) · 2024-01-31T09:51:27.587Z · comments (2)

[link] MIRI's April 2024 Newsletter
Harlan · 2024-04-12T23:38:20.781Z · comments (0)

[Valence series] 1. Introduction
Steven Byrnes (steve2152) · 2023-12-04T15:40:21.274Z · comments (14)

Dialogue introduction to Singular Learning Theory
Olli Järviniemi (jarviniemi) · 2024-07-08T16:58:10.108Z · comments (11)

[link] Ilya Sutskever created a new AGI startup
harfe · 2024-06-19T17:17:17.366Z · comments (35)

Announcing FAR Labs, an AI safety coworking space
bgold · 2023-09-29T16:52:37.753Z · comments (0)

[link] Almost everyone I’ve met would be well-served thinking more about what to focus on
Henrik Karlsson (henrik-karlsson) · 2024-01-05T21:01:27.861Z · comments (8)

OpenAI's Sora is an agent
CBiddulph (caleb-biddulph) · 2024-02-16T07:35:52.171Z · comments (25)

[link] Ideological Bayesians
Kevin Dorst · 2024-02-25T14:17:25.070Z · comments (4)

Investigating the learning coefficient of modular addition: hackathon project
Nina Panickssery (NinaR) · 2023-10-17T19:51:29.720Z · comments (5)

Symbol/Referent Confusions in Language Model Alignment Experiments
johnswentworth · 2023-10-26T19:49:00.718Z · comments (44)

On Claude 3.5 Sonnet
Zvi · 2024-06-24T12:00:05.719Z · comments (14)

[link] RAND report finds no effect of current LLMs on viability of bioterrorism attacks
StellaAthena · 2024-01-25T19:17:30.493Z · comments (14)

[link] Compact Proofs of Model Performance via Mechanistic Interpretability
LawrenceC (LawChan) · 2024-06-24T19:27:21.214Z · comments (3)

Sparsify: A mechanistic interpretability research agenda
Lee Sharkey (Lee_Sharkey) · 2024-04-03T12:34:12.043Z · comments (22)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

gb on What you know when you know nothing

We can only make that inference about conjunctions if we know that the statements are independent. Since (by assumption) we don’t know anything about said world, we don’t know that either, so the conclusion does not follow.

johannes-c-mayer on We Don't Know Our Own Values, but Reward Bridges The Is-Ought Gap

reward is the evidence from which we learn about our values

A sadist might feel good each time they hurt somebody. I am pretty sure it is possible for a sadist to exist who does not endorse hurting people, meaning they feel good if they hurt people, but they avoid it nonetheless.

So to what extent is hurting people a value? It's like the sadist's brain tries to tell them that they ought to want to hurt people, but they don't want to. Intuitively the "they don't want to" seems to be the value.

mateusz-baginski on Reformative Hypocrisy, and Paying Close Enough Attention to Selectively Reward It.

It's better but still not quite. When you play on two levels, sometimes the best strategy involves a pair of (level 1 and 2) substrategies that are seemingly opposites of each other. I don't think there's anything hypocritical about that.

Similarly, hedging is not hypocrisy.

cousin_it on Slave Morality: A place for every man and every man in his place

I think Nietzsche would agree that "slave morality" originated with Jesus. The main new idea that Jesus brought as a moral philosopher was compassion, feeling for the other person. It's pretty find to hard in earlier sources, for example the heroes of the Iliad hurt weaker people without a second thought.

To me it feels obvious that the idea of compassion needs to exist, and needs to have force. Because otherwise we'd have a human society operating by the laws of the natural world, and if you look at what animals do to each other, there's no limit to how bad things can get.

Can compassion also become a tool of power and abuse? Sure. But let's not go back to a world without compassion, please.

xpym on A Nonconstructive Existence Proof of Aligned Superintelligence

This isn’t really a problem with alignment

I'd rather put it that resolving that problem is a prerequisite for the notion of "alignment problem" to be meaningful in the first place. It's not technically a contradiction to have an "aligned" superintelligence that does nothing, but clearly nobody would in practice be satisfied with that.

quetzal_rainbow on What's the Deal with Logical Uncertainty?

The reason why logical uncertainty was brought up in the first place is decision theory, to make crisp formal expression for intuitive "I cooperate with you conditional on you cooperating with me", where "you cooperating with me" is result of analysis of probability distribution over possible algorithms which control actions of your opponent and you can't actually run these algorithms due to computational constraints, and you want to do all this reasoning in non-arbitrary ways.

devrandom on Is "superhuman" AI forecasting BS? Some experiments on the "539" bot from the Centre for AI Safety

There seem to be substantial problems with low probability events, coherent predictions over time, short term events, probabilities adding up to more than 100%, etc

An probabilistic oracle being inconsistent is completely besides the point. If I have a probabilistic oracle that has high accuracy but is sometimes inconsistent, I can just post-process the predictions to force them into a consistent format. For example, I can normalize the probabilities to 100%.

The economic value is in the overall accuracy. Being consistent is a cosmetic consideration.

cata on AI Safety is Dropping the Ball on Clown Attacks

This post was difficult to take seriously when I read it but the "clown attack" idea very much stuck with me.

tailcalled on The case for a negative alignment tax

With this being said, catastrophic misuse could wipe us all out, too. It seems too strong to say that the ‘traits’ of frontier models ‘[have] nothing to do’ with the alignment problem/whether a giant AI wave destroys human society, as you put it. If we had no alignment technique that reliably prevented frontier LLMs from explaining to anyone who asked how to make anthrax, build bombs, spread misinformation, etc, this would definitely at least contribute to a society-destroying wave. But finetuned frontier models do not do this by default, largely because of techniques like RLHF. (Again, not saying or implying RLHF achieves this perfectly or can’t be easily removed or will scale, etc. Just seems like a plausible counterfactual world we could be living in but aren’t because of ‘traits’ of frontier models.)

I would like to see the case for catastrophic misuse being an xrisk, since it mostly seems like business-as-usual for technological development (you get more capabilities which means more good stuff but also more bad stuff).

thomascederborg on The case for more Alignment Target Analysis (ATA)

The proposed research project would indeed be focused on a certain type of alignment target. For example proposals along the lines of PCEV. But not proposals along the lines of a tool-AI. Referring to this as Value-Alignment Target Analysis (VATA) would also be a possible notation. I will adopt this notation for the rest of this comment.

The proposed VATA research project would be aiming for risk mitigation. It would not be aiming for an answer:

There is a big difference between proposing an alignment target on the one hand. And pointing out problems with alignment targets on the other hand. For example: it is entirely possible to reduce risks from a dangerous alignment target, without having any idea how one might find a good alignment target. One can actually reduce risks without having any idea, what it even means for an alignment target to be a good alignment target.

The feature of PCEV mentioned in the post is an example of this. The threat posed by PCEV has presumably been mostly removed. This did not require anything along the lines of an answer. The analysis of Condorcet AI (CAI) is similar. The analysis simply describes a feature shared by all CAI proposals (the feature that a barely caring solid majority can do whatever they want with everyone else). Pointing this out presumably reduces the probability that a CAI will be launched by designers that never considered this feature. All claims made in the post about a VATA research project being tractable is referring to this type of risk mitigation being tractable. There is definitely no claim that a VATA research project can (i): find a good alignment target, (ii): somehow verify that this alignment target does not have any hidden flaws, and (iii): convince whoever is in charge to launch this target.

One can also go a bit beyond analysis of individual proposals, even if one does not have any idea how to find an answer. One can mitigate risk by describing necessary features (for example along the lines of this necessary Membrane formalism feature [LW · GW]). This reduces risks from all proposal that clearly does not have such a necessary feature.

(and just to be extra clear: the post is not arguing that launching a Sovereign AI is a good idea. The post is assuming an audience that agree that it is possible that a Sovereign AI might be launched. And then the post is arguing that if this does happen, then there is a risk that such a Sovereign AI project will be aiming at a bad value alignment target. The post then further argues that this particular risk can be reduced by doing VATA)

Regarding people being skeptical of Value Alignment Target proposals:

If someone ends up with the capability to launch a Sovereign AI, then I certainly hope that they will be skeptical of proposed Value Alignment Targets. Such skepticism can avert catastrophe even if the proposed alignment target has a flaw that no one has noticed.

The issue is that a situation might arise where (i): someone has the ability to launch a Sovereign AI, (ii): there exists a Sovereign AI proposal that no one can find any flaws with, and (iii): there is a time crunch.

Regarding the possibility that there exists people trying to find an answer without telling anyone:

I'm not sure how to estimate the probability of this. From a risk mitigation standpoint, this is certainly not the optimal way of doing things (if a proposed alignment target has a flaw, then it will be a lot easier to notice that flaw, if the proposal is not kept secret). I really don't think that this is a reasonable way of doing things. But I think that you have a point. If Bob is about to launch an AI Sovereign with some critical flaw that would lead to some horrific outcome. Then secretly working Steve might be able to notice this flaw. And if Bob is just about to launch his AI, and speaking up is the only way for Steve to prevent Bob from causing a catastrophe, then Steve will presumably speak up. In other words: the existence of people like secretly working Steve would indeed offer some level of protection. It would mean that the lack of people with relevant intuitions is not as bad as it appears (and when allocating resources, this possibility would indeed point to less resources for VATA). But I think that what is really needed is at least some people doing VATA with a clear risk mitigation focus. And discussing their finding with each other. This does not appear to exist.

Regarding other risks, and the issue that findings might be ignored:

A VATA research project would not help with misalignment. In other words: even if the field of VATA was somehow completely solved tomorrow, AI could still lead to extinction. So the proposed research project is definitely not dealing with all risks. The point of the post is that the field of VATA is basically empty. I don't know of anyone that is doing VATA full time with a clear risk mitigation focus. And I don't know if you personally should switch to focusing on VATA. It would not surprise me at all if some other project is a better use of your time. It just seems like there should exist some form of VATA research project with a clear risk mitigation focus.

It is also possible that a VATA finding will be completely ignored (by leading labs, or by governments, or by someone else). It is possible that a Sovereign AI will be launched, leading to catastrophe, even though it has a known flaw (because the people launching it is just refusing to listen). But finding a flaw at least means that it is possible to avert catastrophe.

PS:

Thanks for the links! I will look into this. (I think that there are many fields of research that are relevant to VATA. It's just that one has to be careful. A concept can behave very differently when it is transferred to the AI context)