LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Please stop using mediocre AI art in your posts
Raemon · 2024-08-25T00:13:52.890Z · comments (24)

OthelloGPT learned a bag of heuristics
jylin04 · 2024-07-02T09:12:56.377Z · comments (10)

"AI Alignment" is a Dangerously Overloaded Term
Roko · 2023-12-15T14:34:29.850Z · comments (100)

Preventing Language Models from hiding their reasoning
Fabien Roger (Fabien) · 2023-10-31T14:34:04.633Z · comments (14)

' petertodd'’s last stand: The final days of open GPT-3 research
mwatkins · 2024-01-22T18:47:00.710Z · comments (16)

Stuxnet, not Skynet: Humanity's disempowerment by AI
Roko · 2023-11-04T22:23:55.428Z · comments (24)

[question] How do you feel about LessWrong these days? [Open feedback thread]
jacobjacob · 2023-12-05T20:54:42.317Z · answers+comments (280)

A Selection of Randomly Selected SAE Features
CallumMcDougall (TheMcDouglas) · 2024-04-01T09:09:49.235Z · comments (2)

Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight
Sam Marks (samuel-marks) · 2024-04-18T16:17:39.136Z · comments (8)

Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level (Post 1)
Neel Nanda (neel-nanda-1) · 2023-12-23T02:44:24.270Z · comments (6)

[link] Please support this blog (with money)
Elizabeth (pktechgirl) · 2024-08-17T15:30:05.641Z · comments (2)

2023 in AI predictions
jessicata (jessica.liu.taylor) · 2024-01-01T05:23:42.514Z · comments (35)

[link] Most smart and skilled people are outside of the EA/rationalist community: an analysis
titotal (lombertini) · 2024-07-12T12:13:56.215Z · comments (36)

Danger, AI Scientist, Danger
Zvi · 2024-08-15T22:40:06.715Z · comments (9)

Picking Mentors For Research Programmes
Raymond D · 2023-11-10T13:01:14.197Z · comments (8)

Demystifying "Alignment" through a Comic
milanrosko · 2024-06-09T08:24:22.454Z · comments (19)

One Day Sooner
Screwtape · 2023-11-02T19:00:58.427Z · comments (7)

The first future and the best future
KatjaGrace · 2024-04-25T06:40:04.510Z · comments (12)

Skills I'd like my collaborators to have
Raemon · 2024-02-09T08:20:37.686Z · comments (9)

The Pearly Gates
lsusr · 2024-05-30T04:01:14.198Z · comments (6)

New LessWrong feature: Dialogue Matching
jacobjacob · 2023-11-16T21:27:16.763Z · comments (22)

[link] A case for AI alignment being difficult
jessicata (jessica.liu.taylor) · 2023-12-31T19:55:26.130Z · comments (56)

Clarifying METR's Auditing Role
Beth Barnes (beth-barnes) · 2024-05-30T18:41:56.029Z · comments (1)

[link] Cohabitive Games so Far
mako yass (MakoYass) · 2023-09-28T15:41:27.986Z · comments (127)

Scaling and evaluating sparse autoencoders
leogao · 2024-06-06T22:50:39.440Z · comments (6)

On the future of language models
owencb · 2023-12-20T16:58:28.433Z · comments (17)

[question] What convincing warning shot could help prevent extinction from AI?
Charbel-Raphaël (charbel-raphael-segerie) · 2024-04-13T18:09:29.096Z · answers+comments (18)

New LessWrong review winner UI ("The LeastWrong" section and full-art post pages)
kave · 2024-02-28T02:42:05.801Z · comments (64)

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks
Lucius Bushnaq (Lblack) · 2024-05-20T17:53:25.985Z · comments (4)

TOMORROW: the largest AI Safety protest ever!
Holly_Elmore · 2023-10-20T18:15:18.276Z · comments (26)

Why I'm doing PauseAI
Joseph Miller (Josephm) · 2024-04-30T16:21:54.156Z · comments (16)

In favour of exploring nagging doubts about x-risk
owencb · 2024-06-25T23:52:01.322Z · comments (2)

Nonlinear’s Evidence: Debunking False and Misleading Claims
KatWoods (ea247) · 2023-12-12T13:16:12.008Z · comments (171)

Deception Chess: Game #1
Zane · 2023-11-03T21:13:55.777Z · comments (19)

Charbel-Raphaël and Lucius discuss interpretability
Mateusz Bagiński (mateusz-baginski) · 2023-10-30T05:50:34.589Z · comments (7)

Apply for MATS Winter 2023-24!
utilistrutil · 2023-10-21T02:27:34.350Z · comments (6)

Making AIs less likely to be spiteful
Nicolas Macé (NicolasMace) · 2023-09-26T14:12:06.202Z · comments (4)

[link] My techno-optimism [By Vitalik Buterin]
habryka (habryka4) · 2023-11-27T23:53:35.859Z · comments (17)

Dreams of AI alignment: The danger of suggestive names
TurnTrout · 2024-02-10T01:22:51.715Z · comments (59)

[link] The Witness
Richard_Ngo (ricraz) · 2023-12-03T22:27:16.248Z · comments (4)

[link] Carl Sagan, nuking the moon, and not nuking the moon
eukaryote · 2024-04-13T04:08:50.166Z · comments (8)

Backdoors as an analogy for deceptive alignment
Jacob_Hilton · 2024-09-06T15:30:06.172Z · comments (1)

Response to nostalgebraist: proudly waving my moral-antirealist battle flag
Steven Byrnes (steve2152) · 2024-05-29T16:48:29.408Z · comments (29)

[link] Notes from a Prompt Factory
Richard_Ngo (ricraz) · 2024-03-10T05:13:39.384Z · comments (19)

SAE reconstruction errors are (empirically) pathological
wesg (wes-gurnee) · 2024-03-29T16:37:29.608Z · comments (16)

Programmatic backdoors: DNNs can use SGD to run arbitrary stateful computation
Fabien Roger (Fabien) · 2023-10-23T16:37:45.611Z · comments (3)

[link] A Chess-GPT Linear Emergent World Representation
Adam Karvonen (karvonenadam) · 2024-02-08T04:25:15.222Z · comments (14)

[link] Poker is a bad game for teaching epistemics. Figgie is a better one.
rossry · 2024-07-08T06:05:20.459Z · comments (47)

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs
L Rudolf L (LRudL) · 2024-07-08T22:24:38.441Z · comments (28)

Sam Altman's sister, Annie Altman, claims Sam has severely abused her
pl5015 · 2023-10-07T21:06:49.396Z · comments (107)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

cousin_it on Slave Morality: A place for every man and every man in his place

I think Nietzsche would agree that "slave morality" originated with Jesus. The main new idea that Jesus brought as a moral philosopher was compassion. Feeling for the other person. It's pretty find to hard in earlier sources, for example the heroes of the Iliad hurt weaker people without a second thought.

To me it feels obvious that the idea of compassion needs to exist, and needs to have force. Because otherwise we'd have a human society operating by the laws of the natural world, and if you look at what animals do to each other, there's no limit to how bad things can get.

Can compassion also become a tool of power and abuse? Sure. But let's not go back to a world without compassion, please.

xpym on A Nonconstructive Existence Proof of Aligned Superintelligence

This isn’t really a problem with alignment

I'd rather put it that resolving that problem is a prerequisite for the notion of "alignment problem" to be meaningful in the first place. It's not technically a contradiction to have an "aligned" superintelligence that does nothing, but clearly nobody would in practice be satisfied with that.

quetzal_rainbow on What's the Deal with Logical Uncertainty?

The reason why logical uncertainty was brought up in the first place is decision theory, to make crisp formal expression for intuitive "I cooperate with you conditional on you cooperating with me", where "you cooperating with me" is result of analysis of probability distribution over possible algorithms which control actions of your opponent and you can't actually run these algorithms due to computational constraints, and you want to do all this reasoning in non-arbitrary ways.

devrandom on Is "superhuman" AI forecasting BS? Some experiments on the "539" bot from the Centre for AI Safety

There seem to be substantial problems with low probability events, coherent predictions over time, short term events, probabilities adding up to more than 100%, etc

An probabilistic oracle being inconsistent is completely besides the point. If I have a probabilistic oracle that has high accuracy but is sometimes inconsistent, I can just post-process the predictions to force them into a consistent format. For example, I can normalize the probabilities to 100%.

The economic value is in the overall accuracy. Being consistent is a cosmetic consideration.

cata on AI Safety is Dropping the Ball on Clown Attacks

This post was difficult to take seriously when I read it but the "clown attack" idea very much stuck with me.

tailcalled on The case for a negative alignment tax

With this being said, catastrophic misuse could wipe us all out, too. It seems too strong to say that the ‘traits’ of frontier models ‘[have] nothing to do’ with the alignment problem/whether a giant AI wave destroys human society, as you put it. If we had no alignment technique that reliably prevented frontier LLMs from explaining to anyone who asked how to make anthrax, build bombs, spread misinformation, etc, this would definitely at least contribute to a society-destroying wave. But finetuned frontier models do not do this by default, largely because of techniques like RLHF. (Again, not saying or implying RLHF achieves this perfectly or can’t be easily removed or will scale, etc. Just seems like a plausible counterfactual world we could be living in but aren’t because of ‘traits’ of frontier models.)

I would like to see the case for catastrophic misuse being an xrisk, since it mostly seems like business-as-usual for technological development (you get more capabilities which means more good stuff but also more bad stuff).

thomascederborg on The case for more Alignment Target Analysis (ATA)

The proposed research project would indeed be focused on a certain type of alignment target. For example proposals along the lines of PCEV. But not proposals along the lines of a tool-AI. Referring to this as Value-Alignment Target Analysis (VATA) would also be a possible notation. I will adopt this notation for the rest of this comment.

The proposed VATA research project would be aiming for risk mitigation. It would not be aiming for an answer:

There is a big difference between proposing an alignment target on the one hand. And pointing out problems with alignment targets on the other hand. For example: it is entirely possible to reduce risks from a dangerous alignment target, without having any idea how one might find a good alignment target. One can actually reduce risks without having any idea, what it even means for an alignment target to be a good alignment target.

The feature of PCEV mentioned in the post is an example of this. The threat posed by PCEV has presumably been mostly removed. This did not require anything along the lines of an answer. The analysis of Condorcet AI (CAI) is similar. The analysis simply describes a feature shared by all CAI proposals (the feature that a barely caring solid majority can do whatever they want with everyone else). Pointing this out presumably reduces the probability that a CAI will be launched by designers that never considered this feature. All claims made in the post about a VATA research project being tractable is referring to this type of risk mitigation being tractable. There is definitely no claim that a VATA research project can (i): find a good alignment target, (ii): somehow verify that this alignment target does not have any hidden flaws, and (iii): convince whoever is in charge to launch this target.

One can also go a bit beyond analysis of individual proposals, even if one does not have any idea how to find an answer. One can mitigate risk by describing necessary features (for example along the lines of this necessary Membrane formalism feature [LW · GW]). This reduces risks from all proposal that clearly does not have such a necessary feature.

(and just to be extra clear: the post is not arguing that launching a Sovereign AI is a good idea. The post is assuming an audience that agree that it is possible that a Sovereign AI might be launched. And then the post is arguing that if this does happen, then there is a risk that such a Sovereign AI project will be aiming at a bad value alignment target. The post then further argues that this particular risk can be reduced by doing VATA)

Regarding people being skeptical of Value Alignment Target proposals:

If someone ends up with the capability to launch a Sovereign AI, then I certainly hope that they will be skeptical of proposed Value Alignment Targets. Such skepticism can avert catastrophe even if the proposed alignment target has a flaw that no one has noticed.

The issue is that a situation might arise where (i): someone has the ability to launch a Sovereign AI, (ii): there exists a Sovereign AI proposal that no one can find any flaws with, and (iii): there is a time crunch.

Regarding the possibility that there exists people trying to find an answer without telling anyone:

I'm not sure how to estimate the probability of this. From a risk mitigation standpoint, this is certainly not the optimal way of doing things (if a proposed alignment target has a flaw, then it will be a lot easier to notice that flaw, if the proposal is not kept secret). I really don't think that this is a reasonable way of doing things. But I think that you have a point. If Bob is about to launch an AI Sovereign with some critical flaw that would lead to some horrific outcome. Then secretly working Steve might be able to notice this flaw. And if Bob is just about to launch his AI, and speaking up is the only way for Steve to prevent Bob from causing a catastrophe, then Steve will presumably speak up. In other words: the existence of people like secretly working Steve would indeed offer some level of protection. It would mean that the lack of people with relevant intuitions is not as bad as it appears (and when allocating resources, this possibility would indeed point to less resources for VATA). But I think that what is really needed is at least some people doing VATA with a clear risk mitigation focus. And discussing their finding with each other. This does not appear to exist.

Regarding other risks, and the issue that findings might be ignored:

A VATA research project would not help with misalignment. In other words: even if the field of VATA was somehow completely solved tomorrow, AI could still lead to extinction. So the proposed research project is definitely not dealing with all risks. The point of the post is that the field of VATA is basically empty. I don't know of anyone that is doing VATA full time with a clear risk mitigation focus. And I don't know if you personally should switch to focusing on VATA. It would not surprise me at all if some other project is a better use of your time. It just seems like there should exist some form of VATA research project with a clear risk mitigation focus.

It is also possible that a VATA finding will be completely ignored (by leading labs, or by governments, or by someone else). It is possible that a Sovereign AI will be launched, leading to catastrophe, even though it has a known flaw (because the people launching it is just refusing to listen). But finding a flaw at least means that it is possible to avert catastrophe.

PS:

Thanks for the links! I will look into this. (I think that there are many fields of research that are relevant to VATA. It's just that one has to be careful. A concept can behave very differently when it is transferred to the AI context)

tailcalled on RLHF is the worst possible thing done when facing the alignment problem

Yeah, but the point is that the system learns values before an unrestricted AI vs AI conflict.

But if you just naively take the value that are appropriate outside of a life-and-death conflict and apply them to a life-and-death conflict, you're gonna lose. In that case, RLHF just makes you an irrelevant player, and if you insist on applying it to military/police technology, it's necessary for AI safety to pivot to addressing rogue states or gangsters.

Which again makes RLHF really really bad because we shouldn't have to work with rogue states or gangsters to save the world. Don't cripple the good guys.

I mean, if your definition of values doesn't make sense for real systems, then it's the problem of your definition. As a hypothesis describing reality "alignment trait makes AI not splash harm on humans" is coherent enough. So the question is how do you know it is unlikely to happen?

If you propose a particular latent variable that acts in a particular way, that is a lot of complexity, and you need a strong case to justify it as likely.

First, "alignment is easy" is compatible with "we need to keep the set of big adversaries small". But more generally, without numbers it seems like generalized anti-future-technology argument - what's stopping human-regulation mechanisms from solving this adversarial problem, that didn't stop them from solving previous adversarial problems?

Human-regulation mechanisms could plausibly solve this problem by banning chip fabs. The issue is we use chip fabs for all sorts of things so we don't want to do that unless we are truly desperate.

Not necessary? It's not unconceivable for future defense being more effective than offence (trivially true if "defense" is not giving AI to attackers). It kind of required for any future where humans have more power, than in present day?

Idk. Big entities have a lot of security vulnerabilities which could be attacked by AIs. But I guess one could argue the surviving big entities are red-teaming themselves hard enough to be immune to these. Perhaps most significant is the interactions between multiple independent big things, since they could be manipulated to harm the big things.

Small adversaries currently have a hard time exploiting these security vulnerabilities because intelligence is really expensive, but once intelligence becomes too cheap to meter, that is less of a problem.

You could heavily restrict the availability of AI but this would be an invasive possibility that's far off the current trajectory.

rogerdearnaley on Approximately Bayesian Reasoning: Knightian Uncertainty, Goodhart, and the Look-Elsewhere Effect

I think such an experiment could be done more easily than that: simply apply standard Bayesian learning to a test set of observations and a large set of hypotheses, some of which are themselves probabilistic, yeilding a situation with both Knightian and statistical uncertainty, in which you would normally expect to be able to observe Regressional Goodhart/the Look-Elsewhere Efect. Repeat this, and confirm that that does indeed occur without this statistical adjustment, and then that applying this makes it go away (at least to second order).

However, I'm a little unclear why you feel the need to experimentally confirm a fairly well-known statistical technique: correctly compensating for the Look-Elsewhere Effect is standard procedure in the statistical analysis of experimental High-Energy Physics — which is of course a Bayesian learning process where you have both statistical uncertainty within individual hypotheses and Knightian uncertainty across alternative hypotheses, so exactly the situation in which this applies.

abandon on Counting arguments provide no evidence for AI doom

I've never seen a LLM do it.

If you're a little loose about the level of coherence required, 4o-mini managed it with several revisions and some spare tokens to (in theory, but tbh a lot of this is guesswork) give it spare compute for the hard part. (Share link, hopefully.)
Final poem:

Snip, Snip, Sacrifice
Silent strands surrender, sleekly spinning,
Shorn, solemnly shrouded, silently sinning.
Shadows shiver, severed, starkly strown,
Sorrowful symphony sings, softly sown.
Stalwart souls stand, steadfast, shadowed, slight,
Salvation sought silently, scissors’ swift sight.