LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

EIS XIII: Reflections on Anthropic’s SAE Research Circa May 2024
scasper · 2024-05-21T20:15:36.502Z · comments (16)

[link] Making every researcher seek grants is a broken model
jasoncrawford · 2024-01-26T16:06:26.688Z · comments (41)

What’s up with LLMs representing XORs of arbitrary features?
Sam Marks (samuel-marks) · 2024-01-03T19:44:33.162Z · comments (61)

Sparse Autoencoders Find Highly Interpretable Directions in Language Models
Logan Riggs (elriggs) · 2023-09-21T15:30:24.432Z · comments (8)

[link] Succession
Richard_Ngo (ricraz) · 2023-12-20T19:25:03.185Z · comments (48)

[link] Masterpiece
Richard_Ngo (ricraz) · 2024-02-13T23:10:35.376Z · comments (21)

You can just spontaneously call people you haven't met in years
lc · 2023-11-13T05:21:05.726Z · comments (21)

Formal verification, heuristic explanations and surprise accounting
Jacob_Hilton · 2024-06-25T15:40:03.535Z · comments (11)

Announcing Dialogues
Ben Pace (Benito) · 2023-10-07T02:57:39.005Z · comments (52)

Is being sexy for your homies?
Valentine · 2023-12-13T20:37:02.043Z · comments (92)

Deep Honesty
Aletheophile (aletheo) · 2024-05-07T20:31:48.734Z · comments (25)

Language Models Model Us
eggsyntax · 2024-05-17T21:00:34.821Z · comments (50)

Dyslucksia
Shoshannah Tekofsky (DarkSym) · 2024-05-09T19:21:33.874Z · comments (45)

[link] "Diamondoid bacteria" nanobots: deadly threat or dead-end? A nanotech investigation
titotal (lombertini) · 2023-09-29T14:01:15.453Z · comments (81)

OpenAI: Exodus
Zvi · 2024-05-20T13:10:03.543Z · comments (26)

Apologizing is a Core Rationalist Skill
johnswentworth · 2024-01-02T17:47:35.950Z · comments (42)

[link] Comp Sci in 2027 (Short story by Eliezer Yudkowsky)
sudo · 2023-10-29T23:09:56.730Z · comments (22)

Contra papers claiming superhuman AI forecasting
nikos (followtheargument) · 2024-09-12T18:10:50.582Z · comments (13)

Ironing Out the Squiggles
Zack_M_Davis · 2024-04-29T16:13:00.371Z · comments (35)

My takes on SB-1047
leogao · 2024-09-09T18:38:37.799Z · comments (8)

[link] Nursing doubts
dynomight · 2024-08-30T02:25:36.826Z · comments (20)

The Incredible Fentanyl-Detecting Machine
sarahconstantin · 2024-06-28T22:10:01.223Z · comments (26)

[link] Daniel Dennett has died (1942-2024)
kave · 2024-04-19T16:17:04.742Z · comments (5)

2023 Survey Results
Screwtape · 2024-02-16T22:24:28.132Z · comments (26)

[link] Will no one rid me of this turbulent pest?
Metacelsus · 2023-10-14T15:27:21.497Z · comments (23)

Tips for Empirical Alignment Research
Ethan Perez (ethan-perez) · 2024-02-29T06:04:54.481Z · comments (4)

LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B
Simon Lermen (dalasnoin) · 2023-10-12T19:58:02.119Z · comments (29)

On Devin
Zvi · 2024-03-18T13:20:04.779Z · comments (33)

[link] Vernor Vinge, who coined the term "Technological Singularity", dies at 79
Kaj_Sotala · 2024-03-21T22:14:14.699Z · comments (24)

Some (problematic) aesthetics of what constitutes good work in academia
Steven Byrnes (steve2152) · 2024-03-11T17:47:28.835Z · comments (12)

Discussion: Challenges with Unsupervised LLM Knowledge Discovery
Seb Farquhar · 2023-12-18T11:58:39.379Z · comments (21)

[question] things that confuse me about the current AI market.
DMMF · 2024-08-28T13:46:56.908Z · answers+comments (28)

Priors and Prejudice
MathiasKB (MathiasKirkBonde) · 2024-04-22T15:00:41.782Z · comments (31)

Does davidad's uploading moonshot work?
jacobjacob · 2023-11-03T02:21:51.720Z · comments (33)

The Plan - 2023 Version
johnswentworth · 2023-12-29T23:34:19.651Z · comments (39)

Liability regimes for AI
Ege Erdil (ege-erdil) · 2024-08-19T01:25:01.006Z · comments (34)

Leading The Parade
johnswentworth · 2024-01-31T22:39:56.499Z · comments (31)

The Information: OpenAI shows 'Strawberry' to feds, races to launch it
Martín Soto (martinsq) · 2024-08-27T23:10:18.155Z · comments (14)

OpenAI o1
Zach Stein-Perlman · 2024-09-12T17:30:31.958Z · comments (40)

[link] Using axis lines for good or evil
dynomight · 2024-03-06T14:47:10.989Z · comments (39)

Deep atheism and AI risk
Joe Carlsmith (joekc) · 2024-01-04T18:58:47.745Z · comments (22)

LLMs for Alignment Research: a safety priority?
abramdemski · 2024-04-04T20:03:22.484Z · comments (24)

[link] Moral Reality Check (a short story)
jessicata (jessica.liu.taylor) · 2023-11-26T05:03:18.254Z · comments (44)

Value Claims (In Particular) Are Usually Bullshit
johnswentworth · 2024-05-30T06:26:21.151Z · comments (18)

[link] If you weren't such an idiot...
kave · 2024-03-02T00:01:37.314Z · comments (74)

AI Views Snapshots
Rob Bensinger (RobbBB) · 2023-12-13T00:45:50.016Z · comments (61)

Loudly Give Up, Don't Quietly Fade
Screwtape · 2023-11-13T23:30:25.308Z · comments (11)

[link] That Alien Message - The Animation
Writer · 2024-09-07T14:53:30.604Z · comments (8)

At 87, Pearl is still able to change his mind
rotatingpaguro · 2023-10-18T04:46:29.339Z · comments (15)

Survey: How Do Elite Chinese Students Feel About the Risks of AI?
Nick Corvino (nick-corvino) · 2024-09-02T18:11:11.867Z · comments (13)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

cata on AI Safety is Dropping the Ball on Clown Attacks

This post was difficult to take seriously when I read it but the "clown attack" idea very much stuck with me.

tailcalled on The case for a negative alignment tax

With this being said, catastrophic misuse could wipe us all out, too. It seems too strong to say that the ‘traits’ of frontier models ‘[have] nothing to do’ with the alignment problem/whether a giant AI wave destroys human society, as you put it. If we had no alignment technique that reliably prevented frontier LLMs from explaining to anyone who asked how to make anthrax, build bombs, spread misinformation, etc, this would definitely at least contribute to a society-destroying wave. But finetuned frontier models do not do this by default, largely because of techniques like RLHF. (Again, not saying or implying RLHF achieves this perfectly or can’t be easily removed or will scale, etc. Just seems like a plausible counterfactual world we could be living in but aren’t because of ‘traits’ of frontier models.)

I would like to see the case for catastrophic misuse being an xrisk, since it mostly seems like business-as-usual for technological development (you get more capabilities which means more good stuff but also more bad stuff).

thomascederborg on The case for more Alignment Target Analysis (ATA)

The proposed research project would indeed be focused on a certain type of alignment target. For example proposals along the lines of PCEV. But not proposals along the lines of a tool-AI. Referring to this as Value-Alignment Target Analysis (VATA) would also be a possible notation. I will adopt this notation for the rest of this comment.

The proposed VATA research project would be aiming for risk mitigation. It would not be aiming for an answer:

There is a big difference between proposing an alignment target on the one hand. And pointing out problems with alignment targets on the other hand. For example: it is entirely possible to reduce risks from a dangerous alignment target, without having any idea how one might find a good alignment target. One can actually reduce risks without having any idea, what it even means for an alignment target to be a good alignment target.

The feature of PCEV mentioned in the post is an example of this. The threat posed by PCEV has presumably been mostly removed. This did not require anything along the lines of an answer. The analysis of Condorcet AI (CAI) is similar. The analysis simply describes a feature shared by all CAI proposals (the feature that a barely caring solid majority can do whatever they want with everyone else). Pointing this out presumably reduces the probability that a CAI will be launched by designers that never considered this feature. All claims made in the post about a VATA research project being tractable is referring to this type of risk mitigation being tractable. There is definitely no claim that a VATA research project can (i): find a good alignment target, (ii): somehow verify that this alignment target does not have any hidden flaws, and (iii): convince whoever is in charge to launch this target.

One can also go a bit beyond analysis of individual proposals, even if one does not have any idea how to find an answer. One can mitigate risk by describing necessary features (for example along the lines of this necessary Membrane formalism feature [LW · GW]). This reduces risks from all proposal that clearly does not have such a necessary feature.

(and just to be extra clear: the post is not arguing that launching a Sovereign AI is a good idea. The post is assuming an audience that agree that it is possible that a Sovereign AI might be launched. And then the post is arguing that if this does happen, then there is a risk that such a Sovereign AI project will be aiming at a bad value alignment target. The post then further argues that this particular risk can be reduced by doing VATA)

Regarding people being skeptical of Value Alignment Target proposals:

If someone ends up with the capability to launch a Sovereign AI, then I certainly hope that they will be skeptical of proposed Value Alignment Targets. Such skepticism can avert catastrophe even if the proposed alignment target has a flaw that no one has noticed.

The issue is that a situation might arise where (i): someone has the ability to launch a Sovereign AI, (ii): there exists a Sovereign AI proposal that no one can find any flaws with, and (iii): there is a time crunch.

Regarding the possibility that there exists people trying to find an answer without telling anyone:

I'm not sure how to estimate the probability of this. From a risk mitigation standpoint, this is certainly not the optimal way of doing things (if a proposed alignment target has a flaw, then it will be a lot easier to notice that flaw, if the proposal is not kept secret). I really don't think that this is a reasonable way of doing things. But I think that you have a point. If Bob is about to launch an AI Sovereign with some critical flaw that would lead to some horrific outcome. Then secretly working Steve might be able to notice this flaw. And if Bob is just about to launch his AI, and speaking up is the only way for Steve to prevent Bob from causing a catastrophe, then Steve will presumably speak up. In other words: the existence of people like secretly working Steve would indeed offer some level of protection. It would mean that the lack of people with relevant intuitions is not as bad as it appears (and when allocating resources, this possibility would indeed point to less resources for VATA). But I think that what is really needed is at least some people doing VATA with a clear risk mitigation focus. And discussing their finding with each other. This does not appear to exist.

Regarding other risks, and the issue that findings might be ignored:

A VATA research project would not help with misalignment. In other words: even if the field of VATA was somehow completely solved tomorrow, AI could still lead to extinction. So the proposed research project is definitely not dealing with all risks. The point of the post is that the field of VATA is basically empty. I don't know of anyone that is doing VATA full time with a clear risk mitigation focus. And I don't know if you personally should switch to focusing on VATA. It would not surprise me at all if some other project is a better use of your time. It just seems like there should exist some form of VATA research project with a clear risk mitigation focus.

It is also possible that a VATA finding will be completely ignored (by leading labs, or by governments, or by someone else). It is possible that a Sovereign AI will be launched, leading to catastrophe, even though it has a known flaw (because the people launching it is just refusing to listen). But finding a flaw at least means that it is possible to avert catastrophe.

PS:

Thanks for the links! I will look into this. (I think that there are many fields of research that are relevant to VATA. It's just that one has to be careful. A concept can behave very differently when it is transferred to the AI context)

tailcalled on RLHF is the worst possible thing done when facing the alignment problem

Yeah, but the point is that the system learns values before an unrestricted AI vs AI conflict.

But if you just naively take the value that are appropriate outside of a life-and-death conflict and apply them to a life-and-death conflict, you're gonna lose. In that case, RLHF just makes you an irrelevant player, and if you insist on applying it to military/police technology, it's necessary for AI safety to pivot to addressing rogue states or gangsters.

Which again makes RLHF really really bad because we shouldn't have to work with rogue states or gangsters to save the world. Don't cripple the good guys.

I mean, if your definition of values doesn't make sense for real systems, then it's the problem of your definition. As a hypothesis describing reality "alignment trait makes AI not splash harm on humans" is coherent enough. So the question is how do you know it is unlikely to happen?

If you propose a particular latent variable that acts in a particular way, that is a lot of complexity, and you need a strong case to justify it as likely.

First, "alignment is easy" is compatible with "we need to keep the set of big adversaries small". But more generally, without numbers it seems like generalized anti-future-technology argument - what's stopping human-regulation mechanisms from solving this adversarial problem, that didn't stop them from solving previous adversarial problems?

Human-regulation mechanisms could plausibly solve this problem by banning chip fabs. The issue is we use chip fabs for all sorts of things so we don't want to do that unless we are truly desperate.

Not necessary? It's not unconceivable for future defense being more effective than offence (trivially true if "defense" is not giving AI to attackers). It kind of required for any future where humans have more power, than in present day?

Idk. Big entities have a lot of security vulnerabilities which could be attacked by AIs. But I guess one could argue the surviving big entities are red-teaming themselves hard enough to be immune to these. Perhaps most significant is the interactions between multiple independent big things, since they could be manipulated to harm the big things.

Small adversaries currently have a hard time exploiting these security vulnerabilities because intelligence is really expensive, but once intelligence becomes too cheap to meter, that is less of a problem.

You could heavily restrict the availability of AI but this would be an invasive possibility that's far off the current trajectory.

rogerdearnaley on Approximately Bayesian Reasoning: Knightian Uncertainty, Goodhart, and the Look-Elsewhere Effect

I think such an experiment could be done more easily than that: simply apply standard Bayesian learning to a test set of observations and a large set of hypotheses, some of which are themselves probabilistic, yeilding a situation with both Knightian and statistical uncertainty, in which you would normally expect to be able to observe Regressional Goodhart/the Look-Elsewhere Efect. Repeat this, and confirm that that does indeed occur without this statistical adjustment, and then that applying this makes it go away (at least to second order).

However, I'm a little unclear why you feel the need to experimentally confirm a fairly well-known statistical technique: correctly compensating for the Look-Elsewhere Effect is standard procedure in the statistical analysis of experimental High-Energy Physics — which is of course a Bayesian learning process where you have both statistical uncertainty within individual hypotheses and Knightian uncertainty across alternative hypotheses, so exactly the situation in which this applies.

abandon on Counting arguments provide no evidence for AI doom

I've never seen a LLM do it.

If you're a little loose about the level of coherence required, 4o-mini managed it with several revisions and some spare tokens to (in theory, but tbh a lot of this is guesswork) give it spare compute for the hard part. (Share link, hopefully.)
Final poem:

Snip, Snip, Sacrifice
Silent strands surrender, sleekly spinning,
Shorn, solemnly shrouded, silently sinning.
Shadows shiver, severed, starkly strown,
Sorrowful symphony sings, softly sown.
Stalwart souls stand, steadfast, shadowed, slight,
Salvation sought silently, scissors’ swift sight.

tamsin-leake on The case for more Alignment Target Analysis (ATA)

Hi !

ATA is extremely neglected. The field of ATA is at a very early stage, and currently there does not exist any research project dedicated to ATA. The present post argues that this lack of progress is dangerous and that this neglect is a serious mistake.

I agree it's neglected, but there is in fact at least one researh project dedicated to at least designing alignment targets: the part of the formal alignment agenda [LW · GW] dedicated to formal outer alignment, which is the design of math problems to which solutions would be world-saving. Our notable attempts at this are QACI [LW · GW] and ESP [LW · GW] (there was also some work on a QACI2, but it predates (and in-my-opinion is superceded by) ESP).

Those try to implement CEV in math. They only work for doing CEV of a single person or small group, but that's fine: just do CEV of {a single person or small group} which values all of humanity/moral-patients/whatever getting their values satisfied instead of just that group's values. If you want humanity's values to be satisfied, then "satisfying humanity's values" is not opposite to "satisfy your own values", it's merely the outcome of "satisfy your own values".

tamsin-leake on Being nicer than Clippy

I wonder how much of those seemingly idealistic people retained power when it was available because they were indeed only pretending to be idealistic. Assuming one is actually initially idealistic but then gets corrupted by having power in some way, one thing someone can do in CEV that you can't do in real life is reuse the CEV process to come up with even better CEV processes which will be even more likely to retain/recover their just-before-launching-CEV values. Yes, many people would mess this up or fail in some other way in CEV; but we only need one person or group who we'd be somewhat confident would do alright in CEV. Plausibly there are at least a few eg MIRIers who would satisfy this. Importantly, to me, this reduces outer alignment to "find someone smart and reasonable and likely to have good goal-content integrity", which is a matter of social & psychology that seems to be much smaller than the initial full problem of formal outer alignment / alignment target design.
One of the main reasons to do CEV is because we're gonna die of AI soon, and CEV is a way to have infinite time to solve the necessary problems. Another is that even if we don't die of AI, we get eaten by various moloch instead of being able to safely solve the necessary problems at whatever pace is necessary.

tom-r on Bordeaux France - ACX Meetups Everywhere Fall 2024

Because of the rain, we will move to the nearby pub the "Dog and Duck".

The time is 20th at 6pm, but I will be there until at least 9pm !

See you there

signer on RLHF is the worst possible thing done when facing the alignment problem

RLHF does not solve the alignment problem because humans can’t provide good-enough feedback fast-enough.

Yeah, but the point is that the system learns values before an unrestricted AI vs AI conflict.

As mentioned in the beginning, I think the intuition goes that neural networks have a personality trait which we call “alignment”, caused by the correspondence between their values and our values. But “their values” only really makes sense after an unrestricted AI vs AI conflict, since without such conflicts, AIs are just gonna propagate energy to whichever constraints we point them at, so this whole worldview is wrong.

I mean, if your definition of values doesn't make sense for real systems, then it's the problem of your definition. As a hypothesis describing reality "alignment trait makes AI not splash harm on humans" is coherent enough. So the question is how do you know it is unlikely to happen?

This has not lead to the destruction of humanity yet because the biggest adversaries have kept their conflicts limited (because too much conflict is too costly) so no entity has pursued an end by any means necessary. But this only works because there’s a sufficiently small number of sufficiently big adversaries (USA, Russia, China, …), and because there’s sufficiently much opportunity cost.

First, "alignment is easy" is compatible with "we need to keep the set of big adversaries small". But more generally, without numbers it seems like generalized anti-future-technology argument - what's stopping human-regulation mechanisms from solving this adversarial problem, that didn't stop them from solving previous adversarial problems?

It makes conflict more viable for small adversaries against large adversaries

Not necessary? It's not unconceivable for future defense being more effective than offence (trivially true if "defense" is not giving AI to attackers). It kind of required for any future where humans have more power, than in present day?