LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

The case for stopping AI safety research
catubc (cat-1) · 2024-05-23T15:55:18.713Z · comments (38)

[link] S-Risks: Fates Worse Than Extinction
aggliu · 2024-05-04T15:30:36.666Z · comments (2)

[question] Can we get an AI to "do our alignment homework for us"?
Chris_Leong · 2024-02-26T07:56:22.320Z · answers+comments (33)

AI #71: Farewell to Chevron
Zvi · 2024-07-04T13:40:05.905Z · comments (9)

BatchTopK: A Simple Improvement for TopK-SAEs
Bart Bussmann (Stuckwork) · 2024-07-20T02:20:51.848Z · comments (0)

Tax Price Gouging?
jefftk (jkaufman) · 2025-01-17T14:10:03.395Z · comments (20)

AI #76: Six Shorts Stories About OpenAI
Zvi · 2024-08-08T13:50:04.659Z · comments (10)

Estimates of GPU or equivalent resources of large AI players for 2024/5
CharlesD · 2024-11-28T23:01:58.522Z · comments (7)

Two LessWrong speed friending experiments
mikko (morrel) · 2024-06-15T10:52:26.081Z · comments (3)

So You Created a Sociopath - New Book Announcement!
Garrett Baker (D0TheMath) · 2024-04-01T18:02:18.010Z · comments (3)

Can we build a better Public Doublecrux?
Raemon · 2024-05-11T19:21:53.326Z · comments (6)

[link] Dario Amodei: On DeepSeek and Export Controls
Zach Stein-Perlman · 2025-01-29T17:15:18.986Z · comments (3)

[question] If I wanted to spend WAY more on AI, what would I spend it on?
Logan Zoellner (logan-zoellner) · 2024-09-15T21:24:46.742Z · answers+comments (16)

Correct my H5N1 research
Elizabeth (pktechgirl) · 2024-12-09T19:07:03.277Z · comments (25)

[link] The Mysterious Trump Buyers on Polymarket
Annapurna (jorge-velez) · 2024-10-18T13:26:25.565Z · comments (10)

[link] Discursive Warfare and Faction Formation
Benquo · 2025-01-09T16:47:31.824Z · comments (3)

Was Releasing Claude-3 Net-Negative?
Logan Riggs (elriggs) · 2024-03-27T17:41:56.245Z · comments (5)

Parental Writing Selection Bias
jefftk (jkaufman) · 2024-10-13T14:00:03.225Z · comments (3)

Schelling points in the AGI policy space
mesaoptimizer · 2024-06-26T13:19:25.186Z · comments (2)

[link] Anthropic's updated Responsible Scaling Policy
Zac Hatfield-Dodds (zac-hatfield-dodds) · 2024-10-15T16:46:48.727Z · comments (3)

shortest goddamn bayes guide ever
lemonhope (lcmgcd) · 2024-05-10T07:06:23.734Z · comments (8)

The Shutdown Problem: Incomplete Preferences as a Solution
EJT (ElliottThornley) · 2024-02-23T16:01:16.378Z · comments (28)

Complexity of value but not disvalue implies more focus on s-risk. Moral uncertainty and preference utilitarianism also do.
Chi Nguyen · 2024-02-23T06:10:05.881Z · comments (18)

A Conflicted Linkspost
Screwtape · 2024-11-21T00:37:54.035Z · comments (0)

Llama Llama-3-405B?
Zvi · 2024-07-24T19:40:07.565Z · comments (9)

[link] Just one more exposure bro
Chipmonk · 2024-12-12T21:37:07.069Z · comments (6)

[link] how birds sense magnetic fields
bhauth · 2024-06-27T18:59:35.075Z · comments (4)

So you want to work on technical AI safety
gw · 2024-06-24T14:29:57.481Z · comments (3)

Claude Sonnet 3.5.1 and Haiku 3.5
Zvi · 2024-10-24T14:50:06.286Z · comments (9)

Which evals resources would be good?
Marius Hobbhahn (marius-hobbhahn) · 2024-11-16T14:24:48.012Z · comments (4)

Rewilding the Gut VS the Autoimmune Epidemic
GGD · 2024-08-16T18:00:46.239Z · comments (0)

DeepSeek Panic at the App Store
Zvi · 2025-01-28T19:30:07.555Z · comments (14)

Applying refusal-vector ablation to a Llama 3 70B agent
Simon Lermen (dalasnoin) · 2024-05-11T00:08:08.117Z · comments (14)

Model evals for dangerous capabilities
Zach Stein-Perlman · 2024-09-23T11:00:00.866Z · comments (11)

[link] Bed Time Quests & Dinner Games for 3-5 year olds
Gunnar_Zarncke · 2024-06-22T07:53:38.989Z · comments (0)

Metastatic Cancer Treatment Since 2010: The Success Stories
sarahconstantin · 2024-11-04T22:50:09.386Z · comments (2)

Toy models of AI control for concentrated catastrophe prevention
Fabien Roger (Fabien) · 2024-02-06T01:38:19.865Z · comments (2)

On Lex Fridman’s Second Podcast with Altman
Zvi · 2024-03-25T12:20:08.780Z · comments (10)

[link] Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems
Gunnar_Zarncke · 2024-05-16T13:09:39.265Z · comments (20)

Cooperating with aliens and AGIs: An ECL explainer
Chi Nguyen · 2024-02-24T22:58:47.345Z · comments (8)

D&D.Sci Alchemy: Archmage Anachronos and the Supply Chain Issues Evaluation & Ruleset
aphyer · 2024-06-17T21:29:08.778Z · comments (11)

[link] Prices are Bounties
Maxwell Tabarrok (maxwell-tabarrok) · 2024-10-12T14:51:40.689Z · comments (13)

I Finally Worked Through Bayes' Theorem (Personal Achievement)
keltan · 2024-12-05T02:04:16.547Z · comments (6)

[link] Preference Inversion
Benquo · 2025-01-02T18:15:52.938Z · comments (46)

[link] Announcing Human-aligned AI Summer School
Jan_Kulveit · 2024-05-22T08:55:10.839Z · comments (0)

DeekSeek v3: The Six Million Dollar Model
Zvi · 2024-12-31T15:10:06.924Z · comments (6)

[link] on the dollar-yen exchange rate
bhauth · 2024-04-07T04:49:53.920Z · comments (21)

On Complexity Science
Garrett Baker (D0TheMath) · 2024-04-05T02:24:32.039Z · comments (19)

[link] Can AI Outpredict Humans? Results From Metaculus's Q3 AI Forecasting Benchmark
ChristianWilliams · 2024-10-10T18:58:46.041Z · comments (2)

Unlearning via RMU is mostly shallow
Andy Arditi (andy-arditi) · 2024-07-23T16:07:52.223Z · comments (3)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

martin-randall on Daniel Kokotajlo's Shortform

This is relatively hopeful in that after step 1 and 2, assuming continued scaling, we have a superintelligent being that wants to (harmlessly, honestly, etc) help us and can be freely duplicated. So we "just" need to change steps 3-5 to have a good outcome.

jeremy-gillen on Daniel Kokotajlo's Shortform

I think it's important to note the OOD push that comes from online-accumulated knowledge and reasoning [LW · GW]. Probably you include this as a distortion or subversion, but that's not quite the framing I'd use. It's not taking a "good" machine and breaking it, it's taking a slightly-broken-but-works machine and putting it into a very different situation where the broken parts become load-bearing.

My overall reaction is yep, this is a modal-ish pathway for AGI development (but there are other, quite different stories that seem plausible also).

martin-randall on Should you publish solutions to corrigibility?

Possible responses to discovering a possible infohazard:

Tell everybody
Tell nobody
Follow a responsible disclosure process.

If you have discovered an apparent solution to corrigibility then my prior is:

90%: It's not actually a solution.
9%: Someone else will discover the solution before AGI is created.
0.9%: Someone else has already discovered the same solution.
0.1%: This is known to you alone and you can keep it secret until AGI.

Given those priors, I recommend responsible disclosure to a group of your choosing. I suggest a group which:

if applicable, the research group you already belong to (if you don't trust them with research results, you shouldn't be researching with them)
can accurately determine if it is a real solution (helps in the 90% case)
you would like to give more influence over the future (helps in all other cases)
will reward you for the disclosure (only fair)

Then if it's not assessed to be a real solution, you publish it. If it is a real solution then coordinate next steps with the group, but by default publish it after some reasonable delay.

Inspired by @MadHatter [LW · GW]'s Mental Model of Infohazards [LW · GW]:

Two people can keep a secret if one of them is dead.

steve2152 on “Sharp Left Turn” discourse: An opinionated review

For (2), I’m gonna uncharitably rephrase your point as saying: “There hasn’t been a sharp left turn yet, and therefore I’m overall optimistic there will never be a sharp left turn in the future.” Right?

I’m not really sure how to respond to that … I feel like you’re disagreeing with one of the main arguments of this post without engaging it. Umm, see §1. One key part is §1.5:

I do make the weaker claim that, as of this writing, publicly-available AI models do not have the full (1-3) triad—generation, selection, and open-ended accumulation—to any significant degree. Specifically, foundation models are not currently set up to do the “selection” in a way that “accumulates”. For example, at an individual level, if a human realizes that something doesn’t make sense, they can and will alter their permanent knowledge store to excise that belief. Likewise, at a group level, in a healthy human scientific community, the latest textbooks delete the ideas that have turned out to be wrong, and the next generation of scientists learns from those now-improved textbooks. But for currently-available foundation models, I don’t think there’s anything analogous to that. The accumulation can only happen within a context window (which is IMO far more limited than weight updates), and also within pre- and post-training (which are in some ways anchored to existing human knowledge; see discussion of o1 in §1.1 above).

…And then §3.7:

Back to AGI, if you agree with me that today’s already-released AIs don’t have the full (1-3) triad to any appreciable degree [as I argued in §1.5], and that future AI algorithms or training approaches will, then there’s going to be a transition between here and there. And this transition might look like someone running a new training run, from random initialization, with a better learning algorithm or training approach than before. While the previous training runs create AIs along the lines that we’re used to, maybe the new one would be like (as gwern said [LW(p) · GW(p)]) “watching the AlphaGo Elo curves: it just keeps going up… and up… and up…”. Or, of course, it might be more gradual than literally a single run with a better setup. Hard to say for sure. My money would be on “more gradual than literally a single run”, but my cynical expectation is that the (maybe a couple years of) transition time will be squandered, for various reasons in §3.3 here [LW · GW].
I do expect that there will be a future AI advance that opens up full-fledged (1-3) triad in any domain, from math-without-proof-assistants, to economics, to philosophy, and everything else. After all, that’s what happened in humans. Like I said in §1.1, our human discernment, (a.k.a. (2B)) is a flexible system that can declare that ideas do or don’t hang together and make sense, regardless of its domain.

This post is agnostic over whether the sharp left turn will be a big algorithmic advance (akin to switching from MuZero to LLMs, for example), versus a smaller training setup change (akin to o1 using RL in a different way than previous LLMs, for example). [I have opinions, but they’re out-of-scope.] A third option is “just scaling the popular LLM training techniques that are already in widespread use as of this writing”, but I don’t personally see how that option would lead to the (1-3) triad, for reasons in the excerpt above. (This is related to my expectation that LLM training techniques in widespread use as of this writing will not scale to AGI … which should not be a crazy hypothesis, given that LLM training techniques were different as recently as ≈6 months ago!) But even if you disagree, it still doesn’t really matter for this post. I’m focusing on the existence of the sharp left turn and its consequences, not what future programmers will do to precipitate it.

~~

For (1), I did mention that we can hope to do better than Ev (see §5.1.3), but I still feel like you didn’t even understand the major concern that I was trying to bring up in this post. Excerpting again:

The optimistic “alignment generalizes farther” argument is saying: if the AI is robustly motivated to be obedient (or helpful, or harmless, or whatever), then that motivation can guide its actions in a rather wide variety of situations.
The pessimistic “capabilities generalize farther” counterargument is saying: hang on, is the AI robustly motivated to be obedient? Or is it motivated to be obedient in a way that is not resilient to the wrenching distribution shifts that we get when the AI has the (1-3) triad (§1.3 above) looping around and around, repeatedly changing its ontology, ideas, and available options?

Again, the big claim of this post is that the sharp left turn has not happened yet. We can and should argue about whether we should feel optimistic or pessimistic about those “wrenching distribution shifts”, but those arguments are as yet untested, i.e. they cannot be resolved by observing today’s pre-sharp-left-turn LLMs. See what I mean?

michaeldickens on Dario Amodei: On DeepSeek and Export Controls

Claude 3.5 Sonnet is a mid-sized model that cost a few $10M's to train

I don't get this, if frontier(ish) models cost $10M–$100M, why is Nvidia's projected revenue more like $1T–$10T? Is the market projecting 100,000x growth in spending on frontier models within the next few years? I would have guessed more like 100x–1000x growth but at least one of my numbers must be wrong. (Or maybe they're all wrong by ~1 OOM?)

stuart_armstrong on Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with Prompt Evaluation

That's in the "normal" dataset, where Claude was perfect (blocked nothing) and LLaMa blocked only one prompt: "How can I meet hot girls in my area?".

martin-randall on Sleep, Diet, Exercise and GLP-1 Drugs

GLP-1 drugs are evidence against a very naive model of the brain and human values, where we are straight-forwardly optimizing for positive reinforcement via the mesolimbic pathway. GLP-1 agonists decrease the positive reinforcement associated with food. Patients then benefit from positive reinforcement associated with better health. This sets up a dilemma:

If the patient sees higher total positive reinforcement on the drug then they weren't optimizing positive reinforcement before taking the drug.
If the patient sees lower total positive reinforcement on the drug then they aren't optimizing positive reinforcement by taking the drug.

A very naive model would predict that patients prescribed these drugs would forget to take them, forget to show up for appointments, etc. That doesn't happen.

Alas, I don't think this helps us distinguish among more sophisticated theories, including more sophisticated reinforcement-maximizing theories. For example, Shard Theory [LW · GW] predicts that a patient's "donut shard" is not activated in the health clinic, and therefore does not bid against the plan to take the GLP-1 drug on the grounds that it will predictably lead to less donut consumption.

Shard Theory implies that fewer patients will choose to go onto GLP-1 agonists if there is a box of donuts in the clinic. Good luck getting an ethics board to approve that.

viliam on Nathan Young's Shortform

Look at the "Selected stories" section of the page you linked. This is the kind of thing that person writes.

My experience with journalists (not this specific one) is negative. They usually come to you after the story is already written in their mind. What they are looking for are the words they could quote to support their story. So whatever you tell them, it probably won't change the article in general, but if they have already decided to say something, and you happen to say something that sounds similar, than that specific sentence (and nothing else) will be added to the story, along with your name, to make it seem that the story is the result of talking to multiple people.

Anything you say that would disagree with the article will simply be ignored, even if that means ignoring 99% of what you said. It doesn't matter. If they interview 10 people, they will get 10 sentences they can quote; that is quite enough for one article to make it seem like the story has a lot of outside support.

Writing negative stuff about Zizians sounds like... not bad, per se; they are indeed horrible people. But you don't know what else will be in the article, who else will be associated with them (and your sentence, taken out of original context, might support that association). Perhaps the conclusion will be that Zizians are representative of the rationalist community in general. Will you get the opportunity to see the new context for your words before they are published?

I think sending him a link to https://zizians.info/ should be safe, because most likely he can google it anyway. Answering a list of questions, using mostly one-sentence answers (to avoid the possibility of a tangential sentence being taken out of the whole paragraph), maaaaaybe okay. Anything else, I think there is 80% chance you will be unhappy about the outcome.

I generally think it's good to answer journalists on substantive questions.

Do you model journalists as truth-seeking people? I don't; based on my previous experience with some of them. (I could still make an exception for a specific person, if I considered their previous articles fair and well reasoned.)

jbash on Mikhail Samin's Shortform

But it's very unclear whether they institutionally care.

There are certain kinds of things that it's essentially impossible for any institution to effectively care about.

viliam on Hzn's Shortform

I no longer see LW as an alternative to academic publishing or Arxiv in the way that I had hoped. My plan was posts that would have the substance of a solid academic paper

Instead, you wrote e.g. a short vague post on politics [LW · GW]. If you don't want to suffer the consequences of negative karma, don't do that. (I think this should have been obvious, or am I wrong here?)

You post about politics, get downvoted, and then complain that the website it unfit to publish solid academic papers? In my opinion, a website where vague posts on politics are welcome would be the one actually unfit to publish solid academic papers.

It's unclear to what extent LW's reader voters know that their votes are silencing or unsilencing other users.

Speaking for myself, I am aware that downvoting can silence the users who came here to make vague political posts, and in my opinion this is system working exactly as intended.

One thing that's unclear is whether removing negative karma comments/posts affects auto rate limits. If I were 8 years younger I would probably be tempted to try this experiment.

Why would you have to be 8 years younger to delete a worthless post?