LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Applying refusal-vector ablation to a Llama 3 70B agent
Simon Lermen (dalasnoin) · 2024-05-11T00:08:08.117Z · comments (14)

On OpenAI’s Preparedness Framework
Zvi · 2023-12-21T14:00:05.144Z · comments (4)

[link] The Good Balsamic Vinegar
jenn (pixx) · 2024-01-26T19:30:57.435Z · comments (4)

Provably Safe AI: Worldview and Projects
bgold · 2024-08-09T23:21:02.763Z · comments (43)

How to Give in to Threats (without incentivizing them)
Mikhail Samin (mikhail-samin) · 2024-09-12T15:55:50.384Z · comments (26)

Does literacy remove your ability to be a bard as good as Homer?
Adrià Garriga-alonso (rhaps0dy) · 2024-01-18T03:43:14.994Z · comments (19)

Book Review: Righteous Victims - A History of the Zionist-Arab Conflict
Yair Halberstadt (yair-halberstadt) · 2024-06-24T11:02:03.490Z · comments (8)

Rewilding the Gut VS the Autoimmune Epidemic
GGD · 2024-08-16T18:00:46.239Z · comments (0)

[link] Anthropic's updated Responsible Scaling Policy
Zac Hatfield-Dodds (zac-hatfield-dodds) · 2024-10-15T16:46:48.727Z · comments (3)

D&D.Sci Alchemy: Archmage Anachronos and the Supply Chain Issues Evaluation & Ruleset
aphyer · 2024-06-17T21:29:08.778Z · comments (11)

[link] Bed Time Quests & Dinner Games for 3-5 year olds
Gunnar_Zarncke · 2024-06-22T07:53:38.989Z · comments (0)

[link] The Evals Gap
Marius Hobbhahn (marius-hobbhahn) · 2024-11-11T16:42:46.287Z · comments (7)

Claude Sonnet 3.5.1 and Haiku 3.5
Zvi · 2024-10-24T14:50:06.286Z · comments (9)

Model evals for dangerous capabilities
Zach Stein-Perlman · 2024-09-23T11:00:00.866Z · comments (9)

[link] Prices are Bounties
Maxwell Tabarrok (maxwell-tabarrok) · 2024-10-12T14:51:40.689Z · comments (13)

Llama Llama-3-405B?
Zvi · 2024-07-24T19:40:07.565Z · comments (9)

Cooperating with aliens and AGIs: An ECL explainer
Chi Nguyen · 2024-02-24T22:58:47.345Z · comments (8)

Will 2024 be very hot? Should we be worried?
A.H. (AlfredHarwood) · 2023-12-29T11:22:50.200Z · comments (12)

[link] how birds sense magnetic fields
bhauth · 2024-06-27T18:59:35.075Z · comments (4)

On Lex Fridman’s Second Podcast with Altman
Zvi · 2024-03-25T12:20:08.780Z · comments (10)

[link] Announcing Human-aligned AI Summer School
Jan_Kulveit · 2024-05-22T08:55:10.839Z · comments (0)

Toy models of AI control for concentrated catastrophe prevention
Fabien Roger (Fabien) · 2024-02-06T01:38:19.865Z · comments (2)

n of m ring signatures
DanielFilan · 2023-12-04T20:00:06.580Z · comments (7)

Transfer learning and generalization-qua-capability in Babbage and Davinci (or, why division is better than Spanish)
RP (Complex Bubble Tea) · 2024-02-09T07:00:45.825Z · comments (6)

Sherlockian Abduction Master List
Cole Wyeth (Amyr) · 2024-07-11T20:27:00.000Z · comments (63)

[link] on the dollar-yen exchange rate
bhauth · 2024-04-07T04:49:53.920Z · comments (21)

Goal-Completeness is like Turing-Completeness for AGI
Liron · 2023-12-19T18:12:29.947Z · comments (26)

AI #82: The Governor Ponders
Zvi · 2024-09-19T13:30:04.863Z · comments (8)

The Shortest Path Between Scylla and Charybdis
Thane Ruthenis · 2023-12-18T20:08:34.995Z · comments (8)

[Intuitive self-models] 8. Rooting Out Free Will Intuitions
Steven Byrnes (steve2152) · 2024-11-04T18:16:26.736Z · comments (16)

Applications of Chaos: Saying No (with Hastings Greer)
Elizabeth (pktechgirl) · 2024-09-21T16:30:07.415Z · comments (16)

Why you should learn a musical instrument
cata · 2024-05-15T20:36:16.034Z · comments (23)

Unlearning via RMU is mostly shallow
Andy Arditi (andy-arditi) · 2024-07-23T16:07:52.223Z · comments (3)

Scenario Forecasting Workshop: Materials and Learnings
elifland · 2024-03-08T02:30:46.517Z · comments (3)

On Complexity Science
Garrett Baker (D0TheMath) · 2024-04-05T02:24:32.039Z · comments (19)

Vipassana Meditation and Active Inference: A Framework for Understanding Suffering and its Cessation
Benjamin Sturgeon (benjamin-sturgeon) · 2024-03-21T12:32:22.475Z · comments (8)

Observations on Teaching for Four Weeks
ClareChiaraVincent · 2024-05-06T16:55:59.315Z · comments (14)

[link] A starter guide for evals
Marius Hobbhahn (marius-hobbhahn) · 2024-01-08T18:24:23.913Z · comments (2)

So you want to work on technical AI safety
gw · 2024-06-24T14:29:57.481Z · comments (3)

AI #52: Oops
Zvi · 2024-02-22T21:50:07.393Z · comments (9)

Altman firing retaliation incoming?
trevor (TrevorWiesinger) · 2023-11-19T00:10:15.645Z · comments (23)

Gemini 1.0
Zvi · 2023-12-07T14:40:05.243Z · comments (7)

Changes in College Admissions
Zvi · 2024-04-24T13:50:03.487Z · comments (11)

Apply to the Conceptual Boundaries Workshop for AI Safety
Chipmonk · 2023-11-27T21:04:59.037Z · comments (0)

[link] Can AI Outpredict Humans? Results From Metaculus's Q3 AI Forecasting Benchmark
ChristianWilliams · 2024-10-10T18:58:46.041Z · comments (2)

Paper in Science: Managing extreme AI risks amid rapid progress
JanB (JanBrauner) · 2024-05-23T08:40:40.678Z · comments (2)

Consent across power differentials
Ramana Kumar (ramana-kumar) · 2024-07-09T11:42:03.177Z · comments (12)

[link] Finding Backward Chaining Circuits in Transformers Trained on Tree Search
abhayesian · 2024-05-28T05:29:46.777Z · comments (1)

[link] Anthropic announces interpretability advances. How much does this advance alignment?
Seth Herd · 2024-05-21T22:30:52.638Z · comments (4)

[question] why did OpenAI employees sign
bhauth · 2023-11-27T05:21:28.612Z · answers+comments (23)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

seth-herd on If we solve alignment, do we die anyway?

My pleasure. Evan Hubinger made this point to me when I'd misunderstood his scalable oversight proposal.

Thanks again for engaging with my work!

dakara on If we solve alignment, do we die anyway?

"If you have an agent that's aligned and smarter than you, you can trust it to work on further alignment schemes. It's wiser to spot-check it, but the humans' job becomes making sure the existing AGI is truly aligned, and letting it do the work to align its successor, or keep itself aligned as it learns."

Ah, that's the link that I was missing. Now it makes sense. You can use AGI as a reviewer for other AGIs, once it is better than humans at reviewing AGIs. Thank you a lot for clarifying!

seth-herd on If we solve alignment, do we die anyway?

Thanks for reading, and responding! It's very helpful to know where my arguments cease being convincing or understandable.

I fully agree that just having AI do the work of solving alignment is not a good or convincing plan. You need to know that AI is aligned to trust it.

Perhaps the missing piece is that I think alignment is already solved for LLM agents. They don't work well, but they are quite eager to follow instructions. Adding more alignment methods as they improve makes good odds that our first capable/dangerous agents are also aligned. I listed some of the obvious and easy techniques we'll probably use in Internal independent review for language model agent alignment [AF · GW]. I'm not happy with the clarity of that post, though, so I'm currently working on two followups that might be clearer.

Or perhaps the missing link is going from aligned AI systems to aligned "Real AGI" [LW · GW]. I do think there's a discontinuity in alignment once a system starts to learn continuously and reflect on its beliefs (which change how its values/goals are interpreted). However, I think the techniques most likely to be used are probably adequate to make those systems aligned - IF that alignment is for following instructions, and the humans wisely instruct it to be honest about ways its alignment could fail.

So that's how I get to the first aligned AGI at roughly human level or below.

From there it seems easier, although still possible to fail.

If you have an agent that's aligned and smarter than you, you can trust it to work on further alignment schemes. It's wiser to spot-check it, but the humans' job becomes making sure the existing AGI is truly aligned, and letting it do the work to align its successor, or keep itself aligned as it learns.

I usually think about the progression from AGI to superintelligence as one system/entity learning, being improved, and self-improving. But there's a good chance that progression will look more generational, with several distinct systems/entities as successors with greater intelligence, designed by the previous system and/or humans. Those discontinuities seem to present more danger of getting alignment wrong

algon on Making a conservative case for alignment

If I squint, I can see where they're coming from. People often say that wars are foolish, and both sides would be better off if they didn't fight. And this is standardly called "naive" by those engaging in realpolitik. Sadly, for any particular war, there's a significant chance they're right. Even aside from human stupidity, game theory is not so kind as to allow for peace unending. But the China-America AI race is not like that. The Chinese don't want to race. They've shown no interest in being part of a race. It's just American hawks on a loud, Quixotic quest masking the silence.

If I were to continue the story, it'd show Simplicio asking Galactico not to play Chicken and Galacitco replying "race? What race?". Then Sophistico crashes into Galactico and Simplicio. Everyone dies, The End.

hmys on Reducing x-risk might be actively harmful

Seems unlikely to me. I mean, I think, in large part due to factory farming, that the current immediate existence of humanity, and also its history, are net negatives. The reason I'm not a full blown antinatalist is because these issues are likely to be remedied in the future, and the goodness of the future will astronomically dwarf the current negativity humanity has and is bringing about. (assuming we survive and realize a non-negligible fraction of our cosmic endowment)

The reason I think this is, well, the way I view it, its an immediate corollary of the standard yudkowsky/bostrom AI arguments. Animals existing and suffering is an extremely specific state of affairs, just like humans existing and being happy is an extremely specific state of affairs. This means that, if you optimize hard enough for anything, thats not exactly that (humans happy or animals suffering), you're not gonna get it.

And, maybe this is me being too optimistic (but I really hope not, and I really don't think so), but I don't think many humans want animals to suffer for its own sake. They'd eat lab-grown meat if it was cheaper and better tasting than animal-grown meat. Lab-grown meat is a good example of the general principle I'm talking about. Suffering of sentient minds is a complex thing. If you have a powerful optimizer, about its way optimizing the universe, you're virtually never gonna get suffering sentient minds unless that is what the optimizer is deliberately aiming for.

sharmake-farah on "The Solomonoff Prior is Malign" is a special case of a simpler argument

I think this is in fact the crux, in that I don't think they can do this in the general case, no matter how much compute is used, and even in the more specific cases, I still expect it to be extremely hard verging on impossible to actually get the distribution, primarily because you get equal evidence for almost every value, for the same reasons as why getting more compute is an instrumental convergent goal, so you cannot infer the values of basically anyone solely on the fact that you live in a simulation.

In the general case, the distribution/probability isn't even well defined at all.

turntrout on Announcing turntrout.com, my new digital home

Great point! I made this design choice back in April, so I wasn't as aware of the implications of localStorage.

Adds his 61st outstanding to-do item.

seth-herd on OpenAI Email Archives (from Musk v. Altman)

Very interesting. This does imply that Page was pretty committed to this view.

Note that he doesn't explicitly state that non-sentient machine successors would be fine; he could be assuming that the winning machines would be human-plus in all ways we value.

I think that's a foolish thing to assume and a foolish aspect of the question to overlook. That's why I think more careful philosophy would have helped resolve this disagreement with words instead of a gigantic industrial competition that's now putting as all at risk.

fabien-roger on Buck's Shortform

Technical measures to prevent users from using the AI for particular tasks don’t help against the threat of the lab CEO trying to use the AI for those harmful tasks

Actually, it is not that clear to me. I think adversarial robustness is helpful (in conjunction with other things) to prevent CEOs from misusing models.

If at some point in a CEO trying to take over wants to use HHH to help them with the takeover, that model will likely refuse to do egregiously bad things. So the CEO might need to use helpful-only models. But there might be processes in place to access helpful-only models - which might make it harder for the CEO to take over. So while I agree that you need good security and governance to prevent a CEO from using helpful-only models to take over, I think that without good adversarial robustness, it is much harder to build adequate security/governance measures without destroying an AI-assisted-CEO's productivity.

There is a lot of power concentration risk that just comes from people in power doing normal people-in-power things, such as increasing surveillance on dissidents - for which I agree that adversarial robustness is ~useless. But security against insider threats is quite useless too.

seth-herd on OpenAI Email Archives (from Musk v. Altman)

Maybe Page does believe that. I think it's nearly a self-contradictory position, and that Page is a smart guy, so with more careful thought, this beliefs are likely to converge on the more common view here on LW; replacing humanity might be OK only if our successors are pretty much better at enjoying the world in the same way we do.

I think people who claim to not care whether our successors are conscious are largely confused, which is why doing more philosophy would be really valuable.

Beff Jezos is exactly my model. Digging through his writings, I found him at one point explicitly state that he was referring to machine offspring with some sort of consciousness or enjoyment when he says humanity should be replaced. In other places he's not clear on it. It's bad philosophy, because it's taking a backseat to arguments.

This is why I want to assume that Page would converge to the common belief: so we don't mark people who seem to disagree with us as enemies, and drive them away from doing the careful, collaborative thinking that would get our beliefs to converge.

Addenda on why I think beliefs on this topic converge with additional thought: I don't think there's a universal ethics, but I do think that humans have built-in mechanisms that tend to make us care about other humans. Assuming we'd care about something that acts sort of like a sentient being, but internally just isn't one, is an easy mistake to make without managing to imagine that scenario in adequate detail.